linux/kernel/trace/trace_syscalls.c

826 lines
20 KiB
C
Raw Normal View History

License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 14:07:57 +00:00
// SPDX-License-Identifier: GPL-2.0
tracing/syscalls: use a dedicated file header Impact: fix build warnings and possibe compat misbehavior on IA64 Building a kernel on ia64 might trigger these ugly build warnings: CC arch/ia64/ia32/sys_ia32.o In file included from arch/ia64/ia32/sys_ia32.c:55: arch/ia64/ia32/ia32priv.h:290:1: warning: "elf_check_arch" redefined In file included from include/linux/elf.h:7, from include/linux/module.h:14, from include/linux/ftrace.h:8, from include/linux/syscalls.h:68, from arch/ia64/ia32/sys_ia32.c:18: arch/ia64/include/asm/elf.h:19:1: warning: this is the location of the previous definition [...] sys_ia32.c includes linux/syscalls.h which in turn includes linux/ftrace.h to import the syscalls tracing prototypes. But including ftrace.h can pull too much things for a low level file, especially on ia64 where the ia32 private headers conflict with higher level headers. Now we isolate the syscall tracing headers in their own lightweight file. Reported-by: Tony Luck <tony.luck@intel.com> Tested-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jason Baron <jbaron@redhat.com> Cc: "Frank Ch. Eigler" <fche@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Jiaying Zhang <jiayingz@google.com> Cc: Michael Rubin <mrubin@google.com> Cc: Martin Bligh <mbligh@google.com> Cc: Michael Davidson <md@google.com> LKML-Reference: <20090408184058.GB6017@nowhere> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-08 18:40:59 +00:00
#include <trace/syscall.h>
#include <trace/events/syscalls.h>
tracing/syscalls: Allow archs to ignore tracing compat syscalls The tracing of ia32 compat system calls has been a bit of a pain as they use different system call numbers than the 64bit equivalents. I wrote a simple 'lls' program that lists files. I compiled it as a i686 ELF binary and ran it under a x86_64 box. This is the result: echo 0 > /debug/tracing/tracing_on echo 1 > /debug/tracing/events/syscalls/enable echo 1 > /debug/tracing/tracing_on ; ./lls ; echo 0 > /debug/tracing/tracing_on grep lls /debug/tracing/trace [.. skipping calls before TS_COMPAT is set ...] lls-1127 [005] d... 936.409188: sys_recvfrom(fd: 0, ubuf: 4d560fc4, size: 0, flags: 8048034, addr: 8, addr_len: f7700420) lls-1127 [005] d... 936.409190: sys_recvfrom -> 0x8a77000 lls-1127 [005] d... 936.409211: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22) lls-1127 [005] d... 936.409215: sys_lgetxattr -> 0xf76ff000 lls-1127 [005] d... 936.409223: sys_dup2(oldfd: 4d55ae9b, newfd: 4) lls-1127 [005] d... 936.409228: sys_dup2 -> 0xfffffffffffffffe lls-1127 [005] d... 936.409236: sys_newfstat(fd: 4d55b085, statbuf: 80000) lls-1127 [005] d... 936.409242: sys_newfstat -> 0x3 lls-1127 [005] d... 936.409243: sys_removexattr(pathname: 3, name: ffcd0060) lls-1127 [005] d... 936.409244: sys_removexattr -> 0x0 lls-1127 [005] d... 936.409245: sys_lgetxattr(pathname: 0, name: 19614, value: 1, size: 2) lls-1127 [005] d... 936.409248: sys_lgetxattr -> 0xf76e5000 lls-1127 [005] d... 936.409248: sys_newlstat(filename: 3, statbuf: 19614) lls-1127 [005] d... 936.409249: sys_newlstat -> 0x0 lls-1127 [005] d... 936.409262: sys_newfstat(fd: f76fb588, statbuf: 80000) lls-1127 [005] d... 936.409279: sys_newfstat -> 0x3 lls-1127 [005] d... 936.409279: sys_close(fd: 3) lls-1127 [005] d... 936.421550: sys_close -> 0x200 lls-1127 [005] d... 936.421558: sys_removexattr(pathname: 3, name: ffcd00d0) lls-1127 [005] d... 936.421560: sys_removexattr -> 0x0 lls-1127 [005] d... 936.421569: sys_lgetxattr(pathname: 4d564000, name: 1b1abc, value: 5, size: 802) lls-1127 [005] d... 936.421574: sys_lgetxattr -> 0x4d564000 lls-1127 [005] d... 936.421575: sys_capget(header: 4d70f000, dataptr: 1000) lls-1127 [005] d... 936.421580: sys_capget -> 0x0 lls-1127 [005] d... 936.421580: sys_lgetxattr(pathname: 4d710000, name: 3000, value: 3, size: 812) lls-1127 [005] d... 936.421589: sys_lgetxattr -> 0x4d710000 lls-1127 [005] d... 936.426130: sys_lgetxattr(pathname: 4d713000, name: 2abc, value: 3, size: 32) lls-1127 [005] d... 936.426141: sys_lgetxattr -> 0x4d713000 lls-1127 [005] d... 936.426145: sys_newlstat(filename: 3, statbuf: f76ff3f0) lls-1127 [005] d... 936.426146: sys_newlstat -> 0x0 lls-1127 [005] d... 936.431748: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22) Obviously I'm not calling newfstat with a fd of 4d55b085. The calls are obviously incorrect, and confusing. Other efforts have been made to fix this: https://lkml.org/lkml/2012/3/26/367 But the real solution is to rewrite the syscall internals and come up with a fixed solution. One that doesn't require all the kluge that the current solution has. Thus for now, instead of outputting incorrect data, simply ignore them. With this patch the changes now have: #> grep lls /debug/tracing/trace #> Compat system calls simply are not traced. If users need compat syscalls, then they should just use the raw syscall tracepoints. For an architecture to make their compat syscalls ignored, it must define ARCH_TRACE_IGNORE_COMPAT_SYSCALLS (done in asm/ftrace.h) and also define an arch_trace_is_compat_syscall() function that will return true if the current task should ignore tracing the syscall. I want to stress that this change does not affect actual syscalls in any way, shape or form. It is only used within the tracing system and doesn't interfere with the syscall logic at all. The changes are consolidated nicely into trace_syscalls.c and asm/ftrace.h. I had to make one small modification to asm/thread_info.h and that was to remove the include of asm/ftrace.h. As asm/ftrace.h required the current_thread_info() it was causing include hell. That include was added back in 2008 when the function graph tracer was added: commit caf4b323 "tracing, x86: add low level support for ftrace return tracing" It does not need to be included there. Link: http://lkml.kernel.org/r/1360703939.21867.99.camel@gandalf.local.home Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-02-12 21:18:59 +00:00
#include <linux/syscalls.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 08:04:11 +00:00
#include <linux/slab.h>
#include <linux/kernel.h>
#include <linux/module.h> /* for MODULE_NAME_LEN via KSYM_SYMBOL_LEN */
#include <linux/ftrace.h>
perf: Do the big rename: Performance Counters -> Performance Events Bye-bye Performance Counters, welcome Performance Events! In the past few months the perfcounters subsystem has grown out its initial role of counting hardware events, and has become (and is becoming) a much broader generic event enumeration, reporting, logging, monitoring, analysis facility. Naming its core object 'perf_counter' and naming the subsystem 'perfcounters' has become more and more of a misnomer. With pending code like hw-breakpoints support the 'counter' name is less and less appropriate. All in one, we've decided to rename the subsystem to 'performance events' and to propagate this rename through all fields, variables and API names. (in an ABI compatible fashion) The word 'event' is also a bit shorter than 'counter' - which makes it slightly more convenient to write/handle as well. Thanks goes to Stephane Eranian who first observed this misnomer and suggested a rename. User-space tooling and ABI compatibility is not affected - this patch should be function-invariant. (Also, defconfigs were not touched to keep the size down.) This patch has been generated via the following script: FILES=$(find * -type f | grep -vE 'oprofile|[^K]config') sed -i \ -e 's/PERF_EVENT_/PERF_RECORD_/g' \ -e 's/PERF_COUNTER/PERF_EVENT/g' \ -e 's/perf_counter/perf_event/g' \ -e 's/nb_counters/nb_events/g' \ -e 's/swcounter/swevent/g' \ -e 's/tpcounter_event/tp_event/g' \ $FILES for N in $(find . -name perf_counter.[ch]); do M=$(echo $N | sed 's/perf_counter/perf_event/g') mv $N $M done FILES=$(find . -name perf_event.*) sed -i \ -e 's/COUNTER_MASK/REG_MASK/g' \ -e 's/COUNTER/EVENT/g' \ -e 's/\<event\>/event_id/g' \ -e 's/counter/event/g' \ -e 's/Counter/Event/g' \ $FILES ... to keep it as correct as possible. This script can also be used by anyone who has pending perfcounters patches - it converts a Linux kernel tree over to the new naming. We tried to time this change to the point in time where the amount of pending patches is the smallest: the end of the merge window. Namespace clashes were fixed up in a preparatory patch - and some stylistic fallout will be fixed up in a subsequent patch. ( NOTE: 'counters' are still the proper terminology when we deal with hardware registers - and these sed scripts are a bit over-eager in renaming them. I've undone some of that, but in case there's something left where 'counter' would be better than 'event' we can undo that on an individual basis instead of touching an otherwise nicely automated patch. ) Suggested-by: Stephane Eranian <eranian@google.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Mackerras <paulus@samba.org> Reviewed-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Howells <dhowells@redhat.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <linux-arch@vger.kernel.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 10:02:48 +00:00
#include <linux/perf_event.h>
#include <linux/xarray.h>
#include <asm/syscall.h>
#include "trace_output.h"
#include "trace.h"
static DEFINE_MUTEX(syscall_trace_lock);
static int syscall_enter_register(struct trace_event_call *event,
enum trace_reg type, void *data);
static int syscall_exit_register(struct trace_event_call *event,
enum trace_reg type, void *data);
tracing: Remove per event trace registering This patch removes the register functions of TRACE_EVENT() to enable and disable tracepoints. The registering of a event is now down directly in the trace_events.c file. The tracepoint_probe_register() is now called directly. The prototypes are no longer type checked, but this should not be an issue since the tracepoints are created automatically by the macros. If a prototype is incorrect in the TRACE_EVENT() macro, then other macros will catch it. The trace_event_class structure now holds the probes to be called by the callbacks. This removes needing to have each event have a separate pointer for the probe. To handle kprobes and syscalls, since they register probes in a different manner, a "reg" field is added to the ftrace_event_class structure. If the "reg" field is assigned, then it will be called for enabling and disabling of the probe for either ftrace or perf. To let the reg function know what is happening, a new enum (trace_reg) is created that has the type of control that is needed. With this new rework, the 82 kernel events and 618 syscall events has their footprint dramatically lowered: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint 4900252 1057412 861512 6819176 680d68 vmlinux.regs The size went from 6863829 to 6819176, that's a total of 44K in savings. With tracepoints being continuously added, this is critical that the footprint becomes minimal. v5: Added #ifdef CONFIG_PERF_EVENTS around a reference to perf specific structure in trace_events.c. v4: Fixed trace self tests to check probe because regfunc no longer exists. v3: Updated to handle void *data in beginning of probe parameters. Also added the tracepoint: check_trace_callback_type_##call(). v2: Changed the callback probes to pass void * and typecast the value within the function. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-21 16:27:06 +00:00
tracing: Move fields from event to class structure Move the defined fields from the event to the class structure. Since the fields of the event are defined by the class they belong to, it makes sense to have the class hold the information instead of the individual events. The events of the same class would just hold duplicate information. After this change the size of the kernel dropped another 3K: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4900252 1057412 861512 6819176 680d68 vmlinux.regs 4900375 1053380 861512 6815267 67fe23 vmlinux.fields Although the text increased, this was mainly due to the C files having to adapt to the change. This is a constant increase, where new tracepoints will not increase the Text. But the big drop is in the data size (as well as needed allocations to hold the fields). This will give even more savings as more tracepoints are created. Note, if just TRACE_EVENT()s are used and not DECLARE_EVENT_CLASS() with several DEFINE_EVENT()s, then the savings will be lost. But we are pushing developers to consolidate events with DEFINE_EVENT() so this should not be an issue. The kprobes define a unique class to every new event, but are dynamic so it should not be a issue. The syscalls however have a single class but the fields for the individual events are different. The syscalls use a metadata to define the fields. I moved the fields list from the event to the metadata and added a "get_fields()" function to the class. This function is used to find the fields. For normal events and kprobes, get_fields() just returns a pointer to the fields list_head in the class. For syscall events, it returns the fields list_head in the metadata for the event. v2: Fixed the syscall fields. The syscall metadata needs a list of fields for both enter and exit. Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-22 14:35:55 +00:00
static struct list_head *
syscall_get_enter_fields(struct trace_event_call *call)
tracing: Move fields from event to class structure Move the defined fields from the event to the class structure. Since the fields of the event are defined by the class they belong to, it makes sense to have the class hold the information instead of the individual events. The events of the same class would just hold duplicate information. After this change the size of the kernel dropped another 3K: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4900252 1057412 861512 6819176 680d68 vmlinux.regs 4900375 1053380 861512 6815267 67fe23 vmlinux.fields Although the text increased, this was mainly due to the C files having to adapt to the change. This is a constant increase, where new tracepoints will not increase the Text. But the big drop is in the data size (as well as needed allocations to hold the fields). This will give even more savings as more tracepoints are created. Note, if just TRACE_EVENT()s are used and not DECLARE_EVENT_CLASS() with several DEFINE_EVENT()s, then the savings will be lost. But we are pushing developers to consolidate events with DEFINE_EVENT() so this should not be an issue. The kprobes define a unique class to every new event, but are dynamic so it should not be a issue. The syscalls however have a single class but the fields for the individual events are different. The syscalls use a metadata to define the fields. I moved the fields list from the event to the metadata and added a "get_fields()" function to the class. This function is used to find the fields. For normal events and kprobes, get_fields() just returns a pointer to the fields list_head in the class. For syscall events, it returns the fields list_head in the metadata for the event. v2: Fixed the syscall fields. The syscall metadata needs a list of fields for both enter and exit. Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-22 14:35:55 +00:00
{
struct syscall_metadata *entry = call->data;
return &entry->enter_fields;
}
tracing: Replace syscall_meta_data struct array with pointer array Currently the syscall_meta structures for the syscall tracepoints are placed in the __syscall_metadata section, and at link time, the linker makes one large array of all these syscall metadata structures. On boot up, this array is read (much like the initcall sections) and the syscall data is processed. The problem is that there is no guarantee that gcc will place complex structures nicely together in an array format. Two structures in the same file may be placed awkwardly, because gcc has no clue that they are suppose to be in an array. A hack was used previous to force the alignment to 4, to pack the structures together. But this caused alignment issues with other architectures (sparc). Instead of packing the structures into an array, the structures' addresses are now put into the __syscall_metadata section. As pointers are always the natural alignment, gcc should always pack them tightly together (otherwise initcall, extable, etc would also fail). By having the pointers to the structures in the section, we can still iterate the trace_events without causing unnecessary alignment problems with other architectures, or depending on the current behaviour of gcc that will likely change in the future just to tick us kernel developers off a little more. The __syscall_metadata section is also moved into the .init.data section as it is now only needed at boot up. Suggested-by: David Miller <davem@davemloft.net> Acked-by: David S. Miller <davem@davemloft.net> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-02-02 22:06:09 +00:00
extern struct syscall_metadata *__start_syscalls_metadata[];
extern struct syscall_metadata *__stop_syscalls_metadata[];
static DEFINE_XARRAY(syscalls_metadata_sparse);
static struct syscall_metadata **syscalls_metadata;
#ifndef ARCH_HAS_SYSCALL_MATCH_SYM_NAME
static inline bool arch_syscall_match_sym_name(const char *sym, const char *name)
{
/*
* Only compare after the "sys" prefix. Archs that use
* syscall wrappers may have syscalls symbols aliases prefixed
* with ".SyS" or ".sys" instead of "sys", leading to an unwanted
* mismatch.
*/
return !strcmp(sym + 3, name + 3);
}
#endif
tracing/syscalls: Allow archs to ignore tracing compat syscalls The tracing of ia32 compat system calls has been a bit of a pain as they use different system call numbers than the 64bit equivalents. I wrote a simple 'lls' program that lists files. I compiled it as a i686 ELF binary and ran it under a x86_64 box. This is the result: echo 0 > /debug/tracing/tracing_on echo 1 > /debug/tracing/events/syscalls/enable echo 1 > /debug/tracing/tracing_on ; ./lls ; echo 0 > /debug/tracing/tracing_on grep lls /debug/tracing/trace [.. skipping calls before TS_COMPAT is set ...] lls-1127 [005] d... 936.409188: sys_recvfrom(fd: 0, ubuf: 4d560fc4, size: 0, flags: 8048034, addr: 8, addr_len: f7700420) lls-1127 [005] d... 936.409190: sys_recvfrom -> 0x8a77000 lls-1127 [005] d... 936.409211: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22) lls-1127 [005] d... 936.409215: sys_lgetxattr -> 0xf76ff000 lls-1127 [005] d... 936.409223: sys_dup2(oldfd: 4d55ae9b, newfd: 4) lls-1127 [005] d... 936.409228: sys_dup2 -> 0xfffffffffffffffe lls-1127 [005] d... 936.409236: sys_newfstat(fd: 4d55b085, statbuf: 80000) lls-1127 [005] d... 936.409242: sys_newfstat -> 0x3 lls-1127 [005] d... 936.409243: sys_removexattr(pathname: 3, name: ffcd0060) lls-1127 [005] d... 936.409244: sys_removexattr -> 0x0 lls-1127 [005] d... 936.409245: sys_lgetxattr(pathname: 0, name: 19614, value: 1, size: 2) lls-1127 [005] d... 936.409248: sys_lgetxattr -> 0xf76e5000 lls-1127 [005] d... 936.409248: sys_newlstat(filename: 3, statbuf: 19614) lls-1127 [005] d... 936.409249: sys_newlstat -> 0x0 lls-1127 [005] d... 936.409262: sys_newfstat(fd: f76fb588, statbuf: 80000) lls-1127 [005] d... 936.409279: sys_newfstat -> 0x3 lls-1127 [005] d... 936.409279: sys_close(fd: 3) lls-1127 [005] d... 936.421550: sys_close -> 0x200 lls-1127 [005] d... 936.421558: sys_removexattr(pathname: 3, name: ffcd00d0) lls-1127 [005] d... 936.421560: sys_removexattr -> 0x0 lls-1127 [005] d... 936.421569: sys_lgetxattr(pathname: 4d564000, name: 1b1abc, value: 5, size: 802) lls-1127 [005] d... 936.421574: sys_lgetxattr -> 0x4d564000 lls-1127 [005] d... 936.421575: sys_capget(header: 4d70f000, dataptr: 1000) lls-1127 [005] d... 936.421580: sys_capget -> 0x0 lls-1127 [005] d... 936.421580: sys_lgetxattr(pathname: 4d710000, name: 3000, value: 3, size: 812) lls-1127 [005] d... 936.421589: sys_lgetxattr -> 0x4d710000 lls-1127 [005] d... 936.426130: sys_lgetxattr(pathname: 4d713000, name: 2abc, value: 3, size: 32) lls-1127 [005] d... 936.426141: sys_lgetxattr -> 0x4d713000 lls-1127 [005] d... 936.426145: sys_newlstat(filename: 3, statbuf: f76ff3f0) lls-1127 [005] d... 936.426146: sys_newlstat -> 0x0 lls-1127 [005] d... 936.431748: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22) Obviously I'm not calling newfstat with a fd of 4d55b085. The calls are obviously incorrect, and confusing. Other efforts have been made to fix this: https://lkml.org/lkml/2012/3/26/367 But the real solution is to rewrite the syscall internals and come up with a fixed solution. One that doesn't require all the kluge that the current solution has. Thus for now, instead of outputting incorrect data, simply ignore them. With this patch the changes now have: #> grep lls /debug/tracing/trace #> Compat system calls simply are not traced. If users need compat syscalls, then they should just use the raw syscall tracepoints. For an architecture to make their compat syscalls ignored, it must define ARCH_TRACE_IGNORE_COMPAT_SYSCALLS (done in asm/ftrace.h) and also define an arch_trace_is_compat_syscall() function that will return true if the current task should ignore tracing the syscall. I want to stress that this change does not affect actual syscalls in any way, shape or form. It is only used within the tracing system and doesn't interfere with the syscall logic at all. The changes are consolidated nicely into trace_syscalls.c and asm/ftrace.h. I had to make one small modification to asm/thread_info.h and that was to remove the include of asm/ftrace.h. As asm/ftrace.h required the current_thread_info() it was causing include hell. That include was added back in 2008 when the function graph tracer was added: commit caf4b323 "tracing, x86: add low level support for ftrace return tracing" It does not need to be included there. Link: http://lkml.kernel.org/r/1360703939.21867.99.camel@gandalf.local.home Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-02-12 21:18:59 +00:00
#ifdef ARCH_TRACE_IGNORE_COMPAT_SYSCALLS
/*
* Some architectures that allow for 32bit applications
* to run on a 64bit kernel, do not map the syscalls for
* the 32bit tasks the same as they do for 64bit tasks.
*
* *cough*x86*cough*
*
* In such a case, instead of reporting the wrong syscalls,
* simply ignore them.
*
* For an arch to ignore the compat syscalls it needs to
* define ARCH_TRACE_IGNORE_COMPAT_SYSCALLS as well as
* define the function arch_trace_is_compat_syscall() to let
* the tracing system know that it should ignore it.
*/
static int
trace_get_syscall_nr(struct task_struct *task, struct pt_regs *regs)
{
if (unlikely(arch_trace_is_compat_syscall(regs)))
return -1;
return syscall_get_nr(task, regs);
}
#else
static inline int
trace_get_syscall_nr(struct task_struct *task, struct pt_regs *regs)
{
return syscall_get_nr(task, regs);
}
#endif /* ARCH_TRACE_IGNORE_COMPAT_SYSCALLS */
tracing: Replace syscall_meta_data struct array with pointer array Currently the syscall_meta structures for the syscall tracepoints are placed in the __syscall_metadata section, and at link time, the linker makes one large array of all these syscall metadata structures. On boot up, this array is read (much like the initcall sections) and the syscall data is processed. The problem is that there is no guarantee that gcc will place complex structures nicely together in an array format. Two structures in the same file may be placed awkwardly, because gcc has no clue that they are suppose to be in an array. A hack was used previous to force the alignment to 4, to pack the structures together. But this caused alignment issues with other architectures (sparc). Instead of packing the structures into an array, the structures' addresses are now put into the __syscall_metadata section. As pointers are always the natural alignment, gcc should always pack them tightly together (otherwise initcall, extable, etc would also fail). By having the pointers to the structures in the section, we can still iterate the trace_events without causing unnecessary alignment problems with other architectures, or depending on the current behaviour of gcc that will likely change in the future just to tick us kernel developers off a little more. The __syscall_metadata section is also moved into the .init.data section as it is now only needed at boot up. Suggested-by: David Miller <davem@davemloft.net> Acked-by: David S. Miller <davem@davemloft.net> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-02-02 22:06:09 +00:00
static __init struct syscall_metadata *
find_syscall_meta(unsigned long syscall)
{
tracing: Replace syscall_meta_data struct array with pointer array Currently the syscall_meta structures for the syscall tracepoints are placed in the __syscall_metadata section, and at link time, the linker makes one large array of all these syscall metadata structures. On boot up, this array is read (much like the initcall sections) and the syscall data is processed. The problem is that there is no guarantee that gcc will place complex structures nicely together in an array format. Two structures in the same file may be placed awkwardly, because gcc has no clue that they are suppose to be in an array. A hack was used previous to force the alignment to 4, to pack the structures together. But this caused alignment issues with other architectures (sparc). Instead of packing the structures into an array, the structures' addresses are now put into the __syscall_metadata section. As pointers are always the natural alignment, gcc should always pack them tightly together (otherwise initcall, extable, etc would also fail). By having the pointers to the structures in the section, we can still iterate the trace_events without causing unnecessary alignment problems with other architectures, or depending on the current behaviour of gcc that will likely change in the future just to tick us kernel developers off a little more. The __syscall_metadata section is also moved into the .init.data section as it is now only needed at boot up. Suggested-by: David Miller <davem@davemloft.net> Acked-by: David S. Miller <davem@davemloft.net> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-02-02 22:06:09 +00:00
struct syscall_metadata **start;
struct syscall_metadata **stop;
char str[KSYM_SYMBOL_LEN];
tracing: Replace syscall_meta_data struct array with pointer array Currently the syscall_meta structures for the syscall tracepoints are placed in the __syscall_metadata section, and at link time, the linker makes one large array of all these syscall metadata structures. On boot up, this array is read (much like the initcall sections) and the syscall data is processed. The problem is that there is no guarantee that gcc will place complex structures nicely together in an array format. Two structures in the same file may be placed awkwardly, because gcc has no clue that they are suppose to be in an array. A hack was used previous to force the alignment to 4, to pack the structures together. But this caused alignment issues with other architectures (sparc). Instead of packing the structures into an array, the structures' addresses are now put into the __syscall_metadata section. As pointers are always the natural alignment, gcc should always pack them tightly together (otherwise initcall, extable, etc would also fail). By having the pointers to the structures in the section, we can still iterate the trace_events without causing unnecessary alignment problems with other architectures, or depending on the current behaviour of gcc that will likely change in the future just to tick us kernel developers off a little more. The __syscall_metadata section is also moved into the .init.data section as it is now only needed at boot up. Suggested-by: David Miller <davem@davemloft.net> Acked-by: David S. Miller <davem@davemloft.net> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-02-02 22:06:09 +00:00
start = __start_syscalls_metadata;
stop = __stop_syscalls_metadata;
kallsyms_lookup(syscall, NULL, NULL, NULL, str);
if (arch_syscall_match_sym_name(str, "sys_ni_syscall"))
return NULL;
for ( ; start < stop; start++) {
if ((*start)->name && arch_syscall_match_sym_name(str, (*start)->name))
tracing: Replace syscall_meta_data struct array with pointer array Currently the syscall_meta structures for the syscall tracepoints are placed in the __syscall_metadata section, and at link time, the linker makes one large array of all these syscall metadata structures. On boot up, this array is read (much like the initcall sections) and the syscall data is processed. The problem is that there is no guarantee that gcc will place complex structures nicely together in an array format. Two structures in the same file may be placed awkwardly, because gcc has no clue that they are suppose to be in an array. A hack was used previous to force the alignment to 4, to pack the structures together. But this caused alignment issues with other architectures (sparc). Instead of packing the structures into an array, the structures' addresses are now put into the __syscall_metadata section. As pointers are always the natural alignment, gcc should always pack them tightly together (otherwise initcall, extable, etc would also fail). By having the pointers to the structures in the section, we can still iterate the trace_events without causing unnecessary alignment problems with other architectures, or depending on the current behaviour of gcc that will likely change in the future just to tick us kernel developers off a little more. The __syscall_metadata section is also moved into the .init.data section as it is now only needed at boot up. Suggested-by: David Miller <davem@davemloft.net> Acked-by: David S. Miller <davem@davemloft.net> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-02-02 22:06:09 +00:00
return *start;
}
return NULL;
}
static struct syscall_metadata *syscall_nr_to_meta(int nr)
{
if (IS_ENABLED(CONFIG_HAVE_SPARSE_SYSCALL_NR))
return xa_load(&syscalls_metadata_sparse, (unsigned long)nr);
if (!syscalls_metadata || nr >= NR_syscalls || nr < 0)
return NULL;
return syscalls_metadata[nr];
}
const char *get_syscall_name(int syscall)
{
struct syscall_metadata *entry;
entry = syscall_nr_to_meta(syscall);
if (!entry)
return NULL;
return entry->name;
}
static enum print_line_t
print_syscall_enter(struct trace_iterator *iter, int flags,
struct trace_event *event)
{
struct trace_array *tr = iter->tr;
struct trace_seq *s = &iter->seq;
struct trace_entry *ent = iter->ent;
struct syscall_trace_enter *trace;
struct syscall_metadata *entry;
int i, syscall;
trace = (typeof(trace))ent;
syscall = trace->nr;
entry = syscall_nr_to_meta(syscall);
if (!entry)
goto end;
if (entry->enter_event->event.type != ent->type) {
WARN_ON_ONCE(1);
goto end;
}
trace_seq_printf(s, "%s(", entry->name);
for (i = 0; i < entry->nb_args; i++) {
if (trace_seq_has_overflowed(s))
goto end;
/* parameter types */
if (tr->trace_flags & TRACE_ITER_VERBOSE)
trace_seq_printf(s, "%s ", entry->types[i]);
/* parameter values */
trace_seq_printf(s, "%s: %lx%s", entry->args[i],
trace->args[i],
i == entry->nb_args - 1 ? "" : ", ");
}
trace_seq_putc(s, ')');
end:
trace_seq_putc(s, '\n');
return trace_handle_return(s);
}
static enum print_line_t
print_syscall_exit(struct trace_iterator *iter, int flags,
struct trace_event *event)
{
struct trace_seq *s = &iter->seq;
struct trace_entry *ent = iter->ent;
struct syscall_trace_exit *trace;
int syscall;
struct syscall_metadata *entry;
trace = (typeof(trace))ent;
syscall = trace->nr;
entry = syscall_nr_to_meta(syscall);
if (!entry) {
trace_seq_putc(s, '\n');
goto out;
}
if (entry->exit_event->event.type != ent->type) {
WARN_ON_ONCE(1);
return TRACE_TYPE_UNHANDLED;
}
trace_seq_printf(s, "%s -> 0x%lx\n", entry->name,
trace->ret);
out:
return trace_handle_return(s);
}
extern char *__bad_type_size(void);
ftrace: Rework event_create_dir() Rework event_create_dir() to use an array of static data instead of function pointers where possible. The problem is that it would call the function pointer on module load before parse_args(), possibly even before jump_labels were initialized. Luckily the generated functions don't use jump_labels but it still seems fragile. It also gets in the way of changing when we make the module map executable. The generated function are basically calling trace_define_field() with a bunch of static arguments. So instead of a function, capture these arguments in a static array, avoiding the function call. Now there are a number of cases where the fields are dynamic (syscall arguments, kprobes and uprobes), in which case a static array does not work, for these we preserve the function call. Luckily all these cases are not related to modules and so we can retain the function call for them. Also fix up all broken tracepoint definitions that now generate a compile error. Tested-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191111132458.342979914@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-24 20:26:59 +00:00
#define SYSCALL_FIELD(_type, _name) { \
.type = #_type, .name = #_name, \
.size = sizeof(_type), .align = __alignof__(_type), \
.is_signed = is_signed_type(_type), .filter_type = FILTER_OTHER }
static int __init
__set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
{
int i;
int pos = 0;
/* When len=0, we just calculate the needed length */
#define LEN_OR_ZERO (len ? len - pos : 0)
pos += snprintf(buf + pos, LEN_OR_ZERO, "\"");
for (i = 0; i < entry->nb_args; i++) {
pos += snprintf(buf + pos, LEN_OR_ZERO, "%s: 0x%%0%zulx%s",
entry->args[i], sizeof(unsigned long),
i == entry->nb_args - 1 ? "" : ", ");
}
pos += snprintf(buf + pos, LEN_OR_ZERO, "\"");
for (i = 0; i < entry->nb_args; i++) {
pos += snprintf(buf + pos, LEN_OR_ZERO,
", ((unsigned long)(REC->%s))", entry->args[i]);
}
#undef LEN_OR_ZERO
/* return the length of print_fmt */
return pos;
}
static int __init set_syscall_print_fmt(struct trace_event_call *call)
{
char *print_fmt;
int len;
struct syscall_metadata *entry = call->data;
if (entry->enter_event != call) {
call->print_fmt = "\"0x%lx\", REC->ret";
return 0;
}
/* First: called with 0 length to calculate the needed length */
len = __set_enter_print_fmt(entry, NULL, 0);
print_fmt = kmalloc(len + 1, GFP_KERNEL);
if (!print_fmt)
return -ENOMEM;
/* Second: actually write the @print_fmt */
__set_enter_print_fmt(entry, print_fmt, len + 1);
call->print_fmt = print_fmt;
return 0;
}
static void __init free_syscall_print_fmt(struct trace_event_call *call)
{
struct syscall_metadata *entry = call->data;
if (entry->enter_event == call)
kfree(call->print_fmt);
}
static int __init syscall_enter_define_fields(struct trace_event_call *call)
{
struct syscall_trace_enter trace;
struct syscall_metadata *meta = call->data;
int offset = offsetof(typeof(trace), args);
int ret = 0;
int i;
for (i = 0; i < meta->nb_args; i++) {
ret = trace_define_field(call, meta->types[i],
meta->args[i], offset,
sizeof(unsigned long), 0,
FILTER_OTHER);
ftrace: Rework event_create_dir() Rework event_create_dir() to use an array of static data instead of function pointers where possible. The problem is that it would call the function pointer on module load before parse_args(), possibly even before jump_labels were initialized. Luckily the generated functions don't use jump_labels but it still seems fragile. It also gets in the way of changing when we make the module map executable. The generated function are basically calling trace_define_field() with a bunch of static arguments. So instead of a function, capture these arguments in a static array, avoiding the function call. Now there are a number of cases where the fields are dynamic (syscall arguments, kprobes and uprobes), in which case a static array does not work, for these we preserve the function call. Luckily all these cases are not related to modules and so we can retain the function call for them. Also fix up all broken tracepoint definitions that now generate a compile error. Tested-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191111132458.342979914@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-24 20:26:59 +00:00
if (ret)
break;
offset += sizeof(unsigned long);
}
return ret;
}
static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
{
struct trace_array *tr = data;
struct trace_event_file *trace_file;
struct syscall_trace_enter *entry;
struct syscall_metadata *sys_data;
struct ring_buffer_event *event;
struct trace_buffer *buffer;
tracing: Merge irqflags + preempt counter. The state of the interrupts (irqflags) and the preemption counter are both passed down to tracing_generic_entry_update(). Only one bit of irqflags is actually required: The on/off state. The complete 32bit of the preemption counter isn't needed. Just whether of the upper bits (softirq, hardirq and NMI) are set and the preemption depth is needed. The irqflags and the preemption counter could be evaluated early and the information stored in an integer `trace_ctx'. tracing_generic_entry_update() would use the upper bits as the TRACE_FLAG_* and the lower 8bit as the disabled-preemption depth (considering that one must be substracted from the counter in one special cases). The actual preemption value is not used except for the tracing record. The `irqflags' variable is mostly used only for the tracing record. An exception here is for instance wakeup_tracer_call() or probe_wakeup_sched_switch() which explicilty disable interrupts and use that `irqflags' to save (and restore) the IRQ state and to record the state. Struct trace_event_buffer has also the `pc' and flags' members which can be replaced with `trace_ctx' since their actual value is not used outside of trace recording. This will reduce tracing_generic_entry_update() to simply assign values to struct trace_entry. The evaluation of the TRACE_FLAG_* bits is moved to _tracing_gen_ctx_flags() which replaces preempt_count() and local_save_flags() invocations. As an example, ftrace_syscall_enter() may invoke: - trace_buffer_lock_reserve() -> … -> tracing_generic_entry_update() - event_trigger_unlock_commit() -> ftrace_trace_stack() -> … -> tracing_generic_entry_update() -> ftrace_trace_userstack() -> … -> tracing_generic_entry_update() In this case the TRACE_FLAG_* bits were evaluated three times. By using the `trace_ctx' they are evaluated once and assigned three times. A build with all tracers enabled on x86-64 with and without the patch: text data bss dec hex filename 21970669 17084168 7639260 46694097 2c87ed1 vmlinux.old 21970293 17084168 7639260 46693721 2c87d59 vmlinux.new text shrank by 379 bytes, data remained constant. Link: https://lkml.kernel.org/r/20210125194511.3924915-2-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-01-25 19:45:08 +00:00
unsigned int trace_ctx;
unsigned long args[6];
int syscall_nr;
tracing/syscalls: Allow archs to ignore tracing compat syscalls The tracing of ia32 compat system calls has been a bit of a pain as they use different system call numbers than the 64bit equivalents. I wrote a simple 'lls' program that lists files. I compiled it as a i686 ELF binary and ran it under a x86_64 box. This is the result: echo 0 > /debug/tracing/tracing_on echo 1 > /debug/tracing/events/syscalls/enable echo 1 > /debug/tracing/tracing_on ; ./lls ; echo 0 > /debug/tracing/tracing_on grep lls /debug/tracing/trace [.. skipping calls before TS_COMPAT is set ...] lls-1127 [005] d... 936.409188: sys_recvfrom(fd: 0, ubuf: 4d560fc4, size: 0, flags: 8048034, addr: 8, addr_len: f7700420) lls-1127 [005] d... 936.409190: sys_recvfrom -> 0x8a77000 lls-1127 [005] d... 936.409211: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22) lls-1127 [005] d... 936.409215: sys_lgetxattr -> 0xf76ff000 lls-1127 [005] d... 936.409223: sys_dup2(oldfd: 4d55ae9b, newfd: 4) lls-1127 [005] d... 936.409228: sys_dup2 -> 0xfffffffffffffffe lls-1127 [005] d... 936.409236: sys_newfstat(fd: 4d55b085, statbuf: 80000) lls-1127 [005] d... 936.409242: sys_newfstat -> 0x3 lls-1127 [005] d... 936.409243: sys_removexattr(pathname: 3, name: ffcd0060) lls-1127 [005] d... 936.409244: sys_removexattr -> 0x0 lls-1127 [005] d... 936.409245: sys_lgetxattr(pathname: 0, name: 19614, value: 1, size: 2) lls-1127 [005] d... 936.409248: sys_lgetxattr -> 0xf76e5000 lls-1127 [005] d... 936.409248: sys_newlstat(filename: 3, statbuf: 19614) lls-1127 [005] d... 936.409249: sys_newlstat -> 0x0 lls-1127 [005] d... 936.409262: sys_newfstat(fd: f76fb588, statbuf: 80000) lls-1127 [005] d... 936.409279: sys_newfstat -> 0x3 lls-1127 [005] d... 936.409279: sys_close(fd: 3) lls-1127 [005] d... 936.421550: sys_close -> 0x200 lls-1127 [005] d... 936.421558: sys_removexattr(pathname: 3, name: ffcd00d0) lls-1127 [005] d... 936.421560: sys_removexattr -> 0x0 lls-1127 [005] d... 936.421569: sys_lgetxattr(pathname: 4d564000, name: 1b1abc, value: 5, size: 802) lls-1127 [005] d... 936.421574: sys_lgetxattr -> 0x4d564000 lls-1127 [005] d... 936.421575: sys_capget(header: 4d70f000, dataptr: 1000) lls-1127 [005] d... 936.421580: sys_capget -> 0x0 lls-1127 [005] d... 936.421580: sys_lgetxattr(pathname: 4d710000, name: 3000, value: 3, size: 812) lls-1127 [005] d... 936.421589: sys_lgetxattr -> 0x4d710000 lls-1127 [005] d... 936.426130: sys_lgetxattr(pathname: 4d713000, name: 2abc, value: 3, size: 32) lls-1127 [005] d... 936.426141: sys_lgetxattr -> 0x4d713000 lls-1127 [005] d... 936.426145: sys_newlstat(filename: 3, statbuf: f76ff3f0) lls-1127 [005] d... 936.426146: sys_newlstat -> 0x0 lls-1127 [005] d... 936.431748: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22) Obviously I'm not calling newfstat with a fd of 4d55b085. The calls are obviously incorrect, and confusing. Other efforts have been made to fix this: https://lkml.org/lkml/2012/3/26/367 But the real solution is to rewrite the syscall internals and come up with a fixed solution. One that doesn't require all the kluge that the current solution has. Thus for now, instead of outputting incorrect data, simply ignore them. With this patch the changes now have: #> grep lls /debug/tracing/trace #> Compat system calls simply are not traced. If users need compat syscalls, then they should just use the raw syscall tracepoints. For an architecture to make their compat syscalls ignored, it must define ARCH_TRACE_IGNORE_COMPAT_SYSCALLS (done in asm/ftrace.h) and also define an arch_trace_is_compat_syscall() function that will return true if the current task should ignore tracing the syscall. I want to stress that this change does not affect actual syscalls in any way, shape or form. It is only used within the tracing system and doesn't interfere with the syscall logic at all. The changes are consolidated nicely into trace_syscalls.c and asm/ftrace.h. I had to make one small modification to asm/thread_info.h and that was to remove the include of asm/ftrace.h. As asm/ftrace.h required the current_thread_info() it was causing include hell. That include was added back in 2008 when the function graph tracer was added: commit caf4b323 "tracing, x86: add low level support for ftrace return tracing" It does not need to be included there. Link: http://lkml.kernel.org/r/1360703939.21867.99.camel@gandalf.local.home Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-02-12 21:18:59 +00:00
int size;
tracing/syscalls: Allow archs to ignore tracing compat syscalls The tracing of ia32 compat system calls has been a bit of a pain as they use different system call numbers than the 64bit equivalents. I wrote a simple 'lls' program that lists files. I compiled it as a i686 ELF binary and ran it under a x86_64 box. This is the result: echo 0 > /debug/tracing/tracing_on echo 1 > /debug/tracing/events/syscalls/enable echo 1 > /debug/tracing/tracing_on ; ./lls ; echo 0 > /debug/tracing/tracing_on grep lls /debug/tracing/trace [.. skipping calls before TS_COMPAT is set ...] lls-1127 [005] d... 936.409188: sys_recvfrom(fd: 0, ubuf: 4d560fc4, size: 0, flags: 8048034, addr: 8, addr_len: f7700420) lls-1127 [005] d... 936.409190: sys_recvfrom -> 0x8a77000 lls-1127 [005] d... 936.409211: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22) lls-1127 [005] d... 936.409215: sys_lgetxattr -> 0xf76ff000 lls-1127 [005] d... 936.409223: sys_dup2(oldfd: 4d55ae9b, newfd: 4) lls-1127 [005] d... 936.409228: sys_dup2 -> 0xfffffffffffffffe lls-1127 [005] d... 936.409236: sys_newfstat(fd: 4d55b085, statbuf: 80000) lls-1127 [005] d... 936.409242: sys_newfstat -> 0x3 lls-1127 [005] d... 936.409243: sys_removexattr(pathname: 3, name: ffcd0060) lls-1127 [005] d... 936.409244: sys_removexattr -> 0x0 lls-1127 [005] d... 936.409245: sys_lgetxattr(pathname: 0, name: 19614, value: 1, size: 2) lls-1127 [005] d... 936.409248: sys_lgetxattr -> 0xf76e5000 lls-1127 [005] d... 936.409248: sys_newlstat(filename: 3, statbuf: 19614) lls-1127 [005] d... 936.409249: sys_newlstat -> 0x0 lls-1127 [005] d... 936.409262: sys_newfstat(fd: f76fb588, statbuf: 80000) lls-1127 [005] d... 936.409279: sys_newfstat -> 0x3 lls-1127 [005] d... 936.409279: sys_close(fd: 3) lls-1127 [005] d... 936.421550: sys_close -> 0x200 lls-1127 [005] d... 936.421558: sys_removexattr(pathname: 3, name: ffcd00d0) lls-1127 [005] d... 936.421560: sys_removexattr -> 0x0 lls-1127 [005] d... 936.421569: sys_lgetxattr(pathname: 4d564000, name: 1b1abc, value: 5, size: 802) lls-1127 [005] d... 936.421574: sys_lgetxattr -> 0x4d564000 lls-1127 [005] d... 936.421575: sys_capget(header: 4d70f000, dataptr: 1000) lls-1127 [005] d... 936.421580: sys_capget -> 0x0 lls-1127 [005] d... 936.421580: sys_lgetxattr(pathname: 4d710000, name: 3000, value: 3, size: 812) lls-1127 [005] d... 936.421589: sys_lgetxattr -> 0x4d710000 lls-1127 [005] d... 936.426130: sys_lgetxattr(pathname: 4d713000, name: 2abc, value: 3, size: 32) lls-1127 [005] d... 936.426141: sys_lgetxattr -> 0x4d713000 lls-1127 [005] d... 936.426145: sys_newlstat(filename: 3, statbuf: f76ff3f0) lls-1127 [005] d... 936.426146: sys_newlstat -> 0x0 lls-1127 [005] d... 936.431748: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22) Obviously I'm not calling newfstat with a fd of 4d55b085. The calls are obviously incorrect, and confusing. Other efforts have been made to fix this: https://lkml.org/lkml/2012/3/26/367 But the real solution is to rewrite the syscall internals and come up with a fixed solution. One that doesn't require all the kluge that the current solution has. Thus for now, instead of outputting incorrect data, simply ignore them. With this patch the changes now have: #> grep lls /debug/tracing/trace #> Compat system calls simply are not traced. If users need compat syscalls, then they should just use the raw syscall tracepoints. For an architecture to make their compat syscalls ignored, it must define ARCH_TRACE_IGNORE_COMPAT_SYSCALLS (done in asm/ftrace.h) and also define an arch_trace_is_compat_syscall() function that will return true if the current task should ignore tracing the syscall. I want to stress that this change does not affect actual syscalls in any way, shape or form. It is only used within the tracing system and doesn't interfere with the syscall logic at all. The changes are consolidated nicely into trace_syscalls.c and asm/ftrace.h. I had to make one small modification to asm/thread_info.h and that was to remove the include of asm/ftrace.h. As asm/ftrace.h required the current_thread_info() it was causing include hell. That include was added back in 2008 when the function graph tracer was added: commit caf4b323 "tracing, x86: add low level support for ftrace return tracing" It does not need to be included there. Link: http://lkml.kernel.org/r/1360703939.21867.99.camel@gandalf.local.home Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-02-12 21:18:59 +00:00
syscall_nr = trace_get_syscall_nr(current, regs);
tracing/syscalls: Ignore numbers outside NR_syscalls' range ARM has some private syscalls (for example, set_tls(2)) which lie outside the range of NR_syscalls. If any of these are called while syscall tracing is being performed, out-of-bounds array access will occur in the ftrace and perf sys_{enter,exit} handlers. # trace-cmd record -e raw_syscalls:* true && trace-cmd report ... true-653 [000] 384.675777: sys_enter: NR 192 (0, 1000, 3, 4000022, ffffffff, 0) true-653 [000] 384.675812: sys_exit: NR 192 = 1995915264 true-653 [000] 384.675971: sys_enter: NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1) true-653 [000] 384.675988: sys_exit: NR 983045 = 0 ... # trace-cmd record -e syscalls:* true [ 17.289329] Unable to handle kernel paging request at virtual address aaaaaace [ 17.289590] pgd = 9e71c000 [ 17.289696] [aaaaaace] *pgd=00000000 [ 17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM [ 17.290169] Modules linked in: [ 17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ #21 [ 17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000 [ 17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8 [ 17.290866] LR is at syscall_trace_enter+0x124/0x184 Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers. Commit cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls" added the check for less than zero, but it should have also checked for greater than NR_syscalls. Link: http://lkml.kernel.org/p/1414620418-29472-1-git-send-email-rabin@rab.in Fixes: cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls" Cc: stable@vger.kernel.org # 2.6.33+ Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-10-29 22:06:58 +00:00
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
tracing: Check invalid syscall nr while tracing syscalls Most arch syscall_get_nr() implementations returns -1 if the syscall number is not valid. Accessing the bit field without a check might result in a kernel oops (at least I saw it on s390 for ftrace selftest). Before this change, this problem did not occur, because the invalid syscall number (-1) caused syscall_nr_to_meta() to return NULL. There are at least two scenarios where syscall_get_nr() can return -1: 1. For example, ptrace stores an invalid syscall number, and thus, tracing code resets it. (see do_syscall_trace_enter in arch/s390/kernel/ptrace.c) 2. The syscall_regfunc() (kernel/tracepoint.c) sets the TIF_SYSCALL_FTRACE (now: TIF_SYSCALL_TRACEPOINT) flag for all threads which include kernel threads. However, the ftrace selftest triggers a kernel oops when testing syscall trace points: - The kernel thread is started as ususal (do_fork()), - tracing code sets TIF_SYSCALL_FTRACE, - the ret_from_fork() function is triggered and starts ftrace_syscall_exit() with an invalid syscall number. To avoid these scenarios, I suggest to check the syscall_nr. For instance, the ftrace selftest fails for s390 (with config option CONFIG_FTRACE_SYSCALLS set) and produces the following kernel oops. Unable to handle kernel pointer dereference at virtual kernel address 2000000000 Oops: 0038 [#1] PREEMPT SMP Modules linked in: CPU: 0 Not tainted 2.6.31-rc6-next-20090819-dirty #18 Process kthreadd (pid: 818, task: 000000003ea207e8, ksp: 000000003e813eb8) Krnl PSW : 0704100180000000 00000000000ea54c (ftrace_syscall_exit+0x58/0xdc) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3 Krnl GPRS: 0000000000000000 00000000000e0000 ffffffffffffffff 20000000008c2650 0000000000000007 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 ffffffffffffffff 000000003e813d78 000000003e813f58 0000000000505ba8 000000003e813e18 000000003e813d78 Krnl Code: 00000000000ea540: e330d0000008 ag %r3,0(%r13) 00000000000ea546: a7480007 lhi %r4,7 00000000000ea54a: 1442 nr %r4,%r2 >00000000000ea54c: e31030000090 llgc %r1,0(%r3) 00000000000ea552: 5410d008 n %r1,8(%r13) 00000000000ea556: 8a104000 sra %r1,0(%r4) 00000000000ea55a: 5410d00c n %r1,12(%r13) 00000000000ea55e: 1211 ltr %r1,%r1 Call Trace: ([<0000000000000000>] 0x0) [<000000000001fa22>] do_syscall_trace_exit+0x132/0x18c [<000000000002d0c4>] sysc_return+0x0/0x8 [<000000000001c738>] kernel_thread_starter+0x0/0xc Last Breaking-Event-Address: [<00000000000ea51e>] ftrace_syscall_exit+0x2a/0xdc Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Jiaying Zhang <jiayingz@google.com> Cc: Martin Bligh <mbligh@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> LKML-Reference: <20090825125027.GE4639@cetus.boeblingen.de.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-08-25 12:50:27 +00:00
return;
/* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */
trace_file = rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]);
if (!trace_file)
return;
if (trace_trigger_soft_disabled(trace_file))
return;
sys_data = syscall_nr_to_meta(syscall_nr);
if (!sys_data)
return;
size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
tracing: Merge irqflags + preempt counter. The state of the interrupts (irqflags) and the preemption counter are both passed down to tracing_generic_entry_update(). Only one bit of irqflags is actually required: The on/off state. The complete 32bit of the preemption counter isn't needed. Just whether of the upper bits (softirq, hardirq and NMI) are set and the preemption depth is needed. The irqflags and the preemption counter could be evaluated early and the information stored in an integer `trace_ctx'. tracing_generic_entry_update() would use the upper bits as the TRACE_FLAG_* and the lower 8bit as the disabled-preemption depth (considering that one must be substracted from the counter in one special cases). The actual preemption value is not used except for the tracing record. The `irqflags' variable is mostly used only for the tracing record. An exception here is for instance wakeup_tracer_call() or probe_wakeup_sched_switch() which explicilty disable interrupts and use that `irqflags' to save (and restore) the IRQ state and to record the state. Struct trace_event_buffer has also the `pc' and flags' members which can be replaced with `trace_ctx' since their actual value is not used outside of trace recording. This will reduce tracing_generic_entry_update() to simply assign values to struct trace_entry. The evaluation of the TRACE_FLAG_* bits is moved to _tracing_gen_ctx_flags() which replaces preempt_count() and local_save_flags() invocations. As an example, ftrace_syscall_enter() may invoke: - trace_buffer_lock_reserve() -> … -> tracing_generic_entry_update() - event_trigger_unlock_commit() -> ftrace_trace_stack() -> … -> tracing_generic_entry_update() -> ftrace_trace_userstack() -> … -> tracing_generic_entry_update() In this case the TRACE_FLAG_* bits were evaluated three times. By using the `trace_ctx' they are evaluated once and assigned three times. A build with all tracers enabled on x86-64 with and without the patch: text data bss dec hex filename 21970669 17084168 7639260 46694097 2c87ed1 vmlinux.old 21970293 17084168 7639260 46693721 2c87d59 vmlinux.new text shrank by 379 bytes, data remained constant. Link: https://lkml.kernel.org/r/20210125194511.3924915-2-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-01-25 19:45:08 +00:00
trace_ctx = tracing_gen_ctx();
tracing: Fix irqs-off tag display in syscall tracing All syscall tracing irqs-off tags are wrong, the syscall enter entry doesn't disable irqs. [root@jovi tracing]#echo "syscalls:sys_enter_open" > set_event [root@jovi tracing]# cat trace # tracer: nop # # entries-in-buffer/entries-written: 13/13 #P:2 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | irqbalance-513 [000] d... 56115.496766: sys_open(filename: 804e1a6, flags: 0, mode: 1b6) irqbalance-513 [000] d... 56115.497008: sys_open(filename: 804e1bb, flags: 0, mode: 1b6) sendmail-771 [000] d... 56115.827982: sys_open(filename: b770e6d1, flags: 0, mode: 1b6) The reason is syscall tracing doesn't record irq_flags into buffer. The proper display is: [root@jovi tracing]#echo "syscalls:sys_enter_open" > set_event [root@jovi tracing]# cat trace # tracer: nop # # entries-in-buffer/entries-written: 14/14 #P:2 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | irqbalance-514 [001] .... 46.213921: sys_open(filename: 804e1a6, flags: 0, mode: 1b6) irqbalance-514 [001] .... 46.214160: sys_open(filename: 804e1bb, flags: 0, mode: 1b6) <...>-920 [001] .... 47.307260: sys_open(filename: 4e82a0c5, flags: 80000, mode: 0) Link: http://lkml.kernel.org/r/1365564393-10972-3-git-send-email-jovi.zhangwei@huawei.com Cc: stable@vger.kernel.org # 2.6.35 Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-04-10 03:26:23 +00:00
buffer = tr->array_buffer.buffer;
event = trace_buffer_lock_reserve(buffer,
tracing: Merge irqflags + preempt counter. The state of the interrupts (irqflags) and the preemption counter are both passed down to tracing_generic_entry_update(). Only one bit of irqflags is actually required: The on/off state. The complete 32bit of the preemption counter isn't needed. Just whether of the upper bits (softirq, hardirq and NMI) are set and the preemption depth is needed. The irqflags and the preemption counter could be evaluated early and the information stored in an integer `trace_ctx'. tracing_generic_entry_update() would use the upper bits as the TRACE_FLAG_* and the lower 8bit as the disabled-preemption depth (considering that one must be substracted from the counter in one special cases). The actual preemption value is not used except for the tracing record. The `irqflags' variable is mostly used only for the tracing record. An exception here is for instance wakeup_tracer_call() or probe_wakeup_sched_switch() which explicilty disable interrupts and use that `irqflags' to save (and restore) the IRQ state and to record the state. Struct trace_event_buffer has also the `pc' and flags' members which can be replaced with `trace_ctx' since their actual value is not used outside of trace recording. This will reduce tracing_generic_entry_update() to simply assign values to struct trace_entry. The evaluation of the TRACE_FLAG_* bits is moved to _tracing_gen_ctx_flags() which replaces preempt_count() and local_save_flags() invocations. As an example, ftrace_syscall_enter() may invoke: - trace_buffer_lock_reserve() -> … -> tracing_generic_entry_update() - event_trigger_unlock_commit() -> ftrace_trace_stack() -> … -> tracing_generic_entry_update() -> ftrace_trace_userstack() -> … -> tracing_generic_entry_update() In this case the TRACE_FLAG_* bits were evaluated three times. By using the `trace_ctx' they are evaluated once and assigned three times. A build with all tracers enabled on x86-64 with and without the patch: text data bss dec hex filename 21970669 17084168 7639260 46694097 2c87ed1 vmlinux.old 21970293 17084168 7639260 46693721 2c87d59 vmlinux.new text shrank by 379 bytes, data remained constant. Link: https://lkml.kernel.org/r/20210125194511.3924915-2-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-01-25 19:45:08 +00:00
sys_data->enter_event->event.type, size, trace_ctx);
if (!event)
return;
entry = ring_buffer_event_data(event);
entry->nr = syscall_nr;
syscalls: Remove start and number from syscall_get_arguments() args At Linux Plumbers, Andy Lutomirski approached me and pointed out that the function call syscall_get_arguments() implemented in x86 was horribly written and not optimized for the standard case of passing in 0 and 6 for the starting index and the number of system calls to get. When looking at all the users of this function, I discovered that all instances pass in only 0 and 6 for these arguments. Instead of having this function handle different cases that are never used, simply rewrite it to return the first 6 arguments of a system call. This should help out the performance of tracing system calls by ptrace, ftrace and perf. Link: http://lkml.kernel.org/r/20161107213233.754809394@goodmis.org Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Dave Martin <dave.martin@arm.com> Cc: "Dmitry V. Levin" <ldv@altlinux.org> Cc: x86@kernel.org Cc: linux-snps-arc@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-c6x-dev@linux-c6x.org Cc: uclinux-h8-devel@lists.sourceforge.jp Cc: linux-hexagon@vger.kernel.org Cc: linux-ia64@vger.kernel.org Cc: linux-mips@vger.kernel.org Cc: nios2-dev@lists.rocketboards.org Cc: openrisc@lists.librecores.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-riscv@lists.infradead.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-um@lists.infradead.org Cc: linux-xtensa@linux-xtensa.org Cc: linux-arch@vger.kernel.org Acked-by: Paul Burton <paul.burton@mips.com> # MIPS parts Acked-by: Max Filippov <jcmvbkbc@gmail.com> # For xtensa changes Acked-by: Will Deacon <will.deacon@arm.com> # For the arm64 bits Reviewed-by: Thomas Gleixner <tglx@linutronix.de> # for x86 Reviewed-by: Dmitry V. Levin <ldv@altlinux.org> Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2016-11-07 21:26:37 +00:00
syscall_get_arguments(current, regs, args);
memcpy(entry->args, args, sizeof(unsigned long) * sys_data->nb_args);
event_trigger_unlock_commit(trace_file, buffer, event, entry,
tracing: Merge irqflags + preempt counter. The state of the interrupts (irqflags) and the preemption counter are both passed down to tracing_generic_entry_update(). Only one bit of irqflags is actually required: The on/off state. The complete 32bit of the preemption counter isn't needed. Just whether of the upper bits (softirq, hardirq and NMI) are set and the preemption depth is needed. The irqflags and the preemption counter could be evaluated early and the information stored in an integer `trace_ctx'. tracing_generic_entry_update() would use the upper bits as the TRACE_FLAG_* and the lower 8bit as the disabled-preemption depth (considering that one must be substracted from the counter in one special cases). The actual preemption value is not used except for the tracing record. The `irqflags' variable is mostly used only for the tracing record. An exception here is for instance wakeup_tracer_call() or probe_wakeup_sched_switch() which explicilty disable interrupts and use that `irqflags' to save (and restore) the IRQ state and to record the state. Struct trace_event_buffer has also the `pc' and flags' members which can be replaced with `trace_ctx' since their actual value is not used outside of trace recording. This will reduce tracing_generic_entry_update() to simply assign values to struct trace_entry. The evaluation of the TRACE_FLAG_* bits is moved to _tracing_gen_ctx_flags() which replaces preempt_count() and local_save_flags() invocations. As an example, ftrace_syscall_enter() may invoke: - trace_buffer_lock_reserve() -> … -> tracing_generic_entry_update() - event_trigger_unlock_commit() -> ftrace_trace_stack() -> … -> tracing_generic_entry_update() -> ftrace_trace_userstack() -> … -> tracing_generic_entry_update() In this case the TRACE_FLAG_* bits were evaluated three times. By using the `trace_ctx' they are evaluated once and assigned three times. A build with all tracers enabled on x86-64 with and without the patch: text data bss dec hex filename 21970669 17084168 7639260 46694097 2c87ed1 vmlinux.old 21970293 17084168 7639260 46693721 2c87d59 vmlinux.new text shrank by 379 bytes, data remained constant. Link: https://lkml.kernel.org/r/20210125194511.3924915-2-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-01-25 19:45:08 +00:00
trace_ctx);
}
static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
{
struct trace_array *tr = data;
struct trace_event_file *trace_file;
struct syscall_trace_exit *entry;
struct syscall_metadata *sys_data;
struct ring_buffer_event *event;
struct trace_buffer *buffer;
tracing: Merge irqflags + preempt counter. The state of the interrupts (irqflags) and the preemption counter are both passed down to tracing_generic_entry_update(). Only one bit of irqflags is actually required: The on/off state. The complete 32bit of the preemption counter isn't needed. Just whether of the upper bits (softirq, hardirq and NMI) are set and the preemption depth is needed. The irqflags and the preemption counter could be evaluated early and the information stored in an integer `trace_ctx'. tracing_generic_entry_update() would use the upper bits as the TRACE_FLAG_* and the lower 8bit as the disabled-preemption depth (considering that one must be substracted from the counter in one special cases). The actual preemption value is not used except for the tracing record. The `irqflags' variable is mostly used only for the tracing record. An exception here is for instance wakeup_tracer_call() or probe_wakeup_sched_switch() which explicilty disable interrupts and use that `irqflags' to save (and restore) the IRQ state and to record the state. Struct trace_event_buffer has also the `pc' and flags' members which can be replaced with `trace_ctx' since their actual value is not used outside of trace recording. This will reduce tracing_generic_entry_update() to simply assign values to struct trace_entry. The evaluation of the TRACE_FLAG_* bits is moved to _tracing_gen_ctx_flags() which replaces preempt_count() and local_save_flags() invocations. As an example, ftrace_syscall_enter() may invoke: - trace_buffer_lock_reserve() -> … -> tracing_generic_entry_update() - event_trigger_unlock_commit() -> ftrace_trace_stack() -> … -> tracing_generic_entry_update() -> ftrace_trace_userstack() -> … -> tracing_generic_entry_update() In this case the TRACE_FLAG_* bits were evaluated three times. By using the `trace_ctx' they are evaluated once and assigned three times. A build with all tracers enabled on x86-64 with and without the patch: text data bss dec hex filename 21970669 17084168 7639260 46694097 2c87ed1 vmlinux.old 21970293 17084168 7639260 46693721 2c87d59 vmlinux.new text shrank by 379 bytes, data remained constant. Link: https://lkml.kernel.org/r/20210125194511.3924915-2-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-01-25 19:45:08 +00:00
unsigned int trace_ctx;
int syscall_nr;
tracing/syscalls: Allow archs to ignore tracing compat syscalls The tracing of ia32 compat system calls has been a bit of a pain as they use different system call numbers than the 64bit equivalents. I wrote a simple 'lls' program that lists files. I compiled it as a i686 ELF binary and ran it under a x86_64 box. This is the result: echo 0 > /debug/tracing/tracing_on echo 1 > /debug/tracing/events/syscalls/enable echo 1 > /debug/tracing/tracing_on ; ./lls ; echo 0 > /debug/tracing/tracing_on grep lls /debug/tracing/trace [.. skipping calls before TS_COMPAT is set ...] lls-1127 [005] d... 936.409188: sys_recvfrom(fd: 0, ubuf: 4d560fc4, size: 0, flags: 8048034, addr: 8, addr_len: f7700420) lls-1127 [005] d... 936.409190: sys_recvfrom -> 0x8a77000 lls-1127 [005] d... 936.409211: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22) lls-1127 [005] d... 936.409215: sys_lgetxattr -> 0xf76ff000 lls-1127 [005] d... 936.409223: sys_dup2(oldfd: 4d55ae9b, newfd: 4) lls-1127 [005] d... 936.409228: sys_dup2 -> 0xfffffffffffffffe lls-1127 [005] d... 936.409236: sys_newfstat(fd: 4d55b085, statbuf: 80000) lls-1127 [005] d... 936.409242: sys_newfstat -> 0x3 lls-1127 [005] d... 936.409243: sys_removexattr(pathname: 3, name: ffcd0060) lls-1127 [005] d... 936.409244: sys_removexattr -> 0x0 lls-1127 [005] d... 936.409245: sys_lgetxattr(pathname: 0, name: 19614, value: 1, size: 2) lls-1127 [005] d... 936.409248: sys_lgetxattr -> 0xf76e5000 lls-1127 [005] d... 936.409248: sys_newlstat(filename: 3, statbuf: 19614) lls-1127 [005] d... 936.409249: sys_newlstat -> 0x0 lls-1127 [005] d... 936.409262: sys_newfstat(fd: f76fb588, statbuf: 80000) lls-1127 [005] d... 936.409279: sys_newfstat -> 0x3 lls-1127 [005] d... 936.409279: sys_close(fd: 3) lls-1127 [005] d... 936.421550: sys_close -> 0x200 lls-1127 [005] d... 936.421558: sys_removexattr(pathname: 3, name: ffcd00d0) lls-1127 [005] d... 936.421560: sys_removexattr -> 0x0 lls-1127 [005] d... 936.421569: sys_lgetxattr(pathname: 4d564000, name: 1b1abc, value: 5, size: 802) lls-1127 [005] d... 936.421574: sys_lgetxattr -> 0x4d564000 lls-1127 [005] d... 936.421575: sys_capget(header: 4d70f000, dataptr: 1000) lls-1127 [005] d... 936.421580: sys_capget -> 0x0 lls-1127 [005] d... 936.421580: sys_lgetxattr(pathname: 4d710000, name: 3000, value: 3, size: 812) lls-1127 [005] d... 936.421589: sys_lgetxattr -> 0x4d710000 lls-1127 [005] d... 936.426130: sys_lgetxattr(pathname: 4d713000, name: 2abc, value: 3, size: 32) lls-1127 [005] d... 936.426141: sys_lgetxattr -> 0x4d713000 lls-1127 [005] d... 936.426145: sys_newlstat(filename: 3, statbuf: f76ff3f0) lls-1127 [005] d... 936.426146: sys_newlstat -> 0x0 lls-1127 [005] d... 936.431748: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22) Obviously I'm not calling newfstat with a fd of 4d55b085. The calls are obviously incorrect, and confusing. Other efforts have been made to fix this: https://lkml.org/lkml/2012/3/26/367 But the real solution is to rewrite the syscall internals and come up with a fixed solution. One that doesn't require all the kluge that the current solution has. Thus for now, instead of outputting incorrect data, simply ignore them. With this patch the changes now have: #> grep lls /debug/tracing/trace #> Compat system calls simply are not traced. If users need compat syscalls, then they should just use the raw syscall tracepoints. For an architecture to make their compat syscalls ignored, it must define ARCH_TRACE_IGNORE_COMPAT_SYSCALLS (done in asm/ftrace.h) and also define an arch_trace_is_compat_syscall() function that will return true if the current task should ignore tracing the syscall. I want to stress that this change does not affect actual syscalls in any way, shape or form. It is only used within the tracing system and doesn't interfere with the syscall logic at all. The changes are consolidated nicely into trace_syscalls.c and asm/ftrace.h. I had to make one small modification to asm/thread_info.h and that was to remove the include of asm/ftrace.h. As asm/ftrace.h required the current_thread_info() it was causing include hell. That include was added back in 2008 when the function graph tracer was added: commit caf4b323 "tracing, x86: add low level support for ftrace return tracing" It does not need to be included there. Link: http://lkml.kernel.org/r/1360703939.21867.99.camel@gandalf.local.home Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-02-12 21:18:59 +00:00
syscall_nr = trace_get_syscall_nr(current, regs);
tracing/syscalls: Ignore numbers outside NR_syscalls' range ARM has some private syscalls (for example, set_tls(2)) which lie outside the range of NR_syscalls. If any of these are called while syscall tracing is being performed, out-of-bounds array access will occur in the ftrace and perf sys_{enter,exit} handlers. # trace-cmd record -e raw_syscalls:* true && trace-cmd report ... true-653 [000] 384.675777: sys_enter: NR 192 (0, 1000, 3, 4000022, ffffffff, 0) true-653 [000] 384.675812: sys_exit: NR 192 = 1995915264 true-653 [000] 384.675971: sys_enter: NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1) true-653 [000] 384.675988: sys_exit: NR 983045 = 0 ... # trace-cmd record -e syscalls:* true [ 17.289329] Unable to handle kernel paging request at virtual address aaaaaace [ 17.289590] pgd = 9e71c000 [ 17.289696] [aaaaaace] *pgd=00000000 [ 17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM [ 17.290169] Modules linked in: [ 17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ #21 [ 17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000 [ 17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8 [ 17.290866] LR is at syscall_trace_enter+0x124/0x184 Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers. Commit cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls" added the check for less than zero, but it should have also checked for greater than NR_syscalls. Link: http://lkml.kernel.org/p/1414620418-29472-1-git-send-email-rabin@rab.in Fixes: cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls" Cc: stable@vger.kernel.org # 2.6.33+ Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-10-29 22:06:58 +00:00
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
tracing: Check invalid syscall nr while tracing syscalls Most arch syscall_get_nr() implementations returns -1 if the syscall number is not valid. Accessing the bit field without a check might result in a kernel oops (at least I saw it on s390 for ftrace selftest). Before this change, this problem did not occur, because the invalid syscall number (-1) caused syscall_nr_to_meta() to return NULL. There are at least two scenarios where syscall_get_nr() can return -1: 1. For example, ptrace stores an invalid syscall number, and thus, tracing code resets it. (see do_syscall_trace_enter in arch/s390/kernel/ptrace.c) 2. The syscall_regfunc() (kernel/tracepoint.c) sets the TIF_SYSCALL_FTRACE (now: TIF_SYSCALL_TRACEPOINT) flag for all threads which include kernel threads. However, the ftrace selftest triggers a kernel oops when testing syscall trace points: - The kernel thread is started as ususal (do_fork()), - tracing code sets TIF_SYSCALL_FTRACE, - the ret_from_fork() function is triggered and starts ftrace_syscall_exit() with an invalid syscall number. To avoid these scenarios, I suggest to check the syscall_nr. For instance, the ftrace selftest fails for s390 (with config option CONFIG_FTRACE_SYSCALLS set) and produces the following kernel oops. Unable to handle kernel pointer dereference at virtual kernel address 2000000000 Oops: 0038 [#1] PREEMPT SMP Modules linked in: CPU: 0 Not tainted 2.6.31-rc6-next-20090819-dirty #18 Process kthreadd (pid: 818, task: 000000003ea207e8, ksp: 000000003e813eb8) Krnl PSW : 0704100180000000 00000000000ea54c (ftrace_syscall_exit+0x58/0xdc) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3 Krnl GPRS: 0000000000000000 00000000000e0000 ffffffffffffffff 20000000008c2650 0000000000000007 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 ffffffffffffffff 000000003e813d78 000000003e813f58 0000000000505ba8 000000003e813e18 000000003e813d78 Krnl Code: 00000000000ea540: e330d0000008 ag %r3,0(%r13) 00000000000ea546: a7480007 lhi %r4,7 00000000000ea54a: 1442 nr %r4,%r2 >00000000000ea54c: e31030000090 llgc %r1,0(%r3) 00000000000ea552: 5410d008 n %r1,8(%r13) 00000000000ea556: 8a104000 sra %r1,0(%r4) 00000000000ea55a: 5410d00c n %r1,12(%r13) 00000000000ea55e: 1211 ltr %r1,%r1 Call Trace: ([<0000000000000000>] 0x0) [<000000000001fa22>] do_syscall_trace_exit+0x132/0x18c [<000000000002d0c4>] sysc_return+0x0/0x8 [<000000000001c738>] kernel_thread_starter+0x0/0xc Last Breaking-Event-Address: [<00000000000ea51e>] ftrace_syscall_exit+0x2a/0xdc Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Jiaying Zhang <jiayingz@google.com> Cc: Martin Bligh <mbligh@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> LKML-Reference: <20090825125027.GE4639@cetus.boeblingen.de.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-08-25 12:50:27 +00:00
return;
/* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE()) */
trace_file = rcu_dereference_sched(tr->exit_syscall_files[syscall_nr]);
if (!trace_file)
return;
if (trace_trigger_soft_disabled(trace_file))
return;
sys_data = syscall_nr_to_meta(syscall_nr);
if (!sys_data)
return;
tracing: Merge irqflags + preempt counter. The state of the interrupts (irqflags) and the preemption counter are both passed down to tracing_generic_entry_update(). Only one bit of irqflags is actually required: The on/off state. The complete 32bit of the preemption counter isn't needed. Just whether of the upper bits (softirq, hardirq and NMI) are set and the preemption depth is needed. The irqflags and the preemption counter could be evaluated early and the information stored in an integer `trace_ctx'. tracing_generic_entry_update() would use the upper bits as the TRACE_FLAG_* and the lower 8bit as the disabled-preemption depth (considering that one must be substracted from the counter in one special cases). The actual preemption value is not used except for the tracing record. The `irqflags' variable is mostly used only for the tracing record. An exception here is for instance wakeup_tracer_call() or probe_wakeup_sched_switch() which explicilty disable interrupts and use that `irqflags' to save (and restore) the IRQ state and to record the state. Struct trace_event_buffer has also the `pc' and flags' members which can be replaced with `trace_ctx' since their actual value is not used outside of trace recording. This will reduce tracing_generic_entry_update() to simply assign values to struct trace_entry. The evaluation of the TRACE_FLAG_* bits is moved to _tracing_gen_ctx_flags() which replaces preempt_count() and local_save_flags() invocations. As an example, ftrace_syscall_enter() may invoke: - trace_buffer_lock_reserve() -> … -> tracing_generic_entry_update() - event_trigger_unlock_commit() -> ftrace_trace_stack() -> … -> tracing_generic_entry_update() -> ftrace_trace_userstack() -> … -> tracing_generic_entry_update() In this case the TRACE_FLAG_* bits were evaluated three times. By using the `trace_ctx' they are evaluated once and assigned three times. A build with all tracers enabled on x86-64 with and without the patch: text data bss dec hex filename 21970669 17084168 7639260 46694097 2c87ed1 vmlinux.old 21970293 17084168 7639260 46693721 2c87d59 vmlinux.new text shrank by 379 bytes, data remained constant. Link: https://lkml.kernel.org/r/20210125194511.3924915-2-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-01-25 19:45:08 +00:00
trace_ctx = tracing_gen_ctx();
tracing: Fix irqs-off tag display in syscall tracing All syscall tracing irqs-off tags are wrong, the syscall enter entry doesn't disable irqs. [root@jovi tracing]#echo "syscalls:sys_enter_open" > set_event [root@jovi tracing]# cat trace # tracer: nop # # entries-in-buffer/entries-written: 13/13 #P:2 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | irqbalance-513 [000] d... 56115.496766: sys_open(filename: 804e1a6, flags: 0, mode: 1b6) irqbalance-513 [000] d... 56115.497008: sys_open(filename: 804e1bb, flags: 0, mode: 1b6) sendmail-771 [000] d... 56115.827982: sys_open(filename: b770e6d1, flags: 0, mode: 1b6) The reason is syscall tracing doesn't record irq_flags into buffer. The proper display is: [root@jovi tracing]#echo "syscalls:sys_enter_open" > set_event [root@jovi tracing]# cat trace # tracer: nop # # entries-in-buffer/entries-written: 14/14 #P:2 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | irqbalance-514 [001] .... 46.213921: sys_open(filename: 804e1a6, flags: 0, mode: 1b6) irqbalance-514 [001] .... 46.214160: sys_open(filename: 804e1bb, flags: 0, mode: 1b6) <...>-920 [001] .... 47.307260: sys_open(filename: 4e82a0c5, flags: 80000, mode: 0) Link: http://lkml.kernel.org/r/1365564393-10972-3-git-send-email-jovi.zhangwei@huawei.com Cc: stable@vger.kernel.org # 2.6.35 Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-04-10 03:26:23 +00:00
buffer = tr->array_buffer.buffer;
event = trace_buffer_lock_reserve(buffer,
tracing: Fix irqs-off tag display in syscall tracing All syscall tracing irqs-off tags are wrong, the syscall enter entry doesn't disable irqs. [root@jovi tracing]#echo "syscalls:sys_enter_open" > set_event [root@jovi tracing]# cat trace # tracer: nop # # entries-in-buffer/entries-written: 13/13 #P:2 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | irqbalance-513 [000] d... 56115.496766: sys_open(filename: 804e1a6, flags: 0, mode: 1b6) irqbalance-513 [000] d... 56115.497008: sys_open(filename: 804e1bb, flags: 0, mode: 1b6) sendmail-771 [000] d... 56115.827982: sys_open(filename: b770e6d1, flags: 0, mode: 1b6) The reason is syscall tracing doesn't record irq_flags into buffer. The proper display is: [root@jovi tracing]#echo "syscalls:sys_enter_open" > set_event [root@jovi tracing]# cat trace # tracer: nop # # entries-in-buffer/entries-written: 14/14 #P:2 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | irqbalance-514 [001] .... 46.213921: sys_open(filename: 804e1a6, flags: 0, mode: 1b6) irqbalance-514 [001] .... 46.214160: sys_open(filename: 804e1bb, flags: 0, mode: 1b6) <...>-920 [001] .... 47.307260: sys_open(filename: 4e82a0c5, flags: 80000, mode: 0) Link: http://lkml.kernel.org/r/1365564393-10972-3-git-send-email-jovi.zhangwei@huawei.com Cc: stable@vger.kernel.org # 2.6.35 Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-04-10 03:26:23 +00:00
sys_data->exit_event->event.type, sizeof(*entry),
tracing: Merge irqflags + preempt counter. The state of the interrupts (irqflags) and the preemption counter are both passed down to tracing_generic_entry_update(). Only one bit of irqflags is actually required: The on/off state. The complete 32bit of the preemption counter isn't needed. Just whether of the upper bits (softirq, hardirq and NMI) are set and the preemption depth is needed. The irqflags and the preemption counter could be evaluated early and the information stored in an integer `trace_ctx'. tracing_generic_entry_update() would use the upper bits as the TRACE_FLAG_* and the lower 8bit as the disabled-preemption depth (considering that one must be substracted from the counter in one special cases). The actual preemption value is not used except for the tracing record. The `irqflags' variable is mostly used only for the tracing record. An exception here is for instance wakeup_tracer_call() or probe_wakeup_sched_switch() which explicilty disable interrupts and use that `irqflags' to save (and restore) the IRQ state and to record the state. Struct trace_event_buffer has also the `pc' and flags' members which can be replaced with `trace_ctx' since their actual value is not used outside of trace recording. This will reduce tracing_generic_entry_update() to simply assign values to struct trace_entry. The evaluation of the TRACE_FLAG_* bits is moved to _tracing_gen_ctx_flags() which replaces preempt_count() and local_save_flags() invocations. As an example, ftrace_syscall_enter() may invoke: - trace_buffer_lock_reserve() -> … -> tracing_generic_entry_update() - event_trigger_unlock_commit() -> ftrace_trace_stack() -> … -> tracing_generic_entry_update() -> ftrace_trace_userstack() -> … -> tracing_generic_entry_update() In this case the TRACE_FLAG_* bits were evaluated three times. By using the `trace_ctx' they are evaluated once and assigned three times. A build with all tracers enabled on x86-64 with and without the patch: text data bss dec hex filename 21970669 17084168 7639260 46694097 2c87ed1 vmlinux.old 21970293 17084168 7639260 46693721 2c87d59 vmlinux.new text shrank by 379 bytes, data remained constant. Link: https://lkml.kernel.org/r/20210125194511.3924915-2-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-01-25 19:45:08 +00:00
trace_ctx);
if (!event)
return;
entry = ring_buffer_event_data(event);
entry->nr = syscall_nr;
entry->ret = syscall_get_return_value(current, regs);
event_trigger_unlock_commit(trace_file, buffer, event, entry,
tracing: Merge irqflags + preempt counter. The state of the interrupts (irqflags) and the preemption counter are both passed down to tracing_generic_entry_update(). Only one bit of irqflags is actually required: The on/off state. The complete 32bit of the preemption counter isn't needed. Just whether of the upper bits (softirq, hardirq and NMI) are set and the preemption depth is needed. The irqflags and the preemption counter could be evaluated early and the information stored in an integer `trace_ctx'. tracing_generic_entry_update() would use the upper bits as the TRACE_FLAG_* and the lower 8bit as the disabled-preemption depth (considering that one must be substracted from the counter in one special cases). The actual preemption value is not used except for the tracing record. The `irqflags' variable is mostly used only for the tracing record. An exception here is for instance wakeup_tracer_call() or probe_wakeup_sched_switch() which explicilty disable interrupts and use that `irqflags' to save (and restore) the IRQ state and to record the state. Struct trace_event_buffer has also the `pc' and flags' members which can be replaced with `trace_ctx' since their actual value is not used outside of trace recording. This will reduce tracing_generic_entry_update() to simply assign values to struct trace_entry. The evaluation of the TRACE_FLAG_* bits is moved to _tracing_gen_ctx_flags() which replaces preempt_count() and local_save_flags() invocations. As an example, ftrace_syscall_enter() may invoke: - trace_buffer_lock_reserve() -> … -> tracing_generic_entry_update() - event_trigger_unlock_commit() -> ftrace_trace_stack() -> … -> tracing_generic_entry_update() -> ftrace_trace_userstack() -> … -> tracing_generic_entry_update() In this case the TRACE_FLAG_* bits were evaluated three times. By using the `trace_ctx' they are evaluated once and assigned three times. A build with all tracers enabled on x86-64 with and without the patch: text data bss dec hex filename 21970669 17084168 7639260 46694097 2c87ed1 vmlinux.old 21970293 17084168 7639260 46693721 2c87d59 vmlinux.new text shrank by 379 bytes, data remained constant. Link: https://lkml.kernel.org/r/20210125194511.3924915-2-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-01-25 19:45:08 +00:00
trace_ctx);
}
static int reg_event_syscall_enter(struct trace_event_file *file,
struct trace_event_call *call)
{
struct trace_array *tr = file->tr;
int ret = 0;
int num;
num = ((struct syscall_metadata *)call->data)->syscall_nr;
if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
return -ENOSYS;
mutex_lock(&syscall_trace_lock);
if (!tr->sys_refcount_enter)
ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
if (!ret) {
rcu_assign_pointer(tr->enter_syscall_files[num], file);
tr->sys_refcount_enter++;
}
mutex_unlock(&syscall_trace_lock);
return ret;
}
static void unreg_event_syscall_enter(struct trace_event_file *file,
struct trace_event_call *call)
{
struct trace_array *tr = file->tr;
int num;
num = ((struct syscall_metadata *)call->data)->syscall_nr;
if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
return;
mutex_lock(&syscall_trace_lock);
tr->sys_refcount_enter--;
RCU_INIT_POINTER(tr->enter_syscall_files[num], NULL);
if (!tr->sys_refcount_enter)
unregister_trace_sys_enter(ftrace_syscall_enter, tr);
mutex_unlock(&syscall_trace_lock);
}
static int reg_event_syscall_exit(struct trace_event_file *file,
struct trace_event_call *call)
{
struct trace_array *tr = file->tr;
int ret = 0;
int num;
num = ((struct syscall_metadata *)call->data)->syscall_nr;
if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
return -ENOSYS;
mutex_lock(&syscall_trace_lock);
if (!tr->sys_refcount_exit)
ret = register_trace_sys_exit(ftrace_syscall_exit, tr);
if (!ret) {
rcu_assign_pointer(tr->exit_syscall_files[num], file);
tr->sys_refcount_exit++;
}
mutex_unlock(&syscall_trace_lock);
return ret;
}
static void unreg_event_syscall_exit(struct trace_event_file *file,
struct trace_event_call *call)
{
struct trace_array *tr = file->tr;
int num;
num = ((struct syscall_metadata *)call->data)->syscall_nr;
if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
return;
mutex_lock(&syscall_trace_lock);
tr->sys_refcount_exit--;
RCU_INIT_POINTER(tr->exit_syscall_files[num], NULL);
if (!tr->sys_refcount_exit)
unregister_trace_sys_exit(ftrace_syscall_exit, tr);
mutex_unlock(&syscall_trace_lock);
}
static int __init init_syscall_trace(struct trace_event_call *call)
{
int id;
int num;
num = ((struct syscall_metadata *)call->data)->syscall_nr;
if (num < 0 || num >= NR_syscalls) {
pr_debug("syscall %s metadata not mapped, disabling ftrace event\n",
((struct syscall_metadata *)call->data)->name);
return -ENOSYS;
}
if (set_syscall_print_fmt(call) < 0)
return -ENOMEM;
id = trace_event_raw_init(call);
if (id < 0) {
free_syscall_print_fmt(call);
return id;
}
return id;
}
ftrace: Rework event_create_dir() Rework event_create_dir() to use an array of static data instead of function pointers where possible. The problem is that it would call the function pointer on module load before parse_args(), possibly even before jump_labels were initialized. Luckily the generated functions don't use jump_labels but it still seems fragile. It also gets in the way of changing when we make the module map executable. The generated function are basically calling trace_define_field() with a bunch of static arguments. So instead of a function, capture these arguments in a static array, avoiding the function call. Now there are a number of cases where the fields are dynamic (syscall arguments, kprobes and uprobes), in which case a static array does not work, for these we preserve the function call. Luckily all these cases are not related to modules and so we can retain the function call for them. Also fix up all broken tracepoint definitions that now generate a compile error. Tested-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191111132458.342979914@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-24 20:26:59 +00:00
static struct trace_event_fields __refdata syscall_enter_fields_array[] = {
SYSCALL_FIELD(int, __syscall_nr),
{ .type = TRACE_FUNCTION_TYPE,
.define_fields = syscall_enter_define_fields },
{}
};
struct trace_event_functions enter_syscall_print_funcs = {
.trace = print_syscall_enter,
};
struct trace_event_functions exit_syscall_print_funcs = {
.trace = print_syscall_exit,
};
struct trace_event_class __refdata event_class_syscall_enter = {
.system = "syscalls",
.reg = syscall_enter_register,
ftrace: Rework event_create_dir() Rework event_create_dir() to use an array of static data instead of function pointers where possible. The problem is that it would call the function pointer on module load before parse_args(), possibly even before jump_labels were initialized. Luckily the generated functions don't use jump_labels but it still seems fragile. It also gets in the way of changing when we make the module map executable. The generated function are basically calling trace_define_field() with a bunch of static arguments. So instead of a function, capture these arguments in a static array, avoiding the function call. Now there are a number of cases where the fields are dynamic (syscall arguments, kprobes and uprobes), in which case a static array does not work, for these we preserve the function call. Luckily all these cases are not related to modules and so we can retain the function call for them. Also fix up all broken tracepoint definitions that now generate a compile error. Tested-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191111132458.342979914@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-24 20:26:59 +00:00
.fields_array = syscall_enter_fields_array,
.get_fields = syscall_get_enter_fields,
.raw_init = init_syscall_trace,
};
struct trace_event_class __refdata event_class_syscall_exit = {
.system = "syscalls",
.reg = syscall_exit_register,
ftrace: Rework event_create_dir() Rework event_create_dir() to use an array of static data instead of function pointers where possible. The problem is that it would call the function pointer on module load before parse_args(), possibly even before jump_labels were initialized. Luckily the generated functions don't use jump_labels but it still seems fragile. It also gets in the way of changing when we make the module map executable. The generated function are basically calling trace_define_field() with a bunch of static arguments. So instead of a function, capture these arguments in a static array, avoiding the function call. Now there are a number of cases where the fields are dynamic (syscall arguments, kprobes and uprobes), in which case a static array does not work, for these we preserve the function call. Luckily all these cases are not related to modules and so we can retain the function call for them. Also fix up all broken tracepoint definitions that now generate a compile error. Tested-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191111132458.342979914@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-24 20:26:59 +00:00
.fields_array = (struct trace_event_fields[]){
SYSCALL_FIELD(int, __syscall_nr),
SYSCALL_FIELD(long, ret),
{}
},
.fields = LIST_HEAD_INIT(event_class_syscall_exit.fields),
.raw_init = init_syscall_trace,
};
unsigned long __init __weak arch_syscall_addr(int nr)
{
return (unsigned long)sys_call_table[nr];
}
void __init init_ftrace_syscalls(void)
{
struct syscall_metadata *meta;
unsigned long addr;
int i;
void *ret;
if (!IS_ENABLED(CONFIG_HAVE_SPARSE_SYSCALL_NR)) {
syscalls_metadata = kcalloc(NR_syscalls,
sizeof(*syscalls_metadata),
GFP_KERNEL);
if (!syscalls_metadata) {
WARN_ON(1);
return;
}
}
for (i = 0; i < NR_syscalls; i++) {
addr = arch_syscall_addr(i);
meta = find_syscall_meta(addr);
if (!meta)
continue;
meta->syscall_nr = i;
if (!IS_ENABLED(CONFIG_HAVE_SPARSE_SYSCALL_NR)) {
syscalls_metadata[i] = meta;
} else {
ret = xa_store(&syscalls_metadata_sparse, i, meta,
GFP_KERNEL);
WARN(xa_is_err(ret),
"Syscall memory allocation failed\n");
}
}
}
#ifdef CONFIG_PERF_EVENTS
2009-08-11 18:22:53 +00:00
static DECLARE_BITMAP(enabled_perf_enter_syscalls, NR_syscalls);
static DECLARE_BITMAP(enabled_perf_exit_syscalls, NR_syscalls);
static int sys_perf_refcount_enter;
static int sys_perf_refcount_exit;
static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *regs,
struct syscall_metadata *sys_data,
struct syscall_trace_enter *rec)
{
struct syscall_tp_t {
unsigned long long regs;
unsigned long syscall_nr;
unsigned long args[SYSCALL_DEFINE_MAXARGS];
} param;
int i;
*(struct pt_regs **)&param = regs;
param.syscall_nr = rec->nr;
for (i = 0; i < sys_data->nb_args; i++)
param.args[i] = rec->args[i];
return trace_call_bpf(call, &param);
}
tracing: Let tracepoints have data passed to tracepoint callbacks This patch adds data to be passed to tracepoint callbacks. The created functions from DECLARE_TRACE() now need a mandatory data parameter. For example: DECLARE_TRACE(mytracepoint, int value, value) Will create the register function: int register_trace_mytracepoint((void(*)(void *data, int value))probe, void *data); As the first argument, all callbacks (probes) must take a (void *data) parameter. So a callback for the above tracepoint will look like: void myprobe(void *data, int value) { } The callback may choose to ignore the data parameter. This change allows callbacks to register a private data pointer along with the function probe. void mycallback(void *data, int value); register_trace_mytracepoint(mycallback, mydata); Then the mycallback() will receive the "mydata" as the first parameter before the args. A more detailed example: DECLARE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status)); /* In the C file */ DEFINE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status)); [...] trace_mytracepoint(status); /* In a file registering this tracepoint */ int my_callback(void *data, int status) { struct my_struct my_data = data; [...] } [...] my_data = kmalloc(sizeof(*my_data), GFP_KERNEL); init_my_data(my_data); register_trace_mytracepoint(my_callback, my_data); The same callback can also be registered to the same tracepoint as long as the data registered is different. Note, the data must also be used to unregister the callback: unregister_trace_mytracepoint(my_callback, my_data); Because of the data parameter, tracepoints declared this way can not have no args. That is: DECLARE_TRACE(mytracepoint, TP_PROTO(void), TP_ARGS()); will cause an error. If no arguments are needed, a new macro can be used instead: DECLARE_TRACE_NOARGS(mytracepoint); Since there are no arguments, the proto and args fields are left out. This is part of a series to make the tracepoint footprint smaller: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint Again, this patch also increases the size of the kernel, but lays the ground work for decreasing it. v5: Fixed net/core/drop_monitor.c to handle these updates. v4: Moved the DECLARE_TRACE() DECLARE_TRACE_NOARGS out of the #ifdef CONFIG_TRACE_POINTS, since the two are the same in both cases. The __DECLARE_TRACE() is what changes. Thanks to Frederic Weisbecker for pointing this out. v3: Made all register_* functions require data to be passed and all callbacks to take a void * parameter as its first argument. This makes the calling functions comply with C standards. Also added more comments to the modifications of DECLARE_TRACE(). v2: Made the DECLARE_TRACE() have the ability to pass arguments and added a new DECLARE_TRACE_NOARGS() for tracepoints that do not need any arguments. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-20 21:04:50 +00:00
static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
{
struct syscall_metadata *sys_data;
struct syscall_trace_enter *rec;
struct hlist_head *head;
unsigned long args[6];
bool valid_prog_array;
int syscall_nr;
int rctx;
2009-08-11 18:22:53 +00:00
int size;
tracing/syscalls: Allow archs to ignore tracing compat syscalls The tracing of ia32 compat system calls has been a bit of a pain as they use different system call numbers than the 64bit equivalents. I wrote a simple 'lls' program that lists files. I compiled it as a i686 ELF binary and ran it under a x86_64 box. This is the result: echo 0 > /debug/tracing/tracing_on echo 1 > /debug/tracing/events/syscalls/enable echo 1 > /debug/tracing/tracing_on ; ./lls ; echo 0 > /debug/tracing/tracing_on grep lls /debug/tracing/trace [.. skipping calls before TS_COMPAT is set ...] lls-1127 [005] d... 936.409188: sys_recvfrom(fd: 0, ubuf: 4d560fc4, size: 0, flags: 8048034, addr: 8, addr_len: f7700420) lls-1127 [005] d... 936.409190: sys_recvfrom -> 0x8a77000 lls-1127 [005] d... 936.409211: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22) lls-1127 [005] d... 936.409215: sys_lgetxattr -> 0xf76ff000 lls-1127 [005] d... 936.409223: sys_dup2(oldfd: 4d55ae9b, newfd: 4) lls-1127 [005] d... 936.409228: sys_dup2 -> 0xfffffffffffffffe lls-1127 [005] d... 936.409236: sys_newfstat(fd: 4d55b085, statbuf: 80000) lls-1127 [005] d... 936.409242: sys_newfstat -> 0x3 lls-1127 [005] d... 936.409243: sys_removexattr(pathname: 3, name: ffcd0060) lls-1127 [005] d... 936.409244: sys_removexattr -> 0x0 lls-1127 [005] d... 936.409245: sys_lgetxattr(pathname: 0, name: 19614, value: 1, size: 2) lls-1127 [005] d... 936.409248: sys_lgetxattr -> 0xf76e5000 lls-1127 [005] d... 936.409248: sys_newlstat(filename: 3, statbuf: 19614) lls-1127 [005] d... 936.409249: sys_newlstat -> 0x0 lls-1127 [005] d... 936.409262: sys_newfstat(fd: f76fb588, statbuf: 80000) lls-1127 [005] d... 936.409279: sys_newfstat -> 0x3 lls-1127 [005] d... 936.409279: sys_close(fd: 3) lls-1127 [005] d... 936.421550: sys_close -> 0x200 lls-1127 [005] d... 936.421558: sys_removexattr(pathname: 3, name: ffcd00d0) lls-1127 [005] d... 936.421560: sys_removexattr -> 0x0 lls-1127 [005] d... 936.421569: sys_lgetxattr(pathname: 4d564000, name: 1b1abc, value: 5, size: 802) lls-1127 [005] d... 936.421574: sys_lgetxattr -> 0x4d564000 lls-1127 [005] d... 936.421575: sys_capget(header: 4d70f000, dataptr: 1000) lls-1127 [005] d... 936.421580: sys_capget -> 0x0 lls-1127 [005] d... 936.421580: sys_lgetxattr(pathname: 4d710000, name: 3000, value: 3, size: 812) lls-1127 [005] d... 936.421589: sys_lgetxattr -> 0x4d710000 lls-1127 [005] d... 936.426130: sys_lgetxattr(pathname: 4d713000, name: 2abc, value: 3, size: 32) lls-1127 [005] d... 936.426141: sys_lgetxattr -> 0x4d713000 lls-1127 [005] d... 936.426145: sys_newlstat(filename: 3, statbuf: f76ff3f0) lls-1127 [005] d... 936.426146: sys_newlstat -> 0x0 lls-1127 [005] d... 936.431748: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22) Obviously I'm not calling newfstat with a fd of 4d55b085. The calls are obviously incorrect, and confusing. Other efforts have been made to fix this: https://lkml.org/lkml/2012/3/26/367 But the real solution is to rewrite the syscall internals and come up with a fixed solution. One that doesn't require all the kluge that the current solution has. Thus for now, instead of outputting incorrect data, simply ignore them. With this patch the changes now have: #> grep lls /debug/tracing/trace #> Compat system calls simply are not traced. If users need compat syscalls, then they should just use the raw syscall tracepoints. For an architecture to make their compat syscalls ignored, it must define ARCH_TRACE_IGNORE_COMPAT_SYSCALLS (done in asm/ftrace.h) and also define an arch_trace_is_compat_syscall() function that will return true if the current task should ignore tracing the syscall. I want to stress that this change does not affect actual syscalls in any way, shape or form. It is only used within the tracing system and doesn't interfere with the syscall logic at all. The changes are consolidated nicely into trace_syscalls.c and asm/ftrace.h. I had to make one small modification to asm/thread_info.h and that was to remove the include of asm/ftrace.h. As asm/ftrace.h required the current_thread_info() it was causing include hell. That include was added back in 2008 when the function graph tracer was added: commit caf4b323 "tracing, x86: add low level support for ftrace return tracing" It does not need to be included there. Link: http://lkml.kernel.org/r/1360703939.21867.99.camel@gandalf.local.home Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-02-12 21:18:59 +00:00
syscall_nr = trace_get_syscall_nr(current, regs);
tracing/syscalls: Ignore numbers outside NR_syscalls' range ARM has some private syscalls (for example, set_tls(2)) which lie outside the range of NR_syscalls. If any of these are called while syscall tracing is being performed, out-of-bounds array access will occur in the ftrace and perf sys_{enter,exit} handlers. # trace-cmd record -e raw_syscalls:* true && trace-cmd report ... true-653 [000] 384.675777: sys_enter: NR 192 (0, 1000, 3, 4000022, ffffffff, 0) true-653 [000] 384.675812: sys_exit: NR 192 = 1995915264 true-653 [000] 384.675971: sys_enter: NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1) true-653 [000] 384.675988: sys_exit: NR 983045 = 0 ... # trace-cmd record -e syscalls:* true [ 17.289329] Unable to handle kernel paging request at virtual address aaaaaace [ 17.289590] pgd = 9e71c000 [ 17.289696] [aaaaaace] *pgd=00000000 [ 17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM [ 17.290169] Modules linked in: [ 17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ #21 [ 17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000 [ 17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8 [ 17.290866] LR is at syscall_trace_enter+0x124/0x184 Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers. Commit cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls" added the check for less than zero, but it should have also checked for greater than NR_syscalls. Link: http://lkml.kernel.org/p/1414620418-29472-1-git-send-email-rabin@rab.in Fixes: cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls" Cc: stable@vger.kernel.org # 2.6.33+ Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-10-29 22:06:58 +00:00
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
return;
if (!test_bit(syscall_nr, enabled_perf_enter_syscalls))
return;
sys_data = syscall_nr_to_meta(syscall_nr);
if (!sys_data)
return;
head = this_cpu_ptr(sys_data->enter_event->perf_events);
valid_prog_array = bpf_prog_array_valid(sys_data->enter_event);
if (!valid_prog_array && hlist_empty(head))
return;
2009-08-11 18:22:53 +00:00
/* get the size after alignment with the u32 buffer size field */
size = sizeof(unsigned long) * sys_data->nb_args + sizeof(*rec);
size = ALIGN(size + sizeof(u32), sizeof(u64));
size -= sizeof(u32);
rec = perf_trace_buf_alloc(size, NULL, &rctx);
if (!rec)
return;
rec->nr = syscall_nr;
syscalls: Remove start and number from syscall_get_arguments() args At Linux Plumbers, Andy Lutomirski approached me and pointed out that the function call syscall_get_arguments() implemented in x86 was horribly written and not optimized for the standard case of passing in 0 and 6 for the starting index and the number of system calls to get. When looking at all the users of this function, I discovered that all instances pass in only 0 and 6 for these arguments. Instead of having this function handle different cases that are never used, simply rewrite it to return the first 6 arguments of a system call. This should help out the performance of tracing system calls by ptrace, ftrace and perf. Link: http://lkml.kernel.org/r/20161107213233.754809394@goodmis.org Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Dave Martin <dave.martin@arm.com> Cc: "Dmitry V. Levin" <ldv@altlinux.org> Cc: x86@kernel.org Cc: linux-snps-arc@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-c6x-dev@linux-c6x.org Cc: uclinux-h8-devel@lists.sourceforge.jp Cc: linux-hexagon@vger.kernel.org Cc: linux-ia64@vger.kernel.org Cc: linux-mips@vger.kernel.org Cc: nios2-dev@lists.rocketboards.org Cc: openrisc@lists.librecores.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-riscv@lists.infradead.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-um@lists.infradead.org Cc: linux-xtensa@linux-xtensa.org Cc: linux-arch@vger.kernel.org Acked-by: Paul Burton <paul.burton@mips.com> # MIPS parts Acked-by: Max Filippov <jcmvbkbc@gmail.com> # For xtensa changes Acked-by: Will Deacon <will.deacon@arm.com> # For the arm64 bits Reviewed-by: Thomas Gleixner <tglx@linutronix.de> # for x86 Reviewed-by: Dmitry V. Levin <ldv@altlinux.org> Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2016-11-07 21:26:37 +00:00
syscall_get_arguments(current, regs, args);
memcpy(&rec->args, args, sizeof(unsigned long) * sys_data->nb_args);
if ((valid_prog_array &&
!perf_call_bpf_enter(sys_data->enter_event, regs, sys_data, rec)) ||
hlist_empty(head)) {
perf_swevent_put_recursion_context(rctx);
return;
}
perf_trace_buf_submit(rec, size, rctx,
sys_data->enter_event->event.type, 1, regs,
head, NULL);
}
static int perf_sysenter_enable(struct trace_event_call *call)
{
int ret = 0;
int num;
num = ((struct syscall_metadata *)call->data)->syscall_nr;
mutex_lock(&syscall_trace_lock);
if (!sys_perf_refcount_enter)
tracing: Let tracepoints have data passed to tracepoint callbacks This patch adds data to be passed to tracepoint callbacks. The created functions from DECLARE_TRACE() now need a mandatory data parameter. For example: DECLARE_TRACE(mytracepoint, int value, value) Will create the register function: int register_trace_mytracepoint((void(*)(void *data, int value))probe, void *data); As the first argument, all callbacks (probes) must take a (void *data) parameter. So a callback for the above tracepoint will look like: void myprobe(void *data, int value) { } The callback may choose to ignore the data parameter. This change allows callbacks to register a private data pointer along with the function probe. void mycallback(void *data, int value); register_trace_mytracepoint(mycallback, mydata); Then the mycallback() will receive the "mydata" as the first parameter before the args. A more detailed example: DECLARE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status)); /* In the C file */ DEFINE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status)); [...] trace_mytracepoint(status); /* In a file registering this tracepoint */ int my_callback(void *data, int status) { struct my_struct my_data = data; [...] } [...] my_data = kmalloc(sizeof(*my_data), GFP_KERNEL); init_my_data(my_data); register_trace_mytracepoint(my_callback, my_data); The same callback can also be registered to the same tracepoint as long as the data registered is different. Note, the data must also be used to unregister the callback: unregister_trace_mytracepoint(my_callback, my_data); Because of the data parameter, tracepoints declared this way can not have no args. That is: DECLARE_TRACE(mytracepoint, TP_PROTO(void), TP_ARGS()); will cause an error. If no arguments are needed, a new macro can be used instead: DECLARE_TRACE_NOARGS(mytracepoint); Since there are no arguments, the proto and args fields are left out. This is part of a series to make the tracepoint footprint smaller: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint Again, this patch also increases the size of the kernel, but lays the ground work for decreasing it. v5: Fixed net/core/drop_monitor.c to handle these updates. v4: Moved the DECLARE_TRACE() DECLARE_TRACE_NOARGS out of the #ifdef CONFIG_TRACE_POINTS, since the two are the same in both cases. The __DECLARE_TRACE() is what changes. Thanks to Frederic Weisbecker for pointing this out. v3: Made all register_* functions require data to be passed and all callbacks to take a void * parameter as its first argument. This makes the calling functions comply with C standards. Also added more comments to the modifications of DECLARE_TRACE(). v2: Made the DECLARE_TRACE() have the ability to pass arguments and added a new DECLARE_TRACE_NOARGS() for tracepoints that do not need any arguments. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-20 21:04:50 +00:00
ret = register_trace_sys_enter(perf_syscall_enter, NULL);
if (ret) {
pr_info("event trace: Could not activate syscall entry trace point");
} else {
set_bit(num, enabled_perf_enter_syscalls);
sys_perf_refcount_enter++;
}
mutex_unlock(&syscall_trace_lock);
return ret;
}
static void perf_sysenter_disable(struct trace_event_call *call)
{
int num;
num = ((struct syscall_metadata *)call->data)->syscall_nr;
mutex_lock(&syscall_trace_lock);
sys_perf_refcount_enter--;
clear_bit(num, enabled_perf_enter_syscalls);
if (!sys_perf_refcount_enter)
tracing: Let tracepoints have data passed to tracepoint callbacks This patch adds data to be passed to tracepoint callbacks. The created functions from DECLARE_TRACE() now need a mandatory data parameter. For example: DECLARE_TRACE(mytracepoint, int value, value) Will create the register function: int register_trace_mytracepoint((void(*)(void *data, int value))probe, void *data); As the first argument, all callbacks (probes) must take a (void *data) parameter. So a callback for the above tracepoint will look like: void myprobe(void *data, int value) { } The callback may choose to ignore the data parameter. This change allows callbacks to register a private data pointer along with the function probe. void mycallback(void *data, int value); register_trace_mytracepoint(mycallback, mydata); Then the mycallback() will receive the "mydata" as the first parameter before the args. A more detailed example: DECLARE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status)); /* In the C file */ DEFINE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status)); [...] trace_mytracepoint(status); /* In a file registering this tracepoint */ int my_callback(void *data, int status) { struct my_struct my_data = data; [...] } [...] my_data = kmalloc(sizeof(*my_data), GFP_KERNEL); init_my_data(my_data); register_trace_mytracepoint(my_callback, my_data); The same callback can also be registered to the same tracepoint as long as the data registered is different. Note, the data must also be used to unregister the callback: unregister_trace_mytracepoint(my_callback, my_data); Because of the data parameter, tracepoints declared this way can not have no args. That is: DECLARE_TRACE(mytracepoint, TP_PROTO(void), TP_ARGS()); will cause an error. If no arguments are needed, a new macro can be used instead: DECLARE_TRACE_NOARGS(mytracepoint); Since there are no arguments, the proto and args fields are left out. This is part of a series to make the tracepoint footprint smaller: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint Again, this patch also increases the size of the kernel, but lays the ground work for decreasing it. v5: Fixed net/core/drop_monitor.c to handle these updates. v4: Moved the DECLARE_TRACE() DECLARE_TRACE_NOARGS out of the #ifdef CONFIG_TRACE_POINTS, since the two are the same in both cases. The __DECLARE_TRACE() is what changes. Thanks to Frederic Weisbecker for pointing this out. v3: Made all register_* functions require data to be passed and all callbacks to take a void * parameter as its first argument. This makes the calling functions comply with C standards. Also added more comments to the modifications of DECLARE_TRACE(). v2: Made the DECLARE_TRACE() have the ability to pass arguments and added a new DECLARE_TRACE_NOARGS() for tracepoints that do not need any arguments. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-20 21:04:50 +00:00
unregister_trace_sys_enter(perf_syscall_enter, NULL);
mutex_unlock(&syscall_trace_lock);
}
static int perf_call_bpf_exit(struct trace_event_call *call, struct pt_regs *regs,
struct syscall_trace_exit *rec)
{
struct syscall_tp_t {
unsigned long long regs;
unsigned long syscall_nr;
unsigned long ret;
} param;
*(struct pt_regs **)&param = regs;
param.syscall_nr = rec->nr;
param.ret = rec->ret;
return trace_call_bpf(call, &param);
}
tracing: Let tracepoints have data passed to tracepoint callbacks This patch adds data to be passed to tracepoint callbacks. The created functions from DECLARE_TRACE() now need a mandatory data parameter. For example: DECLARE_TRACE(mytracepoint, int value, value) Will create the register function: int register_trace_mytracepoint((void(*)(void *data, int value))probe, void *data); As the first argument, all callbacks (probes) must take a (void *data) parameter. So a callback for the above tracepoint will look like: void myprobe(void *data, int value) { } The callback may choose to ignore the data parameter. This change allows callbacks to register a private data pointer along with the function probe. void mycallback(void *data, int value); register_trace_mytracepoint(mycallback, mydata); Then the mycallback() will receive the "mydata" as the first parameter before the args. A more detailed example: DECLARE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status)); /* In the C file */ DEFINE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status)); [...] trace_mytracepoint(status); /* In a file registering this tracepoint */ int my_callback(void *data, int status) { struct my_struct my_data = data; [...] } [...] my_data = kmalloc(sizeof(*my_data), GFP_KERNEL); init_my_data(my_data); register_trace_mytracepoint(my_callback, my_data); The same callback can also be registered to the same tracepoint as long as the data registered is different. Note, the data must also be used to unregister the callback: unregister_trace_mytracepoint(my_callback, my_data); Because of the data parameter, tracepoints declared this way can not have no args. That is: DECLARE_TRACE(mytracepoint, TP_PROTO(void), TP_ARGS()); will cause an error. If no arguments are needed, a new macro can be used instead: DECLARE_TRACE_NOARGS(mytracepoint); Since there are no arguments, the proto and args fields are left out. This is part of a series to make the tracepoint footprint smaller: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint Again, this patch also increases the size of the kernel, but lays the ground work for decreasing it. v5: Fixed net/core/drop_monitor.c to handle these updates. v4: Moved the DECLARE_TRACE() DECLARE_TRACE_NOARGS out of the #ifdef CONFIG_TRACE_POINTS, since the two are the same in both cases. The __DECLARE_TRACE() is what changes. Thanks to Frederic Weisbecker for pointing this out. v3: Made all register_* functions require data to be passed and all callbacks to take a void * parameter as its first argument. This makes the calling functions comply with C standards. Also added more comments to the modifications of DECLARE_TRACE(). v2: Made the DECLARE_TRACE() have the ability to pass arguments and added a new DECLARE_TRACE_NOARGS() for tracepoints that do not need any arguments. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-20 21:04:50 +00:00
static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
{
struct syscall_metadata *sys_data;
struct syscall_trace_exit *rec;
struct hlist_head *head;
bool valid_prog_array;
int syscall_nr;
int rctx;
int size;
tracing/syscalls: Allow archs to ignore tracing compat syscalls The tracing of ia32 compat system calls has been a bit of a pain as they use different system call numbers than the 64bit equivalents. I wrote a simple 'lls' program that lists files. I compiled it as a i686 ELF binary and ran it under a x86_64 box. This is the result: echo 0 > /debug/tracing/tracing_on echo 1 > /debug/tracing/events/syscalls/enable echo 1 > /debug/tracing/tracing_on ; ./lls ; echo 0 > /debug/tracing/tracing_on grep lls /debug/tracing/trace [.. skipping calls before TS_COMPAT is set ...] lls-1127 [005] d... 936.409188: sys_recvfrom(fd: 0, ubuf: 4d560fc4, size: 0, flags: 8048034, addr: 8, addr_len: f7700420) lls-1127 [005] d... 936.409190: sys_recvfrom -> 0x8a77000 lls-1127 [005] d... 936.409211: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22) lls-1127 [005] d... 936.409215: sys_lgetxattr -> 0xf76ff000 lls-1127 [005] d... 936.409223: sys_dup2(oldfd: 4d55ae9b, newfd: 4) lls-1127 [005] d... 936.409228: sys_dup2 -> 0xfffffffffffffffe lls-1127 [005] d... 936.409236: sys_newfstat(fd: 4d55b085, statbuf: 80000) lls-1127 [005] d... 936.409242: sys_newfstat -> 0x3 lls-1127 [005] d... 936.409243: sys_removexattr(pathname: 3, name: ffcd0060) lls-1127 [005] d... 936.409244: sys_removexattr -> 0x0 lls-1127 [005] d... 936.409245: sys_lgetxattr(pathname: 0, name: 19614, value: 1, size: 2) lls-1127 [005] d... 936.409248: sys_lgetxattr -> 0xf76e5000 lls-1127 [005] d... 936.409248: sys_newlstat(filename: 3, statbuf: 19614) lls-1127 [005] d... 936.409249: sys_newlstat -> 0x0 lls-1127 [005] d... 936.409262: sys_newfstat(fd: f76fb588, statbuf: 80000) lls-1127 [005] d... 936.409279: sys_newfstat -> 0x3 lls-1127 [005] d... 936.409279: sys_close(fd: 3) lls-1127 [005] d... 936.421550: sys_close -> 0x200 lls-1127 [005] d... 936.421558: sys_removexattr(pathname: 3, name: ffcd00d0) lls-1127 [005] d... 936.421560: sys_removexattr -> 0x0 lls-1127 [005] d... 936.421569: sys_lgetxattr(pathname: 4d564000, name: 1b1abc, value: 5, size: 802) lls-1127 [005] d... 936.421574: sys_lgetxattr -> 0x4d564000 lls-1127 [005] d... 936.421575: sys_capget(header: 4d70f000, dataptr: 1000) lls-1127 [005] d... 936.421580: sys_capget -> 0x0 lls-1127 [005] d... 936.421580: sys_lgetxattr(pathname: 4d710000, name: 3000, value: 3, size: 812) lls-1127 [005] d... 936.421589: sys_lgetxattr -> 0x4d710000 lls-1127 [005] d... 936.426130: sys_lgetxattr(pathname: 4d713000, name: 2abc, value: 3, size: 32) lls-1127 [005] d... 936.426141: sys_lgetxattr -> 0x4d713000 lls-1127 [005] d... 936.426145: sys_newlstat(filename: 3, statbuf: f76ff3f0) lls-1127 [005] d... 936.426146: sys_newlstat -> 0x0 lls-1127 [005] d... 936.431748: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22) Obviously I'm not calling newfstat with a fd of 4d55b085. The calls are obviously incorrect, and confusing. Other efforts have been made to fix this: https://lkml.org/lkml/2012/3/26/367 But the real solution is to rewrite the syscall internals and come up with a fixed solution. One that doesn't require all the kluge that the current solution has. Thus for now, instead of outputting incorrect data, simply ignore them. With this patch the changes now have: #> grep lls /debug/tracing/trace #> Compat system calls simply are not traced. If users need compat syscalls, then they should just use the raw syscall tracepoints. For an architecture to make their compat syscalls ignored, it must define ARCH_TRACE_IGNORE_COMPAT_SYSCALLS (done in asm/ftrace.h) and also define an arch_trace_is_compat_syscall() function that will return true if the current task should ignore tracing the syscall. I want to stress that this change does not affect actual syscalls in any way, shape or form. It is only used within the tracing system and doesn't interfere with the syscall logic at all. The changes are consolidated nicely into trace_syscalls.c and asm/ftrace.h. I had to make one small modification to asm/thread_info.h and that was to remove the include of asm/ftrace.h. As asm/ftrace.h required the current_thread_info() it was causing include hell. That include was added back in 2008 when the function graph tracer was added: commit caf4b323 "tracing, x86: add low level support for ftrace return tracing" It does not need to be included there. Link: http://lkml.kernel.org/r/1360703939.21867.99.camel@gandalf.local.home Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-02-12 21:18:59 +00:00
syscall_nr = trace_get_syscall_nr(current, regs);
tracing/syscalls: Ignore numbers outside NR_syscalls' range ARM has some private syscalls (for example, set_tls(2)) which lie outside the range of NR_syscalls. If any of these are called while syscall tracing is being performed, out-of-bounds array access will occur in the ftrace and perf sys_{enter,exit} handlers. # trace-cmd record -e raw_syscalls:* true && trace-cmd report ... true-653 [000] 384.675777: sys_enter: NR 192 (0, 1000, 3, 4000022, ffffffff, 0) true-653 [000] 384.675812: sys_exit: NR 192 = 1995915264 true-653 [000] 384.675971: sys_enter: NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1) true-653 [000] 384.675988: sys_exit: NR 983045 = 0 ... # trace-cmd record -e syscalls:* true [ 17.289329] Unable to handle kernel paging request at virtual address aaaaaace [ 17.289590] pgd = 9e71c000 [ 17.289696] [aaaaaace] *pgd=00000000 [ 17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM [ 17.290169] Modules linked in: [ 17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ #21 [ 17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000 [ 17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8 [ 17.290866] LR is at syscall_trace_enter+0x124/0x184 Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers. Commit cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls" added the check for less than zero, but it should have also checked for greater than NR_syscalls. Link: http://lkml.kernel.org/p/1414620418-29472-1-git-send-email-rabin@rab.in Fixes: cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls" Cc: stable@vger.kernel.org # 2.6.33+ Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-10-29 22:06:58 +00:00
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
return;
if (!test_bit(syscall_nr, enabled_perf_exit_syscalls))
return;
sys_data = syscall_nr_to_meta(syscall_nr);
if (!sys_data)
return;
head = this_cpu_ptr(sys_data->exit_event->perf_events);
valid_prog_array = bpf_prog_array_valid(sys_data->exit_event);
if (!valid_prog_array && hlist_empty(head))
return;
/* We can probably do that at build time */
size = ALIGN(sizeof(*rec) + sizeof(u32), sizeof(u64));
size -= sizeof(u32);
2009-08-11 18:22:53 +00:00
rec = perf_trace_buf_alloc(size, NULL, &rctx);
if (!rec)
return;
rec->nr = syscall_nr;
rec->ret = syscall_get_return_value(current, regs);
if ((valid_prog_array &&
!perf_call_bpf_exit(sys_data->exit_event, regs, rec)) ||
hlist_empty(head)) {
perf_swevent_put_recursion_context(rctx);
return;
}
perf_trace_buf_submit(rec, size, rctx, sys_data->exit_event->event.type,
1, regs, head, NULL);
}
static int perf_sysexit_enable(struct trace_event_call *call)
{
int ret = 0;
int num;
num = ((struct syscall_metadata *)call->data)->syscall_nr;
mutex_lock(&syscall_trace_lock);
if (!sys_perf_refcount_exit)
tracing: Let tracepoints have data passed to tracepoint callbacks This patch adds data to be passed to tracepoint callbacks. The created functions from DECLARE_TRACE() now need a mandatory data parameter. For example: DECLARE_TRACE(mytracepoint, int value, value) Will create the register function: int register_trace_mytracepoint((void(*)(void *data, int value))probe, void *data); As the first argument, all callbacks (probes) must take a (void *data) parameter. So a callback for the above tracepoint will look like: void myprobe(void *data, int value) { } The callback may choose to ignore the data parameter. This change allows callbacks to register a private data pointer along with the function probe. void mycallback(void *data, int value); register_trace_mytracepoint(mycallback, mydata); Then the mycallback() will receive the "mydata" as the first parameter before the args. A more detailed example: DECLARE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status)); /* In the C file */ DEFINE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status)); [...] trace_mytracepoint(status); /* In a file registering this tracepoint */ int my_callback(void *data, int status) { struct my_struct my_data = data; [...] } [...] my_data = kmalloc(sizeof(*my_data), GFP_KERNEL); init_my_data(my_data); register_trace_mytracepoint(my_callback, my_data); The same callback can also be registered to the same tracepoint as long as the data registered is different. Note, the data must also be used to unregister the callback: unregister_trace_mytracepoint(my_callback, my_data); Because of the data parameter, tracepoints declared this way can not have no args. That is: DECLARE_TRACE(mytracepoint, TP_PROTO(void), TP_ARGS()); will cause an error. If no arguments are needed, a new macro can be used instead: DECLARE_TRACE_NOARGS(mytracepoint); Since there are no arguments, the proto and args fields are left out. This is part of a series to make the tracepoint footprint smaller: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint Again, this patch also increases the size of the kernel, but lays the ground work for decreasing it. v5: Fixed net/core/drop_monitor.c to handle these updates. v4: Moved the DECLARE_TRACE() DECLARE_TRACE_NOARGS out of the #ifdef CONFIG_TRACE_POINTS, since the two are the same in both cases. The __DECLARE_TRACE() is what changes. Thanks to Frederic Weisbecker for pointing this out. v3: Made all register_* functions require data to be passed and all callbacks to take a void * parameter as its first argument. This makes the calling functions comply with C standards. Also added more comments to the modifications of DECLARE_TRACE(). v2: Made the DECLARE_TRACE() have the ability to pass arguments and added a new DECLARE_TRACE_NOARGS() for tracepoints that do not need any arguments. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-20 21:04:50 +00:00
ret = register_trace_sys_exit(perf_syscall_exit, NULL);
if (ret) {
pr_info("event trace: Could not activate syscall exit trace point");
} else {
set_bit(num, enabled_perf_exit_syscalls);
sys_perf_refcount_exit++;
}
mutex_unlock(&syscall_trace_lock);
return ret;
}
static void perf_sysexit_disable(struct trace_event_call *call)
{
int num;
num = ((struct syscall_metadata *)call->data)->syscall_nr;
mutex_lock(&syscall_trace_lock);
sys_perf_refcount_exit--;
clear_bit(num, enabled_perf_exit_syscalls);
if (!sys_perf_refcount_exit)
tracing: Let tracepoints have data passed to tracepoint callbacks This patch adds data to be passed to tracepoint callbacks. The created functions from DECLARE_TRACE() now need a mandatory data parameter. For example: DECLARE_TRACE(mytracepoint, int value, value) Will create the register function: int register_trace_mytracepoint((void(*)(void *data, int value))probe, void *data); As the first argument, all callbacks (probes) must take a (void *data) parameter. So a callback for the above tracepoint will look like: void myprobe(void *data, int value) { } The callback may choose to ignore the data parameter. This change allows callbacks to register a private data pointer along with the function probe. void mycallback(void *data, int value); register_trace_mytracepoint(mycallback, mydata); Then the mycallback() will receive the "mydata" as the first parameter before the args. A more detailed example: DECLARE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status)); /* In the C file */ DEFINE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status)); [...] trace_mytracepoint(status); /* In a file registering this tracepoint */ int my_callback(void *data, int status) { struct my_struct my_data = data; [...] } [...] my_data = kmalloc(sizeof(*my_data), GFP_KERNEL); init_my_data(my_data); register_trace_mytracepoint(my_callback, my_data); The same callback can also be registered to the same tracepoint as long as the data registered is different. Note, the data must also be used to unregister the callback: unregister_trace_mytracepoint(my_callback, my_data); Because of the data parameter, tracepoints declared this way can not have no args. That is: DECLARE_TRACE(mytracepoint, TP_PROTO(void), TP_ARGS()); will cause an error. If no arguments are needed, a new macro can be used instead: DECLARE_TRACE_NOARGS(mytracepoint); Since there are no arguments, the proto and args fields are left out. This is part of a series to make the tracepoint footprint smaller: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint Again, this patch also increases the size of the kernel, but lays the ground work for decreasing it. v5: Fixed net/core/drop_monitor.c to handle these updates. v4: Moved the DECLARE_TRACE() DECLARE_TRACE_NOARGS out of the #ifdef CONFIG_TRACE_POINTS, since the two are the same in both cases. The __DECLARE_TRACE() is what changes. Thanks to Frederic Weisbecker for pointing this out. v3: Made all register_* functions require data to be passed and all callbacks to take a void * parameter as its first argument. This makes the calling functions comply with C standards. Also added more comments to the modifications of DECLARE_TRACE(). v2: Made the DECLARE_TRACE() have the ability to pass arguments and added a new DECLARE_TRACE_NOARGS() for tracepoints that do not need any arguments. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-20 21:04:50 +00:00
unregister_trace_sys_exit(perf_syscall_exit, NULL);
mutex_unlock(&syscall_trace_lock);
}
#endif /* CONFIG_PERF_EVENTS */
static int syscall_enter_register(struct trace_event_call *event,
enum trace_reg type, void *data)
tracing: Remove per event trace registering This patch removes the register functions of TRACE_EVENT() to enable and disable tracepoints. The registering of a event is now down directly in the trace_events.c file. The tracepoint_probe_register() is now called directly. The prototypes are no longer type checked, but this should not be an issue since the tracepoints are created automatically by the macros. If a prototype is incorrect in the TRACE_EVENT() macro, then other macros will catch it. The trace_event_class structure now holds the probes to be called by the callbacks. This removes needing to have each event have a separate pointer for the probe. To handle kprobes and syscalls, since they register probes in a different manner, a "reg" field is added to the ftrace_event_class structure. If the "reg" field is assigned, then it will be called for enabling and disabling of the probe for either ftrace or perf. To let the reg function know what is happening, a new enum (trace_reg) is created that has the type of control that is needed. With this new rework, the 82 kernel events and 618 syscall events has their footprint dramatically lowered: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint 4900252 1057412 861512 6819176 680d68 vmlinux.regs The size went from 6863829 to 6819176, that's a total of 44K in savings. With tracepoints being continuously added, this is critical that the footprint becomes minimal. v5: Added #ifdef CONFIG_PERF_EVENTS around a reference to perf specific structure in trace_events.c. v4: Fixed trace self tests to check probe because regfunc no longer exists. v3: Updated to handle void *data in beginning of probe parameters. Also added the tracepoint: check_trace_callback_type_##call(). v2: Changed the callback probes to pass void * and typecast the value within the function. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-21 16:27:06 +00:00
{
struct trace_event_file *file = data;
tracing: Remove per event trace registering This patch removes the register functions of TRACE_EVENT() to enable and disable tracepoints. The registering of a event is now down directly in the trace_events.c file. The tracepoint_probe_register() is now called directly. The prototypes are no longer type checked, but this should not be an issue since the tracepoints are created automatically by the macros. If a prototype is incorrect in the TRACE_EVENT() macro, then other macros will catch it. The trace_event_class structure now holds the probes to be called by the callbacks. This removes needing to have each event have a separate pointer for the probe. To handle kprobes and syscalls, since they register probes in a different manner, a "reg" field is added to the ftrace_event_class structure. If the "reg" field is assigned, then it will be called for enabling and disabling of the probe for either ftrace or perf. To let the reg function know what is happening, a new enum (trace_reg) is created that has the type of control that is needed. With this new rework, the 82 kernel events and 618 syscall events has their footprint dramatically lowered: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint 4900252 1057412 861512 6819176 680d68 vmlinux.regs The size went from 6863829 to 6819176, that's a total of 44K in savings. With tracepoints being continuously added, this is critical that the footprint becomes minimal. v5: Added #ifdef CONFIG_PERF_EVENTS around a reference to perf specific structure in trace_events.c. v4: Fixed trace self tests to check probe because regfunc no longer exists. v3: Updated to handle void *data in beginning of probe parameters. Also added the tracepoint: check_trace_callback_type_##call(). v2: Changed the callback probes to pass void * and typecast the value within the function. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-21 16:27:06 +00:00
switch (type) {
case TRACE_REG_REGISTER:
return reg_event_syscall_enter(file, event);
tracing: Remove per event trace registering This patch removes the register functions of TRACE_EVENT() to enable and disable tracepoints. The registering of a event is now down directly in the trace_events.c file. The tracepoint_probe_register() is now called directly. The prototypes are no longer type checked, but this should not be an issue since the tracepoints are created automatically by the macros. If a prototype is incorrect in the TRACE_EVENT() macro, then other macros will catch it. The trace_event_class structure now holds the probes to be called by the callbacks. This removes needing to have each event have a separate pointer for the probe. To handle kprobes and syscalls, since they register probes in a different manner, a "reg" field is added to the ftrace_event_class structure. If the "reg" field is assigned, then it will be called for enabling and disabling of the probe for either ftrace or perf. To let the reg function know what is happening, a new enum (trace_reg) is created that has the type of control that is needed. With this new rework, the 82 kernel events and 618 syscall events has their footprint dramatically lowered: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint 4900252 1057412 861512 6819176 680d68 vmlinux.regs The size went from 6863829 to 6819176, that's a total of 44K in savings. With tracepoints being continuously added, this is critical that the footprint becomes minimal. v5: Added #ifdef CONFIG_PERF_EVENTS around a reference to perf specific structure in trace_events.c. v4: Fixed trace self tests to check probe because regfunc no longer exists. v3: Updated to handle void *data in beginning of probe parameters. Also added the tracepoint: check_trace_callback_type_##call(). v2: Changed the callback probes to pass void * and typecast the value within the function. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-21 16:27:06 +00:00
case TRACE_REG_UNREGISTER:
unreg_event_syscall_enter(file, event);
tracing: Remove per event trace registering This patch removes the register functions of TRACE_EVENT() to enable and disable tracepoints. The registering of a event is now down directly in the trace_events.c file. The tracepoint_probe_register() is now called directly. The prototypes are no longer type checked, but this should not be an issue since the tracepoints are created automatically by the macros. If a prototype is incorrect in the TRACE_EVENT() macro, then other macros will catch it. The trace_event_class structure now holds the probes to be called by the callbacks. This removes needing to have each event have a separate pointer for the probe. To handle kprobes and syscalls, since they register probes in a different manner, a "reg" field is added to the ftrace_event_class structure. If the "reg" field is assigned, then it will be called for enabling and disabling of the probe for either ftrace or perf. To let the reg function know what is happening, a new enum (trace_reg) is created that has the type of control that is needed. With this new rework, the 82 kernel events and 618 syscall events has their footprint dramatically lowered: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint 4900252 1057412 861512 6819176 680d68 vmlinux.regs The size went from 6863829 to 6819176, that's a total of 44K in savings. With tracepoints being continuously added, this is critical that the footprint becomes minimal. v5: Added #ifdef CONFIG_PERF_EVENTS around a reference to perf specific structure in trace_events.c. v4: Fixed trace self tests to check probe because regfunc no longer exists. v3: Updated to handle void *data in beginning of probe parameters. Also added the tracepoint: check_trace_callback_type_##call(). v2: Changed the callback probes to pass void * and typecast the value within the function. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-21 16:27:06 +00:00
return 0;
#ifdef CONFIG_PERF_EVENTS
case TRACE_REG_PERF_REGISTER:
return perf_sysenter_enable(event);
case TRACE_REG_PERF_UNREGISTER:
perf_sysenter_disable(event);
return 0;
case TRACE_REG_PERF_OPEN:
case TRACE_REG_PERF_CLOSE:
case TRACE_REG_PERF_ADD:
case TRACE_REG_PERF_DEL:
return 0;
tracing: Remove per event trace registering This patch removes the register functions of TRACE_EVENT() to enable and disable tracepoints. The registering of a event is now down directly in the trace_events.c file. The tracepoint_probe_register() is now called directly. The prototypes are no longer type checked, but this should not be an issue since the tracepoints are created automatically by the macros. If a prototype is incorrect in the TRACE_EVENT() macro, then other macros will catch it. The trace_event_class structure now holds the probes to be called by the callbacks. This removes needing to have each event have a separate pointer for the probe. To handle kprobes and syscalls, since they register probes in a different manner, a "reg" field is added to the ftrace_event_class structure. If the "reg" field is assigned, then it will be called for enabling and disabling of the probe for either ftrace or perf. To let the reg function know what is happening, a new enum (trace_reg) is created that has the type of control that is needed. With this new rework, the 82 kernel events and 618 syscall events has their footprint dramatically lowered: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint 4900252 1057412 861512 6819176 680d68 vmlinux.regs The size went from 6863829 to 6819176, that's a total of 44K in savings. With tracepoints being continuously added, this is critical that the footprint becomes minimal. v5: Added #ifdef CONFIG_PERF_EVENTS around a reference to perf specific structure in trace_events.c. v4: Fixed trace self tests to check probe because regfunc no longer exists. v3: Updated to handle void *data in beginning of probe parameters. Also added the tracepoint: check_trace_callback_type_##call(). v2: Changed the callback probes to pass void * and typecast the value within the function. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-21 16:27:06 +00:00
#endif
}
return 0;
}
static int syscall_exit_register(struct trace_event_call *event,
enum trace_reg type, void *data)
tracing: Remove per event trace registering This patch removes the register functions of TRACE_EVENT() to enable and disable tracepoints. The registering of a event is now down directly in the trace_events.c file. The tracepoint_probe_register() is now called directly. The prototypes are no longer type checked, but this should not be an issue since the tracepoints are created automatically by the macros. If a prototype is incorrect in the TRACE_EVENT() macro, then other macros will catch it. The trace_event_class structure now holds the probes to be called by the callbacks. This removes needing to have each event have a separate pointer for the probe. To handle kprobes and syscalls, since they register probes in a different manner, a "reg" field is added to the ftrace_event_class structure. If the "reg" field is assigned, then it will be called for enabling and disabling of the probe for either ftrace or perf. To let the reg function know what is happening, a new enum (trace_reg) is created that has the type of control that is needed. With this new rework, the 82 kernel events and 618 syscall events has their footprint dramatically lowered: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint 4900252 1057412 861512 6819176 680d68 vmlinux.regs The size went from 6863829 to 6819176, that's a total of 44K in savings. With tracepoints being continuously added, this is critical that the footprint becomes minimal. v5: Added #ifdef CONFIG_PERF_EVENTS around a reference to perf specific structure in trace_events.c. v4: Fixed trace self tests to check probe because regfunc no longer exists. v3: Updated to handle void *data in beginning of probe parameters. Also added the tracepoint: check_trace_callback_type_##call(). v2: Changed the callback probes to pass void * and typecast the value within the function. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-21 16:27:06 +00:00
{
struct trace_event_file *file = data;
tracing: Remove per event trace registering This patch removes the register functions of TRACE_EVENT() to enable and disable tracepoints. The registering of a event is now down directly in the trace_events.c file. The tracepoint_probe_register() is now called directly. The prototypes are no longer type checked, but this should not be an issue since the tracepoints are created automatically by the macros. If a prototype is incorrect in the TRACE_EVENT() macro, then other macros will catch it. The trace_event_class structure now holds the probes to be called by the callbacks. This removes needing to have each event have a separate pointer for the probe. To handle kprobes and syscalls, since they register probes in a different manner, a "reg" field is added to the ftrace_event_class structure. If the "reg" field is assigned, then it will be called for enabling and disabling of the probe for either ftrace or perf. To let the reg function know what is happening, a new enum (trace_reg) is created that has the type of control that is needed. With this new rework, the 82 kernel events and 618 syscall events has their footprint dramatically lowered: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint 4900252 1057412 861512 6819176 680d68 vmlinux.regs The size went from 6863829 to 6819176, that's a total of 44K in savings. With tracepoints being continuously added, this is critical that the footprint becomes minimal. v5: Added #ifdef CONFIG_PERF_EVENTS around a reference to perf specific structure in trace_events.c. v4: Fixed trace self tests to check probe because regfunc no longer exists. v3: Updated to handle void *data in beginning of probe parameters. Also added the tracepoint: check_trace_callback_type_##call(). v2: Changed the callback probes to pass void * and typecast the value within the function. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-21 16:27:06 +00:00
switch (type) {
case TRACE_REG_REGISTER:
return reg_event_syscall_exit(file, event);
tracing: Remove per event trace registering This patch removes the register functions of TRACE_EVENT() to enable and disable tracepoints. The registering of a event is now down directly in the trace_events.c file. The tracepoint_probe_register() is now called directly. The prototypes are no longer type checked, but this should not be an issue since the tracepoints are created automatically by the macros. If a prototype is incorrect in the TRACE_EVENT() macro, then other macros will catch it. The trace_event_class structure now holds the probes to be called by the callbacks. This removes needing to have each event have a separate pointer for the probe. To handle kprobes and syscalls, since they register probes in a different manner, a "reg" field is added to the ftrace_event_class structure. If the "reg" field is assigned, then it will be called for enabling and disabling of the probe for either ftrace or perf. To let the reg function know what is happening, a new enum (trace_reg) is created that has the type of control that is needed. With this new rework, the 82 kernel events and 618 syscall events has their footprint dramatically lowered: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint 4900252 1057412 861512 6819176 680d68 vmlinux.regs The size went from 6863829 to 6819176, that's a total of 44K in savings. With tracepoints being continuously added, this is critical that the footprint becomes minimal. v5: Added #ifdef CONFIG_PERF_EVENTS around a reference to perf specific structure in trace_events.c. v4: Fixed trace self tests to check probe because regfunc no longer exists. v3: Updated to handle void *data in beginning of probe parameters. Also added the tracepoint: check_trace_callback_type_##call(). v2: Changed the callback probes to pass void * and typecast the value within the function. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-21 16:27:06 +00:00
case TRACE_REG_UNREGISTER:
unreg_event_syscall_exit(file, event);
tracing: Remove per event trace registering This patch removes the register functions of TRACE_EVENT() to enable and disable tracepoints. The registering of a event is now down directly in the trace_events.c file. The tracepoint_probe_register() is now called directly. The prototypes are no longer type checked, but this should not be an issue since the tracepoints are created automatically by the macros. If a prototype is incorrect in the TRACE_EVENT() macro, then other macros will catch it. The trace_event_class structure now holds the probes to be called by the callbacks. This removes needing to have each event have a separate pointer for the probe. To handle kprobes and syscalls, since they register probes in a different manner, a "reg" field is added to the ftrace_event_class structure. If the "reg" field is assigned, then it will be called for enabling and disabling of the probe for either ftrace or perf. To let the reg function know what is happening, a new enum (trace_reg) is created that has the type of control that is needed. With this new rework, the 82 kernel events and 618 syscall events has their footprint dramatically lowered: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint 4900252 1057412 861512 6819176 680d68 vmlinux.regs The size went from 6863829 to 6819176, that's a total of 44K in savings. With tracepoints being continuously added, this is critical that the footprint becomes minimal. v5: Added #ifdef CONFIG_PERF_EVENTS around a reference to perf specific structure in trace_events.c. v4: Fixed trace self tests to check probe because regfunc no longer exists. v3: Updated to handle void *data in beginning of probe parameters. Also added the tracepoint: check_trace_callback_type_##call(). v2: Changed the callback probes to pass void * and typecast the value within the function. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-21 16:27:06 +00:00
return 0;
#ifdef CONFIG_PERF_EVENTS
case TRACE_REG_PERF_REGISTER:
return perf_sysexit_enable(event);
case TRACE_REG_PERF_UNREGISTER:
perf_sysexit_disable(event);
return 0;
case TRACE_REG_PERF_OPEN:
case TRACE_REG_PERF_CLOSE:
case TRACE_REG_PERF_ADD:
case TRACE_REG_PERF_DEL:
return 0;
tracing: Remove per event trace registering This patch removes the register functions of TRACE_EVENT() to enable and disable tracepoints. The registering of a event is now down directly in the trace_events.c file. The tracepoint_probe_register() is now called directly. The prototypes are no longer type checked, but this should not be an issue since the tracepoints are created automatically by the macros. If a prototype is incorrect in the TRACE_EVENT() macro, then other macros will catch it. The trace_event_class structure now holds the probes to be called by the callbacks. This removes needing to have each event have a separate pointer for the probe. To handle kprobes and syscalls, since they register probes in a different manner, a "reg" field is added to the ftrace_event_class structure. If the "reg" field is assigned, then it will be called for enabling and disabling of the probe for either ftrace or perf. To let the reg function know what is happening, a new enum (trace_reg) is created that has the type of control that is needed. With this new rework, the 82 kernel events and 618 syscall events has their footprint dramatically lowered: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint 4900252 1057412 861512 6819176 680d68 vmlinux.regs The size went from 6863829 to 6819176, that's a total of 44K in savings. With tracepoints being continuously added, this is critical that the footprint becomes minimal. v5: Added #ifdef CONFIG_PERF_EVENTS around a reference to perf specific structure in trace_events.c. v4: Fixed trace self tests to check probe because regfunc no longer exists. v3: Updated to handle void *data in beginning of probe parameters. Also added the tracepoint: check_trace_callback_type_##call(). v2: Changed the callback probes to pass void * and typecast the value within the function. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-21 16:27:06 +00:00
#endif
}
return 0;
}