Commit Graph

33541 Commits

Author SHA1 Message Date
Steven Rostedt (VMware)
38ce2a9e33 tracing: Add trace_array_init_printk() to initialize instance trace_printk() buffers
As trace_array_printk() used with not global instances will not add noise to
the main buffer, they are OK to have in the kernel (unlike trace_printk()).
This require the subsystem to create their own tracing instance, and the
trace_array_printk() only writes into those instances.

Add trace_array_init_printk() to initialize the trace_printk() buffers
without printing out the WARNING message.

Reported-by: Sean Paul <sean@poorly.run>
Reviewed-by: Sean Paul <sean@poorly.run>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-08-07 17:05:01 -04:00
Muchun Song
10de795a5a kprobes: Fix compiler warning for !CONFIG_KPROBES_ON_FTRACE
Fix compiler warning(as show below) for !CONFIG_KPROBES_ON_FTRACE.

kernel/kprobes.c: In function 'kill_kprobe':
kernel/kprobes.c:1116:33: warning: statement with no effect
[-Wunused-value]
 1116 | #define disarm_kprobe_ftrace(p) (-ENODEV)
      |                                 ^
kernel/kprobes.c:2154:3: note: in expansion of macro
'disarm_kprobe_ftrace'
 2154 |   disarm_kprobe_ftrace(p);

Link: https://lore.kernel.org/r/20200805142136.0331f7ea@canb.auug.org.au
Link: https://lkml.kernel.org/r/20200805172046.19066-1-songmuchun@bytedance.com

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: 0cb2f1372b ("kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-08-06 09:16:27 -04:00
Steven Rostedt (VMware)
afcab63665 tracing: Use trace_sched_process_free() instead of exit() for pid tracing
On exit, if a process is preempted after the trace_sched_process_exit()
tracepoint but before the process is done exiting, then when it gets
scheduled in, the function tracers will not filter it properly against the
function tracing pid filters.

That is because the function tracing pid filters hooks to the
sched_process_exit() tracepoint to remove the exiting task's pid from the
filter list. Because the filtering happens at the sched_switch tracepoint,
when the exiting task schedules back in to finish up the exit, it will no
longer be in the function pid filtering tables.

This was noticeable in the notrace self tests on a preemptable kernel, as
the tests would fail as it exits and preempted after being taken off the
notrace filter table and on scheduling back in it would not be in the
notrace list, and then the ending of the exit function would trace. The test
detected this and would fail.

Cc: stable@vger.kernel.org
Cc: Namhyung Kim <namhyung@kernel.org>
Fixes: 1e10486ffe ("ftrace: Add 'function-fork' trace option")
Fixes: c37775d578 ("tracing: Add infrastructure to allow set_event_pid to follow children"
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-08-04 20:15:07 -04:00
Peng Fan
231621d0c5 tracing/uprobe: Remove dead code in trace_uprobe_register()
In the function trace_uprobe_register(), the statement "return 0;"
out of switch case is dead code, remove it.

Link: https://lkml.kernel.org/r/1595561064-29186-1-git-send-email-fanpeng@loongson.cn

Signed-off-by: Peng Fan <fanpeng@loongson.cn>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-08-03 16:16:46 -04:00
Muchun Song
0cb2f1372b kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler
We found a case of kernel panic on our server. The stack trace is as
follows(omit some irrelevant information):

  BUG: kernel NULL pointer dereference, address: 0000000000000080
  RIP: 0010:kprobe_ftrace_handler+0x5e/0xe0
  RSP: 0018:ffffb512c6550998 EFLAGS: 00010282
  RAX: 0000000000000000 RBX: ffff8e9d16eea018 RCX: 0000000000000000
  RDX: ffffffffbe1179c0 RSI: ffffffffc0535564 RDI: ffffffffc0534ec0
  RBP: ffffffffc0534ec1 R08: ffff8e9d1bbb0f00 R09: 0000000000000004
  R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
  R13: ffff8e9d1f797060 R14: 000000000000bacc R15: ffff8e9ce13eca00
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000080 CR3: 00000008453d0005 CR4: 00000000003606e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <IRQ>
   ftrace_ops_assist_func+0x56/0xe0
   ftrace_call+0x5/0x34
   tcpa_statistic_send+0x5/0x130 [ttcp_engine]

The tcpa_statistic_send is the function being kprobed. After analysis,
the root cause is that the fourth parameter regs of kprobe_ftrace_handler
is NULL. Why regs is NULL? We use the crash tool to analyze the kdump.

  crash> dis tcpa_statistic_send -r
         <tcpa_statistic_send>: callq 0xffffffffbd8018c0 <ftrace_caller>

The tcpa_statistic_send calls ftrace_caller instead of ftrace_regs_caller.
So it is reasonable that the fourth parameter regs of kprobe_ftrace_handler
is NULL. In theory, we should call the ftrace_regs_caller instead of the
ftrace_caller. After in-depth analysis, we found a reproducible path.

  Writing a simple kernel module which starts a periodic timer. The
  timer's handler is named 'kprobe_test_timer_handler'. The module
  name is kprobe_test.ko.

  1) insmod kprobe_test.ko
  2) bpftrace -e 'kretprobe:kprobe_test_timer_handler {}'
  3) echo 0 > /proc/sys/kernel/ftrace_enabled
  4) rmmod kprobe_test
  5) stop step 2) kprobe
  6) insmod kprobe_test.ko
  7) bpftrace -e 'kretprobe:kprobe_test_timer_handler {}'

We mark the kprobe as GONE but not disarm the kprobe in the step 4).
The step 5) also do not disarm the kprobe when unregister kprobe. So
we do not remove the ip from the filter. In this case, when the module
loads again in the step 6), we will replace the code to ftrace_caller
via the ftrace_module_enable(). When we register kprobe again, we will
not replace ftrace_caller to ftrace_regs_caller because the ftrace is
disabled in the step 3). So the step 7) will trigger kernel panic. Fix
this problem by disarming the kprobe when the module is going away.

Link: https://lkml.kernel.org/r/20200728064536.24405-1-songmuchun@bytedance.com

Cc: stable@vger.kernel.org
Fixes: ae6aa16fdc ("kprobes: introduce ftrace based optimization")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-08-03 16:14:54 -04:00
Josef Bacik
c58b6b0372 ftrace: Fix ftrace_trace_task return value
I was attempting to use pid filtering with function_graph, but it wasn't
allowing anything to make it through.  Turns out ftrace_trace_task
returns false if ftrace_ignore_pid is not-empty, which isn't correct
anymore.  We're now setting it to FTRACE_PID_IGNORE if we need to ignore
that pid, otherwise it's set to the pid (which is weird considering the
name) or to FTRACE_PID_TRACE.  Fix the check to check for !=
FTRACE_PID_IGNORE.  With this we can now use function_graph with pid
filtering.

Link: https://lkml.kernel.org/r/20200725005048.1790-1-josef@toxicpanda.com

Fixes: 717e3f5ebc ("ftrace: Make function trace pid filtering a bit more exact")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-08-03 16:12:31 -04:00
Zhaoyang Huang
0f69dae4d1 trace : Have tracing buffer info use kvzalloc instead of kzalloc
High order memory stuff within trace could introduce OOM, use kvzalloc instead.

Please find the bellowing for the call stack we run across in an android system.
The scenario happens when traced_probes is woken up to get a large quantity of
trace even if free memory is even higher than watermark_low. 

traced_probes invoked oom-killer: gfp_mask=0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),  order=2, oom_score_adj=-1

traced_probes cpuset=system-background mems_allowed=0
CPU: 3 PID: 588 Comm: traced_probes Tainted: G        W  O    4.14.181 #1
Hardware name: Generic DT based system
(unwind_backtrace) from [<c010d824>] (show_stack+0x20/0x24)
(show_stack) from [<c0b2e174>] (dump_stack+0xa8/0xec)
(dump_stack) from [<c027d584>] (dump_header+0x9c/0x220)
(dump_header) from [<c027cfe4>] (oom_kill_process+0xc0/0x5c4)
(oom_kill_process) from [<c027cb94>] (out_of_memory+0x220/0x310)
(out_of_memory) from [<c02816bc>] (__alloc_pages_nodemask+0xff8/0x13a4)
(__alloc_pages_nodemask) from [<c02a6a1c>] (kmalloc_order+0x30/0x48)
(kmalloc_order) from [<c02a6a64>] (kmalloc_order_trace+0x30/0x118)
(kmalloc_order_trace) from [<c0223d7c>] (tracing_buffers_open+0x50/0xfc)
(tracing_buffers_open) from [<c02e6f58>] (do_dentry_open+0x278/0x34c)
(do_dentry_open) from [<c02e70d0>] (vfs_open+0x50/0x70)
(vfs_open) from [<c02f7c24>] (path_openat+0x5fc/0x169c)
(path_openat) from [<c02f75c4>] (do_filp_open+0x94/0xf8)
(do_filp_open) from [<c02e7650>] (do_sys_open+0x168/0x26c)
(do_sys_open) from [<c02e77bc>] (SyS_openat+0x34/0x38)
(SyS_openat) from [<c0108bc0>] (ret_fast_syscall+0x0/0x28)

Link: https://lkml.kernel.org/r/1596155265-32365-1-git-send-email-zhaoyang.huang@unisoc.com

Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-08-03 11:52:20 -04:00
Vincent Whitchurch
ee896ee805 tracing: Remove outdated comment in stack handling
This comment describes the behaviour before commit 2a820bf749
("tracing: Use percpu stack trace buffer more intelligently").  Since
that commit, interrupts and NMIs do use the per-cpu stacks so the
comment is no longer correct.  Remove it.

(Note that the FTRACE_STACK_SIZE mentioned in the comment has never
existed, it probably should have said FTRACE_STACK_ENTRIES.)

Link: https://lkml.kernel.org/r/20200727092840.18659-1-vincent.whitchurch@axis.com

Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-30 22:54:50 -04:00
Chengming Zhou
c5f51572a7 ftrace: Do not let direct or IPMODIFY ftrace_ops be added to module and set trampolines
When inserting a module, we find all ftrace_ops referencing it on the
ftrace_ops_list. But FTRACE_OPS_FL_DIRECT and FTRACE_OPS_FL_IPMODIFY
flags are special, and should not be set automatically. So warn and
skip ftrace_ops that have these two flags set and adding new code.
Also check if only one ftrace_ops references the module, in which case
we can use a trampoline as an optimization.

Link: https://lkml.kernel.org/r/20200728180554.65203-2-zhouchengming@bytedance.com

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-30 22:45:31 -04:00
Chengming Zhou
8a224ffb3f ftrace: Setup correct FTRACE_FL_REGS flags for module
When module loaded and enabled, we will use __ftrace_replace_code
for module if any ftrace_ops referenced it found. But we will get
wrong ftrace_addr for module rec in ftrace_get_addr_new, because
rec->flags has not been setup correctly. It can cause the callback
function of a ftrace_ops has FTRACE_OPS_FL_SAVE_REGS to be called
with pt_regs set to NULL.
So setup correct FTRACE_FL_REGS flags for rec when we call
referenced_filters to find ftrace_ops references it.

Link: https://lkml.kernel.org/r/20200728180554.65203-1-zhouchengming@bytedance.com

Cc: stable@vger.kernel.org
Fixes: 8c4f3c3fa9 ("ftrace: Check module functions being traced on reload")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-30 19:35:19 -04:00
Kevin Hao
96b4833b68 tracing/hwlat: Honor the tracing_cpumask
In calculation of the cpu mask for the hwlat kernel thread, the wrong
cpu mask is used instead of the tracing_cpumask, this causes the
tracing/tracing_cpumask useless for hwlat tracer. Fixes it.

Link: https://lkml.kernel.org/r/20200730082318.42584-2-haokexin@gmail.com

Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 0330f7aa8e ("tracing: Have hwlat trace migrate across tracing_cpumask CPUs")
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-30 19:35:04 -04:00
Kevin Hao
a9d0ba6772 tracing/hwlat: Drop the duplicate assignment in start_kthread()
We have set 'current_mask' to '&save_cpumask' in its declaration,
so there is no need to assign again.

Link: https://lkml.kernel.org/r/20200730082318.42584-1-haokexin@gmail.com

Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-30 19:35:04 -04:00
Wei Yang
36b8aacf2a tracing: Save one trace_event->type by using __TRACE_LAST_TYPE
Static defined trace_event->type stops at (__TRACE_LAST_TYPE - 1) and
dynamic trace_event->type starts from (__TRACE_LAST_TYPE + 1).

To save one trace_event->type index, let's use __TRACE_LAST_TYPE.

Link: https://lkml.kernel.org/r/20200703020612.12930-3-richard.weiyang@linux.alibaba.com

Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-09 18:14:58 -04:00
Wei Yang
746cf3459f tracing: Simplify defining of the next event id
The value to be used and compared in trace_search_list() is "last + 1".
Let's just define next to be "last + 1" instead of doing the addition
each time.

Link: https://lkml.kernel.org/r/20200703020612.12930-2-richard.weiyang@linux.alibaba.com

Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-09 18:00:47 -04:00
Steven Rostedt (VMware)
29ce24519c ring-buffer: Do not trigger a WARN if clock going backwards is detected
After tweaking the ring buffer to be a bit faster, a warning is triggering
on one of my machines, and causing my tests to fail. This warning is caused
when the delta (current time stamp minus previous time stamp), is larger
than the max time held by the ring buffer (59 bits).

If the clock were to go backwards slightly, this would then easily trigger
this warning. The machine that it triggered on, the clock did go backwards
by around 450 nanoseconds, and this happened after a recalibration of the
TSC clock. Now that the ring buffer is faster, it detects this, and the
delta that is used larger than the max, the warning is triggered and my test
fails.

To handle the clock going backwards, look at the saved before and after time
stamps. If they are the same, it means that the current event did not
interrupt another event, and that those timestamp are of a previous event
that was recorded. If the max delta is triggered, look at those time stamps,
make sure they are the same, then use them to compare with the current
timestamp. If the current timestamp is less than the before/after time
stamps, then that means the clock being used went backward.

Print out a message that this has happened, but do not warn about it (and
only print the message once).

Still do the warning if the delta is indeed larger than what can be used.

Also remove the unneeded KERN_WARNING from the WARN_ONCE() print.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-01 22:12:07 -04:00
Steven Rostedt (VMware)
bbeba3e58f ring-buffer: Call trace_clock_local() directly for RETPOLINE kernels
After doing some benchmarks and examining the code, I found that the ring
buffer clock calls were quite expensive, and noticed that it uses
retpolines. This is because the ring buffer clock is programmable, and can
be set. But in most cases it simply uses the fastest ns unit clock which is
the trace_clock_local(). For RETPOLINE builds, checking if the ring buffer
clock is set to trace_clock_local() and then calling it directly has brought
the time of an event on my i7 box from an average of 93 nanoseconds an event
down to 83 nanoseconds an event, and the minimum time from 81 nanoseconds to
68 nanoseconds!

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-01 22:12:07 -04:00
Steven Rostedt (VMware)
74e879373b ring-buffer: Move the add_timestamp into its own function
Make a helper function rb_add_timestamp() that moves the adding of the
extended time stamps into its own function. Also, remove the noinline and
inline for the functions it calls, as recent benchmarks appear they do not
make a difference (just let gcc decide).

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-01 22:12:06 -04:00
Steven Rostedt (VMware)
58fbc3c632 ring-buffer: Consolidate add_timestamp to remove some branches
Reorganize a little the logic to handle adding the absolute time stamp,
extended and forced time stamps, in such a way to remove a branch or two.
This is just a micro optimization.

Also add before and after time stamps to the rb_event_info structure to
display those values in the rb_check_timestamps() code, if something were to
go wrong.

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-01 22:11:22 -04:00
Steven Rostedt (VMware)
75b21c6dfa ring-buffer: Mark the !tail (crossing a page) as unlikely
It is the uncommon case where an event crosses a sub buffer boundary (page)
mark that check at the end of reserving an event as unlikely.

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30 17:18:56 -04:00
Nicholas Piggin
b23d7a5f4a ring-buffer: speed up buffer resets by avoiding synchronize_rcu for each CPU
On a 144 thread system, `perf ftrace` takes about 20 seconds to start
up, due to calling synchronize_rcu() for each CPU.

  cat /proc/108560/stack
    0xc0003e7eb336f470
    __switch_to+0x2e0/0x480
    __wait_rcu_gp+0x20c/0x220
    synchronize_rcu+0x9c/0xc0
    ring_buffer_reset_cpu+0x88/0x2e0
    tracing_reset_online_cpus+0x84/0xe0
    tracing_open+0x1d4/0x1f0

On a system with 10x more threads, it starts to become an annoyance.

Batch these up so we disable all the per-cpu buffers first, then
synchronize_rcu() once, then reset each of the buffers. This brings
the time down to about 0.5s.

Link: https://lkml.kernel.org/r/20200625053403.2386972-1-npiggin@gmail.com

Tested-by: Anton Blanchard <anton@ozlabs.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30 17:18:56 -04:00
Steven Rostedt (VMware)
10464b4aa6 ring-buffer: Add rb_time_t 64 bit operations for speeding up 32 bit
After a discussion with the new time algorithm to have nested events still
have proper time keeping but required using local64_t atomic operations.
Mathieu was concerned about the performance this would have on 32 bit
machines, as in most cases, atomic 64 bit operations on them can be
expensive.

As the ring buffer's timing needs do not require full features of local64_t,
a wrapper is made to implement a new rb_time_t operation that uses two longs
on 32 bit machines but still uses the local64_t operations on 64 bit
machines. There's a switch that can be made in the file to force 64 bit to
use the 32 bit version just for testing purposes.

All reads do not need to succeed if a read happened while the stamp being
read is in the process of being updated. The requirement is that all reads
must succed that were done by an interrupting event (where this event was
interrupted by another event that did the write). Or if the event itself did
the write first. That is: rb_time_set(t, x) followed by rb_time_read(t) will
always succeed (even if it gets interrupted by another event that writes to
t. The result of the read will be either the previous set, or a set
performed by an interrupting event.

If the read is done by an event that interrupted another event that was in
the process of setting the time stamp, and no other event came along to
write to that time stamp, it will fail and the rb_time_read() will return
that it failed (the value to read will be undefined).

A set will always write to the time stamp and return with a valid time
stamp, such that any read after it will be valid.

A cmpxchg may fail if it interrupted an event that was in the process of
updating the time stamp just like the reads do. Other than that, it will act
like a normal cmpxchg.

The way this works is that the rb_time_t is made of of three fields. A cnt,
that gets updated atomically everyting a modification is made. A top that
represents the most significant 30 bits of the time, and a bottom to
represent the least significant 30 bits of the time. Notice, that the time
values is only 60 bits long (where the ring buffer only uses 59 bits, which
gives us 18 years of nanoseconds!).

The top two bits of both the top and bottom is a 2 bit counter that gets set
by the value of the least two significant bits of the cnt. A read of the top
and the bottom where both the top and bottom have the same most significant
top 2 bits, are considered a match and a valid 60 bit number can be created
from it. If they do not match, then the number is considered invalid, and
this must only happen if an event interrupted another event in the midst of
updating the time stamp.

This is only used for 32 bits machines as 64 bit machines can get better
performance out of the local64_t. This has been tested heavily by forcing 64
bit to use this logic.

Link: https://lore.kernel.org/r/20200625225345.18cf5881@oasis.local.home
Link: http://lkml.kernel.org/r/20200629025259.309232719@goodmis.org

Inspired-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30 17:18:51 -04:00
Steven Rostedt (VMware)
7c4b4a5164 ring-buffer: Incorporate absolute timestamp into add_timestamp logic
Instead of calling out the absolute test for each time to check if the
ring buffer wants absolute time stamps for all its recording, incorporate it
with the add_timestamp field and turn it into flags for faster processing
between wanting a absolute tag and needing to force one.

Link: http://lkml.kernel.org/r/20200629025259.154892368@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30 16:16:14 -04:00
Steven Rostedt (VMware)
a389d86f7f ring-buffer: Have nested events still record running time stamp
Up until now, if an event is interrupted while it is recorded by an
interrupt, and that interrupt records events, the time of those events will
all be the same. This is because events only record the delta of the time
since the previous event (or beginning of a page), and to handle updating
the time keeping for that of nested events is extremely racy. After years of
thinking about this and several failed attempts, I finally have a solution
to solve this puzzle.

The problem is that you need to atomically calculate the delta and then
update the time stamp you made the delta from, as well as then record it
into the buffer, all this while at any time an interrupt can come in and
do the same thing. This is easy to solve with heavy weight atomics, but that
would be detrimental to the performance of the ring buffer. The current
state of affairs sacrificed the time deltas for nested events for
performance.

The reason for previous failed attempts at solving this puzzle was because I
was trying to completely avoid slow atomic operations like cmpxchg. I final
came to the conclusion to always avoid cmpxchg is not possible, which is why
those previous attempts always failed. But it is possible to pick one path
(the most common case) and avoid cmpxchg in that path, which is the "fast
path". The most common case is that an event will not be interrupted and
have other events added into it. An event can detect if it has
interrupted another event, and for these cases we can make it the slow
path and use the heavy operations like cmpxchg.

One more player was added to the game that made this possible, and that is
the "absolute timestamp" (by Tom Zanussi) that allows us to inject a full 59
bit time stamp. (Of course this breaks if a machine is running for more than
18 years without a reboot!).

There's barrier() placements around for being paranoid, even when they
are not needed because of other atomic functions near by. But those
should not hurt, as if they are not needed, they basically become a nop.

Note, this also makes the race window much smaller, which means there
are less slow paths to slow down the performance.

The basic idea is that there's two main paths taken.

 1) Not being interrupted between time stamps and reserving buffer space.
    In this case, the time stamps taken are true to the location in the
    buffer.

 2) Was interrupted by another path between taking time stamps and reserving
    buffer space.

The objective is to know what the delta is from the last reserved location
in the buffer.

As it is possible to detect if an event is interrupting another event before
reserving data, space is added to the length to be reserved to inject a full
time stamp along with the event being reserved.

When an event is not interrupted, the write stamp is always the time of the
last event written to the buffer.

In path 1, there's two sub paths we care about:

 a) The event did not interrupt another event.
 b) The event interrupted another event.

In case a, as the write stamp was read and known to be correct, the delta
between the current time stamp and the write stamp is the delta between the
current event and the previously recorded event.

In case b, extra space was reserved to just put the full time stamp into the
buffer. Which is done, as stated, in this path the time stamp taken is known
to match the location in the buffer.

In path 2, there's also two sub paths we care about:

 a) The event was not interrupted by another event since it reserved space
    on the buffer and re-reading the write stamp.
 b) The event was interrupted by another event.

In case a, the write stamp is that of the last event that interrupted this
event between taking the time stamps and reserving. As no event came in
after re-reading the write stamp, that event is known to be the time of the
event directly before this event and the delta can be the new time stamp and
the write stamp.

In case b, one or more events came in between reserving the event and
re-reading he write stamp. Since this event's buffer reservation is between
other events at this path, there's no way to know what the delta is. But
because an event interrupted this event after it started, its fine to just
give a zero delta, and take the same time stamp as the events that happened
within the event being recorded.

Here's the implementation of the design of this solution:

 All this is per cpu, and only needs to worry about nested events (not
 parallel events).

The players:

 write_tail: The index in the buffer where new events can be written to.
     It is incremented via local_add() to reserve space for a new event.

 before_stamp: A time stamp set by all events before reserving space.

 write_stamp: A time stamp updated by events after it has successfully
     reserved space.

	/* Save the current position of write */
 [A]	w = local_read(write_tail);
	barrier();
	/* Read both before and write stamps before touching anything */
	before = local_read(before_stamp);
	after = local_read(write_stamp);
	barrier();

	/*
	 * If before and after are the same, then this event is not
	 * interrupting a time update. If it is, then reserve space for adding
	 * a full time stamp (this can turn into a time extend which is
	 * just an extended time delta but fill up the extra space).
	 */
	if (after != before)
		abs = true;

	ts = clock();

	/* Now update the before_stamp (everyone does this!) */
 [B]	local_set(before_stamp, ts);

	/* Now reserve space on the buffer */
 [C]	write = local_add_return(len, write_tail);

	/* Set tail to be were this event's data is */
	tail = write - len;

 	if (w == tail) {

		/* Nothing interrupted this between A and C */
 [D]		local_set(write_stamp, ts);
		barrier();
 [E]		save_before = local_read(before_stamp);

 		if (!abs) {
			/* This did not interrupt a time update */
			delta = ts - after;
		} else {
			delta = ts; /* The full time stamp will be in use */
		}
		if (ts != save_before) {
			/* slow path - Was interrupted between C and E */
			/* The update to write_stamp could have overwritten the update to
			 * it by the interrupting event, but before and after should be
			 * the same for all completed top events */
			after = local_read(write_stamp);
			if (save_before > after)
				local_cmpxchg(write_stamp, after, save_before);
		}
	} else {
		/* slow path - Interrupted between A and C */

		after = local_read(write_stamp);
		temp_ts = clock();
		barrier();
 [F]		if (write == local_read(write_tail) && after < temp_ts) {
			/* This was not interrupted since C and F
			 * The last write_stamp is still valid for the previous event
			 * in the buffer. */
			delta = temp_ts - after;
			/* OK to keep this new time stamp */
			ts = temp_ts;
		} else {
			/* Interrupted between C and F
			 * Well, there's no use to try to know what the time stamp
			 * is for the previous event. Just set delta to zero and
			 * be the same time as that event that interrupted us before
			 * the reservation of the buffer. */

			delta = 0;
		}
		/* No need to use full timestamps here */
		abs = 0;
	}

Link: https://lkml.kernel.org/r/20200625094454.732790f7@oasis.local.home
Link: https://lore.kernel.org/r/20200627010041.517736087@goodmis.org
Link: http://lkml.kernel.org/r/20200629025258.957440797@goodmis.org

Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30 14:29:33 -04:00
Steven Rostedt (VMware)
7ef282e051 tracing: Move pipe reference to trace array instead of current_tracer
If a process has the trace_pipe open on a trace_array, the current tracer
for that trace array should not be changed. This was original enforced by a
global lock, but when instances were introduced, it was moved to the
current_trace. But this structure is shared by all instances, and a
trace_pipe is for a single instance. There's no reason that a process that
has trace_pipe open on one instance should prevent another instance from
changing its current tracer. Move the reference counter to the trace_array
instead.

This is marked as "Fixes" but is more of a clean up than a true fix.
Backport if you want, but its not critical.

Fixes: cf6ab6d914 ("tracing: Add ref count to tracer for when they are being read by pipe")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30 14:29:33 -04:00
Steven Rostedt (VMware)
5da7cd11d0 x86/ftrace: Only have the builtin ftrace_regs_caller call direct hooks
If a direct hook is attached to a function that ftrace also has a function
attached to it, then it is required that the ftrace_ops_list_func() is used
to iterate over the registered ftrace callbacks. This will also include the
direct ftrace_ops helper, that tells ftrace_regs_caller where to return to
(the direct callback and not the function that called it).

As this direct helper is only to handle the case of ftrace callbacks
attached to the same function as the direct callback, the ftrace callback
allocated trampolines (used to only call them), should never be used to
return back to a direct callback.

Only copy the portion of the ftrace_regs_caller that will return back to
what called it, and not the portion that returns back to the direct caller.

The direct ftrace_ops must then pick the ftrace_regs_caller builtin function
as its own trampoline to ensure that it will never have one allocated for
it (which would not include the handling of direct callbacks).

Link: http://lkml.kernel.org/r/20200422162750.495903799@goodmis.org

Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-29 11:42:47 -04:00
Steven Rostedt (VMware)
c791cc4b1f tracing: Only allow trace_array_printk() to be used by instances
To prevent default "trace_printks()" from spamming the top level tracing
ring buffer, only allow trace instances to use trace_array_printk() (which
can be used without the trace_printk() start up warning).

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-29 09:01:02 -04:00
Linus Torvalds
91a9a90d04 Peter Zijlstra says:
The most anticipated fix in this pull request is probably the horrible build
 fix for the RANDSTRUCT fail that didn't make -rc2. Also included is the cleanup
 that removes those BUILD_BUG_ON()s and replaces it with ugly unions.
 
 Also included is the try_to_wake_up() race fix that was first triggered by
 Paul's RCU-torture runs, but was independently hit by Dave Chinner's fstest
 runs as well.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl74tMMACgkQEsHwGGHe
 VUqpAxAAnAiwPetkmCUn53wmv10oGC/vbnxprvNzoIANo9IFJYwKLYuRviT4r4KW
 0tEmpWtsy0CkVdCTpx4yXYUqtGswbjAvxSuwk8vR3bdtottMNJ77PPBKrywL3ymZ
 uQ0tpB/W9CFTOjKx4U/OyaK2Gf4mYzvuJSqhhTbopGf4H9SWflhepLZf0C4rhYa5
 tywch3etazAcNpq+dm31jKIVUkwULyJ4mXH2VDXo+jjl1A5g6h2UliS03e1/BChD
 hX78NRv7ezySdVVpLFhLVKCRdFFj6wIbLsx0yIQjw83dYhmDHK9iqN7m9+p4pZOr
 4qz/+eRYv+zZwWZP8IqOIAE4la1S/LToKEyxAehwl2sfIjhUXx68PvM/feWr8yfd
 z2CHEsI3Dn5XfM8FdPSA+JHE9IHwUyHrDRxcVGU7Nj/9s4L2DfxdrPl6qKGA3Tzm
 F7rK4vR5MNB8Sr7bzcCWV9FOsMNcXh2WThpZcsjfCUgwJza45N3HfocsXO5m4ShC
 FQ8RjE46Msd1WgIoslAkgQT7rFohe/sUKs5xVj4SwT/5i6lz55IGYmiV+hErrxU4
 ArSzUeOys/0EwzJX8PvxiElMq3btFW2XYV65XX5dIABt9IxgRvxHcUGPJDNvQKP7
 WdKVxRIzVXcfRiKUI05vLZU6yzfJuoAjvI1kyTYo64QIbeM7H6g=
 =EGOe
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_5.8_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:
 "The most anticipated fix in this pull request is probably the horrible
  build fix for the RANDSTRUCT fail that didn't make -rc2. Also included
  is the cleanup that removes those BUILD_BUG_ON()s and replaces it with
  ugly unions.

  Also included is the try_to_wake_up() race fix that was first
  triggered by Paul's RCU-torture runs, but was independently hit by
  Dave Chinner's fstest runs as well"

* tag 'sched_urgent_for_5.8_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/cfs: change initial value of runnable_avg
  smp, irq_work: Continue smp_call_function*() and irq_work*() integration
  sched/core: s/WF_ON_RQ/WQ_ON_CPU/
  sched/core: Fix ttwu() race
  sched/core: Fix PI boosting between RT and DEADLINE tasks
  sched/deadline: Initialize ->dl_boosted
  sched/core: Check cpus_mask, not cpus_ptr in __set_cpus_allowed_ptr(), to fix mask corruption
  sched/core: Fix CONFIG_GCC_PLUGIN_RANDSTRUCT build fail
2020-06-28 10:37:39 -07:00
Linus Torvalds
c141b30e99 Paul E. McKenney says:
A single commit that uses "arch_" atomic operations to avoid the
 instrumentation that comes with the non-"arch_" versions. In preparation
 for that commit, it also has another commit that makes these "arch_"
 atomic operations available to generic code.
 
 Without these commits, KCSAN uses can see pointless errors.
 
 Both from Peter Zijlstra.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl74qe4ACgkQEsHwGGHe
 VUruxhAApxHnsIX4IFm4cBaSMMsXCGpifM3EOd3S1PPqxWQxLfrDpc/SgW4hvJja
 y144m/HQVvHkO8DAqWaC5lNmILjZhZeR1ToRrtqsFVzedlORaXgFJQzojjOBBCWi
 kwtrqVDb4dw+RBQdj6hrknnsivdAlDVFHYCxQuBpNQ/NN4M9l0nwxPRVpTdcFtw0
 Yv6ttpDeo8/XJ12OwiFINWnQT7F1n6CoyvdH+zQayvP+2qK8sq3sYVN4DiTC2Jyk
 9YpnR9ubl4jGz78+l2IrhhHw0zcHutGy2OVMXMYYvqZVzcp7QCpXFCP7MY00R6Br
 1eyxzMJX3j9rxDcreNTFZQFqQsCSfla3SMJIHFT1PHiw2O1ZVXp4EUaHb6eCy/nb
 IMgRd37mRCQovE267+LmDMNovSbRXGFu/qhu7QPaKQizqfYTbAzGULbttHJr6P7i
 ciQRG6ZfpbqflsezlijmhDTXI/oK/prn5apo8g6IVAxVBINzpu01+xszpuOKdCg0
 CGliJRShIXwPCAPacq0aFtauRt3RVpbEWOXj3GZU4yof/8wnHOAPZ0/HmFeKmO+4
 BIaa7QASvYUfczVv/Fi0FKdU6c0jQGDCUxVi1XJpxNG0XSiayGEPyN4Y0wDNHuWg
 H+9MPAUhGoyDoMPRBjSKIVzNF7bLJ8VMe3GUBrFcJY+BVLhfXUE=
 =amVp
 -----END PGP SIGNATURE-----

Merge tag 'rcu_urgent_for_5.8_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RCU-vs-KCSAN fixes from Borislav Petkov:
 "A single commit that uses "arch_" atomic operations to avoid the
  instrumentation that comes with the non-"arch_" versions.

  In preparation for that commit, it also has another commit that makes
  these "arch_" atomic operations available to generic code.

  Without these commits, KCSAN uses can see pointless errors"

* tag 'rcu_urgent_for_5.8_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rcu: Fixup noinstr warnings
  locking/atomics: Provide the arch_atomic_ interface to generic code
2020-06-28 10:29:38 -07:00
Vincent Guittot
e21cf43406 sched/cfs: change initial value of runnable_avg
Some performance regression on reaim benchmark have been raised with
  commit 070f5e860e ("sched/fair: Take into account runnable_avg to classify group")

The problem comes from the init value of runnable_avg which is initialized
with max value. This can be a problem if the newly forked task is finally
a short task because the group of CPUs is wrongly set to overloaded and
tasks are pulled less agressively.

Set initial value of runnable_avg equals to util_avg to reflect that there
is no waiting time so far.

Fixes: 070f5e860e ("sched/fair: Take into account runnable_avg to classify group")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200624154422.29166-1-vincent.guittot@linaro.org
2020-06-28 17:01:20 +02:00
Peter Zijlstra
8c4890d1c3 smp, irq_work: Continue smp_call_function*() and irq_work*() integration
Instead of relying on BUG_ON() to ensure the various data structures
line up, use a bunch of horrible unions to make it all automatic.

Much of the union magic is to ensure irq_work and smp_call_function do
not (yet) see the members of their respective data structures change
name.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20200622100825.844455025@infradead.org
2020-06-28 17:01:20 +02:00
Peter Zijlstra
739f70b476 sched/core: s/WF_ON_RQ/WQ_ON_CPU/
Use a better name for this poorly named flag, to avoid confusion...

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20200622100825.785115830@infradead.org
2020-06-28 17:01:20 +02:00
Peter Zijlstra
b6e13e8582 sched/core: Fix ttwu() race
Paul reported rcutorture occasionally hitting a NULL deref:

  sched_ttwu_pending()
    ttwu_do_wakeup()
      check_preempt_curr() := check_preempt_wakeup()
        find_matching_se()
          is_same_group()
            if (se->cfs_rq == pse->cfs_rq) <-- *BOOM*

Debugging showed that this only appears to happen when we take the new
code-path from commit:

  2ebb177175 ("sched/core: Offload wakee task activation if it the wakee is descheduling")

and only when @cpu == smp_processor_id(). Something which should not
be possible, because p->on_cpu can only be true for remote tasks.
Similarly, without the new code-path from commit:

  c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")

this would've unconditionally hit:

  smp_cond_load_acquire(&p->on_cpu, !VAL);

and if: 'cpu == smp_processor_id() && p->on_cpu' is possible, this
would result in an instant live-lock (with IRQs disabled), something
that hasn't been reported.

The NULL deref can be explained however if the task_cpu(p) load at the
beginning of try_to_wake_up() returns an old value, and this old value
happens to be smp_processor_id(). Further assume that the p->on_cpu
load accurately returns 1, it really is still running, just not here.

Then, when we enqueue the task locally, we can crash in exactly the
observed manner because p->se.cfs_rq != rq->cfs_rq, because p's cfs_rq
is from the wrong CPU, therefore we'll iterate into the non-existant
parents and NULL deref.

The closest semi-plausible scenario I've managed to contrive is
somewhat elaborate (then again, actual reproduction takes many CPU
hours of rcutorture, so it can't be anything obvious):

					X->cpu = 1
					rq(1)->curr = X

	CPU0				CPU1				CPU2

					// switch away from X
					LOCK rq(1)->lock
					smp_mb__after_spinlock
					dequeue_task(X)
					  X->on_rq = 9
					switch_to(Z)
					  X->on_cpu = 0
					UNLOCK rq(1)->lock

									// migrate X to cpu 0
									LOCK rq(1)->lock
									dequeue_task(X)
									set_task_cpu(X, 0)
									  X->cpu = 0
									UNLOCK rq(1)->lock

									LOCK rq(0)->lock
									enqueue_task(X)
									  X->on_rq = 1
									UNLOCK rq(0)->lock

	// switch to X
	LOCK rq(0)->lock
	smp_mb__after_spinlock
	switch_to(X)
	  X->on_cpu = 1
	UNLOCK rq(0)->lock

	// X goes sleep
	X->state = TASK_UNINTERRUPTIBLE
	smp_mb();			// wake X
					ttwu()
					  LOCK X->pi_lock
					  smp_mb__after_spinlock

					  if (p->state)

					  cpu = X->cpu; // =? 1

					  smp_rmb()

	// X calls schedule()
	LOCK rq(0)->lock
	smp_mb__after_spinlock
	dequeue_task(X)
	  X->on_rq = 0

					  if (p->on_rq)

					  smp_rmb();

					  if (p->on_cpu && ttwu_queue_wakelist(..)) [*]

					  smp_cond_load_acquire(&p->on_cpu, !VAL)

					  cpu = select_task_rq(X, X->wake_cpu, ...)
					  if (X->cpu != cpu)
	switch_to(Y)
	  X->on_cpu = 0
	UNLOCK rq(0)->lock

However I'm having trouble convincing myself that's actually possible
on x86_64 -- after all, every LOCK implies an smp_mb() there, so if ttwu
observes ->state != RUNNING, it must also observe ->cpu != 1.

(Most of the previous ttwu() races were found on very large PowerPC)

Nevertheless, this fully explains the observed failure case.

Fix it by ordering the task_cpu(p) load after the p->on_cpu load,
which is easy since nothing actually uses @cpu before this.

Fixes: c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200622125649.GC576871@hirez.programming.kicks-ass.net
2020-06-28 17:01:20 +02:00
Juri Lelli
740797ce3a sched/core: Fix PI boosting between RT and DEADLINE tasks
syzbot reported the following warning:

 WARNING: CPU: 1 PID: 6351 at kernel/sched/deadline.c:628
 enqueue_task_dl+0x22da/0x38a0 kernel/sched/deadline.c:1504

At deadline.c:628 we have:

 623 static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
 624 {
 625 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 626 	struct rq *rq = rq_of_dl_rq(dl_rq);
 627
 628 	WARN_ON(dl_se->dl_boosted);
 629 	WARN_ON(dl_time_before(rq_clock(rq), dl_se->deadline));
        [...]
     }

Which means that setup_new_dl_entity() has been called on a task
currently boosted. This shouldn't happen though, as setup_new_dl_entity()
is only called when the 'dynamic' deadline of the new entity
is in the past w.r.t. rq_clock and boosted tasks shouldn't verify this
condition.

Digging through the PI code I noticed that what above might in fact happen
if an RT tasks blocks on an rt_mutex hold by a DEADLINE task. In the
first branch of boosting conditions we check only if a pi_task 'dynamic'
deadline is earlier than mutex holder's and in this case we set mutex
holder to be dl_boosted. However, since RT 'dynamic' deadlines are only
initialized if such tasks get boosted at some point (or if they become
DEADLINE of course), in general RT 'dynamic' deadlines are usually equal
to 0 and this verifies the aforementioned condition.

Fix it by checking that the potential donor task is actually (even if
temporary because in turn boosted) running at DEADLINE priority before
using its 'dynamic' deadline value.

Fixes: 2d3d891d33 ("sched/deadline: Add SCHED_DEADLINE inheritance logic")
Reported-by: syzbot+119ba87189432ead09b4@syzkaller.appspotmail.com
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Tested-by: Daniel Wagner <dwagner@suse.de>
Link: https://lkml.kernel.org/r/20181119153201.GB2119@localhost.localdomain
2020-06-28 17:01:20 +02:00
Juri Lelli
ce9bc3b27f sched/deadline: Initialize ->dl_boosted
syzbot reported the following warning triggered via SYSC_sched_setattr():

  WARNING: CPU: 0 PID: 6973 at kernel/sched/deadline.c:593 setup_new_dl_entity /kernel/sched/deadline.c:594 [inline]
  WARNING: CPU: 0 PID: 6973 at kernel/sched/deadline.c:593 enqueue_dl_entity /kernel/sched/deadline.c:1370 [inline]
  WARNING: CPU: 0 PID: 6973 at kernel/sched/deadline.c:593 enqueue_task_dl+0x1c17/0x2ba0 /kernel/sched/deadline.c:1441

This happens because the ->dl_boosted flag is currently not initialized by
__dl_clear_params() (unlike the other flags) and setup_new_dl_entity()
rightfully complains about it.

Initialize dl_boosted to 0.

Fixes: 2d3d891d33 ("sched/deadline: Add SCHED_DEADLINE inheritance logic")
Reported-by: syzbot+5ac8bac25f95e8b221e7@syzkaller.appspotmail.com
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Daniel Wagner <dwagner@suse.de>
Link: https://lkml.kernel.org/r/20200617072919.818409-1-juri.lelli@redhat.com
2020-06-28 17:01:20 +02:00
Scott Wood
fd844ba9ae sched/core: Check cpus_mask, not cpus_ptr in __set_cpus_allowed_ptr(), to fix mask corruption
This function is concerned with the long-term CPU mask, not the
transitory mask the task might have while migrate disabled.  Before
this patch, if a task was migrate-disabled at the time
__set_cpus_allowed_ptr() was called, and the new mask happened to be
equal to the CPU that the task was running on, then the mask update
would be lost.

Signed-off-by: Scott Wood <swood@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200617121742.cpxppyi7twxmpin7@linutronix.de
2020-06-28 17:01:20 +02:00
Linus Torvalds
f05baa066d dma-mapping fixes for 5.8:
- fix dma coherent mmap in nommu (me)
  - more AMD SEV fallout (David Rientjes, me)
  - fix alignment in dma_common_*_remap (Eric Auger)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl72+VsLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYMVaw//VgQbKUfTsuCZt+ZZqIY5nd6YajexoC+X051yC7/8
 YtdGqAa2RuutoHwUhTcqzvrSsCqthNCeeZ3yBUS/SQwyoQy3szrEwNXnRboNdwgq
 xebuTOra3MIRSWJzFHL+PNQjkaGSoQroSJHEeVZOUdYchE+sNh/pZxQoPU8ImcOe
 iVB+6nDJga+CpbKVi6oaGs8EISHtYkt1yHOeAhTxlqPkmP1tvsOZFgvMQBPCq4Rz
 QlqcVilDb0fPl2pnLy1LTbgAC8yPs7phrf9KBVUqCptfTLAv1nkwI9WpX8zFmkDo
 KapepEr9bkAHcq+gNcUOSiKr3K1bMF41numZ5zi6PnEJ/bHsPEotzwf05GrKY0Ci
 vMNpWL5QIcaMECe8Q8jrelgoDK0614vp8k7U+1CXmgpyF3lf5+zXwJyYLSgcf2PI
 2ryJnnib3jYORe80VVHc76CpX5Z5Ez6IaaDP/3rNsexLW/Ip3mhwqUDEYNCvMN+P
 qYJ8GrmqGAbMrhifvxVRL0ur73kIKE2s4l7xznd7p0Nj6ToAdMYnmrKUZEhMTPD9
 UcpzK9omgT51qAsByEggT97eDYzQSqYfh0OxAUJwML/8AXa7nJVdFo9ipHCVal6x
 tEuWpAMBe9YRBDaPUgu3vf8VNagv7YCzJmLnPFS7KvYJ0siw5r6ZxdXfkE2cG9o2
 DyI=
 =qAJQ
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.8-4' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:

 - fix dma coherent mmap in nommu (me)

 - more AMD SEV fallout (David Rientjes, me)

 - fix alignment in dma_common_*_remap (Eric Auger)

* tag 'dma-mapping-5.8-4' of git://git.infradead.org/users/hch/dma-mapping:
  dma-remap: align the size in dma_common_*_remap()
  dma-mapping: DMA_COHERENT_POOL should select GENERIC_ALLOCATOR
  dma-direct: add missing set_memory_decrypted() for coherent mapping
  dma-direct: check return value when encrypting or decrypting memory
  dma-direct: re-encrypt memory if dma_direct_alloc_pages() fails
  dma-direct: always align allocation size in dma_direct_alloc_pages()
  dma-direct: mark __dma_direct_alloc_pages static
  dma-direct: re-enable mmap for !CONFIG_MMU
2020-06-27 13:06:22 -07:00
Linus Torvalds
6116dea80d kgdb patches for 5.8-rc3
The main change here is a fix for a number of unsafe interactions
 between kdb and the console system. The fixes are specific to kdb (pure
 kgdb debugging does not use the console system at all). On systems with
 an NMI then kdb, if it is enabled, must get messages to the user despite
 potentially running from some "difficult" calling contexts. These fixes
 avoid using the console system where we have been provided an
 alternative (safer) way to interact with the user and, if using the
 console system in unavoidable, use oops_in_progress for deadlock
 avoidance. These fixes also ensure kdb honours the console enable flag.
 
 Also included is a fix that wraps kgdb trap handling in an RCU read lock
 to avoids triggering diagnostic warnings. This is a wide lock scope but
 this is OK because kgdb is a stop-the-world debugger. When we stop the
 world we put all the CPUs into holding pens and this inhibits RCU update
 anyway.
 
 Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAl72C5AACgkQfOMlXTn3
 iKFfmA//SCJU7zJsrKTsr6+HJY+gIuwHm70aGCNIr3EjBgTZQHQYflG6msmMHTAX
 d4qnGSkfKzC8jYJrHPpX4eU3bnqYci6GnaT/N5p9YkTGHun+kYYTz3wLzZiWxKRg
 iE4QLEwjU/dGAYyRz0CKCTRNTLTG+R79HWLL2Wi5OQiNhYiPuFAgS/NSUjpnJIuf
 fmj8jSPP/7T/m0cEUWXbLwTfolEZLIa1heqtaJq4fAftPsAk5a5TZ0NugaxUPoo4
 YS06eASIZoVcDQiehVy+gH05FyEjJGXnkFtTkAoRL/yOERKLy0WMzFZAAh6NT4St
 16Hx3Nnw+7ds7Iq8jEIpM/XJo1d3haYvAQdzy6HakAOwp7vrD/CjF45wwju78woY
 Jq54Vjvaxjaw1vlJCVrAAjdj3bAHdufBeWrBGmYO8F1HSn9eNeLS7wWbq6lEhxNd
 ObXRUFwebzYpOT6DI2TdnDg/2+xAn2oXpzk4UK9I/Vbxew8R4lOPQm4vC0V3CTME
 cHXFGV3ncjXlVRKdMAmnYcN7pMY4NCdX5vGqC/djQRwKRV1Ve8jwUCFVKRAd4zio
 wHpCFziwSaz9giZJ5I831EKsvSj9DVoPPJFgoEXIzIWF3OS0qzP6UqO2HwJNbA+e
 W4laVRzdBcMuVVa+7XWYzdAhof0hNX0Ov78dyDMcX1MkOS02O7o=
 =ovnT
 -----END PGP SIGNATURE-----

Merge tag 'kgdb-5.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux

Pull kgdb fixes from Daniel Thompson:
 "The main change here is a fix for a number of unsafe interactions
  between kdb and the console system. The fixes are specific to kdb
  (pure kgdb debugging does not use the console system at all). On
  systems with an NMI then kdb, if it is enabled, must get messages to
  the user despite potentially running from some "difficult" calling
  contexts. These fixes avoid using the console system where we have
  been provided an alternative (safer) way to interact with the user
  and, if using the console system in unavoidable, use oops_in_progress
  for deadlock avoidance. These fixes also ensure kdb honours the
  console enable flag.

  Also included is a fix that wraps kgdb trap handling in an RCU read
  lock to avoids triggering diagnostic warnings. This is a wide lock
  scope but this is OK because kgdb is a stop-the-world debugger. When
  we stop the world we put all the CPUs into holding pens and this
  inhibits RCU update anyway"

* tag 'kgdb-5.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
  kgdb: Avoid suspicious RCU usage warning
  kdb: Switch to use safer dbg_io_ops over console APIs
  kdb: Make kdb_printf() console handling more robust
  kdb: Check status of console prior to invoking handlers
  kdb: Re-factor kdb_printf() message write code
2020-06-27 08:53:49 -07:00
Linus Torvalds
ed3e00e7d6 Power management fixes for 5.8-rc3
- Make sure that the _TIF_POLLING_NRFLAG is clear before entering
    the last phase of suspend-to-idle to avoid wakeup issues on some
    x86 systems (Chen Yu, Rafael Wysocki).
 
  - Cover one more case in which the intel_pstate driver should let
    the platform firmware control the CPU frequency and refuse to
    load (Srinivas Pandruvada).
 
  - Add __init annotations to 2 functions in the power management
    core (Christophe JAILLET).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl72FGESHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxt44QAKk+ojYFCLVHz+mwniaNBRROegZrDPFS
 u9QiH4Qth5QdG6TFXJF2nhcZmkvyh/W3AXFj8px1XWanBfG8nhPb0c+wRHGIbACV
 /R7ykcQuj/h7ZlCjqevFAaUbCSw4Lt5oldIk+YyVUGrZH+uulyNOABm+cxFT6XaD
 KKnJ4hbbgnriQaMxmee+jYNphDsaTBhWWCkGj/V5x4DEyvWihkjf9skDEJ4/aP4O
 09Ug8FB1P0AxMTdaoKNfrIae27oPAb74HQOe92UbOMA/Q/k1snRr7vxGVak2/VyH
 /KHxCElM+F5F3kM6LMp4C9jtVMrcJq+1ABUmh0z2jBAw5MxGcdvjKihw6+W8Z9gC
 j7r4gDvIkXDGHlLSaWQC8otARs8gMmGdZJu+Tl8yzo7ZK7InFMAsNMBH/iOPzycQ
 9UMOpIkbDrOP2zeukULAFIuh1ow/dQopnY0Hyi0Gfezb1lTXe3rFzjPAnTRantyo
 z+c+o01pkbiiP+fBG3aUYjzkUv3wpm3cyBAlzr17wIUs2cfIo7WGMQuGQ8ZqoL0T
 QMpY5OxtzJJlvTvn9UF3oxlYbNFGq68fNnrdQ4u1zHm+K1P2mh2Dko6viHrceOyA
 DOMHuUP1Yrc7sneqGYkNbfhAfjdGeS4FTDJOpx9fgasTG1IYmk+hLOuKTi8Vttjj
 Fx6nxbo1sECm
 =Reud
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix a recent regression that broke suspend-to-idle on some x86
  systems, fix the intel_pstate driver to correctly let the platform
  firmware control CPU performance in some cases and add __init
  annotations to a couple of functions.

  Specifics:

   - Make sure that the _TIF_POLLING_NRFLAG is clear before entering the
     last phase of suspend-to-idle to avoid wakeup issues on some x86
     systems (Chen Yu, Rafael Wysocki).

   - Cover one more case in which the intel_pstate driver should let the
     platform firmware control the CPU frequency and refuse to load
     (Srinivas Pandruvada).

   - Add __init annotations to 2 functions in the power management core
     (Christophe JAILLET)"

* tag 'pm-5.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpuidle: Rearrange s2idle-specific idle state entry code
  PM: sleep: core: mark 2 functions as __init to save some memory
  cpufreq: intel_pstate: Add one more OOB control bit
  PM: s2idle: Clear _TIF_POLLING_NRFLAG before suspend to idle
2020-06-26 12:32:11 -07:00
Linus Torvalds
7c902e2730 Merge branch 'akpm' (patches from Andrew)
Merge misx fixes from Andrew Morton:
 "31 patches.

  Subsystems affected by this patch series: hotfixes, mm/pagealloc,
  kexec, ocfs2, lib, mm/slab, mm/slab, mm/slub, mm/swap, mm/pagemap,
  mm/vmalloc, mm/memcg, mm/gup, mm/thp, mm/vmscan, x86,
  mm/memory-hotplug, MAINTAINERS"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (31 commits)
  MAINTAINERS: update info for sparse
  mm/memory_hotplug.c: fix false softlockup during pfn range removal
  mm: remove vmalloc_exec
  arm64: use PAGE_KERNEL_ROX directly in alloc_insn_page
  x86/hyperv: allocate the hypercall page with only read and execute bits
  mm/memory: fix IO cost for anonymous page
  mm/swap: fix for "mm: workingset: age nonresident information alongside anonymous pages"
  mm: workingset: age nonresident information alongside anonymous pages
  doc: THP CoW fault no longer allocate THP
  docs: mm/gup: minor documentation update
  mm/memcontrol.c: prevent missed memory.low load tears
  mm/memcontrol.c: add missed css_put()
  mm: memcontrol: handle div0 crash race condition in memory.low
  mm/vmalloc.c: fix a warning while make xmldocs
  media: omap3isp: remove cacheflush.h
  make asm-generic/cacheflush.h more standalone
  mm/debug_vm_pgtable: fix build failure with powerpc 8xx
  mm/memory.c: properly pte_offset_map_lock/unlock in vm_insert_pages()
  mm: fix swap cache node allocation mask
  slub: cure list_slab_objects() from double fix
  ...
2020-06-26 12:19:36 -07:00
Douglas Anderson
440ab9e10e kgdb: Avoid suspicious RCU usage warning
At times when I'm using kgdb I see a splat on my console about
suspicious RCU usage.  I managed to come up with a case that could
reproduce this that looked like this:

  WARNING: suspicious RCU usage
  5.7.0-rc4+ #609 Not tainted
  -----------------------------
  kernel/pid.c:395 find_task_by_pid_ns() needs rcu_read_lock() protection!

  other info that might help us debug this:

    rcu_scheduler_active = 2, debug_locks = 1
  3 locks held by swapper/0/1:
   #0: ffffff81b6b8e988 (&dev->mutex){....}-{3:3}, at: __device_attach+0x40/0x13c
   #1: ffffffd01109e9e8 (dbg_master_lock){....}-{2:2}, at: kgdb_cpu_enter+0x20c/0x7ac
   #2: ffffffd01109ea90 (dbg_slave_lock){....}-{2:2}, at: kgdb_cpu_enter+0x3ec/0x7ac

  stack backtrace:
  CPU: 7 PID: 1 Comm: swapper/0 Not tainted 5.7.0-rc4+ #609
  Hardware name: Google Cheza (rev3+) (DT)
  Call trace:
   dump_backtrace+0x0/0x1b8
   show_stack+0x1c/0x24
   dump_stack+0xd4/0x134
   lockdep_rcu_suspicious+0xf0/0x100
   find_task_by_pid_ns+0x5c/0x80
   getthread+0x8c/0xb0
   gdb_serial_stub+0x9d4/0xd04
   kgdb_cpu_enter+0x284/0x7ac
   kgdb_handle_exception+0x174/0x20c
   kgdb_brk_fn+0x24/0x30
   call_break_hook+0x6c/0x7c
   brk_handler+0x20/0x5c
   do_debug_exception+0x1c8/0x22c
   el1_sync_handler+0x3c/0xe4
   el1_sync+0x7c/0x100
   rpmh_rsc_probe+0x38/0x420
   platform_drv_probe+0x94/0xb4
   really_probe+0x134/0x300
   driver_probe_device+0x68/0x100
   __device_attach_driver+0x90/0xa8
   bus_for_each_drv+0x84/0xcc
   __device_attach+0xb4/0x13c
   device_initial_probe+0x18/0x20
   bus_probe_device+0x38/0x98
   device_add+0x38c/0x420

If I understand properly we should just be able to blanket kgdb under
one big RCU read lock and the problem should go away.  We'll add it to
the beast-of-a-function known as kgdb_cpu_enter().

With this I no longer get any splats and things seem to work fine.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20200602154729.v2.1.I70e0d4fd46d5ed2aaf0c98a355e8e1b7a5bb7e4e@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-06-26 15:41:40 +01:00
Sumit Garg
5946d1f5b3 kdb: Switch to use safer dbg_io_ops over console APIs
In kgdb context, calling console handlers aren't safe due to locks used
in those handlers which could in turn lead to a deadlock. Although, using
oops_in_progress increases the chance to bypass locks in most console
handlers but it might not be sufficient enough in case a console uses
more locks (VT/TTY is good example).

Currently when a driver provides both polling I/O and a console then kdb
will output using the console. We can increase robustness by using the
currently active polling I/O driver (which should be lockless) instead
of the corresponding console. For several common cases (e.g. an
embedded system with a single serial port that is used both for console
output and debugger I/O) this will result in no console handler being
used.

In order to achieve this we need to reverse the order of preference to
use dbg_io_ops (uses polling I/O mode) over console APIs. So we just
store "struct console" that represents debugger I/O in dbg_io_ops and
while emitting kdb messages, skip console that matches dbg_io_ops
console in order to avoid duplicate messages. After this change,
"is_console" param becomes redundant and hence removed.

Suggested-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Link: https://lore.kernel.org/r/1591264879-25920-5-git-send-email-sumit.garg@linaro.org
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-06-26 15:40:16 +01:00
Christoph Hellwig
7a0e27b2a0 mm: remove vmalloc_exec
Merge vmalloc_exec into its only caller.  Note that for !CONFIG_MMU
__vmalloc_node_range maps to __vmalloc, which directly clears the
__GFP_HIGHMEM added by the vmalloc_exec stub anyway.

Link: http://lkml.kernel.org/r/20200618064307.32739-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-26 00:27:38 -07:00
Lianbo Jiang
fd7af71be5 kexec: do not verify the signature without the lockdown or mandatory signature
Signature verification is an important security feature, to protect
system from being attacked with a kernel of unknown origin.  Kexec
rebooting is a way to replace the running kernel, hence need be secured
carefully.

In the current code of handling signature verification of kexec kernel,
the logic is very twisted.  It mixes signature verification, IMA
signature appraising and kexec lockdown.

If there is no KEXEC_SIG_FORCE, kexec kernel image doesn't have one of
signature, the supported crypto, and key, we don't think this is wrong,
Unless kexec lockdown is executed.  IMA is considered as another kind of
signature appraising method.

If kexec kernel image has signature/crypto/key, it has to go through the
signature verification and pass.  Otherwise it's seen as verification
failure, and won't be loaded.

Seems kexec kernel image with an unqualified signature is even worse
than those w/o signature at all, this sounds very unreasonable.  E.g.
If people get a unsigned kernel to load, or a kernel signed with expired
key, which one is more dangerous?

So, here, let's simplify the logic to improve code readability.  If the
KEXEC_SIG_FORCE enabled or kexec lockdown enabled, signature
verification is mandated.  Otherwise, we lift the bar for any kernel
image.

Link: http://lkml.kernel.org/r/20200602045952.27487-1-lijiang@redhat.com
Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
Reviewed-by: Jiri Bohac <jbohac@suse.cz>
Acked-by: Dave Young <dyoung@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Matthew Garrett <mjg59@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-26 00:27:36 -07:00
Linus Torvalds
4a21185cda Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Don't insert ESP trailer twice in IPSEC code, from Huy Nguyen.

 2) The default crypto algorithm selection in Kconfig for IPSEC is out
    of touch with modern reality, fix this up. From Eric Biggers.

 3) bpftool is missing an entry for BPF_MAP_TYPE_RINGBUF, from Andrii
    Nakryiko.

 4) Missing init of ->frame_sz in xdp_convert_zc_to_xdp_frame(), from
    Hangbin Liu.

 5) Adjust packet alignment handling in ax88179_178a driver to match
    what the hardware actually does. From Jeremy Kerr.

 6) register_netdevice can leak in the case one of the notifiers fail,
    from Yang Yingliang.

 7) Use after free in ip_tunnel_lookup(), from Taehee Yoo.

 8) VLAN checks in sja1105 DSA driver need adjustments, from Vladimir
    Oltean.

 9) tg3 driver can sleep forever when we get enough EEH errors, fix from
    David Christensen.

10) Missing {READ,WRITE}_ONCE() annotations in various Intel ethernet
    drivers, from Ciara Loftus.

11) Fix scanning loop break condition in of_mdiobus_register(), from
    Florian Fainelli.

12) MTU limit is incorrect in ibmveth driver, from Thomas Falcon.

13) Endianness fix in mlxsw, from Ido Schimmel.

14) Use after free in smsc95xx usbnet driver, from Tuomas Tynkkynen.

15) Missing bridge mrp configuration validation, from Horatiu Vultur.

16) Fix circular netns references in wireguard, from Jason A. Donenfeld.

17) PTP initialization on recovery is not done properly in qed driver,
    from Alexander Lobakin.

18) Endian conversion of L4 ports in filters of cxgb4 driver is wrong,
    from Rahul Lakkireddy.

19) Don't clear bound device TX queue of socket prematurely otherwise we
    get problems with ktls hw offloading, from Tariq Toukan.

20) ipset can do atomics on unaligned memory, fix from Russell King.

21) Align ethernet addresses properly in bridging code, from Thomas
    Martitz.

22) Don't advertise ipv4 addresses on SCTP sockets having ipv6only set,
    from Marcelo Ricardo Leitner.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (149 commits)
  rds: transport module should be auto loaded when transport is set
  sch_cake: fix a few style nits
  sch_cake: don't call diffserv parsing code when it is not needed
  sch_cake: don't try to reallocate or unshare skb unconditionally
  ethtool: fix error handling in linkstate_prepare_data()
  wil6210: account for napi_gro_receive never returning GRO_DROP
  hns: do not cast return value of napi_gro_receive to null
  socionext: account for napi_gro_receive never returning GRO_DROP
  wireguard: receive: account for napi_gro_receive never returning GRO_DROP
  vxlan: fix last fdb index during dump of fdb with nhid
  sctp: Don't advertise IPv4 addresses if ipv6only is set on the socket
  tc-testing: avoid action cookies with odd length.
  bpf: tcp: bpf_cubic: fix spurious HYSTART_DELAY exit upon drop in min RTT
  tcp_cubic: fix spurious HYSTART_DELAY exit upon drop in min RTT
  net: dsa: sja1105: fix tc-gate schedule with single element
  net: dsa: sja1105: recalculate gating subschedule after deleting tc-gate rules
  net: dsa: sja1105: unconditionally free old gating config
  net: dsa: sja1105: move sja1105_compose_gating_subschedule at the top
  net: macb: free resources on failure path of at91ether_open()
  net: macb: call pm_runtime_put_sync on failure path
  ...
2020-06-25 18:27:40 -07:00
Linus Torvalds
42e9c85f5c tracing: Four small fixes
- Fixed a ringbuffer bug for nested events having time go backwards
  - Fix a config dependency for boot time tracing to depend on synthetic
    events instead of histograms.
  - Fix trigger format parsing to handle multiple spaces
  - Fix bootconfig to handle failures in multiple events
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXvUjBBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qmLiAQD47/1T01ilYeXqJ+EG235aeQssvRa7
 RSmIAoMP+V6kHQD9G2RjnWkb3BcrdNk9zoi0LpnuMl95m5OuaMzE4PPO+ws=
 =Zbx8
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Four small fixes:

   - Fix a ringbuffer bug for nested events having time go backwards

   - Fix a config dependency for boot time tracing to depend on
     synthetic events instead of histograms.

   - Fix trigger format parsing to handle multiple spaces

   - Fix bootconfig to handle failures in multiple events"

* tag 'trace-v5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing/boottime: Fix kprobe multiple events
  tracing: Fix event trigger to accept redundant spaces
  tracing/boot: Fix config dependency for synthedic event
  ring-buffer: Zero out time extend if it is nested and not absolute
2020-06-25 16:16:49 -07:00
Peter Zijlstra
b58e733fd7 rcu: Fixup noinstr warnings
A KCSAN build revealed we have explicit annoations through atomic_*()
usage, switch to arch_atomic_*() for the respective functions.

vmlinux.o: warning: objtool: rcu_nmi_exit()+0x4d: call to __kcsan_check_access() leaves .noinstr.text section
vmlinux.o: warning: objtool: rcu_dynticks_eqs_enter()+0x25: call to __kcsan_check_access() leaves .noinstr.text section
vmlinux.o: warning: objtool: rcu_nmi_enter()+0x4f: call to __kcsan_check_access() leaves .noinstr.text section
vmlinux.o: warning: objtool: rcu_dynticks_eqs_exit()+0x2a: call to __kcsan_check_access() leaves .noinstr.text section
vmlinux.o: warning: objtool: __rcu_is_watching()+0x25: call to __kcsan_check_access() leaves .noinstr.text section

Additionally, without the NOP in instrumentation_begin(), objtool would
not detect the lack of the 'else instrumentation_begin();' branch in
rcu_nmi_enter().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-25 08:24:32 -07:00
Rafael J. Wysocki
10e8b11eb3 cpuidle: Rearrange s2idle-specific idle state entry code
Implement call_cpuidle_s2idle() in analogy with call_cpuidle()
for the s2idle-specific idle state entry and invoke it from
cpuidle_idle_call() to make the s2idle-specific idle entry code
path look more similar to the "regular" idle entry one.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Chen Yu <yu.c.chen@intel.com>
2020-06-25 13:52:53 +02:00
Sumit Garg
2a78b85b70 kdb: Make kdb_printf() console handling more robust
While rounding up CPUs via NMIs, its possible that a rounded up CPU
maybe holding a console port lock leading to kgdb master CPU stuck in
a deadlock during invocation of console write operations. A similar
deadlock could also be possible while using synchronous breakpoints.

So in order to avoid such a deadlock, set oops_in_progress to encourage
the console drivers to disregard their internal spin locks: in the
current calling context the risk of deadlock is a bigger problem than
risks due to re-entering the console driver. We operate directly on
oops_in_progress rather than using bust_spinlocks() because the calls
bust_spinlocks() makes on exit are not appropriate for this calling
context.

Suggested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/1591264879-25920-4-git-send-email-sumit.garg@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-06-25 12:04:30 +01:00
Sumit Garg
e8857288bb kdb: Check status of console prior to invoking handlers
Check if a console is enabled prior to invoking corresponding write
handler.

Suggested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/1591264879-25920-3-git-send-email-sumit.garg@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-06-25 12:04:29 +01:00
Sumit Garg
9d71b344f8 kdb: Re-factor kdb_printf() message write code
Re-factor kdb_printf() message write code in order to avoid duplication
of code and thereby increase readability.

Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/1591264879-25920-2-git-send-email-sumit.garg@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-06-25 12:04:29 +01:00