select_idle_sibling() has a special case for tasks woken up by a per-CPU
kthread where the selected CPU is the previous one. For asymmetric CPU
capacity systems, the assumption was that the wakee couldn't have a
bigger utilization during task placement than it used to have during the
last activation. That was not considering uclamp.min which can completely
change between two task activations and as a consequence mandates the
fitness criterion asym_fits_capacity(), even for the exit path described
above.
Fixes: b4c9c9f156 ("sched/fair: Prefer prev cpu in asymmetric wakeup path")
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20211129173115.4006346-1-vincent.donnefort@arm.com
select_idle_sibling() has a special case for tasks woken up by a per-CPU
kthread, where the selected CPU is the previous one. However, the current
condition for this exit path is incomplete. A task can wake up from an
interrupt context (e.g. hrtimer), while a per-CPU kthread is running. A
such scenario would spuriously trigger the special case described above.
Also, a recent change made the idle task like a regular per-CPU kthread,
hence making that situation more likely to happen
(is_per_cpu_kthread(swapper) being true now).
Checking for task context makes sure select_idle_sibling() will not
interpret a wake up from any other context as a wake up by a per-CPU
kthread.
Fixes: 52262ee567 ("sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression")
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lore.kernel.org/r/20211201143450.479472-1-vincent.donnefort@arm.com
Commit d81ae8aac8 ("sched/uclamp: Fix initialization of struct
uclamp_rq") introduced a bug where uclamp_max of the rq is not reset to
match the woken up task's uclamp_max when the rq is idle.
The code was relying on rq->uclamp_max initialized to zero, so on first
enqueue
static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
enum uclamp_id clamp_id)
{
...
if (uc_se->value > READ_ONCE(uc_rq->value))
WRITE_ONCE(uc_rq->value, uc_se->value);
}
was actually resetting it. But since commit d81ae8aac8 changed the
default to 1024, this no longer works. And since rq->uclamp_flags is
also initialized to 0, neither above code path nor uclamp_idle_reset()
update the rq->uclamp_max on first wake up from idle.
This is only visible from first wake up(s) until the first dequeue to
idle after enabling the static key. And it only matters if the
uclamp_max of this task is < 1024 since only then its uclamp_max will be
effectively ignored.
Fix it by properly initializing rq->uclamp_flags = UCLAMP_FLAG_IDLE to
ensure uclamp_idle_reset() is called which then will update the rq
uclamp_max value as expected.
Fixes: d81ae8aac8 ("sched/uclamp: Fix initialization of struct uclamp_rq")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20211202112033.1705279-1-qais.yousef@arm.com
__setup() callbacks expect 1 for success and 0 for failure. Correct the
usage here to reflect that.
Fixes: 826bfeb37b ("preempt/dynamic: Support dynamic preempt with preempt= boot option")
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211203233203.133581-1-ahalaney@redhat.com
I completed the first batch of signal changes for v5.17 against
v5.16-rc1 before the SA_IMMUTABLE fixes where completed. Which leaves
me with two lines of development that I want on my signal development
branch both rooted at v5.16-rc1. Especially as I am hoping
to reach the point of being able to remove SA_IMMUTABLE.
Linus merged my SA_IMUTABLE fixes as:
7af959b5d5 ("Merge branch 'SA_IMMUTABLE-fixes-for-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
To avoid rebasing the development changes that are currently complete I am
merging the work I sent upstream to Linus to make my life simpler.
The SA_IMMUTABLE changes as they are described in Linus's merge commit.
Pull exit-vs-signal handling fixes from Eric Biederman:
"This is a small set of changes where debuggers were no longer able to
intercept synchronous SIGTRAP and SIGSEGV, introduced by the exit
cleanups.
This is essentially the change you suggested with all of i's dotted
and the t's crossed so that ptrace can intercept all of the cases it
has been able to intercept the past, and all of the cases that made it
to exit without giving ptrace a chance still don't give ptrace a
chance"
* 'SA_IMMUTABLE-fixes-for-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
signal: Replace force_fatal_sig with force_exit_sig when in doubt
signal: Don't always set SA_IMMUTABLE for forced signals
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The first commit cited below attempts to fix the off-by-one error that
appeared in some comparisons with an open range. Due to this error,
arithmetically equivalent pieces of code could get different verdicts
from the verifier, for example (pseudocode):
// 1. Passes the verifier:
if (data + 8 > data_end)
return early
read *(u64 *)data, i.e. [data; data+7]
// 2. Rejected by the verifier (should still pass):
if (data + 7 >= data_end)
return early
read *(u64 *)data, i.e. [data; data+7]
The attempted fix, however, shifts the range by one in a wrong
direction, so the bug not only remains, but also such piece of code
starts failing in the verifier:
// 3. Rejected by the verifier, but the check is stricter than in #1.
if (data + 8 >= data_end)
return early
read *(u64 *)data, i.e. [data; data+7]
The change performed by that fix converted an off-by-one bug into
off-by-two. The second commit cited below added the BPF selftests
written to ensure than code chunks like #3 are rejected, however,
they should be accepted.
This commit fixes the off-by-two error by adjusting new_range in the
right direction and fixes the tests by changing the range into the
one that should actually fail.
Fixes: fb2a311a31 ("bpf: fix off by one for range markings with L{T, E} patterns")
Fixes: b37242c773 ("bpf: add test cases to bpf selftests to cover all access tests")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20211130181607.593149-1-maximmi@nvidia.com
At CPU-hotplug time, unbind_workers() may preempt a worker while it is
going to sleep. In that case the following scenario can happen:
unbind_workers() wq_worker_sleeping()
-------------- -------------------
if (worker->flags & WORKER_NOT_RUNNING)
return;
//PREEMPTED by unbind_workers
worker->flags |= WORKER_UNBOUND;
[...]
atomic_set(&pool->nr_running, 0);
//resume to worker
atomic_dec_and_test(&pool->nr_running);
After unbind_worker() resets pool->nr_running, the value is expected to
remain 0 until the pool ever gets rebound in case cpu_up() is called on
the target CPU in the future. But here the race leaves pool->nr_running
with a value of -1, triggering the following warning when the worker goes
idle:
WARNING: CPU: 3 PID: 34 at kernel/workqueue.c:1823 worker_enter_idle+0x95/0xc0
Modules linked in:
CPU: 3 PID: 34 Comm: kworker/3:0 Not tainted 5.16.0-rc1+ #34
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
Workqueue: 0x0 (rcu_par_gp)
RIP: 0010:worker_enter_idle+0x95/0xc0
Code: 04 85 f8 ff ff ff 39 c1 7f 09 48 8b 43 50 48 85 c0 74 1b 83 e2 04 75 99 8b 43 34 39 43 30 75 91 8b 83 00 03 00 00 85 c0 74 87 <0f> 0b 5b c3 48 8b 35 70 f1 37 01 48 8d 7b 48 48 81 c6 e0 93 0
RSP: 0000:ffff9b7680277ed0 EFLAGS: 00010086
RAX: 00000000ffffffff RBX: ffff93465eae9c00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9346418a0000 RDI: ffff934641057140
RBP: ffff934641057170 R08: 0000000000000001 R09: ffff9346418a0080
R10: ffff9b768027fdf0 R11: 0000000000002400 R12: ffff93465eae9c20
R13: ffff93465eae9c20 R14: ffff93465eae9c70 R15: ffff934641057140
FS: 0000000000000000(0000) GS:ffff93465eac0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000001cc0c000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
worker_thread+0x89/0x3d0
? process_one_work+0x400/0x400
kthread+0x162/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
</TASK>
Also due to this incorrect "nr_running == -1", all sorts of hazards can
happen, starting with queued works being ignored because no workers are
awaken at insert_work() time.
Fix this with checking again the worker flags while pool->lock is locked.
Fixes: b945efcdd0 ("sched: Remove pointless preemption disable in sched_submit_work()")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
At CPU-hotplug time, unbind_worker() may preempt a worker while it is
waking up. In that case the following scenario can happen:
unbind_workers() wq_worker_running()
-------------- -------------------
if (!(worker->flags & WORKER_NOT_RUNNING))
//PREEMPTED by unbind_workers
worker->flags |= WORKER_UNBOUND;
[...]
atomic_set(&pool->nr_running, 0);
//resume to worker
atomic_inc(&worker->pool->nr_running);
After unbind_worker() resets pool->nr_running, the value is expected to
remain 0 until the pool ever gets rebound in case cpu_up() is called on
the target CPU in the future. But here the race leaves pool->nr_running
with a value of 1, triggering the following warning when the worker goes
idle:
WARNING: CPU: 3 PID: 34 at kernel/workqueue.c:1823 worker_enter_idle+0x95/0xc0
Modules linked in:
CPU: 3 PID: 34 Comm: kworker/3:0 Not tainted 5.16.0-rc1+ #34
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
Workqueue: 0x0 (rcu_par_gp)
RIP: 0010:worker_enter_idle+0x95/0xc0
Code: 04 85 f8 ff ff ff 39 c1 7f 09 48 8b 43 50 48 85 c0 74 1b 83 e2 04 75 99 8b 43 34 39 43 30 75 91 8b 83 00 03 00 00 85 c0 74 87 <0f> 0b 5b c3 48 8b 35 70 f1 37 01 48 8d 7b 48 48 81 c6 e0 93 0
RSP: 0000:ffff9b7680277ed0 EFLAGS: 00010086
RAX: 00000000ffffffff RBX: ffff93465eae9c00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9346418a0000 RDI: ffff934641057140
RBP: ffff934641057170 R08: 0000000000000001 R09: ffff9346418a0080
R10: ffff9b768027fdf0 R11: 0000000000002400 R12: ffff93465eae9c20
R13: ffff93465eae9c20 R14: ffff93465eae9c70 R15: ffff934641057140
FS: 0000000000000000(0000) GS:ffff93465eac0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000001cc0c000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
worker_thread+0x89/0x3d0
? process_one_work+0x400/0x400
kthread+0x162/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
</TASK>
Also due to this incorrect "nr_running == 1", further queued work may
end up not being served, because no worker is awaken at work insert time.
This raises rcutorture writer stalls for example.
Fix this with disabling preemption in the right place in
wq_worker_running().
It's worth noting that if the worker migrates and runs concurrently with
unbind_workers(), it is guaranteed to see the WORKER_UNBOUND flag update
due to set_cpus_allowed_ptr() acquiring/releasing rq->lock.
Fixes: 6d25be5782 ("sched/core, workqueues: Distangle worker accounting from rq lock")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When module registering its set is built-in, THIS_MODULE will be NULL,
hence we cannot return early in case owner is NULL.
Fixes: 14f267d95f ("bpf: btf: Introduce helpers for dynamic BTF set registration")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211122144742.477787-3-memxor@gmail.com
Vinicius Costa Gomes reported [0] that build fails when
CONFIG_DEBUG_INFO_BTF is enabled and CONFIG_BPF_SYSCALL is disabled.
This leads to btf.c not being compiled, and then no symbol being present
in vmlinux for the declarations in btf.h. Since BTF is not useful
without enabling BPF subsystem, disallow this combination.
However, theoretically disabling both now could still fail, as the
symbol for kfunc_btf_id_list variables is not available. This isn't a
problem as the compiler usually optimizes the whole register/unregister
call, but at lower optimization levels it can fail the build in linking
stage.
Fix that by adding dummy variables so that modules taking address of
them still work, but the whole thing is a noop.
[0]: https://lore.kernel.org/bpf/20211110205418.332403-1-vinicius.gomes@intel.com
Fixes: 14f267d95f ("bpf: btf: Introduce helpers for dynamic BTF set registration")
Reported-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211122144742.477787-2-memxor@gmail.com
Given BPF program's BTF root type name perform the following steps:
. search in vmlinux candidate cache.
. if (present in cache and candidate list >= 1) return candidate list.
. do a linear search through kernel BTFs for possible candidates.
. regardless of number of candidates found populate vmlinux cache.
. if (candidate list >= 1) return candidate list.
. search in module candidate cache.
. if (present in cache) return candidate list (even if list is empty).
. do a linear search through BTFs of all kernel modules
collecting candidates from all of them.
. regardless of number of candidates found populate module cache.
. return candidate list.
Then wire the result into bpf_core_apply_relo_insn().
When BPF program is trying to CO-RE relocate a type
that doesn't exist in either vmlinux BTF or in modules BTFs
these steps will perform 2 cache lookups when cache is hit.
Note the cache doesn't prevent the abuse by the program that might
have lots of relocations that cannot be resolved. Hence cond_resched().
CO-RE in the kernel requires CAP_BPF, since BTF loading requires it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211201181040.23337-9-alexei.starovoitov@gmail.com
Make BTF log size limit to be the same as the verifier log size limit.
Otherwise tools that progressively increase log size and use the same log
for BTF loading and program loading will be hitting hard to debug EINVAL.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211201181040.23337-7-alexei.starovoitov@gmail.com
struct bpf_core_relo is generated by llvm and processed by libbpf.
It's a de-facto uapi.
With CO-RE in the kernel the struct bpf_core_relo becomes uapi de-jure.
Add an ability to pass a set of 'struct bpf_core_relo' to prog_load command
and let the kernel perform CO-RE relocations.
Note the struct bpf_line_info and struct bpf_func_info have the same
layout when passed from LLVM to libbpf and from libbpf to the kernel
except "insn_off" fields means "byte offset" when LLVM generates it.
Then libbpf converts it to "insn index" to pass to the kernel.
The struct bpf_core_relo's "insn_off" field is always "byte offset".
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211201181040.23337-6-alexei.starovoitov@gmail.com
Make relo_core.c to be compiled for the kernel and for user space libbpf.
Note the patch is reducing BPF_CORE_SPEC_MAX_LEN from 64 to 32.
This is the maximum number of nested structs and arrays.
For example:
struct sample {
int a;
struct {
int b[10];
};
};
struct sample *s = ...;
int *y = &s->b[5];
This field access is encoded as "0:1:0:5" and spec len is 4.
The follow up patch might bump it back to 64.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211201181040.23337-4-alexei.starovoitov@gmail.com
Rename btf_member_bit_offset() and btf_member_bitfield_size() to
avoid conflicts with similarly named helpers in libbpf's btf.h.
Rename the kernel helpers, since libbpf helpers are part of uapi.
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211201181040.23337-3-alexei.starovoitov@gmail.com
getrusage(RUSAGE_THREAD) with nohz_full may return shorter utime/stime
than the actual time.
task_cputime_adjusted() snapshots utime and stime and then adjust their
sum to match the scheduler maintained cputime.sum_exec_runtime.
Unfortunately in nohz_full, sum_exec_runtime is only updated once per
second in the worst case, causing a discrepancy against utime and stime
that can be updated anytime by the reader using vtime.
To fix this situation, perform an update of cputime.sum_exec_runtime
when the cputime snapshot reports the task as actually running while
the tick is disabled. The related overhead is then contained within the
relevant situations.
Reported-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Acked-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20211026141055.57358-3-frederic@kernel.org
When at least one CPU runs in nohz_full mode, a dedicated timekeeper CPU
is guaranteed to stay online and to never stop its tick.
Meanwhile on some rare case, the dedicated timekeeper may be running
with interrupts disabled for a while, such as in stop_machine.
If jiffies stop being updated, a nohz_full CPU may end up endlessly
programming the next tick in the past, taking the last jiffies update
monotonic timestamp as a stale base, resulting in an tick storm.
Here is a scenario where it matters:
0) CPU 0 is the timekeeper and CPU 1 a nohz_full CPU.
1) A stop machine callback is queued to execute somewhere.
2) CPU 0 reaches MULTI_STOP_DISABLE_IRQ while CPU 1 is still in
MULTI_STOP_PREPARE. Hence CPU 0 can't do its timekeeping duty. CPU 1
can still take IRQs.
3) CPU 1 receives an IRQ which queues a timer callback one jiffy forward.
4) On IRQ exit, CPU 1 schedules the tick one jiffy forward, taking
last_jiffies_update as a base. But last_jiffies_update hasn't been
updated for 2 jiffies since the timekeeper has interrupts disabled.
5) clockevents_program_event(), which relies on ktime_get(), observes
that the expiration is in the past and therefore programs the min
delta event on the clock.
6) The tick fires immediately, goto 3)
7) Tick storm, the nohz_full CPU is drown and takes ages to reach
MULTI_STOP_DISABLE_IRQ, which is the only way out of this situation.
Solve this with unconditionally updating jiffies if the value is stale
on nohz_full IRQ entry. IRQs and other disturbances are expected to be
rare enough on nohz_full for the unconditional call to ktime_get() to
actually matter.
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20211026141055.57358-2-frederic@kernel.org
The minimum supported version of LLVM has been raised to 11.0.0, meaning
this check is always true, so it can be dropped.
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Miguel Ojeda <ojeda@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
The 'kprobe::data_size' is unsigned, thus it can not be negative. But if
user sets it enough big number (e.g. (size_t)-8), the result of 'data_size
+ sizeof(struct kretprobe_instance)' becomes smaller than sizeof(struct
kretprobe_instance) or zero. In result, the kretprobe_instance are
allocated without enough memory, and kretprobe accesses outside of
allocated memory.
To avoid this issue, introduce a max limitation of the
kretprobe::data_size. 4KB per instance should be OK.
Link: https://lkml.kernel.org/r/163836995040.432120.10322772773821182925.stgit@devnote2
Cc: stable@vger.kernel.org
Fixes: f47cd9b553 ("kprobes: kretprobe user entry-handler")
Reported-by: zhangyue <zhangyue1@kylinos.cn>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When comparing two strings for the "onmatch" histogram trigger, fields
that are strings use string comparisons, which do not care about being
signed or not.
Do not fail to match two string fields if one is unsigned char array and
the other is a signed char array.
Link: https://lore.kernel.org/all/20211129123043.5cfd687a@gandalf.local.home/
Cc: stable@vgerk.kernel.org
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Yafang Shao <laoar.shao@gmail.com>
Fixes: b05e89ae7c ("tracing: Accept different type for synthetic event fields")
Reviewed-by: Masami Hiramatsu <mhiramatsu@kernel.org>
Reported-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
An extra newline will output for bpf_log() with BPF_LOG_KERNEL level
as shown below:
[ 52.095704] BPF:The function test_3 has 12 arguments. Too many.
[ 52.095704]
[ 52.096896] Error in parsing func ptr test_3 in struct bpf_dummy_ops
Now all bpf_log() are ended by newline, but not all btf_verifier_log()
are ended by newline, so checking whether or not the log message
has the trailing newline and adding a newline if not.
Also there is no need to calculate the left userspace buffer size
for kernel log output and to truncate the output by '\0' which
has already been done by vscnprintf(), so only do these for
userspace log output.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20211201073458.2731595-2-houtao1@huawei.com
The current queue_work_on() docbook comment says that the caller must
ensure that the specified CPU can't go away, but does not spell out the
consequences, which turn out to be quite mild. Therefore expand this
comment to explicitly say that the penalty for failing to nail down the
specified CPU is that the workqueue handler might find itself executing
on some other CPU.
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
0Day/LKP observed that the refscale results fail to complete when larger
values of nrun (such as 300) are specified. The problem is that printk()
can accept at most a 1024-byte buffer. This commit therefore prints
the buffer whenever its length exceeds 800 bytes.
CC: Philip Li <philip.li@intel.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
There is only the one OOM error case in main_func(), so this commit
eliminates the errexit local variable in favor of a branch to cleanup
code.
Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Because Tiny srcu_read_unlock() directly calls swake_up_one(), lockdep
complains when a pi lock is held across that srcu_read_unlock().
Although this is a lockdep false positive (there is no other CPU to
complete the deadlock cycle), lockdep is what it is at the moment.
This commit therefore prevents rcutorture from holding pi lock across
a Tiny srcu_read_unlock().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, nested readers occur only when a timer handler interrupts a
reader. This is rare, and is thus insufficient testing of the transition
between nesting levels. This commit therefore causes rcutorture nested
readers to be the rule rather than the exception.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
RCUTORTURE_RDR_MASK is currently not the bit indicated by
RCUTORTURE_RDR_SHIFT, but is instead all the bits less significant than
that one. This is an accident waiting to happen, so this commit makes
RCUTORTURE_RDR_MASK be that one bit and adjusts uses accordingly.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, the check_all_holdout_tasks_trace() function removes all tasks
marked with ->trc_reader_checked from the holdout list, including those
with IPIs pending. This means that the IPI handler might arrive at
a task that has already been removed from the list, which is at best
an accident waiting to happen.
This commit therefore avoids removing tasks with IPIs pending from
the holdout list. This in turn means that the "if" condition in the
for_each_online_cpu() loop in rcu_tasks_trace_postgp() should always
evaluate to false, so a WARN_ON_ONCE() is added to check that.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tiny SRCU readers can appear at task level, but also in interrupt and
softirq handlers. Because Tiny SRCU is selected only in kernels built
with CONFIG_SMP=n and CONFIG_PREEMPTION=n, it is not possible for a grace
period to start while there is a non-task-level SRCU reader executing.
This means that it does not make sense for __srcu_read_unlock() to awaken
the Tiny SRCU grace period, because that can only happen when the grace
period is waiting for one value of ->srcu_idx and __srcu_read_unlock()
is ending the last reader for some other value of ->srcu_idx. After all,
any such wakeup will be redundant.
Worse yet, in some cases, such wakeups generate lockdep splats:
======================================================
WARNING: possible circular locking dependency detected
5.15.0-rc1+ #3758 Not tainted
------------------------------------------------------
rcu_torture_rea/53 is trying to acquire lock:
ffffffff9514e6a8 (srcu_ctl.srcu_wq.lock){..-.}-{2:2}, at:
xa/0x30
but task is already holding lock:
ffff95c642479d80 (&p->pi_lock){-.-.}-{2:2}, at:
_extend+0x370/0x400
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&p->pi_lock){-.-.}-{2:2}:
_raw_spin_lock_irqsave+0x2f/0x50
try_to_wake_up+0x50/0x580
swake_up_locked.part.7+0xe/0x30
swake_up_one+0x22/0x30
rcutorture_one_extend+0x1b6/0x400
rcu_torture_one_read+0x290/0x5d0
rcu_torture_timer+0x1a/0x70
call_timer_fn+0xa6/0x230
run_timer_softirq+0x493/0x4c0
__do_softirq+0xc0/0x371
irq_exit+0x73/0x90
sysvec_apic_timer_interrupt+0x63/0x80
asm_sysvec_apic_timer_interrupt+0x12/0x20
default_idle+0xb/0x10
default_idle_call+0x5e/0x170
do_idle+0x18a/0x1f0
cpu_startup_entry+0xa/0x10
start_kernel+0x678/0x69f
secondary_startup_64_no_verify+0xc2/0xcb
-> #0 (srcu_ctl.srcu_wq.lock){..-.}-{2:2}:
__lock_acquire+0x130c/0x2440
lock_acquire+0xc2/0x270
_raw_spin_lock_irqsave+0x2f/0x50
swake_up_one+0xa/0x30
rcutorture_one_extend+0x387/0x400
rcu_torture_one_read+0x290/0x5d0
rcu_torture_reader+0xac/0x200
kthread+0x12d/0x150
ret_from_fork+0x22/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&p->pi_lock);
lock(srcu_ctl.srcu_wq.lock);
lock(&p->pi_lock);
lock(srcu_ctl.srcu_wq.lock);
*** DEADLOCK ***
1 lock held by rcu_torture_rea/53:
#0: ffff95c642479d80 (&p->pi_lock){-.-.}-{2:2}, at:
_extend+0x370/0x400
stack backtrace:
CPU: 0 PID: 53 Comm: rcu_torture_rea Not tainted 5.15.0-rc1+
Hardware name: Red Hat KVM/RHEL-AV, BIOS
e_el8.5.0+746+bbd5d70c 04/01/2014
Call Trace:
check_noncircular+0xfe/0x110
? find_held_lock+0x2d/0x90
__lock_acquire+0x130c/0x2440
lock_acquire+0xc2/0x270
? swake_up_one+0xa/0x30
? find_held_lock+0x72/0x90
_raw_spin_lock_irqsave+0x2f/0x50
? swake_up_one+0xa/0x30
swake_up_one+0xa/0x30
rcutorture_one_extend+0x387/0x400
rcu_torture_one_read+0x290/0x5d0
rcu_torture_reader+0xac/0x200
? rcutorture_oom_notify+0xf0/0xf0
? __kthread_parkme+0x61/0x90
? rcu_torture_one_read+0x5d0/0x5d0
kthread+0x12d/0x150
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
This is a false positive because there is only one CPU, and both locks
are raw (non-preemptible) spinlocks. However, it is worthwhile getting
rid of the redundant wakeup, which has the side effect of breaking
the theoretical deadlock cycle. This commit therefore eliminates the
redundant wakeups.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The default kasan_record_aux_stack() calls stack_depot_save() with GFP_NOWAIT,
which in turn can then call alloc_pages(GFP_NOWAIT, ...). In general, however,
it is not even possible to use either GFP_ATOMIC nor GFP_NOWAIT in certain
non-preemptive contexts/RT kernel including raw_spin_locks (see gfp.h and ab00db216c).
Fix it by instructing stackdepot to not expand stack storage via alloc_pages()
in case it runs out by using kasan_record_aux_stack_noalloc().
Jianwei Hu reported:
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:969
in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 15319, name: python3
INFO: lockdep is turned off.
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff856c8b13>] copy_process+0xaf3/0x2590
softirqs last enabled at (0): [<ffffffff856c8b13>] copy_process+0xaf3/0x2590
softirqs last disabled at (0): [<0000000000000000>] 0x0
CPU: 6 PID: 15319 Comm: python3 Tainted: G W O 5.15-rc7-preempt-rt #1
Hardware name: Supermicro SYS-E300-9A-8C/A2SDi-8C-HLN4F, BIOS 1.1b 12/17/2018
Call Trace:
show_stack+0x52/0x58
dump_stack+0xa1/0xd6
___might_sleep.cold+0x11c/0x12d
rt_spin_lock+0x3f/0xc0
rmqueue+0x100/0x1460
rmqueue+0x100/0x1460
mark_usage+0x1a0/0x1a0
ftrace_graph_ret_addr+0x2a/0xb0
rmqueue_pcplist.constprop.0+0x6a0/0x6a0
__kasan_check_read+0x11/0x20
__zone_watermark_ok+0x114/0x270
get_page_from_freelist+0x148/0x630
is_module_text_address+0x32/0xa0
__alloc_pages_nodemask+0x2f6/0x790
__alloc_pages_slowpath.constprop.0+0x12d0/0x12d0
create_prof_cpu_mask+0x30/0x30
alloc_pages_current+0xb1/0x150
stack_depot_save+0x39f/0x490
kasan_save_stack+0x42/0x50
kasan_save_stack+0x23/0x50
kasan_record_aux_stack+0xa9/0xc0
__call_rcu+0xff/0x9c0
call_rcu+0xe/0x10
put_object+0x53/0x70
__delete_object+0x7b/0x90
kmemleak_free+0x46/0x70
slab_free_freelist_hook+0xb4/0x160
kfree+0xe5/0x420
kfree_const+0x17/0x30
kobject_cleanup+0xaa/0x230
kobject_put+0x76/0x90
netdev_queue_update_kobjects+0x17d/0x1f0
... ...
ksys_write+0xd9/0x180
__x64_sys_write+0x42/0x50
do_syscall_64+0x38/0x50
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Links: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/include/linux/kasan.h?id=7cb3007ce2da27ec02a1a3211941e7fe6875b642
Fixes: 84109ab585 ("rcu: Record kvfree_call_rcu() call stack for KASAN")
Fixes: 26e760c9a7 ("rcu: kasan: record and print call_rcu() call stack")
Reported-by: Jianwei Hu <jianwei.hu@windriver.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Marco Elver <elver@google.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Jun Miao <jun.miao@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When the boost kthreads are created on systems with nohz_full CPUs,
the cpus_allowed_ptr is set to housekeeping_cpumask(HK_FLAG_KTHREAD).
However, when the rcu_boost_kthread_setaffinity() is called, the original
affinity will be changed and these kthreads can subsequently run on
nohz_full CPUs. This commit makes rcu_boost_kthread_setaffinity()
restrict these boost kthreads to housekeeping CPUs.
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit cleans up some comments and code in kernel/rcu/tree_plugin.h.
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit replaces the obsolete and ambiguous macro in_irq() with its
shiny new in_hardirq() equivalent.
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Now that RCU_FAST_NO_HZ is no more, there is but one implementation of
the rcu_needs_cpu() function. This commit therefore moves this function
from kernel/rcu/tree_plugin.c to kernel/rcu/tree.c.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
All of the uses of CONFIG_RCU_FAST_NO_HZ=y that I have seen involve
systems with RCU callbacks offloaded. In this situation, all that this
Kconfig option does is slow down idle entry/exit with an additional
allways-taken early exit. If this is the only use case, then this
Kconfig option nothing but an attractive nuisance that needs to go away.
This commit therefore removes the RCU_FAST_NO_HZ Kconfig option.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
With the previous patch, there is an extra watchdog read in each retry.
Now the total number of clocksource reads is increased to 4 per iteration.
In order to avoid increasing the clock skew check overhead, the default
maximum number of retries is reduced from 3 to 2 to maintain the same 12
clocksource reads in the worst case.
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Since commit db3a34e174 ("clocksource: Retry clock read if long delays
detected") and commit 2e27e793e2 ("clocksource: Reduce clocksource-skew
threshold"), it is found that tsc clocksource fallback to hpet can
sometimes happen on both Intel and AMD systems especially when they are
running stressful benchmarking workloads. Of the 23 systems tested with
a v5.14 kernel, 10 of them have switched to hpet clock source during
the test run.
The result of falling back to hpet is a drastic reduction of performance
when running benchmarks. For example, the fio performance tests can
drop up to 70% whereas the iperf3 performance can drop up to 80%.
4 hpet fallbacks happened during bootup. They were:
[ 8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
[ 12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
[ 17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
[ 17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
Other fallbacks happen when the systems were running stressful
benchmarks. For example:
[ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
[46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
Commit 2e27e793e2 ("clocksource: Reduce clocksource-skew threshold"),
changed the skew margin from 100us to 50us. I think this is too small
and can easily be exceeded when running some stressful workloads on a
thermally stressed system. So it is switched back to 100us.
Even a maximum skew margin of 100us may be too small in for some systems
when booting up especially if those systems are under thermal stress. To
eliminate the case that the large skew is due to the system being too
busy slowing down the reading of both the watchdog and the clocksource,
an extra consecutive read of watchdog clock is being done to check this.
The consecutive watchdog read delay is compared against
WATCHDOG_MAX_SKEW/2. If the delay exceeds the limit, we assume that
the system is just too busy. A warning will be printed to the console
and the clock skew check is skipped for this round.
Fixes: db3a34e174 ("clocksource: Retry clock read if long delays detected")
Fixes: 2e27e793e2 ("clocksource: Reduce clocksource-skew threshold")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The only usage of irq_generic_chip_ops is to pass its address to
irq_domain_add_linear() which takes a pointer to const struct
irq_domain_ops. Make it const to allow the compiler to put it in
read-only memory.
[ tglx: Fixed subject prefix ]
Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20211130214043.1257585-1-rikard.falkeborn@gmail.com
Some thread flags can be set remotely, and so even when IRQs are disabled,
the flags can change under our feet. Generally this is unlikely to cause a
problem in practice, but it is somewhat unsound, and KCSAN will
legitimately warn that there is a data race.
To avoid such issues, a snapshot of the flags has to be taken prior to
using them. Some places already use READ_ONCE() for that, others do not.
Convert them all to the new flag accessor helpers.
The READ_ONCE(ti->flags) .. cmpxchg(ti->flags) loop in
set_nr_if_polling() is left as-is for clarity.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20211129130653.2037928-4-mark.rutland@arm.com
Some thread flags can be set remotely, and so even when IRQs are disabled,
the flags can change under our feet. Generally this is unlikely to cause a
problem in practice, but it is somewhat unsound, and KCSAN will
legitimately warn that there is a data race.
To avoid such issues, a snapshot of the flags has to be taken prior to
using them. Some places already use READ_ONCE() for that, others do not.
Convert them all to the new flag accessor helpers.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20211129130653.2037928-3-mark.rutland@arm.com
This patch adds the kernel-side and API changes for a new helper
function, bpf_loop:
long bpf_loop(u32 nr_loops, void *callback_fn, void *callback_ctx,
u64 flags);
where long (*callback_fn)(u32 index, void *ctx);
bpf_loop invokes the "callback_fn" **nr_loops** times or until the
callback_fn returns 1. The callback_fn can only return 0 or 1, and
this is enforced by the verifier. The callback_fn index is zero-indexed.
A few things to please note:
~ The "u64 flags" parameter is currently unused but is included in
case a future use case for it arises.
~ In the kernel-side implementation of bpf_loop (kernel/bpf/bpf_iter.c),
bpf_callback_t is used as the callback function cast.
~ A program can have nested bpf_loop calls but the program must
still adhere to the verifier constraint of its stack depth (the stack depth
cannot exceed MAX_BPF_STACK))
~ Recursive callback_fns do not pass the verifier, due to the call stack
for these being too deep.
~ The next patch will include the tests and benchmark
Signed-off-by: Joanne Koong <joannekoong@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211130030622.4131246-2-joannekoong@fb.com
The eBPF name has completely taken over from eBPF in general usage for
the actual eBPF representation, or BPF for any general in-kernel use.
Prune all remaining references to "internal BPF".
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211119163215.971383-4-hch@lst.de
The comment telling that the prog_free helper is freeing the program is
not exactly useful, so just remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211119163215.971383-3-hch@lst.de
css_alloc() needs the parent css, while cgroup_css() gets current
cgropu's css. So we are getting the wrong css during
cgroup_init_subsys().
Fortunately, cgrp_dfl_root.cgrp's css is not set yet, so the value we
pass to css_alloc() is NULL anyway.
Let's pass NULL directly during init, since we know there is no parent
yet.
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Just use the disk attached to the request_queue instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the copying of the I/O context to the block layer as that is where
we can use the proper low-level interfaces.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
left on the idle task's stack when a CPU is brought up after it was brought
down before.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmGjr0UTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoadaD/0Q3hMjI+N3AigZiBToGccafOfsmiMH
fJ6fUM7gh4pTrGuoDQSGt02zYNYx9Zx7X8PpiuWAAIKbppiKmvniCgPMgMGARUBn
UQ/W2XWUiu/wtleRf4JtE6VwHciNVgLdnWIazRWsjDryUXVcJwhn8J1o5K6LnwjD
Rof/aYuVR47DprYG03OI0FD1GwlSPWMbAgB6OlJS6ZRvpq+7ergVKA0PQAY7ZZko
vBlDU7Sq4dJ2CE4aiRGLyLNhZfrubmfeMP2UVmVSpMBta7zs+YmaYjZvKfgO3KZT
OVbyFfDbL8FJgUmTSI1WBKq+W44o1D1e8VrKiCFj+y5w9diHW9OQEg2wqQdsMB6a
QgNgDZjg8UHancF5O2kNYjnUVGgxUww7PftWbxkg4VAUmlCzhbZAAegspZHow0mU
zcqDaMTky0FbcbB/Ukik/HG6J3KrR34GYjui3fe0wZHZlDim6azZucRTd+x9jRsB
jPUlE3DW0JfNFKcMnlLLNvS8h3j7iCbb3XDv1y4BW0+EB76IsCThjqFO0dIPpiju
T9ituTr6p4+B4U37Cz5qOMgUSha+f9/6blYG8NgCeHyD2l5HDnavO9lGhoP3jsZJ
LJRa8mWd+oZbZlpBtTkaSOA55cTxonsIuCseTdXlfsVtzuJBmLKwdRPuDSRCEo0G
xH1vNNUba86+6A==
=ne0K
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
"A single scheduler fix to ensure that there is no stale KASAN shadow
state left on the idle task's stack when a CPU is brought up after it
was brought down before"
* tag 'sched-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/scs: Reset task stack state in bringup_cpu()
a trace point event as it's not possible to deliver a synchronous signal to
a different task from there.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmGjrj0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoRHRD/9T8sQw4arpmaFvB76m1LijsGrAuoXv
XH/gTcUupCdo0J1X8iEZfuGKx3C89BqLFaGpucK+9TCl6VMKHqtDTunciKV79tVQ
TcaTKYFwCwNrAQ0eATNzuM4RzzHGx0TK6u1DB0iFTSUJfAQ/EUE4/+yau2qDVfql
Pud/Fm5uHtqxDq5T9XqG3w324e8HWJr2johGMeg4ukbuKppRoNWlZcm75HndyK4m
OT8svA9Yg8GhSZNQ3q4HQTwof4zcGyaln+wxf7GWr9oryBPiqhHQuvWKXqDXLCVb
SbhsYmYcHEQgM3wpNaNqSf1LV1RoPuhFhgWB0te5SoVzoF7KpJLs+VIP/0q27Mcu
6aF7eTUG92NkR1uvSQ2d62UBE4EM0bFBvPaD4A5hLX1JAkVxHi+vxRFf5q0bUliO
Yybia4bv1WYwCVajBbpgwNDMKb4qacoIcXPlsjkRqkxk/vedOBkJadJnIEqc1iOl
Ld70jylQmj/TxmFM3iGk+QyFwFNpPnUxu0wws7A4YxYFknrhW+/8pcVTsUApBuYN
LWWiC08QelvQucCYGqpbEX37WA3DFXj4AHDp7nCJBkweMGhcgIBvZbz8yz/mgT7T
CTMkT5ZZY93mAWiXdagNJI4EWnjHZgeVtSlKRvF1D0J49SyKepqogOxNgi7KnW+/
tbCmxOTH9eA2Eg==
=yMum
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Thomas Gleixner:
"A single fix for perf to prevent it from sending SIGTRAP to another
task from a trace point event as it's not possible to deliver a
synchronous signal to a different task from there"
* tag 'perf-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Ignore sigtrap for tracepoints destined for other tasks
- Plug a race in the lock handoff which is caused by inconsistency of the
reader and writer path and can lead to corruption of the underlying
counter.
- down_read_trylock() is suboptimal when the lock is contended and
multiple readers trylock concurrently. That's due to the initial value
being read non-atomically which results in at least two compare exchange
loops. Making the initial readout atomic reduces this significantly.
Whith 40 readers by 11% in a benchmark which enforces contention on
mmap_sem.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmGjrRITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodsdEACRDUU5tkNVIgNTsGrO4IUhNW9fxyfG
3dCAzcQx9w1UjjBn23/B0c6rPsVqEv6hKouBGXqdOHj0kLx6Xn0IPMTvqycPL+mp
OyDzx+t773BlvTZyaYFa6vBiWbEVGzedDp6uLsYaBNo//4yN1WZY3mevTwzKVceX
WOoobHjsoh5Wfwr1XmNw+7HVhPaY0E50DaIuRQrJjNj1zsUhzJsjr/M1NpiqCaSm
PleDum3Dg0PD/pxdWtm34teuGQur0QknqPc2I6sZGnX0UMsCozeZAuH/MGnwwXec
fsweMXBVyDngOIZbFX/tPbVTocOpfxkYgJKXwIrlmVwHzFeT6KFfpEPXxVhUj6ao
3KNqD+V5VL2zdMF11WB2lVQaX2/48WIXz23ppiUA5R7tJTPr+yAIYIUzT2GFkMTr
u//41pxnoXlm9RCjANrbzGSl049exf01mMFVzm6zGt6PZqTE/kaBuklRy6Vibk/C
cSB7Iy/iVaySunmF6X5RuBT7HsKrIN6SgYRCHZ7BI9aelQpHztJuy4LZAbgRPZZU
/VKB2BKLx1KeRNfn6ScvF1uSSLmXoFVs0PP7HwMrPs3AdI+KaHmYLqZf+Bf4W1q2
5bAfj2x5qWwvMrV4RnwLltWAASw1G/o5fs8WhPA6cZkG9iZCB5EBCnHv4B0pm+oq
xw8RPYImZFzK8w==
=dKz+
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
"Two regression fixes for reader writer semaphores:
- Plug a race in the lock handoff which is caused by inconsistency of
the reader and writer path and can lead to corruption of the
underlying counter.
- down_read_trylock() is suboptimal when the lock is contended and
multiple readers trylock concurrently. That's due to the initial
value being read non-atomically which results in at least two
compare exchange loops. Making the initial readout atomic reduces
this significantly. Whith 40 readers by 11% in a benchmark which
enforces contention on mmap_sem"
* tag 'locking-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rwsem: Optimize down_read_trylock() under highly contended case
locking/rwsem: Make handoff bit handling more consistent
- The setting of the pid filtering flag tested the "trace only this
pid" case twice, and ignored the "trace everything but this pid" case.
Note, the 5.15 kernel does things a little differently due to the new
sparse pid mask introduced in 5.16, and as the bug was discovered
running the 5.15 kernel, and the first fix was initially done for
that kernel, that fix handled both cases (only pid and all but pid),
but the forward port to 5.16 created this bug.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYaOnPxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qqUTAP9KCOe2rZBjbn14xiCm/wbECjox58Uf
PrJ3fCDBVt8E0gEAjHkR3ybVE4xYLKj4RrO5GJ/pk/x1NeMmHdi+ls5hOQg=
=MZso
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.16-rc2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull another tracing fix from Steven Rostedt:
"Fix the fix of pid filtering
The setting of the pid filtering flag tested the "trace only this pid"
case twice, and ignored the "trace everything but this pid" case.
The 5.15 kernel does things a little differently due to the new sparse
pid mask introduced in 5.16, and as the bug was discovered running the
5.15 kernel, and the first fix was initially done for that kernel,
that fix handled both cases (only pid and all but pid), but the
forward port to 5.16 created this bug"
* tag 'trace-v5.16-rc2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Test the 'Do not trace this pid' case in create event
When creating a new event (via a module, kprobe, eprobe, etc), the
descriptors that are created must add flags for pid filtering if an
instance has pid filtering enabled, as the flags are used at the time the
event is executed to know if pid filtering should be done or not.
The "Only trace this pid" case was added, but a cut and paste error made
that case checked twice, instead of checking the "Trace all but this pid"
case.
Link: https://lore.kernel.org/all/202111280401.qC0z99JB-lkp@intel.com/
Fixes: 6cb206508b ("tracing: Check pid filtering when creating events")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Have created events reflect the current state of pid filtering
- Test pid filtering on discard test of recorded logic.
(Also clean up the if statement to be cleaner).
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYaJ3ZhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qhusAQC3nj0Xj4LRJXJtH4ALoJuthoBNoRHN
SslcuItuFLheyQD/URecPD2h4O+u/GQs1rjEUJ3B/mdzXojIrTz6Stagkwg=
=QCQF
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.16-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Two fixes to event pid filtering:
- Make sure newly created events reflect the current state of pid
filtering
- Take pid filtering into account when recording trigger events.
(Also clean up the if statement to be cleaner)"
* tag 'trace-v5.16-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix pid filtering when triggers are attached
tracing: Check pid filtering when creating events
If a event is filtered by pid and a trigger that requires processing of
the event to happen is a attached to the event, the discard portion does
not take the pid filtering into account, and the event will then be
recorded when it should not have been.
Cc: stable@vger.kernel.org
Fixes: 3fdaf80f4a ("tracing: Implement event pid filtering")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Make intel_pstate work correctly on Ice Lake server systems with
out-of-band performance control enabled (Adamos Ttofari).
- Fix EPP handling in intel_pstate during CPU offline and online in
the active mode (Rafael Wysocki).
- Make intel_pstate support ITMT on asymmetric systems with
overclocking enabled (Srinivas Pandruvada).
- Fix hibernation image saving when using the user space interface
based on the snapshot special device file (Evan Green).
- Make the hibernation code release the snapshot block device using
the same mode that was used when acquiring it (Thomas Zeitlhofer).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmGhM8QSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRx/IkP/2VVQ2c56QZsGWmeyu6plAZBDXu69rHm
GeIO2/q0tEVZrIjmZkwPkSg0mKWw1cUEbiMq6pWShvSurJrko8Te3IECPE/2kOYO
R6crlBDxy2gcpoa5KIlKGz/qQBJPknDHMDSHE0kzmRokOl+/bCCgZkWWzRpR91EX
YlwBstvG1nd2F8Pi0UT59OTLVoTClIW5eTQRZtOY38Ip3PBiziMQAIwk/BFRtRSn
6H9xIdwg/KffTCmtMAq44O7Q5H5Kv6xhgJNNRlKClKnOCmMXGfuKaYDbzddkzEDW
8AAIt8mxZR9TWlhRJRbwTilcjQX/Ph0z2mpMmhcR9NdVm3g8rwHwrKxFirGOc4cQ
q3LXHma3csQ8PqagPoZV77rkBmVzd5ByYYYHQIZP7729WgzPlQ4XhDLU7+gd+eEI
pChycSNH9QNkgrBTvk7BTiD0C9EUYNIex2ptqf4sK7Tcr0pMSG2l9BjQBqQEyYns
O+fhkHkAuK+1dCJjhxcj6gAIuae2FEjjp1MOGkUVeozNwKKmx3ps4BcE9v5syuKi
HRJ72+8RTfV5FhEMZ7rPpWwibGJj6ZLYfuFUEngoWoc1t+sMkAIhpnadsEujcyIX
NzFpM3R0/LATNuYWquLiMHH3/AxOCe1Ezgc0cP8HaXYlZfb8k6p0IxkzNXWc3xLN
6m/j+ppjbXoK
=JN2D
-----END PGP SIGNATURE-----
Merge tag 'pm-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These address three issues in the intel_pstate driver and fix two
problems related to hibernation.
Specifics:
- Make intel_pstate work correctly on Ice Lake server systems with
out-of-band performance control enabled (Adamos Ttofari).
- Fix EPP handling in intel_pstate during CPU offline and online in
the active mode (Rafael Wysocki).
- Make intel_pstate support ITMT on asymmetric systems with
overclocking enabled (Srinivas Pandruvada).
- Fix hibernation image saving when using the user space interface
based on the snapshot special device file (Evan Green).
- Make the hibernation code release the snapshot block device using
the same mode that was used when acquiring it (Thomas Zeitlhofer)"
* tag 'pm-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: hibernate: Fix snapshot partial write lengths
PM: hibernate: use correct mode for swsusp_close()
cpufreq: intel_pstate: ITMT support for overclocked system
cpufreq: intel_pstate: Fix active mode offline/online EPP handling
cpufreq: intel_pstate: Add Ice Lake server to out-of-band IDs
When pid filtering is activated in an instance, all of the events trace
files for that instance has the PID_FILTER flag set. This determines
whether or not pid filtering needs to be done on the event, otherwise the
event is executed as normal.
If pid filtering is enabled when an event is created (via a dynamic event
or modules), its flag is not updated to reflect the current state, and the
events are not filtered properly.
Cc: stable@vger.kernel.org
Fixes: 3fdaf80f4a ("tracing: Implement event pid filtering")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Now that all architectures have a working futex implementation in any
configuration, remove the runtime detection code.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Vineet Gupta <vgupta@kernel.org>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Link: https://lore.kernel.org/r/20211026100432.1730393-2-arnd@kernel.org
snapshot_write() is inappropriately limiting the amount of data that can
be written in cases where a partial page has already been written. For
example, one would expect to be able to write 1 byte, then 4095 bytes to
the snapshot device, and have both of those complete fully (since now
we're aligned to a page again). But what ends up happening is we write 1
byte, then 4094/4095 bytes complete successfully.
The reason is that simple_write_to_buffer()'s second argument is the
total size of the buffer, not the size of the buffer minus the offset.
Since simple_write_to_buffer() accounts for the offset in its
implementation, snapshot_write() can just pass the full page size
directly down.
Signed-off-by: Evan Green <evgreen@chromium.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 39fbef4b0f ("PM: hibernate: Get block device exclusively in
swsusp_check()") changed the opening mode of the block device to
(FMODE_READ | FMODE_EXCL).
In the corresponding calls to swsusp_close(), the mode is still just
FMODE_READ which triggers the warning in blkdev_flush_mapping() on
resume from hibernate.
So, use the mode (FMODE_READ | FMODE_EXCL) also when closing the
device.
Fixes: 39fbef4b0f ("PM: hibernate: Get block device exclusively in swsusp_check()")
Signed-off-by: Thomas Zeitlhofer <thomas.zeitlhofer+lkml@ze-it.at>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
To hot unplug a CPU, the idle task on that CPU calls a few layers of C
code before finally leaving the kernel. When KASAN is in use, poisoned
shadow is left around for each of the active stack frames, and when
shadow call stacks are in use. When shadow call stacks (SCS) are in use
the task's saved SCS SP is left pointing at an arbitrary point within
the task's shadow call stack.
When a CPU is offlined than onlined back into the kernel, this stale
state can adversely affect execution. Stale KASAN shadow can alias new
stackframes and result in bogus KASAN warnings. A stale SCS SP is
effectively a memory leak, and prevents a portion of the shadow call
stack being used. Across a number of hotplug cycles the idle task's
entire shadow call stack can become unusable.
We previously fixed the KASAN issue in commit:
e1b77c9298 ("sched/kasan: remove stale KASAN poison after hotplug")
... by removing any stale KASAN stack poison immediately prior to
onlining a CPU.
Subsequently in commit:
f1a0a376ca ("sched/core: Initialize the idle task with preemption disabled")
... the refactoring left the KASAN and SCS cleanup in one-time idle
thread initialization code rather than something invoked prior to each
CPU being onlined, breaking both as above.
We fixed SCS (but not KASAN) in commit:
63acd42c0d ("sched/scs: Reset the shadow stack when idle_task_exit")
... but as this runs in the context of the idle task being offlined it's
potentially fragile.
To fix these consistently and more robustly, reset the SCS SP and KASAN
shadow of a CPU's idle task immediately before we online that CPU in
bringup_cpu(). This ensures the idle task always has a consistent state
when it is running, and removes the need to so so when exiting an idle
task.
Whenever any thread is created, dup_task_struct() will give the task a
stack which is free of KASAN shadow, and initialize the task's SCS SP,
so there's no need to specially initialize either for idle thread within
init_idle(), as this was only necessary to handle hotplug cycles.
I've tested this on arm64 with:
* gcc 11.1.0, defconfig +KASAN_INLINE, KASAN_STACK
* clang 12.0.0, defconfig +KASAN_INLINE, KASAN_STACK, SHADOW_CALL_STACK
... offlining and onlining CPUS with:
| while true; do
| for C in /sys/devices/system/cpu/cpu*/online; do
| echo 0 > $C;
| echo 1 > $C;
| done
| done
Fixes: f1a0a376ca ("sched/core: Initialize the idle task with preemption disabled")
Reported-by: Qian Cai <quic_qiancai@quicinc.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Qian Cai <quic_qiancai@quicinc.com>
Link: https://lore.kernel.org/lkml/20211115113310.35693-1-mark.rutland@arm.com/
Add missing 'tu' variable initialization in the probes loop,
otherwise the head 'tu' is used instead of added probes.
Link: https://lkml.kernel.org/r/20211123142801.182530-1-jolsa@kernel.org
Cc: stable@vger.kernel.org
Fixes: 99c9a923e9 ("tracing/uprobe: Fix double perf_event linking on multiprobe uprobe")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
cpuacct.stat shows user time based on raw random precision tick
based counters. Use cputime_addjust() to scale these values against the
total runtime accounted by the scheduler, like we already do
for user/system times in /proc/<pid>/stat.
Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20211115164607.23784-4-arbn@yandex-team.com
cpuacct has 2 different ways of accounting and showing user
and system times.
The first one uses cpuacct_account_field() to account times
and cpuacct.stat file to expose them. And this one seems to work ok.
The second one is uses cpuacct_charge() function for accounting and
set of cpuacct.usage* files to show times. Despite some attempts to
fix it in the past it still doesn't work. Sometimes while running KVM
guest the cpuacct_charge() accounts most of the guest time as
system time. This doesn't match with user&system times shown in
cpuacct.stat or proc/<pid>/stat.
Demonstration:
# git clone https://github.com/aryabinin/kvmsample
# make
# mkdir /sys/fs/cgroup/cpuacct/test
# echo $$ > /sys/fs/cgroup/cpuacct/test/tasks
# ./kvmsample &
# for i in {1..5}; do cat /sys/fs/cgroup/cpuacct/test/cpuacct.usage_sys; sleep 1; done
1976535645
2979839428
3979832704
4983603153
5983604157
Use cpustats accounted in cpuacct_account_field() as the source
of user/sys times for cpuacct.usage* files. Make cpuacct_charge()
to account only summary execution time.
Fixes: d740037fac ("sched/cpuacct: Split usage accounting into user_usage and sys_usage")
Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20211115164607.23784-3-arbn@yandex-team.com
Replace fatal BUG_ON() with more safe WARN_ON_ONCE() in cpuacct_cpuusage_read().
Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20211115164607.23784-2-arbn@yandex-team.com
cpuacct.stat in no-root cgroups shows user time without guest time
included int it. This doesn't match with user time shown in root
cpuacct.stat and /proc/<pid>/stat. This also affects cgroup2's cpu.stat
in the same way.
Make account_guest_time() to add user time to cgroup's cpustat to
fix this.
Fixes: ef12fefabf ("cpuacct: add per-cgroup utime/stime statistics")
Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20211115164607.23784-1-arbn@yandex-team.com
syzbot reported that the warning in perf_sigtrap() fires, saying that
the event's task does not match current:
| WARNING: CPU: 0 PID: 9090 at kernel/events/core.c:6446 perf_pending_event+0x40d/0x4b0 kernel/events/core.c:6513
| Modules linked in:
| CPU: 0 PID: 9090 Comm: syz-executor.1 Not tainted 5.15.0-syzkaller #0
| Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
| RIP: 0010:perf_sigtrap kernel/events/core.c:6446 [inline]
| RIP: 0010:perf_pending_event_disable kernel/events/core.c:6470 [inline]
| RIP: 0010:perf_pending_event+0x40d/0x4b0 kernel/events/core.c:6513
| ...
| Call Trace:
| <IRQ>
| irq_work_single+0x106/0x220 kernel/irq_work.c:211
| irq_work_run_list+0x6a/0x90 kernel/irq_work.c:242
| irq_work_run+0x4f/0xd0 kernel/irq_work.c:251
| __sysvec_irq_work+0x95/0x3d0 arch/x86/kernel/irq_work.c:22
| sysvec_irq_work+0x8e/0xc0 arch/x86/kernel/irq_work.c:17
| </IRQ>
| <TASK>
| asm_sysvec_irq_work+0x12/0x20 arch/x86/include/asm/idtentry.h:664
| RIP: 0010:__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:152 [inline]
| RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70 kernel/locking/spinlock.c:194
| ...
| coredump_task_exit kernel/exit.c:371 [inline]
| do_exit+0x1865/0x25c0 kernel/exit.c:771
| do_group_exit+0xe7/0x290 kernel/exit.c:929
| get_signal+0x3b0/0x1ce0 kernel/signal.c:2820
| arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
| handle_signal_work kernel/entry/common.c:148 [inline]
| exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
| exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
| __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
| syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
| do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
| entry_SYSCALL_64_after_hwframe+0x44/0xae
On x86 this shouldn't happen, which has arch_irq_work_raise().
The test program sets up a perf event with sigtrap set to fire on the
'sched_wakeup' tracepoint, which fired in ttwu_do_wakeup().
This happened because the 'sched_wakeup' tracepoint also takes a task
argument passed on to perf_tp_event(), which is used to deliver the
event to that other task.
Since we cannot deliver synchronous signals to other tasks, skip an event if
perf_tp_event() is targeted at another task and perf_event_attr::sigtrap is
set, which will avoid ever entering perf_sigtrap() for such events.
Fixes: 97ba62b278 ("perf: Add support for SIGTRAP on perf events")
Reported-by: syzbot+663359e32ce6f1a305ad@syzkaller.appspotmail.com
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YYpoCOBmC/kJWfmI@elver.google.com
We found that a process with 10 thousnads threads has been encountered
a regression problem from Linux-v4.14 to Linux-v5.4. It is a kind of
workload which will concurrently allocate lots of memory in different
threads sometimes. In this case, we will see the down_read_trylock()
with a high hotspot. Therefore, we suppose that rwsem has a regression
at least since Linux-v5.4. In order to easily debug this problem, we
write a simply benchmark to create the similar situation lile the
following.
```c++
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <sched.h>
#include <cstdio>
#include <cassert>
#include <thread>
#include <vector>
#include <chrono>
volatile int mutex;
void trigger(int cpu, char* ptr, std::size_t sz)
{
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(cpu, &set);
assert(pthread_setaffinity_np(pthread_self(), sizeof(set), &set) == 0);
while (mutex);
for (std::size_t i = 0; i < sz; i += 4096) {
*ptr = '\0';
ptr += 4096;
}
}
int main(int argc, char* argv[])
{
std::size_t sz = 100;
if (argc > 1)
sz = atoi(argv[1]);
auto nproc = std:🧵:hardware_concurrency();
std::vector<std::thread> thr;
sz <<= 30;
auto* ptr = mmap(nullptr, sz, PROT_READ | PROT_WRITE, MAP_ANON |
MAP_PRIVATE, -1, 0);
assert(ptr != MAP_FAILED);
char* cptr = static_cast<char*>(ptr);
auto run = sz / nproc;
run = (run >> 12) << 12;
mutex = 1;
for (auto i = 0U; i < nproc; ++i) {
thr.emplace_back(std::thread([i, cptr, run]() { trigger(i, cptr, run); }));
cptr += run;
}
rusage usage_start;
getrusage(RUSAGE_SELF, &usage_start);
auto start = std::chrono::system_clock::now();
mutex = 0;
for (auto& t : thr)
t.join();
rusage usage_end;
getrusage(RUSAGE_SELF, &usage_end);
auto end = std::chrono::system_clock::now();
timeval utime;
timeval stime;
timersub(&usage_end.ru_utime, &usage_start.ru_utime, &utime);
timersub(&usage_end.ru_stime, &usage_start.ru_stime, &stime);
printf("usr: %ld.%06ld\n", utime.tv_sec, utime.tv_usec);
printf("sys: %ld.%06ld\n", stime.tv_sec, stime.tv_usec);
printf("real: %lu\n",
std::chrono::duration_cast<std::chrono::milliseconds>(end -
start).count());
return 0;
}
```
The functionality of above program is simply which creates `nproc`
threads and each of them are trying to touch memory (trigger page
fault) on different CPU. Then we will see the similar profile by
`perf top`.
25.55% [kernel] [k] down_read_trylock
14.78% [kernel] [k] handle_mm_fault
13.45% [kernel] [k] up_read
8.61% [kernel] [k] clear_page_erms
3.89% [kernel] [k] __do_page_fault
The highest hot instruction, which accounts for about 92%, in
down_read_trylock() is cmpxchg like the following.
91.89 │ lock cmpxchg %rdx,(%rdi)
Sice the problem is found by migrating from Linux-v4.14 to Linux-v5.4,
so we easily found that the commit ddb20d1d3a ("locking/rwsem: Optimize
down_read_trylock()") caused the regression. The reason is that the
commit assumes the rwsem is not contended at all. But it is not always
true for mmap lock which could be contended with thousands threads.
So most threads almost need to run at least 2 times of "cmpxchg" to
acquire the lock. The overhead of atomic operation is higher than
non-atomic instructions, which caused the regression.
By using the above benchmark, the real executing time on a x86-64 system
before and after the patch were:
Before Patch After Patch
# of Threads real real reduced by
------------ ------ ------ ----------
1 65,373 65,206 ~0.0%
4 15,467 15,378 ~0.5%
40 6,214 5,528 ~11.0%
For the uncontended case, the new down_read_trylock() is the same as
before. For the contended cases, the new down_read_trylock() is faster
than before. The more contended, the more fast.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20211118094455.9068-1-songmuchun@bytedance.com
There are some inconsistency in the way that the handoff bit is being
handled in readers and writers that lead to a race condition.
Firstly, when a queue head writer set the handoff bit, it will clear
it when the writer is being killed or interrupted on its way out
without acquiring the lock. That is not the case for a queue head
reader. The handoff bit will simply be inherited by the next waiter.
Secondly, in the out_nolock path of rwsem_down_read_slowpath(), both
the waiter and handoff bits are cleared if the wait queue becomes
empty. For rwsem_down_write_slowpath(), however, the handoff bit is
not checked and cleared if the wait queue is empty. This can
potentially make the handoff bit set with empty wait queue.
Worse, the situation in rwsem_down_write_slowpath() relies on wstate,
a variable set outside of the critical section containing the ->count
manipulation, this leads to race condition where RWSEM_FLAG_HANDOFF
can be double subtracted, corrupting ->count.
To make the handoff bit handling more consistent and robust, extract
out handoff bit clearing code into the new rwsem_del_waiter() helper
function. Also, completely eradicate wstate; always evaluate
everything inside the same critical section.
The common function will only use atomic_long_andnot() to clear bits
when the wait queue is empty to avoid possible race condition. If the
first waiter with handoff bit set is killed or interrupted to exit the
slowpath without acquiring the lock, the next waiter will inherit the
handoff bit.
While at it, simplify the trylock for loop in
rwsem_down_write_slowpath() to make it easier to read.
Fixes: 4f23dbc1e6 ("locking/rwsem: Implement lock handoff to prevent lock starvation")
Reported-by: Zhenhua Ma <mazhenhua@xiaomi.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211116012912.723980-1-longman@redhat.com
The security_task_getsecid_subj() LSM hook invites misuse by allowing
callers to specify a task even though the hook is only safe when the
current task is referenced. Fix this by removing the task_struct
argument to the hook, requiring LSM implementations to use the
current task. While we are changing the hook declaration we also
rename the function to security_current_getsecid_subj() in an effort
to reinforce that the hook captures the subjective credentials of the
current task and not an arbitrary task on the system.
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
- Fix double free in destroy_hist_field
- Harden memset() of trace_iterator structure
- Do not warn in trace printk check when test buffer fills up
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYZgSTRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qqsJAQDg6Oe0XMclYPLMyRlEJEMEV2bFh8ZQ
G1jqvMLcMnuFZAEA2onhzHzjR1amXuSw9YwNHcDB7eHiaIg9pgdOFFDUpwI=
=KTcf
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.16-6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix double free in destroy_hist_field
- Harden memset() of trace_iterator structure
- Do not warn in trace printk check when test buffer fills up
* tag 'trace-v5.16-6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Don't use out-of-sync va_list in event printing
tracing: Use memset_startat() to zero struct trace_iterator
tracing/histogram: Fix UAF in destroy_hist_field()
Pull exit-vs-signal handling fixes from Eric Biederman:
"This is a small set of changes where debuggers were no longer able to
intercept synchronous SIGTRAP and SIGSEGV, introduced by the exit
cleanups.
This is essentially the change you suggested with all of i's dotted
and the t's crossed so that ptrace can intercept all of the cases it
has been able to intercept the past, and all of the cases that made it
to exit without giving ptrace a chance still don't give ptrace a
chance"
* 'SA_IMMUTABLE-fixes-for-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
signal: Replace force_fatal_sig with force_exit_sig when in doubt
signal: Don't always set SA_IMMUTABLE for forced signals
Recently to prevent issues with SECCOMP_RET_KILL and similar signals
being changed before they are delivered SA_IMMUTABLE was added.
Unfortunately this broke debuggers[1][2] which reasonably expect
to be able to trap synchronous SIGTRAP and SIGSEGV even when
the target process is not configured to handle those signals.
Add force_exit_sig and use it instead of force_fatal_sig where
historically the code has directly called do_exit. This has the
implementation benefits of going through the signal exit path
(including generating core dumps) without the danger of allowing
userspace to ignore or change these signals.
This avoids userspace regressions as older kernels exited with do_exit
which debuggers also can not intercept.
In the future is should be possible to improve the quality of
implementation of the kernel by changing some of these force_exit_sig
calls to force_fatal_sig. That can be done where it matters on
a case-by-case basis with careful analysis.
Reported-by: Kyle Huey <me@kylehuey.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
[1] https://lkml.kernel.org/r/CAP045AoMY4xf8aC_4QU_-j7obuEPYgTcnQQP3Yxk=2X90jtpjw@mail.gmail.com
[2] https://lkml.kernel.org/r/20211117150258.GB5403@xsang-OptiPlex-9020
Fixes: 00b06da29c ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
Fixes: a3616a3c02 ("signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die")
Fixes: 83a1f27ad7 ("signal/powerpc: On swapcontext failure force SIGSEGV")
Fixes: 9bc508cf07 ("signal/s390: Use force_sigsegv in default_trap_handler")
Fixes: 086ec444f8 ("signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig")
Fixes: c317d306d5 ("signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails")
Fixes: 695dd0d634 ("signal/x86: In emulate_vsyscall force a signal instead of calling do_exit")
Fixes: 1fbd60df8a ("signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.")
Fixes: 941edc5bf1 ("exit/syscall_user_dispatch: Send ordinary signals on failure")
Link: https://lkml.kernel.org/r/871r3dqfv8.fsf_-_@email.froward.int.ebiederm.org
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Kees Cook <keescook@chromium.org>
Tested-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Recently to prevent issues with SECCOMP_RET_KILL and similar signals
being changed before they are delivered SA_IMMUTABLE was added.
Unfortunately this broke debuggers[1][2] which reasonably expect to be
able to trap synchronous SIGTRAP and SIGSEGV even when the target
process is not configured to handle those signals.
Update force_sig_to_task to support both the case when we can allow
the debugger to intercept and possibly ignore the signal and the case
when it is not safe to let userspace know about the signal until the
process has exited.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Kyle Huey <me@kylehuey.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Cc: stable@vger.kernel.org
[1] https://lkml.kernel.org/r/CAP045AoMY4xf8aC_4QU_-j7obuEPYgTcnQQP3Yxk=2X90jtpjw@mail.gmail.com
[2] https://lkml.kernel.org/r/20211117150258.GB5403@xsang-OptiPlex-9020
Fixes: 00b06da29c ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
Link: https://lkml.kernel.org/r/877dd5qfw5.fsf_-_@email.froward.int.ebiederm.org
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Kees Cook <keescook@chromium.org>
Tested-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
If trace_seq becomes full, trace_seq_vprintf() no longer consumes
arguments from va_list, making va_list out of sync with format
processing by trace_check_vprintf().
This causes va_arg() in trace_check_vprintf() to return wrong
positional argument, which results into a WARN_ON_ONCE() hit.
ftrace_stress_test from LTP triggers this situation.
Fix it by explicitly avoiding further use if va_list at the point
when it's consistency can no longer be guaranteed.
Link: https://lkml.kernel.org/r/20211118145516.13219-1-nikita.yushchenko@virtuozzo.com
Signed-off-by: Nikita Yushchenko <nikita.yushchenko@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memset(), avoid intentionally writing across
neighboring fields.
Use memset_startat() to avoid confusing memset() about writing beyond
the target struct member.
Link: https://lkml.kernel.org/r/20211118202217.1285588-1-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Current release - regressions:
- devlink: don't throw an error if flash notification sent before
devlink visible
- page_pool: Revert "page_pool: disable dma mapping support...",
turns out there are active arches who need it
Current release - new code bugs:
- amt: cancel delayed_work synchronously in amt_fini()
Previous releases - regressions:
- xsk: fix crash on double free in buffer pool
- bpf: fix inner map state pruning regression causing program
rejections
- mac80211: drop check for DONT_REORDER in __ieee80211_select_queue,
preventing mis-selecting the best effort queue
- mac80211: do not access the IV when it was stripped
- mac80211: fix radiotap header generation, off-by-one
- nl80211: fix getting radio statistics in survey dump
- e100: fix device suspend/resume
Previous releases - always broken:
- tcp: fix uninitialized access in skb frags array for Rx 0cp
- bpf: fix toctou on read-only map's constant scalar tracking
- bpf: forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs
- tipc: only accept encrypted MSG_CRYPTO msgs
- smc: transfer remaining wait queue entries during fallback,
fix missing wake ups
- udp: validate checksum in udp_read_sock() (when sockmap is used)
- sched: act_mirred: drop dst for the direction from egress to ingress
- virtio_net_hdr_to_skb: count transport header in UFO, prevent
allowing bad skbs into the stack
- nfc: reorder the logic in nfc_{un,}register_device, fix unregister
- ipsec: check return value of ipv6_skip_exthdr
- usb: r8152: add MAC passthrough support for more Lenovo Docks
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmGWf08ACgkQMUZtbf5S
Irt+lxAAj8FAoLoSmQKUK3LttLLh0ZQQXu8Riey+wrP8Z9Yp8xWXIaVRF1c0vCE6
clbrF+mLfk6Wvv/RzOgwyBMHvK+djr/oVDNSmjlRvss4MLDfOQZhUV8V4XpvF4Up
hI7wyKfHtd7niosNqil6wklJFpLU8WyIAWrPSIPE6JlPkJmcm3GUGsliwEPwdLY1
yl7z4zsxigjA+hKxYqNQX6tixF3xnbDUbAnWshrSPL89melwz4GMao45qmcxJEVr
EipPhKifk0hT067jG08FMXcKBFKt6rKk7SVNo4mtq8Tl6HleJBj8fdaJAjSdFahB
+rYJ0sDZwGoDL5CxZ5mD3fM1cDgh4WFEM0z//0b/bZhoPDRKEpLr9LPuv+N6+/rA
8D98EHsvyNjlFgdyd8celMstiGtBn4YLEoLNYYh9Qibgm0XsCuv0yox7g0AOLMmQ
QiBmh2EnaXNPQ8PRZNMK3VH5ol2KoYWL6yrpJYV+wOWVLfezwlSsjkPSfW5pF9FG
hU0iQBp/YTCdCadR9YLj8qfDWDUAkCN7WpqIu9EA9FXJcYjJVaix0MA/tAVlzXyR
xlB7cU6O5NABcs/+04zPkKLwKbVYNMqgvKE+FVDVm+BKxo0UMxcmz/Np/ZYxfhkF
bwKplaiPb2H4D6t0sdxqaeYirPwt1BcleLilae6vHG1jO90H9Vw=
=tlqV
-----END PGP SIGNATURE-----
Merge tag 'net-5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf, mac80211.
Current release - regressions:
- devlink: don't throw an error if flash notification sent before
devlink visible
- page_pool: Revert "page_pool: disable dma mapping support...",
turns out there are active arches who need it
Current release - new code bugs:
- amt: cancel delayed_work synchronously in amt_fini()
Previous releases - regressions:
- xsk: fix crash on double free in buffer pool
- bpf: fix inner map state pruning regression causing program
rejections
- mac80211: drop check for DONT_REORDER in __ieee80211_select_queue,
preventing mis-selecting the best effort queue
- mac80211: do not access the IV when it was stripped
- mac80211: fix radiotap header generation, off-by-one
- nl80211: fix getting radio statistics in survey dump
- e100: fix device suspend/resume
Previous releases - always broken:
- tcp: fix uninitialized access in skb frags array for Rx 0cp
- bpf: fix toctou on read-only map's constant scalar tracking
- bpf: forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing
progs
- tipc: only accept encrypted MSG_CRYPTO msgs
- smc: transfer remaining wait queue entries during fallback, fix
missing wake ups
- udp: validate checksum in udp_read_sock() (when sockmap is used)
- sched: act_mirred: drop dst for the direction from egress to
ingress
- virtio_net_hdr_to_skb: count transport header in UFO, prevent
allowing bad skbs into the stack
- nfc: reorder the logic in nfc_{un,}register_device, fix unregister
- ipsec: check return value of ipv6_skip_exthdr
- usb: r8152: add MAC passthrough support for more Lenovo Docks"
* tag 'net-5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (96 commits)
ptp: ocp: Fix a couple NULL vs IS_ERR() checks
net: ethernet: dec: tulip: de4x5: fix possible array overflows in type3_infoblock()
net: tulip: de4x5: fix the problem that the array 'lp->phy[8]' may be out of bound
ipv6: check return value of ipv6_skip_exthdr
e100: fix device suspend/resume
devlink: Don't throw an error if flash notification sent before devlink visible
page_pool: Revert "page_pool: disable dma mapping support..."
ethernet: hisilicon: hns: hns_dsaf_misc: fix a possible array overflow in hns_dsaf_ge_srst_by_port()
octeontx2-af: debugfs: don't corrupt user memory
NFC: add NCI_UNREG flag to eliminate the race
NFC: reorder the logic in nfc_{un,}register_device
NFC: reorganize the functions in nci_request
tipc: check for null after calling kmemdup
i40e: Fix display error code in dmesg
i40e: Fix creation of first queue by omitting it if is not power of two
i40e: Fix warning message and call stack during rmmod i40e driver
i40e: Fix ping is lost after configuring ADq on VF
i40e: Fix changing previously set num_queue_pairs for PFs
i40e: Fix NULL ptr dereference on VSI filter sync
i40e: Fix correct max_pkt_size on VF RX queue
...
Calling destroy_hist_field() on an expression will recursively free
any operands associated with the expression. If during expression
parsing the operands of the expression are already set when an error
is encountered, there is no need to explicity free the operands. Doing
so will result in destroy_hist_field() being called twice for the
operands and lead to a use-after-free (UAF) error.
If the operands are associated with the expression, only call
destroy_hist_field() on the expression since the operands will be
recursively freed.
Link: https://lore.kernel.org/all/CAHk-=wgcrEbFgkw9720H3tW-AhHOoEKhYwZinYJw4FpzSaJ6_Q@mail.gmail.com/
Link: https://lkml.kernel.org/r/20211118011542.1420131-1-kaleshsingh@google.com
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Fixes: 8b5d46fd7a ("tracing/histogram: Optimize division by constants")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmGWF6YACgkQUqAMR0iA
lPJJ+RAAm9pi/EElKKl+lOlBl+ehJlKuNLnPQWFmmaRc9xd0ruUipp1nsaktLJ8f
R/PkSwR/YWpBWlF8P4o+x9sOFyTNyLasoHtqsinEcAJI4lb7d1KOrPliTXyr15Ai
A303djwJmwCw5KxAPOjkG/nMBlpMIAQRee9GDWs1ykfSlIsI4jp7isVbCFNCQNVF
auHYq1bfJ5MJYPjxIDZUt+NF7kg7dD4k4g+VCVjaH1u8pGeaCUCtnNjMFOk1XfU8
yFQnaDcrAu4zJPq3d74z4eN9Bk+su8+DhnfrAEFjuFxGTgYc2MyRt0gGFeiUtNs4
rvST/eHBO4zeuL18S8G+fLcig/9ZYE73xzjdOCzRzLDjn0VQr9t06ez1QqJOb4D6
A4SSufwek5NIqYKMlhV/az2EceQYK8Wv3KAz8w98KDfwvVVhUSgE23MbTCO0hvQU
PWF35d3hQ+9oH0ZGYRumb8OpXtKJ+2KmzyN8Z0xhivHFBIKlcW6IBGhWRANclJO8
jNAM3jiwi8fRDVM2wI1fmgeEmMhG+WuTI3dJVu3tu4vI923FW5GdY6ev5EvH0Ts0
khTwIjtmCHUJGSeWajy3Gi9irdyhPyPNRMqgal4GS+gGpVU2mMMKTG+NzxxtCRKR
BUgfCjFDoDJWrNWIzzOwNqgF0Y+V9GgCZOkb73u/y+xVx0Rmc6U=
=wbBy
-----END PGP SIGNATURE-----
Merge tag 'printk-for-5.16-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk fixes from Petr Mladek:
- Try to flush backtraces from other CPUs also on the local one. This
was a regression caused by printk_safe buffers removal.
- Remove header dependency warning.
* tag 'printk-for-5.16-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk: Remove printk.h inclusion in percpu.h
printk: restore flushing of NMI buffers on remote CPUs after NMI backtraces
Kyle Huey <me@kylehuey.com> writes:
> rr, a userspace record and replay debugger[0], uses the recorded register
> state at PTRACE_EVENT_EXIT to find the point in time at which to cease
> executing the program during replay.
>
> If a SIGKILL races with processing another signal in get_signal, it is
> possible for the kernel to decline to notify the tracer of the original
> signal. But if the original signal had a handler, the kernel proceeds
> with setting up a signal handler frame as if the tracer had chosen to
> deliver the signal unmodified to the tracee. When the kernel goes to
> execute the signal handler that it has now modified the stack and registers
> for, it will discover the pending SIGKILL, and terminate the tracee
> without executing the handler. When PTRACE_EVENT_EXIT is delivered to
> the tracer, however, the effects of handler setup will be visible to
> the tracer.
>
> Because rr (the tracer) was never notified of the signal, it is not aware
> that a signal handler frame was set up and expects the state of the program
> at PTRACE_EVENT_EXIT to be a state that will be reconstructed naturally
> by allowing the program to execute from the last event. When that fails
> to happen during replay, rr will assert and die.
>
> The following patches add an explicit check for a newly pending SIGKILL
> after the ptracer has been notified and the siglock has been reacquired.
> If this happens, we stop processing the current signal and proceed
> immediately to handling the SIGKILL. This makes the state reported at
> PTRACE_EVENT_EXIT the unmodified state of the program, and also avoids the
> work to set up a signal handler frame that will never be used.
>
> [0] https://rr-project.org/
The problem is that while the traced process makes it into ptrace_stop,
the tracee is killed before the tracer manages to wait for the
tracee and discover which signal was about to be delivered.
More generally the problem is that while siglock was dropped a signal
with process wide effect is short cirucit delivered to the entire
process killing it, but the process continues to try and deliver another
signal.
In general it impossible to avoid all cases where work is performed
after the process has been killed. In particular if the process is
killed after get_signal returns the code will simply not know it has
been killed until after delivering the signal frame to userspace.
On the other hand when the code has already discovered the process
has been killed and taken user space visible action that shows
the kernel knows the process has been killed, it is just silly
to then write the signal frame to the user space stack.
Instead of being silly detect the process has been killed
in ptrace_signal and requeue the signal so the code can pretend
it was simply never dequeued for delivery.
To test the process has been killed I use fatal_signal_pending rather
than signal_group_exit to match the test in signal_pending_state which
is used in schedule which is where ptrace_stop detects the process has
been killed.
Requeuing the signal so the code can pretend it was simply never
dequeued improves the user space visible behavior that has been
present since ebf5ebe31d2c ("[PATCH] signal-fixes-2.5.59-A4").
Kyle Huey verified that this change in behavior and makes rr happy.
Reported-by: Kyle Huey <khuey@kylehuey.com>
Reported-by: Marko Mäkelä <marko.makela@mariadb.com>
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.gi
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87tugcd5p2.fsf_-_@email.froward.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In the event that a tracer changes which signal needs to be delivered
and that signal is currently blocked then the signal needs to be
requeued for later delivery.
With the advent of CLONE_THREAD the kernel has 2 signal queues per
task. The per process queue and the per task queue. Update the code
so that if the signal is removed from the per process queue it is
requeued on the per process queue. This is necessary to make it
appear the signal was never dequeued.
The rr debugger reasonably believes that the state of the process from
the last ptrace_stop it observed until PTRACE_EVENT_EXIT can be recreated
by simply letting a process run. If a SIGKILL interrupts a ptrace_stop
this is not true today.
So return signals to their original queue in ptrace_signal so that
signals that are not delivered appear like they were never dequeued.
Fixes: 794aa320b79d ("[PATCH] sigfix-2.5.40-D6")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.gi
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87zgq4d5r4.fsf_-_@email.froward.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Recently while investigating a problem with rr and signals I noticed
that siglock is dropped in ptrace_signal and get_signal does not jump
to relock.
Looking farther to see if the problem is anywhere else I see that
do_signal_stop also returns if signal_group_exit is true. I believe
that test can now never be true, but it is a bit hard to trace
through and be certain.
Testing signal_group_exit is not expensive, so move the test for
signal_group_exit into the for loop inside of get_signal to ensure
the test is never skipped improperly.
This has been a potential problem since I added the test for
signal_group_exit was added.
Fixes: 35634ffa17 ("signal: Always notice exiting tasks")
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/875yssekcd.fsf_-_@email.froward.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Use static_call to optimize perf's guest callbacks on arm64 and x86,
which are now the only architectures that define the callbacks. Use
DEFINE_STATIC_CALL_RET0 as the default/NULL for all guest callbacks, as
the callback semantics are that a return value '0' means "not in guest".
static_call obviously avoids the overhead of CONFIG_RETPOLINE=y, but is
also advantageous versus other solutions, e.g. per-cpu callbacks, in that
a per-cpu memory load is not needed to detect the !guest case.
Based on code from Peter and Like.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20211111020738.2512932-10-seanjc@google.com
Introduce GUEST_PERF_EVENTS and require architectures to select it to
allow registering and using guest callbacks in perf. This will hopefully
make it more difficult for new architectures to add useless "support" for
guest callbacks, e.g. via copy+paste.
Stubbing out the helpers has the happy bonus of avoiding a load of
perf_guest_cbs when GUEST_PERF_EVENTS=n on arm64/x86.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20211111020738.2512932-9-seanjc@google.com
Drop the 'int' return value from the perf (un)register callbacks helpers
and stop pretending perf can support multiple callbacks. The 'int'
returns are not future proofing anything as none of the callers take
action on an error. It's also not obvious that there will ever be
co-tenant hypervisors, and if there are, that allowing multiple callbacks
to be registered is desirable or even correct.
Opportunistically rename callbacks=>cbs in the affected declarations to
match their definitions.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20211111020738.2512932-5-seanjc@google.com
Protect perf_guest_cbs with RCU to fix multiple possible errors. Luckily,
all paths that read perf_guest_cbs already require RCU protection, e.g. to
protect the callback chains, so only the direct perf_guest_cbs touchpoints
need to be modified.
Bug #1 is a simple lack of WRITE_ONCE/READ_ONCE behavior to ensure
perf_guest_cbs isn't reloaded between a !NULL check and a dereference.
Fixed via the READ_ONCE() in rcu_dereference().
Bug #2 is that on weakly-ordered architectures, updates to the callbacks
themselves are not guaranteed to be visible before the pointer is made
visible to readers. Fixed by the smp_store_release() in
rcu_assign_pointer() when the new pointer is non-NULL.
Bug #3 is that, because the callbacks are global, it's possible for
readers to run in parallel with an unregisters, and thus a module
implementing the callbacks can be unloaded while readers are in flight,
resulting in a use-after-free. Fixed by a synchronize_rcu() call when
unregistering callbacks.
Bug #1 escaped notice because it's extremely unlikely a compiler will
reload perf_guest_cbs in this sequence. perf_guest_cbs does get reloaded
for future derefs, e.g. for ->is_user_mode(), but the ->is_in_guest()
guard all but guarantees the consumer will win the race, e.g. to nullify
perf_guest_cbs, KVM has to completely exit the guest and teardown down
all VMs before KVM start its module unload / unregister sequence. This
also makes it all but impossible to encounter bug #3.
Bug #2 has not been a problem because all architectures that register
callbacks are strongly ordered and/or have a static set of callbacks.
But with help, unloading kvm_intel can trigger bug #1 e.g. wrapping
perf_guest_cbs with READ_ONCE in perf_misc_flags() while spamming
kvm_intel module load/unload leads to:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 6 PID: 1825 Comm: stress Not tainted 5.14.0-rc2+ #459
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:perf_misc_flags+0x1c/0x70
Call Trace:
perf_prepare_sample+0x53/0x6b0
perf_event_output_forward+0x67/0x160
__perf_event_overflow+0x52/0xf0
handle_pmi_common+0x207/0x300
intel_pmu_handle_irq+0xcf/0x410
perf_event_nmi_handler+0x28/0x50
nmi_handle+0xc7/0x260
default_do_nmi+0x6b/0x170
exc_nmi+0x103/0x130
asm_exc_nmi+0x76/0xbf
Fixes: 39447b386c ("perf: Enhance perf to allow for guest statistic collection from host")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211111020738.2512932-2-seanjc@google.com
We've noticed cases where tasks in a cgroup are stalled on memory but
there is little memory FULL pressure since tasks stay on the runqueue
in reclaim.
A simple example involves a single threaded program that keeps leaking
and touching large amounts of memory. It runs in a cgroup with swap
enabled, memory.high set at 10M and cpu.max ratio set at 5%. Though
there is significant CPU pressure and memory SOME, there is barely any
memory FULL since the task enters reclaim and stays on the runqueue.
However, this memory-bound task is effectively stalled on memory and
we expect memory FULL to match memory SOME in this scenario.
The code is confused about memstall && running, thinking there is a
stalled task and a productive task when there's only one task: a
reclaimer that's counted as both. To fix this, we redefine the
condition for PSI_MEM_FULL to check that all running tasks are in an
active memstall instead of checking that there are no running tasks.
case PSI_MEM_FULL:
- return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]);
+ return unlikely(tasks[NR_MEMSTALL] &&
+ tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]);
This will capture reclaimers. It will also capture tasks that called
psi_memstall_enter() and are about to sleep, but this should be
negligible noise.
Signed-off-by: Brian Chen <brianchen118@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20211110213312.310243-1-brianchen118@gmail.com
Adds accounting for "forced idle" time, which is time where a cookie'd
task forces its SMT sibling to idle, despite the presence of runnable
tasks.
Forced idle time is one means to measure the cost of enabling core
scheduling (ie. the capacity lost due to the need to force idle).
Forced idle time is attributed to the thread responsible for causing
the forced idle.
A few details:
- Forced idle time is displayed via /proc/PID/sched. It also requires
that schedstats is enabled.
- Forced idle is only accounted when a sibling hyperthread is held
idle despite the presence of runnable tasks. No time is charged if
a sibling is idle but has no runnable tasks.
- Tasks with 0 cookie are never charged forced idle.
- For SMT > 2, we scale the amount of forced idle charged based on the
number of forced idle siblings. Additionally, we split the time up and
evenly charge it to all running tasks, as each is equally responsible
for the forced idle.
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211018203428.2025792-1-joshdon@google.com
Add the missing SPDX license header to
include/linux/psi.h
include/linux/psi_types.h
kernel/sched/psi.c
Signed-off-by: Liu Xinpeng <liuxp11@chinatelecom.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/1635133586-84611-2-git-send-email-liuxp11@chinatelecom.cn
Comment in function psi_task_switch,there are two same lines.
...
* runtime state, the cgroup that contains both tasks
* runtime state, the cgroup that contains both tasks
...
Signed-off-by: Liu Xinpeng <liuxp11@chinatelecom.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/1635133586-84611-1-git-send-email-liuxp11@chinatelecom.cn
mutex_acquire_nest() expects a pointer, pass the pointer.
Fixes: 12235da8c8 ("kernel/locking: Add context to ww_mutex_trylock()")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211104122706.frk52zxbjorso2kv@linutronix.de
Daniel Borkmann says:
====================
pull-request: bpf 2021-11-16
We've added 12 non-merge commits during the last 5 day(s) which contain
a total of 23 files changed, 573 insertions(+), 73 deletions(-).
The main changes are:
1) Fix pruning regression where verifier went overly conservative rejecting
previsouly accepted programs, from Alexei Starovoitov and Lorenz Bauer.
2) Fix verifier TOCTOU bug when using read-only map's values as constant
scalars during verification, from Daniel Borkmann.
3) Fix a crash due to a double free in XSK's buffer pool, from Magnus Karlsson.
4) Fix libbpf regression when cross-building runqslower, from Jean-Philippe Brucker.
5) Forbid use of bpf_ktime_get_coarse_ns() and bpf_timer_*() helpers in tracing
programs due to deadlock possibilities, from Dmitrii Banshchikov.
6) Fix checksum validation in sockmap's udp_read_sock() callback, from Cong Wang.
7) Various BPF sample fixes such as XDP stats in xdp_sample_user, from Alexander Lobakin.
8) Fix libbpf gen_loader error handling wrt fd cleanup, from Kumar Kartikeya Dwivedi.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
udp: Validate checksum in udp_read_sock()
bpf: Fix toctou on read-only map's constant scalar tracking
samples/bpf: Fix build error due to -isystem removal
selftests/bpf: Add tests for restricted helpers
bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs
libbpf: Perform map fd cleanup for gen_loader in case of error
samples/bpf: Fix incorrect use of strlen in xdp_redirect_cpu
tools/runqslower: Fix cross-build
samples/bpf: Fix summary per-sec stats in xdp_sample_user
selftests/bpf: Check map in map pruning
bpf: Fix inner map state pruning regression.
xsk: Fix crash on double free in buffer pool
====================
Link: https://lore.kernel.org/r/20211116141134.6490-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In the current code, the actual max tail call count is 33 which is greater
than MAX_TAIL_CALL_CNT (defined as 32). The actual limit is not consistent
with the meaning of MAX_TAIL_CALL_CNT and thus confusing at first glance.
We can see the historical evolution from commit 04fd61ab36 ("bpf: allow
bpf programs to tail-call other bpf programs") and commit f9dabe016b
("bpf: Undo off-by-one in interpreter tail call count limit"). In order
to avoid changing existing behavior, the actual limit is 33 now, this is
reasonable.
After commit 874be05f52 ("bpf, tests: Add tail call test suite"), we can
see there exists failed testcase.
On all archs when CONFIG_BPF_JIT_ALWAYS_ON is not set:
# echo 0 > /proc/sys/net/core/bpf_jit_enable
# modprobe test_bpf
# dmesg | grep -w FAIL
Tail call error path, max count reached jited:0 ret 34 != 33 FAIL
On some archs:
# echo 1 > /proc/sys/net/core/bpf_jit_enable
# modprobe test_bpf
# dmesg | grep -w FAIL
Tail call error path, max count reached jited:1 ret 34 != 33 FAIL
Although the above failed testcase has been fixed in commit 18935a72eb
("bpf/tests: Fix error in tail call limit tests"), it would still be good
to change the value of MAX_TAIL_CALL_CNT from 32 to 33 to make the code
more readable.
The 32-bit x86 JIT was using a limit of 32, just fix the wrong comments and
limit to 33 tail calls as the constant MAX_TAIL_CALL_CNT updated. For the
mips64 JIT, use "ori" instead of "addiu" as suggested by Johan Almbladh.
For the riscv JIT, use RV_REG_TCC directly to save one register move as
suggested by Björn Töpel. For the other implementations, no function changes,
it does not change the current limit 33, the new value of MAX_TAIL_CALL_CNT
can reflect the actual max tail call count, the related tail call testcases
in test_bpf module and selftests can work well for the interpreter and the
JIT.
Here are the test results on x86_64:
# uname -m
x86_64
# echo 0 > /proc/sys/net/core/bpf_jit_enable
# modprobe test_bpf test_suite=test_tail_calls
# dmesg | tail -1
test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [0/8 JIT'ed]
# rmmod test_bpf
# echo 1 > /proc/sys/net/core/bpf_jit_enable
# modprobe test_bpf test_suite=test_tail_calls
# dmesg | tail -1
test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [8/8 JIT'ed]
# rmmod test_bpf
# ./test_progs -t tailcalls
#142 tailcalls:OK
Summary: 1/11 PASSED, 0 SKIPPED, 0 FAILED
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Acked-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/bpf/1636075800-3264-1-git-send-email-yangtiezhu@loongson.cn
Commit a23740ec43 ("bpf: Track contents of read-only maps as scalars") is
checking whether maps are read-only both from BPF program side and user space
side, and then, given their content is constant, reading out their data via
map->ops->map_direct_value_addr() which is then subsequently used as known
scalar value for the register, that is, it is marked as __mark_reg_known()
with the read value at verification time. Before a23740ec43, the register
content was marked as an unknown scalar so the verifier could not make any
assumptions about the map content.
The current implementation however is prone to a TOCTOU race, meaning, the
value read as known scalar for the register is not guaranteed to be exactly
the same at a later point when the program is executed, and as such, the
prior made assumptions of the verifier with regards to the program will be
invalid which can cause issues such as OOB access, etc.
While the BPF_F_RDONLY_PROG map flag is always fixed and required to be
specified at map creation time, the map->frozen property is initially set to
false for the map given the map value needs to be populated, e.g. for global
data sections. Once complete, the loader "freezes" the map from user space
such that no subsequent updates/deletes are possible anymore. For the rest
of the lifetime of the map, this freeze one-time trigger cannot be undone
anymore after a successful BPF_MAP_FREEZE cmd return. Meaning, any new BPF_*
cmd calls which would update/delete map entries will be rejected with -EPERM
since map_get_sys_perms() removes the FMODE_CAN_WRITE permission. This also
means that pending update/delete map entries must still complete before this
guarantee is given. This corner case is not an issue for loaders since they
create and prepare such program private map in successive steps.
However, a malicious user is able to trigger this TOCTOU race in two different
ways: i) via userfaultfd, and ii) via batched updates. For i) userfaultfd is
used to expand the competition interval, so that map_update_elem() can modify
the contents of the map after map_freeze() and bpf_prog_load() were executed.
This works, because userfaultfd halts the parallel thread which triggered a
map_update_elem() at the time where we copy key/value from the user buffer and
this already passed the FMODE_CAN_WRITE capability test given at that time the
map was not "frozen". Then, the main thread performs the map_freeze() and
bpf_prog_load(), and once that had completed successfully, the other thread
is woken up to complete the pending map_update_elem() which then changes the
map content. For ii) the idea of the batched update is similar, meaning, when
there are a large number of updates to be processed, it can increase the
competition interval between the two. It is therefore possible in practice to
modify the contents of the map after executing map_freeze() and bpf_prog_load().
One way to fix both i) and ii) at the same time is to expand the use of the
map's map->writecnt. The latter was introduced in fc9702273e ("bpf: Add mmap()
support for BPF_MAP_TYPE_ARRAY") and further refined in 1f6cb19be2 ("bpf:
Prevent re-mmap()'ing BPF map as writable for initially r/o mapping") with
the rationale to make a writable mmap()'ing of a map mutually exclusive with
read-only freezing. The counter indicates writable mmap() mappings and then
prevents/fails the freeze operation. Its semantics can be expanded beyond
just mmap() by generally indicating ongoing write phases. This would essentially
span any parallel regular and batched flavor of update/delete operation and
then also have map_freeze() fail with -EBUSY. For the check_mem_access() in
the verifier we expand upon the bpf_map_is_rdonly() check ensuring that all
last pending writes have completed via bpf_map_write_active() test. Once the
map->frozen is set and bpf_map_write_active() indicates a map->writecnt of 0
only then we are really guaranteed to use the map's data as known constants.
For map->frozen being set and pending writes in process of still being completed
we fall back to marking that register as unknown scalar so we don't end up
making assumptions about it. With this, both TOCTOU reproducers from i) and
ii) are fixed.
Note that the map->writecnt has been converted into a atomic64 in the fix in
order to avoid a double freeze_mutex mutex_{un,}lock() pair when updating
map->writecnt in the various map update/delete BPF_* cmd flavors. Spanning
the freeze_mutex over entire map update/delete operations in syscall side
would not be possible due to then causing everything to be serialized.
Similarly, something like synchronize_rcu() after setting map->frozen to wait
for update/deletes to complete is not possible either since it would also
have to span the user copy which can sleep. On the libbpf side, this won't
break d66562fba1 ("libbpf: Add BPF object skeleton support") as the
anonymous mmap()-ed "map initialization image" is remapped as a BPF map-backed
mmap()-ed memory where for .rodata it's non-writable.
Fixes: a23740ec43 ("bpf: Track contents of read-only maps as scalars")
Reported-by: w1tcher.bupt@gmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
There is a race between updaters and flushers (flush can possibly miss
the latest update(s)). This is expected as explained in
cgroup_rstat_updated() comment, add also machine readable annotation so
that KCSAN results aren't noisy.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/r/CACkBjsbPVdkub=e-E-p1WBOLxS515ith-53SFdmFHWV_QMo40w@mail.gmail.com
Suggested-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2021-11-15
We've added 72 non-merge commits during the last 13 day(s) which contain
a total of 171 files changed, 2728 insertions(+), 1143 deletions(-).
The main changes are:
1) Add btf_type_tag attributes to bring kernel annotations like __user/__rcu to
BTF such that BPF verifier will be able to detect misuse, from Yonghong Song.
2) Big batch of libbpf improvements including various fixes, future proofing APIs,
and adding a unified, OPTS-based bpf_prog_load() low-level API, from Andrii Nakryiko.
3) Add ingress_ifindex to BPF_SK_LOOKUP program type for selectively applying the
programmable socket lookup logic to packets from a given netdev, from Mark Pashmfouroush.
4) Remove the 128M upper JIT limit for BPF programs on arm64 and add selftest to
ensure exception handling still works, from Russell King and Alan Maguire.
5) Add a new bpf_find_vma() helper for tracing to map an address to the backing
file such as shared library, from Song Liu.
6) Batch of various misc fixes to bpftool, fixing a memory leak in BPF program dump,
updating documentation and bash-completion among others, from Quentin Monnet.
7) Deprecate libbpf bpf_program__get_prog_info_linear() API and migrate its users as
the API is heavily tailored around perf and is non-generic, from Dave Marchevsky.
8) Enable libbpf's strict mode by default in bpftool and add a --legacy option as an
opt-out for more relaxed BPF program requirements, from Stanislav Fomichev.
9) Fix bpftool to use libbpf_get_error() to check for errors, from Hengqi Chen.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (72 commits)
bpftool: Use libbpf_get_error() to check error
bpftool: Fix mixed indentation in documentation
bpftool: Update the lists of names for maps and prog-attach types
bpftool: Fix indent in option lists in the documentation
bpftool: Remove inclusion of utilities.mak from Makefiles
bpftool: Fix memory leak in prog_dump()
selftests/bpf: Fix a tautological-constant-out-of-range-compare compiler warning
selftests/bpf: Fix an unused-but-set-variable compiler warning
bpf: Introduce btf_tracing_ids
bpf: Extend BTF_ID_LIST_GLOBAL with parameter for number of IDs
bpftool: Enable libbpf's strict mode by default
docs/bpf: Update documentation for BTF_KIND_TYPE_TAG support
selftests/bpf: Clarify llvm dependency with btf_tag selftest
selftests/bpf: Add a C test for btf_type_tag
selftests/bpf: Rename progs/tag.c to progs/btf_decl_tag.c
selftests/bpf: Test BTF_KIND_DECL_TAG for deduplication
selftests/bpf: Add BTF_KIND_TYPE_TAG unit tests
selftests/bpf: Test libbpf API function btf__add_type_tag()
bpftool: Support BTF_KIND_TYPE_TAG
libbpf: Support BTF_KIND_TYPE_TAG
...
====================
Link: https://lore.kernel.org/r/20211115162008.25916-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A fix to only copy the size of the field to the histogram string
did not take into account that the size can be larger than the
storage.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYZHGYBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qi4RAP9Lr7RqTRQQ3C9BHZfCmIgwZtAqT+Z4
U+nHva6FcI9ufQEAtWAAHleVHUcfVB90mahMFSEnJ7yESKC3k1ZKXsTsYwo=
=X961
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.16-5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Update to tracing histogram variable string copy
A fix to only copy the size of the field to the histogram string did
not take into account that the size can be larger than the storage"
* tag 'trace-v5.16-5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Add length protection to histogram string copies
The string copies to the histogram storage has a max size of 256 bytes
(defined by MAX_FILTER_STR_VAL). Only the string size of the event field
needs to be copied to the event storage, but no more than what is in the
event storage. Although nothing should be bigger than 256 bytes, there's
no protection against overwriting of the storage if one day there is.
Copy no more than the destination size, and enforce it.
Also had to turn MAX_FILTER_STR_VAL into an unsigned int, to keep the
min() comparison of the string sizes of comparable types.
Link: https://lore.kernel.org/all/CAHk-=wjREUihCGrtRBwfX47y_KrLCGjiq3t6QtoNJpmVrAEb1w@mail.gmail.com/
Link: https://lkml.kernel.org/r/20211114132834.183429a4@rorschach.local.home
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 63f84ae6b8 ("tracing/histogram: Do not copy the fixed-size char array field over the field size")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
timer delivery stops working for a new child task because copy_process()
copies state information which is only valid for the parent task.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmGRDVUTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYocOFD/42NOdli73N+Jdq7APHUIHXzu+6DVT6
CI5toLQw+0KPoF0s1wg4+J0YCDt2k0Pu4lOabF3Ze/+c6RlR5zfCXESqsXdHaCjh
E91Vs57u0ataRMEHo6KB6eBIutuF8hyxfY6vVXfkTRNAreUIWiwWYrlB0G64JVOG
+/l1W7adovjLcLwcW+ArrnLJwkBKtXunK6PVv2IrdRHwpMHbwoNRCCCFvzkqnWmQ
4Yy2/NaB/PEBK5kezP1/j9EMcGCTWk1JJIm+l/PEwCCcbIgIdUahpW3XHAaqms6R
oukqCvE5ukfmVzBFYBhCamhF8heyEeBVRqGU+Yyk48+I+DQFBCqaqa1NKSuEUdNL
Nycy6Rp1yn7CHVSB461shMS6NJGOSNDBjv7vxer3WjV3HPJu7y0RrN7jXbkSfQnm
hVKjkmbDEYwylgzFE5+T857NqD5MEXeuIBtTO08hNRnpd61aB3x+qq+8ElE6ST8Y
pm6rMzw0AZ5buPK8QdGVDk0dD4WKObj1LzmRZvBtYeWynO6sxyKUl6B2CgAxrvn5
D1Li2/arkJMCVeIuIL5uE6DPoxSh8J7OuEC4KeWX8M8xQSEDImqfZ+tDL2Esv6jv
xDmymq584hiCBc1CJjCOA9kZYe6KNXC7lkVOns6GaKKzLhkrcvUR3dUGhMyzxAMO
t9QIAinR6JwRRA==
=EBbc
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2021-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
"A single fix for POSIX CPU timers to address a problem where POSIX CPU
timer delivery stops working for a new child task because
copy_process() copies state information which is only valid for the
parent task"
* tag 'timers-urgent-2021-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
posix-cpu-timers: Clear task::posix_cputimers_work in copy_process()
- Core code:
A regression fix for the Open Firmware interrupt mapping code where a
interrupt controller property in a node caused a map property in the
same node to be ignored.
- Interrupt chip drivers:
- Workaround a limitation in SiFive PLIC interrupt chip which silently
ignores an EOI when the interrupt line is masked.
- Provide the missing mask/unmask implementation for the CSKY MP
interrupt controller.
- PCI/MSI:
- Prevent a use after free when PCI/MSI interrupts are released by
destroying the sysfs entries before freeing the memory which is
accessed in the sysfs show() function.
- Implement a mask quirk for the Nvidia ION AHCI chip which does not
advertise masking capability despite implementing it. Even worse the
chip comes out of reset with all MSI entries masked, which due to the
missing masking capability never get unmasked.
- Move the check which prevents accessing the MSI[X] masking for XEN
back into the low level accessors. The recent consolidation missed
that these accessors can be invoked from places which do not have
that check which broke XEN. Move them back to he original place
instead of sprinkling tons of these checks all over the code.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmGRDCsTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoTL5D/4n7CUudohHPckr0Rl3LbnSUfyY9g3H
irTKur71AT392YerJtQp+WBp3AKYMDD8wPTgydfpWe95ouIjx5jhb/co7uSifG6k
ZssXYS10bkvjqyS8E2s5FnA5xbnagunK/R981qju14Ec39xqx1JzlUnO/Pra0Kcr
5rBV7br9jJMBleBI4OFuS9fS8dVL1MH/yushkuDNfIKEnaElnaxaYUk/ZdzkMMAW
lt1B+dPhK24t1hXQvZKp/iVQUGrJWdzzy9aDiUYPv1IZP+V5nbLMgmFvEv8jNdNa
6kkfp0l30nXM9rgvcp2KkasVUPVhurVEwitzz9+tT6LRA+/kSwi2yx8/FwCVUcL6
xD0AgKQgxOj/WwGJTZswvPu3afsLuw3rGmx5uH1IV40P9mPX0AiHWgvoaInHjzlJ
QKFQ7mJEuUcC6cJ36RGqX9njhKvPIcUENGCTjGSffcXsWltPrOCg2mQFcsDa9fSH
qPfXDVv4YINI+0MAlOULh6TLWQ07xy37HiskJu/AgILOfipoDi8pXdqNJRfvxB1S
D3O8vB+SH3lPj69w4dtj7539SdNZn8CCyN3RbNlstl2vHV5Bus3cVk0CcOhG8qNW
KwK/tSH8O0ZYHAsUu8OqBipXy6qOPi/10MJQn3NOpvvOmS4oDd+82bq+jp5qJpsG
42WNuzEoBdaUiA==
=LBQL
-----END PGP SIGNATURE-----
Merge tag 'irq-urgent-2021-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"A set of fixes for the interrupt subsystem
Core code:
- A regression fix for the Open Firmware interrupt mapping code where
a interrupt controller property in a node caused a map property in
the same node to be ignored.
Interrupt chip drivers:
- Workaround a limitation in SiFive PLIC interrupt chip which
silently ignores an EOI when the interrupt line is masked.
- Provide the missing mask/unmask implementation for the CSKY MP
interrupt controller.
PCI/MSI:
- Prevent a use after free when PCI/MSI interrupts are released by
destroying the sysfs entries before freeing the memory which is
accessed in the sysfs show() function.
- Implement a mask quirk for the Nvidia ION AHCI chip which does not
advertise masking capability despite implementing it. Even worse
the chip comes out of reset with all MSI entries masked, which due
to the missing masking capability never get unmasked.
- Move the check which prevents accessing the MSI[X] masking for XEN
back into the low level accessors. The recent consolidation missed
that these accessors can be invoked from places which do not have
that check which broke XEN. Move them back to he original place
instead of sprinkling tons of these checks all over the code"
* tag 'irq-urgent-2021-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
of/irq: Don't ignore interrupt-controller when interrupt-map failed
irqchip/sifive-plic: Fixup EOI failed when masked
irqchip/csky-mpintc: Fixup mask/unmask implementation
PCI/MSI: Destroy sysfs before freeing entries
PCI: Add MSI masking quirk for Nvidia ION AHCI
PCI/MSI: Deal with devices lying about their MSI mask capability
PCI/MSI: Move non-mask check back into low level accessors
the preemption model
- clear cluster CPU masks too, on the CPU unplug path
- prevent use-after-free in cfs
- Prevent a race condition when updating CPU cache domains
- Factor out common shared part of smp_prepare_cpus() into a common
helper which can be called by both baremetal and Xen, in order to fix a
booting of Xen PV guests
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmGQ8HcACgkQEsHwGGHe
VUouoA//WAZ/dZu7IiM06JhZWswa2yNsdU8qQHys81lEqstaBqiWuZdg1qJTVIir
2d0aN0keiPcsLyAsp1UJ2g/K/7D5vSJWDzsHKfEAToiAm8Tntai2LlSocWWfeSQm
10grDHWpEHbj0hTHTA6HYOr2WbY4/LnR4cdL0WobIzivIrRTx49d0XUOUfWLP5KX
60uM6dSjwpJrQUnvzk+bhGiHVmutFrEJy+UU/0o+nxkdhwraNiSbLi0007BGRCof
6dokRRvLLR09dl1LMG51gVjQch4j/lCx6EWWUhYOFeV3I3gibSCNkmu7dpmMCBTR
QWO01cR9gyFN4xQ2is4I36M5L0/8T+sbGvvXIXNDT/XWr0/p+g6p2mx0cd2XiYIr
ZthGRcxxV/KGmxfPaygKS9tpQseMEIrdd6VjAnGfZ3OS6CtUvYt8d0B2Soj8FALQ
N9fMXDIEP3uUZim8UvCT6HBKlj9LR5uI5n+dAQ6uzsenO9WqeGeldc/N26/+osdN
vo4lNYTqiXJPhJvunYW5t4j5JnUa3grDHioAPWaQRJlWtEZBGKs9SXTcweg/KURb
mNfe1RfSlGJt28RD3E18gXeSS7xWdKgpcVX1rmW/9tUjX04NNDWjq4sAzOj7c+Ir
4sr78XgCY0pUxFaFYxvQWFUy7wcm0zAczo1RGUhcDTf1edDEvjo=
=s2MX
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Avoid touching ~100 config files in order to be able to select the
preemption model
- clear cluster CPU masks too, on the CPU unplug path
- prevent use-after-free in cfs
- Prevent a race condition when updating CPU cache domains
- Factor out common shared part of smp_prepare_cpus() into a common
helper which can be called by both baremetal and Xen, in order to fix
a booting of Xen PV guests
* tag 'sched_urgent_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
preempt: Restore preemption model selection configs
arch_topology: Fix missing clear cluster_cpumask in remove_cpu_topology()
sched/fair: Prevent dead task groups from regaining cfs_rq's
sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain()
x86/smp: Factor out parts of native_smp_prepare_cpus()
reference to a PMU samples page has been acquired properly before that
- Make sure the LBR_SELECT MSR is saved/restored too
- Reset the LBR_SELECT MSR when resetting the LBR PMU to clear any
residual data left
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmGQ5z8ACgkQEsHwGGHe
VUqqdQ/+JIV6t0yIj7aNADaakwAe+i9zFzzUuvb5KT0zPzZirswkz6xeZ4g8S8PZ
lSjqKk8M2Yt3SJiqi/s3KNIOev52wtKGmeOFz1I+DUNpgk0wGHkRtVHV/iSptB61
Kp/fJvOVppY5grs5B0fRYkM5e477RPyZo+E0COKnff1bQ+k+z2ItMLCVxFCxQS6k
HmgPW7CBye811YcEg28lSwgS1OXiMZ19gACIsqnQ6kQP2Puo8+HT1/V1n+0grejb
OeYxURuYSRPd6Ft76qz0YlRIe1dgKllUBr7b0AaM11ADBMtWBTxqJcQvq/mOIHmT
9to0dVB/xFySR57iaL7BRuZFOrt8MRqJniEedMO99Dm9sxEVfHs1iXC9r7wZxQAf
/HcvVkcyOJD92Kv+4LS5tKjowCByOYEJW2YQIgXEbA6oIhRuM9/fdxEW6lHwgdwc
BPnOR6rtYuq+I+merBIIijAuf8OsIGY7ap2B+f7DkiOtA9+SHZsrU22J8T7CED/w
gmrAC3+3KGt7YDs6WZTbvkXminZQyu5WpHe+2K6dlCIPmJLqEsYUx8TeXa/okyvb
8ZXy/CfJNbHUrk6GZw7RFoeannwSPv9ZJO3Mfy5PDvwDk0Fj0J+/G92mR2Zucxpo
siNyBCivPY5vBPqk+x6eUPev/C3wPS+dNrs4HOyr1N2gZwgTk40=
=Ciqw
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Prevent unintentional page sharing by checking whether a page
reference to a PMU samples page has been acquired properly before
that
- Make sure the LBR_SELECT MSR is saved/restored too
- Reset the LBR_SELECT MSR when resetting the LBR PMU to clear any
residual data left
* tag 'perf_urgent_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Avoid put_page() when GUP fails
perf/x86/vlbr: Add c->flags to vlbr event constraints
perf/x86/lbr: Reset LBR_SELECT during vlbr reset
- Make local osnoise_instances static
- Copy just actual size of histogram strings
- Properly check missing operands in histogram expressions
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYY++DxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qn93AQD9sBFtm7D/90P8KMp/yl75OTd1InGm
uZPOioR/itFXBwD6A4Y4xbpN0aWByM4P31pqFjZRxY0wmInHw3fkd8EjmQM=
=LgAs
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.16-4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Three tracing fixes:
- Make local osnoise_instances static
- Copy just actual size of histogram strings
- Properly check missing operands in histogram expressions"
* tag 'trace-v5.16-4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/histogram: Fix check for missing operands in an expression
tracing/histogram: Do not copy the fixed-size char array field over the field size
tracing/osnoise: Make osnoise_instances static
If a binary operation is detected while parsing an expression string,
the operand strings are deduced by splitting the experssion string at
the position of the detected binary operator. Both operand strings are
sub-strings (can be empty string) of the expression string but will
never be NULL.
Currently a NULL check is used for missing operands, fix this by
checking for empty strings instead.
Link: https://lkml.kernel.org/r/20211112191324.1302505-1-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Fixes: 9710b2f341 ("tracing: Fix operator precedence for hist triggers expression")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Do not copy the fixed-size char array field of the events over
the field size. The histogram treats char array as a string and
there are 2 types of char array in the event, fixed-size and
dynamic string. The dynamic string (__data_loc) field must be
null terminated, but the fixed-size char array field may not
be null terminated (not a string, but just a data).
In that case, histogram can copy the data after the field.
This uses the original field size for fixed-size char array
field to restrict the histogram not to access over the original
field size.
Link: https://lkml.kernel.org/r/163673292822.195747.3696966210526410250.stgit@devnote2
Fixes: 02205a6752 (tracing: Add support for 'field variables')
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull input updates from Dmitry Torokhov:
"Just one new driver (Cypress StreetFighter touchkey), and no input
core changes this time.
Plus various fixes and enhancements to existing drivers"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (54 commits)
Input: iforce - fix control-message timeout
Input: wacom_i2c - use macros for the bit masks
Input: ili210x - reduce sample period to 15ms
Input: ili210x - improve polled sample spacing
Input: ili210x - special case ili251x sample read out
Input: elantench - fix misreporting trackpoint coordinates
Input: synaptics-rmi4 - Fix device hierarchy
Input: i8042 - Add quirk for Fujitsu Lifebook T725
Input: cap11xx - add support for cap1206
Input: remove unused header <linux/input/cy8ctmg110_pdata.h>
Input: ili210x - add ili251x firmware update support
Input: ili210x - export ili251x version details via sysfs
Input: ili210x - use resolution from ili251x firmware
Input: pm8941-pwrkey - respect reboot_mode for warm reset
reboot: export symbol 'reboot_mode'
Input: max77693-haptic - drop unneeded MODULE_ALIAS
Input: cpcap-pwrbutton - do not set input parent explicitly
Input: max8925_onkey - don't mark comment as kernel-doc
Input: ads7846 - do not attempt IRQ workaround when deferring probe
Input: ads7846 - use input_set_capability()
...
Similar to btf_sock_ids, btf_tracing_ids provides btf ID for task_struct,
file, and vm_area_struct via easy to understand format like
btf_tracing_ids[BTF_TRACING_TYPE_[TASK|file|VMA]].
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20211112150243.1270987-3-songliubraving@fb.com
Introduction of map_uid made two lookups from outer map to be distinct.
That distinction is only necessary when inner map has an embedded timer.
Otherwise it will make the verifier state pruning to be conservative
which will cause complex programs to hit 1M insn_processed limit.
Tighten map_uid logic to apply to inner maps with timers only.
Fixes: 3e8ce29850 ("bpf: Prevent pointer mismatch in bpf_timer_init.")
Reported-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Lorenz Bauer <lmb@cloudflare.com>
Link: https://lore.kernel.org/bpf/CACAyw99hVEJFoiBH_ZGyy=+oO-jyydoz6v1DeKPKs2HVsUH28w@mail.gmail.com
Link: https://lore.kernel.org/bpf/20211110172556.20754-1-alexei.starovoitov@gmail.com
LLVM patches ([1] for clang, [2] and [3] for BPF backend)
added support for btf_type_tag attributes. This patch
added support for the kernel.
The main motivation for btf_type_tag is to bring kernel
annotations __user, __rcu etc. to btf. With such information
available in btf, bpf verifier can detect mis-usages
and reject the program. For example, for __user tagged pointer,
developers can then use proper helper like bpf_probe_read_user()
etc. to read the data.
BTF_KIND_TYPE_TAG may also useful for other tracing
facility where instead of to require user to specify
kernel/user address type, the kernel can detect it
by itself with btf.
[1] https://reviews.llvm.org/D111199
[2] https://reviews.llvm.org/D113222
[3] https://reviews.llvm.org/D113496
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211112012609.1505032-1-yhs@fb.com
This series contains initialization fixups, testing improvements, addition
of instruction pointer to data-race reports, and scoped data-race checks.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmGNQO4THHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jIECD/49FaTsFhtZdEDlvLI2u2QJnxkVjwda
PBZkJrB66jDk0Dyc0oUxOu4GGSw64vze8HOJxWhaBA4tmqWGDA0DmTqRFQ3VJ4uW
Csl1uCzkIR9R0dgkDFwkvnq2fNbcr4SwDu0i+7Iig3zws7nhnZlSJPSze6gFkVX2
mLtUXybSR4FvlFMRePHd6cxltmwUohLKOklsI6emOfnSgBBFQ3584wEZ2HN5KwwO
8EwVxE5YNWyZQKqIj76tUoa8qkWbp5SoiiK6mzSQbJpgX8gLN3GngeAc9ZrfY09R
aiSQK9FnkcNkpnRROKA6Go6ze5NGa+1NvF32swZ1nSYOb/LFBDtwt4G8Y8cqdmLv
UtsxjFX4hhxdZzBSbGK3GwDWtDLWgHrmf5K/qPNHkM+QwdoyS27C5Kzfs4jkbtZ0
rAEWBxTrtdTCd+xMIz04ZDlio05CqSqme2/t4xaxGpcYGHLcuSi3uFa1cRvfaew8
rSfq2WKd9Cu2dKmjyF+EtN4Y2o8l8IaxJyeq5bVrBHeijIBH0KdCWkeDhWIJcMmE
Wo36PYsFLyCdAwr66IoNFHvOxbtAQsERZa0/2FGlOKBAzntNA72BdlAFgKJWiLKg
M5K1Q+r7kfns/T1JhftTByryZBd5JM+OiZ/rwU0hCRY48L93ftTzGYSyLVfPBeZ0
lDgc/oJQziv9fA==
=MjQ1
-----END PGP SIGNATURE-----
Merge tag 'kcsan.2021.11.11a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull KCSAN updates from Paul McKenney:
"This contains initialization fixups, testing improvements, addition of
instruction pointer to data-race reports, and scoped data-race checks"
* tag 'kcsan.2021.11.11a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
kcsan: selftest: Cleanup and add missing __init
kcsan: Move ctx to start of argument list
kcsan: Support reporting scoped read-write access type
kcsan: Start stack trace with explicit location if provided
kcsan: Save instruction pointer for scoped accesses
kcsan: Add ability to pass instruction pointer of access to reporting
kcsan: test: Fix flaky test case
kcsan: test: Use kunit_skip() to skip tests
kcsan: test: Defer kcsan_test_init() after kunit initialization
and netfilter.
Current release - regressions:
- bpf: do not reject when the stack read size is different
from the tracked scalar size
- net: fix premature exit from NAPI state polling in napi_disable()
- riscv, bpf: fix RV32 broken build, and silence RV64 warning
Current release - new code bugs:
- net: fix possible NULL deref in sock_reserve_memory
- amt: fix error return code in amt_init(); fix stopping the workqueue
- ax88796c: use the correct ioctl callback
Previous releases - always broken:
- bpf: stop caching subprog index in the bpf_pseudo_func insn
- security: fixups for the security hooks in sctp
- nfc: add necessary privilege flags in netlink layer, limit operations
to admin only
- vsock: prevent unnecessary refcnt inc for non-blocking connect
- net/smc: fix sk_refcnt underflow on link down and fallback
- nfnetlink_queue: fix OOB when mac header was cleared
- can: j1939: ignore invalid messages per standard
- bpf, sockmap:
- fix race in ingress receive verdict with redirect to self
- fix incorrect sk_skb data_end access when src_reg = dst_reg
- strparser, and tls are reusing qdisc_skb_cb and colliding
- ethtool: fix ethtool msg len calculation for pause stats
- vlan: fix a UAF in vlan_dev_real_dev() when ref-holder tries
to access an unregistering real_dev
- udp6: make encap_rcv() bump the v6 not v4 stats
- drv: prestera: add explicit padding to fix m68k build
- drv: felix: fix broken VLAN-tagged PTP under VLAN-aware bridge
- drv: mvpp2: fix wrong SerDes reconfiguration order
Misc & small latecomers:
- ipvs: auto-load ipvs on genl access
- mctp: sanity check the struct sockaddr_mctp padding fields
- libfs: support RENAME_EXCHANGE in simple_rename()
- avoid double accounting for pure zerocopy skbs
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmGNQdwACgkQMUZtbf5S
IrsiMQ//f66lTJ8PJ5Qj70hX9dC897olx7uGHB9eiKoyOcJI459hFlfXwRU2T4Tf
fPNwPNUQ9Mynw9tX/jWEi+7zd6r6TSHGXK49U9/rIbQ95QjKY4LHowIE63x+vPl2
5Cpf+80zXC3DUX1fijgyG1ujnU3kBaqopTxDLmlsHw2PGkwT5Ox1DUwkhc370eEL
xlpq3PYGWA8/AQNyhSVBkG/UmoLaq0jYNP5yVcOj4jGjgcgLe1SLrqczENr35QHZ
cRkuBsFBMBZF7wSX2f9qQIB/+b1pcLlD9IO+K3S7Ruq+rUd7qfL/tmwNxEh0axYK
AyIun1Bxcy7QJGjtpGAz+Ku7jS9T3HxzyxhqilQo3co8jAW0WJ1YwHl+XPgQXyjV
DLG6Vxt4syiwsoSXGn8MQugs4nlBT+0qWl8YamIR+o7KkAYPc2QWkXlzEDfNeIW8
JNCZA3sy7VGi1ytorZGx16sQsEWnyRG9a6/WV20Dr+HVs1SKPcFzIfG6mVngR07T
mQMHnbAF6Z5d8VTcPQfMxd7UH48s1bHtk5lcSTa3j0Cw+GkA6ytTmjPdJ1qRcdkH
dl9jAfADe4O6frG+9XH7FEFqhmkghVI7bOCA4ZOhClVaIcDGgEZc2y7sY9/oZ7P4
KXBD2R5X1caCUM0UtzwL7/8ddOtPtHIrFnhY+7+I6ijt9qmI0BY=
=Ttgq
-----END PGP SIGNATURE-----
Merge tag 'net-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf, can and netfilter.
Current release - regressions:
- bpf: do not reject when the stack read size is different from the
tracked scalar size
- net: fix premature exit from NAPI state polling in napi_disable()
- riscv, bpf: fix RV32 broken build, and silence RV64 warning
Current release - new code bugs:
- net: fix possible NULL deref in sock_reserve_memory
- amt: fix error return code in amt_init(); fix stopping the
workqueue
- ax88796c: use the correct ioctl callback
Previous releases - always broken:
- bpf: stop caching subprog index in the bpf_pseudo_func insn
- security: fixups for the security hooks in sctp
- nfc: add necessary privilege flags in netlink layer, limit
operations to admin only
- vsock: prevent unnecessary refcnt inc for non-blocking connect
- net/smc: fix sk_refcnt underflow on link down and fallback
- nfnetlink_queue: fix OOB when mac header was cleared
- can: j1939: ignore invalid messages per standard
- bpf, sockmap:
- fix race in ingress receive verdict with redirect to self
- fix incorrect sk_skb data_end access when src_reg = dst_reg
- strparser, and tls are reusing qdisc_skb_cb and colliding
- ethtool: fix ethtool msg len calculation for pause stats
- vlan: fix a UAF in vlan_dev_real_dev() when ref-holder tries to
access an unregistering real_dev
- udp6: make encap_rcv() bump the v6 not v4 stats
- drv: prestera: add explicit padding to fix m68k build
- drv: felix: fix broken VLAN-tagged PTP under VLAN-aware bridge
- drv: mvpp2: fix wrong SerDes reconfiguration order
Misc & small latecomers:
- ipvs: auto-load ipvs on genl access
- mctp: sanity check the struct sockaddr_mctp padding fields
- libfs: support RENAME_EXCHANGE in simple_rename()
- avoid double accounting for pure zerocopy skbs"
* tag 'net-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (123 commits)
selftests/net: udpgso_bench_rx: fix port argument
net: wwan: iosm: fix compilation warning
cxgb4: fix eeprom len when diagnostics not implemented
net: fix premature exit from NAPI state polling in napi_disable()
net/smc: fix sk_refcnt underflow on linkdown and fallback
net/mlx5: Lag, fix a potential Oops with mlx5_lag_create_definer()
gve: fix unmatched u64_stats_update_end()
net: ethernet: lantiq_etop: Fix compilation error
selftests: forwarding: Fix packet matching in mirroring selftests
vsock: prevent unnecessary refcnt inc for nonblocking connect
net: marvell: mvpp2: Fix wrong SerDes reconfiguration order
net: ethernet: ti: cpsw_ale: Fix access to un-initialized memory
net: stmmac: allow a tc-taprio base-time of zero
selftests: net: test_vxlan_under_vrf: fix HV connectivity test
net: hns3: allow configure ETS bandwidth of all TCs
net: hns3: remove check VF uc mac exist when set by PF
net: hns3: fix some mac statistics is always 0 in device version V2
net: hns3: fix kernel crash when unload VF while it is being reset
net: hns3: sync rx ring head in echo common pull
net: hns3: fix pfc packet number incorrect after querying pfc parameters
...
PEBS PERF_SAMPLE_PHYS_ADDR events use perf_virt_to_phys() to convert PMU
sampled virtual addresses to physical using get_user_page_fast_only()
and page_to_phys().
Some get_user_page_fast_only() error cases return false, indicating no
page reference, but still initialize the output page pointer with an
unreferenced page. In these error cases perf_virt_to_phys() calls
put_page(). This causes page reference count underflow, which can lead
to unintentional page sharing.
Fix perf_virt_to_phys() to only put_page() if get_user_page_fast_only()
returns a referenced page.
Fixes: fc7ce9c74c ("perf/core, x86: Add PERF_SAMPLE_PHYS_ADDR")
Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211111021814.757086-1-gthelen@google.com
Commit c597bfddc9 ("sched: Provide Kconfig support for default dynamic
preempt mode") changed the selectable config names for the preemption
model. This means a config file must now select
CONFIG_PREEMPT_BEHAVIOUR=y
rather than
CONFIG_PREEMPT=y
to get a preemptible kernel. This means all arch config files would need to
be updated - right now they'll all end up with the default
CONFIG_PREEMPT_NONE_BEHAVIOUR.
Rather than touch a good hundred of config files, restore usage of
CONFIG_PREEMPT{_NONE, _VOLUNTARY}. Make them configure:
o The build-time preemption model when !PREEMPT_DYNAMIC
o The default boot-time preemption model when PREEMPT_DYNAMIC
Add siblings of those configs with the _BUILD suffix to unconditionally
designate the build-time preemption model (PREEMPT_DYNAMIC is built with
the "highest" preemption model it supports, aka PREEMPT). Downstream
configs should by now all be depending / selected by CONFIG_PREEMPTION
rather than CONFIG_PREEMPT, so only a few sites need patching up.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20211110202448.4054153-2-valentin.schneider@arm.com
Kevin is reporting crashes which point to a use-after-free of a cfs_rq
in update_blocked_averages(). Initial debugging revealed that we've
live cfs_rq's (on_list=1) in an about to be kfree()'d task group in
free_fair_sched_group(). However, it was unclear how that can happen.
His kernel config happened to lead to a layout of struct sched_entity
that put the 'my_q' member directly into the middle of the object
which makes it incidentally overlap with SLUB's freelist pointer.
That, in combination with SLAB_FREELIST_HARDENED's freelist pointer
mangling, leads to a reliable access violation in form of a #GP which
made the UAF fail fast.
Michal seems to have run into the same issue[1]. He already correctly
diagnosed that commit a7b359fc6a ("sched/fair: Correctly insert
cfs_rq's to list on unthrottle") is causing the preconditions for the
UAF to happen by re-adding cfs_rq's also to task groups that have no
more running tasks, i.e. also to dead ones. His analysis, however,
misses the real root cause and it cannot be seen from the crash
backtrace only, as the real offender is tg_unthrottle_up() getting
called via sched_cfs_period_timer() via the timer interrupt at an
inconvenient time.
When unregister_fair_sched_group() unlinks all cfs_rq's from the dying
task group, it doesn't protect itself from getting interrupted. If the
timer interrupt triggers while we iterate over all CPUs or after
unregister_fair_sched_group() has finished but prior to unlinking the
task group, sched_cfs_period_timer() will execute and walk the list of
task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the
dying task group. These will later -- in free_fair_sched_group() -- be
kfree()'ed while still being linked, leading to the fireworks Kevin
and Michal are seeing.
To fix this race, ensure the dying task group gets unlinked first.
However, simply switching the order of unregistering and unlinking the
task group isn't sufficient, as concurrent RCU walkers might still see
it, as can be seen below:
CPU1: CPU2:
: timer IRQ:
: do_sched_cfs_period_timer():
: :
: distribute_cfs_runtime():
: rcu_read_lock();
: :
: unthrottle_cfs_rq():
sched_offline_group(): :
: walk_tg_tree_from(…,tg_unthrottle_up,…):
list_del_rcu(&tg->list); :
(1) : list_for_each_entry_rcu(child, &parent->children, siblings)
: :
(2) list_del_rcu(&tg->siblings); :
: tg_unthrottle_up():
unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
: :
list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); :
: :
: if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
(3) : list_add_leaf_cfs_rq(cfs_rq);
: :
: :
: :
: :
: :
(4) : rcu_read_unlock();
CPU 2 walks the task group list in parallel to sched_offline_group(),
specifically, it'll read the soon to be unlinked task group entry at
(1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from
still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink
all cfs_rq's via list_del_leaf_cfs_rq() in
unregister_fair_sched_group(). Meanwhile CPU 2 will re-add some of
these at (3), which is the cause of the UAF later on.
To prevent this additional race from happening, we need to wait until
walk_tg_tree_from() has finished traversing the task groups, i.e.
after the RCU read critical section ends in (4). Afterwards we're safe
to call unregister_fair_sched_group(), as each new walk won't see the
dying task group any more.
On top of that, we need to wait yet another RCU grace period after
unregister_fair_sched_group() to ensure print_cfs_stats(), which might
run concurrently, always sees valid objects, i.e. not already free'd
ones.
This patch survives Michal's reproducer[2] for 8h+ now, which used to
trigger within minutes before.
[1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@suse.com/
[2] https://lore.kernel.org/lkml/20211102160228.GA57072@blackbody.suse.cz/
Fixes: a7b359fc6a ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
[peterz: shuffle code around a bit]
Reported-by: Kevin Tanguy <kevin.tanguy@corp.ovh.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Nothing protects the access to the per_cpu variable sd_llc_id. When testing
the same CPU (i.e. this_cpu == that_cpu), a race condition exists with
update_top_cache_domain(). One scenario being:
CPU1 CPU2
==================================================================
per_cpu(sd_llc_id, CPUX) => 0
partition_sched_domains_locked()
detach_destroy_domains()
cpus_share_cache(CPUX, CPUX) update_top_cache_domain(CPUX)
per_cpu(sd_llc_id, CPUX) => 0
per_cpu(sd_llc_id, CPUX) = CPUX
per_cpu(sd_llc_id, CPUX) => CPUX
return false
ttwu_queue_cond() wouldn't catch smp_processor_id() == cpu and the result
is a warning triggered from ttwu_queue_wakelist().
Avoid a such race in cpus_share_cache() by always returning true when
this_cpu == that_cpu.
Fixes: 518cd62341 ("sched: Only queue remote wakeups when crossing cache boundaries")
Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20211104175120.857087-1-vincent.donnefort@arm.com
The recent rework of PCI/MSI[X] masking moved the non-mask checks from the
low level accessors into the higher level mask/unmask functions.
This missed the fact that these accessors can be invoked from other places
as well. The missing checks break XEN-PV which sets pci_msi_ignore_mask and
also violates the virtual MSIX and the msi_attrib.maskbit protections.
Instead of sprinkling checks all over the place, lift them back into the
low level accessor functions. To avoid checking three different conditions
combine them into one property of msi_desc::msi_attrib.
[ josef: Fixed the missed conversion in the core code ]
Fixes: fcacdfbef5 ("PCI/MSI: Provide a new set of mask and unmask functions")
Reported-by: Josef Johansson <josef@oderland.se>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Josef Johansson <josef@oderland.se>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: stable@vger.kernel.org
Pull exit cleanups from Eric Biederman:
"While looking at some issues related to the exit path in the kernel I
found several instances where the code is not using the existing
abstractions properly.
This set of changes introduces force_fatal_sig a way of sending a
signal and not allowing it to be caught, and corrects the misuse of
the existing abstractions that I found.
A lot of the misuse of the existing abstractions are silly things such
as doing something after calling a no return function, rolling BUG by
hand, doing more work than necessary to terminate a kernel thread, or
calling do_exit(SIGKILL) instead of calling force_sig(SIGKILL).
In the review a deficiency in force_fatal_sig and force_sig_seccomp
where ptrace or sigaction could prevent the delivery of the signal was
found. I have added a change that adds SA_IMMUTABLE to change that
makes it impossible to interrupt the delivery of those signals, and
allows backporting to fix force_sig_seccomp
And Arnd found an issue where a function passed to kthread_run had the
wrong prototype, and after my cleanup was failing to build."
* 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
soc: ti: fix wkup_m3_rproc_boot_thread return type
signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed
signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)
exit/r8188eu: Replace the macro thread_exit with a simple return 0
exit/rtl8712: Replace the macro thread_exit with a simple return 0
exit/rtl8723bs: Replace the macro thread_exit with a simple return 0
signal/x86: In emulate_vsyscall force a signal instead of calling do_exit
signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig
signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails
exit/syscall_user_dispatch: Send ordinary signals on failure
signal: Implement force_fatal_sig
exit/kthread: Have kernel threads return instead of calling do_exit
signal/s390: Use force_sigsegv in default_trap_handler
signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.
signal/vm86_32: Replace open coded BUG_ON with an actual BUG_ON
signal/sparc: In setup_tsb_params convert open coded BUG into BUG
signal/powerpc: On swapcontext failure force SIGSEGV
signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL)
signal/mips: Update (_save|_restore)_fp_context to fail with -EFAULT
signal/sparc32: Remove unreachable do_exit in do_sparc_fault
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYYvEbgAKCRCRxhvAZXjc
og17AQDj+gsxk2lT4GsRo+WrI9qegGSvYHaxbOoqqSL6rHrrsQD+IU92dwVfuUXE
oP+De6/TBmsdygnlECxITp8p4ByhGAM=
=wi2X
-----END PGP SIGNATURE-----
Merge tag 'kernel.sys.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull prctl updates from Christian Brauner:
"This contains the missing prctl uapi pieces for PR_SCHED_CORE.
In order to activate core scheduling the caller is expected to specify
the scope of the new core scheduling domain.
For example, passing 2 in the 4th argument of
prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, <pid>, 2, 0);
would indicate that the new core scheduling domain encompasses all
tasks in the process group of <pid>. Specifying 0 would only create a
core scheduling domain for the thread identified by <pid> and 2 would
encompass the whole thread-group of <pid>.
Note, the values 0, 1, and 2 correspond to PIDTYPE_PID, PIDTYPE_TGID,
and PIDTYPE_PGID. A first version tried to expose those values
directly to which I objected because:
- PIDTYPE_* is an enum that is kernel internal which we should not
expose to userspace directly.
- PIDTYPE_* indicates what a given struct pid is used for it doesn't
express a scope.
But what the 4th argument of PR_SCHED_CORE prctl() expresses is the
scope of the operation, i.e. the scope of the core scheduling domain
at creation time. So Eugene's patch now simply introduces three new
defines PR_SCHED_CORE_SCOPE_THREAD, PR_SCHED_CORE_SCOPE_THREAD_GROUP,
and PR_SCHED_CORE_SCOPE_PROCESS_GROUP. They simply express what
happens.
This has been on the mailing list for quite a while with all relevant
scheduler folks Cced. I announced multiple times that I'd pick this up
if I don't see or her anyone else doing it. None of this touches
proper scheduler code but only concerns uapi so I think this is fine.
With core scheduling being quite common now for vm managers (e.g.
moving individual vcpu threads into their own core scheduling domain)
and container managers (e.g. moving the init process into its own core
scheduling domain and letting all created children inherit it) having
to rely on raw numbers passed as the 4th argument in prctl() is a bit
annoying and everyone is starting to come up with their own defines"
* tag 'kernel.sys.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
uapi/linux/prctl: provide macro definitions for the PR_SCHED_CORE type argument
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYYvE0wAKCRCRxhvAZXjc
oo36AQCQRC9+LsfBsfoqrrdfWqp9ifs9DuytUg+CTftsy1Pn0QD/ZtySkNx9mnNl
0/lSTN5dJBfEYm6Xcfxuu/vu/iauhw0=
=dY6T
-----END PGP SIGNATURE-----
Merge tag 'pidfd.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull pidfd updates from Christian Brauner:
"Various places in the kernel have picked up pidfds.
The two most recent additions have probably been the ability to use
pidfds in bpf maps and the usage of pidfds in mm-based syscalls such
as process_mrelease() and process_madvise().
The same pattern to turn a pidfd into a struct task exists in two
places. One of those places used PIDTYPE_TGID while the other one used
PIDTYPE_PID even though it is clearly documented in all pidfd-helpers
that pidfds __currently__ only refer to thread-group leaders (subject
to change in the future if need be).
This isn't a bug per se but has the potential to be one if we allow
pidfds to refer to individual threads. If that happens we want to
audit all codepaths that make use of them to ensure they can deal with
pidfds refering to individual threads.
This adds a simple helper to turn a pidfd into a struct task making it
easy to grep for such places. Plus, it gets rid of code-duplication"
* tag 'pidfd.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
mm: use pidfd_get_task()
pid: add pidfd_get_task() helper
We can't call unregister_ftrace_function under ftrace_lock.
Link: https://lkml.kernel.org/r/20211109114217.1645296-1-jolsa@kernel.org
Fixes: ed29271894 ("ftrace/direct: Do not disable when switching direct callers")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The resetting of the entire ring buffer use to simply go through and reset
each individual CPU buffer that had its own protection and synchronization.
But this was very slow, due to performing a synchronization for each CPU.
The code was reshuffled to do one disabling of all CPU buffers, followed
by a single RCU synchronization, and then the resetting of each of the CPU
buffers. But unfortunately, the mutex that prevented multiple occurrences
of resetting the buffer was not moved to the upper function, and there is
nothing to protect from it.
Take the ring buffer mutex around the global reset.
Cc: stable@vger.kernel.org
Fixes: b23d7a5f4a ("ring-buffer: speed up buffer resets by avoiding synchronize_rcu for each CPU")
Reported-by: "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
printk from NMI context relies on irq work being raised on the local CPU
to print to console. This can be a problem if the NMI was raised by a
lockup detector to print lockup stack and regs, because the CPU may not
enable irqs (because it is locked up).
Introduce printk_trigger_flush() that can be called another CPU to try
to get those messages to the console, call that where printk_safe_flush
was previously called.
Fixes: 93d102f094 ("printk: remove safe buffers")
Cc: stable@vger.kernel.org # 5.15
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20211107045116.1754411-1-npiggin@gmail.com
- convert sparc32 to the generic dma-direct code
- use bitmap_zalloc (Christophe JAILLET)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmGKfNYLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYMEIRAAhOocEFpeaSg8iLMd7QLzm5vvzAuR43iykkKCvdvV
Q4P+g8H9Jr65ThsGS90AuuDKuyKh3tmbL7loHlyDygmRHhHALOO4127um4RAnOAL
1y2qCRwgHEZTu1uiu65cB+RRrlJP6T4sHV7+U3uZ3P5nfQoVVIoHKMceSTLIa3dx
WPyJXP33TWK50ZvGYuzMhO5hQPA8sKSePiaN3gz3anF0lMnqlUNh1Iso6nasUW40
XifOFM2Bg/SO7HpBGssrku6Zc5x9TpyuQtLP0u+LpjrbUYUZvz/OteyVu5cTZdbP
QG7MG6jcvDuU41sjKYNjaNpGZlvmXrEs4pXiwbOhzHTG8TFIEiR/LRsrvBGS7DJ8
y0NKNryIKR3+9fMKDH0PWHC7NszJbAQR0J7OT7+GP8cx9M62x5MuV8d2uOXp6TPY
v3VO0SJQrBZLKpY7vixZ6TOYMz15kmULMRrkGzf95+z5MpM2RjJ4lY8Kqlm2PBRR
Q3k53Ii8ya9U61SvgcCH39gR1fGT+WO8E5UFttCfhUhn49KJc7DqbEUiOC8Ta7QC
OONXxhGLdXAkti5NLFAexk8zdLBVRMnzfG44tBnP/JWDbQu3lMNuQfUXzsJK9yDb
zWr/832qwTIzT01NGZDFWdKUPNpafyuDQ1lP9rZZ2ZLo+f/EXNsHvczXvkwP08xS
cyY=
=DvuN
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-5.16' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
"Just a small set of changes this time. The request dma_direct_alloc
cleanups are still under review and haven't made the cut.
Summary:
- convert sparc32 to the generic dma-direct code
- use bitmap_zalloc (Christophe JAILLET)"
* tag 'dma-mapping-5.16' of git://git.infradead.org/users/hch/dma-mapping:
dma-mapping: use 'bitmap_zalloc()' when applicable
sparc32: use DMA_DIRECT_REMAP
sparc32: remove dma_make_coherent
sparc32: remove the call to dma_make_coherent in arch_dma_free
Merge more updates from Andrew Morton:
"87 patches.
Subsystems affected by this patch series: mm (pagecache and hugetlb),
procfs, misc, MAINTAINERS, lib, checkpatch, binfmt, kallsyms, ramfs,
init, codafs, nilfs2, hfs, crash_dump, signals, seq_file, fork,
sysvfs, kcov, gdb, resource, selftests, and ipc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (87 commits)
ipc/ipc_sysctl.c: remove fallback for !CONFIG_PROC_SYSCTL
ipc: check checkpoint_restore_ns_capable() to modify C/R proc files
selftests/kselftest/runner/run_one(): allow running non-executable files
virtio-mem: disallow mapping virtio-mem memory via /dev/mem
kernel/resource: disallow access to exclusive system RAM regions
kernel/resource: clean up and optimize iomem_is_exclusive()
scripts/gdb: handle split debug for vmlinux
kcov: replace local_irq_save() with a local_lock_t
kcov: avoid enable+disable interrupts if !in_task()
kcov: allocate per-CPU memory on the relevant node
Documentation/kcov: define `ip' in the example
Documentation/kcov: include types.h in the example
sysv: use BUILD_BUG_ON instead of runtime check
kernel/fork.c: unshare(): use swap() to make code cleaner
seq_file: fix passing wrong private data
seq_file: move seq_escape() to a header
signal: remove duplicate include in signal.h
crash_dump: remove duplicate include in crash_dump.h
crash_dump: fix boolreturn.cocci warning
hfs/hfsplus: use WARN_ON for sanity check
...
virtio-mem dynamically exposes memory inside a device memory region as
system RAM to Linux, coordinating with the hypervisor which parts are
actually "plugged" and consequently usable/accessible.
On the one hand, the virtio-mem driver adds/removes whole memory blocks,
creating/removing busy IORESOURCE_SYSTEM_RAM resources, on the other
hand, it logically (un)plugs memory inside added memory blocks,
dynamically either exposing them to the buddy or hiding them from the
buddy and marking them PG_offline.
In contrast to physical devices, like a DIMM, the virtio-mem driver is
required to actually make use of any of the device-provided memory,
because it performs the handshake with the hypervisor. virtio-mem
memory cannot simply be access via /dev/mem without a driver.
There is no safe way to:
a) Access plugged memory blocks via /dev/mem, as they might contain
unplugged holes or might get silently unplugged by the virtio-mem
driver and consequently turned inaccessible.
b) Access unplugged memory blocks via /dev/mem because the virtio-mem
driver is required to make them actually accessible first.
The virtio-spec states that unplugged memory blocks MUST NOT be written,
and only selected unplugged memory blocks MAY be read. We want to make
sure, this is the case in sane environments -- where the virtio-mem driver
was loaded.
We want to make sure that in a sane environment, nobody "accidentially"
accesses unplugged memory inside the device managed region. For example,
a user might spot a memory region in /proc/iomem and try accessing it via
/dev/mem via gdb or dumping it via something else. By the time the mmap()
happens, the memory might already have been removed by the virtio-mem
driver silently: the mmap() would succeeed and user space might
accidentially access unplugged memory.
So once the driver was loaded and detected the device along the
device-managed region, we just want to disallow any access via /dev/mem to
it.
In an ideal world, we would mark the whole region as busy ("owned by a
driver") and exclude it; however, that would be wrong, as we don't really
have actual system RAM at these ranges added to Linux ("busy system RAM").
Instead, we want to mark such ranges as "not actual busy system RAM but
still soft-reserved and prepared by a driver for future use."
Let's teach iomem_is_exclusive() to reject access to any range with
"IORESOURCE_SYSTEM_RAM | IORESOURCE_EXCLUSIVE", even if not busy and even
if "iomem=relaxed" is set. Introduce EXCLUSIVE_SYSTEM_RAM to make it
easier for applicable drivers to depend on this setting in their Kconfig.
For now, there are no applicable ranges and we'll modify virtio-mem next
to properly set IORESOURCE_EXCLUSIVE on the parent resource container it
creates to contain all actual busy system RAM added via
add_memory_driver_managed().
Link: https://lkml.kernel.org/r/20210920142856.17758-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "virtio-mem: disallow mapping virtio-mem memory via /dev/mem", v5.
Let's add the basic infrastructure to exclude some physical memory regions
marked as "IORESOURCE_SYSTEM_RAM" completely from /dev/mem access, even
though they are not marked IORESOURCE_BUSY and even though "iomem=relaxed"
is set. Resource IORESOURCE_EXCLUSIVE for that purpose instead of adding
new flags to express something similar to "soft-busy" or "not busy yet,
but already prepared by a driver and not to be mapped by user space".
Use it for virtio-mem, to disallow mapping any virtio-mem memory via
/dev/mem to user space after the virtio-mem driver was loaded.
This patch (of 3):
We end up traversing subtrees of ranges we are not interested in; let's
optimize this case, skipping such subtrees, cleaning up the function a
bit.
For example, in the following configuration (/proc/iomem):
00000000-00000fff : Reserved
00001000-00057fff : System RAM
00058000-00058fff : Reserved
00059000-0009cfff : System RAM
0009d000-000fffff : Reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000c3fff : PCI Bus 0000:00
000c4000-000c7fff : PCI Bus 0000:00
000c8000-000cbfff : PCI Bus 0000:00
000cc000-000cffff : PCI Bus 0000:00
000d0000-000d3fff : PCI Bus 0000:00
000d4000-000d7fff : PCI Bus 0000:00
000d8000-000dbfff : PCI Bus 0000:00
000dc000-000dffff : PCI Bus 0000:00
000e0000-000e3fff : PCI Bus 0000:00
000e4000-000e7fff : PCI Bus 0000:00
000e8000-000ebfff : PCI Bus 0000:00
000ec000-000effff : PCI Bus 0000:00
000f0000-000fffff : PCI Bus 0000:00
000f0000-000fffff : System ROM
00100000-3fffffff : System RAM
40000000-403fffff : Reserved
40000000-403fffff : pnp 00:00
40400000-80a79fff : System RAM
...
We don't have to look at any children of "0009d000-000fffff : Reserved"
if we can just skip these 15 items directly because the parent range is
not of interest.
Link: https://lkml.kernel.org/r/20210920142856.17758-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210920142856.17758-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kcov code mixes local_irq_save() and spin_lock() in
kcov_remote_{start|end}(). This creates a warning on PREEMPT_RT because
local_irq_save() disables interrupts and spin_lock_t is turned into a
sleeping lock which can not be acquired in a section with disabled
interrupts.
The kcov_remote_lock is used to synchronize the access to the hash-list
kcov_remote_map. The local_irq_save() block protects access to the
per-CPU data kcov_percpu_data.
There is no compelling reason to change the lock type to raw_spin_lock_t
to make it work with local_irq_save(). Changing it would require to
move memory allocation (in kcov_remote_add()) and deallocation outside
of the locked section.
Adding an unlimited amount of entries to the hashlist will increase the
IRQ-off time during lookup. It could be argued that this is debug code
and the latency does not matter. There is however no need to do so and
it would allow to use this facility in an RT enabled build.
Using a local_lock_t instead of local_irq_save() has the befit of adding
a protection scope within the source which makes it obvious what is
protected. On a !PREEMPT_RT && !LOCKDEP build the local_lock_irqsave()
maps directly to local_irq_save() so there is overhead at runtime.
Replace the local_irq_save() section with a local_lock_t.
Link: https://lkml.kernel.org/r/20210923164741.1859522-6-bigeasy@linutronix.de
Link: https://lore.kernel.org/r/20210830172627.267989-6-bigeasy@linutronix.de
Reported-by: Clark Williams <williams@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Marco Elver <elver@google.com>
Tested-by: Marco Elver <elver@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kcov_remote_start() may need to allocate memory in the in_task() case
(otherwise per-CPU memory has been pre-allocated) and therefore requires
enabled interrupts.
The interrupts are enabled before checking if the allocation is required
so if no allocation is required then the interrupts are needlessly enabled
and disabled again.
Enable interrupts only if memory allocation is performed.
Link: https://lkml.kernel.org/r/20210923164741.1859522-5-bigeasy@linutronix.de
Link: https://lore.kernel.org/r/20210830172627.267989-5-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Marco Elver <elver@google.com>
Tested-by: Marco Elver <elver@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During boot kcov allocates per-CPU memory which is used later if remote/
softirq processing is enabled.
Allocate the per-CPU memory on the CPU local node to avoid cross node
memory access.
Link: https://lkml.kernel.org/r/20210923164741.1859522-4-bigeasy@linutronix.de
Link: https://lore.kernel.org/r/20210830172627.267989-4-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Marco Elver <elver@google.com>
Tested-by: Marco Elver <elver@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use swap() instead of reimplementing it.
Link: https://lkml.kernel.org/r/20210909022046.8151-1-ran.xiaokai@zte.com.cn
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexey Gladkov <legion@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The core_kernel_text() should check the gate area, as it is part of kernel
text range, use is_kernel_text() in core_kernel_text().
Link: https://lkml.kernel.org/r/20210930071143.63410-9-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The is_kernel_inittext() and init_kernel_text() are with same
functionality, let's just keep is_kernel_inittext() and move it into
sections.h, then update all the callers.
Link: https://lkml.kernel.org/r/20210930071143.63410-5-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move core_kernel_data() into sections.h and rename it to
is_kernel_core_data(), also make it return bool value, then update all the
callers.
Link: https://lkml.kernel.org/r/20210930071143.63410-4-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "sections: Unify kernel sections range check and use", v4.
There are three head files(kallsyms.h, kernel.h and sections.h) which
include the kernel sections range check, let's make some cleanup and unify
them.
1. cleanup arch specific text/data check and fix address boundary check
in kallsyms.h
2. make all the basic/core kernel range check function into sections.h
3. update all the callers, and use the helper in sections.h to simplify
the code
After this series, we have 5 APIs about kernel sections range check in
sections.h
* is_kernel_rodata() --- already in sections.h
* is_kernel_core_data() --- come from core_kernel_data() in kernel.h
* is_kernel_inittext() --- come from kernel.h and kallsyms.h
* __is_kernel_text() --- add new internal helper
* __is_kernel() --- add new internal helper
Note: For the last two helpers, people should not use directly, consider to
use corresponding function in kallsyms.h.
This patch (of 11):
Remove arch specific text and data check after commit 4ba66a9760 ("arch:
remove blackfin port"), no need arch-specific text/data check.
Link: https://lkml.kernel.org/r/20210930071143.63410-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20210930071143.63410-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A single patch this cycle. We replace some open-coded routines to
classify task states with the scheduler's own function to do this.
Alongside the obvious benefits of removing funky code and aligning
more exactly with the scheduler's task classification, this also
fixes a long standing compiler warning by removing the open-coded
routines that generated the warning.
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAmGJA90ACgkQfOMlXTn3
iKFJcw/9FLy94FS+y/F8xpTU89d2j92f8q1mxS9g7ToDzDOiIPLyNazbX+4PsXVQ
FGgLpTqzNZX3+D5eAnOA/BNwWXtsvdxpsNnkY5ZCsVY5kZ0zsBYHe1O5CM2TbcMg
bOvJRQVI/FjydlrwqxIz9gAD7FmT/QvyecIbHZm/zFiCxdQwZy3rFwREd5ENsjoG
wumCCCH8Gh/afi9Pu3ZKHoZggNy/gmtSP3h3wmyoQneVFIJ4Vw5J61GFCvMPD+pN
wuAXWpuzWaND5IPTr4aZMKHNSaxqADQoEpNWxkgRh0cNL4NGBKsdLZMcqTTiyWww
TJSDtQKqocQB99eouwzQoA8SBsZwRvKRf/33QUXrWCAjl5YRK+9fSd8+dEf9Zd0o
A3sh99ecmHXknY6K2uO7NFjUPLSA/QeMGBzNx9lt7RoL+14tjqZkrAvWXooZzBY3
j39gwI1kSplmmCSoXwoW3AFVcCLJcGzE9qh0NUmZgt3kv8K1SUo3gxotKs8KwKj/
xVozOokmZV2ZuCTf8oIw7ntLwIFjiUaYBE7JY+c8mT8VWCbs5ztyOb11I35YYT0V
InXMDICLxZBD85eNOHyPC0fAud5emfboHl5GSUxo2hPgrRKuBmqElGtxG9CC8DLR
SItPjKfrYI1CJd4uoFX54nC3GmwLVSAq3xDwpYsN4A4lLbJJytc=
=vIfI
-----END PGP SIGNATURE-----
Merge tag 'kgdb-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux
Pull kgdb update from Daniel Thompson:
"A single patch this cycle.
We replace some open-coded routines to classify task states with the
scheduler's own function to do this. Alongside the obvious benefits of
removing funky code and aligning more exactly with the scheduler's
task classification, this also fixes a long standing compiler warning
by removing the open-coded routines that generated the warning"
* tag 'kgdb-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
kdb: Adopt scheduler's task classification
As requested by Jessica I'm stepping in to help with modules
maintenance. This is my first pull request to you.
I've collected only two patches for modules for the 5.16-rc1 merge
window. These patches are from Shuah Khan as she debugged some corner
case error with modules. The error messages are improved for
elf_validity_check(). While doing this work a corner case fix was
spotted on validate_section_offset() due to a possible overflow bug
on 64-bit. The impact of this fix is low given this just limits
module section headers placed within the 32-bit boundary, and we
obviously don't have insane module sizes. Even if a specially crafted
module is constructed later checks would invalidate the module right
away.
I've let this sit through 0-day testing since October 15th with no
issues found.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmGFrvcSHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoinhFAP/1BBXuM/vevC1IdZaEU4M8pg07NOpkZt
PYJc8CxWKTtEg5hrLJMqOexXGwvAg/nq28IFWvUKh3bGtEghPyrQu6+I4mXsjnjJ
t9/AO+BOYU14DJGDAYEuReNsaAcyeRooHLriuUaNvhhaN9q+v+FRyBWNphmA6Tz7
VkCtmCNMFJZlhd9Cu4jOZpJe6CIe9gZ0czYfRshAl/3ZRSQjYaddtbYf1Cs8Vwah
by4o2YyvctrRzeOj/Fy+kbqZw2St39nZ5fKYwijRn1ZwHRQo6NQqrlMeS8rI0LgG
1YwWgNWO1FjaPzyIFcAhk2bUF2TxEf5/eVpXn2qXHnmVZ55oBPP/O7Th0/5OK9gD
utOMbO1nqBLBXUyX/1dO/UT36XcrqtUP0Y9VgjIvj9n8Y82RGYmBScH/TOU1f7A7
sH56sW9/3YvIOe8AShBHJ7IKqZXU0inIGasFYwKKm2pAOLtajaC9Sr5fqVbuyfNF
J2+nXipVzjI0f9SGTqmE41jynFGln6nfd1pgOOiysg9ZqxieINB0J8l0OHe6fZz/
zU4TehXZHE9DApP8D+rVpP0ltwR2YWs2u0zRqHr/0GEWYH00JZu2ymDR13W7izSp
KiiveBxhwBpewgV5cyua8TDyeKhn3mEJFNmijlaq4yq1P2oKeWTQRDRZjwUP8EZY
s16oV+BW7Kp+
=Evek
-----END PGP SIGNATURE-----
Merge tag 'modules-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull module updates from Luis Chamberlain:
"As requested by Jessica I'm stepping in to help with modules
maintenance. This is my first pull request to you.
I've collected only two patches for modules for the 5.16-rc1 merge
window. These patches are from Shuah Khan as she debugged some corner
case error with modules. The error messages are improved for
elf_validity_check(). While doing this work a corner case fix was
spotted on validate_section_offset() due to a possible overflow bug on
64-bit. The impact of this fix is low given this just limits module
section headers placed within the 32-bit boundary, and we obviously
don't have insane module sizes. Even if a specially crafted module is
constructed later checks would invalidate the module right away.
I've let this sit through 0-day testing since October 15th with no
issues found"
* tag 'modules-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
module: change to print useful messages from elf_validity_check()
module: fix validate_section_offset() overflow bug on 64-bit
In some profiler use cases, it is necessary to map an address to the
backing file, e.g., a shared library. bpf_find_vma helper provides a
flexible way to achieve this. bpf_find_vma maps an address of a task to
the vma (vm_area_struct) for this address, and feed the vma to an callback
BPF function. The callback function is necessary here, as we need to
ensure mmap_sem is unlocked.
It is necessary to lock mmap_sem for find_vma. To lock and unlock mmap_sem
safely when irqs are disable, we use the same mechanism as stackmap with
build_id. Specifically, when irqs are disabled, the unlocked is postponed
in an irq_work. Refactor stackmap.c so that the irq_work is shared among
bpf_find_vma and stackmap helpers.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Hengqi Chen <hengqi.chen@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20211105232330.1936330-2-songliubraving@fb.com
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmGFN6IACgkQnJ2qBz9k
QNkfYwgA1w5x/CsN2IMZdx6FTuZFgbOvQpBMTry8iuOPKK3UyIkZaUirTVLKR0cm
k3QbBR9/vTfQTNg5weuFJcbPZZaCXKEvlPGvDh+pumMbfTkMwL3FADweNBoZ3PzO
EiRrV45AbRgSMOzsfURzCz1T53Gd8fYM3pXxmNXG+bnE7+Ea+heKgor8/jFc4U3w
kAKZTfyCiheo7KxVhFGnkGI3ZhIbnbZne4seY/CE4qtv7/bmBE7bhGpmv8LT5FUn
h/JBDLjFU0fzJpplXE6n/VHXeGaUwb8adnYpzojWQ0lLYFrMIZFQ0KkDK6PNwmJF
MKWGqRxDkf54oeWuEAJ9t4/OorqM9A==
=ltE7
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
"Support for reporting filesystem errors through fanotify so that
system health monitoring daemons can watch for these and act instead
of scraping system logs"
* tag 'fsnotify_for_v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (34 commits)
samples: remove duplicate include in fs-monitor.c
samples: Fix warning in fsnotify sample
docs: Fix formatting of literal sections in fanotify docs
samples: Make fs-monitor depend on libc and headers
docs: Document the FAN_FS_ERROR event
samples: Add fs error monitoring example
ext4: Send notifications on error
fanotify: Allow users to request FAN_FS_ERROR events
fanotify: Emit generic error info for error event
fanotify: Report fid info for file related file system errors
fanotify: WARN_ON against too large file handles
fanotify: Add helpers to decide whether to report FID/DFID
fanotify: Wrap object_fh inline space in a creator macro
fanotify: Support merging of error events
fanotify: Support enqueueing of error events
fanotify: Pre-allocate pool of error events
fanotify: Reserve UAPI bits for FAN_FS_ERROR
fsnotify: Support FS_ERROR event type
fanotify: Require fid_mode for any non-fd event
fanotify: Encode empty file handle when no inode is provided
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmGFXBkUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vx6Tg/7BsGWm8f+uw/mr9lLm47q2mc4XyoO
7bR9KDp5NM84W/8ZOU7dqqqsnY0ddrSOLBRyhJJYMW3SwJd1y1ajTBsL1Ujqv+eN
z+JUFmhq4Laqm4k6Spc9CEJE+Ol5P6gGUtxLYo6PM2R0VxnSs/rDxctT5i7YOpCi
COJ+NVT/mc/by2loz1kLTSR9GgtBBgd+Y8UA33GFbHKssROw02L0OI3wffp81Oba
EhMGPoD+0FndAniDw+vaOSoO+YaBuTfbM92T/O00mND69Fj1PWgmNWZz7gAVgsXb
3RrNENUFxgw6CDt7LZWB8OyT04iXe0R2kJs+PA9gigFCGbypwbd/Nbz5M7e9HUTR
ray+1EpZib6+nIksQBL2mX8nmtyHMcLiM57TOEhq0+ECDO640MiRm8t0FIG/1E8v
3ZYd9w20o/NxlFNXHxxpZ3D/osGH5ocyF5c5m1rfB4RGRwztZGL172LWCB0Ezz9r
eHB8sWxylxuhrH+hp2BzQjyddg7rbF+RA4AVfcQSxUpyV01hoRocKqknoDATVeLH
664nJIINFxKJFwfuL3E6OhrInNe1LnAhCZsHHqbS+NNQFgvPRznbixBeLkI9dMf5
Yf6vpsWO7ur8lHHbRndZubVu8nxklXTU7B/w+C11sq6k9LLRJSHzanr3Fn9WA80x
sznCxwUvbTCu1r0=
=nsMh
-----END PGP SIGNATURE-----
Merge tag 'pci-v5.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull pci updates from Bjorn Helgaas:
"Enumeration:
- Conserve IRQs by setting up portdrv IRQs only when there are users
(Jan Kiszka)
- Rework and simplify _OSC negotiation for control of PCIe features
(Joerg Roedel)
- Remove struct pci_dev.driver pointer since it's redundant with the
struct device.driver pointer (Uwe Kleine-König)
Resource management:
- Coalesce contiguous host bridge apertures from _CRS to accommodate
BARs that cover more than one aperture (Kai-Heng Feng)
Sysfs:
- Check CAP_SYS_ADMIN before parsing user input (Krzysztof
Wilczyński)
- Return -EINVAL consistently from "store" functions (Krzysztof
Wilczyński)
- Use sysfs_emit() in endpoint "show" functions to avoid buffer
overruns (Kunihiko Hayashi)
PCIe native device hotplug:
- Ignore Link Down/Up caused by resets during error recovery so
endpoint drivers can remain bound to the device (Lukas Wunner)
Virtualization:
- Avoid bus resets on Atheros QCA6174, where they hang the device
(Ingmar Klein)
- Work around Pericom PI7C9X2G switch packet drop erratum by using
store and forward mode instead of cut-through (Nathan Rossi)
- Avoid trying to enable AtomicOps on VFs; the PF setting applies to
all VFs (Selvin Xavier)
MSI:
- Document that /sys/bus/pci/devices/.../irq contains the legacy INTx
interrupt or the IRQ of the first MSI (not MSI-X) vector (Barry
Song)
VPD:
- Add pci_read_vpd_any() and pci_write_vpd_any() to access anywhere
in the possible VPD space; use these to simplify the cxgb3 driver
(Heiner Kallweit)
Peer-to-peer DMA:
- Add (not subtract) the bus offset when calculating DMA address
(Wang Lu)
ASPM:
- Re-enable LTR at Downstream Ports so they don't report Unsupported
Requests when reset or hot-added devices send LTR messages
(Mingchuang Qiao)
Apple PCIe controller driver:
- Add driver for Apple M1 PCIe controller (Alyssa Rosenzweig, Marc
Zyngier)
Cadence PCIe controller driver:
- Return success when probe succeeds instead of falling into error
path (Li Chen)
HiSilicon Kirin PCIe controller driver:
- Reorganize PHY logic and add support for external PHY drivers
(Mauro Carvalho Chehab)
- Support PERST# GPIOs for HiKey970 external PEX 8606 bridge (Mauro
Carvalho Chehab)
- Add Kirin 970 support (Mauro Carvalho Chehab)
- Make driver removable (Mauro Carvalho Chehab)
Intel VMD host bridge driver:
- If IOMMU supports interrupt remapping, leave VMD MSI-X remapping
enabled (Adrian Huang)
- Number each controller so we can tell them apart in
/proc/interrupts (Chunguang Xu)
- Avoid building on UML because VMD depends on x86 bare metal APIs
(Johannes Berg)
Marvell Aardvark PCIe controller driver:
- Define macros for PCI_EXP_DEVCTL_PAYLOAD_* (Pali Rohár)
- Set Max Payload Size to 512 bytes per Marvell spec (Pali Rohár)
- Downgrade PIO Response Status messages to debug level (Marek Behún)
- Preserve CRS SV (Config Request Retry Software Visibility) bit in
emulated Root Control register (Pali Rohár)
- Fix issue in configuring reference clock (Pali Rohár)
- Don't clear status bits for masked interrupts (Pali Rohár)
- Don't mask unused interrupts (Pali Rohár)
- Avoid code repetition in advk_pcie_rd_conf() (Marek Behún)
- Retry config accesses on CRS response (Pali Rohár)
- Simplify emulated Root Capabilities initialization (Pali Rohár)
- Fix several link training issues (Pali Rohár)
- Fix link-up checking via LTSSM (Pali Rohár)
- Fix reporting of Data Link Layer Link Active (Pali Rohár)
- Fix emulation of W1C bits (Marek Behún)
- Fix MSI domain .alloc() method to return zero on success (Marek
Behún)
- Read entire 16-bit MSI vector in MSI handler, not just low 8 bits
(Marek Behún)
- Clear Root Port I/O Space, Memory Space, and Bus Master Enable bits
at startup; PCI core will set those as necessary (Pali Rohár)
- When operating as a Root Port, set class code to "PCI Bridge"
instead of the default "Mass Storage Controller" (Pali Rohár)
- Add emulation for PCI_BRIDGE_CTL_BUS_RESET since aardvark doesn't
implement this per spec (Pali Rohár)
- Add emulation of option ROM BAR since aardvark doesn't implement
this per spec (Pali Rohár)
MediaTek MT7621 PCIe controller driver:
- Add MediaTek MT7621 PCIe host controller driver and DT binding
(Sergio Paracuellos)
Qualcomm PCIe controller driver:
- Add SC8180x compatible string (Bjorn Andersson)
- Add endpoint controller driver and DT binding (Manivannan
Sadhasivam)
- Restructure to use of_device_get_match_data() (Prasad Malisetty)
- Add SC7280-specific pcie_1_pipe_clk_src handling (Prasad Malisetty)
Renesas R-Car PCIe controller driver:
- Remove unnecessary includes (Geert Uytterhoeven)
Rockchip DesignWare PCIe controller driver:
- Add DT binding (Simon Xue)
Socionext UniPhier Pro5 controller driver:
- Serialize INTx masking/unmasking (Kunihiko Hayashi)
Synopsys DesignWare PCIe controller driver:
- Run dwc .host_init() method before registering MSI interrupt
handler so we can deal with pending interrupts left by bootloader
(Bjorn Andersson)
- Clean up Kconfig dependencies (Andy Shevchenko)
- Export symbols to allow more modular drivers (Luca Ceresoli)
TI DRA7xx PCIe controller driver:
- Allow host and endpoint drivers to be modules (Luca Ceresoli)
- Enable external clock if present (Luca Ceresoli)
TI J721E PCIe driver:
- Disable PHY when probe fails after initializing it (Christophe
JAILLET)
MicroSemi Switchtec management driver:
- Return error to application when command execution fails because an
out-of-band reset has cleared the device BARs, Memory Space Enable,
etc (Kelvin Cao)
- Fix MRPC error status handling issue (Kelvin Cao)
- Mask out other bits when reading of management VEP instance ID
(Kelvin Cao)
- Return EOPNOTSUPP instead of ENOTSUPP from sysfs show functions
(Kelvin Cao)
- Add check of event support (Logan Gunthorpe)
Miscellaneous:
- Remove unused pci_pool wrappers, which have been replaced by
dma_pool (Cai Huoqing)
- Use 'unsigned int' instead of bare 'unsigned' (Krzysztof
Wilczyński)
- Use kstrtobool() directly, sans strtobool() wrapper (Krzysztof
Wilczyński)
- Fix some sscanf(), sprintf() format mismatches (Krzysztof
Wilczyński)
- Update PCI subsystem information in MAINTAINERS (Krzysztof
Wilczyński)
- Correct some misspellings (Krzysztof Wilczyński)"
* tag 'pci-v5.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (137 commits)
PCI: Add ACS quirk for Pericom PI7C9X2G switches
PCI: apple: Configure RID to SID mapper on device addition
iommu/dart: Exclude MSI doorbell from PCIe device IOVA range
PCI: apple: Implement MSI support
PCI: apple: Add INTx and per-port interrupt support
PCI: kirin: Allow removing the driver
PCI: kirin: De-init the dwc driver
PCI: kirin: Disable clkreq during poweroff sequence
PCI: kirin: Move the power-off code to a common routine
PCI: kirin: Add power_off support for Kirin 960 PHY
PCI: kirin: Allow building it as a module
PCI: kirin: Add MODULE_* macros
PCI: kirin: Add Kirin 970 compatible
PCI: kirin: Support PERST# GPIOs for HiKey970 external PEX 8606 bridge
PCI: apple: Set up reference clocks when probing
PCI: apple: Add initial hardware bring-up
PCI: of: Allow matching of an interrupt-map local to a PCI device
of/irq: Allow matching of an interrupt-map local to an interrupt controller
irqdomain: Make of_phandle_args_to_fwspec() generally available
PCI: Do not enable AtomicOps on VFs
...
Merge misc updates from Andrew Morton:
"257 patches.
Subsystems affected by this patch series: scripts, ocfs2, vfs, and
mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache,
gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc,
pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools,
memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm,
vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram,
cleanups, kfence, and damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits)
mm/damon: remove return value from before_terminate callback
mm/damon: fix a few spelling mistakes in comments and a pr_debug message
mm/damon: simplify stop mechanism
Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions
Docs/admin-guide/mm/damon/start: simplify the content
Docs/admin-guide/mm/damon/start: fix a wrong link
Docs/admin-guide/mm/damon/start: fix wrong example commands
mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on
mm/damon: remove unnecessary variable initialization
Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM
mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)
selftests/damon: support watermarks
mm/damon/dbgfs: support watermarks
mm/damon/schemes: activate schemes based on a watermarks mechanism
tools/selftests/damon: update for regions prioritization of schemes
mm/damon/dbgfs: support prioritization weights
mm/damon/vaddr,paddr: support pageout prioritization
mm/damon/schemes: prioritize regions within the quotas
mm/damon/selftests: support schemes quotas
mm/damon/dbgfs: support quotas of schemes
...
filter_irq_stacks() has little to do with the stackdepot implementation,
except that it is usually used by users (such as KASAN) of stackdepot to
reduce the stack trace.
However, filter_irq_stacks() itself is not useful without a stack trace
as obtained by stack_trace_save() and friends.
Therefore, move filter_irq_stacks() to kernel/stacktrace.c, so that new
users of filter_irq_stacks() do not have to start depending on
STACKDEPOT only for filter_irq_stacks().
Link: https://lkml.kernel.org/r/20210923104803.2620285-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Aleksandr Nogikh <nogikh@google.com>
Cc: Taras Madan <tarasmadan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's add a flag that corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED,
indicating that we're dealing with a memory region that is never
indicated in the firmware-provided memory map, but always detected and
added by a driver.
Similar to MEMBLOCK_HOTPLUG, most infrastructure has to treat such
memory regions like ordinary MEMBLOCK_NONE memory regions -- for
example, when selecting memory regions to add to the vmcore for dumping
in the crashkernel via for_each_mem_range().
However, especially kexec_file is not supposed to select such memblocks
via for_each_free_mem_range() / for_each_free_mem_range_reverse() to
place kexec images, similar to how we handle
IORESOURCE_SYSRAM_DRIVER_MANAGED without CONFIG_ARCH_KEEP_MEMBLOCK.
We'll make sure that memory hotplug code sets the flag where applicable
(IORESOURCE_SYSRAM_DRIVER_MANAGED) next. This prepares architectures
that need CONFIG_ARCH_KEEP_MEMBLOCK, such as arm64, for virtio-mem
support.
Note that kexec *must not* indicate this memory to the second kernel and
*must not* place kexec-images on this memory. Let's add a comment to
kexec_walk_memblock(), documenting how we handle MEMBLOCK_DRIVER_MANAGED
now just like using IORESOURCE_SYSRAM_DRIVER_MANAGED in
locate_mem_hole_callback() for kexec_walk_resources().
Also note that MEMBLOCK_HOTPLUG cannot be reused due to different
semantics:
MEMBLOCK_HOTPLUG: memory is indicated as "System RAM" in the
firmware-provided memory map and added to the system early during
boot; kexec *has to* indicate this memory to the second kernel and
can place kexec-images on this memory. After memory hotunplug,
kexec has to be re-armed. We mostly ignore this flag when
"movable_node" is not set on the kernel command line, because
then we're told to not care about hotunpluggability of such
memory regions.
MEMBLOCK_DRIVER_MANAGED: memory is not indicated as "System RAM" in
the firmware-provided memory map; this memory is always detected
and added to the system by a driver; memory might not actually be
physically hotunpluggable. kexec *must not* indicate this memory to
the second kernel and *must not* place kexec-images on this memory.
Link: https://lkml.kernel.org/r/20211004093605.5830-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Jianyong Wu <Jianyong.Wu@arm.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Shahab Vahedi <shahab@synopsys.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename memblock_free_ptr() to memblock_free() and use memblock_free()
when freeing a virtual pointer so that memblock_free() will be a
counterpart of memblock_alloc()
The callers are updated with the below semantic patch and manual
addition of (void *) casting to pointers that are represented by
unsigned long variables.
@@
identifier vaddr;
expression size;
@@
(
- memblock_phys_free(__pa(vaddr), size);
+ memblock_free(vaddr, size);
|
- memblock_free_ptr(vaddr, size);
+ memblock_free(vaddr, size);
)
[sfr@canb.auug.org.au: fixup]
Link: https://lkml.kernel.org/r/20211018192940.3d1d532f@canb.auug.org.au
Link: https://lkml.kernel.org/r/20210930185031.18648-7-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Juergen Gross <jgross@suse.com>
Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since memblock_free() operates on a physical range, make its name
reflect it and rename it to memblock_phys_free(), so it will be a
logical counterpart to memblock_phys_alloc().
The callers are updated with the below semantic patch:
@@
expression addr;
expression size;
@@
- memblock_free(addr, size);
+ memblock_phys_free(addr, size);
Link: https://lkml.kernel.org/r/20210930185031.18648-6-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Juergen Gross <jgross@suse.com>
Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memblock_free_early_nid() is unused and memblock_free_early() is an
alias for memblock_free().
Replace calls to memblock_free_early() with calls to memblock_free() and
remove memblock_free_early() and memblock_free_early_nid().
Link: https://lkml.kernel.org/r/20210930185031.18648-4-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Juergen Gross <jgross@suse.com>
Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 7a5da02de8 ("locking/lockdep: check for freed initmem in
static_obj()") added arch_is_kernel_initmem_freed() which is supposed to
report whether an object is part of already freed init memory.
For the time being, the generic version of
arch_is_kernel_initmem_freed() always reports 'false', allthough
free_initmem() is generically called on all architectures.
Therefore, change the generic version of arch_is_kernel_initmem_freed()
to check whether free_initmem() has been called. If so, then check if a
given address falls into init memory.
To ease the use of system_state, move it out of line into its only
caller which is lockdep.c
Link: https://lkml.kernel.org/r/1d40783e676e07858be97d881f449ee7ea8adfb1.1633001016.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
core_kernel_text() considers that until system_state in at least
SYSTEM_RUNNING, init memory is valid.
But init memory is freed a few lines before setting SYSTEM_RUNNING, so
we have a small period of time when core_kernel_text() is wrong.
Create an intermediate system state called SYSTEM_FREEING_INIT that is
set before starting freeing init memory, and use it in
core_kernel_text() to report init memory invalid earlier.
Link: https://lkml.kernel.org/r/9ecfdee7dd4d741d172cb93ff1d87f1c58127c9a.1633001016.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There was a report that starting an Ubuntu in docker while using cpuset
to bind it to movable nodes (a node only has movable zone, like a node
for hotplug or a Persistent Memory node in normal usage) will fail due
to memory allocation failure, and then OOM is involved and many other
innocent processes got killed.
It can be reproduced with command:
$ docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status"
(where node 4 is a movable node)
runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G W I E 5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
Call Trace:
dump_stack+0x6b/0x88
dump_header+0x4a/0x1e2
oom_kill_process.cold+0xb/0x10
out_of_memory.part.0+0xaf/0x230
out_of_memory+0x3d/0x80
__alloc_pages_slowpath.constprop.0+0x954/0xa20
__alloc_pages_nodemask+0x2d3/0x300
pipe_write+0x322/0x590
new_sync_write+0x196/0x1b0
vfs_write+0x1c3/0x1f0
ksys_write+0xa7/0xe0
do_syscall_64+0x52/0xd0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Mem-Info:
active_anon:392832 inactive_anon:182 isolated_anon:0
active_file:68130 inactive_file:151527 isolated_file:0
unevictable:2701 dirty:0 writeback:7
slab_reclaimable:51418 slab_unreclaimable:116300
mapped:45825 shmem:735 pagetables:2540 bounce:0
free:159849484 free_pcp:73 free_cma:0
Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no
Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0 0
Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB
oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0
Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0
oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB
oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
The reason is that in this case, the target cpuset nodes only have
movable zone, while the creation of an OS in docker sometimes needs to
allocate memory in non-movable zones (dma/dma32/normal) like
GFP_HIGHUSER, and the cpuset limit forbids the allocation, then
out-of-memory killing is involved even when normal nodes and movable
nodes both have many free memory.
The OOM killer cannot help to resolve the situation as there is no
usable memory for the request in the cpuset scope. The only reasonable
measure to take is to fail the allocation right away and have the caller
to deal with it.
So add a check for cases like this in the slowpath of allocation, and
bail out early returning NULL for the allocation.
As page allocation is one of the hottest path in kernel, this check will
hurt all users with sane cpuset configuration, add a static branch check
and detect the abnormal config in cpuset memory binding setup so that
the extra check cost in page allocation is not paid by everyone.
[thanks to Micho Hocko and David Rientjes for suggesting not handling
it inside OOM code, adding cpuset check, refining comments]
Link: https://lkml.kernel.org/r/1632481657-68112-1-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Fix NUMA without SMP".
SuperH is the only architecture which still supports NUMA without SMP,
for good reasons (various memories scattered around the address space,
each with varying latencies).
This series fixes two build errors due to variables and functions used
by the NUMA code being provided by SMP-only source files or sections.
This patch (of 2):
If CONFIG_NUMA=y, but CONFIG_SMP=n (e.g. sh/migor_defconfig):
sh4-linux-gnu-ld: mm/page_alloc.o: in function `get_page_from_freelist':
page_alloc.c:(.text+0x2c24): undefined reference to `node_reclaim_distance'
Fix this by moving the declaration of node_reclaim_distance from an
SMP-only to a generic file.
Link: https://lkml.kernel.org/r/cover.1631781495.git.geert+renesas@glider.be
Link: https://lkml.kernel.org/r/6432666a648dde85635341e6c918cee97c97d264.1631781495.git.geert+renesas@glider.be
Fixes: a55c7454a8 ("sched/topology: Improve load balancing on AMD EPYC systems")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Suggested-by: Matt Fleming <matt@codeblueprint.co.uk>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Cc: Rich Felker <dalias@libc.org>
Cc: Gon Solo <gonsolo@gmail.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The variable mm->total_vm could be accessed concurrently during mmaping
and system accounting as noticed by KCSAN,
BUG: KCSAN: data-race in __acct_update_integrals / mmap_region
read-write to 0xffffa40267bd14c8 of 8 bytes by task 15609 on cpu 3:
mmap_region+0x6dc/0x1400
do_mmap+0x794/0xca0
vm_mmap_pgoff+0xdf/0x150
ksys_mmap_pgoff+0xe1/0x380
do_syscall_64+0x37/0x50
entry_SYSCALL_64_after_hwframe+0x44/0xa9
read to 0xffffa40267bd14c8 of 8 bytes by interrupt on cpu 2:
__acct_update_integrals+0x187/0x1d0
acct_account_cputime+0x3c/0x40
update_process_times+0x5c/0x150
tick_sched_timer+0x184/0x210
__run_hrtimer+0x119/0x3b0
hrtimer_interrupt+0x350/0xaa0
__sysvec_apic_timer_interrupt+0x7b/0x220
asm_call_irq_on_stack+0x12/0x20
sysvec_apic_timer_interrupt+0x4d/0x80
asm_sysvec_apic_timer_interrupt+0x12/0x20
smp_call_function_single+0x192/0x2b0
perf_install_in_context+0x29b/0x4a0
__se_sys_perf_event_open+0x1a98/0x2550
__x64_sys_perf_event_open+0x63/0x70
do_syscall_64+0x37/0x50
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reported by Kernel Concurrency Sanitizer on:
CPU: 2 PID: 15610 Comm: syz-executor.3 Not tainted 5.10.0+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
In vm_stat_account which called by mmap_region, increase total_vm, and
__acct_update_integrals may read total_vm at the same time. This will
cause a data race which lead to undefined behaviour. To avoid potential
bad read/write, volatile property and barrier are both used to avoid
undefined behaviour.
Link: https://lkml.kernel.org/r/20210913105550.1569419-1-liupeng256@huawei.com
Signed-off-by: Peng Liu <liupeng256@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shuah Khan reported:
| When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled,
| kasan_record_aux_stack() runs into "BUG: Invalid wait context" when
| it tries to allocate memory attempting to acquire spinlock in page
| allocation code while holding workqueue pool raw_spinlock.
|
| There are several instances of this problem when block layer tries
| to __queue_work(). Call trace from one of these instances is below:
|
| kblockd_mod_delayed_work_on()
| mod_delayed_work_on()
| __queue_delayed_work()
| __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held)
| insert_work()
| kasan_record_aux_stack()
| kasan_save_stack()
| stack_depot_save()
| alloc_pages()
| __alloc_pages()
| get_page_from_freelist()
| rm_queue()
| rm_queue_pcplist()
| local_lock_irqsave(&pagesets.lock, flags);
| [ BUG: Invalid wait context triggered ]
The default kasan_record_aux_stack() calls stack_depot_save() with
GFP_NOWAIT, which in turn can then call alloc_pages(GFP_NOWAIT, ...).
In general, however, it is not even possible to use either GFP_ATOMIC
nor GFP_NOWAIT in certain non-preemptive contexts, including
raw_spin_locks (see gfp.h and commmit ab00db216c).
Fix it by instructing stackdepot to not expand stack storage via
alloc_pages() in case it runs out by using
kasan_record_aux_stack_noalloc().
While there is an increased risk of failing to insert the stack trace,
this is typically unlikely, especially if the same insertion had already
succeeded previously (stack depot hit).
For frequent calls from the same location, it therefore becomes
extremely unlikely that kasan_record_aux_stack_noalloc() fails.
Link: https://lkml.kernel.org/r/20210902200134.25603-1-skhan@linuxfoundation.org
Link: https://lkml.kernel.org/r/20210913112609.2651084-7-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Shuah Khan <skhan@linuxfoundation.org>
Tested-by: Shuah Khan <skhan@linuxfoundation.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Taras Madan <tarasmadan@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Walter Wu <walter-zh.wu@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is to fix an out-of-bound access issue when jit-ing the
bpf_pseudo_func insn (i.e. ld_imm64 with src_reg == BPF_PSEUDO_FUNC)
In jit_subprog(), it currently reuses the subprog index cached in
insn[1].imm. This subprog index is an index into a few array related
to subprogs. For example, in jit_subprog(), it is an index to the newly
allocated 'struct bpf_prog **func' array.
The subprog index was cached in insn[1].imm after add_subprog(). However,
this could become outdated (and too big in this case) if some subprogs
are completely removed during dead code elimination (in
adjust_subprog_starts_after_remove). The cached index in insn[1].imm
is not updated accordingly and causing out-of-bound issue in the later
jit_subprog().
Unlike bpf_pseudo_'func' insn, the current bpf_pseudo_'call' insn
is handling the DCE properly by calling find_subprog(insn->imm) to
figure out the index instead of caching the subprog index.
The existing bpf_adj_branches() will adjust the insn->imm
whenever insn is added or removed.
Instead of having two ways handling subprog index,
this patch is to make bpf_pseudo_func works more like
bpf_pseudo_call.
First change is to stop caching the subprog index result
in insn[1].imm after add_subprog(). The verification
process will use find_subprog(insn->imm) to figure
out the subprog index.
Second change is in bpf_adj_branches() and have it to
adjust the insn->imm for the bpf_pseudo_func insn also
whenever insn is added or removed.
Third change is in jit_subprog(). Like the bpf_pseudo_call handling,
bpf_pseudo_func temporarily stores the find_subprog() result
in insn->off. It is fine because the prog's insn has been finalized
at this point. insn->off will be reset back to 0 later to avoid
confusing the userspace prog dump tool.
Fixes: 69c087ba62 ("bpf: Add bpf_for_each_map_elem() helper")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211106014014.651018-1-kafai@fb.com
elf_validity_check() checks ELF headers for errors and ELF Spec.
compliance and if any of them fail it returns -ENOEXEC from all of
these error paths. Almost all of them don't print any messages.
When elf_validity_check() returns an error, load_module() prints an
error message without error code. It is hard to determine why the
module ELF structure is invalid, even if load_module() prints the
error code which is -ENOEXEC in all of these cases.
Change to print useful error messages from elf_validity_check() to
clearly say what went wrong and why the ELF validity checks failed.
Remove the load_module() error message which is no longer needed.
This patch includes changes to fix build warns on 32-bit platforms:
warning: format '%llu' expects argument of type 'long long unsigned int',
but argument 3 has type 'Elf32_Off' {aka 'unsigned int'}
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
validate_section_offset() uses unsigned long local variable to
add/store shdr->sh_offset and shdr->sh_size on all platforms.
unsigned long is too short when sh_offset is Elf64_Off which
would be the case on 64bit ELF headers.
Without this fix applied we were shorting the design of modules
to have section headers placed within the 32-bit boundary (4 GiB)
instead of 64-bits when on 64-bit architectures (which allows for
up to 16,777,216 TiB). In practice this just meant we were limiting
modules sections to below 4 GiB even on 64-bit systems. This then
should not really affect any real-world use case as modules these
days obviously should likely never exceed 1 GiB in size overall.
A specially crafted invalid module might succeed to skip validation
in validate_section_offset() due to this mistake, but in such case
no impact is observed through code inspection given the correct data
types are used for the copy of the module when needed on move_module()
when the section type is not SHT_NOBITS (which indicates no the
section occupies no space on the file).
Fix the overflow problem using the right size local variable when
CONFIG_64BIT is defined.
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
[mcgrof: expand commit log with possible impact if not applied]
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Add support for the per-port interrupt controller that deals with both INTx
signalling and management interrupts.
This allows the Link-up/Link-down interrupts to be wired, allowing the
bring-up to be synchronised (and provide debug information). The framework
can further be used to handle the rest of the per port events if and when
necessary.
Likewise, INTx signalling is implemented so that end-points can actually be
used.
Link: https://lore.kernel.org/r/20210929163847.2807812-7-maz@kernel.org
Link: https://lore.kernel.org/r/20211004150552.3844830-1-maz@kernel.org
Tested-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
of_phandle_args_to_fwspec() can be generally useful to code extracting a DT
of_phandle and using an irq_fwspec to use the hierarchical irqdomain API.
Make it visible to the rest of the kernel, including modules.
Link: https://lore.kernel.org/r/20210929163847.2807812-2-maz@kernel.org
Tested-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Including:
- Intel IOMMU Updates fro Lu Baolu:
- Dump DMAR translation structure when DMA fault occurs
- An optimization in the page table manipulation code
- Use second level for GPA->HPA translation
- Various cleanups
- Arm SMMU Updates from Will
- Minor optimisations to SMMUv3 command creation and submission
- Numerous new compatible string for Qualcomm SMMUv2 implementations
- Fixes for the SWIOTLB based implemenation of dma-iommu code for
untrusted devices
- Add support for r8a779a0 to the Renesas IOMMU driver and DT matching
code for r8a77980
- A couple of cleanups and fixes for the Apple DART IOMMU driver
- Make use of generic report_iommu_fault() interface in the AMD IOMMU
driver
- Various smaller fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmGD6NQACgkQK/BELZcB
GuOSfg/9FKXl5ym86BP3tAS1fREKH7p59JRGZrrIR89NyHAcEUjtNG3YLPao+YxU
3CDgLkru+vlDpYY54QoyqcY5FgIHT3Cna/Cdk4zekRmSO/14gHp47jtZRheOUzLF
rvwfaplcbbtT8akpsVFzvw8YpQLGSDiDQSl7xL2+40Z9hiYX/gS9Af+PH98tAXsa
yZKZj6gU+JXM58VihO3M7umyE06tovyBaYgcsBZtbf66bGc0ySu+fe75UVWbueRt
Z8jwqa7TUfVXiYC8h+LqtGET6gtzNSsxAU3VllRe7Brf6K8i/yaRs/TO2Hp83d7/
q/fcK3vNQ5v3aDNci/DjBB8SEySzCmRz/9ocCOCx8ByuRp+5lwVRPPq3WcUMtsZY
QpYo9Fk7luFz2Gj5LObKAVBvOoeBZ5Km3oPs4HVmQ6epxn/rVckJDnJnVSLJuATq
tSZC2heRfFlg1dT6WFaynCTP2RI1LlNEdKhHirV6L368rSjmF0ZdQxdTpHULsHr1
yMjqL21OfcSkLW91rvfb3g68EsIwDbCPGTOlQWZLmAtwOWtHSCLPgwwEG7WefZbH
yaslpmlUTOurUnFmpxlfLicy5sqsBL2ASzGJkEKrgunw82Ke96zzkRzi+9j9HeS6
g0AyIWMi1cUAjONVUZtV4yjImXh63HIPiKx730a9teodusoxm+Q=
=waUR
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
- Intel IOMMU Updates fro Lu Baolu:
- Dump DMAR translation structure when DMA fault occurs
- An optimization in the page table manipulation code
- Use second level for GPA->HPA translation
- Various cleanups
- Arm SMMU Updates from Will
- Minor optimisations to SMMUv3 command creation and submission
- Numerous new compatible string for Qualcomm SMMUv2 implementations
- Fixes for the SWIOTLB based implemenation of dma-iommu code for
untrusted devices
- Add support for r8a779a0 to the Renesas IOMMU driver and DT matching
code for r8a77980
- A couple of cleanups and fixes for the Apple DART IOMMU driver
- Make use of generic report_iommu_fault() interface in the AMD IOMMU
driver
- Various smaller fixes and cleanups
* tag 'iommu-updates-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (35 commits)
iommu/dma: Fix incorrect error return on iommu deferred attach
iommu/dart: Initialize DART_STREAMS_ENABLE
iommu/dma: Use kvcalloc() instead of kvzalloc()
iommu/tegra-smmu: Use devm_bitmap_zalloc when applicable
iommu/dart: Use kmemdup instead of kzalloc and memcpy
iommu/vt-d: Avoid duplicate removing in __domain_mapping()
iommu/vt-d: Convert the return type of first_pte_in_page to bool
iommu/vt-d: Clean up unused PASID updating functions
iommu/vt-d: Delete dev_has_feat callback
iommu/vt-d: Use second level for GPA->HPA translation
iommu/vt-d: Check FL and SL capability sanity in scalable mode
iommu/vt-d: Remove duplicate identity domain flag
iommu/vt-d: Dump DMAR translation structure when DMA fault occurs
iommu/vt-d: Do not falsely log intel_iommu is unsupported kernel option
iommu/arm-smmu-qcom: Request direct mapping for modem device
iommu: arm-smmu-qcom: Add compatible for QCM2290
dt-bindings: arm-smmu: Add compatible for QCM2290 SoC
iommu/arm-smmu-qcom: Add SM6350 SMMU compatible
dt-bindings: arm-smmu: Add compatible for SM6350 SoC
iommu/arm-smmu-v3: Properly handle the return value of arm_smmu_cmdq_build_cmd()
...
Pull per signal_struct coredumps from Eric Biederman:
"Current coredumps are mixed up with the exit code, the signal handling
code, and the ptrace code making coredumps much more complicated than
necessary and difficult to follow.
This series of changes starts with ptrace_stop and cleans it up,
making it easier to follow what is happening in ptrace_stop. Then
cleans up the exec interactions with coredumps. Then cleans up the
coredump interactions with exit. Finally the coredump interactions
with the signal handling code is cleaned up.
The first and last changes are bug fixes for minor bugs.
I believe the fact that vfork followed by execve can kill the process
the called vfork if exec fails is sufficient justification to change
the userspace visible behavior.
In previous discussions some of these changes were organized
differently and individually appeared to make the code base worse. As
currently written I believe they all stand on their own as cleanups
and bug fixes.
Which means that even if the worst should happen and the last change
needs to be reverted for some unimaginable reason, the code base will
still be improved.
If the worst does not happen there are a more cleanups that can be
made. Signals that generate coredumps can easily become eligible for
short circuit delivery in complete_signal. The entire rendezvous for
generating a coredump can move into get_signal. The function
force_sig_info_to_task be written in a way that does not modify the
signal handling state of the target task (because coredumps are
eligible for short circuit delivery). Many of these future cleanups
can be done another way but nothing so cleanly as if coredumps become
per signal_struct"
* 'per_signal_struct_coredumps-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
coredump: Limit coredumps to a single thread group
coredump: Don't perform any cleanups before dumping core
exit: Factor coredump_exit_mm out of exit_mm
exec: Check for a pending fatal signal instead of core_state
ptrace: Remove the unnecessary arguments from arch_ptrace_stop
signal: Remove the bogus sigkill_pending in ptrace_stop
As Andy pointed out that there are races between
force_sig_info_to_task and sigaction[1] when force_sig_info_task. As
Kees discovered[2] ptrace is also able to change these signals.
In the case of seeccomp killing a process with a signal it is a
security violation to allow the signal to be caught or manipulated.
Solve this problem by introducing a new flag SA_IMMUTABLE that
prevents sigaction and ptrace from modifying these forced signals.
This flag is carefully made kernel internal so that no new ABI is
introduced.
Longer term I think this can be solved by guaranteeing short circuit
delivery of signals in this case. Unfortunately reliable and
guaranteed short circuit delivery of these signals is still a ways off
from being implemented, tested, and merged. So I have implemented a much
simpler alternative for now.
[1] https://lkml.kernel.org/r/b5d52d25-7bde-4030-a7b1-7c6f8ab90660@www.fastmail.com
[2] https://lkml.kernel.org/r/202110281136.5CE65399A7@keescook
Cc: stable@vger.kernel.org
Fixes: 307d522f5e ("signal/seccomp: Refactor seccomp signal and coredump generation")
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Currently kdb contains some open-coded routines to generate a summary
character for each task. This code currently issues warnings, is
almost certainly broken and won't make sense to any kernel dev who
has ever used /proc to examine task states.
Fix both the warning and the potential for confusion by adopting the
scheduler's task classification. Whilst doing this we also simplify the
filtering by using mask strings directly (which means we don't have to
guess all the characters the scheduler might give us).
Unfortunately we can't quite match the scheduler classification completely.
We add four extra states: - for idle loops and i, m and s for sleeping
system daemons (which means kthreads in one of the I, M and S states).
These extra states are used to manage the filters for tools to make the
output of ps and bta less noisy.
Note: The Fixes below is the last point the original dubious code was
moved; it was not introduced by that patch. However it gives us
the last point to which this patch can be easily backported.
Happily that should be enough to cover the introduction of
CONFIG_WERROR!
Fixes: 2f064a59a1 ("sched: Change task_struct::state")
Link: https://lore.kernel.org/r/20211102173158.3315227-1-daniel.thompson@linaro.org
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
- osnoise and timerlat updates that will work with the RTLA tool (Real-Time
Linux Analysis). Specifically it disconnects the work load (threads
that look for latency) from the tracing instances attached to them,
allowing for more than one instance to retrieve data from the work load.
- Optimization on division in the trace histogram trigger code to use shift
and multiply when possible. Also added documentation.
- Fix prototype to my_direct_func in direct ftrace trampoline sample code.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYYKWXxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qqJEAP9czpSZ/nFvDjxdGHZAcKKXCFWbGcK5
IF2cHDDwxXjZ/gD+NnpRhR1JPfA55fO52DUJPn2cOU5xOsP6DmJxu6mwDg0=
=AKVv
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull more tracing updates from Steven Rostedt:
- osnoise and timerlat updates that will work with the RTLA tool
(Real-Time Linux Analysis).
Specifically it disconnects the work load (threads that look for
latency) from the tracing instances attached to them, allowing for
more than one instance to retrieve data from the work load.
- Optimization on division in the trace histogram trigger code to use
shift and multiply when possible. Also added documentation.
- Fix prototype to my_direct_func in direct ftrace trampoline sample
code.
* tag 'trace-v5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace/samples: Add missing prototype for my_direct_func
tracing/selftests: Add tests for hist trigger expression parsing
tracing/histogram: Document hist trigger variables
tracing/histogram: Update division by 0 documentation
tracing/histogram: Optimize division by constants
tracing/osnoise: Remove PREEMPT_RT ifdefs from inside functions
tracing/osnoise: Remove STACKTRACE ifdefs from inside functions
tracing/osnoise: Allow multiple instances of the same tracer
tracing/osnoise: Remove TIMERLAT ifdefs from inside functions
tracing/osnoise: Support a list of trace_array *tr
tracing/osnoise: Use start/stop_per_cpu_kthreads() on osnoise_cpus_write()
tracing/osnoise: Split workload start from the tracer start
tracing/osnoise: Improve comments about barrier need for NMI callbacks
tracing/osnoise: Do not follow tracing_cpumask
Below is a simplified case from a report in bcc [0]:
r4 = 20
*(u32 *)(r10 -4) = r4
*(u32 *)(r10 -8) = r4 /* r4 state is tracked */
r4 = *(u64 *)(r10 -8) /* Read more than the tracked 32bit scalar.
* verifier rejects as 'corrupted spill memory'.
*/
After commit 354e8f1970 ("bpf: Support <8-byte scalar spill and refill"),
the 8-byte aligned 32bit spill is also tracked by the verifier and the
register state is stored.
However, if 8 bytes are read from the stack instead of the tracked 4 byte
scalar, then verifier currently rejects the program as "corrupted spill
memory". This patch fixes this case by allowing it to read but marks the
register as unknown.
Also note that, if the prog is trying to corrupt/leak an earlier spilled
pointer by spilling another <8 bytes register on top, this has already
been rejected in the check_stack_write_fixed_off().
[0] https://github.com/iovisor/bcc/pull/3683
Fixes: 354e8f1970 ("bpf: Support <8-byte scalar spill and refill")
Reported-by: Hengqi Chen <hengqi.chen@gmail.com>
Reported-by: Yonghong Song <yhs@gmail.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Hengqi Chen <hengqi.chen@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20211102064535.316018-1-kafai@fb.com
- Add support for inefficient operating performance points to the
Energy Model and modify cpufreq to use them properly (Vincent
Donnefort).
- Rearrange the DTPM framework code to simplify it and make it easier
to follow (Daniel Lezcano).
- Fix power intialization in DTPM (Daniel Lezcano).
- Add CPU load consideration when estimating the instaneous power
consumption in DTPM (Daniel Lezcano).
- Fix cpu->pstate.turbo_freq initialization in intel_pstate (Zhang
Rui).
- Make intel_pstate process HWP Guaranteed change notifications from
the processor (Srinivas Pandruvada).
- Fix typo in cpufreq.h (Rafael Wysocki).
- Fix tegra driver to handle BPMP errors properly (Mikko Perttunen).
- Fix the parameter usage of the newly added perf-domain API (Hector
Yuan).
- Minor cleanups to cppc, vexpress and s3c244x drivers (Han Wang,
Guenter Roeck, and Arnd Bergmann).
- Fix kobject memory leaks in cpuidle error paths (Anel Orazgaliyeva).
- Make intel_idle enable interrupts before entering C1 on some Xeon
processor models (Artem Bityutskiy).
- Clean up hib_wait_io() (Falla Coulibaly).
- Fix sparse warnings in hibernation-related code (Anders Roxell).
- Use vzalloc() and kzalloc() instead of their open-coded
equivalents in hibernation-related code (Cai Huoqing).
- Prevent user space from crashing the kernel by attempting to
restore the system state from a swap partition in use (Ye Bin).
- Do not let "syscore" devices runtime-suspend during system PM
transitions (Rafael Wysocki).
- Do not pause cpuidle in the suspend-to-idle path (Rafael Wysocki).
- Pause cpuidle later and resume it earlier during system PM
transitions (Rafael Wysocki).
- Make system suspend code use valid_state() consistently (Rafael
Wysocki).
- Add support for enabling wakeup IRQs after invoking the
->runtime_suspend() callback and make two drivers use it (Chunfeng
Yun).
- Make the association of ACPI device objects with PCI devices more
straightforward and simplify the code doing that for all devices
in general (Rafael Wysocki).
- Eliminate struct pci_platform_pm_ops and handle the both of its
users (PCI and Intel MID) directly in the PCI bus code (Rafael
Wysocki).
- Simplify and clarify ACPI PCI device PM helpers (Rafael Wysocki).
- Fix ordering of operations in pci_back_from_sleep() (Rafael
Wysocki).
- Make exynos-ppmu use hyphens in DT properties (Krzysztof
Kozlowski).
- Simplify parsing event-type from DT in exynos-ppmu (Krzysztof
Kozlowski).
- Strengthen check for freq_table in devfreq (Samuel Holland).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmGBkiQSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxDZwP/1kyhVdk+FSMMwEfhn8NriOkE8drJ3nB
Y/PWI93oPhgA7ITMwBv4ufLcxx6uYu5bctVDx3XG2lqvFC30t1fzTosJiwireRaS
WC5p7ZTqAECxd1LiZWPjL5EzHmtbyzrz3/AxSxOEN1K66qENBd6q5QldKZylxhY3
bawCQjabz6CLXvOjzdv8M8A7Fmd8LTcjHI5Y3/IOcDydPJqyN4/rDCoqft3/xcNB
kwN6de73aQwB3AQYufS/VAiNN4XOOkhPwF4QfULmAlnMdCsov6YzahtMB2+oG7O4
G3DF/OVFrONr3GPMMuMJSC6GXyFiBuW8FRva4W9HpY0MA8xVGLPUpwlpaFVaX1+c
vAYcRBTyJvOWgRap8+q+UKTlkj37pAgHp7kRiaO1wkVnKxJB1w40OSJZO1nnsExe
3qeCJHOJ9r+S/FsSPKCmws8vr0XQH5wPXY639Kmj9OI/t3gXGrfy3cXm9pa+gSh0
eMyHxtCp5ItT7V2FMpYh+wn+wfe5h//sK3tESZs+h6FKwJG1hYIbG4+F3ztIgzHp
t0rT3JXZIkY41KREGFhCMS9+wnLugOik21w9O0qVZfn/dJtDe73Kely7rY/EA3Mw
H4aBJDD19BvbIKqaTguxJXEc9zJI737fy/Ze4rrzTDkbXU8qVmjvFoEl2i/Ef4o2
b6aiDdz3V/CW
=L2Jo
-----END PGP SIGNATURE-----
Merge tag 'pm-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These make the power management of PCI devices with ACPI companions
more straightforwad, add support for inefficient operating performance
points to the Energy model and make cpufreq handle them as
appropriate, rearrange the handling of cpuidle during system PM
transitions, update a few cpufreq drivers and intel_idle, fix assorded
issues and clean up code in multiple places.
Specifics:
- Add support for inefficient operating performance points to the
Energy Model and modify cpufreq to use them properly (Vincent
Donnefort).
- Rearrange the DTPM framework code to simplify it and make it easier
to follow (Daniel Lezcano).
- Fix power intialization in DTPM (Daniel Lezcano).
- Add CPU load consideration when estimating the instaneous power
consumption in DTPM (Daniel Lezcano).
- Fix cpu->pstate.turbo_freq initialization in intel_pstate (Zhang
Rui).
- Make intel_pstate process HWP Guaranteed change notifications from
the processor (Srinivas Pandruvada).
- Fix typo in cpufreq.h (Rafael Wysocki).
- Fix tegra driver to handle BPMP errors properly (Mikko Perttunen).
- Fix the parameter usage of the newly added perf-domain API (Hector
Yuan).
- Minor cleanups to cppc, vexpress and s3c244x drivers (Han Wang,
Guenter Roeck, and Arnd Bergmann).
- Fix kobject memory leaks in cpuidle error paths (Anel
Orazgaliyeva).
- Make intel_idle enable interrupts before entering C1 on some Xeon
processor models (Artem Bityutskiy).
- Clean up hib_wait_io() (Falla Coulibaly).
- Fix sparse warnings in hibernation-related code (Anders Roxell).
- Use vzalloc() and kzalloc() instead of their open-coded equivalents
in hibernation-related code (Cai Huoqing).
- Prevent user space from crashing the kernel by attempting to
restore the system state from a swap partition in use (Ye Bin).
- Do not let "syscore" devices runtime-suspend during system PM
transitions (Rafael Wysocki).
- Do not pause cpuidle in the suspend-to-idle path (Rafael Wysocki).
- Pause cpuidle later and resume it earlier during system PM
transitions (Rafael Wysocki).
- Make system suspend code use valid_state() consistently (Rafael
Wysocki).
- Add support for enabling wakeup IRQs after invoking the
->runtime_suspend() callback and make two drivers use it (Chunfeng
Yun).
- Make the association of ACPI device objects with PCI devices more
straightforward and simplify the code doing that for all devices in
general (Rafael Wysocki).
- Eliminate struct pci_platform_pm_ops and handle the both of its
users (PCI and Intel MID) directly in the PCI bus code (Rafael
Wysocki).
- Simplify and clarify ACPI PCI device PM helpers (Rafael Wysocki).
- Fix ordering of operations in pci_back_from_sleep() (Rafael
Wysocki).
- Make exynos-ppmu use hyphens in DT properties (Krzysztof
Kozlowski).
- Simplify parsing event-type from DT in exynos-ppmu (Krzysztof
Kozlowski).
- Strengthen check for freq_table in devfreq (Samuel Holland)"
* tag 'pm-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (49 commits)
cpufreq: Fix parameter in parse_perf_domain()
usb: mtu3: enable wake-up interrupt after runtime_suspend called
usb: xhci-mtk: enable wake-up interrupt after runtime_suspend called
PM / wakeirq: support enabling wake-up irq after runtime_suspend called
PM / devfreq: Strengthen check for freq_table
devfreq: exynos-ppmu: simplify parsing event-type from DT
devfreq: exynos-ppmu: use node names with hyphens
cpufreq: intel_pstate: Fix cpu->pstate.turbo_freq initialization
PM: suspend: Use valid_state() consistently
PM: sleep: Pause cpuidle later and resume it earlier during system transitions
PM: suspend: Do not pause cpuidle in the suspend-to-idle path
PM: sleep: Do not let "syscore" devices runtime-suspend during system transitions
PM: hibernate: Get block device exclusively in swsusp_check()
powercap/drivers/dtpm: Fix power limit initialization
powercap/drivers/dtpm: Scale the power with the load
powercap/drivers/dtpm: Use container_of instead of a private data field
powercap/drivers/dtpm: Simplify the dtpm table
powercap/drivers/dtpm: Encapsulate even more the code
PM: hibernate: swap: Use vzalloc() and kzalloc()
PM: hibernate: fix sparse warnings
...
Pull ucount cleanups from Eric Biederman:
"While working on the ucount fixes a for v5.15 a number of cleanups
suggested themselves.
Little things like not testing for NULL when a pointer can not be NULL
and wrapping atomic_add_negative with a more descriptive name, so that
people reading the code can more quickly understand what is going on"
* 'ucount-fixes-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ucounts: Use atomic_long_sub_return for clarity
ucounts: Add get_ucounts_or_wrap for clarity
ucounts: Remove unnecessary test for NULL ucount in get_ucounts
ucounts: In set_cred_ucounts assume new->ucounts is non-NULL
Pull cgroup updates from Tejun Heo:
- The misc controller now reports allocation rejections through
misc.events instead of printking
- cgroup_mutex usage is reduced to improve scalability of some
operations
- vhost helper threads are now assigned to the right cgroup on cgroup2
- Bug fixes
* 'for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: bpf: Move wrapper for __cgroup_bpf_*() to kernel/bpf/cgroup.c
cgroup: Fix rootcg cpu.stat guest double counting
cgroup: no need for cgroup_mutex for /proc/cgroups
cgroup: remove cgroup_mutex from cgroupstats_build
cgroup: reduce dependency on cgroup_mutex
cgroup: cgroup-v1: do not exclude cgrp_dfl_root
cgroup: Make rebind_subsystems() disable v2 controllers all at once
docs/cgroup: add entry for misc.events
misc_cgroup: remove error log to avoid log flood
misc_cgroup: introduce misc.events to count failures
Pull workqueue updates from Tejun Heo:
"Nothing too interesting. An optimization to short-circuit noop cpumask
updates, debug dump code reorg, and doc update"
* 'for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: doc: Call out the non-reentrance conditions
workqueue: Introduce show_one_worker_pool and show_one_workqueue.
workqueue: make sysfs of unbound kworker cpumask more clever
Merge Energy Model and power capping updates for 5.16-rc1:
- Add support for inefficient operating performance points to the
Energy Model and modify cpufreq to use them properly (Vincent
Donnefort).
- Rearrange the DTPM framework code to simplify it and make it easier
to follow (Daniel Lezcano).
- Fix power intialization in DTPM (Daniel Lezcano).
- Add CPU load consideration when estimating the instaneous power
consumption in DTPM (Daniel Lezcano).
* pm-em:
cpufreq: mediatek-hw: Fix cpufreq_table_find_index_dl() call
PM: EM: Mark inefficiencies in CPUFreq
cpufreq: Use CPUFREQ_RELATION_E in DVFS governors
cpufreq: Introducing CPUFREQ_RELATION_E
cpufreq: Add an interface to mark inefficient frequencies
cpufreq: Make policy min/max hard requirements
PM: EM: Allow skipping inefficient states
PM: EM: Extend em_perf_domain with a flag field
PM: EM: Mark inefficient states
PM: EM: Fix inefficient states detection
* powercap:
powercap/drivers/dtpm: Fix power limit initialization
powercap/drivers/dtpm: Scale the power with the load
powercap/drivers/dtpm: Use container_of instead of a private data field
powercap/drivers/dtpm: Simplify the dtpm table
powercap/drivers/dtpm: Encapsulate even more the code
Merge updates related to system sleep for 5.16-rc1:
- Clean up hib_wait_io() (Falla Coulibaly).
- Fix sparse warnings in hibernation-related code (Anders Roxell).
- Use vzalloc() and kzalloc() instead of their open-coded
equivalents in hibernation-related code (Cai Huoqing).
- Prevent user space from crashing the kernel by attempting to
restore the system state from a swap partition in use (Ye Bin).
- Do not let "syscore" devices runtime-suspend during system PM
transitions (Rafael Wysocki).
- Do not pause cpuidle in the suspend-to-idle path (Rafael Wysocki).
- Pause cpuidle later and resume it earlier during system PM
transitions (Rafael Wysocki).
- Make system suspend code use valid_state() consistently (Rafael
Wysocki).
- Add support for enabling wakeup IRQs after invoking the
->runtime_suspend() callback and make two drivers use it (Chunfeng
Yun).
* pm-sleep:
usb: mtu3: enable wake-up interrupt after runtime_suspend called
usb: xhci-mtk: enable wake-up interrupt after runtime_suspend called
PM / wakeirq: support enabling wake-up irq after runtime_suspend called
PM: suspend: Use valid_state() consistently
PM: sleep: Pause cpuidle later and resume it earlier during system transitions
PM: suspend: Do not pause cpuidle in the suspend-to-idle path
PM: sleep: Do not let "syscore" devices runtime-suspend during system transitions
PM: hibernate: Get block device exclusively in swsusp_check()
PM: hibernate: swap: Use vzalloc() and kzalloc()
PM: hibernate: fix sparse warnings
Revert "PM: sleep: Do not assume that "mem" is always present"
PM: hibernate: Remove blk_status_to_errno in hib_wait_io
PM: sleep: Do not assume that "mem" is always present
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmGBCBkACgkQUqAMR0iA
lPLMdg/6Ag9V5Q6DPvbYe0WK8wfrrRL39Eic+K6wrYBVK/8rvMUy4Oee5tyOqCz7
z9GM+SivWRtEdEy8X/HzoawMQEuy3jLcaFoCNxHcScmc6R5Sd8otxPU5Lo8aZPLN
Pulni9EprysI2zhLqq5m6o/F9pMOY0y8uKbD1mgIHEV9yoLan+CZ+vahf/eFwYQu
NtYlMoK2KbS2mChGOZuLsthhyNxcCNFWWNwpBBQz7iJ9ZvnKCZ3EwG7Nx34Rx7ZE
TYZ2iga3TTONsoCk0IClbA6zRIowgumKQl9aY9Oci1MXdIEug42i0GEl+p4iCkrH
VhLyPsvJG6xyE6aCg/p2SB1vPasY+pp94VfTjFfmMulYdUHK7ipfZCR3ddxayR4B
PEsITibo/hHYEVerMMSyVXttiPS7qFhIyZkNuX/xpCMLz8RSFjgU5QhR848A4scM
r+qv1p7xkdBRvH3jlStrpLRnGtqOucvbNQgyvQiinm0yunpJN8FZgEsHnP60E5+j
DLpQF/bK2h7PhE2Wy8/iINi49/dZiIldZ1gZV4BxjuJ5zwSLdiuR9aP51RK4IRhV
qraLwU6yNv0k4v6sjXV78inQQ2vkqy/MBYMe3zqnpYbJB2DZYCbeRE62whrdEd4W
wxHxiY7r9dR6gtJB52kGepbryd3JIMdI49oFRjvGi2shaXG1AZ0=
=t12m
-----END PGP SIGNATURE-----
Merge tag 'printk-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:
- Extend %pGp print format to print hex value of the page flags
- Use kvmalloc instead of kmalloc to allocate devkmsg buffers
- Misc cleanup and warning fixes
* tag 'printk-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
vsprintf: Update %pGp documentation about that it prints hex value
lib/vsprintf.c: Amend static asserts for format specifier flags
vsprintf: Make %pGp print the hex value
test_printf: Append strings more efficiently
test_printf: Remove custom appending of '|'
test_printf: Remove separate page_flags variable
test_printf: Make pft array const
ia64: don't do IA64_CMPXCHG_DEBUG without CONFIG_PRINTK
printk: use gnu_printf format attribute for printk_sprint()
printk: avoid -Wsometimes-uninitialized warning
printk: use kvmalloc instead of kmalloc for devkmsg_user
- Remove socket skb caches
- Add a SO_RESERVE_MEM socket op to forward allocate buffer space
and avoid memory accounting overhead on each message sent
- Introduce managed neighbor entries - added by control plane and
resolved by the kernel for use in acceleration paths (BPF / XDP
right now, HW offload users will benefit as well)
- Make neighbor eviction on link down controllable by userspace
to work around WiFi networks with bad roaming implementations
- vrf: Rework interaction with netfilter/conntrack
- fq_codel: implement L4S style ce_threshold_ect1 marking
- sch: Eliminate unnecessary RCU waits in mini_qdisc_pair_swap()
BPF:
- Add support for new btf kind BTF_KIND_TAG, arbitrary type tagging
as implemented in LLVM14
- Introduce bpf_get_branch_snapshot() to capture Last Branch Records
- Implement variadic trace_printk helper
- Add a new Bloomfilter map type
- Track <8-byte scalar spill and refill
- Access hw timestamp through BPF's __sk_buff
- Disallow unprivileged BPF by default
- Document BPF licensing
Netfilter:
- Introduce egress hook for looking at raw outgoing packets
- Allow matching on and modifying inner headers / payload data
- Add NFT_META_IFTYPE to match on the interface type either from
ingress or egress
Protocols:
- Multi-Path TCP:
- increase default max additional subflows to 2
- rework forward memory allocation
- add getsockopts: MPTCP_INFO, MPTCP_TCPINFO, MPTCP_SUBFLOW_ADDRS
- MCTP flow support allowing lower layer drivers to configure msg
muxing as needed
- Automatic Multicast Tunneling (AMT) driver based on RFC7450
- HSR support the redbox supervision frames (IEC-62439-3:2018)
- Support for the ip6ip6 encapsulation of IOAM
- Netlink interface for CAN-FD's Transmitter Delay Compensation
- Support SMC-Rv2 eliminating the current same-subnet restriction,
by exploiting the UDP encapsulation feature of RoCE adapters
- TLS: add SM4 GCM/CCM crypto support
- Bluetooth: initial support for link quality and audio/codec
offload
Driver APIs:
- Add a batched interface for RX buffer allocation in AF_XDP
buffer pool
- ethtool: Add ability to control transceiver modules' power mode
- phy: Introduce supported interfaces bitmap to express MAC
capabilities and simplify PHY code
- Drop rtnl_lock from DSA .port_fdb_{add,del} callbacks
New drivers:
- WiFi driver for Realtek 8852AE 802.11ax devices (rtw89)
- Ethernet driver for ASIX AX88796C SPI device (x88796c)
Drivers:
- Broadcom PHYs
- support 72165, 7712 16nm PHYs
- support IDDQ-SR for additional power savings
- PHY support for QCA8081, QCA9561 PHYs
- NXP DPAA2: support for IRQ coalescing
- NXP Ethernet (enetc): support for software TCP segmentation
- Renesas Ethernet (ravb) - support DMAC and EMAC blocks of
Gigabit-capable IP found on RZ/G2L SoC
- Intel 100G Ethernet
- support for eswitch offload of TC/OvS flow API, including
offload of GRE, VxLAN, Geneve tunneling
- support application device queues - ability to assign Rx and Tx
queues to application threads
- PTP and PPS (pulse-per-second) extensions
- Broadcom Ethernet (bnxt)
- devlink health reporting and device reload extensions
- Mellanox Ethernet (mlx5)
- offload macvlan interfaces
- support HW offload of TC rules involving OVS internal ports
- support HW-GRO and header/data split
- support application device queues
- Marvell OcteonTx2:
- add XDP support for PF
- add PTP support for VF
- Qualcomm Ethernet switch (qca8k): support for QCA8328
- Realtek Ethernet DSA switch (rtl8366rb)
- support bridge offload
- support STP, fast aging, disabling address learning
- support for Realtek RTL8365MB-VC, a 4+1 port 10M/100M/1GE switch
- Mellanox Ethernet/IB switch (mlxsw)
- multi-level qdisc hierarchy offload (e.g. RED, prio and shaping)
- offload root TBF qdisc as port shaper
- support multiple routing interface MAC address prefixes
- support for IP-in-IP with IPv6 underlay
- MediaTek WiFi (mt76)
- mt7921 - ASPM, 6GHz, SDIO and testmode support
- mt7915 - LED and TWT support
- Qualcomm WiFi (ath11k)
- include channel rx and tx time in survey dump statistics
- support for 80P80 and 160 MHz bandwidths
- support channel 2 in 6 GHz band
- spectral scan support for QCN9074
- support for rx decapsulation offload (data frames in 802.3
format)
- Qualcomm phone SoC WiFi (wcn36xx)
- enable Idle Mode Power Save (IMPS) to reduce power consumption
during idle
- Bluetooth driver support for MediaTek MT7922 and MT7921
- Enable support for AOSP Bluetooth extension in Qualcomm WCN399x
and Realtek 8822C/8852A
- Microsoft vNIC driver (mana)
- support hibernation and kexec
- Google vNIC driver (gve)
- support for jumbo frames
- implement Rx page reuse
Refactor:
- Make all writes to netdev->dev_addr go thru helpers, so that we
can add this address to the address rbtree and handle the updates
- Various TCP cleanups and optimizations including improvements
to CPU cache use
- Simplify the gnet_stats, Qdisc stats' handling and remove
qdisc->running sequence counter
- Driver changes and API updates to address devlink locking
deficiencies
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmGAzX4ACgkQMUZtbf5S
IrvW3g//Q0ZLrOuHK9pZ8sCXMMhDj8qL6ajm0otMddHWA/+1UglwVBKFhsajfxOf
wJ/5LZis+XKLpLqKTU5chKVfn39HuDGe/D3l+egi01Gv5BW0+XzEhagfyR5tJX5z
wsGG5CXO/we/laVSzRiFtwwVEKHKN20YC+tIQwYOYP5Wy3q4G7qDsFhT7GqgsGCS
n74QUEAIB5Tz0ODWFqLtbsySzIurXrskibwt5T9bvAAlPw/lCU68mmG+NVJ7VddO
lBbNkLMOo8yW9Ci20H09SrYd4jZTmMARo9tsFO1tAvAMk7qpn0Wd8pnOYTjFFoMD
+qjiFSVMh7E0JGb8Y7NCvwaB99suAK5rfGP68Xwe62DfP7vYWEx4pZGxBP19F4ld
6Kn1ME33BX9rUF9tBecf0bdKfJUwB2Q2Xou/b9laG04bwiqsc9iG5FQq1C46lnLZ
QdzNiS1My4dJMczkWt66HF3Kx30ibwHfvKMIHjf4PqkzEatkv6Y6SBZ57KXL+Lde
0BQSFhbf0tm2Gf55etzrczLElI3uqHSFWUNZZ2Bt6WmzO1e6tpV9nAtRWF4C/dFg
QDpLJtOOOY65uq+qz09zoPfv2lem868SrCAuFrVn99bEpYjx/CGNFDeEI02l6jyr
84eUxd364UcbIk3fc+eTGdXHLQNVk30G0AHVBBxaWNIidwfqXeE=
=srde
-----END PGP SIGNATURE-----
Merge tag 'net-next-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- Remove socket skb caches
- Add a SO_RESERVE_MEM socket op to forward allocate buffer space and
avoid memory accounting overhead on each message sent
- Introduce managed neighbor entries - added by control plane and
resolved by the kernel for use in acceleration paths (BPF / XDP
right now, HW offload users will benefit as well)
- Make neighbor eviction on link down controllable by userspace to
work around WiFi networks with bad roaming implementations
- vrf: Rework interaction with netfilter/conntrack
- fq_codel: implement L4S style ce_threshold_ect1 marking
- sch: Eliminate unnecessary RCU waits in mini_qdisc_pair_swap()
BPF:
- Add support for new btf kind BTF_KIND_TAG, arbitrary type tagging
as implemented in LLVM14
- Introduce bpf_get_branch_snapshot() to capture Last Branch Records
- Implement variadic trace_printk helper
- Add a new Bloomfilter map type
- Track <8-byte scalar spill and refill
- Access hw timestamp through BPF's __sk_buff
- Disallow unprivileged BPF by default
- Document BPF licensing
Netfilter:
- Introduce egress hook for looking at raw outgoing packets
- Allow matching on and modifying inner headers / payload data
- Add NFT_META_IFTYPE to match on the interface type either from
ingress or egress
Protocols:
- Multi-Path TCP:
- increase default max additional subflows to 2
- rework forward memory allocation
- add getsockopts: MPTCP_INFO, MPTCP_TCPINFO, MPTCP_SUBFLOW_ADDRS
- MCTP flow support allowing lower layer drivers to configure msg
muxing as needed
- Automatic Multicast Tunneling (AMT) driver based on RFC7450
- HSR support the redbox supervision frames (IEC-62439-3:2018)
- Support for the ip6ip6 encapsulation of IOAM
- Netlink interface for CAN-FD's Transmitter Delay Compensation
- Support SMC-Rv2 eliminating the current same-subnet restriction, by
exploiting the UDP encapsulation feature of RoCE adapters
- TLS: add SM4 GCM/CCM crypto support
- Bluetooth: initial support for link quality and audio/codec offload
Driver APIs:
- Add a batched interface for RX buffer allocation in AF_XDP buffer
pool
- ethtool: Add ability to control transceiver modules' power mode
- phy: Introduce supported interfaces bitmap to express MAC
capabilities and simplify PHY code
- Drop rtnl_lock from DSA .port_fdb_{add,del} callbacks
New drivers:
- WiFi driver for Realtek 8852AE 802.11ax devices (rtw89)
- Ethernet driver for ASIX AX88796C SPI device (x88796c)
Drivers:
- Broadcom PHYs
- support 72165, 7712 16nm PHYs
- support IDDQ-SR for additional power savings
- PHY support for QCA8081, QCA9561 PHYs
- NXP DPAA2: support for IRQ coalescing
- NXP Ethernet (enetc): support for software TCP segmentation
- Renesas Ethernet (ravb) - support DMAC and EMAC blocks of
Gigabit-capable IP found on RZ/G2L SoC
- Intel 100G Ethernet
- support for eswitch offload of TC/OvS flow API, including
offload of GRE, VxLAN, Geneve tunneling
- support application device queues - ability to assign Rx and Tx
queues to application threads
- PTP and PPS (pulse-per-second) extensions
- Broadcom Ethernet (bnxt)
- devlink health reporting and device reload extensions
- Mellanox Ethernet (mlx5)
- offload macvlan interfaces
- support HW offload of TC rules involving OVS internal ports
- support HW-GRO and header/data split
- support application device queues
- Marvell OcteonTx2:
- add XDP support for PF
- add PTP support for VF
- Qualcomm Ethernet switch (qca8k): support for QCA8328
- Realtek Ethernet DSA switch (rtl8366rb)
- support bridge offload
- support STP, fast aging, disabling address learning
- support for Realtek RTL8365MB-VC, a 4+1 port 10M/100M/1GE switch
- Mellanox Ethernet/IB switch (mlxsw)
- multi-level qdisc hierarchy offload (e.g. RED, prio and shaping)
- offload root TBF qdisc as port shaper
- support multiple routing interface MAC address prefixes
- support for IP-in-IP with IPv6 underlay
- MediaTek WiFi (mt76)
- mt7921 - ASPM, 6GHz, SDIO and testmode support
- mt7915 - LED and TWT support
- Qualcomm WiFi (ath11k)
- include channel rx and tx time in survey dump statistics
- support for 80P80 and 160 MHz bandwidths
- support channel 2 in 6 GHz band
- spectral scan support for QCN9074
- support for rx decapsulation offload (data frames in 802.3
format)
- Qualcomm phone SoC WiFi (wcn36xx)
- enable Idle Mode Power Save (IMPS) to reduce power consumption
during idle
- Bluetooth driver support for MediaTek MT7922 and MT7921
- Enable support for AOSP Bluetooth extension in Qualcomm WCN399x and
Realtek 8822C/8852A
- Microsoft vNIC driver (mana)
- support hibernation and kexec
- Google vNIC driver (gve)
- support for jumbo frames
- implement Rx page reuse
Refactor:
- Make all writes to netdev->dev_addr go thru helpers, so that we can
add this address to the address rbtree and handle the updates
- Various TCP cleanups and optimizations including improvements to
CPU cache use
- Simplify the gnet_stats, Qdisc stats' handling and remove
qdisc->running sequence counter
- Driver changes and API updates to address devlink locking
deficiencies"
* tag 'net-next-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2122 commits)
Revert "net: avoid double accounting for pure zerocopy skbs"
selftests: net: add arp_ndisc_evict_nocarrier
net: ndisc: introduce ndisc_evict_nocarrier sysctl parameter
net: arp: introduce arp_evict_nocarrier sysctl parameter
libbpf: Deprecate AF_XDP support
kbuild: Unify options for BTF generation for vmlinux and modules
selftests/bpf: Add a testcase for 64-bit bounds propagation issue.
bpf: Fix propagation of signed bounds from 64-bit min/max into 32-bit.
bpf: Fix propagation of bounds from 64-bit min/max into 32-bit and var_off.
net: vmxnet3: remove multiple false checks in vmxnet3_ethtool.c
net: avoid double accounting for pure zerocopy skbs
tcp: rename sk_wmem_free_skb
netdevsim: fix uninit value in nsim_drv_configure_vfs()
selftests/bpf: Fix also no-alu32 strobemeta selftest
bpf: Add missing map_delete_elem method to bloom filter map
selftests/bpf: Add bloom map success test for userspace calls
bpf: Add alignment padding for "map_extra" + consolidate holes
bpf: Bloom filter map naming fixups
selftests/bpf: Add test cases for struct_ops prog
bpf: Add dummy BPF STRUCT_OPS for test purpose
...
copy_process currently copies task_struct.posix_cputimers_work as-is. If a
timer interrupt arrives while handling clone and before dup_task_struct
completes then the child task will have:
1. posix_cputimers_work.scheduled = true
2. posix_cputimers_work.work queued.
copy_process clears task_struct.task_works, so (2) will have no effect and
posix_cpu_timers_work will never run (not to mention it doesn't make sense
for two tasks to share a common linked list).
Since posix_cpu_timers_work never runs, posix_cputimers_work.scheduled is
never cleared. Since scheduled is set, future timer interrupts will skip
scheduling work, with the ultimate result that the task will never receive
timer expirations.
Together, the complete flow is:
1. Task 1 calls clone(), enters kernel.
2. Timer interrupt fires, schedules task work on Task 1.
2a. task_struct.posix_cputimers_work.scheduled = true
2b. task_struct.posix_cputimers_work.work added to
task_struct.task_works.
3. dup_task_struct() copies Task 1 to Task 2.
4. copy_process() clears task_struct.task_works for Task 2.
5. Future timer interrupts on Task 2 see
task_struct.posix_cputimers_work.scheduled = true and skip scheduling
work.
Fix this by explicitly clearing contents of task_struct.posix_cputimers_work
in copy_process(). This was never meant to be shared or inherited across
tasks in the first place.
Fixes: 1fb497dd00 ("posix-cpu-timers: Provide mechanisms to defer timer handling to task_work")
Reported-by: Rhys Hiltner <rhys@justin.tv>
Signed-off-by: Michael Pratt <mpratt@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20211101210615.716522-1-mpratt@google.com
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmGANdUUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXOmihAAgKSTv4Jf0s4yopdcxfuLweiyqHX1
719QJzdLZohmllrJPq/83FZL9qodCzxy87nAm67Ht0baSKiEjtVgRaVCqJWEE+l6
oQL+wUsGLP7CmExOP503Uh6tW35AhETQA4Uwu6QtiUYLYG17kAgeR3cTFuekUsJS
iL4K65PXE2bBxMe7Ta1YIZqcxptbknMgpqYkdne7xs7RS+UiVj8TyRle6ACrfzEX
IVy4LTk+spHCy1a494g9pt/21xOnbiLHr/FpckALscnvJiUThxbfQHGSQeMpM4uM
BnwCqFrj860vMeh52M11/GAAXmdPh6AjoLhaSIW2I3M2GbV8ZP2hu1HYUz3osmrT
f+aeMPJ4feX1xVj6qAC+1G83XRO83tP/YIEuocGiwyepImB25NHPin21xepf6Ru0
wJX+aXC9O1eG6E2ghT6tBim/MpeNH5OT0hNO3uhGmEQ6xZpArRVVaBwlEdufJiCx
ZljqEFUT7wA9nGEQif6GdLnGezGr/aNL65caTkIAzHKamd79QIr7VZXYjYIfHSqE
p74Aro6E8qoQJjsTSkvZceM0u1LRzwS4wPRroE6eGz98oYDpiDm1RPb+9Gw5jyJf
JN7UjJKO9+iPGAi3KivGBqpBskw4cCp2y/nHrMYmpGUPELcr5kQtDfQ6yp59tVZ8
Dwo5GeSlG6khmiI=
=WrEw
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
"Add some additional audit logging to capture the openat2() syscall
open_how struct info.
Previous variations of the open()/openat() syscalls allowed audit
admins to inspect the syscall args to get the information contained in
the new open_how struct used in openat2()"
* tag 'audit-pr-20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: return early if the filter rule has a lower priority
audit: add OPENAT2 record to list "how" info
audit: add support for the openat2 syscall
audit: replace magic audit syscall class numbers with macros
lsm_audit: avoid overloading the "key" audit field
audit: Convert to SPDX identifier
audit: rename struct node to struct audit_node to prevent future name collisions
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmGANbAUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNaMBAAg+9gZr0F7xiafu8JFZqZfx/AQdJ2
G2cn3le+/tXGZmF8m/+82lOaR6LeQLatgSDJNSkXWkKr0nRwseQJDbtRfvYJdn0t
Ax05/Fmz6OGxQ2wgRYgaFiSrKpE5p3NhDtiLFVdkCJaQNe/8DZOc7NhBl6EjZf3x
ubhl2hUiJ4AmiXGwcYhr4uKgP4nhW8OM1/OkskVi+bBMmLA8KTY9kslmIDP5E3BW
29W4qhqeLNQupY5dGMEMVcyxY9ZUWpO39q4uOaQVZrUGE7xABkj/jhnxT5gFTSlI
pu8VhsYXm9KuRVveIsv0L5SZfadwoM9YAl7ki1wD3W5rHqOAte3rBTm6VmNlQwfU
MqxP65Jiyxudxet5Be3/dCRH/+MDQuwBxivgmZXbeVxor2SeznVb0GDaEUC5FSHu
CJIgWtQzsPJMxgAEGXN4F3QGP0htTTJni56GUPOsrf4TIBW02TT+oLTLFRIokQQL
INNOfwVSRXElnCsvxsHR4oB+JZ9pJyBaAmeupcQ6jmcKiWlbLj4s+W0U0pM5h91v
hmMpz7KMxrX6gVL4gB2Jj4aN3r5YRbq26NBu6D+wdwwBTeTTocaHSpAqkv4buClf
uNk3cG8Hkp8TTg9cM8jYgpxMyzKH/AI/Uw3VhEa1xCiq2Ck3DgfnZvnvcRRaZevU
FPgmwgqePJXGi60=
=sb8J
-----END PGP SIGNATURE-----
Merge tag 'selinux-pr-20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull selinux updates from Paul Moore:
- Add LSM/SELinux/Smack controls and auditing for io-uring.
As usual, the individual commit descriptions have more detail, but we
were basically missing two things which we're adding here:
+ establishment of a proper audit context so that auditing of
io-uring ops works similarly to how it does for syscalls (with
some io-uring additions because io-uring ops are *not* syscalls)
+ additional LSM hooks to enable access control points for some of
the more unusual io-uring features, e.g. credential overrides.
The additional audit callouts and LSM hooks were done in conjunction
with the io-uring folks, based on conversations and RFC patches
earlier in the year.
- Fixup the binder credential handling so that the proper credentials
are used in the LSM hooks; the commit description and the code
comment which is removed in these patches are helpful to understand
the background and why this is the proper fix.
- Enable SELinux genfscon policy support for securityfs, allowing
improved SELinux filesystem labeling for other subsystems which make
use of securityfs, e.g. IMA.
* tag 'selinux-pr-20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
security: Return xattr name from security_dentry_init_security()
selinux: fix a sock regression in selinux_ip_postroute_compat()
binder: use cred instead of task for getsecid
binder: use cred instead of task for selinux checks
binder: use euid from cred instead of using task
LSM: Avoid warnings about potentially unused hook variables
selinux: fix all of the W=1 build warnings
selinux: make better use of the nf_hook_state passed to the NF hooks
selinux: fix race condition when computing ocontext SIDs
selinux: remove unneeded ipv6 hook wrappers
selinux: remove the SELinux lockdown implementation
selinux: enable genfscon labeling for securityfs
Smack: Brutalist io_uring support
selinux: add support for the io_uring access controls
lsm,io_uring: add LSM hooks to io_uring
io_uring: convert io_uring to the secure anon inode interface
fs: add anon_inode_getfile_secure() similar to anon_inode_getfd_secure()
audit: add filtering for io_uring records
audit,io_uring,io-wq: add some basic audit support to io_uring
audit: prepare audit_context for use in calling contexts beyond syscalls
This pull request contains the following branches:
fixes.2021.10.07a: Miscellaneous fixes.
scftorture.2021.09.16a: smp_call_function torture-test updates, most
notably better checking of module parameters.
tasks.2021.09.15a: Tasks-trace RCU updates that fix a number of rare
but important race-condition bugs.
torture.2021.09.13b: Other torture-test updates, most notably
better checking of module parameters. In addition, rcutorture
may now be run on CONFIG_PREEMPT_RT kernels.
torturescript.2021.09.16a: Torture-test scripting updates, most notably
specifying the new CONFIG_KCSAN_STRICT kconfig option rather
than maintaining an ever-changing list of individual KCSAN
kconfig options.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmGAVMMTHHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jGJBD/9ld6USOpedBLAbTYVMQYvIKoSqqDIG
74ZFhKvZ5I6Y8OZAGxXjb5U06rh4V2brlTN7IJ7XLEA1t401ENffsGeQSCxEmpEf
PqQN04dbmVvaWjD4jiLZCcl3oDp+w1gIKwmX6wh0Weogr3KZWu5aNvD5tl9qIz4a
uPC1JqTBxf7WDrLhqNxG5N4MXs27+KvukCd9wftk3NTzRJ9tyLM/YNGOVArM8rW2
QpEh8n6veB5dEoXBxmRHzuxYHN1k0Fhkbm3irMjcI0T5wj8TDod89zbg9mdFXMIj
AjZ9CGpIBa4frThdu654ZNuEQHDCsPWtMi925xNOWxh5lkPGjeWnwYpcRrwfI2pj
op0xVlur+Nam5CT/AJNT9+KogpZthAWXvwqCs5GbYNSU30Rlw99bw1vyAsJUD+af
Mv08/z4o7Kuhr4cw2vkd2UfF9zuIQsJ1jWCIjMxfj4ctBnIpedrEnEISp8Y61fWk
w9vXgCRhZCSkxoURoNss+nAUsiePUafptsvqKLu6Z53ufPA5yL0rVS778xq8vurP
Xyd34TVlQ94ydZDC5pkSNpri1HGV1U7pztFwey5GloE66iV+7TSQCfMhzLd4CM0K
wW96wimHrDtIxD6LedCZOHLHkS9AJd7F9uSoNodKspTH0tJowQztrzPW1eZifDE3
iJP8xcJ+vL67Og==
=nmaP
-----END PGP SIGNATURE-----
Merge tag 'rcu.2021.11.01a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Miscellaneous fixes
- Torture-test updates for smp_call_function(), most notably improved
checking of module parameters.
- Tasks-trace RCU updates that fix a number of rare but important
race-condition bugs.
- Other torture-test updates, most notably better checking of module
parameters. In addition, rcutorture may once again be run on
CONFIG_PREEMPT_RT kernels.
- Torture-test scripting updates, most notably specifying the new
CONFIG_KCSAN_STRICT kconfig option rather than maintaining an
ever-changing list of individual KCSAN kconfig options.
* tag 'rcu.2021.11.01a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (46 commits)
rcu: Fix rcu_dynticks_curr_cpu_in_eqs() vs noinstr
rcu: Always inline rcu_dynticks_task*_{enter,exit}()
torture: Make kvm-remote.sh print size of downloaded tarball
torture: Allot 1G of memory for scftorture runs
tools/rcu: Add an extract-stall script
scftorture: Warn on individual scf_torture_init() error conditions
scftorture: Count reschedule IPIs
scftorture: Account for weight_resched when checking for all zeroes
scftorture: Shut down if nonsensical arguments given
scftorture: Allow zero weight to exclude an smp_call_function*() category
rcu: Avoid unneeded function call in rcu_read_unlock()
rcu-tasks: Update comments to cond_resched_tasks_rcu_qs()
rcu-tasks: Fix IPI failure handling in trc_wait_for_one_reader
rcu-tasks: Fix read-side primitives comment for call_rcu_tasks_trace
rcu-tasks: Clarify read side section info for rcu_tasks_rude GP primitives
rcu-tasks: Correct comparisons for CPU numbers in show_stalled_task_trace
rcu-tasks: Correct firstreport usage in check_all_holdout_tasks_trace
rcu-tasks: Fix s/rcu_add_holdout/trc_add_holdout/ typo in comment
rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop
rcu-tasks: Fix s/instruction/instructions/ typo in comment
...
- kprobes: Restructured stack unwinder to show properly on x86 when a stack
dump happens from a kretprobe callback.
- Fix to bootconfig parsing
- Have tracefs allow owner and group permissions by default (only denying
others). There's been pressure to allow non root to tracefs in a
controlled fashion, and using groups is probably the safest.
- Bootconfig memory managament updates.
- Bootconfig clean up to have the tools directory be less dependent on
changes in the kernel tree.
- Allow perf to be traced by function tracer.
- Rewrite of function graph tracer to be a callback from the function tracer
instead of having its own trampoline (this change will happen on an arch
by arch basis, and currently only x86_64 implements it).
- Allow multiple direct trampolines (bpf hooks to functions) be batched
together in one synchronization.
- Allow histogram triggers to add variables that can perform calculations
against the event's fields.
- Use the linker to determine architecture callbacks from the ftrace
trampoline to allow for proper parameter prototypes and prevent warnings
from the compiler.
- Extend histogram triggers to key off of variables.
- Have trace recursion use bit magic to determine preempt context over if
branches.
- Have trace recursion disable preemption as all use cases do anyway.
- Added testing for verification of tracing utilities.
- Various small clean ups and fixes.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYYBdxhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qp1sAQD2oYFwaG3sx872gj/myBcHIBSKdiki
Hry5csd8zYDBpgD+Poylopt5JIbeDuoYw/BedgEXmscZ8Qr7VzjAXdnv/Q4=
=Loz8
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- kprobes: Restructured stack unwinder to show properly on x86 when a
stack dump happens from a kretprobe callback.
- Fix to bootconfig parsing
- Have tracefs allow owner and group permissions by default (only
denying others). There's been pressure to allow non root to tracefs
in a controlled fashion, and using groups is probably the safest.
- Bootconfig memory managament updates.
- Bootconfig clean up to have the tools directory be less dependent on
changes in the kernel tree.
- Allow perf to be traced by function tracer.
- Rewrite of function graph tracer to be a callback from the function
tracer instead of having its own trampoline (this change will happen
on an arch by arch basis, and currently only x86_64 implements it).
- Allow multiple direct trampolines (bpf hooks to functions) be batched
together in one synchronization.
- Allow histogram triggers to add variables that can perform
calculations against the event's fields.
- Use the linker to determine architecture callbacks from the ftrace
trampoline to allow for proper parameter prototypes and prevent
warnings from the compiler.
- Extend histogram triggers to key off of variables.
- Have trace recursion use bit magic to determine preempt context over
if branches.
- Have trace recursion disable preemption as all use cases do anyway.
- Added testing for verification of tracing utilities.
- Various small clean ups and fixes.
* tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (101 commits)
tracing/histogram: Fix semicolon.cocci warnings
tracing/histogram: Fix documentation inline emphasis warning
tracing: Increase PERF_MAX_TRACE_SIZE to handle Sentinel1 and docker together
tracing: Show size of requested perf buffer
bootconfig: Initialize ret in xbc_parse_tree()
ftrace: do CPU checking after preemption disabled
ftrace: disable preemption when recursion locked
tracing/histogram: Document expression arithmetic and constants
tracing/histogram: Optimize division by a power of 2
tracing/histogram: Covert expr to const if both operands are constants
tracing/histogram: Simplify handling of .sym-offset in expressions
tracing: Fix operator precedence for hist triggers expression
tracing: Add division and multiplication support for hist triggers
tracing: Add support for creating hist trigger variables from literal
selftests/ftrace: Stop tracing while reading the trace file by default
MAINTAINERS: Update KPROBES and TRACING entries
test_kprobes: Move it from kernel/ to lib/
docs, kprobes: Remove invalid URL and add new reference
samples/kretprobes: Fix return value if register_kretprobe() failed
lib/bootconfig: Fix the xbc_get_info kerneldoc
...
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-11-01
We've added 181 non-merge commits during the last 28 day(s) which contain
a total of 280 files changed, 11791 insertions(+), 5879 deletions(-).
The main changes are:
1) Fix bpf verifier propagation of 64-bit bounds, from Alexei.
2) Parallelize bpf test_progs, from Yucong and Andrii.
3) Deprecate various libbpf apis including af_xdp, from Andrii, Hengqi, Magnus.
4) Improve bpf selftests on s390, from Ilya.
5) bloomfilter bpf map type, from Joanne.
6) Big improvements to JIT tests especially on Mips, from Johan.
7) Support kernel module function calls from bpf, from Kumar.
8) Support typeless and weak ksym in light skeleton, from Kumar.
9) Disallow unprivileged bpf by default, from Pawan.
10) BTF_KIND_DECL_TAG support, from Yonghong.
11) Various bpftool cleanups, from Quentin.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (181 commits)
libbpf: Deprecate AF_XDP support
kbuild: Unify options for BTF generation for vmlinux and modules
selftests/bpf: Add a testcase for 64-bit bounds propagation issue.
bpf: Fix propagation of signed bounds from 64-bit min/max into 32-bit.
bpf: Fix propagation of bounds from 64-bit min/max into 32-bit and var_off.
selftests/bpf: Fix also no-alu32 strobemeta selftest
bpf: Add missing map_delete_elem method to bloom filter map
selftests/bpf: Add bloom map success test for userspace calls
bpf: Add alignment padding for "map_extra" + consolidate holes
bpf: Bloom filter map naming fixups
selftests/bpf: Add test cases for struct_ops prog
bpf: Add dummy BPF STRUCT_OPS for test purpose
bpf: Factor out helpers for ctx access checking
bpf: Factor out a helper to prepare trampoline for struct_ops prog
selftests, bpf: Fix broken riscv build
riscv, libbpf: Add RISC-V (RV64) support to bpf_tracing.h
tools, build: Add RISC-V to HOSTARCH parsing
riscv, bpf: Increase the maximum number of iterations
selftests, bpf: Add one test for sockmap with strparser
selftests, bpf: Fix test_txmsg_ingress_parser error
...
====================
Link: https://lore.kernel.org/r/20211102013123.9005-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Similar to unsigned bounds propagation fix signed bounds.
The 'Fixes' tag is a hint. There is no security bug here.
The verifier was too conservative.
Fixes: 3f50f132d8 ("bpf: Verifier, do explicit ALU32 bounds tracking")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20211101222153.78759-2-alexei.starovoitov@gmail.com
Before this fix:
166: (b5) if r2 <= 0x1 goto pc+22
from 166 to 189: R2=invP(id=1,umax_value=1,var_off=(0x0; 0xffffffff))
After this fix:
166: (b5) if r2 <= 0x1 goto pc+22
from 166 to 189: R2=invP(id=1,umax_value=1,var_off=(0x0; 0x1))
While processing BPF_JLE the reg_set_min_max() would set true_reg->umax_value = 1
and call __reg_combine_64_into_32(true_reg).
Without the fix it would not pass the condition:
if (__reg64_bound_u32(reg->umin_value) && __reg64_bound_u32(reg->umax_value))
since umin_value == 0 at this point.
Before commit 10bf4e8316 the umin was incorrectly ingored.
The commit 10bf4e8316 fixed the correctness issue, but pessimized
propagation of 64-bit min max into 32-bit min max and corresponding var_off.
Fixes: 10bf4e8316 ("bpf: Fix propagation of 32 bit unsigned bounds from 64 bit bounds")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20211101222153.78759-1-alexei.starovoitov@gmail.com
If the divisor is a constant use specific division functions to
avoid extra branches when the trigger is hit.
If the divisor constant but not a power of 2, the division can be
replaced with a multiplication and shift in the following case:
Let X = dividend and Y = divisor.
Choose Z = some power of 2. If Y <= Z, then:
X / Y = (X * (Z / Y)) / Z
(Z / Y) is a constant (mult) which is calculated at parse time, so:
X / Y = (X * mult) / Z
The division by Z can be replaced by a shift since Z is a power of 2:
X / Y = (X * mult) >> shift
As long, as X < Z the results will not be off by more than 1.
Link: https://lkml.kernel.org/r/20211029232410.3494196-1-kaleshsingh@google.com
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cross-architecture update to move task_struct::cpu back into thread_info
on arm64, x86, s390, powerpc, and riscv. All Acked by arch maintainers.
Quoting Ard Biesheuvel:
"Move task_struct::cpu back into thread_info
Keeping CPU in task_struct is problematic for architectures that define
raw_smp_processor_id() in terms of this field, as it requires
linux/sched.h to be included, which causes a lot of pain in terms of
circular dependencies (aka 'header soup')
This series moves it back into thread_info (where it came from) for all
architectures that enable THREAD_INFO_IN_TASK, addressing the header
soup issue as well as some pointless differences in the implementations
of task_cpu() and set_task_cpu()."
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmGAEPYWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJq4wEACItgLuyzPgB2eSLVMc3sHPIWcn
EUWbAWsuzJH79wmJtn2AKxW/C5OLBNGeoNjkXQvFN3ULkQDPrfCpB4x/tB6CjIQI
WRDf8kO7oaAD85ZrbSwyFl/MFfrD67f6H1HZoB9FKWAzuv/Bp2xQ0Kf06Dv4HEZp
CzprzZuWtjHB+qgyy+EpGOge3zbFmCuYPE2QpMYLWgs1rcVW9OYvoCI6AYtNefrC
6Kl6CbmBb1k6lFxkhM7wvRcIJthBl6Bajpc3Z2uL1aLb27dVpQZs3YpY859Knb6U
ZpOQCRJOMui3HOxyF3bDUI37y0XVLm6xaNM6C/7i0XS1GiFlSxkGVamg+Mp7anpI
+hdK5kqtSagaBC9CaJvRHnWIex1npQAfiyDNdyiEbrsUJ1dp6/zZcQSe4/m/XRbi
vywQPGxU9f1ASshzHsGU2TJf7Ps7qHulUsS5fKwmHU2ZjQnbYCoPN10JGO9gKjOX
yioN5xsKnbPY9j0ys3l9XBqaMJ8KAr1XspplTGIMZIVbjNMlqrfgbg8Qn8T8WGM7
oUqudMIxczilj0/iEGfGRxBeFaYAfhGQCDnxNlNX9g7Xe/gHTJgNYlHVxL55jHNu
AoPE3Gd0X8K9fbov0BCB6a21XwGJ6Wj+FSrnvuyWrRuy8JWiDFJaVKUBEcalKr7a
MhoUNQPu5M83OdC42A==
=PzvV
-----END PGP SIGNATURE-----
Merge tag 'cpu-to-thread_info-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull thread_info update to move 'cpu' back from task_struct from Kees Cook:
"Cross-architecture update to move task_struct::cpu back into
thread_info on arm64, x86, s390, powerpc, and riscv. All Acked by arch
maintainers.
Quoting Ard Biesheuvel:
'Move task_struct::cpu back into thread_info
Keeping CPU in task_struct is problematic for architectures that
define raw_smp_processor_id() in terms of this field, as it
requires linux/sched.h to be included, which causes a lot of pain
in terms of circular dependencies (aka 'header soup')
This series moves it back into thread_info (where it came from)
for all architectures that enable THREAD_INFO_IN_TASK, addressing
the header soup issue as well as some pointless differences in the
implementations of task_cpu() and set_task_cpu()'"
* tag 'cpu-to-thread_info-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
riscv: rely on core code to keep thread_info::cpu updated
powerpc: smp: remove hack to obtain offset of task_struct::cpu
sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=y
powerpc: add CPU field to struct thread_info
s390: add CPU field to struct thread_info
x86: add CPU field to struct thread_info
arm64: add CPU field to struct thread_info
- Support for the Arm8.6 timer extensions, including a self-synchronising
view of the system registers to elide some expensive ISB instructions.
- Exception table cleanup and rework so that the fixup handlers appear
correctly in backtraces.
- A handful of miscellaneous changes, the main one being selection of
CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK.
- More mm and pgtable cleanups.
- KASAN support for "asymmetric" MTE, where tag faults are reported
synchronously for loads (via an exception) and asynchronously for
stores (via a register).
- Support for leaving the MMU enabled during kexec relocation, which
significantly speeds up the operation.
- Minor improvements to our perf PMU drivers.
- Improvements to the compat vDSO build system, particularly when
building with LLVM=1.
- Preparatory work for handling some Coresight TRBE tracing errata.
- Cleanup and refactoring of the SVE code to pave the way for SME
support in future.
- Ensure SCS pages are unpoisoned immediately prior to freeing them
when KASAN is enabled for the vmalloc area.
- Try moving to the generic pfn_valid() implementation again now that
the DMA mapping issue from last time has been resolved.
- Numerous improvements and additions to our FPSIMD and SVE selftests.
-----BEGIN PGP SIGNATURE-----
iQFDBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmF74ZYQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNI/eB/UZYAtmNi6xC5StPaETyMLeZph9BV/IqIFq
N71ds7MFzlX/agR6MwLbH2tBHezBtlQ90O732Jjz8zAec2cHd+7sx/w82JesX7PB
IuOfqP78rvtU4ZkKe1Rcd96QtYvbtNAqcRhIo95OzfV9xwuzkvdXI+ZTYhtCfCuZ
GozCqQoJtnNDayMtfzbDSXyJLNJc/qnIcUQhrt3vg12zbF3BcHxnmp0nBcHCqZEo
lDJYufju7p87kCzaFYda2WhlI3t+NThqKOiZ332wQfqzNcr+rw1Y4jWbnCfrdLtI
JfHT9yiuHDmFSYaJrk7NU8kftW31NV70bbhD7rZ+DQCVndl0lRc=
=3R3j
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"There's the usual summary below, but the highlights are support for
the Armv8.6 timer extensions, KASAN support for asymmetric MTE, the
ability to kexec() with the MMU enabled and a second attempt at
switching to the generic pfn_valid() implementation.
Summary:
- Support for the Arm8.6 timer extensions, including a
self-synchronising view of the system registers to elide some
expensive ISB instructions.
- Exception table cleanup and rework so that the fixup handlers
appear correctly in backtraces.
- A handful of miscellaneous changes, the main one being selection of
CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK.
- More mm and pgtable cleanups.
- KASAN support for "asymmetric" MTE, where tag faults are reported
synchronously for loads (via an exception) and asynchronously for
stores (via a register).
- Support for leaving the MMU enabled during kexec relocation, which
significantly speeds up the operation.
- Minor improvements to our perf PMU drivers.
- Improvements to the compat vDSO build system, particularly when
building with LLVM=1.
- Preparatory work for handling some Coresight TRBE tracing errata.
- Cleanup and refactoring of the SVE code to pave the way for SME
support in future.
- Ensure SCS pages are unpoisoned immediately prior to freeing them
when KASAN is enabled for the vmalloc area.
- Try moving to the generic pfn_valid() implementation again now that
the DMA mapping issue from last time has been resolved.
- Numerous improvements and additions to our FPSIMD and SVE
selftests"
[ armv8.6 timer updates were in a shared branch and already came in
through -tip in the timer pull - Linus ]
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (85 commits)
arm64: Select POSIX_CPU_TIMERS_TASK_WORK
arm64: Document boot requirements for FEAT_SME_FA64
arm64/sve: Fix warnings when SVE is disabled
arm64/sve: Add stub for sve_max_virtualisable_vl()
arm64: errata: Add detection for TRBE write to out-of-range
arm64: errata: Add workaround for TSB flush failures
arm64: errata: Add detection for TRBE overwrite in FILL mode
arm64: Add Neoverse-N2, Cortex-A710 CPU part definition
selftests: arm64: Factor out utility functions for assembly FP tests
arm64: vmlinux.lds.S: remove `.fixup` section
arm64: extable: add load_unaligned_zeropad() handler
arm64: extable: add a dedicated uaccess handler
arm64: extable: add `type` and `data` fields
arm64: extable: use `ex` for `exception_table_entry`
arm64: extable: make fixup_exception() return bool
arm64: extable: consolidate definitions
arm64: gpr-num: support W registers
arm64: factor out GPR numbering helpers
arm64: kvm: use kvm_exception_table_entry
arm64: lib: __arch_copy_to_user(): fold fixups into body
...
by confidential computing solutions to query different aspects of the
system. The intent behind it is to unify testing of such aspects instead
of having each confidential computing solution add its own set of tests
to code paths in the kernel, leading to an unwieldy mess.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmF/uLUACgkQEsHwGGHe
VUqGbQ/+LOmz8hmL5vtbXw/lVonCSBRKI2KVefnN2VtQ3rjtCq8HlNoq/hAdi15O
WntABFV8u4daNAcssp+H/p+c8Mt/NzQa60TRooC5ZIynSOCj4oZQxTWjcnR4Qxrf
oABy4sp09zNW31qExtTVTwPC/Ejzv4hA0Vqt9TLQOSxp7oYVYKeDJNp79VJK64Yz
Ky7epgg8Pauk0tAT76ATR4kyy9PLGe4/Ry0bOtAptO4NShL1RyRgI0ywUmptJHSw
FV/MnoexdAs4V8+4zPwyOkf8YMDnhbJcvFcr7Yd9AEz2q9Z1wKCgi1M3aZIoW8lV
YMXECMGe9DfxmEJbnP5zbnL6eF32x+tbq+fK8Ye4V2fBucpWd27zkcTXjoP+Y+zH
NLg+9QykR9QCH75YCOXcAg1Q5hSmc4DaWuJymKjT+W7MKs89ywjq+ybIBpLBHbQe
uN9FM/CEKXx8nQwpNQc7mdUE5sZeCQ875028RaLbLx3/b6uwT6rBlNJfxl/uxmcZ
iF1kG7Cx4uO+7G1a9EWgxtWiJQ8GiZO7PMCqEdwIymLIrlNksAk7nX2SXTuH5jIZ
YDuBj/Xz2UUVWYFm88fV5c4ogiFlm9Jeo140Zua/BPdDJd2VOP013rYxzFE/rVSF
SM2riJxCxkva8Fb+8TNiH42AMhPMSpUt1Nmd1H2rcEABRiT83Ow=
=Na0U
-----END PGP SIGNATURE-----
Merge tag 'x86_cc_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull generic confidential computing updates from Borislav Petkov:
"Add an interface called cc_platform_has() which is supposed to be used
by confidential computing solutions to query different aspects of the
system.
The intent behind it is to unify testing of such aspects instead of
having each confidential computing solution add its own set of tests
to code paths in the kernel, leading to an unwieldy mess"
* tag 'x86_cc_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
treewide: Replace the use of mem_encrypt_active() with cc_platform_has()
x86/sev: Replace occurrences of sev_es_active() with cc_platform_has()
x86/sev: Replace occurrences of sev_active() with cc_platform_has()
x86/sme: Replace occurrences of sme_active() with cc_platform_has()
powerpc/pseries/svm: Add a powerpc version of cc_platform_has()
x86/sev: Add an x86 version of cc_platform_has()
arch/cc: Introduce a function to check for confidential computing features
x86/ioremap: Selectively build arch override encryption functions
Remove CONFIG_PREEMPT_RT from inside functions, avoiding
compilation problems in the future.
Link: https://lkml.kernel.org/r/37ee0881b033cdc513efc84ebea26cf77880c8c2.1635702894.git.bristot@kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: linux-rt-users@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Remove CONFIG_STACKTRACE from inside functions, avoiding
compilation problems in the future.
Link: https://lkml.kernel.org/r/3465cca2f28e1ba602a1fc8bdb28d12950b5226e.1635702894.git.bristot@kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: linux-rt-users@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently, the user can start only one instance of timerlat/osnoise
tracers and the tracers cannot run in parallel.
As starting point to add more flexibility, let's allow the same tracer to
run on different trace instances. The workload will start when the first
trace_array (instance) is registered and stop when the last instance
is unregistered.
So, while this patch allows the same tracer to run in multiple
instances (e.g., two instances running osnoise), it still does not allow
instances of timerlat and osnoise in parallel (e.g., one timerlat and
osnoise). That is because the osnoise: events have different behavior
depending on which tracer is enabled (osnoise or timerlat). Enabling
the parallel usage of these two tracers is in my TODO list.
Link: https://lkml.kernel.org/r/38c8f14b613492a4f3f938d9d3bf0b063b72f0f0.1635702894.git.bristot@kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: linux-rt-users@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Remove CONFIG_TIMERLAT_TRACER from inside functions, avoiding
compilation problems in the future.
Link: https://lkml.kernel.org/r/8245abb5a112d249f5da6c1df499244ad9e647bc.1635702894.git.bristot@kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: linux-rt-users@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
osnoise/timerlat were built to run a single instance, and for this,
a single variable is enough to store the current struct trace_array
*tr with information about the tracing instance. This is done via
the *osnoise_trace variable. A trace_array represents a trace instance.
In preparation to support multiple instances, replace the
*osnoise_trace variable with an RCU protected list of instances.
The operations that refer to an instance now propagate to all
elements of the list (all instances).
Also, replace the osnoise_busy variable with a check if the list
has elements (busy).
No functional change is expected with this patch, i.e., only one
instance is allowed yet.
Link: https://lkml.kernel.org/r/91d006e889b9a5d1ff258fe6077f021ae3f26372.1635702894.git.bristot@kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: linux-rt-users@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When writing a new CPU mask via osnoise/cpus, if the tracer is running,
the workload is restarted to follow the new cpumask. The restart is
currently done using osnoise_workload_start/stop(), which disables the
workload *and* the instrumentation. However, disabling the
instrumentation is not necessary.
Calling start/stop_per_cpu_kthreads() is enough to apply the new
osnoise/cpus config.
Link: https://lkml.kernel.org/r/ee633e82867c5b88851aa6040522a799c0034486.1635702894.git.bristot@kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: linux-rt-users@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In preparation from supporting multiple trace instances, create
workload start/stop specific functions.
No functional change.
Link: https://lkml.kernel.org/r/74b090971e9acdd13625be1c28ef3270d2275e77.1635702894.git.bristot@kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: linux-rt-users@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
trace_osnoise_callback_enabled is used by ftrace_nmi_enter/exit()
to know when to call the NMI callback. The barrier is used to
avoid having callbacks enabled before the resetting date during
the start or to touch the values after stopping the tracer.
Link: https://lkml.kernel.org/r/a413b8f14aa9312fbd1ba99f96225a8aed831053.1635702894.git.bristot@kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: linux-rt-users@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In preparation to support multiple instances, decouple the
osnoise/timelat workload from instance-specific tracing_cpumask.
Different instances can have conflicting cpumasks, making osnoise
workload management needlessly complex. Osnoise already has its
global cpumask.
I also thought about using the first instance mask, but the
"first" instance could be removed before the others.
This also fixes the problem that changing the tracing_mask was not
re-starting the trace.
Link: https://lkml.kernel.org/r/169a71bcc919ce3ab53ae6f9ca5cde57fffaf9c6.1635702894.git.bristot@kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: linux-rt-users@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This patch has two changes in the kernel bloom filter map
implementation:
1) Change the names of map-ops functions to include the
"bloom_map" prefix.
As Martin pointed out on a previous patchset, having generic
map-ops names may be confusing in tracing and in perf-report.
2) Drop the "& 0xF" when getting nr_hash_funcs, since we
already ascertain that no other bits in map_extra beyond the
first 4 bits can be set.
Signed-off-by: Joanne Koong <joannekoong@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20211029224909.1721024-2-joannekoong@fb.com
Currently the test of BPF STRUCT_OPS depends on the specific bpf
implementation of tcp_congestion_ops, but it can not cover all
basic functionalities (e.g, return value handling), so introduce
a dummy BPF STRUCT_OPS for test purpose.
Loading a bpf_dummy_ops implementation from userspace is prohibited,
and its only purpose is to run BPF_PROG_TYPE_STRUCT_OPS program
through bpf(BPF_PROG_TEST_RUN). Now programs for test_1() & test_2()
are supported. The following three cases are exercised in
bpf_dummy_struct_ops_test_run():
(1) test and check the value returned from state arg in test_1(state)
The content of state is copied from userspace pointer and copied back
after calling test_1(state). The user pointer is saved in an u64 array
and the array address is passed through ctx_in.
(2) test and check the return value of test_1(NULL)
Just simulate the case in which an invalid input argument is passed in.
(3) test multiple arguments passing in test_2(state, ...)
5 arguments are passed through ctx_in in form of u64 array. The first
element of array is userspace pointer of state and others 4 arguments
follow.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20211025064025.2567443-4-houtao1@huawei.com
Factor out two helpers to check the read access of ctx for raw tp
and BTF function. bpf_tracing_ctx_access() is used to check
the read access to argument is valid, and bpf_tracing_btf_ctx_access()
checks whether the btf type of argument is valid besides the checking
of argument read. bpf_tracing_btf_ctx_access() will be used by the
following patch.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20211025064025.2567443-3-houtao1@huawei.com
Factor out a helper bpf_struct_ops_prepare_trampoline() to prepare
trampoline for BPF_PROG_TYPE_STRUCT_OPS prog. It will be used by
.test_run callback in following patch.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20211025064025.2567443-2-houtao1@huawei.com
- Cleanup of extable fixup handling to be more robust, which in turn
allows to make the FPU exception fixups more robust as well.
- Change the return code for signal frame related failures from explicit
error codes to a boolean fail/success as that's all what the calling
code evaluates.
- A large refactoring of the FPU code to prepare for adding AMX support:
- Distangle the public header maze and remove especially the misnomed
kitchen sink internal.h which is despite it's name included all over
the place.
- Add a proper abstraction for the register buffer storage (struct
fpstate) which allows to dynamically size the buffer at runtime by
flipping the pointer to the buffer container from the default
container which is embedded in task_struct::tread::fpu to a
dynamically allocated container with a larger register buffer.
- Convert the code over to the new fpstate mechanism.
- Consolidate the KVM FPU handling by moving the FPU related code into
the FPU core which removes the number of exports and avoids adding
even more export when AMX has to be supported in KVM. This also
removes duplicated code which was of course unnecessary different and
incomplete in the KVM copy.
- Simplify the KVM FPU buffer handling by utilizing the new fpstate
container and just switching the buffer pointer from the user space
buffer to the KVM guest buffer when entering vcpu_run() and flipping
it back when leaving the function. This cuts the memory requirements
of a vCPU for FPU buffers in half and avoids pointless memory copy
operations.
This also solves the so far unresolved problem of adding AMX support
because the current FPU buffer handling of KVM inflicted a circular
dependency between adding AMX support to the core and to KVM. With
the new scheme of switching fpstate AMX support can be added to the
core code without affecting KVM.
- Replace various variables with proper data structures so the extra
information required for adding dynamically enabled FPU features (AMX)
can be added in one place
- Add AMX (Advanved Matrix eXtensions) support (finally):
AMX is a large XSTATE component which is going to be available with
Saphire Rapids XEON CPUs. The feature comes with an extra MSR (MSR_XFD)
which allows to trap the (first) use of an AMX related instruction,
which has two benefits:
1) It allows the kernel to control access to the feature
2) It allows the kernel to dynamically allocate the large register
state buffer instead of burdening every task with the the extra 8K
or larger state storage.
It would have been great to gain this kind of control already with
AVX512.
The support comes with the following infrastructure components:
1) arch_prctl() to
- read the supported features (equivalent to XGETBV(0))
- read the permitted features for a task
- request permission for a dynamically enabled feature
Permission is granted per process, inherited on fork() and cleared
on exec(). The permission policy of the kernel is restricted to
sigaltstack size validation, but the syscall obviously allows
further restrictions via seccomp etc.
2) A stronger sigaltstack size validation for sys_sigaltstack(2) which
takes granted permissions and the potentially resulting larger
signal frame into account. This mechanism can also be used to
enforce factual sigaltstack validation independent of dynamic
features to help with finding potential victims of the 2K
sigaltstack size constant which is broken since AVX512 support was
added.
3) Exception handling for #NM traps to catch first use of a extended
feature via a new cause MSR. If the exception was caused by the use
of such a feature, the handler checks permission for that
feature. If permission has not been granted, the handler sends a
SIGILL like the #UD handler would do if the feature would have been
disabled in XCR0. If permission has been granted, then a new fpstate
which fits the larger buffer requirement is allocated.
In the unlikely case that this allocation fails, the handler sends
SIGSEGV to the task. That's not elegant, but unavoidable as the
other discussed options of preallocation or full per task
permissions come with their own set of horrors for kernel and/or
userspace. So this is the lesser of the evils and SIGSEGV caused by
unexpected memory allocation failures is not a fundamentally new
concept either.
When allocation succeeds, the fpstate properties are filled in to
reflect the extended feature set and the resulting sizes, the
fpu::fpstate pointer is updated accordingly and the trap is disarmed
for this task permanently.
4) Enumeration and size calculations
5) Trap switching via MSR_XFD
The XFD (eXtended Feature Disable) MSR is context switched with the
same life time rules as the FPU register state itself. The mechanism
is keyed off with a static key which is default disabled so !AMX
equipped CPUs have zero overhead. On AMX enabled CPUs the overhead
is limited by comparing the tasks XFD value with a per CPU shadow
variable to avoid redundant MSR writes. In case of switching from a
AMX using task to a non AMX using task or vice versa, the extra MSR
write is obviously inevitable.
All other places which need to be aware of the variable feature sets
and resulting variable sizes are not affected at all because they
retrieve the information (feature set, sizes) unconditonally from
the fpstate properties.
6) Enable the new AMX states
Note, this is relatively new code despite the fact that AMX support is in
the works for more than a year now.
The big refactoring of the FPU code, which allowed to do a proper
integration has been started exactly 3 weeks ago. Refactoring of the
existing FPU code and of the original AMX patches took a week and has
been subject to extensive review and testing. The only fallout which has
not been caught in review and testing right away was restricted to AMX
enabled systems, which is completely irrelevant for anyone outside Intel
and their early access program. There might be dragons lurking as usual,
but so far the fine grained refactoring has held up and eventual yet
undetected fallout is bisectable and should be easily addressable before
the 5.16 release. Famous last words...
Many thanks to Chang Bae and Dave Hansen for working hard on this and
also to the various test teams at Intel who reserved extra capacity to
follow the rapid development of this closely which provides the
confidence level required to offer this rather large update for inclusion
into 5.16-rc1.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/NkITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodDkEADH4+/nN/QoSUHIuuha5Zptj3g2b16a
/3TxT9fhwPen/kzMGsUk70s3iWJMA+I5dCfkSZexJ2hfhcRe9cBzZIa1HCawKwf3
YCISTsO/M+LpeORuZ+TpfFLJKnxNr1SEOl+EYffGhq0AkCjifb9Cnr0JZuoMUzGU
jpfJZ2bj28ri5lG812DtzSMBM9E3SAwgJv+GNjmZbxZKb9mAfhbAMdBUXHirX7Ej
jmx6koQjYOKwYIW8w1BrdC270lUKQUyJTbQgdRkN9Mh/HnKyFixQ18JqGlgaV2cT
EtYePUfTEdaHdAhUINLIlEug1MfOslHU+HyGsdywnoChNB4GHPQuePC5Tz60VeFN
RbQ9aKcBUu8r95rjlnKtAtBijNMA4bjGwllVxNwJ/ZoA9RPv1SbDZ07RX3qTaLVY
YhVQl8+shD33/W24jUTJv1kMMexpHXIlv0gyfMryzpwI7uzzmGHRPAokJdbYKctC
dyMPfdE90rxTiMUdL/1IQGhnh3awjbyfArzUhHyQ++HyUyzCFh0slsO0CD18vUy8
FofhCugGBhjuKw3XwLNQ+KsWURz5qHctSzBc3qMOSyqFHbAJCVRANkhsFvWJo2qL
75+Z7OTRebtsyOUZIdq26r4roSxHrps3dupWTtN70HWx2NhQG1nLEw986QYiQu1T
hcKvDmehQLrUvg==
=x3WL
-----END PGP SIGNATURE-----
Merge tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fpu updates from Thomas Gleixner:
- Cleanup of extable fixup handling to be more robust, which in turn
allows to make the FPU exception fixups more robust as well.
- Change the return code for signal frame related failures from
explicit error codes to a boolean fail/success as that's all what the
calling code evaluates.
- A large refactoring of the FPU code to prepare for adding AMX
support:
- Distangle the public header maze and remove especially the
misnomed kitchen sink internal.h which is despite it's name
included all over the place.
- Add a proper abstraction for the register buffer storage (struct
fpstate) which allows to dynamically size the buffer at runtime
by flipping the pointer to the buffer container from the default
container which is embedded in task_struct::tread::fpu to a
dynamically allocated container with a larger register buffer.
- Convert the code over to the new fpstate mechanism.
- Consolidate the KVM FPU handling by moving the FPU related code
into the FPU core which removes the number of exports and avoids
adding even more export when AMX has to be supported in KVM.
This also removes duplicated code which was of course
unnecessary different and incomplete in the KVM copy.
- Simplify the KVM FPU buffer handling by utilizing the new
fpstate container and just switching the buffer pointer from the
user space buffer to the KVM guest buffer when entering
vcpu_run() and flipping it back when leaving the function. This
cuts the memory requirements of a vCPU for FPU buffers in half
and avoids pointless memory copy operations.
This also solves the so far unresolved problem of adding AMX
support because the current FPU buffer handling of KVM inflicted
a circular dependency between adding AMX support to the core and
to KVM. With the new scheme of switching fpstate AMX support can
be added to the core code without affecting KVM.
- Replace various variables with proper data structures so the
extra information required for adding dynamically enabled FPU
features (AMX) can be added in one place
- Add AMX (Advanced Matrix eXtensions) support (finally):
AMX is a large XSTATE component which is going to be available with
Saphire Rapids XEON CPUs. The feature comes with an extra MSR
(MSR_XFD) which allows to trap the (first) use of an AMX related
instruction, which has two benefits:
1) It allows the kernel to control access to the feature
2) It allows the kernel to dynamically allocate the large register
state buffer instead of burdening every task with the the extra
8K or larger state storage.
It would have been great to gain this kind of control already with
AVX512.
The support comes with the following infrastructure components:
1) arch_prctl() to
- read the supported features (equivalent to XGETBV(0))
- read the permitted features for a task
- request permission for a dynamically enabled feature
Permission is granted per process, inherited on fork() and
cleared on exec(). The permission policy of the kernel is
restricted to sigaltstack size validation, but the syscall
obviously allows further restrictions via seccomp etc.
2) A stronger sigaltstack size validation for sys_sigaltstack(2)
which takes granted permissions and the potentially resulting
larger signal frame into account. This mechanism can also be used
to enforce factual sigaltstack validation independent of dynamic
features to help with finding potential victims of the 2K
sigaltstack size constant which is broken since AVX512 support
was added.
3) Exception handling for #NM traps to catch first use of a extended
feature via a new cause MSR. If the exception was caused by the
use of such a feature, the handler checks permission for that
feature. If permission has not been granted, the handler sends a
SIGILL like the #UD handler would do if the feature would have
been disabled in XCR0. If permission has been granted, then a new
fpstate which fits the larger buffer requirement is allocated.
In the unlikely case that this allocation fails, the handler
sends SIGSEGV to the task. That's not elegant, but unavoidable as
the other discussed options of preallocation or full per task
permissions come with their own set of horrors for kernel and/or
userspace. So this is the lesser of the evils and SIGSEGV caused
by unexpected memory allocation failures is not a fundamentally
new concept either.
When allocation succeeds, the fpstate properties are filled in to
reflect the extended feature set and the resulting sizes, the
fpu::fpstate pointer is updated accordingly and the trap is
disarmed for this task permanently.
4) Enumeration and size calculations
5) Trap switching via MSR_XFD
The XFD (eXtended Feature Disable) MSR is context switched with
the same life time rules as the FPU register state itself. The
mechanism is keyed off with a static key which is default
disabled so !AMX equipped CPUs have zero overhead. On AMX enabled
CPUs the overhead is limited by comparing the tasks XFD value
with a per CPU shadow variable to avoid redundant MSR writes. In
case of switching from a AMX using task to a non AMX using task
or vice versa, the extra MSR write is obviously inevitable.
All other places which need to be aware of the variable feature
sets and resulting variable sizes are not affected at all because
they retrieve the information (feature set, sizes) unconditonally
from the fpstate properties.
6) Enable the new AMX states
Note, this is relatively new code despite the fact that AMX support
is in the works for more than a year now.
The big refactoring of the FPU code, which allowed to do a proper
integration has been started exactly 3 weeks ago. Refactoring of the
existing FPU code and of the original AMX patches took a week and has
been subject to extensive review and testing. The only fallout which
has not been caught in review and testing right away was restricted
to AMX enabled systems, which is completely irrelevant for anyone
outside Intel and their early access program. There might be dragons
lurking as usual, but so far the fine grained refactoring has held up
and eventual yet undetected fallout is bisectable and should be
easily addressable before the 5.16 release. Famous last words...
Many thanks to Chang Bae and Dave Hansen for working hard on this and
also to the various test teams at Intel who reserved extra capacity
to follow the rapid development of this closely which provides the
confidence level required to offer this rather large update for
inclusion into 5.16-rc1
* tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits)
Documentation/x86: Add documentation for using dynamic XSTATE features
x86/fpu: Include vmalloc.h for vzalloc()
selftests/x86/amx: Add context switch test
selftests/x86/amx: Add test cases for AMX state management
x86/fpu/amx: Enable the AMX feature in 64-bit mode
x86/fpu: Add XFD handling for dynamic states
x86/fpu: Calculate the default sizes independently
x86/fpu/amx: Define AMX state components and have it used for boot-time checks
x86/fpu/xstate: Prepare XSAVE feature table for gaps in state component numbers
x86/fpu/xstate: Add fpstate_realloc()/free()
x86/fpu/xstate: Add XFD #NM handler
x86/fpu: Update XFD state where required
x86/fpu: Add sanity checks for XFD
x86/fpu: Add XFD state to fpstate
x86/msr-index: Add MSRs for XFD
x86/cpufeatures: Add eXtended Feature Disabling (XFD) feature bit
x86/fpu: Reset permission and fpstate on exec()
x86/fpu: Prepare fpu_clone() for dynamically enabled features
x86/fpu/signal: Prepare for variable sigframe length
x86/signal: Use fpu::__state_user_size for sigalt stack validation
...
- Revert the printk format based wchan() symbol resolution as it can leak
the raw value in case that the symbol is not resolvable.
- Make wchan() more robust and work with all kind of unwinders by
enforcing that the task stays blocked while unwinding is in progress.
- Prevent sched_fork() from accessing an invalid sched_task_group
- Improve asymmetric packing logic
- Extend scheduler statistics to RT and DL scheduling classes and add
statistics for bandwith burst to the SCHED_FAIR class.
- Properly account SCHED_IDLE entities
- Prevent a potential deadlock when initial priority is assigned to a
newly created kthread. A recent change to plug a race between cpuset and
__sched_setscheduler() introduced a new lock dependency which is now
triggered. Break the lock dependency chain by moving the priority
assignment to the thread function.
- Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
- Improve idle balancing in general and especially for NOHZ enabled
systems.
- Provide proper interfaces for live patching so it does not have to
fiddle with scheduler internals.
- Add cluster aware scheduling support.
- A small set of tweaks for RT (irqwork, wait_task_inactive(), various
scheduler options and delaying mmdrop)
- The usual small tweaks and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/OUkTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoR/5D/9ikdGNpKg9osNqJ3GjAmxsK6kVkB29
iFe2k8pIpWDToWQf/wQRGih4Yj3Cl49QSnZcPIibh2/12EB1qrrW6iSPJkInz8Ec
/1LS5/Vewn2OyoxyXZjdvGC5gTXEodSbIazASvX7nvdMeI4gsAsL5etzrMJirT/t
aymqvr7zovvywrwMTQJrGjUMo9l4ewE8tafMNNhRu1BHU1U4ojM9yvThyRAAcmp7
3Xy49A+Yq3IgrvYI4u8FMK5Zh08KaxSFjiLhePGm/bF+wSfYmWop2TP1jY05W2Uo
ti8hfbJMUoFRYuMxAiEldkItnc0wV4M9PtWZZ/x+B71bs65Y4Zjt9cW+rxJv2+m1
vzV31EsQwGnOti072dzWN4c/cZqngVXAjaNtErvDwJUr+Tw1ayv9KUvuodMQqZY6
mu68bFUO2kV9EMe1CBOv51Uy1RGHyLj3rlNqrkw+Xp5ISE9Ad2vhUEiRp5bQx5Ci
V/XFhGZkGUluh0vccrdFlNYZwhj8cZEzkOPCnPSeZ+bq8SyZE6xuHH/lTP1CJCOy
s800rW1huM+kgV+zRN8adDkGXibAk9N3RtVGnQXmuEy8gB9LZmQg+JeM2wsc9B+6
i0gdqZnsjNAfoK+BBAG4holxptSL8/eOJsFH8ZNIoxQ+iqooyPx9tFX7yXnRTBQj
d2qWG7UvoseT+g==
=fgtS
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Thomas Gleixner:
- Revert the printk format based wchan() symbol resolution as it can
leak the raw value in case that the symbol is not resolvable.
- Make wchan() more robust and work with all kind of unwinders by
enforcing that the task stays blocked while unwinding is in progress.
- Prevent sched_fork() from accessing an invalid sched_task_group
- Improve asymmetric packing logic
- Extend scheduler statistics to RT and DL scheduling classes and add
statistics for bandwith burst to the SCHED_FAIR class.
- Properly account SCHED_IDLE entities
- Prevent a potential deadlock when initial priority is assigned to a
newly created kthread. A recent change to plug a race between cpuset
and __sched_setscheduler() introduced a new lock dependency which is
now triggered. Break the lock dependency chain by moving the priority
assignment to the thread function.
- Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
- Improve idle balancing in general and especially for NOHZ enabled
systems.
- Provide proper interfaces for live patching so it does not have to
fiddle with scheduler internals.
- Add cluster aware scheduling support.
- A small set of tweaks for RT (irqwork, wait_task_inactive(), various
scheduler options and delaying mmdrop)
- The usual small tweaks and improvements all over the place
* tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
sched/fair: Cleanup newidle_balance
sched/fair: Remove sysctl_sched_migration_cost condition
sched/fair: Wait before decaying max_newidle_lb_cost
sched/fair: Skip update_blocked_averages if we are defering load balance
sched/fair: Account update_blocked_averages in newidle_balance cost
x86: Fix __get_wchan() for !STACKTRACE
sched,x86: Fix L2 cache mask
sched/core: Remove rq_relock()
sched: Improve wake_up_all_idle_cpus() take #2
irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT
irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT
irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.
sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ
sched: Add cluster scheduler level for x86
sched: Add cluster scheduler level in core and related Kconfig for ARM64
topology: Represent clusters of CPUs within a die
sched: Disable -Wunused-but-set-variable
sched: Add wrapper for get_wchan() to keep task blocked
x86: Fix get_wchan() to support the ORC unwinder
proc: Use task_is_running() for wchan in /proc/$pid/stat
...
- Improve retpoline code patching by separating it from alternatives which
reduces memory footprint and allows to do better optimizations in the
actual runtime patching.
- Add proper retpoline support for x86/BPF
- Address noinstr warnings in x86/kvm, lockdep and paravirtualization code
- Add support to handle pv_opsindirect calls in the noinstr analysis
- Classify symbols upfront and cache the result to avoid redundant
str*cmp() invocations.
- Add a CFI hash to reduce memory consumption which also reduces runtime
on a allyesconfig by ~50%
- Adjust XEN code to make objtool handling more robust and as a side
effect to prevent text fragmentation due to placement of the hypercall
page.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/GFgTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoc1JD/0Sz6seP2OUMxbMT3gCcFo9sMvYTdsM
7WuGFbBbnCIo7g8JH7k0zRRBigptMp2eUtQXKkgaaIbWN4JbuVKf8KxN5/qXxLi4
fJ12QnNTGH9N2jtzl5wKmpjaKJnnJMD9D10XwoR+T6gn6NHd+AgLEs7GxxuQUlgo
eC9oEXhNHC8uNhiZc38EwfwmItI1bRgaLrnZWIL4rYGSMxfCK1/cEOpWrFfX9wmj
/diB6oqMyPXZXMCtgpX7TniUr5XOTCcUkeO9mQv5bmyq/YM/8hrTbcVSJlsVYLvP
EsBnUSHAcfLFiHXwa1RNiIGdbiPjbN+UYeXGAvqF58f3e5dTIHtN/UmWo7OH93If
9rLMVNcMpsfPx7QRk2IxEPumLCkyfwjzfKrVDM6P6TKEIUzD1og4IK9gTlfykVsh
56G5XiCOC/X2x8IMxKTLGuBiAVLFHXK/rSwoqhvNEWBFKDbP13QWs0LurBcW09Sa
/kQI9pIBT1xFA/R+OY5Xy1cqNVVK1Gxmk8/bllCijA9pCFSCFM4hLZE5CevdrBCV
h5SdqEK5hIlzFyypXfsCik/4p/+rfvlGfUKtFsPctxx29SPe+T0orx+l61jiWQok
rZOflwMawK5lDuASHrvNHGJcWaTwoo3VcXMQDnQY0Wulc43J5IFBaPxkZzgyd+S1
4lktHxatrCMUgw==
=pfZi
-----END PGP SIGNATURE-----
Merge tag 'objtool-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Thomas Gleixner:
- Improve retpoline code patching by separating it from alternatives
which reduces memory footprint and allows to do better optimizations
in the actual runtime patching.
- Add proper retpoline support for x86/BPF
- Address noinstr warnings in x86/kvm, lockdep and paravirtualization
code
- Add support to handle pv_opsindirect calls in the noinstr analysis
- Classify symbols upfront and cache the result to avoid redundant
str*cmp() invocations.
- Add a CFI hash to reduce memory consumption which also reduces
runtime on a allyesconfig by ~50%
- Adjust XEN code to make objtool handling more robust and as a side
effect to prevent text fragmentation due to placement of the
hypercall page.
* tag 'objtool-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
bpf,x86: Respect X86_FEATURE_RETPOLINE*
bpf,x86: Simplify computing label offsets
x86,bugs: Unconditionally allow spectre_v2=retpoline,amd
x86/alternative: Add debug prints to apply_retpolines()
x86/alternative: Try inline spectre_v2=retpoline,amd
x86/alternative: Handle Jcc __x86_indirect_thunk_\reg
x86/alternative: Implement .retpoline_sites support
x86/retpoline: Create a retpoline thunk array
x86/retpoline: Move the retpoline thunk declarations to nospec-branch.h
x86/asm: Fixup odd GEN-for-each-reg.h usage
x86/asm: Fix register order
x86/retpoline: Remove unused replacement symbols
objtool,x86: Replace alternatives with .retpoline_sites
objtool: Shrink struct instruction
objtool: Explicitly avoid self modifying code in .altinstr_replacement
objtool: Classify symbols
objtool: Support pv_opsindirect calls for noinstr
x86/xen: Rework the xen_{cpu,irq,mmu}_opsarrays
x86/xen: Mark xen_force_evtchn_callback() noinstr
x86/xen: Make irq_disable() noinstr
...
- Move futex code into kernel/futex/ and split up the kitchen sink into
seperate files to make integration of sys_futex_waitv() simpler.
- Add a new sys_futex_waitv() syscall which allows to wait on multiple
futexes. The main use case is emulating Windows' WaitForMultipleObjects
which allows Wine to improve the performance of Windows Games. Also
native Linux games can benefit from this interface as this is a common
wait pattern for this kind of applications.
- Add context to ww_mutex_trylock() to provide a path for i915 to rework
their eviction code step by step without making lockdep upset until the
final steps of rework are completed. It's also useful for regulator and
TTM to avoid dropping locks in the non contended path.
- Lockdep and might_sleep() cleanups and improvements
- A few improvements for the RT substitutions.
- The usual small improvements and cleanups.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/FTITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoVNZD/9vIm3Bu1Coz8tbNXz58AiCYq9Y/vp5
mzFgSzz+VJTkW5Vh8jo5Uel4rCKZyt+rL276EoaRPzYl8KFtWDbpK3qd3PrXKqTX
At49JO4ttAMJUHIBQ6vblEkykmfEd9YPU1uSWk5roJ+s7Jmr5VWnu0FEWHP00As5
tWOca/TM0ei9kof26V2fl5aecTGII4i4Zsvy+LPsXtI+TnmP0gSBcGAS/5UnZTtJ
vQRWTR3ojoYvh5iTmNqbaURYoQLe2j8yscn1DSW1CABWVmP12eDWs+N7jRP4b5S9
73xOv5P7vpva41wxrK2ir5iNkpsLE97VL2JOHTW8nm7orblfiuxHLTCkTjEdd2pO
h8blI2IBizEB3JYn2BMkOAaZQOSjN8hd6Ye/b2B4AMEGWeXEoEv6eVy/orYKCluQ
XDqGn47Vce/SYmo5vfTB8VMt6nANx8PKvOP3IvjHInYEQBgiT6QrlUw3RRkXBp5s
clQkjYYwjAMVIXowcCrdhoKjMROzi6STShVwHwGL8MaZXqr8Vl6BUO9ckU0pY+4C
F000Hzwxi8lGEQ9k+P+BnYOEzH5osCty8lloKiQ/7ciX6T+CZHGJPGK/iY4YL8P5
C3CJWMsHCqST7DodNFJmdfZt99UfIMmEhshMDduU9AAH0tHCn8vOu0U6WvCtpyBp
BvHj68zteAtlYg==
=RZ4x
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Thomas Gleixner:
- Move futex code into kernel/futex/ and split up the kitchen sink into
seperate files to make integration of sys_futex_waitv() simpler.
- Add a new sys_futex_waitv() syscall which allows to wait on multiple
futexes.
The main use case is emulating Windows' WaitForMultipleObjects which
allows Wine to improve the performance of Windows Games. Also native
Linux games can benefit from this interface as this is a common wait
pattern for this kind of applications.
- Add context to ww_mutex_trylock() to provide a path for i915 to
rework their eviction code step by step without making lockdep upset
until the final steps of rework are completed. It's also useful for
regulator and TTM to avoid dropping locks in the non contended path.
- Lockdep and might_sleep() cleanups and improvements
- A few improvements for the RT substitutions.
- The usual small improvements and cleanups.
* tag 'locking-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
locking: Remove spin_lock_flags() etc
locking/rwsem: Fix comments about reader optimistic lock stealing conditions
locking: Remove rcu_read_{,un}lock() for preempt_{dis,en}able()
locking/rwsem: Disable preemption for spinning region
docs: futex: Fix kernel-doc references
futex: Fix PREEMPT_RT build
futex2: Documentation: Document sys_futex_waitv() uAPI
selftests: futex: Test sys_futex_waitv() wouldblock
selftests: futex: Test sys_futex_waitv() timeout
selftests: futex: Add sys_futex_waitv() test
futex,arm: Wire up sys_futex_waitv()
futex,x86: Wire up sys_futex_waitv()
futex: Implement sys_futex_waitv()
futex: Simplify double_lock_hb()
futex: Split out wait/wake
futex: Split out requeue
futex: Rename mark_wake_futex()
futex: Rename: match_futex()
futex: Rename: hb_waiter_{inc,dec,pending}()
futex: Split out PI futex
...
core:
- Allow ftrace to instrument parts of the perf core code
- Add a new mem_hops field to perf_mem_data_src which allows to represent
intra-node/package or inter-node/off-package details to prepare for
next generation systems which have more hieararchy within the
node/pacakge level.
tools:
- Update for the new mem_hops field in perf_mem_data_src
arch:
- A set of constraints fixes for the Intel uncore PMU
- The usual set of small fixes and improvements for x86 and PPC
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/GkQTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoaD8D/wLhXR8RxtF4W9HJmHA+5XFsPtg+isp
ZNU2kOs4gZskFx75NQaRv5ikA8y68TKdIx+NuQvRLYItaMveTToLSsJ55bfGMxIQ
JHqDvANUNxBmAACnbYQlqf9WgB0i/3fCUHY5lpmN0waKjaswz7WNpycv4ccShVZr
PKbgEjkeFBhplCqqOF0X5H3V+4q85+nZONm1iSNd4S7/3B6OCxOf1u78usL1bbtW
yJAMSuTeOVUZCJm7oVywKW/ZlCscT135aKr6xe5QTrjlPuRWzuLaXNezdMnMyoVN
HVv8a0ClACb8U5KiGfhvaipaIlIAliWJp2qoiNjrspDruhH6Yc+eNh1gUhLbtNpR
4YZR5jxv4/mS13kzMMQg00cCWQl7N4whPT+ZE9pkpshGt+EwT+Iy3U+v13wDfnnp
MnDggpWYGEkAck13t/T6DwC3qBIsVujtpiG+tt/ERbTxiuxi1ccQTGY3PDjtHV3k
tIMH5n7l4jEpfl8VmoSUgz/2h1MLZnQUWp41GXkjkaOt7uunQZen+nAwqpTm28KV
7U6U0h1q6r7HxOZRxkPPe4HSV+aBNH3H1LeNBfEd3hDCFGf6MY6vLow+2BE9ybk7
Y6LPbRqq0SN3sd5MND0ZvQEt5Zgol8CMlX+UKoLEEv7RognGbIxkgpK7exv5pC9w
nWj7TaMfpRzPgw==
=Oj0G
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Thomas Gleixner:
"Core:
- Allow ftrace to instrument parts of the perf core code
- Add a new mem_hops field to perf_mem_data_src which allows to
represent intra-node/package or inter-node/off-package details to
prepare for next generation systems which have more hieararchy
within the node/pacakge level.
Tools:
- Update for the new mem_hops field in perf_mem_data_src
Arch:
- A set of constraints fixes for the Intel uncore PMU
- The usual set of small fixes and improvements for x86 and PPC"
* tag 'perf-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Fix ICL/SPR INST_RETIRED.PREC_DIST encodings
powerpc/perf: Fix data source encodings for L2.1 and L3.1 accesses
tools/perf: Add mem_hops field in perf_mem_data_src structure
perf: Add mem_hops field in perf_mem_data_src structure
perf: Add comment about current state of PERF_MEM_LVL_* namespace and remove an extra line
perf/core: Allow ftrace for functions in kernel/event/core.c
perf/x86: Add new event for AUX output counter index
perf/x86: Add compiler barrier after updating BTS
perf/x86/intel/uncore: Fix Intel SPR M3UPI event constraints
perf/x86/intel/uncore: Fix Intel SPR M2PCIE event constraints
perf/x86/intel/uncore: Fix Intel SPR IIO event constraints
perf/x86/intel/uncore: Fix Intel SPR CHA event constraints
perf/x86/intel/uncore: Fix Intel ICX IIO event constraints
perf/x86/intel/uncore: Fix invalid unit check
perf/x86/intel/uncore: Support extra IMC channel on Ice Lake server
Core changes:
- Prevent a potential deadlock when initial priority is assigned to a
newly created interrupt thread. A recent change to plug a race between
cpuset and __sched_setscheduler() introduced a new lock dependency
which is now triggered. Break the lock dependency chain by moving the
priority assignment to the thread function.
- A couple of small updates to make the irq core RT safe.
- Confine the irq_cpu_online/offline() API to the only left unfixable
user Cavium Octeon so that it does not grow new usage.
- A small documentation update
Driver changes:
- A large cross architecture rework to move irq_enter/exit() into the
architecture code to make addressing the NOHZ_FULL/RCU issues simpler.
- The obligatory new irq chip driver for Microchip EIC
- Modularize a few irq chip drivers
- Expand usage of devm_*() helpers throughout the driver code
- The usual small fixes and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF+8BUTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoWs2EACeNbL93aIFokd2/RllRSr4VvMjKNyW
PpA0RYDOz1Jh4ldK+7b/EYapKgAkR3yyOtz+jyjRE7jsQK0pQeLtYNLd3cTzsD7K
LCvl8rq6cbRqyFoSC15UKKNbQ/f+o/3LeGPoipr5NQZRMepxk2J/yBCNRXHvIbe6
oLMQJUgw7KKtvCrCUX9OSei4F09T1qsNrIYb7QafP5+v0zndAT7uKNivWrKGFrsh
Uk9epoH3hIkvQERkpmzwJEJaq6oyqhoYQy7ZRGayEPwIdCyivJGZrVX0mZk1LX58
uc8u5grIslX9MqZEQWBweR5y7nISB494NGKmoCInu66U/+3DSOg3AGH2Rfw8PNFZ
lMKdXzYoDgv2y6LeiLtTUKV4K1NBRXo0BhwSGbPw0o6C03/x003kG824Y+/naU75
6q05BZSia1PagPV3e0UAm0A2Rnjj/5uso2fEk0eGBSGM27jf9SQcSE8DVrEiLRd1
2N5uAXbMdfu4xACsEI1Uxu1KNOSQnUhBCy0X6Ppj1a083kLG7jg/126ebb05R8G4
MF79PFt+xUPSzmuKc/xwCdANtW+zzoyjYl5w6mwELBJ9veNbPShokGBTN/qzjXKZ
vdr3/pXx95lRAzFnGOnETesm3IyObruU4K8NbMKd2b+eYa0w1WuZCKnutGLfsqxg
byhCEw459e3P2g==
=r6ln
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"Updates for the interrupt subsystem:
Core changes:
- Prevent a potential deadlock when initial priority is assigned to a
newly created interrupt thread. A recent change to plug a race
between cpuset and __sched_setscheduler() introduced a new lock
dependency which is now triggered. Break the lock dependency chain
by moving the priority assignment to the thread function.
- A couple of small updates to make the irq core RT safe.
- Confine the irq_cpu_online/offline() API to the only left unfixable
user Cavium Octeon so that it does not grow new usage.
- A small documentation update
Driver changes:
- A large cross architecture rework to move irq_enter/exit() into the
architecture code to make addressing the NOHZ_FULL/RCU issues
simpler.
- The obligatory new irq chip driver for Microchip EIC
- Modularize a few irq chip drivers
- Expand usage of devm_*() helpers throughout the driver code
- The usual small fixes and improvements all over the place"
* tag 'irq-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
h8300: Fix linux/irqchip.h include mess
dt-bindings: irqchip: renesas-irqc: Document r8a774e1 bindings
MIPS: irq: Avoid an unused-variable error
genirq: Hide irq_cpu_{on,off}line() behind a deprecated option
irqchip/mips-gic: Get rid of the reliance on irq_cpu_online()
MIPS: loongson64: Drop call to irq_cpu_offline()
irq: remove handle_domain_{irq,nmi}()
irq: remove CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
irq: riscv: perform irqentry in entry code
irq: openrisc: perform irqentry in entry code
irq: csky: perform irqentry in entry code
irq: arm64: perform irqentry in entry code
irq: arm: perform irqentry in entry code
irq: add a (temporary) CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
irq: nds32: avoid CONFIG_HANDLE_DOMAIN_IRQ
irq: arc: avoid CONFIG_HANDLE_DOMAIN_IRQ
irq: add generic_handle_arch_irq()
irq: unexport handle_irq_desc()
irq: simplify handle_domain_{irq,nmi}()
irq: mips: simplify do_domain_IRQ()
...
In commit 324bda9e6c5a("bpf: multi program support for cgroup+bpf")
cgroup_bpf_*() called from kernel/bpf/syscall.c, but now they are only
used in kernel/bpf/cgroup.c, so move these function to
kernel/bpf/cgroup.c, like cgroup_bpf_replace().
Signed-off-by: He Fengqing <hefengqing@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In account_guest_time in kernel/sched/cputime.c guest time is
attributed to both CPUTIME_NICE and CPUTIME_USER in addition to
CPUTIME_GUEST_NICE and CPUTIME_GUEST respectively. Therefore, adding
both to calculate usage results in double counting any guest time at
the rootcg.
Fixes: 936f2a70f2 ("cgroup: add cpu.stat file to root cgroup")
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8KDgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmQ2D/wO0nH3U+3+OZChi3XUwYck9Dev3o6BANCF
ClATiK/kivZY0xY1r8J4ixirZo2gcjIMpWSC3JGYZ5LdspfmYGLUbMjfZsaeU23i
lAKaX1IqfArmHN76k3IU1bKCg7B0/LFwC0q9QTFWTSwNSs8RK/EZLJ61U1hEXUb3
OfIpaMmvPiMaU7yuPqhcZK14m1cg1srrLM4rFB/PqsWWStF07pHq32WeArGDAU0e
Fe0YSnYD7qqA5Qc37KwqjCTmmxKX5YZf7etIcA6p3DNmwcuQrVNzKoCH/ZEDijaD
E2bS/BWbN1x96+rtoEZfBYEaNIrkmJzmW6+fJ53OITbJF3KqP6V66erhqNcFYCzC
mhFlRe7voXb/8AP7zQqSIhK529BUBM36sQ6nF7EiQcDrfLc1z39mq6eblUxbknIA
DDPISD5Tseik9N9x0bc7vINseKyHI1E90VAU/XKADcuGbzLvehPx+2p+Iq5ch5Ah
oa1G3RdlWWQOZxphJHWJhu1qMfo5+FP9dFZj1aoo7b8Kbc/CedyoQe71cpIE5wNh
Jj/EpWJnuyKXwuTic2VYGC+6ezM9O5DSdqCfP3YuZky95VESyvRCKJYMMgBYRVdC
/LuxhnBXIY2G8An7ZTnX0kLCCvLbapIwa0NyA98/xeOngO843coJ6wn8ZmE9LJNH
kMmpCygUrA==
=QWC+
-----END PGP SIGNATURE-----
Merge tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
- mq-deadline accounting improvements (Bart)
- blk-wbt timer fix (Andrea)
- Untangle the block layer includes (Christoph)
- Rework the poll support to be bio based, which will enable adding
support for polling for bio based drivers (Christoph)
- Block layer core support for multi-actuator drives (Damien)
- blk-crypto improvements (Eric)
- Batched tag allocation support (me)
- Request completion batching support (me)
- Plugging improvements (me)
- Shared tag set improvements (John)
- Concurrent queue quiesce support (Ming)
- Cache bdev in ->private_data for block devices (Pavel)
- bdev dio improvements (Pavel)
- Block device invalidation and block size improvements (Xie)
- Various cleanups, fixes, and improvements (Christoph, Jackie,
Masahira, Tejun, Yu, Pavel, Zheng, me)
* tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block: (174 commits)
blk-mq-debugfs: Show active requests per queue for shared tags
block: improve readability of blk_mq_end_request_batch()
virtio-blk: Use blk_validate_block_size() to validate block size
loop: Use blk_validate_block_size() to validate block size
nbd: Use blk_validate_block_size() to validate block size
block: Add a helper to validate the block size
block: re-flow blk_mq_rq_ctx_init()
block: prefetch request to be initialized
block: pass in blk_mq_tags to blk_mq_rq_ctx_init()
block: add rq_flags to struct blk_mq_alloc_data
block: add async version of bio_set_polled
block: kill DIO_MULTI_BIO
block: kill unused polling bits in __blkdev_direct_IO()
block: avoid extra iter advance with async iocb
block: Add independent access ranges support
blk-mq: don't issue request directly in case that current is to be blocked
sbitmap: silence data race warning
blk-cgroup: synchronize blkg creation against policy deactivation
block: refactor bio_iov_bvec_set()
block: add single bio async direct IO helper
...
Disabling unprivileged BPF would help prevent unprivileged users from
creating certain conditions required for potential speculative execution
side-channel attacks on unmitigated affected hardware.
A deep dive on such attacks and current mitigations is available here [0].
Sync with what many distros are currently applying already, and disable
unprivileged BPF by default. An admin can enable this at runtime, if
necessary, as described in 08389d8882 ("bpf: Add kconfig knob for
disabling unpriv bpf by default").
[0] "BPF and Spectre: Mitigating transient execution attacks", Daniel Borkmann, eBPF Summit '21
https://ebpf.io/summit-2021-slides/eBPF_Summit_2021-Keynote-Daniel_Borkmann-BPF_and_Spectre.pdf
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/bpf/0ace9ce3f97656d5f62d11093ad7ee81190c3c25.1635535215.git.pawan.kumar.gupta@linux.intel.com
Add memory folios, a new type to represent either order-0 pages or
the head page of a compound page. This should be enough infrastructure
to support filesystems converting from pages to folios.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmF9uI0ACgkQDpNsjXcp
gj7MUAf/R7LCZ+xFiIedw7SAgb/DGK0C9uVjuBEIZgAw21ZUw/GuPI6cuKBMFGGf
rRcdtlvMpwi7yZJcoNXxaqU/xPaaJMjf2XxscIvYJP1mjlZVuwmP9dOx0neNvWOc
T+8lqR6c1TLl82lpqIjGFLwvj2eVowq2d3J5jsaIJFd4odmmYVInrhJXOzC/LQ54
Niloj5ksehf+KUIRLDz7ycppvIHhlVsoAl0eM2dWBAtL0mvT7Nyn/3y+vnMfV2v3
Flb4opwJUgTJleYc16oxTn9svT2yS8q2uuUemRDLW8ABghoAtH3fUUk43RN+5Krd
LYCtbeawtkikPVXZMfWybsx5vn0c3Q==
=7SBe
-----END PGP SIGNATURE-----
Merge tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache
Pull memory folios from Matthew Wilcox:
"Add memory folios, a new type to represent either order-0 pages or the
head page of a compound page. This should be enough infrastructure to
support filesystems converting from pages to folios.
The point of all this churn is to allow filesystems and the page cache
to manage memory in larger chunks than PAGE_SIZE. The original plan
was to use compound pages like THP does, but I ran into problems with
some functions expecting only a head page while others expect the
precise page containing a particular byte.
The folio type allows a function to declare that it's expecting only a
head page. Almost incidentally, this allows us to remove various calls
to VM_BUG_ON(PageTail(page)) and compound_head().
This converts just parts of the core MM and the page cache. For 5.17,
we intend to convert various filesystems (XFS and AFS are ready; other
filesystems may make it) and also convert more of the MM and page
cache to folios. For 5.18, multi-page folios should be ready.
The multi-page folios offer some improvement to some workloads. The
80% win is real, but appears to be an artificial benchmark (postgres
startup, which isn't a serious workload). Real workloads (eg building
the kernel, running postgres in a steady state, etc) seem to benefit
between 0-10%. I haven't heard of any performance losses as a result
of this series. Nobody has done any serious performance tuning; I
imagine that tweaking the readahead algorithm could provide some more
interesting wins. There are also other places where we could choose to
create large folios and currently do not, such as writes that are
larger than PAGE_SIZE.
I'd like to thank all my reviewers who've offered review/ack tags:
Christoph Hellwig, David Howells, Jan Kara, Jeff Layton, Johannes
Weiner, Kirill A. Shutemov, Michal Hocko, Mike Rapoport, Vlastimil
Babka, William Kucharski, Yu Zhao and Zi Yan.
I'd also like to thank those who gave feedback I incorporated but
haven't offered up review tags for this part of the series: Nick
Piggin, Mel Gorman, Ming Lei, Darrick Wong, Ted Ts'o, John Hubbard,
Hugh Dickins, and probably a few others who I forget"
* tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache: (90 commits)
mm/writeback: Add folio_write_one
mm/filemap: Add FGP_STABLE
mm/filemap: Add filemap_get_folio
mm/filemap: Convert mapping_get_entry to return a folio
mm/filemap: Add filemap_add_folio()
mm/filemap: Add filemap_alloc_folio
mm/page_alloc: Add folio allocation functions
mm/lru: Add folio_add_lru()
mm/lru: Convert __pagevec_lru_add_fn to take a folio
mm: Add folio_evictable()
mm/workingset: Convert workingset_refault() to take a folio
mm/filemap: Add readahead_folio()
mm/filemap: Add folio_mkwrite_check_truncate()
mm/filemap: Add i_blocks_per_folio()
mm/writeback: Add folio_redirty_for_writepage()
mm/writeback: Add folio_account_redirty()
mm/writeback: Add folio_clear_dirty_for_io()
mm/writeback: Add folio_cancel_dirty()
mm/writeback: Add folio_account_cleaned()
mm/writeback: Add filemap_dirty_folio()
...
update_next_balance() uses sd->last_balance which is not modified by
load_balance() so we can merge the 2 calls in one place.
No functional change
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20211019123537.17146-6-vincent.guittot@linaro.org
With a default value of 500us, sysctl_sched_migration_cost is
significanlty higher than the cost of load_balance. Remove the
condition and rely on the sd->max_newidle_lb_cost to abort
newidle_balance.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20211019123537.17146-5-vincent.guittot@linaro.org
Decay max_newidle_lb_cost only when it has not been updated for a while
and ensure to not decay a recently changed value.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20211019123537.17146-4-vincent.guittot@linaro.org
In newidle_balance(), the scheduler skips load balance to the new idle cpu
when the 1st sd of this_rq is:
this_rq->avg_idle < sd->max_newidle_lb_cost
Doing a costly call to update_blocked_averages() will not be useful and
simply adds overhead when this condition is true.
Check the condition early in newidle_balance() to skip
update_blocked_averages() when possible.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20211019123537.17146-3-vincent.guittot@linaro.org
The time spent to update the blocked load can be significant depending of
the complexity fo the cgroup hierarchy. Take this time into account in
the cost of the 1st load balance of a newly idle cpu.
Also reduce the number of call to sched_clock_cpu() and track more actual
work.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20211019123537.17146-2-vincent.guittot@linaro.org
parisc, ia64 and powerpc32 are the only remaining architectures that
provide custom arch_{spin,read,write}_lock_flags() functions, which are
meant to re-enable interrupts while waiting for a spinlock.
However, none of these can actually run into this codepath, because
it is only called on architectures without CONFIG_GENERIC_LOCKBREAK,
or when CONFIG_DEBUG_LOCK_ALLOC is set without CONFIG_LOCKDEP, and none
of those combinations are possible on the three architectures.
Going back in the git history, it appears that arch/mn10300 may have
been able to run into this code path, but there is a good chance that
it never worked. On the architectures that still exist, it was
already impossible to hit back in 2008 after the introduction of
CONFIG_GENERIC_LOCKBREAK, and possibly earlier.
As this is all dead code, just remove it and the helper functions built
around it. For arch/ia64, the inline asm could be cleaned up, but
it seems safer to leave it untouched.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Helge Deller <deller@gmx.de> # parisc
Link: https://lore.kernel.org/r/20211022120058.1031690-1-arnd@kernel.org
Use force_fatal_sig instead of calling do_exit directly. This ensures
the ordinary signal handling path gets invoked, core dumps as
appropriate get created, and for multi-threaded processes all of the
threads are terminated not just a single thread.
When asked Gabriel Krisman Bertazi <krisman@collabora.com> said [1]:
> ebiederm@xmission.com (Eric W. Biederman) asked:
>
> > Why does do_syscal_user_dispatch call do_exit(SIGSEGV) and
> > do_exit(SIGSYS) instead of force_sig(SIGSEGV) and force_sig(SIGSYS)?
> >
> > Looking at the code these cases are not expected to happen, so I would
> > be surprised if userspace depends on any particular behaviour on the
> > failure path so I think we can change this.
>
> Hi Eric,
>
> There is not really a good reason, and the use case that originated the
> feature doesn't rely on it.
>
> Unless I'm missing yet another problem and others correct me, I think
> it makes sense to change it as you described.
>
> > Is using do_exit in this way something you copied from seccomp?
>
> I'm not sure, its been a while, but I think it might be just that. The
> first prototype of SUD was implemented as a seccomp mode.
If at some point it becomes interesting we could relax
"force_fatal_sig(SIGSEGV)" to instead say
"force_sig_fault(SIGSEGV, SEGV_MAPERR, sd->selector)".
I avoid doing that in this patch to avoid making it possible
to catch currently uncatchable signals.
Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
[1] https://lkml.kernel.org/r/87mtr6gdvi.fsf@collabora.com
Link: https://lkml.kernel.org/r/20211020174406.17889-14-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Add a simple helper force_fatal_sig that causes a signal to be
delivered to a process as if the signal handler was set to SIG_DFL.
Reimplement force_sigsegv based upon this new helper. This fixes
force_sigsegv so that when it forces the default signal handler
to be used the code now forces the signal to be unblocked as well.
Reusing the tested logic in force_sig_info_to_task that was built for
force_sig_seccomp this makes the implementation trivial.
This is interesting both because it makes force_sigsegv simpler and
because there are a couple of buggy places in the kernel that call
do_exit(SIGILL) or do_exit(SIGSYS) because there is no straight
forward way today for those places to simply force the exit of a
process with the chosen signal. Creating force_fatal_sig allows
those places to be implemented with normal signal exits.
Link: https://lkml.kernel.org/r/20211020174406.17889-13-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
In 2009 Oleg reworked[1] the kernel threads so that it is not
necessary to call do_exit if you are not using kthread_stop(). Remove
the explicit calls of do_exit and complete_and_exit (with a NULL
completion) that were previously necessary.
[1] 63706172f3 ("kthreads: rework kthread_stop()")
Link: https://lkml.kernel.org/r/20211020174406.17889-12-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
- Some bots have informed me that some of the ftrace functions kernel-doc
has formatting issues.
- Also, fix my snake instinct.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYXwAqxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qpl+AQDn8QTG2RN9+jWVTNkCcWNNQDKPi5Ij
QYquG/clrCVnwgEAtxUoC+ZAcnsadBm5gXwjzCi62aWfAb9apkTHb6fZAgc=
=H3GU
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.15-rc6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing comment fixes from Steven Rostedt:
- Some bots have informed me that some of the ftrace functions
kernel-doc has formatting issues.
- Also, fix my snake instinct.
* tag 'trace-v5.15-rc6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix misspelling of "missing"
ftrace: Fix kernel-doc formatting issues
Some functions had kernel-doc that used a comma instead of a hash to
separate the function name from the one line description.
Also, the "ftrace_is_dead()" had an incomplete description.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- A large cross-arch rework to move irq_enter()/irq_exit() into
the arch code, and removing it from the generic irq code.
Thanks to Mark Rutland for the huge effort!
- A few irqchip drivers are made modular (broadcom, meson), because
that's apparently a thing...
- A new driver for the Microchip External Interrupt Controller
- The irq_cpu_offline()/irq_cpu_online() API is now deprecated and
can only be selected on the Cavium Octeon platform. Once this
platform is removed, the API will be removed at the same time.
- A sprinkle of devm_* helper, as people seem to love that.
- The usual spattering of small fixes and minor improvements.
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmF7rnYPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDudEP/i3WmAcXQYKJpRz075M8S6PZ8BXeTKUe7WMK
rrslOkxDqyQ2SVqMLII1xkyOWafC7BnRjexm/ASwrBsc6GyQha7B2YsKy1m/NEwy
ZcnXCCIg71LpDrUyxbscFxB6s5OvUN0yv+a+WnEAmOXpD1x3S8x5tHmRUfsRGksR
zOhKaYPLqgCiw3VHRuhEKFUA+CMjXxHhw3lJv6gPh6TRjdXQuJouau2dBzr7tQEd
h9Jq2OatWXiwPr00hQDDILbdH4+fQYKJqsaaLNX0Pxexg2slRWHwrgA2o/w0tTVW
99HOc9hN04QoLkDfyQis40L1YC7VOIr5OAqzUehdYELT8UsrZS288Rr6099n4M/Y
x8Nzcg4eA+jVUz1VMEBA9qR45fKjEMcTAXyNAAYLsov/obSgGH/PSOYaunG2xvYq
iiJBM/g506PTw2MRROqrH5oKiER3tTD65f5NM0mJONr3xEm9XT74m0JIodgVZ4QX
0LMJytgetg0b+yZcFY25GhJ+2mGoYwB2eiZBVjE3FyLSs0epcuzogaKRi5axK4sN
rvlAtgNZiOg7tzRqiPIQKSzO3dCyJjR86t5fd1cRBl/WPmywvA2Lkcgd09V2oyJe
FEp1QllpgYw0a5+aIS+bdOUK63FLnLdEMas7WgSAAxA4/jjgP1p+SbytOD81psL0
4r02YN2A
=/NLR
-----END PGP SIGNATURE-----
Merge tag 'irqchip-5.16' into irq/core
Merge irqchip updates for Linux 5.16 from Marc Zyngier:
- A large cross-arch rework to move irq_enter()/irq_exit() into
the arch code, and removing it from the generic irq code.
Thanks to Mark Rutland for the huge effort!
- A few irqchip drivers are made modular (broadcom, meson), because
that's apparently a thing...
- A new driver for the Microchip External Interrupt Controller
- The irq_cpu_offline()/irq_cpu_online() API is now deprecated and
can only be selected on the Cavium Octeon platform. Once this
platform is removed, the API will be removed at the same time.
- A sprinkle of devm_* helper, as people seem to love that.
- The usual spattering of small fixes and minor improvements.
* tag 'irqchip-5.16': (912 commits)
h8300: Fix linux/irqchip.h include mess
dt-bindings: irqchip: renesas-irqc: Document r8a774e1 bindings
MIPS: irq: Avoid an unused-variable error
genirq: Hide irq_cpu_{on,off}line() behind a deprecated option
irqchip/mips-gic: Get rid of the reliance on irq_cpu_online()
MIPS: loongson64: Drop call to irq_cpu_offline()
irq: remove handle_domain_{irq,nmi}()
irq: remove CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
irq: riscv: perform irqentry in entry code
irq: openrisc: perform irqentry in entry code
irq: csky: perform irqentry in entry code
irq: arm64: perform irqentry in entry code
irq: arm: perform irqentry in entry code
irq: add a (temporary) CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
irq: nds32: avoid CONFIG_HANDLE_DOMAIN_IRQ
irq: arc: avoid CONFIG_HANDLE_DOMAIN_IRQ
irq: add generic_handle_arch_irq()
irq: unexport handle_irq_desc()
irq: simplify handle_domain_{irq,nmi}()
irq: mips: simplify do_domain_IRQ()
...
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211029083332.3680101-1-maz@kernel.org
This helper allows us to get the address of a kernel symbol from inside
a BPF_PROG_TYPE_SYSCALL prog (used by gen_loader), so that we can
relocate typeless ksym vars.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211028063501.2239335-2-memxor@gmail.com
This patch adds the kernel-side changes for the implementation of
a bpf bloom filter map.
The bloom filter map supports peek (determining whether an element
is present in the map) and push (adding an element to the map)
operations.These operations are exposed to userspace applications
through the already existing syscalls in the following way:
BPF_MAP_LOOKUP_ELEM -> peek
BPF_MAP_UPDATE_ELEM -> push
The bloom filter map does not have keys, only values. In light of
this, the bloom filter map's API matches that of queue stack maps:
user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM
which correspond internally to bpf_map_peek_elem/bpf_map_push_elem,
and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem
APIs to query or add an element to the bloom filter map. When the
bloom filter map is created, it must be created with a key_size of 0.
For updates, the user will pass in the element to add to the map
as the value, with a NULL key. For lookups, the user will pass in the
element to query in the map as the value, with a NULL key. In the
verifier layer, this requires us to modify the argument type of
a bloom filter's BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE;
as well, in the syscall layer, we need to copy over the user value
so that in bpf_map_peek_elem, we know which specific value to query.
A few things to please take note of:
* If there are any concurrent lookups + updates, the user is
responsible for synchronizing this to ensure no false negative lookups
occur.
* The number of hashes to use for the bloom filter is configurable from
userspace. If no number is specified, the default used will be 5 hash
functions. The benchmarks later in this patchset can help compare the
performance of using different number of hashes on different entry
sizes. In general, using more hashes decreases both the false positive
rate and the speed of a lookup.
* Deleting an element in the bloom filter map is not supported.
* The bloom filter map may be used as an inner map.
* The "max_entries" size that is specified at map creation time is used
to approximate a reasonable bitmap size for the bloom filter, and is not
otherwise strictly enforced. If the user wishes to insert more entries
into the bloom filter than "max_entries", they may do so but they should
be aware that this may lead to a higher false positive rate.
Signed-off-by: Joanne Koong <joannekoong@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211027234504.30744-2-joannekoong@fb.com
(mac80211), and BPF.
Current release - regressions:
- skb_expand_head: adjust skb->truesize to fix socket memory
accounting
- mptcp: fix corrupt receiver key in MPC + data + checksum
Previous releases - regressions:
- multicast: calculate csum of looped-back and forwarded packets
- cgroup: fix memory leak caused by missing cgroup_bpf_offline
- cfg80211: fix management registrations locking, prevent list
corruption
- cfg80211: correct false positive in bridge/4addr mode check
- tcp_bpf: fix race in the tcp_bpf_send_verdict resulting in reusing
previous verdict
Previous releases - always broken:
- sctp: enhancements for the verification tag, prevent attackers
from killing SCTP sessions
- tipc: fix size validations for the MSG_CRYPTO type
- mac80211: mesh: fix HE operation element length check, prevent
out of bound access
- tls: fix sign of socket errors, prevent positive error codes
being reported from read()/write()
- cfg80211: scan: extend RCU protection in cfg80211_add_nontrans_list()
- implement ->sock_is_readable() for UDP and AF_UNIX, fix poll()
for sockets in a BPF sockmap
- bpf: fix potential race in tail call compatibility check resulting
in two operations which would make the map incompatible succeeding
- bpf: prevent increasing bpf_jit_limit above max
- bpf: fix error usage of map_fd and fdget() in generic batch update
- phy: ethtool: lock the phy for consistency of results
- prevent infinite while loop in skb_tx_hash() when Tx races with
driver reconfiguring the queue <> traffic class mapping
- usbnet: fixes for bad HW conjured by syzbot
- xen: stop tx queues during live migration, prevent UAF
- net-sysfs: initialize uid and gid before calling net_ns_get_ownership
- mlxsw: prevent Rx stalls under memory pressure
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmF6zR8ACgkQMUZtbf5S
Irtu4w//e7BJVjn1zKnGHo5SdpHwIxePP5sgD7rGT9udTlYUeteRAEQPALlA8oHo
6nx59eZBlvOt4+1yyK8qVzT94aLca8dwJ4j7dlONrvLFyWinSUlSZ5ayH8Co2f8t
ZGYVg+EinR6b+iaeUp5PG0VGbW+FmyIO6iS2xytireJxW6sytZ2BLlrus4+j7BCM
oCaLD+P6A1bCH1PxDMO2GRFNXphxPX3azec33HpNevHD0iwdgrjAlnz0+PZ5uiEA
AQ92PVw59+I1el/h0SxDsxfOrCdUMWbM4ZoC2wf0jDC8o6tAg4KxhlI0lAiMvhla
iqo0BTxhMWAeqC4CMmZJVGAR7zY7xpZiLowp92qSCzZpG0XKVxPaomLFGHFiU3Im
JDNGGmWAsE5maWsMbUtGr9Gd4Gxoor5r2YfFwCBsMsg7VtO98lCIdORi44VGVD3A
z2OtHEk0ismOmu8ktDumpFPTF0l1G5HR+YxsVP7obEh3T+hv4nEwPtHQ7OsByuxg
BBUrmOGr3hB5kcYtbmT/CqLqGXHNCCEAS0f+7mCPsGJfMvsOZeVurdk5GS7SjPIE
870ogJycV+KRP7ZoefXQxiRe9oCmQs9tFQqCxyPwZ8O5vevTx78D5vj1dkuF4Q5w
cbn2qbgJAtNe0UN1Gxw3emarofXarAVQoO7n+29CDFG5fe/NlnY=
=61p6
-----END PGP SIGNATURE-----
Merge tag 'net-5.15-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from WiFi (mac80211), and BPF.
Current release - regressions:
- skb_expand_head: adjust skb->truesize to fix socket memory
accounting
- mptcp: fix corrupt receiver key in MPC + data + checksum
Previous releases - regressions:
- multicast: calculate csum of looped-back and forwarded packets
- cgroup: fix memory leak caused by missing cgroup_bpf_offline
- cfg80211: fix management registrations locking, prevent list
corruption
- cfg80211: correct false positive in bridge/4addr mode check
- tcp_bpf: fix race in the tcp_bpf_send_verdict resulting in reusing
previous verdict
Previous releases - always broken:
- sctp: enhancements for the verification tag, prevent attackers from
killing SCTP sessions
- tipc: fix size validations for the MSG_CRYPTO type
- mac80211: mesh: fix HE operation element length check, prevent out
of bound access
- tls: fix sign of socket errors, prevent positive error codes being
reported from read()/write()
- cfg80211: scan: extend RCU protection in
cfg80211_add_nontrans_list()
- implement ->sock_is_readable() for UDP and AF_UNIX, fix poll() for
sockets in a BPF sockmap
- bpf: fix potential race in tail call compatibility check resulting
in two operations which would make the map incompatible succeeding
- bpf: prevent increasing bpf_jit_limit above max
- bpf: fix error usage of map_fd and fdget() in generic batch update
- phy: ethtool: lock the phy for consistency of results
- prevent infinite while loop in skb_tx_hash() when Tx races with
driver reconfiguring the queue <> traffic class mapping
- usbnet: fixes for bad HW conjured by syzbot
- xen: stop tx queues during live migration, prevent UAF
- net-sysfs: initialize uid and gid before calling
net_ns_get_ownership
- mlxsw: prevent Rx stalls under memory pressure"
* tag 'net-5.15-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (67 commits)
Revert "net: hns3: fix pause config problem after autoneg disabled"
mptcp: fix corrupt receiver key in MPC + data + checksum
riscv, bpf: Fix potential NULL dereference
octeontx2-af: Fix possible null pointer dereference.
octeontx2-af: Display all enabled PF VF rsrc_alloc entries.
octeontx2-af: Check whether ipolicers exists
net: ethernet: microchip: lan743x: Fix skb allocation failure
net/tls: Fix flipped sign in async_wait.err assignment
net/tls: Fix flipped sign in tls_err_abort() calls
net/smc: Correct spelling mistake to TCPF_SYN_RECV
net/smc: Fix smc_link->llc_testlink_time overflow
nfp: bpf: relax prog rejection for mtu check through max_pkt_offset
vmxnet3: do not stop tx queues after netif_device_detach()
r8169: Add device 10ec:8162 to driver r8169
ptp: Document the PTP_CLK_MAGIC ioctl number
usbnet: fix error return code in usbnet_probe()
net: hns3: adjust string spaces of some parameters of tx bd info in debugfs
net: hns3: expand buffer len for some debugfs command
net: hns3: add more string spaces for dumping packets number of queue info in debugfs
net: hns3: fix data endian problem of some functions of debugfs
...
If the user tries to attach an event probe (eprobe) to an event that does
not exist, it will trigger a warning. There's an error check that only
expects memory issues otherwise it is considered a bug. But changes in the
code to move around the locking made it that it can error out if the user
attempts to attach to an event that does not exist, returning an -ENODEV.
As this path can be caused by user space putting in a bad value, do not
trigger a WARN.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYXoHQhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qjT+AQCx4ThfDRwuUkIyfzJR68b6t9YnOL3p
gqoSsjIj2JvzzQD/VrsXbmZJw9iYBYKFzkDxaNkRpI7HWFdInD7jzRTo4w0=
=RWQl
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.15-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Do not WARN when attaching event probe to non-existent event
If the user tries to attach an event probe (eprobe) to an event that
does not exist, it will trigger a warning. There's an error check that
only expects memory issues otherwise it is considered a bug. But
changes in the code to move around the locking made it that it can
error out if the user attempts to attach to an event that does not
exist, returning an -ENODEV. As this path can be caused by user space
putting in a bad value, do not trigger a WARN"
* tag 'trace-v5.15-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Do not warn when connecting eprobe to non existing event
* irq/irq_cpu_offline:
: .
: Make irq_cpu_{on,off}line() deprecated kernel API, and only
: enable it for some obscure Cavium platform after having
: moved all the other users away from it.
:
: Next step, drop the platform itself.
: .
genirq: Hide irq_cpu_{on,off}line() behind a deprecated option
irqchip/mips-gic: Get rid of the reliance on irq_cpu_online()
MIPS: loongson64: Drop call to irq_cpu_offline()
Signed-off-by: Marc Zyngier <maz@kernel.org>
* irq/remove-handle-domain-irq-20211026:
: Large rework of the architecture entry code from Mark Rutland.
: From the cover letter:
:
: <quote>
: The handle_domain_{irq,nmi}() functions were oringally intended as a
: convenience, but recent rework to entry code across the kernel tree has
: demonstrated that they cause more pain than they're worth and prevent
: architectures from being able to write robust entry code.
:
: This series reworks the irq code to remove them, handling the necessary
: entry work consistently in entry code (be it architectural or generic).
: </quote>
MIPS: irq: Avoid an unused-variable error
irq: remove handle_domain_{irq,nmi}()
irq: remove CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
irq: riscv: perform irqentry in entry code
irq: openrisc: perform irqentry in entry code
irq: csky: perform irqentry in entry code
irq: arm64: perform irqentry in entry code
irq: arm: perform irqentry in entry code
irq: add a (temporary) CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
irq: nds32: avoid CONFIG_HANDLE_DOMAIN_IRQ
irq: arc: avoid CONFIG_HANDLE_DOMAIN_IRQ
irq: add generic_handle_arch_irq()
irq: unexport handle_irq_desc()
irq: simplify handle_domain_{irq,nmi}()
irq: mips: simplify do_domain_IRQ()
irq: mips: stop (ab)using handle_domain_irq()
irq: mips: simplify bcm6345_l1_irq_handle()
irq: mips: avoid nested irq_enter()
Signed-off-by: Marc Zyngier <maz@kernel.org>
When the syscall trace points are not configured in, the kselftests for
ftrace will try to attach an event probe (eprobe) to one of the system
call trace points. This triggered a WARNING, because the failure only
expects to see memory issues. But this is not the only failure. The user
may attempt to attach to a non existent event, and the kernel must not
warn about it.
Link: https://lkml.kernel.org/r/20211027120854.0680aa0f@gandalf.local.home
Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Commit 316580b69d ("u64_stats: provide u64_stats_t type")
fixed possible load/store tearing on 64bit arches.
For instance the following C code
stats->nsecs += sched_clock() - start;
Could be rightfully implemented like this by a compiler,
confusing concurrent readers a lot:
stats->nsecs += sched_clock();
// arbitrary delay
stats->nsecs -= start;
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211026214133.3114279-4-eric.dumazet@gmail.com
It seems update_prog_stats() suffers from same issue fixed
in the prior patch:
As it can run while interrupts are enabled, it could
be re-entered and the u64_stats syncp could be mangled.
Fixes: fec56f5890 ("bpf: Introduce BPF trampoline")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211026214133.3114279-3-eric.dumazet@gmail.com
If the perf buffer isn't large enough, provide a hint about how large it
needs to be for whatever is running.
Link: https://lkml.kernel.org/r/20210831043723.13481-1-robbat2@gentoo.org
Signed-off-by: Robin H. Johnson <robbat2@gentoo.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
With CONFIG_DEBUG_PREEMPT we observed reports like:
BUG: using smp_processor_id() in preemptible
caller is perf_ftrace_function_call+0x6f/0x2e0
CPU: 1 PID: 680 Comm: a.out Not tainted
Call Trace:
<TASK>
dump_stack_lvl+0x8d/0xcf
check_preemption_disabled+0x104/0x110
? optimize_nops.isra.7+0x230/0x230
? text_poke_bp_batch+0x9f/0x310
perf_ftrace_function_call+0x6f/0x2e0
...
__text_poke+0x5/0x620
text_poke_bp_batch+0x9f/0x310
This telling us the CPU could be changed after task is preempted, and
the checking on CPU before preemption will be invalid.
Since now ftrace_test_recursion_trylock() will help to disable the
preemption, this patch just do the checking after trylock() to address
the issue.
Link: https://lkml.kernel.org/r/54880691-5fe2-33e7-d12f-1fa6136f5183@linux.alibaba.com
CC: Steven Rostedt <rostedt@goodmis.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Jisheng Zhang <jszhang@kernel.org>
Reported-by: Abaci <abaci@linux.alibaba.com>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As the documentation explained, ftrace_test_recursion_trylock()
and ftrace_test_recursion_unlock() were supposed to disable and
enable preemption properly, however currently this work is done
outside of the function, which could be missing by mistake.
And since the internal using of trace_test_and_set_recursion()
and trace_clear_recursion() also require preemption disabled, we
can just merge the logical.
This patch will make sure the preemption has been disabled when
trace_test_and_set_recursion() return bit >= 0, and
trace_clear_recursion() will enable the preemption if previously
enabled.
Link: https://lkml.kernel.org/r/13bde807-779c-aa4c-0672-20515ae365ea@linux.alibaba.com
CC: Petr Mladek <pmladek@suse.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Jisheng Zhang <jszhang@kernel.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Miroslav Benes <mbenes@suse.cz>
Reported-by: Abaci <abaci@linux.alibaba.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
[ Removed extra line in comment - SDR ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Clarify argument names and contract for fsnotify_create() and
fsnotify_mkdir() to reflect the anomaly of kernfs, which leaves dentries
negavite after mkdir/create.
Remove the WARN_ON(!inode) in audit code that were added by the Fixes
commit under the wrong assumption that dentries cannot be negative after
mkdir/create.
Fixes: aa93bdc550 ("fsnotify: use helpers to access data by data_type")
Link: https://lore.kernel.org/linux-fsdevel/87mtp5yz0q.fsf@collabora.com/
Link: https://lore.kernel.org/r/20211025192746.66445-4-krisman@collabora.com
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jan Kara <jack@suse.cz>
'dma_mem->bitmap' is a bitmap. So use 'bitmap_zalloc()' to simplify code,
improve the semantic and avoid some open-coded arithmetic in allocator
arguments.
Also change the corresponding 'kfree()' into 'bitmap_free()' to keep
consistency.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The division is a slow operation. If the divisor is a power of 2, use a
shift instead.
Results were obtained using Android's version of perf (simpleperf[1]) as
described below:
1. hist_field_div() is modified to call 2 test functions:
test_hist_field_div_[not]_optimized(); passing them the
same args. Use noinline and volatile to ensure these are
not optimized out by the compiler.
2. Create a hist event trigger that uses division:
events/kmem/rss_stat$ echo 'hist:keys=common_pid:x=size/<divisor>'
>> trigger
events/kmem/rss_stat$ echo 'hist:keys=common_pid:vals=$x'
>> trigger
3. Run Android's lmkd_test[2] to generate rss_stat events, and
record CPU samples with Android's simpleperf:
simpleperf record -a --exclude-perf --post-unwind=yes -m 16384 -g
-f 2000 -o perf.data
== Results ==
Divisor is a power of 2 (divisor == 32):
test_hist_field_div_not_optimized | 8,717,091 cpu-cycles
test_hist_field_div_optimized | 1,643,137 cpu-cycles
If the divisor is a power of 2, the optimized version is ~5.3x faster.
Divisor is not a power of 2 (divisor == 33):
test_hist_field_div_not_optimized | 4,444,324 cpu-cycles
test_hist_field_div_optimized | 5,497,958 cpu-cycles
If the divisor is not a power of 2, as expected, the optimized version is
slightly slower (~24% slower).
[1] https://android.googlesource.com/platform/system/extras/+/master/simpleperf/doc/README.md
[2] https://cs.android.com/android/platform/superproject/+/master:system/memory/lmkd/tests/lmkd_test.cpp
Link: https://lkml.kernel.org/r/20211025200852.3002369-7-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If both operands of a hist trigger expression are constants, convert the
expression to a constant. This optimization avoids having to perform the
same calculation multiple times and also saves on memory since the
merged constants are represented by a single struct hist_field instead
or multiple.
Link: https://lkml.kernel.org/r/20211025200852.3002369-6-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The '-' in .sym-offset can confuse the hist trigger arithmetic
expression parsing. Simplify the handling of this by replacing the
'sym-offset' with 'symXoffset'. This allows us to correctly evaluate
expressions where the user may have inadvertently added a .sym-offset
modifier to one of the operands in an expression, instead of bailing
out. In this case the .sym-offset has no effect on the evaluation of the
expression. The only valid use of the .sym-offset is as a hist key
modifier.
Link: https://lkml.kernel.org/r/20211025200852.3002369-5-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The current histogram expression evaluation logic evaluates the
expression from right to left. This can lead to incorrect results
if the operations are not associative (as is the case for subtraction
and, the now added, division operators).
e.g. 16-8-4-2 should be 2 not 10 --> 16-8-4-2 = ((16-8)-4)-2
64/8/4/2 should be 1 not 16 --> 64/8/4/2 = ((64/8)/4)/2
Division and multiplication are currently limited to single operation
expression due to operator precedence support not yet implemented.
Rework the expression parsing to support the correct evaluation of
expressions containing operators of different precedences; and fix
the associativity error by evaluating expressions with operators of
the same precedence from left to right.
Examples:
(1) echo 'hist:keys=common_pid:a=8,b=4,c=2,d=1,w=$a-$b-$c-$d' \
>> event/trigger
(2) echo 'hist:keys=common_pid:x=$a/$b/3/2' >> event/trigger
(3) echo 'hist:keys=common_pid:y=$a+10/$c*1024' >> event/trigger
(4) echo 'hist:keys=common_pid:z=$a/$b+$c*$d' >> event/trigger
Link: https://lkml.kernel.org/r/20211025200852.3002369-4-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Adds basic support for division and multiplication operations for
hist trigger variable expressions.
For simplicity this patch only supports, division and multiplication
for a single operation expression (e.g. x=$a/$b), as currently
expressions are always evaluated right to left. This can lead to some
incorrect results:
e.g. echo 'hist:keys=common_pid:x=8-4-2' >> event/trigger
8-4-2 should evaluate to 2 i.e. (8-4)-2
but currently x evaluate to 6 i.e. 8-(4-2)
Multiplication and division in sub-expressions will work correctly, once
correct operator precedence support is added (See next patch in this
series).
For the undefined case of division by 0, the histogram expression
evaluates to (u64)(-1). Since this cannot be detected when the
expression is created, it is the responsibility of the user to be
aware and account for this possibility.
Examples:
echo 'hist:keys=common_pid:a=8,b=4,x=$a/$b' \
>> event/trigger
echo 'hist:keys=common_pid:y=5*$b' \
>> event/trigger
Link: https://lkml.kernel.org/r/20211025200852.3002369-3-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently hist trigger expressions don't support the use of numeric
literals:
e.g. echo 'hist:keys=common_pid:x=$y-1234'
--> is not valid expression syntax
Having the ability to use numeric constants in hist triggers supports
a wider range of expressions for creating variables.
Add support for creating trace event histogram variables from numeric
literals.
e.g. echo 'hist:keys=common_pid:x=1234,y=size-1024' >> event/trigger
A negative numeric constant is created, using unary minus operator
(parentheses are required).
e.g. echo 'hist:keys=common_pid:z=-(2)' >> event/trigger
Constants can be used with division/multiplication (added in the
next patch in this series) to implement granularity filters for frequent
trace events. For instance we can limit emitting the rss_stat
trace event to when there is a 512KB cross over in the rss size:
# Create a synthetic event to monitor instead of the high frequency
# rss_stat event
echo 'rss_stat_throttled unsigned int mm_id; unsigned int curr;
int member; long size' >> tracing/synthetic_events
# Create a hist trigger that emits the synthetic rss_stat_throttled
# event only when the rss size crosses a 512KB boundary.
echo 'hist:keys=keys=mm_id,member:bucket=size/0x80000:onchange($bucket)
.rss_stat_throttled(mm_id,curr,member,size)'
>> events/kmem/rss_stat/trigger
A use case for using constants with addition/subtraction is not yet
known, but for completeness the use of constants are supported for all
operators.
Link: https://lkml.kernel.org/r/20211025200852.3002369-2-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Daniel Borkmann says:
====================
pull-request: bpf 2021-10-26
We've added 12 non-merge commits during the last 7 day(s) which contain
a total of 23 files changed, 118 insertions(+), 98 deletions(-).
The main changes are:
1) Fix potential race window in BPF tail call compatibility check, from Toke Høiland-Jørgensen.
2) Fix memory leak in cgroup fs due to missing cgroup_bpf_offline(), from Quanyang Wang.
3) Fix file descriptor reference counting in generic_map_update_batch(), from Xu Kuohai.
4) Fix bpf_jit_limit knob to the max supported limit by the arch's JIT, from Lorenz Bauer.
5) Fix BPF sockmap ->poll callbacks for UDP and AF_UNIX sockets, from Cong Wang and Yucong Sun.
6) Fix BPF sockmap concurrency issue in TCP on non-blocking sendmsg calls, from Liu Jian.
7) Fix build failure of INODE_STORAGE and TASK_STORAGE maps on !CONFIG_NET, from Tejun Heo.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Fix potential race in tail call compatibility check
bpf: Move BPF_MAP_TYPE for INODE_STORAGE and TASK_STORAGE outside of CONFIG_NET
selftests/bpf: Use recv_timeout() instead of retries
net: Implement ->sock_is_readable() for UDP and AF_UNIX
skmsg: Extract and reuse sk_msg_is_readable()
net: Rename ->stream_memory_read to ->sock_is_readable
tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function
cgroup: Fix memory leak caused by missing cgroup_bpf_offline
bpf: Fix error usage of map_fd and fdget() in generic_map_update_batch()
bpf: Prevent increasing bpf_jit_limit above max
bpf: Define bpf_jit_alloc_exec_limit for arm64 JIT
bpf: Define bpf_jit_alloc_exec_limit for riscv JIT
====================
Link: https://lore.kernel.org/r/20211026201920.11296-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since config KPROBES_SANITY_TEST is in lib/Kconfig.debug, it is better to
let test_kprobes.c in lib/, just like other similar tests found in lib/.
Link: https://lkml.kernel.org/r/1635213091-24387-4-git-send-email-yangtiezhu@loongson.cn
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add a test case for stacktrace from kretprobe handler and
nested kretprobe handlers.
This test checks both of stack trace inside kretprobe handler
and stack trace from pt_regs. Those stack trace must include
actual function return address instead of kretprobe trampoline.
The nested kretprobe stacktrace test checks whether the unwinder
can correctly unwind the call frame on the stack which has been
modified by the kretprobe.
Since the stacktrace on kretprobe is correctly fixed only on x86,
this introduces a meta kconfig ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
which tells user that the stacktrace on kretprobe is correct or not.
The test results will be shown like below;
TAP version 14
1..1
# Subtest: kprobes_test
1..6
ok 1 - test_kprobe
ok 2 - test_kprobes
ok 3 - test_kretprobe
ok 4 - test_kretprobes
ok 5 - test_stacktrace_on_kretprobe
ok 6 - test_stacktrace_on_nested_kretprobe
# kprobes_test: pass:6 fail:0 skip:0 total:6
# Totals: pass:6 fail:0 skip:0 total:6
ok 1 - kprobes_test
Link: https://lkml.kernel.org/r/163516211244.604541.18350507860972214415.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Lorenzo noticed that the code testing for program type compatibility of
tail call maps is potentially racy in that two threads could encounter a
map with an unset type simultaneously and both return true even though they
are inserting incompatible programs.
The race window is quite small, but artificially enlarging it by adding a
usleep_range() inside the check in bpf_prog_array_compatible() makes it
trivial to trigger from userspace with a program that does, essentially:
map_fd = bpf_create_map(BPF_MAP_TYPE_PROG_ARRAY, 4, 4, 2, 0);
pid = fork();
if (pid) {
key = 0;
value = xdp_fd;
} else {
key = 1;
value = tc_fd;
}
err = bpf_map_update_elem(map_fd, &key, &value, 0);
While the race window is small, it has potentially serious ramifications in
that triggering it would allow a BPF program to tail call to a program of a
different type. So let's get rid of it by protecting the update with a
spinlock. The commit in the Fixes tag is the last commit that touches the
code in question.
v2:
- Use a spinlock instead of an atomic variable and cmpxchg() (Alexei)
v3:
- Put lock and the members it protects into an embedded 'owner' struct (Daniel)
Fixes: 3324b584b6 ("ebpf: misc core cleanup")
Reported-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211026110019.363464-1-toke@redhat.com
Make valid_state() check if the ->enter callback is present in
suspend_ops (only PM_SUSPEND_TO_IDLE can be valid otherwise) and
make sleep_state_supported() call valid_state() consistently to
validate the states other than PM_SUSPEND_TO_IDLE.
While at it, clean up the comment in valid_state().
No expected functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 8651f97bd9 ("PM / cpuidle: System resume hang fix with
cpuidle") that introduced cpuidle pausing during system suspend
did that to work around a platform firmware issue causing systems
to hang during resume if CPUs were allowed to enter idle states
in the system suspend and resume code paths.
However, pausing cpuidle before the last phase of suspending
devices is the source of an otherwise arbitrary difference between
the suspend-to-idle path and other system suspend variants, so it is
cleaner to do that later, before taking secondary CPUs offline (it
is still safer to take secondary CPUs offline with cpuidle paused,
though).
Modify the code accordingly, but in order to avoid code duplication,
introduce new wrapper functions, pm_sleep_disable_secondary_cpus()
and pm_sleep_enable_secondary_cpus(), to combine cpuidle_pause()
and cpuidle_resume(), respectively, with the handling of secondary
CPUs during system-wide transitions to sleep states.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Tested-by: Ulf Hansson <ulf.hansson@linaro.org>
It is pointless to pause cpuidle in the suspend-to-idle path,
because it is going to be resumed in the same path later and
pausing it does not serve any particular purpose in that case.
Rework the code to avoid doing that.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Tested-by: Ulf Hansson <ulf.hansson@linaro.org>
The sparse tool complains as follows:
kernel/trace/trace_hwlat.c:82:27: warning: symbol 'hwlat_single_cpu_data' was not declared. Should it be static?
kernel/trace/trace_hwlat.c:83:1: warning: symbol '__pcpu_scope_hwlat_per_cpu_data' was not declared. Should it be static?
This symbol is not used outside of trace_hwlat.c, so this commit
marks it static.
Link: https://lkml.kernel.org/r/20211021035225.1050685-1-bobo.shaobowang@huawei.com
Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
irq_cpu_{on,off}line() are now only used by the Octeon platform.
Make their use conditional on this plaform being enabled, and
otherwise hidden away.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Serge Semin <fancer.lancer@gmail.com>
Link: https://lore.kernel.org/r/20211021170414.3341522-4-maz@kernel.org
Now that entry code handles IRQ entry (including setting the IRQ regs)
before calling irqchip code, irqchip code can safely call
generic_handle_domain_irq(), and there's no functional reason for it to
call handle_domain_irq().
Let's cement this split of responsibility and remove handle_domain_irq()
entirely, updating irqchip drivers to call generic_handle_domain_irq().
For consistency, handle_domain_nmi() is similarly removed and replaced
with a generic_handle_domain_nmi() function which also does not perform
any entry logic.
Previously handle_domain_{irq,nmi}() had a WARN_ON() which would fire
when they were called in an inappropriate context. So that we can
identify similar issues going forward, similar WARN_ON_ONCE() logic is
added to the generic_handle_*() functions, and comments are updated for
clarity and consistency.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Now that all users of CONFIG_HANDLE_DOMAIN_IRQ perform the irq entry
work themselves, we can remove the legacy
CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY behaviour.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
New x86 FPU features will be very large, requiring ~10k of stack in
signal handlers. These new features require a new approach called
"dynamic features".
The kernel currently tries to ensure that altstacks are reasonably
sized. Right now, on x86, sys_sigaltstack() requires a size of >=2k.
However, that 2k is a constant. Simply raising that 2k requirement
to >10k for the new features would break existing apps which have a
compiled-in size of 2k.
Instead of universally enforcing a larger stack, prohibit a process from
using dynamic features without properly-sized altstacks. This must be
enforced in two places:
* A dynamic feature can not be enabled without an large-enough altstack
for each process thread.
* Once a dynamic feature is enabled, any request to install a too-small
altstack will be rejected
The dynamic feature enabling code must examine each thread in a
process to ensure that the altstacks are large enough. Add a new lock
(sigaltstack_lock()) to ensure that threads can not race and change
their altstack after being examined.
Add the infrastructure in form of a config option and provide empty
stubs for architectures which do not need dynamic altstack size checks.
This implementation will be fleshed out for x86 in a future patch called
x86/arch_prctl: Add controls for dynamic XSTATE components
[dhansen: commit message. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211021225527.10184-2-chang.seok.bae@intel.com
Since "54357f0c9149 tracing: Add migrate-disabled counter to tracing
output," the migrate disabled field is also printed in the !PREEMPR_RT
kernel config. While this information was added to the vast majority of
tracers, osnoise and timerlat were not updated (because they are new
tracers).
Fix timerlat header by adding the information about migrate disabled.
Link: https://lkml.kernel.org/r/bc0c234ab49946cdd63effa6584e1d5e8662cb44.1634308385.git.bristot@kernel.org
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: x86@kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Fixes: 54357f0c91 ("tracing: Add migrate-disabled counter to tracing output.")
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Since "54357f0c9149 tracing: Add migrate-disabled counter to tracing
output," the migrate disabled field is also printed in the !PREEMPR_RT
kernel config. While this information was added to the vast majority of
tracers, osnoise and timerlat were not updated (because they are new
tracers).
Fix osnoise header by adding the information about migrate disabled.
Link: https://lkml.kernel.org/r/9cb3d54e29e0588dbba12e81486bd8a09adcd8ca.1634308385.git.bristot@kernel.org
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: x86@kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Fixes: 54357f0c91 ("tracing: Add migrate-disabled counter to tracing output.")
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
It is useful to trace functions in kernel/event/core.c. Allow ftrace for
them by removing $(CC_FLAGS_FTRACE) from Makefile.
Link: https://lkml.kernel.org/r/20211006210732.2826289-1-songliubraving@fb.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: KP Singh <kpsingh@kernel.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This symbol is not used outside of ftrace.c, so marks it static.
Fixes the following sparse warning:
kernel/trace/ftrace.c:579:5: warning: symbol 'ftrace_profile_pages_init'
was not declared. Should it be static?
Link: https://lkml.kernel.org/r/1634640534-18280-1-git-send-email-jiapeng.chong@linux.alibaba.com
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Fixes: cafb168a1c ("tracing: make the function profiler per cpu")
Signed-off-by: chongjiapeng <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
On the real systems, the cgroups hierarchies are setup early and just
once by the node controller, so, other than number of cgroups, all
information in /proc/cgroups remain same for the system uptime. Let's
remove the cgroup_mutex usage on reading /proc/cgroups. There is a
chance of inconsistent number of cgroups for co-mounted cgroups while
printing the information from /proc/cgroups but that is not a big
issue. In addition /proc/cgroups is a v1 specific interface, so the
dependency on it should reduce over time.
The main motivation for removing the cgroup_mutex from /proc/cgroups is
to reduce the avenues of its contention. On our fleet, we have observed
buggy application hammering on /proc/cgroups and drastically slowing
down the node controller on the system which have many negative
consequences on other workloads running on the system.
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The function cgroupstats_build extracts cgroup from the kernfs_node's
priv pointer which is a RCU pointer. So, there is no need to grab
cgroup_mutex. Just get the reference on the cgroup before using and
remove the cgroup_mutex altogether.
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently cgroup_get_from_path() and cgroup_get_from_id() grab
cgroup_mutex before traversing the default hierarchy to find the
kernfs_node corresponding to the path/id and then extract the linked
cgroup. Since cgroup_mutex is still held, it is guaranteed that the
cgroup will be alive and the reference can be taken on it.
However similar guarantee can be provided without depending on the
cgroup_mutex and potentially reducing avenues of cgroup_mutex contentions.
The kernfs_node's priv pointer is RCU protected pointer and with just
rcu read lock we can grab the reference on the cgroup without
cgroup_mutex. So, remove cgroup_mutex from them.
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Going forward we want architecture/entry code to perform all the
necessary work to enter/exit IRQ context, with irqchip code merely
handling the mapping of the interrupt to any handler(s). Among other
reasons, this is necessary to consistently fix some longstanding issues
with the ordering of lockdep/RCU/tracing instrumentation which many
architectures get wrong today in their entry code.
Importantly, rcu_irq_{enter,exit}() must be called precisely once per
IRQ exception, so that rcu_is_cpu_rrupt_from_idle() can correctly
identify when an interrupt was taken from an idle context which must be
explicitly preempted. Currently handle_domain_irq() calls
rcu_irq_{enter,exit}() via irq_{enter,exit}(), but entry code needs to
be able to call rcu_irq_{enter,exit}() earlier for correct ordering
across lockdep/RCU/tracing updates for sequences such as:
lockdep_hardirqs_off(CALLER_ADDR0);
rcu_irq_enter();
trace_hardirqs_off_finish();
To permit each architecture to be converted to the new style in turn,
this patch adds a new CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY selected by all
current users of HANDLE_DOMAIN_IRQ, which gates the existing behaviour.
When CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY is not selected,
handle_domain_irq() requires entry code to perform the
irq_{enter,exit}() work, with an explicit check for this matching the
style of handle_domain_nmi().
Subsequent patches will:
1) Add the necessary IRQ entry accounting to each architecture in turn,
dropping CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY from that architecture's
Kconfig.
2) Remove CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY once it is no longer
selected.
3) Convert irqchip drivers to consistently use
generic_handle_domain_irq() rather than handle_domain_irq().
4) Remove handle_domain_irq() and CONFIG_HANDLE_DOMAIN_IRQ.
... which should leave us with a clear split of responsiblity across the
entry and irqchip code, making it possible to perform additional
cleanups and fixes for the aforementioned longstanding issues with entry
code.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Several architectures select GENERIC_IRQ_MULTI_HANDLER and branch to
handle_arch_irq() without performing any entry accounting.
Add a generic wrapper to handle the common irqentry work when invoking
handle_arch_irq(). Where an architecture needs to perform some entry
accounting itself, it will need to invoke handle_arch_irq() itself.
In subsequent patches it will become the responsibilty of the entry code
to set the irq regs when entering an IRQ (rather than deferring this to
an irqchip handler), so generic_handle_arch_irq() is made to set the irq
regs now. This can be redundant in some cases, but is never harmful as
saving/restoring the old regs nests safely.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Guo Ren <guoren@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
There are no modular users of handle_irq_desc(). Remove the export
before we gain any.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Suggested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
There's no need for handle_domain_{irq,nmi}() to open-code the NULL
check performed by handle_irq_desc(), nor the resolution of the desc
performed by generic_handle_domain_irq().
Use generic_handle_domain_irq() directly, as this is functioanlly
equivalent and clearer. At the same time, delete the stale comments,
which are no longer helpful.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmF1Kz4ACgkQEsHwGGHe
VUoQDA//UQhp6iDIAS9IVeca/ZZH3PWeyEJPQW/067UOM8jx+LkNvVBZnHn2mWai
Zwlz9MvUsfo7O5mB0SMKly9hT10E9kHDD9jBDPeLS4sVN3wv1Ku5YdkK4esS+49X
gbHHAPwL0SzR77Gx835I3grMUNbFrXgBkgP//DBcYSxX0nusey1XdgEuAoijTCC8
tDWEmd5Wz7dSPgrw8ntxGrWsM2SwRPTfY3culuRJ+Xws0gE+THs3cQ2HUnNW6qiu
g08fBBS+vD0X5UTv4iL0LHlPzmLUiMo/v6CsP1tyMoia3QgVYTVczz8CK0aAOFp8
i7O8rD/k8BE0hNlwTjoB2R99weN69RqCJtHJYo5898AhHXZ3A0I1N1H/eZNXldo8
cXlbFB4XPfhm+JwF+NPTNR5u2+YbyrT4+yrdCvljYGtm5w4imn0RGOhUkXEMYnEp
XqhRSP3k8KUD0YIpMrHcBRHKbrZxo5ldNzXp7U//gLn5W2hTrNl+LPAArqfUx9DM
NTjxc93gZjYm7/S9CUhPUaofLiU8nm+SDZDJi7NuxWO7d9OpyBckYk2y4yi+tGML
MxdBtxGxUUwTWSvls0H+gPrnpLjllw1VZz1OnURypCu2I2HntHW9yTswDpnzPzAL
Uykd5Ha4l8DEE59Qhy4ICKKiwpSQSe6ED/0LPPxPt5gW05tRnVM=
=z6Fz
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v5.15_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Borislav Petkov:
"Reset clang's Shadow Call Stack on hotplug to prevent it from
overflowing"
* tag 'sched_urgent_for_v5.15_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/scs: Reset the shadow stack when idle_task_exit
When enabling CONFIG_CGROUP_BPF, kmemleak can be observed by running
the command as below:
$mount -t cgroup -o none,name=foo cgroup cgroup/
$umount cgroup/
unreferenced object 0xc3585c40 (size 64):
comm "mount", pid 425, jiffies 4294959825 (age 31.990s)
hex dump (first 32 bytes):
01 00 00 80 84 8c 28 c0 00 00 00 00 00 00 00 00 ......(.........
00 00 00 00 00 00 00 00 6c 43 a0 c3 00 00 00 00 ........lC......
backtrace:
[<e95a2f9e>] cgroup_bpf_inherit+0x44/0x24c
[<1f03679c>] cgroup_setup_root+0x174/0x37c
[<ed4b0ac5>] cgroup1_get_tree+0x2c0/0x4a0
[<f85b12fd>] vfs_get_tree+0x24/0x108
[<f55aec5c>] path_mount+0x384/0x988
[<e2d5e9cd>] do_mount+0x64/0x9c
[<208c9cfe>] sys_mount+0xfc/0x1f4
[<06dd06e0>] ret_fast_syscall+0x0/0x48
[<a8308cb3>] 0xbeb4daa8
This is because that since the commit 2b0d3d3e4f ("percpu_ref: reduce
memory footprint of percpu_ref in fast path") root_cgrp->bpf.refcnt.data
is allocated by the function percpu_ref_init in cgroup_bpf_inherit which
is called by cgroup_setup_root when mounting, but not freed along with
root_cgrp when umounting. Adding cgroup_bpf_offline which calls
percpu_ref_kill to cgroup_kill_sb can free root_cgrp->bpf.refcnt.data in
umount path.
This patch also fixes the commit 4bfc0bb2c6 ("bpf: decouple the lifetime
of cgroup_bpf from cgroup itself"). A cgroup_bpf_offline is needed to do a
cleanup that frees the resources which are allocated by cgroup_bpf_inherit
in cgroup_setup_root.
And inside cgroup_bpf_offline, cgroup_get() is at the beginning and
cgroup_put is at the end of cgroup_bpf_release which is called by
cgroup_bpf_offline. So cgroup_bpf_offline can keep the balance of
cgroup's refcount.
Fixes: 2b0d3d3e4f ("percpu_ref: reduce memory footprint of percpu_ref in fast path")
Fixes: 4bfc0bb2c6 ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
Signed-off-by: Quanyang Wang <quanyang.wang@windriver.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20211018075623.26884-1-quanyang.wang@windriver.com
1. The ufd in generic_map_update_batch() should be read from batch.map_fd;
2. A call to fdget() should be followed by a symmetric call to fdput().
Fixes: aa2e93b8e5 ("bpf: Add generic support for update and delete batch ops")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211019032934.1210517-1-xukuohai@huawei.com
As reported by syzbot and experienced by Pavel, using cpus_read_lock()
in wake_up_all_idle_cpus() generates lock inversion (against mmap_sem
and possibly others).
Instead, shrink the preempt disable region by iterating all CPUs and
checking the online status for each individual CPU while having
preemption disabled.
Fixes: 8850cb663b ("sched: Simplify wake_up_*idle*()")
Reported-by: syzbot+d5b23b18d2f4feae8a67@syzkaller.appspotmail.com
Reported-by: Pavel Machek <pavel@ucw.cz>
Reported-by: Qian Cai <quic_qiancai@quicinc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Qian Cai <quic_qiancai@quicinc.com>
Pull ucounts fixes from Eric Biederman:
"There has been one very hard to track down bug in the ucount code that
we have been tracking since roughly v5.14 was released. Alex managed
to find a reliable reproducer a few days ago and then I was able to
instrument the code and figure out what the issue was.
It turns out the sigqueue_alloc single atomic operation optimization
did not play nicely with ucounts multiple level rlimits. It turned out
that either sigqueue_alloc or sigqueue_free could be operating on
multiple levels and trigger the conditions for the optimization on
more than one level at the same time.
To deal with that situation I have introduced inc_rlimit_get_ucounts
and dec_rlimit_put_ucounts that just focuses on the optimization and
the rlimit and ucount changes.
While looking into the big bug I found I couple of other little issues
so I am including those fixes here as well.
When I have time I would very much like to dig into process ownership
of the shared signal queue and see if we could pick a single owner for
the entire queue so that all of the rlimits can count to that owner.
That should entirely remove the need to call get_ucounts and
put_ucounts in sigqueue_alloc and sigqueue_free. It is difficult
because Linux unlike POSIX supports setuid that works on a single
thread"
* 'ucount-fixes-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ucounts: Move get_ucounts from cred_alloc_blank to key_change_session_keyring
ucounts: Proper error handling in set_cred_ucounts
ucounts: Pair inc_rlimit_ucounts with dec_rlimit_ucoutns in commit_creds
ucounts: Fix signal ucount refcounting
This stat is currently printed in the verifier log and not stored
anywhere. To ease consumption of this data, add a field to bpf_prog_aux
so it can be exposed via BPF_OBJ_GET_INFO_BY_FD and fdinfo.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20211020074818.1017682-2-davemarchevsky@fb.com
The helper is used in tracing programs to cast a socket
pointer to a unix_sock pointer.
The return value could be NULL if the casting is illegal.
Suggested-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211021134752.1223426-2-hengqi.chen@gmail.com
This converts the kprobes testcases to use the kunit framework.
It adds a dependency on CONFIG_KUNIT, and the output will change
to TAP:
TAP version 14
1..1
# Subtest: kprobes_test
1..4
random: crng init done
ok 1 - test_kprobe
ok 2 - test_kprobes
ok 3 - test_kretprobe
ok 4 - test_kretprobes
ok 1 - kprobes_test
Note that the kprobes testcases are no longer run immediately after
kprobes initialization, but as a late initcall when kunit is
initialized. kprobes itself is initialized with an early initcall,
so the order is still correct.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
clang started warning about excessive stack usage in
hist_trigger_print_key()
kernel/trace/trace_events_hist.c:4723:13: error: stack frame size (1336) exceeds limit (1024) in function 'hist_trigger_print_key' [-Werror,-Wframe-larger-than]
The problem is that there are two 512-byte arrays on the stack if
hist_trigger_stacktrace_print() gets inlined. I don't think this has
changed in the past five years, but something probably changed the
inlining decisions made by the compiler, so the problem is now made
more obvious.
Rather than printing the symbol names into separate buffers, it
seems we can simply use the special %ps format string modifier
to print the pointers symbolically and get rid of both buffers.
Marking hist_trigger_stacktrace_print() would be a simpler
way of avoiding the warning, but that would not address the
excessive stack usage.
Link: https://lkml.kernel.org/r/20211019153337.294790-1-arnd@kernel.org
Fixes: 69a0200c2e ("tracing: Add hist trigger support for stacktraces as keys")
Link: https://lore.kernel.org/all/20211015095704.49a99859@gandalf.local.home/
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Tested-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently to switch a set of "multi" direct trampolines from one
trampoline to another, a full shutdown of the current set needs to be
done, followed by an update to what trampoline the direct callers would
call, and then re-enabling the callers. This leaves a time when the
functions will not be calling anything, and events may be missed.
Instead, use a trick to allow all the functions with direct trampolines
attached will always call either the new or old trampoline while the
switch is happening. To do this, first attach a "dummy" callback via
ftrace to all the functions that the current direct trampoline is attached
to. This will cause the functions to call the "list func" instead of the
direct trampoline. The list function will call the direct trampoline
"helper" that will set the function it should call as it returns back to
the ftrace trampoline.
At this moment, the direct caller descriptor can safely update the direct
call trampoline. The list function will pick either the new or old
function (depending on the memory coherency model of the architecture).
Now removing the dummy function from each of the locations of the direct
trampoline caller, will put back the direct call, but now to the new
trampoline.
A better visual is:
[ Changing direct call from my_direct_1 to my_direct_2 ]
<traced_func>:
call my_direct_1
||||||||||||||||||||
vvvvvvvvvvvvvvvvvvvv
<traced_func>:
call ftrace_caller
<ftrace_caller>:
[..]
call ftrace_ops_list_func
ftrace_ops_list_func()
{
ops->func() -> direct_helper -> set rax to my_direct_1 or my_direct_2
}
call rax (to either my_direct_1 or my_direct_2
||||||||||||||||||||
vvvvvvvvvvvvvvvvvvvv
<traced_func>:
call my_direct_2
Link: https://lore.kernel.org/all/20211014162819.5c85618b@gandalf.local.home/
Acked-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Adding interface to modify registered direct function
for ftrace_ops. Adding following function:
modify_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr)
The function changes the currently registered direct
function for all attached functions.
Link: https://lkml.kernel.org/r/20211008091336.33616-8-jolsa@kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Adding interface to register multiple direct functions
within single call. Adding following functions:
register_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr)
unregister_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr)
The register_ftrace_direct_multi registers direct function (addr)
with all functions in ops filter. The ops filter can be updated
before with ftrace_set_filter_ip calls.
All requested functions must not have direct function currently
registered, otherwise register_ftrace_direct_multi will fail.
The unregister_ftrace_direct_multi unregisters ops related direct
functions.
Link: https://lkml.kernel.org/r/20211008091336.33616-7-jolsa@kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Factor out the code that adds (ip, addr) tuple to direct_functions
hash in new ftrace_add_rec_direct function. It will be used in
following patches.
Link: https://lkml.kernel.org/r/20211008091336.33616-6-jolsa@kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There's a new test in trace_selftest_startup_function_graph() that
requires the use of ftrace args being supported as well does some tricks
with dynamic tracing. Although this code checks HAVE_DYNAMIC_FTRACE_WITH_ARGS
it fails to check DYNAMIC_FTRACE, and the kernel fails to build due to
that dependency.
Also only define the prototype of trace_direct_tramp() if it is used.
Link: https://lkml.kernel.org/r/20211021134357.7f48e173@gandalf.local.home
Acked-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Replace vmalloc()/memset() with vzalloc() and kmalloc()/memset() with
kzalloc() to simplify the code.
Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When building the kernel with sparse enabled 'C=1' the following
warnings shows up:
kernel/power/swap.c:390:29: warning: incorrect type in assignment (different base types)
kernel/power/swap.c:390:29: expected int ret
kernel/power/swap.c:390:29: got restricted blk_status_t
This is due to function hib_wait_io() returns a 'blk_status_t' which is
a bitwise u8. Commit 5416da01ff ("PM: hibernate: Remove
blk_status_to_errno in hib_wait_io") seemed to have mixed up the return
type. However, the 4e4cbee93d ("block: switch bios to blk_status_t")
actually broke the behaviour by returning the wrong type.
Rework so function hib_wait_io() returns a 'int' instead of
'blk_status_t' and make sure to call function
blk_status_to_errno(hb->error)' when returning from function
hib_wait_io() a int gets returned.
Fixes: 4e4cbee93d ("block: switch bios to blk_status_t")
Fixes: 5416da01ff ("PM: hibernate: Remove blk_status_to_errno in hib_wait_io")
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Adding selftest for checking that direct trampoline can
co-exist together with graph tracer on same function.
This is supported for CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
config option, which is defined only for x86_64 for now.
Link: https://lkml.kernel.org/r/20211008091336.33616-5-jolsa@kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
We don't need special hook for graph tracer entry point,
but instead we can use graph_ops::func function to install
the return_hooker.
This moves the graph tracing setup _before_ the direct
trampoline prepares the stack, so the return_hooker will
be called when the direct trampoline is finished.
This simplifies the code, because we don't need to take into
account the direct trampoline setup when preparing the graph
tracer hooker and we can allow function graph tracer on entries
registered with direct trampoline.
Link: https://lkml.kernel.org/r/20211008091336.33616-4-jolsa@kernel.org
[fixed compile error reported by kernel test robot <lkp@intel.com>]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In order to build drivers/irqchip/irq-bcm7120-l2.c as a module which
references irq_gc_noop(), we need to export it towards modules.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211020184859.2705451-10-f.fainelli@gmail.com
In order to allow drivers/irqchip/irq-brcmstb-l2.c to be built as a
module we need to export: irq_gc_unmask_enable_reg() and
irq_gc_mask_disable_reg().
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211020184859.2705451-8-f.fainelli@gmail.com
kernel/bpf/preload/Makefile was recently updated to have it install
libbpf's headers locally instead of pulling them from tools/lib/bpf. But
two items still need to be addressed.
First, the local .gitignore file was not adjusted to ignore the files
generated in the new kernel/bpf/preload/libbpf output directory.
Second, the "clean-files" target is now incorrect. The old artefacts
names were not removed from the target, while the new ones were added
incorrectly. This is because "clean-files" expects names relative to
$(obj), but we passed the absolute path instead. This results in the
output and header-destination directories for libbpf (and their
contents) not being removed from kernel/bpf/preload on "make clean" from
the root of the repository.
This commit fixes both issues. Note that $(userprogs) needs not be added
to "clean-files", because the cleaning infrastructure already accounts
for it.
Cleaning the files properly also prevents make from printing the
following message, for builds coming after a "make clean":
"make[4]: Nothing to be done for 'install_headers'."
v2: Simplify the "clean-files" target.
Fixes: bf60791741 ("bpf: preload: Install libbpf headers when building")
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20211020094647.15564-1-quentin@isovalent.com
Currently show_workqueue_state shows the state of all workqueues and of
all worker pools. In certain cases we may need to dump state of only a
specific workqueue or worker pool. For example in destroy_workqueue we
only need to show state of the workqueue which is getting destroyed.
So rename show_workqueue_state to show_all_workqueues(to signify it
dumps state of all busy workqueues) and divide it into more granular
functions (show_one_workqueue and show_one_worker_pool), that would show
states of individual workqueues and worker pools and can be used in
cases such as the one mentioned above.
Also, as mentioned earlier, make destroy_workqueue dump data pertaining
to only the workqueue that is being destroyed and make user(s) of
earlier interface(show_workqueue_state), use new interface
(show_all_workqueues).
Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmFvbsAUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXOMsBAAkf8ZrL5UHq+0F60g+dEJHq7kwet7
QKWjDS1UOQK9Cvtt7T6Ggwu+kUMYG/HNlWMBkiv8+6SSy9o4KaftjpEoZDkZIO0F
lujGhgPsZdfadRUZgvhC0NrEmXwYQxdGqxWiE00tRYRBMviP2vb2Bf1Z/iXQZGeI
JZOENUCT6fridU9gNkUKP+qcV7/eZLaOTUItPyd8spYGtl4k85kKsHDHlF1C4MHM
ByRVAjuTbubGU8m5RF4tjju+f7CWBiiAXQFev/qouBzRXp+bk//WCxmc8e414Uwh
/QScy4wbRplIpq+iWTIcii8jwo0uJke7rPMetDik3VtqLCBu9hhh4Np0umfwjnOt
Fwis0H/2VoikJE8G7lC/0qd2ya3DgGtBbr+QMePh3QK8iUTkTlDTKiAf4b8JYm3x
MNXV/XSYIdlSoYUsSZXx9IciSCKnEa5TayY/N60CLFeyKOgmyxdtRA/Mql30bLc5
a141pVF+hNnovdpgcoIfCvA/oXhxsPYAL/Rh1OLPhwhTG+fKPrJfM8qsIZsvNUAV
Kg0UJRWxr5mkmYFv7vlPJSK+ZrJ/LlbNGskr2RuAPQ9QOsAQjf/3z/WJe2nfOQgH
oMLij2M0/sq1YuEP1yfQP4k2Du/Vqy4z+Ls1kKovlxldZEwk3TAEKND5voPUXVVv
h6rtxUWwGl+j0m4=
=tBGu
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20211019' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit fix from Paul Moore:
"One small audit patch to add a pointer NULL check"
* tag 'audit-pr-20211019' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: fix possible null-pointer dereference in audit_filter_rules
- While cleaning up some of the tracing recursion protection logic,
I discovered a scenario that the current design would miss, and
would allow an infinite recursion. Removing an optimization trick
that opened the hole, fixes the issue and cleans up the code as well.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYW7CqRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qmVWAQDnboKAUVmB/3D/L1T9XdWEq4AzCS6W
51QpzWff0pBVkwEAuc1af2gqDZ6/N9sQjN9kGxikY6luVs3CSQ1yHkcanQw=
=1zGP
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Recursion fix for tracing.
While cleaning up some of the tracing recursion protection logic, I
discovered a scenario that the current design would miss, and would
allow an infinite recursion. Removing an optimization trick that
opened the hole fixes the issue and cleans up the code as well"
* tag 'trace-v5.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Have all levels of checks prevent recursion
Consolidate the various helpers into a single blk_flush_plug helper that
takes a plk_plug and the from_scheduler bool and switch all callsites to
call it directly. Checks that the plug is non-NULL must be performed by
the caller, something that most already do anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211020144119.142582-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Decrement ucounts using atomic_long_sub_return to make it
clear the point is for the ucount to decrease.
Not a big deal but it should make it easier to catch bugs.
Suggested-by: Yu Zhao <yuzhao@google.com>
Link: https://lkml.kernel.org/r/87k0iaqkqj.fsf_-_@disp2133
Tested-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Alexey Gladkov <legion@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Add a helper function get_ucounts_or_wrap that is a trivial
wrapper around atomic_add_negative, that makes it clear
how atomic_add_negative is used in the context of ucounts.
Link: https://lkml.kernel.org/r/87pms2qkr9.fsf_-_@disp2133
Tested-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Alexey Gladkov <legion@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
All of the callers of get_ucounts are passeds a non-NULL value so stop
handling a NULL ucounts pointer in get_ucounts.
It is guaranteed that ever valid fully formed cred that is passed to
commit_cred contains a non-NULL ucounts pointer. This in turn
gurantees that current_ucounts() never returns NULL.
The call of get_ucounts in user_shm_lock is always passed
current_ucounts().
The call of get_ucounts in mqueue_get_inode is always passed
current_ucounts().
The call of get_ucounts in inc_rlmit_get_ucounts is always
passed iter, after iter has been verified to be non-NULL.
The call of get_ucounts in key_change_session_keyring is always passed
current_ucounts().
The call of get_ucounts in prepare_cred is always passed
current_ucounts().
The call of get_ucounts in prepare_kernel_cred is always
passed task->cred->ucounts or init_cred->ucounts which
being on tasks are guaranteed to have a non-NULL ucounts
field.
Link: https://lkml.kernel.org/r/87v91uqksg.fsf_-_@disp2133
Tested-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Alexey Gladkov <legion@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Any cred that is destined for use by commit_creds must have a non-NULL
cred->ucounts field. Only curing credential construction is a NULL
cred->ucounts valid. Only abort_creds, put_cred, and put_cred_rcu
needs to deal with a cred with a NULL ucount. As set_cred_ucounts is
non of those case don't confuse people by handling something that can
not happen.
Link: https://lkml.kernel.org/r/871r4irzds.fsf_-_@disp2133
Tested-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Alexey Gladkov <legion@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Setting cred->ucounts in cred_alloc_blank does not make sense. The
uid and user_ns are deliberately not set in cred_alloc_blank but
instead the setting is delayed until key_change_session_keyring.
So move dealing with ucounts into key_change_session_keyring as well.
Unfortunately that movement of get_ucounts adds a new failure mode to
key_change_session_keyring. I do not see anything stopping the parent
process from calling setuid and changing the relevant part of it's
cred while keyctl_session_to_parent is running making it fundamentally
necessary to call get_ucounts in key_change_session_keyring. Which
means that the new failure mode cannot be avoided.
A failure of key_change_session_keyring results in a single threaded
parent keeping it's existing credentials. Which results in the parent
process not being able to access the session keyring and whichever
keys are in the new keyring.
Further get_ucounts is only expected to fail if the number of bits in
the refernece count for the structure is too few.
Since the code has no other way to report the failure of get_ucounts
and because such failures are not expected to be common add a WARN_ONCE
to report this problem to userspace.
Between the WARN_ONCE and the parent process not having access to
the keys in the new session keyring I expect any failure of get_ucounts
will be noticed and reported and we can find another way to handle this
condition. (Possibly by just making ucounts->count an atomic_long_t).
Cc: stable@vger.kernel.org
Fixes: 905ae01c4a ("Add a reference to ucounts for each cred")
Link: https://lkml.kernel.org/r/7k0ias0uf.fsf_-_@disp2133
Tested-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Alexey Gladkov <legion@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Now that there are three different instances of doing the addition trick
to the preempt_count() and NMI_MASK, HARDIRQ_MASK and SOFTIRQ_OFFSET
macros, it deserves a helper function defined in the preempt.h header.
Add the interrupt_context_level() helper and replace the three instances
that do that logic with it.
Link: https://lore.kernel.org/all/20211015142541.4badd8a9@gandalf.local.home/
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Instead of having branches that adds noise to the branch prediction, use
the addition logic to set the bit for the level of interrupt context that
the state is currently in. This copies the logic from perf's
get_recursion_context() function.
Link: https://lore.kernel.org/all/20211015161702.GF174703@worktop.programming.kicks-ass.net/
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If CONFIG_CFI_CLANG=y, attempting to read an event histogram will cause
the kernel to panic due to failed CFI check.
1. echo 'hist:keys=common_pid' >> events/sched/sched_switch/trigger
2. cat events/sched/sched_switch/hist
3. kernel panics on attempting to read hist
This happens because the sort() function expects a generic
int (*)(const void *, const void *) pointer for the compare function.
To prevent this CFI failure, change tracing map cmp_entries_* function
signatures to match this.
Also, fix the build error reported by the kernel test robot [1].
[1] https://lore.kernel.org/r/202110141140.zzi4dRh4-lkp@intel.com/
Link: https://lkml.kernel.org/r/20211014045217.3265162-1-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In an effort to enable -Wcast-function-type in the top-level Makefile to
support Control Flow Integrity builds, all function casts need to be
removed.
This means that ftrace_ops_list_func() can no longer be defined as
ftrace_ops_no_ops(). The reason for ftrace_ops_no_ops() is to use that when
an architecture calls ftrace_ops_list_func() with only two parameters
(called from assembly). And to make sure there's no C side-effects, those
archs call ftrace_ops_no_ops() which only has two parameters, as
ftrace_ops_list_func() has four parameters.
Instead of a typecast, use vmlinux.lds.h to define ftrace_ops_list_func() to
arch_ftrace_ops_list_func() that will define the proper set of parameters.
Link: https://lore.kernel.org/r/20200614070154.6039-1-oscar.carter@gmx.com
Link: https://lkml.kernel.org/r/20200617165616.52241bde@oasis.local.home
Link: https://lore.kernel.org/all/20211005053922.GA702049@embeddedor/
Requested-by: Oscar Carter <oscar.carter@gmx.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The helper function returns a pointer that in the failure case encodes
an error in the struct btf pointer. The current code lead to Coverity
warning about the use of the invalid pointer:
*** CID 1507963: Memory - illegal accesses (USE_AFTER_FREE)
/kernel/bpf/verifier.c: 1788 in find_kfunc_desc_btf()
1782 return ERR_PTR(-EINVAL);
1783 }
1784
1785 kfunc_btf = __find_kfunc_desc_btf(env, offset, btf_modp);
1786 if (IS_ERR_OR_NULL(kfunc_btf)) {
1787 verbose(env, "cannot find module BTF for func_id %u\n", func_id);
>>> CID 1507963: Memory - illegal accesses (USE_AFTER_FREE)
>>> Using freed pointer "kfunc_btf".
1788 return kfunc_btf ?: ERR_PTR(-ENOENT);
1789 }
1790 return kfunc_btf;
1791 }
1792 return btf_vmlinux ?: ERR_PTR(-ENOENT);
1793 }
Daniel suggested the use of ERR_CAST so that the intended use is clear
to Coverity, but on closer look it seems that we never return NULL from
the helper. Andrii noted that since __find_kfunc_desc_btf already logs
errors for all cases except btf_get_by_fd, it is much easier to add
logging for that and remove the IS_ERR check altogether, returning
directly from it.
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211009040900.803436-1-memxor@gmail.com
Revert commit bfcc1e67ff ("PM: sleep: Do not assume that "mem" is
always present"), because it breaks compatibility with user space
utilities assuming that "mem" will always be present in
/sys/power/state.
Fixes: bfcc1e67ff ("PM: sleep: Do not assume that "mem" is always present")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Some unfriendly component, such as dpdk, write the same mask to
unbound kworker cpumask again and again. Every time it write to
this interface some work is queue to cpu, even though the mask
is same with the original mask.
So, fix it by return success and do nothing if the cpumask is
equal with the old one.
Signed-off-by: Mengen Sun <mengensun@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Instead of leaking the ucounts in new if alloc_ucounts fails, store
the result of alloc_ucounts into a temporary variable, which is later
assigned to new->ucounts.
Cc: stable@vger.kernel.org
Fixes: 905ae01c4a ("Add a reference to ucounts for each cred")
Link: https://lkml.kernel.org/r/87pms2s0v8.fsf_-_@disp2133
Tested-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Alexey Gladkov <legion@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The purpose of inc_rlimit_ucounts and dec_rlimit_ucounts in commit_creds
is to change which rlimit counter is used to track a process when the
credentials changes.
Use the same test for both to guarantee the tracking is correct.
Cc: stable@vger.kernel.org
Fixes: 21d1c5e386 ("Reimplement RLIMIT_NPROC on top of ucounts")
Link: https://lkml.kernel.org/r/87v91us0w4.fsf_-_@disp2133
Tested-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Alexey Gladkov <legion@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Commit f1a0a376ca ("sched/core: Initialize the idle task with
preemption disabled") removed the init_idle() call from
idle_thread_get(). This was the sole call-path on hotplug that resets
the Shadow Call Stack (scs) Stack Pointer (sp).
Not resetting the scs-sp leads to scs overflow after enough hotplug
cycles. Therefore add an explicit scs_task_reset() to the hotplug code
to make sure the scs-sp does get reset on hotplug.
Fixes: f1a0a376ca ("sched/core: Initialize the idle task with preemption disabled")
Signed-off-by: Woody Lin <woodylin@google.com>
[peterz: Changelog]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lore.kernel.org/r/20211012083521.973587-1-woodylin@google.com
After the commit 617f3ef951 ("locking/rwsem: Remove reader
optimistic spinning"), reader doesn't support optimistic spinning
anymore, there is no need meet the condition which OSQ is empty.
BTW, add an unlikely() for the max reader wakeup check in the loop.
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20211013134154.1085649-4-yanfei.xu@windriver.com
preempt_disable/enable() is equal to RCU read-side crital section, and
the spinning codes in mutex and rwsem could ensure that the preemption
is disabled. So let's remove the unnecessary rcu_read_lock/unlock for
saving some cycles in hot codes.
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20211013134154.1085649-2-yanfei.xu@windriver.com
The spinning region rwsem_spin_on_owner() should not be preempted,
however the rwsem_down_write_slowpath() invokes it and don't disable
preemption. Fix it by adding a pair of preempt_disable/enable().
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
[peterz: Fix CONFIG_RWSEM_SPIN_ON_OWNER=n build]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20211013134154.1085649-3-yanfei.xu@windriver.com
Mike reported that rcuwait went walk-about and is causing failures on
the PREEMPT_RT builds, restore it.
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
We only need to call it to resolve the blk_status_t -> errno mapping for
tracing, so move the conversion into the tracepoints that are not called
at all when tracing isn't enabled.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Patch set [1] introduced BTF_KIND_TAG to allow tagging
declarations for struct/union, struct/union field, var, func
and func arguments and these tags will be encoded into
dwarf. They are also encoded to btf by llvm for the bpf target.
After BTF_KIND_TAG is introduced, we intended to use it
for kernel __user attributes. But kernel __user is actually
a type attribute. Upstream and internal discussion showed
it is not a good idea to mix declaration attribute and
type attribute. So we proposed to introduce btf_type_tag
as a type attribute and existing btf_tag renamed to
btf_decl_tag ([2]).
This patch renamed BTF_KIND_TAG to BTF_KIND_DECL_TAG and some
other declarations with *_tag to *_decl_tag to make it clear
the tag is for declaration. In the future, BTF_KIND_TYPE_TAG
might be introduced per [3].
[1] https://lore.kernel.org/bpf/20210914223004.244411-1-yhs@fb.com/
[2] https://reviews.llvm.org/D111588
[3] https://reviews.llvm.org/D111199
Fixes: b5ea834dde ("bpf: Support for new btf kind BTF_KIND_TAG")
Fixes: 5b84bd1036 ("libbpf: Add support for BTF_KIND_TAG")
Fixes: 5c07f2fec0 ("bpftool: Add support for BTF_KIND_TAG")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211012164838.3345699-1-yhs@fb.com
It is not necessary for audit_filter_rules() functions to check
audit fileds of the rule with a lower priority, and if we did,
there might be some unintended effects, such as the ctx->ppid
may be changed unexpectedly, so return early if the rule has
a lower priority.
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
[PM: slight tweak to the subject line]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Fix possible null-pointer dereference in audit_filter_rules.
audit_filter_rules() error: we previously assumed 'ctx' could be null
Cc: stable@vger.kernel.org
Fixes: bf361231c2 ("audit: add saddr_fam filter field")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
While writing an email explaining the "bit = 0" logic for a discussion on
making ftrace_test_recursion_trylock() disable preemption, I discovered a
path that makes the "not do the logic if bit is zero" unsafe.
The recursion logic is done in hot paths like the function tracer. Thus,
any code executed causes noticeable overhead. Thus, tricks are done to try
to limit the amount of code executed. This included the recursion testing
logic.
Having recursion testing is important, as there are many paths that can
end up in an infinite recursion cycle when tracing every function in the
kernel. Thus protection is needed to prevent that from happening.
Because it is OK to recurse due to different running context levels (e.g.
an interrupt preempts a trace, and then a trace occurs in the interrupt
handler), a set of bits are used to know which context one is in (normal,
softirq, irq and NMI). If a recursion occurs in the same level, it is
prevented*.
Then there are infrastructure levels of recursion as well. When more than
one callback is attached to the same function to trace, it calls a loop
function to iterate over all the callbacks. Both the callbacks and the
loop function have recursion protection. The callbacks use the
"ftrace_test_recursion_trylock()" which has a "function" set of context
bits to test, and the loop function calls the internal
trace_test_and_set_recursion() directly, with an "internal" set of bits.
If an architecture does not implement all the features supported by ftrace
then the callbacks are never called directly, and the loop function is
called instead, which will implement the features of ftrace.
Since both the loop function and the callbacks do recursion protection, it
was seemed unnecessary to do it in both locations. Thus, a trick was made
to have the internal set of recursion bits at a more significant bit
location than the function bits. Then, if any of the higher bits were set,
the logic of the function bits could be skipped, as any new recursion
would first have to go through the loop function.
This is true for architectures that do not support all the ftrace
features, because all functions being traced must first go through the
loop function before going to the callbacks. But this is not true for
architectures that support all the ftrace features. That's because the
loop function could be called due to two callbacks attached to the same
function, but then a recursion function inside the callback could be
called that does not share any other callback, and it will be called
directly.
i.e.
traced_function_1: [ more than one callback tracing it ]
call loop_func
loop_func:
trace_recursion set internal bit
call callback
callback:
trace_recursion [ skipped because internal bit is set, return 0 ]
call traced_function_2
traced_function_2: [ only traced by above callback ]
call callback
callback:
trace_recursion [ skipped because internal bit is set, return 0 ]
call traced_function_2
[ wash, rinse, repeat, BOOM! out of shampoo! ]
Thus, the "bit == 0 skip" trick is not safe, unless the loop function is
call for all functions.
Since we want to encourage architectures to implement all ftrace features,
having them slow down due to this extra logic may encourage the
maintainers to update to the latest ftrace features. And because this
logic is only safe for them, remove it completely.
[*] There is on layer of recursion that is allowed, and that is to allow
for the transition between interrupt context (normal -> softirq ->
irq -> NMI), because a trace may occur before the context update is
visible to the trace recursion logic.
Link: https://lore.kernel.org/all/609b565a-ed6e-a1da-f025-166691b5d994@linux.alibaba.com/
Link: https://lkml.kernel.org/r/20211018154412.09fcad3c@gandalf.local.home
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Jisheng Zhang <jszhang@kernel.org>
Cc: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: stable@vger.kernel.org
Fixes: edc15cafcb ("tracing: Avoid unnecessary multiple recursion checks")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In commit fda31c5029 ("signal: avoid double atomic counter
increments for user accounting") Linus made a clever optimization to
how rlimits and the struct user_struct. Unfortunately that
optimization does not work in the obvious way when moved to nested
rlimits. The problem is that the last decrement of the per user
namespace per user sigpending counter might also be the last decrement
of the sigpending counter in the parent user namespace as well. Which
means that simply freeing the leaf ucount in __free_sigqueue is not
enough.
Maintain the optimization and handle the tricky cases by introducing
inc_rlimit_get_ucounts and dec_rlimit_put_ucounts.
By moving the entire optimization into functions that perform all of
the work it becomes possible to ensure that every level is handled
properly.
The new function inc_rlimit_get_ucounts returns 0 on failure to
increment the ucount. This is different than inc_rlimit_ucounts which
increments the ucounts and returns LONG_MAX if the ucount counter has
exceeded it's maximum or it wrapped (to indicate the counter needs to
decremented).
I wish we had a single user to account all pending signals to across
all of the threads of a process so this complexity was not necessary
Cc: stable@vger.kernel.org
Fixes: d646969055 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
v1: https://lkml.kernel.org/r/87mtnavszx.fsf_-_@disp2133
Link: https://lkml.kernel.org/r/87fssytizw.fsf_-_@disp2133
Reviewed-by: Alexey Gladkov <legion@kernel.org>
Tested-by: Rune Kleveland <rune.kleveland@infomedia.dk>
Tested-by: Yu Zhao <yuzhao@google.com>
Tested-by: Jordan Glover <Golden_Miller83@protonmail.ch>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Only core.c needs blkdev.h, so move the #include statement there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Various files have acquired spurious includes of <linux/blkdev.h> over
time. Remove them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert __add_to_page_cache_locked() into __filemap_add_folio().
Add an assertion to it that (for !hugetlbfs), the folio is naturally
aligned within the file. Move the prototype from mm.h to pagemap.h.
Convert add_to_page_cache_lru() into filemap_add_folio(). Add a
compatibility wrapper for unconverted callers.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Mapping something twice should be possible as long as,
DMA_ATTR_SKIP_CPU_SYNC is passed to the strictly speaking second relevant
mapping operation (that attempts to map the same thing). So, don't issue a
warning if the specified condition is met in add_dma_entry().
Signed-off-by: Hamza Mahfooz <someguy@effective-light.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Some drivers like Qualcomm pm8941-pwrkey need to access 'reboot_mode'
for triggering reboot between cold and warm mode. Export the symbol, so
that drivers built as module can still access the symbol.
Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
Link: https://lore.kernel.org/r/20210714095850.27185-2-shawn.guo@linaro.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
- Fix defined but not use warning/error for osnoise function
- Fix memory leak in event probe
- Fix memblock leak in bootconfig
- Fix the API of event probes to be like kprobes
- Added test to check removal of event probe API
- Fix recordmcount.pl for nds32 failed build
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYWpB6BQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qnu/AQD1eYekS43uCDyzzpvjsz0tZ6tzVH8z
ainpgtcAd11q4AD8CHLvhBsEyo99Yna2Mvir6nCkafm2Y2IVGvVbnDofnAA=
=yvDo
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Tracing fixes for 5.15:
- Fix defined but not use warning/error for osnoise function
- Fix memory leak in event probe
- Fix memblock leak in bootconfig
- Fix the API of event probes to be like kprobes
- Added test to check removal of event probe API
- Fix recordmcount.pl for nds32 failed build
* tag 'trace-v5.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
nds32/ftrace: Fix Error: invalid operands (*UND* and *UND* sections) for `^'
selftests/ftrace: Update test for more eprobe removal process
tracing: Fix event probe removal from dynamic events
tracing: Fix missing * in comment block
bootconfig: init: Fix memblock leak in xbc_make_cmdline()
tracing: Fix memory leak in eprobe_register()
tracing: Fix missing osnoise tracer on max_latency
It is useful to trace functions in kernel/event/core.c. Allow ftrace for
them by removing $(CC_FLAGS_FTRACE) from Makefile.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211006210732.2826289-1-songliubraving@fb.com
PEBS-via-PT records contain a mask of applicable counters. To identify
which event belongs to which counter, a side-band event is needed. Until
now, there has been no side-band event, and consequently users were limited
to using a single event.
Add such a side-band event. Note the event is optimised to output only
when the counter index changes for an event. That works only so long as
all PEBS-via-PT events are scheduled together, which they are for a
recording session because they are in a single group.
Also no attribute bit is used to select the new event, so a new
kernel is not compatible with older perf tools. The assumption
being that PEBS-via-PT is sufficiently esoteric that users will not
be troubled by this.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210907163903.11820-2-adrian.hunter@intel.com
On PREEMPT_RT most items are processed as LAZY via softirq context.
Avoid to spin-wait for them because irq_work_sync() could have higher
priority and not allow the irq-work to be completed.
Wait additionally for !IRQ_WORK_HARD_IRQ irq_work items on PREEMPT_RT.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211006111852.1514359-5-bigeasy@linutronix.de
The irq_work callback is invoked in hard IRQ context. By default all
callbacks are scheduled for invocation right away (given supported by
the architecture) except for the ones marked IRQ_WORK_LAZY which are
delayed until the next timer-tick.
While looking over the callbacks, some of them may acquire locks
(spinlock_t, rwlock_t) which are transformed into sleeping locks on
PREEMPT_RT and must not be acquired in hard IRQ context.
Changing the locks into locks which could be acquired in this context
will lead to other problems such as increased latencies if everything
in the chain has IRQ-off locks. This will not solve all the issues as
one callback has been noticed which invoked kref_put() and its callback
invokes kfree() and this can not be invoked in hardirq context.
Some callbacks are required to be invoked in hardirq context even on
PREEMPT_RT to work properly. This includes for instance the NO_HZ
callback which needs to be able to observe the idle context.
The callbacks which require to be run in hardirq have already been
marked. Use this information to split the callbacks onto the two lists
on PREEMPT_RT:
- lazy_list
Work items which are not marked with IRQ_WORK_HARD_IRQ will be added
to this list. Callbacks on this list will be invoked from a per-CPU
thread.
The handler here may acquire sleeping locks such as spinlock_t and
invoke kfree().
- raised_list
Work items which are marked with IRQ_WORK_HARD_IRQ will be added to
this list. They will be invoked in hardirq context and must not
acquire any sleeping locks.
The wake up of the per-CPU thread occurs from irq_work handler/
hardirq context. The thread runs with lowest RT priority to ensure it
runs before any SCHED_OTHER tasks do.
[bigeasy: melt tglx's irq_work_tick_soft() which splits irq_work_tick() into a
hard and soft variant. Collected fixes over time from Steven
Rostedt and Mike Galbraith. Move to per-CPU threads instead of
softirq as suggested by PeterZ.]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211007092646.uhshe3ut2wkrcfzv@linutronix.de
irq_work() triggers instantly an interrupt if supported by the
architecture. Otherwise the work will be processed on the next timer
tick. In worst case irq_work_sync() could spin up to a jiffy.
irq_work_sync() is usually used in tear down context which is fully
preemptible. Based on review irq_work_sync() is invoked from preemptible
context and there is one waiter at a time. This qualifies it to use
rcuwait for synchronisation.
Let irq_work_sync() synchronize with rcuwait if the architecture
processes irqwork via the timer tick.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211006111852.1514359-3-bigeasy@linutronix.de
The push-IPI logic for RT tasks expects to be invoked from hardirq
context. One reason is that a RT task on the remote CPU would block the
softirq processing on PREEMPT_RT and so avoid pulling / balancing the RT
tasks as intended.
Annotate root_domain::rto_push_work as IRQ_WORK_HARD_IRQ.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211006111852.1514359-2-bigeasy@linutronix.de
The compilers can't deal with obvious DCE vs that warning, resulting
in code like:
if (0) {
sched sched_statistics *stats;
stats = __schedstats_from_se(se);
...
}
triggering the warning. Kill the warning to make the robots stop
reporting this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lkml.kernel.org/r/YWWPLnaZGybHsTkv@hirez.programming.kicks-ass.net
Having a stable wchan means the process must be blocked and for it to
stay that way while performing stack unwinding.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [arm]
Tested-by: Mark Rutland <mark.rutland@arm.com> [arm64]
Link: https://lkml.kernel.org/r/20211008111626.332092234@infradead.org
tools/testing/selftests/net/ioam6.sh
7b1700e009 ("selftests: net: modify IOAM tests for undef bits")
bf77b1400a ("selftests: net: Test for the IOAM encapsulation with IPv6")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The number of system calls making use of pidfds is constantly
increasing. Some of those new system calls duplicate the code to turn a
pidfd into task_struct it refers to. Give them a simple helper for this.
Link: https://lore.kernel.org/r/20211004125050.1153693-2-christian.brauner@ubuntu.com
Link: https://lore.kernel.org/r/20211011133245.1703103-2-brauner@kernel.org
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Matthew Bobrowski <repnop@google.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Reviewed-by: Matthew Bobrowski <repnop@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
There is a small race between copy_process() and sched_fork()
where child->sched_task_group point to an already freed pointer.
parent doing fork() | someone moving the parent
| to another cgroup
-------------------------------+-------------------------------
copy_process()
+ dup_task_struct()<1>
parent move to another cgroup,
and free the old cgroup. <2>
+ sched_fork()
+ __set_task_cpu()<3>
+ task_fork_fair()
+ sched_slice()<4>
In the worst case, this bug can lead to "use-after-free" and
cause panic as shown above:
(1) parent copy its sched_task_group to child at <1>;
(2) someone move the parent to another cgroup and free the old
cgroup at <2>;
(3) the sched_task_group and cfs_rq that belong to the old cgroup
will be accessed at <3> and <4>, which cause a panic:
[] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[] PGD 8000001fa0a86067 P4D 8000001fa0a86067 PUD 2029955067 PMD 0
[] Oops: 0000 [#1] SMP PTI
[] CPU: 7 PID: 648398 Comm: ebizzy Kdump: loaded Tainted: G OE --------- - - 4.18.0.x86_64+ #1
[] RIP: 0010:sched_slice+0x84/0xc0
[] Call Trace:
[] task_fork_fair+0x81/0x120
[] sched_fork+0x132/0x240
[] copy_process.part.5+0x675/0x20e0
[] ? __handle_mm_fault+0x63f/0x690
[] _do_fork+0xcd/0x3b0
[] do_syscall_64+0x5d/0x1d0
[] entry_SYSCALL_64_after_hwframe+0x65/0xca
[] RIP: 0033:0x7f04418cd7e1
Between cgroup_can_fork() and cgroup_post_fork(), the cgroup
membership and thus sched_task_group can't change. So update child's
sched_task_group at sched_post_fork() and move task_fork() and
__set_task_cpu() (where accees the sched_task_group) from sched_fork()
to sched_post_fork().
Fixes: 8323f26ce3 ("sched: Fix race in task_group")
Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lkml.kernel.org/r/20210915064030.2231-1-zhangqiao22@huawei.com
numa_distance in cpu_attach_domain() is introduced in
commit b5b217346d ("sched/topology: Warn when NUMA diameter > 2")
to warn user when NUMA diameter > 2 as we'll misrepresent
the scheduler topology structures at that time. This is
fixed by Barry in commit 585b6d2723 ("sched/topology: fix the issue
groups don't span domain->span for NUMA diameter > 2") and
numa_distance is unused now. So remove it.
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lore.kernel.org/r/20210915063158.80639-1-yangyicong@hisilicon.com
Fix a few comments to help understand them better.
Signed-off-by: Bharata B Rao <bharata@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20211004105706.3669-4-bharata@amd.com
numa_group::fault_cpus is actually a pointer to the region
in numa_group::faults[] where NUMA_CPU stats are located.
Remove this redundant member and use numa_group::faults[NUMA_CPU]
directly like it is done for similar per-process numa fault stats.
There is no functionality change due to this commit.
Signed-off-by: Bharata B Rao <bharata@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20211004105706.3669-3-bharata@amd.com
While allocating group fault stats, task_numa_group()
is using a hard coded number 4. Replace this by
NR_NUMA_HINT_FAULT_STATS.
No functionality change in this commit.
Signed-off-by: Bharata B Rao <bharata@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20211004105706.3669-2-bharata@amd.com
Make sure to prod idle CPUs so they call klp_update_patch_state().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390
Link: https://lkml.kernel.org/r/20210929151723.162004989@infradead.org
When an event probe is to be removed via the API that created it via the
dynamic events, an -ENOENT error is returned.
This is because the removal of the event probe does not expect to see the
event system and name that the event probe is attached to, even though
that's part of the API to create it. As the removal of probes is to use
the same API as they are created.
In fact, the removal is not consistent with the kprobes and uprobes
removal. Fix that by allowing various ways to remove the eprobe.
The eprobe is created with:
e:[GROUP/]NAME SYSTEM/EVENT [OPTIONS]
Have it get removed by echoing in the following into dynamic_events:
# Remove all eprobes with NAME
echo '-:NAME' >> dynamic_events
# Remove a specific eprobe
echo '-:GROUP/NAME' >> dynamic_events
echo '-:GROUP/NAME SYSTEM/EVENT' >> dynamic_events
echo '-:NAME SYSTEM/EVENT' >> dynamic_events
echo '-:GROUP/NAME SYSTEM/EVENT OPTIONS' >> dynamic_events
echo '-:NAME SYSTEM/EVENT OPTIONS' >> dynamic_events
Link: https://lkml.kernel.org/r/20211012081925.0e19cc4f@gandalf.local.home
Link: https://lkml.kernel.org/r/20211013205533.630722129@goodmis.org
Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Replace the obsolete and ambiguos macro in_irq() with new
macro in_hardirq().
Link: https://lkml.kernel.org/r/20210930000342.6016-1-changbin.du@gmail.com
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull cgroup fixes from Tejun Heo:
"All documentation / comment updates"
* 'for-5.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroupv2, docs: fix misinformation in "device controller" section
cgroup/cpuset: Change references of cpuset_mutex to cpuset_rwsem
docs/cgroup: remove some duplicate words
Pull workqueue fixes from Tejun Heo:
"One patch to add a missing __printf annotation and the other to enable
deferred printing for debug dumps to avoid deadlocks when triggered
from some contexts (e.g. console drivers)"
* 'for-5.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix state-dump console deadlock
workqueue: annotate alloc_workqueue() as printf
Console drivers often queue work while holding locks also taken in their
console write paths, something which can lead to deadlocks on SMP when
dumping workqueue state (e.g. sysrq-t or on suspend failures).
For serial console drivers this could look like:
CPU0 CPU1
---- ----
show_workqueue_state();
lock(&pool->lock); <IRQ>
lock(&port->lock);
schedule_work();
lock(&pool->lock);
printk();
lock(console_owner);
lock(&port->lock);
where workqueues are, for example, used to push data to the line
discipline, process break signals and handle modem-status changes. Line
disciplines and serdev drivers can also queue work on write-wakeup
notifications, etc.
Reworking every console driver to avoid queuing work while holding locks
also taken in their write paths would complicate drivers and is neither
desirable or feasible.
Instead use the deferred-printk mechanism to avoid printing while
holding pool locks when dumping workqueue state.
Note that there are a few WARN_ON() assertions in the workqueue code
which could potentially also trigger a deadlock. Hopefully the ongoing
printk rework will provide a general solution for this eventually.
This was originally reported after a lockdep splat when executing
sysrq-t with the imx serial driver.
Fixes: 3494fc3084 ("workqueue: dump workqueues on sysrq-t")
Cc: stable@vger.kernel.org # 4.0
Reported-by: Fabio Estevam <festevam@denx.de>
Tested-by: Fabio Estevam <festevam@denx.de>
Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
The following warning occurred sporadically on s390:
DMA-API: nvme 0006:00:00.0: device driver maps memory from kernel text or rodata [addr=0000000048cc5e2f] [len=131072]
WARNING: CPU: 4 PID: 825 at kernel/dma/debug.c:1083 check_for_illegal_area+0xa8/0x138
It is a false-positive warning, due to broken logic in debug_dma_map_sg().
check_for_illegal_area() checks for overlay of sg elements with kernel text
or rodata. It is called with sg_dma_len(s) instead of s->length as
parameter. After the call to ->map_sg(), sg_dma_len() will contain the
length of possibly combined sg elements in the DMA address space, and not
the individual sg element length, which would be s->length.
The check will then use the physical start address of an sg element, and
add the DMA length for the overlap check, which could result in the false
warning, because the DMA length can be larger than the actual single sg
element length.
In addition, the call to check_for_illegal_area() happens in the iteration
over mapped_ents, which will not include all individual sg elements if
any of them were combined in ->map_sg().
Fix this by using s->length instead of sg_dma_len(s). Also put the call to
check_for_illegal_area() in a separate loop, iterating over all the
individual sg elements ("nents" instead of "mapped_ents").
While at it, as suggested by Robin Murphy, also move check_for_stack()
inside the new loop, as it is similarly concerned with validating the
individual sg elements.
Link: https://lore.kernel.org/lkml/20210705185252.4074653-1-gerald.schaefer@linux.ibm.com
Fixes: 884d05970b ("dma-debug: use sg_dma_len accessor")
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
htmldocs began producing the following warnings:
kernel/dma/mapping.c:256: WARNING: Definition list ends without a
blank line; unexpected unindent.
kernel/dma/mapping.c:257: WARNING: Bullet list ends without a blank
line; unexpected unindent.
Reformatting the list without hyphens fixes the warnings and produces
both a readable text and HTML output.
Fixes: fffe3cc8c2 ("dma-mapping: allow map_sg() ops to return negative error code")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Most of ARCHs use empty ftrace_dyn_arch_init(), introduce a weak common
ftrace_dyn_arch_init() to cleanup them.
Link: https://lkml.kernel.org/r/20210909090216.1955240-1-o451686892@gmail.com
Acked-by: Heiko Carstens <hca@linux.ibm.com> (s390)
Acked-by: Helge Deller <deller@gmx.de> (parisc)
Signed-off-by: Weizhao Ouyang <o451686892@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When building the files in the tracefs file system, do not by default set
any permissions for OTH (other). This will make it easier for admins who
want to define a group for accessing tracefs and not having to first
disable all the permission bits for "other" in the file system.
As tracing can leak sensitive information, it should never by default
allowing all users access. An admin can still set the permission bits for
others to have access, which may be useful for creating a honeypot and
seeing who takes advantage of it and roots the machine.
Link: https://lkml.kernel.org/r/20210818153038.864149276@goodmis.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
With "make install", bpftool installs its binary and its bash completion
file. Usually, this is what we want. But a few components in the kernel
repository (namely, BPF iterators and selftests) also install bpftool
locally before using it. In such a case, bash completion is not
necessary and is just a useless build artifact.
Let's add an "install-bin" target to bpftool, to offer a way to install
the binary only.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211007194438.34443-13-quentin@isovalent.com
API headers from libbpf should not be accessed directly from the
library's source directory. Instead, they should be exported with "make
install_headers". Let's make sure that bpf/preload/iterators/Makefile
installs the headers properly when building.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211007194438.34443-8-quentin@isovalent.com
API headers from libbpf should not be accessed directly from the
library's source directory. Instead, they should be exported with "make
install_headers". Let's make sure that bpf/preload/Makefile installs the
headers properly when building.
Note that we declare an additional dependency for iterators/iterators.o:
having $(LIBBPF_A) as a dependency to "$(obj)/bpf_preload_umd" is not
sufficient, as it makes it required only at the linking step. But we
need libbpf to be compiled, and in particular its headers to be
exported, before we attempt to compile iterators.o. The issue would not
occur before this commit, because libbpf's headers were not exported and
were always available under tools/lib/bpf.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211007194438.34443-7-quentin@isovalent.com
Today when a signal is delivered with a handler of SIG_DFL whose
default behavior is to generate a core dump not only that process but
every process that shares the mm is killed.
In the case of vfork this looks like a real world problem. Consider
the following well defined sequence.
if (vfork() == 0) {
execve(...);
_exit(EXIT_FAILURE);
}
If a signal that generates a core dump is received after vfork but
before the execve changes the mm the process that called vfork will
also be killed (as the mm is shared).
Similarly if the execve fails after the point of no return the kernel
delivers SIGSEGV which will kill both the exec'ing process and because
the mm is shared the process that called vfork as well.
As far as I can tell this behavior is a violation of people's
reasonable expectations, POSIX, and is unnecessarily fragile when the
system is low on memory.
Solve this by making a userspace visible change to only kill a single
process/thread group. This is possible because Jann Horn recently
modified[1] the coredump code so that the mm can safely be modified
while the coredump is happening. With LinuxThreads long gone I don't
expect anyone to have a notice this behavior change in practice.
To accomplish this move the core_state pointer from mm_struct to
signal_struct, which allows different thread groups to coredump
simultatenously.
In zap_threads remove the work to kill anything except for the current
thread group.
v2: Remove core_state from the VM_BUG_ON_MM print to fix
compile failure when CONFIG_DEBUG_VM is enabled.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
[1] a07279c9a8 ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3")
History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Link: https://lkml.kernel.org/r/87y27mvnke.fsf@disp2133
Link: https://lkml.kernel.org/r/20211007144701.67592574@canb.auug.org.au
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
vmlinux.o: warning: objtool: rcu_nmi_enter()+0x36: call to __kasan_check_read() leaves .noinstr.text section
noinstr cannot have atomic_*() functions in because they're explicitly
annotated, use arch_atomic_*().
Fixes: 2be57f7328 ("rcu: Weaken ->dynticks accesses and updates")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
RCU managed to grow a few noinstr violations:
vmlinux.o: warning: objtool: rcu_dynticks_eqs_enter()+0x0: call to rcu_dynticks_task_trace_enter() leaves .noinstr.text section
vmlinux.o: warning: objtool: rcu_dynticks_eqs_exit()+0xe: call to rcu_dynticks_task_trace_exit() leaves .noinstr.text section
Fix them by adding __always_inline to the relevant trivial functions.
Also replace the noinstr with __always_inline for the existing
rcu_dynticks_task_*() functions since noinstr would force noinline
them, even when empty, which seems silly.
Fixes: 7d0c9c50c5 ("rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
netfilter, and wireless.
Current release - regressions:
- xfrm: fix XFRM_MSG_MAPPING ABI breakage caused by inserting
a new value in the middle of an enum
- unix: fix an issue in unix_shutdown causing the other end
read/write failures
- phy: mdio: fix memory leak
Current release - new code bugs:
- mlx5e: improve MQPRIO resiliency against bad configs
Previous releases - regressions:
- bpf: fix integer overflow leading to OOB access in map element
pre-allocation
- stmmac: dwmac-rk: fix ethernet on rk3399 based devices
- netfilter: conntrack: fix boot failure with nf_conntrack.enable_hooks=1
- brcmfmac: revert using ISO3166 country code and 0 rev as fallback
- i40e: fix freeing of uninitialized misc IRQ vector
- iavf: fix double unlock of crit_lock
Previous releases - always broken:
- bpf, arm: fix register clobbering in div/mod implementation
- netfilter: nf_tables: correct issues in netlink rule change
event notifications
- dsa: tag_dsa: fix mask for trunked packets
- usb: r8152: don't resubmit rx immediately to avoid soft lockup
on device unplug
- i40e: fix endless loop under rtnl if FW fails to correctly
respond to capability query
- mlx5e: fix rx checksum offload coexistence with ipsec offload
- mlx5: force round second at 1PPS out start time and allow it
only in supported clock modes
- phy: pcs: xpcs: fix incorrect CL37 AN sequence, EEE disable
sequence
Misc:
- xfrm: slightly rejig the new policy uAPI to make it less cryptic
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmFfEpUACgkQMUZtbf5S
Iru3vQ//fgm+pdDE860BXmLEgrbJTHU4rq/YD1vwZBcWw/i5wI1vnLr6BzZsdPNX
DkhcKFGOZUTj+8ctuRDuqrkqoDjva6uRjwM0vcPWh5i9sGqJpKjxB3dFksyxJELR
SnXM3Jmlk7YiGAw9Bi+66OuIwt2ouRLR/bNIwg8/qCnFI1efIF7IPeCpuvKCw/yd
SOiSBIfuSPD1qcs5Sy4aHZqA8Xr9qbwDNwWQfFLXgNDKEiY7XOSdo3CoCddSxdR+
2nmpOiz4w68wspapdZn3GSZHYrQh5kjz7b0Aru0Jvw86M79mKp3b9AfJ9uXTcJhp
4cQBralLnQvLKanvKi1z5CI6NjXx+r6rXI43N6NjHOtjRUPoFMqxZEH0d7o11aT1
sN3UDgtFtuE9Pfrhnc5ZHuHqFCCyxA6NWD6nt1dUoSEo0oWt9mOHfeoFT4+45fO0
5no5+1q3EkYdH4jiJlavnM2DMvVzMd6FbxDzWpXJ2j4W1vM6TLkexEJIK4MLGxPV
76lxeXzcvbM9a0vq5BabR4QbPIAv+A9qYPmXJwPVrvjo+zynwuWc3gMO5hc4EaOf
FXF2Ka5Jn97jW8/JS7i7Gj6M8GKdyIxaFHgS4MLtJNs6pt3h7m6bSgcIOQZ5psBZ
dKRYjM2lxeVkvDDmy5Gztkw2asbofYQP004tgagc+jwXP7DwXaY=
=xzZg
-----END PGP SIGNATURE-----
Merge tag 'net-5.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from xfrm, bpf, netfilter, and wireless.
Current release - regressions:
- xfrm: fix XFRM_MSG_MAPPING ABI breakage caused by inserting a new
value in the middle of an enum
- unix: fix an issue in unix_shutdown causing the other end
read/write failures
- phy: mdio: fix memory leak
Current release - new code bugs:
- mlx5e: improve MQPRIO resiliency against bad configs
Previous releases - regressions:
- bpf: fix integer overflow leading to OOB access in map element
pre-allocation
- stmmac: dwmac-rk: fix ethernet on rk3399 based devices
- netfilter: conntrack: fix boot failure with
nf_conntrack.enable_hooks=1
- brcmfmac: revert using ISO3166 country code and 0 rev as fallback
- i40e: fix freeing of uninitialized misc IRQ vector
- iavf: fix double unlock of crit_lock
Previous releases - always broken:
- bpf, arm: fix register clobbering in div/mod implementation
- netfilter: nf_tables: correct issues in netlink rule change event
notifications
- dsa: tag_dsa: fix mask for trunked packets
- usb: r8152: don't resubmit rx immediately to avoid soft lockup on
device unplug
- i40e: fix endless loop under rtnl if FW fails to correctly respond
to capability query
- mlx5e: fix rx checksum offload coexistence with ipsec offload
- mlx5: force round second at 1PPS out start time and allow it only
in supported clock modes
- phy: pcs: xpcs: fix incorrect CL37 AN sequence, EEE disable
sequence
Misc:
- xfrm: slightly rejig the new policy uAPI to make it less cryptic"
* tag 'net-5.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (66 commits)
net: prefer socket bound to interface when not in VRF
iavf: fix double unlock of crit_lock
i40e: Fix freeing of uninitialized misc IRQ vector
i40e: fix endless loop under rtnl
dt-bindings: net: dsa: marvell: fix compatible in example
ionic: move filter sync_needed bit set
gve: report 64bit tx_bytes counter from gve_handle_report_stats()
gve: fix gve_get_stats()
rtnetlink: fix if_nlmsg_stats_size() under estimation
gve: Properly handle errors in gve_assign_qpl
gve: Avoid freeing NULL pointer
gve: Correct available tx qpl check
unix: Fix an issue in unix_shutdown causing the other end read/write failures
net: stmmac: trigger PCS EEE to turn off on link down
net: pcs: xpcs: fix incorrect steps on disable EEE
netlink: annotate data races around nlk->bound
net: pcs: xpcs: fix incorrect CL37 AN sequence
net: sfp: Fix typo in state machine debug string
net/sched: sch_taprio: properly cancel timer from taprio_destroy()
net: bridge: fix under estimation in br_get_linkxstats_size()
...
Daniel Borkmann says:
====================
pull-request: bpf 2021-10-07
We've added 7 non-merge commits during the last 8 day(s) which contain
a total of 8 files changed, 38 insertions(+), 21 deletions(-).
The main changes are:
1) Fix ARM BPF JIT to preserve caller-saved regs for DIV/MOD JIT-internal
helper call, from Johan Almbladh.
2) Fix integer overflow in BPF stack map element size calculation when
used with preallocation, from Tatsuhiko Yasumatsu.
3) Fix an AF_UNIX regression due to added BPF sockmap support related
to shutdown handling, from Jiang Wang.
4) Fix a segfault in libbpf when generating light skeletons from objects
without BTF, from Kumar Kartikeya Dwivedi.
5) Fix a libbpf memory leak in strset to free the actual struct strset
itself, from Andrii Nakryiko.
6) Dual-license bpf_insn.h similarly as we did for libbpf and bpftool,
with ACKs from all contributors, from Luca Boccassi.
====================
Link: https://lore.kernel.org/r/20211007135010.21143-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The upper and lower variables are set as link lists to add into the sparse
array. If they are NULL, after the needed allocations are done, then there
is nothing to add. But they need to be initialized to NULL for this to
work.
Link: https://lore.kernel.org/all/221bc7ba-a475-1cb9-1bbe-730bb9c2d448@canonical.com/
Fixes: 8d6e90983a ("tracing: Create a sparse bitmask for pid filtering")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The compiler warns when the data are actually unused:
kernel/trace/trace.c:1712:13: error: ‘trace_create_maxlat_file’ defined but not used [-Werror=unused-function]
1712 | static void trace_create_maxlat_file(struct trace_array *tr,
| ^~~~~~~~~~~~~~~~~~~~~~~~
[Why]
CONFIG_HWLAT_TRACER=n, CONFIG_TRACER_MAX_TRACE=n, CONFIG_OSNOISE_TRACER=y
gcc report warns.
[How]
Now trace_create_maxlat_file will only take effect when
CONFIG_HWLAT_TRACER=y or CONFIG_TRACER_MAX_TRACE=y. In fact, after
adding osnoise trace, it also needs to take effect.
Link: https://lore.kernel.org/all/c1d9e328-ad7c-920b-6c24-9e1598a6421c@infradead.org/
Link: https://lkml.kernel.org/r/20210922025122.3268022-1-liu.yun@linux.dev
Fixes: bce29ac9ce ("trace: Add osnoise tracer")
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Simplify and make wake_up_if_idle() more robust, also don't iterate
the whole machine with preempt_disable() in it's caller:
wake_up_all_idle_cpus().
This prepares for another wake_up_if_idle() user that needs a full
do_idle() cycle.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390
Link: https://lkml.kernel.org/r/20210929152428.769328779@infradead.org
Instead of frobbing around with scheduler internals, use the shiny new
task_call_func() interface.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390
Link: https://lkml.kernel.org/r/20210929152428.709906138@infradead.org
Give try_invoke_on_locked_down_task() a saner name and have it return
an int so that the caller might distinguish between different reasons
of failure.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390
Link: https://lkml.kernel.org/r/20210929152428.649944917@infradead.org
Clarify and tighten try_invoke_on_locked_down_task().
Basically the function calls @func under task_rq_lock(), except it
avoids taking rq->lock when possible.
This makes calling @func unconditional (the function will get renamed
in a later patch to remove the try).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390
Link: https://lkml.kernel.org/r/20210929152428.589323576@infradead.org
Add support to wait on multiple futexes. This is the interface
implemented by this syscall:
futex_waitv(struct futex_waitv *waiters, unsigned int nr_futexes,
unsigned int flags, struct timespec *timeout, clockid_t clockid)
struct futex_waitv {
__u64 val;
__u64 uaddr;
__u32 flags;
__u32 __reserved;
};
Given an array of struct futex_waitv, wait on each uaddr. The thread
wakes if a futex_wake() is performed at any uaddr. The syscall returns
immediately if any waiter has *uaddr != val. *timeout is an optional
absolute timeout value for the operation. This syscall supports only
64bit sized timeout structs. The flags argument of the syscall should be
empty, but it can be used for future extensions. Flags for shared
futexes, sizes, etc. should be used on the individual flags of each
waiter.
__reserved is used for explicit padding and should be 0, but it might be
used for future extensions. If the userspace uses 32-bit pointers, it
should make sure to explicitly cast it when assigning to waitv::uaddr.
Returns the array index of one of the woken futexes. There’s no given
information of how many were woken, or any particular attribute of it
(if it’s the first woken, if it is of the smaller index...).
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210923171111.300673-17-andrealmeid@collabora.com
We need to make sure that all requeue operations take the hash bucket
locks in the same order to avoid deadlock. Simplify the current
double_lock_hb implementation by making sure hb1 is always the
"smallest" bucket to avoid extra checks.
[André: Add commit description]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-16-andrealmeid@collabora.com
Move the wait/wake bits into their own file.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-15-andrealmeid@collabora.com
Move all the requeue bits into their own file.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-14-andrealmeid@collabora.com
In order to prepare introducing these symbols into the global
namespace; rename:
s/mark_wake_futex/futex_wake_mark/g
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-13-andrealmeid@collabora.com
In order to prepare introducing these symbols into the global
namespace; rename:
s/match_futex/futex_match/g
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-12-andrealmeid@collabora.com
In order to prepare introducing these symbols into the global
namespace; rename them:
s/hb_waiters_/futex_&/g
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-11-andrealmeid@collabora.com
Move the PI futex implementation into it's own file.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-10-andrealmeid@collabora.com
In order to prepare introducing these symbols into the global
namespace; rename them:
s/\<\([^_ ]*\)_futex_value_locked/futex_\1_value_locked/g
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-9-andrealmeid@collabora.com
In order to prepare introducing these symbols into the global
namespace; rename:
s/hash_futex/futex_hash/g
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-8-andrealmeid@collabora.com
In order to prepare introducing these symbols into the global
namespace; rename:
s/__unqueue_futex/__futex_unqueue/g
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-7-andrealmeid@collabora.com
In order to prepare introducing these symbols into the global
namespace; rename them:
s/queue_\(un\)*lock/futex_q_\1lock/g
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-6-andrealmeid@collabora.com
In order to prepare introducing these symbols into the global
namespace; rename them:
s/futex_wait_queue_me/futex_wait_queue/g
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-5-andrealmeid@collabora.com
In order to prepare introducing these symbols into the global
namespace; rename them:
s/\<\(__\)*\(un\)*queue_me/\1futex_\2queue/g
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-4-andrealmeid@collabora.com
Put the syscalls in their own little file.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-3-andrealmeid@collabora.com
In preparation for splitup..
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-2-andrealmeid@collabora.com
Instead of a full barrier around the Rmw insn, micro-optimize
for weakly ordered archs such that we only provide the required
ACQUIRE semantics when taking the read lock.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lkml.kernel.org/r/20210920052031.54220-2-dave@stgolabs.net
Rename coredump_exit_mm to coredump_task_exit and call it from do_exit
before PTRACE_EVENT_EXIT, and before any cleanup work for a task
happens. This ensures that an accurate copy of the process can be
captured in the coredump as no cleanup for the process happens before
the coredump completes. This also ensures that PTRACE_EVENT_EXIT
will not be visited by any thread until the coredump is complete.
Add a new flag PF_POSTCOREDUMP so that tasks that have passed through
coredump_task_exit can be recognized and ignored in zap_process.
Now that all of the coredumping happens before exit_mm remove code to
test for a coredump in progress from mm_release.
Replace "may_ptrace_stop()" with a simple test of "current->ptrace".
The other tests in may_ptrace_stop all concern avoiding stopping
during a coredump. These tests are no longer necessary as it is now
guaranteed that fatal_signal_pending will be set if the code enters
ptrace_stop during a coredump. The code in ptrace_stop is guaranteed
not to stop if fatal_signal_pending returns true.
Until this change "ptrace_event(PTRACE_EVENT_EXIT)" could call
ptrace_stop without fatal_signal_pending being true, as signals are
dequeued in get_signal before calling do_exit. This is no longer
an issue as "ptrace_event(PTRACE_EVENT_EXIT)" is no longer reached
until after the coredump completes.
Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Separate the coredump logic from the ordinary exit_mm logic
by moving the coredump logic out of exit_mm into it's own
function coredump_exit_mm.
Link: https://lkml.kernel.org/r/87a6k2x277.fsf@disp2133
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Both arch_ptrace_stop_needed and arch_ptrace_stop are called with an
exit_code and a siginfo structure. Neither argument is used by any of
the implementations so just remove the unneeded arguments.
The two arechitectures that implement arch_ptrace_stop are ia64 and
sparc. Both architectures flush their register stacks before a
ptrace_stack so that all of the register information can be accessed
by debuggers.
As the question of if a register stack needs to be flushed is
independent of why ptrace is stopping not needing arguments make sense.
Cc: David Miller <davem@davemloft.net>
Cc: sparclinux@vger.kernel.org
Link: https://lkml.kernel.org/r/87lf3mx290.fsf@disp2133
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The existence of sigkill_pending is a little silly as it is
functionally a duplicate of fatal_signal_pending that is used in
exactly one place.
Checking for pending fatal signals and returning early in ptrace_stop
is actively harmful. It casues the ptrace_stop called by
ptrace_signal to return early before setting current->exit_code.
Later when ptrace_signal reads the signal number from
current->exit_code is undefined, making it unpredictable what will
happen.
Instead rely on the fact that schedule will not sleep if there is a
pending signal that can awaken a task.
Removing the explict sigkill_pending test fixes fixes ptrace_signal
when ptrace_stop does not stop because current->exit_code is always
set to to signr.
Cc: stable@vger.kernel.org
Fixes: 3d749b9e67 ("ptrace: simplify ptrace_stop()->sigkill_pending() path")
Fixes: 1a669c2f16 ("Add arch_ptrace_stop")
Link: https://lkml.kernel.org/r/87pmsyx29t.fsf@disp2133
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When !SCHEDSTATS schedstat_enabled() is an unconditional 0 and the
whole block doesn't exist, however GCC figures the scoped variable
'stats' is unused and complains about it.
Upgrade the warning from -Wunused-variable to -Wunused-but-set-variable
by writing it in two statements. This fixes the build because the new
warning is in W=1.
Given that whole if(0) {} thing, I don't feel motivated to change
things overly much and quite strongly feel this is the compiler being
daft.
Fixes: cb3e971c435d ("sched: Make struct sched_statistics independent of fair sched class")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
This adds selftests that tests the success and failure path for modules
kfuncs (in presence of invalid kfunc calls) for both libbpf and
gen_loader. It also adds a prog_test kfunc_btf_id_list so that we can
add module BTF ID set from bpf_testmod.
This also introduces a couple of test cases to verifier selftests for
validating whether we get an error or not depending on if invalid kfunc
call remains after elimination of unreachable instructions.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-10-memxor@gmail.com
This commit moves BTF ID lookup into the newly added registration
helper, in a way that the bbr, cubic, and dctcp implementation set up
their sets in the bpf_tcp_ca kfunc_btf_set list, while the ones not
dependent on modules are looked up from the wrapper function.
This lifts the restriction for them to be compiled as built in objects,
and can be loaded as modules if required. Also modify Makefile.modfinal
to call resolve_btfids for each module.
Note that since kernel kfunc_ids never overlap with module kfunc_ids, we
only match the owner for module btf id sets.
See following commits for background on use of:
CONFIG_X86 ifdef:
569c484f99 (bpf: Limit static tcp-cc functions in the .BTF_ids list to x86)
CONFIG_DYNAMIC_FTRACE ifdef:
7aae231ac9 (bpf: tcp: Limit calling some tcp cc functions to CONFIG_DYNAMIC_FTRACE)
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-6-memxor@gmail.com
This adds helpers for registering btf_id_set from modules and the
bpf_check_mod_kfunc_call callback that can be used to look them up.
With in kernel sets, the way this is supposed to work is, in kernel
callback looks up within the in-kernel kfunc whitelist, and then defers
to the dynamic BTF set lookup if it doesn't find the BTF id. If there is
no in-kernel BTF id set, this callback can be used directly.
Also fix includes for btf.h and bpfptr.h so that they can included in
isolation. This is in preparation for their usage in tcp_bbr, tcp_cubic
and tcp_dctcp modules in the next patch.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-4-memxor@gmail.com
This patch also modifies the BPF verifier to only return error for
invalid kfunc calls specially marked by userspace (with insn->imm == 0,
insn->off == 0) after the verifier has eliminated dead instructions.
This can be handled in the fixup stage, and skip processing during add
and check stages.
If such an invalid call is dropped, the fixup stage will not encounter
insn->imm as 0, otherwise it bails out and returns an error.
This will be exposed as weak ksym support in libbpf in later patches.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-3-memxor@gmail.com
This change adds support on the kernel side to allow for BPF programs to
call kernel module functions. Userspace will prepare an array of module
BTF fds that is passed in during BPF_PROG_LOAD using fd_array parameter.
In the kernel, the module BTFs are placed in the auxilliary struct for
bpf_prog, and loaded as needed.
The verifier then uses insn->off to index into the fd_array. insn->off
0 is reserved for vmlinux BTF (for backwards compat), so userspace must
use an fd_array index > 0 for module kfunc support. kfunc_btf_tab is
sorted based on offset in an array, and each offset corresponds to one
descriptor, with a max limit up to 256 such module BTFs.
We also change existing kfunc_tab to distinguish each element based on
imm, off pair as each such call will now be distinct.
Another change is to check_kfunc_call callback, which now include a
struct module * pointer, this is to be used in later patch such that the
kfunc_id and module pointer are matched for dynamically registered BTF
sets from loadable modules, so that same kfunc_id in two modules doesn't
lead to check_kfunc_call succeeding. For the duration of the
check_kfunc_call, the reference to struct module exists, as it returns
the pointer stored in kfunc_btf_tab.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-2-memxor@gmail.com
When the trace_pid_list was created, the default pid max was 32768.
Creating a bitmask that can hold one bit for all 32768 took up 4096 (one
page). Having a one page bitmask was not much of a problem, and that was
used for mapping pids. But today, systems are bigger and can run more
tasks, and now the default pid_max is usually set to 4194304. Which means
to handle that many pids requires 524288 bytes. Worse yet, the pid_max can
be set to 2^30 (1073741824 or 1G) which would take 134217728 (128M) of
memory to store this array.
Since the pid_list array is very sparsely populated, it is a huge waste of
memory to store all possible bits for each pid when most will not be set.
Instead, use a page table scheme to store the array, and allow this to
handle up to 30 bit pids.
The pid_mask will start out with 256 entries for the first 8 MSB bits.
This will cost 1K for 32 bit architectures and 2K for 64 bit. Each of
these will have a 256 array to store the next 8 bits of the pid (another
1 or 2K). These will hold an 2K byte bitmask (which will cover the LSB
14 bits or 16384 pids).
When the trace_pid_list is allocated, it will have the 1/2K upper bits
allocated, and then it will allocate a cache for the next upper chunks and
the lower chunks (default 6 of each). Then when a bit is "set", these
chunks will be pulled from the free list and added to the array. If the
free list gets down to a lever (default 2), it will trigger an irqwork
that will refill the cache back up.
On clearing a bit, if the clear causes the bitmask to be zero, that chunk
will then be placed back into the free cache for later use, keeping the
need to allocate more down to a minimum.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Instead of having the logic that does trace_pid_list open coded, wrap it in
abstract functions. This will allow a rewrite of the logic that implements
the trace_pid_list without affecting the users.
Note, this causes a change in behavior. Every time a pid is written into
the set_*_pid file, it creates a new list and uses RCU to update it. If
pid_max is lowered, but there was a pid currently in the list that was
higher than pid_max, those pids will now be removed on updating the list.
The old behavior kept that from happening.
The rewrite of the pid_list logic will no longer depend on pid_max,
and will return the old behavior.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Found an issue within cgroup_attach_task_all() fn which seem
to exclude cgrp_dfl_root (cgroupv2) while attaching tasks to
the given cgroup. This was noticed when the system was running
qemu/kvm with kernel vhost helper threads. It appears that the
vhost layer which uses cgroup_attach_task_all() fn to assign the
vhost kthread to the right qemu cgroup works fine with cgroupv1
based configuration but not in cgroupv2. With cgroupv2, the vhost
helper thread ends up just belonging to the root cgroup as is
shown below:
$ stat -fc %T /sys/fs/cgroup/
cgroup2fs
$ sudo pgrep qemu
1916421
$ ps -eL | grep 1916421
1916421 1916421 ? 00:00:01 qemu-system-x86
1916421 1916431 ? 00:00:00 call_rcu
1916421 1916435 ? 00:00:00 IO mon_iothread
1916421 1916436 ? 00:00:34 CPU 0/KVM
1916421 1916439 ? 00:00:00 SPICE Worker
1916421 1916440 ? 00:00:00 vnc_worker
1916433 1916433 ? 00:00:00 vhost-1916421
1916437 1916437 ? 00:00:00 kvm-pit/1916421
$ cat /proc/1916421/cgroup
0::/machine.slice/machine-qemu\x2d18\x2dDroplet\x2d7572850.scope/emulator
$ cat /proc/1916439/cgroup
0::/machine.slice/machine-qemu\x2d18\x2dDroplet\x2d7572850.scope/emulator
$ cat /proc/1916433/cgroup
0::/
From above, it can be seen that the vhost kthread (PID: 1916433)
doesn't seem to belong the qemu cgroup like other qemu PIDs.
After applying this patch:
$ pgrep qemu
1643
$ ps -eL | grep 1643
1643 1643 ? 00:00:00 qemu-system-x86
1643 1645 ? 00:00:00 call_rcu
1643 1648 ? 00:00:00 IO mon_iothread
1643 1649 ? 00:00:00 CPU 0/KVM
1643 1652 ? 00:00:00 SPICE Worker
1643 1653 ? 00:00:00 vnc_worker
1647 1647 ? 00:00:00 vhost-1643
1651 1651 ? 00:00:00 kvm-pit/1643
$ cat /proc/1647/cgroup
0::/machine.slice/machine-qemu\x2d18\x2dDroplet\x2d7572850.scope/emulator
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Vishal Verma <vverma@digitalocean.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The Energy Model has a 1:1 mapping between OPPs and performance states
(em_perf_state). If a CPUFreq driver registers an Energy Model,
inefficiencies found by the latter can be applied to CPUFreq.
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The new performance domain flag EM_PERF_DOMAIN_SKIP_INEFFICIENCIES allows
to not take into account inefficient states when estimating energy
consumption. This intends to let the Energy Model know that CPUFreq itself
will skip inefficiencies and such states don't need to be part of the
estimation anymore.
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Merge the current "milliwatts" option into a "flag" field. This intends to
prepare the extension of this structure for inefficient states support in
the Energy Model.
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Some SoCs, such as the sd855 have OPPs within the same performance domain,
whose cost is higher than others with a higher frequency. Even though
those OPPs are interesting from a cooling perspective, it makes no sense
to use them when the device can run at full capacity. Those OPPs handicap
the performance domain, when choosing the most energy-efficient CPU and
are wasting energy. They are inefficient.
Hence, add support for such OPPs to the Energy Model. The table can now
be read skipping inefficient performance states (and by extension,
inefficient OPPs).
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Reviewed-by: Matthias Kaehlcke <mka@chromium.org>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently, a debug message is printed if an inefficient state is detected
in the Energy Model. Unfortunately, it won't detect if the first state is
inefficient or if two successive states are. Fix this behavior.
Fixes: 27871f7a8a (PM: Introduce an Energy Model management framework)
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Matthias Kaehlcke <mka@chromium.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Since commit 89aafd67f2 ("sched/fair: Use prev instead of new target as recent_used_cpu"),
p->recent_used_cpu is unconditionnaly set with prev.
Fixes: 89aafd67f2 ("sched/fair: Use prev instead of new target as recent_used_cpu")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20210928103544.27489-1-vincent.guittot@linaro.org
Neither wq_worker_sleeping() nor io_wq_worker_sleeping() require to be invoked
with preemption disabled:
- The worker flag checks operations only need to be serialized against
the worker thread itself.
- The accounting and worker pool operations are serialized with locks.
which means that disabling preemption has neither a reason nor a
value. Remove it and update the stale comment.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lkml.kernel.org/r/8735pnafj7.ffs@tglx
Doing cleanups in the tail of schedule() is a latency punishment for the
incoming task. The point of invoking kprobes_task_flush() for a dead task
is that the instances are returned and cannot leak when __schedule() is
kprobed.
Move it into the delayed cleanup.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210928122411.537994026@linutronix.de
The queued remote wakeup mechanism has turned out to be suboptimal for RT
enabled kernels. The maximum latencies go up by a factor of > 5x in certain
scenarious.
This is caused by either long wake lists or by a large number of TTWU IPIs
which are processed back to back.
Disable it for RT.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210928122411.482262764@linutronix.de
Batched task migrations are a source for large latencies as they keep the
scheduler from running while processing the migrations.
Limit the batch size to 8 instead of 32 when running on a RT enabled
kernel.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210928122411.425097596@linutronix.de
mmdrop() is invoked from finish_task_switch() by the incoming task to drop
the mm which was handed over by the previous task. mmdrop() can be quite
expensive which prevents an incoming real-time task from getting useful
work done.
Provide mmdrop_sched() which maps to mmdrop() on !RT kernels. On RT kernels
it delagates the eventually required invocation of __mmdrop() to RCU.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210928122411.648582026@linutronix.de
Make cookie functions static as these are no longer invoked directly
by other code.
No functional change intended.
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210922085735.52812-1-zhangshaokun@hisilicon.com
When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
check for the idle state of the destination CPU, dst_cpu, but also of
its SMT siblings.
If dst_cpu is idle but its SMT siblings are busy, performance suffers
if it pulls tasks from a medium priority CPU that does not have SMT
siblings.
Implement asym_smt_can_pull_tasks() to inspect the state of the SMT
siblings of both dst_cpu and the CPUs in the candidate busiest group.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Len Brown <len.brown@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210911011819.12184-7-ricardo.neri-calderon@linux.intel.com
Create a separate function, sched_asym(). A subsequent changeset will
introduce logic to deal with SMT in conjunction with asmymmetric
packing. Such logic will need the statistics of the scheduling
group provided as argument. Update them before calling sched_asym().
Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Len Brown <len.brown@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210911011819.12184-6-ricardo.neri-calderon@linux.intel.com
Before deciding to pull tasks when using asymmetric packing of tasks,
on some architectures (e.g., x86) it is necessary to know not only the
state of dst_cpu but also of its SMT siblings. The decision to classify
a candidate busiest group as group_asym_packing is done in
update_sg_lb_stats(). Give this function access to the scheduling domain
statistics, which contains the statistics of the local group.
Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Len Brown <len.brown@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210911011819.12184-5-ricardo.neri-calderon@linux.intel.com
sched_asmy_prefer() always returns false when called on the local group. By
checking local_group, we can avoid additional checks and invoking
sched_asmy_prefer() when it is not needed. No functional changes are
introduced.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Len Brown <len.brown@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210911011819.12184-4-ricardo.neri-calderon@linux.intel.com
There exist situations in which the load balance needs to know the
properties of the CPUs in a scheduling group. When using asymmetric
packing, for instance, the load balancer needs to know not only the
state of dst_cpu but also of its SMT siblings, if any.
Use the flags of the child scheduling domains to initialize scheduling
group flags. This will reflect the properties of the CPUs in the
group.
A subsequent changeset will make use of these new flags. No functional
changes are introduced.
Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Len Brown <len.brown@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210911011819.12184-3-ricardo.neri-calderon@linux.intel.com
With enabled threaded interrupts the nouveau driver reported the
following:
| Chain exists of:
| &mm->mmap_lock#2 --> &device->mutex --> &cpuset_rwsem
|
| Possible unsafe locking scenario:
|
| CPU0 CPU1
| ---- ----
| lock(&cpuset_rwsem);
| lock(&device->mutex);
| lock(&cpuset_rwsem);
| lock(&mm->mmap_lock#2);
The device->mutex is nvkm_device::mutex.
Unblocking the lockchain at `cpuset_rwsem' is probably the easiest
thing to do. Move the priority reset to the start of the newly
created thread.
Fixes: 710da3c8ea ("sched/core: Prevent race condition between cpuset and __sched_setscheduler()")
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/a23a826af7c108ea5651e73b8fbae5e653f16e86.camel@gmx.de
Currently the boot defined preempt behaviour (aka dynamic preempt)
selects full preemption by default when the "preempt=" boot parameter
is omitted. However distros may rather want to default to either
no preemption or voluntary preemption.
To provide with this flexibility, make dynamic preemption a visible
Kconfig option and adapt the preemption behaviour selected by the user
to either static or dynamic preemption.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210914103134.11309-1-frederic@kernel.org
After we make the struct sched_statistics and the helpers of it
independent of fair sched class, we can easily use the schedstats
facility for deadline sched class.
The schedstat usage in DL sched class is similar with fair sched class,
for example,
fair deadline
enqueue update_stats_enqueue_fair update_stats_enqueue_dl
dequeue update_stats_dequeue_fair update_stats_dequeue_dl
put_prev_task update_stats_wait_start update_stats_wait_start_dl
set_next_task update_stats_wait_end update_stats_wait_end_dl
The user can get the schedstats information in the same way in fair sched
class. For example,
fair deadline
/proc/[pid]/sched /proc/[pid]/sched
The output of a deadline task's schedstats as follows,
$ cat /proc/69662/sched
...
se.sum_exec_runtime : 3067.696449
se.nr_migrations : 0
sum_sleep_runtime : 720144.029661
sum_block_runtime : 0.547853
wait_start : 0.000000
sleep_start : 14131540.828955
block_start : 0.000000
sleep_max : 2999.974045
block_max : 0.283637
exec_max : 1.000269
slice_max : 0.000000
wait_max : 0.002217
wait_sum : 0.762179
wait_count : 733
iowait_sum : 0.547853
iowait_count : 3
nr_migrations_cold : 0
nr_failed_migrations_affine : 0
nr_failed_migrations_running : 0
nr_failed_migrations_hot : 0
nr_forced_migrations : 0
nr_wakeups : 246
nr_wakeups_sync : 2
nr_wakeups_migrate : 0
nr_wakeups_local : 244
nr_wakeups_remote : 2
nr_wakeups_affine : 0
nr_wakeups_affine_attempts : 0
nr_wakeups_passive : 0
nr_wakeups_idle : 0
...
The sched:sched_stat_{wait, sleep, iowait, blocked} tracepoints can
be used to trace deadlline tasks as well.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210905143547.4668-9-laoar.shao@gmail.com
The runtime of a DL task has already been there, so we only need to
add a tracepoint.
One difference between fair task and DL task is that there is no vruntime
in dl task. To reuse the sched_stat_runtime tracepoint, '0' is passed as
vruntime for DL task.
The output of this tracepoint for DL task as follows,
top-36462 [047] d.h. 6083.452103: sched_stat_runtime: comm=top pid=36462 runtime=409898 [ns] vruntime=0 [ns]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210905143547.4668-8-laoar.shao@gmail.com
We want to measure the latency of RT tasks in our production
environment with schedstats facility, but currently schedstats is only
supported for fair sched class. This patch enable it for RT sched class
as well.
After we make the struct sched_statistics and the helpers of it
independent of fair sched class, we can easily use the schedstats
facility for RT sched class.
The schedstat usage in RT sched class is similar with fair sched class,
for example,
fair RT
enqueue update_stats_enqueue_fair update_stats_enqueue_rt
dequeue update_stats_dequeue_fair update_stats_dequeue_rt
put_prev_task update_stats_wait_start update_stats_wait_start_rt
set_next_task update_stats_wait_end update_stats_wait_end_rt
The user can get the schedstats information in the same way in fair sched
class. For example,
fair RT
/proc/[pid]/sched /proc/[pid]/sched
schedstats is not supported for RT group.
The output of a RT task's schedstats as follows,
$ cat /proc/10349/sched
...
sum_sleep_runtime : 972.434535
sum_block_runtime : 960.433522
wait_start : 188510.871584
sleep_start : 0.000000
block_start : 0.000000
sleep_max : 12.001013
block_max : 952.660622
exec_max : 0.049629
slice_max : 0.000000
wait_max : 0.018538
wait_sum : 0.424340
wait_count : 49
iowait_sum : 956.495640
iowait_count : 24
nr_migrations_cold : 0
nr_failed_migrations_affine : 0
nr_failed_migrations_running : 0
nr_failed_migrations_hot : 0
nr_forced_migrations : 0
nr_wakeups : 49
nr_wakeups_sync : 0
nr_wakeups_migrate : 0
nr_wakeups_local : 49
nr_wakeups_remote : 0
nr_wakeups_affine : 0
nr_wakeups_affine_attempts : 0
nr_wakeups_passive : 0
nr_wakeups_idle : 0
...
The sched:sched_stat_{wait, sleep, iowait, blocked} tracepoints can
be used to trace RT tasks as well. The output of these tracepoints for a
RT tasks as follows,
- runtime
stress-10352 [004] d.h. 1035.382286: sched_stat_runtime: comm=stress pid=10352 runtime=995769 [ns] vruntime=0 [ns]
[vruntime=0 means it is a RT task]
- wait
<idle>-0 [004] dN.. 1227.688544: sched_stat_wait: comm=stress pid=10352 delay=46849882 [ns]
- blocked
kworker/4:1-465 [004] dN.. 1585.676371: sched_stat_blocked: comm=stress pid=17194 delay=189963 [ns]
- iowait
kworker/4:1-465 [004] dN.. 1585.675330: sched_stat_iowait: comm=stress pid=17189 delay=182848 [ns]
- sleep
sleep-18194 [023] dN.. 1780.891840: sched_stat_sleep: comm=sleep.sh pid=17767 delay=1001160770 [ns]
sleep-18196 [023] dN.. 1781.893208: sched_stat_sleep: comm=sleep.sh pid=17767 delay=1001161970 [ns]
sleep-18197 [023] dN.. 1782.894544: sched_stat_sleep: comm=sleep.sh pid=17767 delay=1001128840 [ns]
[ In sleep.sh, it sleeps 1 sec each time. ]
[lkp@intel.com: reported build failure in earlier version]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210905143547.4668-7-laoar.shao@gmail.com
The runtime of a RT task has already been there, so we only need to
add a tracepoint.
One difference between fair task and RT task is that there is no vruntime
in RT task. To reuse the sched_stat_runtime tracepoint, '0' is passed as
vruntime for RT task.
The output of this tracepoint for RT task as follows,
stress-9748 [039] d.h. 113.519352: sched_stat_runtime: comm=stress pid=9748 runtime=997573 [ns] vruntime=0 [ns]
stress-9748 [039] d.h. 113.520352: sched_stat_runtime: comm=stress pid=9748 runtime=997627 [ns] vruntime=0 [ns]
stress-9748 [039] d.h. 113.521352: sched_stat_runtime: comm=stress pid=9748 runtime=998203 [ns] vruntime=0 [ns]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210905143547.4668-6-laoar.shao@gmail.com
Currently in schedstats we have sum_sleep_runtime and iowait_sum, but
there's no metric to show how long the task is in D state. Once a task in
D state, it means the task is blocked in the kernel, for example the
task may be waiting for a mutex. The D state is more frequent than
iowait, and it is more critital than S state. So it is worth to add a
metric to measure it.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210905143547.4668-5-laoar.shao@gmail.com
The original prototype of the schedstats helpers are
update_stats_wait_*(struct cfs_rq *cfs_rq, struct sched_entity *se)
The cfs_rq in these helpers is used to get the rq_clock, and the se is
used to get the struct sched_statistics and the struct task_struct. In
order to make these helpers available by all sched classes, we can pass
the rq, sched_statistics and task_struct directly.
Then the new helpers are
update_stats_wait_*(struct rq *rq, struct task_struct *p,
struct sched_statistics *stats)
which are independent of fair sched class.
To avoid vmlinux growing too large or introducing ovehead when
!schedstat_enabled(), some new helpers after schedstat_enabled() are also
introduced, Suggested by Mel. These helpers are in sched/stats.c,
__update_stats_wait_*(struct rq *rq, struct task_struct *p,
struct sched_statistics *stats)
The size of vmlinux as follows,
Before After
Size of vmlinux 826308552 826304640
The size is a litte smaller as some functions are not inlined again after
the change.
I also compared the sched performance with 'perf bench sched pipe',
suggested by Mel. The result as followsi (in usecs/op),
Before After
kernel.sched_schedstats=0 5.2~5.4 5.2~5.4
kernel.sched_schedstats=1 5.3~5.5 5.3~5.5
[These data is a little difference with the prev version, that is
because my old test machine is destroyed so I have to use a new
different test machine.]
Almost no difference.
No functional change.
[lkp@intel.com: reported build failure in prev version]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20210905143547.4668-4-laoar.shao@gmail.com
If we want to use the schedstats facility to trace other sched classes, we
should make it independent of fair sched class. The struct sched_statistics
is the schedular statistics of a task_struct or a task_group. So we can
move it into struct task_struct and struct task_group to achieve the goal.
After the patch, schestats are orgnized as follows,
struct task_struct {
...
struct sched_entity se;
struct sched_rt_entity rt;
struct sched_dl_entity dl;
...
struct sched_statistics stats;
...
};
Regarding the task group, schedstats is only supported for fair group
sched, and a new struct sched_entity_stats is introduced, suggested by
Peter -
struct sched_entity_stats {
struct sched_entity se;
struct sched_statistics stats;
} __no_randomize_layout;
Then with the se in a task_group, we can easily get the stats.
The sched_statistics members may be frequently modified when schedstats is
enabled, in order to avoid impacting on random data which may in the same
cacheline with them, the struct sched_statistics is defined as cacheline
aligned.
As this patch changes the core struct of scheduler, so I verified the
performance it may impact on the scheduler with 'perf bench sched
pipe', suggested by Mel. Below is the result, in which all the values
are in usecs/op.
Before After
kernel.sched_schedstats=0 5.2~5.4 5.2~5.4
kernel.sched_schedstats=1 5.3~5.5 5.3~5.5
[These data is a little difference with the earlier version, that is
because my old test machine is destroyed so I have to use a new
different test machine.]
Almost no impact on the sched performance.
No functional change.
[lkp@intel.com: reported build failure in earlier version]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20210905143547.4668-3-laoar.shao@gmail.com
schedstat_enabled() has been already checked, so we can use
__schedstat_set() directly.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20210905143547.4668-2-laoar.shao@gmail.com
Two new statistics are introduced to show the internal of burst feature
and explain why burst helps or not.
nr_bursts: number of periods bandwidth burst occurs
burst_time: cumulative wall-time (in nanoseconds) that any cpus has
used above quota in respective periods
Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210830032215.16302-2-changhuaixin@linux.alibaba.com
Give reduced sleeper credit to SCHED_IDLE entities. As a result, woken
SCHED_IDLE entities will take longer to preempt normal entities.
The benefit of this change is to make it less likely that a newly woken
SCHED_IDLE entity will preempt a short-running normal entity before it
blocks.
We still give a small sleeper credit to SCHED_IDLE entities, so that
idle<->idle competition retains some fairness.
Example: With HZ=1000, spawned four threads affined to one cpu, one of
which was set to SCHED_IDLE. Without this patch, wakeup latency for the
SCHED_IDLE thread was ~1-2ms, with the patch the wakeup latency was
~5ms.
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Link: https://lore.kernel.org/r/20210820010403.946838-5-joshdon@google.com
Use a small, non-scaled min granularity for SCHED_IDLE entities, when
competing with normal entities. This reduces the latency of getting
a normal entity back on cpu, at the expense of increased context
switch frequency of SCHED_IDLE entities.
The benefit of this change is to reduce the round-robin latency for
normal entities when competing with a SCHED_IDLE entity.
Example: on a machine with HZ=1000, spawned two threads, one of which is
SCHED_IDLE, and affined to one cpu. Without this patch, the SCHED_IDLE
thread runs for 4ms then waits for 1.4s. With this patch, it runs for
1ms and waits 340ms (as it round-robins with the other thread).
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20210820010403.946838-4-joshdon@google.com
Adds cfs_rq->idle_nr_running, which accounts the number of idle entities
directly enqueued on the cfs_rq.
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20210820010403.946838-3-joshdon@google.com
Tao suggested a two-pass task selection to avoid the retry loop.
Not only does it avoid the retry loop, it results in *much* simpler
code.
This also fixes an issue spotted by Josh Don where, for SMT3+, we can
forget to update max on the first pass and get to do an extra round.
Suggested-by: Tao Zhou <tao.zhou@linux.dev>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Don <joshdon@google.com>
Reviewed-by: Vineeth Pillai (Microsoft) <vineethrp@gmail.com>
Link: https://lkml.kernel.org/r/YSS9+k1teA9oPEKl@hirez.programming.kicks-ass.net
With PREEMPT_RT enabled all hrtimers callbacks will be invoked in
softirq mode unless they are explicitly marked as HRTIMER_MODE_HARD.
During boot kthread_bind() is used for the creation of per-CPU threads
and then hangs in wait_task_inactive() if the ksoftirqd is not
yet up and running.
The hang disappeared since commit
26c7295be0 ("kthread: Do not preempt current task if it is going to call schedule()")
but enabling function trace on boot reliably leads to the freeze on boot
behaviour again.
The timer in wait_task_inactive() can not be directly used by a user
interface to abuse it and create a mass wake up of several tasks at the
same time leading to long sections with disabled interrupts.
Therefore it is safe to make the timer HRTIMER_MODE_REL_HARD.
Switch the timer to HRTIMER_MODE_REL_HARD.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210826170408.vm7rlj7odslshwch@linutronix.de
Consider a system with some NOHZ-idle CPUs, such that
nohz.idle_cpus_mask = S
nohz.next_balance = T
When a new CPU k goes NOHZ idle (nohz_balance_enter_idle()), we end up
with:
nohz.idle_cpus_mask = S \U {k}
nohz.next_balance = T
Note that the nohz.next_balance hasn't changed - it won't be updated until
a NOHZ balance is triggered. This is problematic if the newly NOHZ idle CPU
has an earlier rq.next_balance than the other NOHZ idle CPUs, IOW if:
cpu_rq(k).next_balance < nohz.next_balance
In such scenarios, the existing nohz.next_balance will prevent any NOHZ
balance from happening, which itself will prevent nohz.next_balance from
being updated to this new cpu_rq(k).next_balance. Unnecessary load balance
delays of over 12ms caused by this were observed on an arm64 RB5 board.
Use the new nohz.needs_update flag to mark the presence of newly-idle CPUs
that need their rq->next_balance to be collated into
nohz.next_balance. Trigger a NOHZ_NEXT_KICK when the flag is set.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210823111700.2842997-3-valentin.schneider@arm.com
A following patch will trigger NOHZ idle balances as a means to update
nohz.next_balance. Vincent noted that blocked load updates can have
non-negligible overhead, which should be avoided if the intent is to only
update nohz.next_balance.
Add a new NOHZ balance kick flag, NOHZ_NEXT_KICK. Gate NOHZ blocked load
update by the presence of NOHZ_STATS_KICK - currently all NOHZ balance
kicks will have the NOHZ_STATS_KICK flag set, so no change in behaviour is
expected.
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210823111700.2842997-2-valentin.schneider@arm.com
Similar to:
commit 8b8e6b5d3b ("kallsyms: strip ThinLTO hashes from static
functions")
It's very common for compilers to modify the symbol name for static
functions as part of optimizing transformations. That makes hooking
static functions (that weren't inlined or DCE'd) with kprobes difficult.
LLVM has yet another name mangling scheme used by thin LTO.
Combine handling of the various schemes by truncating after the first
'.'. Strip off these suffixes so that we can continue to hook such
static functions. Clang releases prior to clang-13 would use '$'
instead of '.'
Link: https://reviews.llvm.org/rGc6e5c4654bd5045fe22a1a52779e48e2038a404c
Reported-by: KE.LI(Lieke) <like1@oppo.com>
Suggested-by: Nathan Chancellor <nathan@kernel.org>
Suggested-by: Padmanabha Srinivasaiah <treasure4paddy@gmail.com>
Suggested-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Fangrui Song <maskray@google.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20211004162936.21961-1-ndesaulniers@google.com
Since the openat2(2) syscall uses a struct open_how pointer to communicate
its parameters they are not usefully recorded by the audit SYSCALL record's
four existing arguments.
Add a new audit record type OPENAT2 that reports the parameters in its
third argument, struct open_how with fields oflag, mode and resolve.
The new record in the context of an event would look like:
time->Wed Mar 17 16:28:53 2021
type=PROCTITLE msg=audit(1616012933.531:184): proctitle=
73797363616C6C735F66696C652F6F70656E617432002F746D702F61756469742D
7465737473756974652D737641440066696C652D6F70656E617432
type=PATH msg=audit(1616012933.531:184): item=1 name="file-openat2"
inode=29 dev=00:1f mode=0100600 ouid=0 ogid=0 rdev=00:00
obj=unconfined_u:object_r:user_tmp_t:s0 nametype=CREATE
cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=PATH msg=audit(1616012933.531:184):
item=0 name="/root/rgb/git/audit-testsuite/tests"
inode=25 dev=00:1f mode=040700 ouid=0 ogid=0 rdev=00:00
obj=unconfined_u:object_r:user_tmp_t:s0 nametype=PARENT
cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=CWD msg=audit(1616012933.531:184):
cwd="/root/rgb/git/audit-testsuite/tests"
type=OPENAT2 msg=audit(1616012933.531:184):
oflag=0100302 mode=0600 resolve=0xa
type=SYSCALL msg=audit(1616012933.531:184): arch=c000003e syscall=437
success=yes exit=4 a0=3 a1=7ffe315f1c53 a2=7ffe315f1550 a3=18
items=2 ppid=528 pid=540 auid=0 uid=0 gid=0 euid=0 suid=0
fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 ses=1 comm="openat2"
exe="/root/rgb/git/audit-testsuite/tests/syscalls_file/openat2"
subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
key="testsuite-1616012933-bjAUcEPO"
Link: https://lore.kernel.org/r/d23fbb89186754487850367224b060e26f9b7181.1621363275.git.rgb@redhat.com
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
[PM: tweak subject, wrap example, move AUDIT_OPENAT2 to 1337]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Replace uses of mem_encrypt_active() with calls to cc_platform_has() with
the CC_ATTR_MEM_ENCRYPT attribute.
Remove the implementation of mem_encrypt_active() across all arches.
For s390, since the default implementation of the cc_platform_has()
matches the s390 implementation of mem_encrypt_active(), cc_platform_has()
does not need to be implemented in s390 (the config option
ARCH_HAS_CC_PLATFORM is not set).
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-9-bp@alien8.de
Fix the following W=1 kernel build warning:
kernel/printk/printk.c: In function 'printk_sprint':
kernel/printk/printk.c:1913:9: warning: function 'printk_sprint' might be
a candidate for 'gnu_printf' format attribute [-Wsuggest-attribute=format]
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210927142203.124730-1-john.ogness@linutronix.de
clang notices that the pi_get_entry() function would use
uninitialized data if it was called with a non-NULL module
pointer on a kernel that does not support modules:
kernel/printk/index.c:32:6: error: variable 'nr_entries' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
if (!mod) {
^~~~
kernel/printk/index.c:38:13: note: uninitialized use occurs here
if (pos >= nr_entries)
^~~~~~~~~~
kernel/printk/index.c:32:2: note: remove the 'if' if its condition is always true
if (!mod) {
Rework the condition to make it clear to the compiler that we are always
in the second case. Unfortunately the #ifdef is still required as the
definition of 'struct module' is hidden when modules are disabled.
Fixes: 3370155737 ("printk: Userspace format indexing support")
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210928093456.2438109-1-arnd@kernel.org
- Make sure tunable_scaling buffer is null-terminated after an update in sysfs
- Fix LTP named regression due to cgroup list ordering
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmFZhRgACgkQEsHwGGHe
VUq0lRAAqazsUL+j8UL5URX28/gVNtQye25pkVFvg16VRGk0qqHRyBikA165InhE
btTbQb/Bd/YTU2ykKeBt+Cw4Zl+kWnvplXu2gxreQQhCjukSqe6OlQEjC0midnL0
haOjIFERoxY5dTnwa8JuFJpCiKT5ltnLmudiOmCrUAVoANCg5ra8Yl+z6WJwsnFL
EOXMj7tSRHZpinMjLzoF7kbID43dil4OjailrM6iR6+7/JzJNaXPG/IuVfEDnZM7
2+WY+qijveD1PhZMHz5tIkDjYmfOSn1uyfeIklUKM/9dpp82TdwA2yptJ+/IRmvj
vjn9ZCfpzWqXn0bf4RIvcK49MQQkJBXrg4EqZOJFXgT/6tXqr5BxSdTKnIkRBs+V
zOc+zF/Y9OH5NH0DlELFtn7k2HMOO8DLOVrbOb+TixNPxbUFrbCJR5i9T7yecuQJ
s+B+5aBmOZclNYehx1Hv8A9bjixWvp+UkGixLj1F4i6RYONS11M15bM94Cwyh1K9
koHcZtVt1iQGkpYtV5fQAvxiKtIrAwkX4oniZzWbYr8ORQ269T/l5oeNNdzFxo7y
qWkHWWIAG+A3dMUdJGbc8c7cInc4EK8tUzD8ec23cF74UO16cBgl/6tTGazyUrdA
oxNTCuWPtvqp49eef3y02QONkRARfmEBnFYCqfFrEvEEh9OUQPU=
=Kj5B
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v5.15_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Tell the compiler to always inline is_percpu_thread()
- Make sure tunable_scaling buffer is null-terminated after an update
in sysfs
- Fix LTP named regression due to cgroup list ordering
* tag 'sched_urgent_for_v5.15_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Always inline is_percpu_thread()
sched/fair: Null terminate buffer when updating tunable_scaling
sched/fair: Add ancestors of unthrottled undecayed cfs_rq
- Update the event constraints for Icelake
- Make sure the active time of an event is updated even for inactive events
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmFZfZoACgkQEsHwGGHe
VUqonA//SskDtPRTrbHoc1pDZEBOoYmuYZpsVIXMTcjxt8DkdGZcmgSRzJ53QdGT
aOv+Oc7qIsOa7qcRPaUxlyjnqbWEynOmeHaIN2PtKebOKsXsvdf0ehrrOYjgoD5g
1ENCisvV6TyyPtQPDv9QkYkGScZXg5/D8VyrlV5GsRj+MslWsfVWaJvUCHQtJcnz
YPbcOFw1MHCBtUuJXtXuzA2JDp0Lfr6aX5Zl6hXqP8k69Y+Ob0bjA3jopo29l5rU
lp/y4MQcFHuEAbQ7AdVeKg3Yi/RnSEI0w3HT1xDeKyXlMGp+Xz5rDBg1I+WREDri
u7QuT9Mx6XSzGHGeg/Y5XbMN+M5WCQUxH5gJMLiduG0YYULWhq/j4f5tA0U0cQ80
qVKRH8luh+0R+RLZtqno9pNg0pSdB47X8lqXOX0/DyAA7f6xPm17VBvyDeuF75AL
vYuhdtu0vB5Iwsccj+h5cXGsaMWCzcriGABWZL/1JZFwRT8kRadiybJkv0n6R77F
qDqUMu6cLM9kDhTeggDffoQZoxLiEL6pfeWFuWcaWdVXa8veX4lZZNnPFbqNhnJm
zRm758r3jdHie1BAgFXym6Gt5EhkBhtofbUBaw3T2SdxFTs0csKzF5tzCXIvIyz2
LH3Is3ce0pw+A72barhbLua2IJ2h895DGE529sQBiaSibB53tYk=
=TQvw
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v5.15_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Make sure the destroy callback is reset when a event initialization
fails
- Update the event constraints for Icelake
- Make sure the active time of an event is updated even for inactive
events
* tag 'perf_urgent_for_v5.15_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: fix userpage->time_enabled of inactive events
perf/x86/intel: Update event constraints for ICX
perf/x86: Reset destroy callback on event init failure
Daniel Borkmann says:
====================
bpf-next 2021-10-02
We've added 85 non-merge commits during the last 15 day(s) which contain
a total of 132 files changed, 13779 insertions(+), 6724 deletions(-).
The main changes are:
1) Massive update on test_bpf.ko coverage for JITs as preparatory work for
an upcoming MIPS eBPF JIT, from Johan Almbladh.
2) Add a batched interface for RX buffer allocation in AF_XDP buffer pool,
with driver support for i40e and ice from Magnus Karlsson.
3) Add legacy uprobe support to libbpf to complement recently merged legacy
kprobe support, from Andrii Nakryiko.
4) Add bpf_trace_vprintk() as variadic printk helper, from Dave Marchevsky.
5) Support saving the register state in verifier when spilling <8byte bounded
scalar to the stack, from Martin Lau.
6) Add libbpf opt-in for stricter BPF program section name handling as part
of libbpf 1.0 effort, from Andrii Nakryiko.
7) Add a document to help clarifying BPF licensing, from Alexei Starovoitov.
8) Fix skel_internal.h to propagate errno if the loader indicates an internal
error, from Kumar Kartikeya Dwivedi.
9) Fix build warnings with -Wcast-function-type so that the option can later
be enabled by default for the kernel, from Kees Cook.
10) Fix libbpf to ignore STT_SECTION symbols in legacy map definitions as it
otherwise errors out when encountering them, from Toke Høiland-Jørgensen.
11) Teach libbpf to recognize specialized maps (such as for perf RB) and
internally remove BTF type IDs when creating them, from Hengqi Chen.
12) Various fixes and improvements to BPF selftests.
====================
Link: https://lore.kernel.org/r/20211002001327.15169-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Replace audit syscall class magic numbers with macros.
This required putting the macros into new header file
include/linux/audit_arch.h since the syscall macros were
included for both 64 bit and 32 bit in any compat code, causing
redefinition warnings.
Link: https://lore.kernel.org/r/2300b1083a32aade7ae7efb95826e8f3f260b1df.1621363275.git.rgb@redhat.com
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
[PM: renamed header to audit_arch.h after consulting with Richard]
Signed-off-by: Paul Moore <paul@paul-moore.com>
dma_map_resource() uses pfn_valid() to ensure the range is not RAM.
However, pfn_valid() only checks for availability of the memory map for a
PFN but it does not ensure that the PFN is actually backed by RAM.
As dma_map_resource() is the only method in DMA mapping APIs that has this
check, simply drop the pfn_valid() test from dma_map_resource().
Link: https://lore.kernel.org/all/20210824173741.GC623@arm.com/
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20210930013039.11260-2-rppt@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
Since commit a7b359fc6a ("sched/fair: Correctly insert cfs_rq's to
list on unthrottle") we add cfs_rqs with no runnable tasks but not fully
decayed into the load (leaf) list. We may ignore adding some ancestors
and therefore breaking tmp_alone_branch invariant. This broke LTP test
cfs_bandwidth01 and it was partially fixed in commit fdaba61ef8
("sched/fair: Ensure that the CFS parent is added after unthrottling").
I noticed the named test still fails even with the fix (but with low
probability, 1 in ~1000 executions of the test). The reason is when
bailing out of unthrottle_cfs_rq early, we may miss adding ancestors of
the unthrottled cfs_rq, thus, not joining tmp_alone_branch properly.
Fix this by adding ancestors if we notice the unthrottled cfs_rq was
added to the load list.
Fixes: a7b359fc6a ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Odin Ugedal <odin@uged.al>
Link: https://lore.kernel.org/r/20210917153037.11176-1-mkoutny@suse.com
Users of rdpmc rely on the mmapped user page to calculate accurate
time_enabled. Currently, userpage->time_enabled is only updated when the
event is added to the pmu. As a result, inactive event (due to counter
multiplexing) does not have accurate userpage->time_enabled. This can
be reproduced with something like:
/* open 20 task perf_event "cycles", to create multiplexing */
fd = perf_event_open(); /* open task perf_event "cycles" */
userpage = mmap(fd); /* use mmap and rdmpc */
while (true) {
time_enabled_mmap = xxx; /* use logic in perf_event_mmap_page */
time_enabled_read = read(fd).time_enabled;
if (time_enabled_mmap > time_enabled_read)
BUG();
}
Fix this by updating userpage for inactive events in merge_sched_in.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-and-tested-by: Lucian Grijincu <lucian@fb.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210929194313.2398474-1-songliubraving@fb.com
The rw_semaphore and rwlock_t implementation both wake the waiter while
holding the rt_mutex_base::wait_lock acquired.
This can be optimized by waking the waiter lockless outside of the
locked section to avoid a needless contention on the
rt_mutex_base::wait_lock lock.
Extend rt_mutex_wake_q_add() to also accept task and state and use it in
__rwbase_read_unlock().
Suggested-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210928150006.597310-3-bigeasy@linutronix.de
rt_mutex_wake_q_add() needs to need to distiguish between sleeping
locks (TASK_RTLOCK_WAIT) and normal locks which use TASK_NORMAL to use
the proper wake mechanism.
Instead of checking for != TASK_NORMAL make it more robust and check
explicit for TASK_RTLOCK_WAIT which is the reason why a different wake
mechanism is used.
No functional change.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210928150006.597310-2-bigeasy@linutronix.de
The general rule that rcu_read_lock() held sections cannot voluntary sleep
does apply even on RT kernels. Though the substitution of spin/rw locks on
RT enabled kernels has to be exempt from that rule. On !RT a spin_lock()
can obviously nest inside a RCU read side critical section as the lock
acquisition is not going to block, but on RT this is not longer the case
due to the 'sleeping' spinlock substitution.
The RT patches contained a cheap hack to ignore the RCU nesting depth in
might_sleep() checks, which was a pragmatic but incorrect workaround.
Instead of generally ignoring the RCU nesting depth in __might_sleep() and
__might_resched() checks, pass the rcu_preempt_depth() via the offsets
argument to __might_resched() from spin/read/write_lock() which makes the
checks work correctly even in RCU read side critical sections.
The actual blocking on such a substituted lock within a RCU read side
critical section is already handled correctly in __schedule() by treating
it as a "preemption" of the RCU read side critical section.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210923165358.368305497@linutronix.de
For !RT kernels RCU nest depth in __might_resched() is always expected to
be 0, but on RT kernels it can be non zero while the preempt count is
expected to be always 0.
Instead of playing magic games in interpreting the 'preempt_offset'
argument, rename it to 'offsets' and use the lower 8 bits for the expected
preempt count, allow to hand in the expected RCU nest depth in the upper
bits and adopt the __might_resched() code and related checks and printks.
The affected call sites are updated in subsequent steps.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210923165358.243232823@linutronix.de
might_sleep() output is pretty informative, but can be confusing at times
especially with PREEMPT_RCU when the check triggers due to a voluntary
sleep inside a RCU read side critical section:
BUG: sleeping function called from invalid context at kernel/test.c:110
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52
Preemption disabled at: migrate_disable+0x33/0xa0
in_atomic() is 0, but it still tells that preemption was disabled at
migrate_disable(), which is completely useless because preemption is not
disabled. But the interesting information to decode the above, i.e. the RCU
nesting depth, is not printed.
That becomes even more confusing when might_sleep() is invoked from
cond_resched_lock() within a RCU read side critical section. Here the
expected preemption count is 1 and not 0.
BUG: sleeping function called from invalid context at kernel/test.c:131
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52
Preemption disabled at: test_cond_lock+0xf3/0x1c0
So in_atomic() is set, which is expected as the caller holds a spinlock,
but it's unclear why this is broken and the preempt disable IP is just
pointing at the correct place, i.e. spin_lock(), which is obviously not
helpful either.
Make that more useful in general:
- Print preempt_count() and the expected value
and for the CONFIG_PREEMPT_RCU case:
- Print the RCU read side critical section nesting depth
- Print the preempt disable IP only when preempt count
does not have the expected value.
So the might_sleep() dump from a within a preemptible RCU read side
critical section becomes:
BUG: sleeping function called from invalid context at kernel/test.c:110
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
and the cond_resched_lock() case becomes:
BUG: sleeping function called from invalid context at kernel/test.c:141
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52
preempt_count: 1, expected: 1
RCU nest depth: 1, expected: 0
which makes is pretty obvious what's going on. For all other cases the
preempt disable IP is still printed as before:
BUG: sleeping function called from invalid context at kernel/test.c: 156
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
preempt_count: 1, expected: 0
RCU nest depth: 0, expected: 0
Preemption disabled at:
[<ffffffff82b48326>] test_might_sleep+0xbe/0xf8
BUG: sleeping function called from invalid context at kernel/test.c: 163
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
preempt_count: 1, expected: 0
RCU nest depth: 1, expected: 0
Preemption disabled at:
[<ffffffff82b48326>] test_might_sleep+0x1e4/0x280
This also prepares to provide a better debugging output for RT enabled
kernels and their spinlock substitutions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210923165358.181022656@linutronix.de
__might_sleep() vs. ___might_sleep() is hard to distinguish. Aside of that
the three underscore variant is exposed to provide a checkpoint for
rescheduling points which are distinct from blocking points.
They are semantically a preemption point which means that scheduling is
state preserving. A real blocking operation, e.g. mutex_lock(), wait*(),
which cannot preserve a task state which is not equal to RUNNING.
While technically blocking on a "sleeping" spinlock in RT enabled kernels
falls into the voluntary scheduling category because it has to wait until
the contended spin/rw lock becomes available, the RT lock substitution code
can semantically be mapped to a voluntary preemption because the RT lock
substitution code and the scheduler are providing mechanisms to preserve
the task state and to take regular non-lock related wakeups into account.
Rename ___might_sleep() to __might_resched() to make the distinction of
these functions clear.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210923165357.928693482@linutronix.de
Clang warns:
kernel/locking/test-ww_mutex.c:138:7: error: variable 'ret' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
if (!ww_mutex_trylock(&mutex, &ctx)) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/locking/test-ww_mutex.c:172:9: note: uninitialized use occurs here
return ret;
^~~
kernel/locking/test-ww_mutex.c:138:3: note: remove the 'if' if its condition is always false
if (!ww_mutex_trylock(&mutex, &ctx)) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/locking/test-ww_mutex.c:125:9: note: initialize the variable 'ret' to silence this warning
int ret;
^
= 0
1 error generated.
Assign !ww_mutex_trylock(...) to ret so that it is always initialized.
Fixes: 12235da8c8 ("kernel/locking: Add context to ww_mutex_trylock()")
Reported-by: "kernelci.org bot" <bot@kernelci.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20210922145822.3935141-1-nathan@kernel.org
In x86, the fake return address on the stack saved by
__kretprobe_trampoline() will be replaced with the real return
address after returning from trampoline_handler(). Before fixing
the return address, the real return address can be found in the
'current->kretprobe_instances'.
However, since there is a window between updating the
'current->kretprobe_instances' and fixing the address on the stack,
if an interrupt happens at that timing and the interrupt handler
does stacktrace, it may fail to unwind because it can not get
the correct return address from 'current->kretprobe_instances'.
This will eliminate that window by fixing the return address
right before updating 'current->kretprobe_instances'.
Link: https://lkml.kernel.org/r/163163057094.489837.9044470370440745866.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
ftrace shows "[unknown/kretprobe'd]" indicator all addresses in the
kretprobe_trampoline, but the modified address by kretprobe should
be only kretprobe_trampoline+0.
Link: https://lkml.kernel.org/r/163163056044.489837.794883849706638013.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Tested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Since the ORC unwinder from pt_regs requires setting up regs->ip
correctly, set the correct return address to the regs->ip before
calling user kretprobe handler.
This allows the kretrprobe handler to trace stack from the
kretprobe's pt_regs by stack_trace_save_regs() (eBPF will do
this), instead of stack tracing from the handler context by
stack_trace_save() (ftrace will do this).
Link: https://lkml.kernel.org/r/163163053237.489837.4272653874525136832.stgit@devnote2
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Introduce kretprobe_find_ret_addr() and is_kretprobe_trampoline().
These APIs will be used by the ORC stack unwinder and ftrace, so that
they can check whether the given address points kretprobe trampoline
code and query the correct return address in that case.
Link: https://lkml.kernel.org/r/163163046461.489837.1044778356430293962.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Since now there is kretprobe_trampoline_addr() for referring the
address of kretprobe trampoline code, we don't need to access
kretprobe_trampoline directly.
Make it harder to refer by renaming it to __kretprobe_trampoline().
Link: https://lkml.kernel.org/r/163163045446.489837.14510577516938803097.stgit@devnote2
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The __kretprobe_trampoline_handler() callback, called from low level
arch kprobes methods, has the 'trampoline_address' parameter, which is
entirely superfluous as it basically just replicates:
dereference_kernel_function_descriptor(kretprobe_trampoline)
In fact we had bugs in arch code where it wasn't replicated correctly.
So remove this superfluous parameter and use kretprobe_trampoline_addr()
instead.
Link: https://lkml.kernel.org/r/163163044546.489837.13505751885476015002.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
~15 years ago kprobes grew the 'arch_deref_entry_point()' __weak function:
3d7e33825d: ("jprobes: make jprobes a little safer for users")
But this is just open-coded dereference_symbol_descriptor() in essence, and
its obscure nature was causing bugs.
Just use the real thing and remove arch_deref_entry_point().
Link: https://lkml.kernel.org/r/163163043630.489837.7924988885652708696.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Use the 'bool' type instead of 'int' for the functions which
returns a boolean value, because this makes clear that those
functions don't return any error code.
Link: https://lkml.kernel.org/r/163163041649.489837.17311187321419747536.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Since get_optimized_kprobe() is only used inside kprobes,
it doesn't need to use 'unsigned long' type for 'addr' parameter.
Make it use 'kprobe_opcode_t *' for the 'addr' parameter and
subsequent call of arch_within_optimized_kprobe() also should use
'kprobe_opcode_t *'.
Note that MAX_OPTIMIZED_LENGTH and RELATIVEJUMP_SIZE are defined
by byte-size, but the size of 'kprobe_opcode_t' depends on the
architecture. Therefore, we must be careful when calculating
addresses using those macros.
Link: https://lkml.kernel.org/r/163163040680.489837.12133032364499833736.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fix coding style issues reported by checkpatch.pl and update
comments to quote variable names and add "()" to function
name.
One TODO comment in __disarm_kprobe() is removed because
it has been done by following commit.
Link: https://lkml.kernel.org/r/163163037468.489837.4282347782492003960.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This clean up the error/notification messages in kprobes related code.
Basically this defines 'pr_fmt()' macros for each files and update
the messages which describes
- what happened,
- what is the kernel going to do or not do,
- is the kernel fine,
- what can the user do about it.
Also, if the message is not needed (e.g. the function returns unique
error code, or other error message is already shown.) remove it,
and replace the message with WARN_*() macros if suitable.
Link: https://lkml.kernel.org/r/163163036568.489837.14085396178727185469.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
arch_check_ftrace_location() was introduced as a weak function in
commit f7f242ff00 ("kprobes: introduce weak
arch_check_ftrace_location() helper function") to allow architectures
to handle kprobes call site on their own.
Recently, the only architecture (csky) to implement
arch_check_ftrace_location() was migrated to using the common
version.
As a result, further cleanup the code to drop the weak attribute and
rename the function to remove the architecture specific
implementation.
Link: https://lkml.kernel.org/r/163163035673.489837.2367816318195254104.stgit@devnote2
Signed-off-by: Punit Agrawal <punitagrawal@gmail.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The function prepare_kprobe() is called during kprobe registration and
is responsible for ensuring any architecture related preparation for
the kprobe is done before returning.
One of two versions of prepare_kprobe() is chosen depending on the
availability of KPROBE_ON_FTRACE in the kernel configuration.
Simplify the code by dropping the version when KPROBE_ON_FTRACE is not
selected - instead relying on kprobe_ftrace() to return false when
KPROBE_ON_FTRACE is not set.
No functional change.
Link: https://lkml.kernel.org/r/163163033696.489837.9264661820279300788.stgit@devnote2
Signed-off-by: Punit Agrawal <punitagrawal@gmail.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The "enabled" file provides a debugfs interface to arm / disarm
kprobes in the kernel. In order to parse the buffer containing the
values written from userspace, the callback manually parses the user
input to convert it to a boolean value.
As taking a string value from userspace and converting it to boolean
is a common operation, a helper kstrtobool_from_user() is already
available in the kernel. Update the callback to use the common helper
to parse the write buffer from userspace.
Link: https://lkml.kernel.org/r/163163032637.489837.10678039554832855327.stgit@devnote2
Signed-off-by: Punit Agrawal <punitagrawal@gmail.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
debugfs_create_file() takes a pointer argument that can be used during
file operation callbacks (accessible via i_private in the inode
structure). An obvious requirement is for the pointer to refer to
valid memory when used.
When creating the debugfs file to dynamically enable / disable
kprobes, a pointer to local variable is passed to
debugfs_create_file(); which will go out of scope when the init
function returns. The reason this hasn't triggered random memory
corruption is because the pointer is not accessed during the debugfs
file callbacks.
Since the enabled state is managed by the kprobes_all_disabled global
variable, the local variable is not needed. Fix the incorrect (and
unnecessary) usage of local variable during debugfs_file_create() by
passing NULL instead.
Link: https://lkml.kernel.org/r/163163031686.489837.4476867635937014973.stgit@devnote2
Fixes: bf8f6e5b3e ("Kprobes: The ON/OFF knob thru debugfs")
Signed-off-by: Punit Agrawal <punitagrawal@gmail.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
and bpf.
Current release - regressions:
- bpf, cgroup: assign cgroup in cgroup_sk_alloc when called from
interrupt
- mdio: revert mechanical patches which broke handling of optional
resources
- dev_addr_list: prevent address duplication
Previous releases - regressions:
- sctp: break out if skb_header_pointer returns NULL in sctp_rcv_ootb
(NULL deref)
- Revert "mac80211: do not use low data rates for data frames with no
ack flag", fixing broadcast transmissions
- mac80211: fix use-after-free in CCMP/GCMP RX
- netfilter: include zone id in tuple hash again, minimize collisions
- netfilter: nf_tables: unlink table before deleting it (race -> UAF)
- netfilter: log: work around missing softdep backend module
- mptcp: don't return sockets in foreign netns
- sched: flower: protect fl_walk() with rcu (race -> UAF)
- ixgbe: fix NULL pointer dereference in ixgbe_xdp_setup
- smsc95xx: fix stalled rx after link change
- enetc: fix the incorrect clearing of IF_MODE bits
- ipv4: fix rtnexthop len when RTA_FLOW is present
- dsa: mv88e6xxx: 6161: use correct MAX MTU config method for this SKU
- e100: fix length calculation & buffer overrun in ethtool::get_regs
Previous releases - always broken:
- mac80211: fix using stale frag_tail skb pointer in A-MSDU tx
- mac80211: drop frames from invalid MAC address in ad-hoc mode
- af_unix: fix races in sk_peer_pid and sk_peer_cred accesses
(race -> UAF)
- bpf, x86: Fix bpf mapping of atomic fetch implementation
- bpf: handle return value of BPF_PROG_TYPE_STRUCT_OPS prog
- netfilter: ip6_tables: zero-initialize fragment offset
- mhi: fix error path in mhi_net_newlink
- af_unix: return errno instead of NULL in unix_create1() when
over the fs.file-max limit
Misc:
- bpf: exempt CAP_BPF from checks against bpf_jit_limit
- netfilter: conntrack: make max chain length random, prevent guessing
buckets by attackers
- netfilter: nf_nat_masquerade: make async masq_inet6_event handling
generic, defer conntrack walk to work queue (prevent hogging RTNL lock)
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmFV5KYACgkQMUZtbf5S
Irs3cRAAqNsgaQXSVOXcPKsndeDDKHIv1Ktes8aOP9tgEXZw5rpsbct7g9Yxc0os
Oolyt6HThjWr3u1/e3HHJO9I5Klr/J8eQReoRZnKW+6TYZflmmzfuf8u1nx6SLP/
tliz5y8wKbp8BqqNuTMdRpm+R1QQcNkXTeruUoR1PgREcY4J7bC2BRrqeZhBGHSR
Z5yPOIietFN3nITxNwbe4AYJXlesMc6QCWhBtXjMPGQ4Zc4/sjDNfqi7eHJi2H2y
kW2dHeXG86gnlgFllOBBWP85ptxynyxoNQJuhrxgC9T+/FpSVST7cwKbtmkwDI3M
5WGmeE6B3yfF8iOQuR8fbKQmsnLgQlYhjpbbhgN0GxzkyI7RpGYOFroX0Pht4IVZ
mwprDOtvoLs4UeDjULRMB0JZfRN75PCtVlhfUkhhJxXGCCmnhGYaxG/pE+6OQWlr
+n8RXYYMoOzPaHIYTS9NGSGqT0r32IUy/W5Yfv3rEzSeehy2/fxzGr2fOyBGs+q7
xrnqpsOnM8cODDwGMy3TclCI4Dd72WoHNCHPhA/bk/ZMjHpBd4CSEZPm8IROY3Ja
g1t68cncgL8fB7TSD9WLFgYu67Lg5j0gC/BHOzUQDQMX5IbhOq/fj1xRy5Lc6SYp
mqW1f7LdnixBe4W61VjDAYq5jJRqrwEWedx+rvV/ONLvr77KULk=
=rSti
-----END PGP SIGNATURE-----
Merge tag 'net-5.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes, including fixes from mac80211, netfilter and bpf.
Current release - regressions:
- bpf, cgroup: assign cgroup in cgroup_sk_alloc when called from
interrupt
- mdio: revert mechanical patches which broke handling of optional
resources
- dev_addr_list: prevent address duplication
Previous releases - regressions:
- sctp: break out if skb_header_pointer returns NULL in sctp_rcv_ootb
(NULL deref)
- Revert "mac80211: do not use low data rates for data frames with no
ack flag", fixing broadcast transmissions
- mac80211: fix use-after-free in CCMP/GCMP RX
- netfilter: include zone id in tuple hash again, minimize collisions
- netfilter: nf_tables: unlink table before deleting it (race -> UAF)
- netfilter: log: work around missing softdep backend module
- mptcp: don't return sockets in foreign netns
- sched: flower: protect fl_walk() with rcu (race -> UAF)
- ixgbe: fix NULL pointer dereference in ixgbe_xdp_setup
- smsc95xx: fix stalled rx after link change
- enetc: fix the incorrect clearing of IF_MODE bits
- ipv4: fix rtnexthop len when RTA_FLOW is present
- dsa: mv88e6xxx: 6161: use correct MAX MTU config method for this
SKU
- e100: fix length calculation & buffer overrun in ethtool::get_regs
Previous releases - always broken:
- mac80211: fix using stale frag_tail skb pointer in A-MSDU tx
- mac80211: drop frames from invalid MAC address in ad-hoc mode
- af_unix: fix races in sk_peer_pid and sk_peer_cred accesses (race
-> UAF)
- bpf, x86: Fix bpf mapping of atomic fetch implementation
- bpf: handle return value of BPF_PROG_TYPE_STRUCT_OPS prog
- netfilter: ip6_tables: zero-initialize fragment offset
- mhi: fix error path in mhi_net_newlink
- af_unix: return errno instead of NULL in unix_create1() when over
the fs.file-max limit
Misc:
- bpf: exempt CAP_BPF from checks against bpf_jit_limit
- netfilter: conntrack: make max chain length random, prevent
guessing buckets by attackers
- netfilter: nf_nat_masquerade: make async masq_inet6_event handling
generic, defer conntrack walk to work queue (prevent hogging RTNL
lock)"
* tag 'net-5.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (77 commits)
af_unix: fix races in sk_peer_pid and sk_peer_cred accesses
net: stmmac: fix EEE init issue when paired with EEE capable PHYs
net: dev_addr_list: handle first address in __hw_addr_add_ex
net: sched: flower: protect fl_walk() with rcu
net: introduce and use lock_sock_fast_nested()
net: phy: bcm7xxx: Fixed indirect MMD operations
net: hns3: disable firmware compatible features when uninstall PF
net: hns3: fix always enable rx vlan filter problem after selftest
net: hns3: PF enable promisc for VF when mac table is overflow
net: hns3: fix show wrong state when add existing uc mac address
net: hns3: fix mixed flag HCLGE_FLAG_MQPRIO_ENABLE and HCLGE_FLAG_DCB_ENABLE
net: hns3: don't rollback when destroy mqprio fail
net: hns3: remove tc enable checking
net: hns3: do not allow call hns3_nic_net_open repeatedly
ixgbe: Fix NULL pointer dereference in ixgbe_xdp_setup
net: bridge: mcast: Associate the seqcount with its protecting lock.
net: mdio-ipq4019: Fix the error for an optional regs resource
net: hns3: fix hclge_dbg_dump_tm_pg() stack usage
net: mdio: mscc-miim: Fix the mdio controller
af_unix: Return errno instead of NULL in unix_create1().
...
In prealloc_elems_and_freelist(), the multiplication to calculate the
size passed to bpf_map_area_alloc() could lead to an integer overflow.
As a result, out-of-bounds write could occur in pcpu_freelist_populate()
as reported by KASAN:
[...]
[ 16.968613] BUG: KASAN: slab-out-of-bounds in pcpu_freelist_populate+0xd9/0x100
[ 16.969408] Write of size 8 at addr ffff888104fc6ea0 by task crash/78
[ 16.970038]
[ 16.970195] CPU: 0 PID: 78 Comm: crash Not tainted 5.15.0-rc2+ #1
[ 16.970878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[ 16.972026] Call Trace:
[ 16.972306] dump_stack_lvl+0x34/0x44
[ 16.972687] print_address_description.constprop.0+0x21/0x140
[ 16.973297] ? pcpu_freelist_populate+0xd9/0x100
[ 16.973777] ? pcpu_freelist_populate+0xd9/0x100
[ 16.974257] kasan_report.cold+0x7f/0x11b
[ 16.974681] ? pcpu_freelist_populate+0xd9/0x100
[ 16.975190] pcpu_freelist_populate+0xd9/0x100
[ 16.975669] stack_map_alloc+0x209/0x2a0
[ 16.976106] __sys_bpf+0xd83/0x2ce0
[...]
The possibility of this overflow was originally discussed in [0], but
was overlooked.
Fix the integer overflow by changing elem_size to u64 from u32.
[0] https://lore.kernel.org/bpf/728b238e-a481-eb50-98e9-b0f430ab01e7@gmail.com/
Fixes: 557c0c6e7d ("bpf: convert stackmap to pre-allocation")
Signed-off-by: Tatsuhiko Yasumatsu <th.yasumatsu@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210930135545.173698-1-th.yasumatsu@gmail.com