linux

Author	SHA1	Message	Date
David S. Miller	2865ba8247	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2021-09-14 The following pull-request contains BPF updates for your net tree. We've added 7 non-merge commits during the last 13 day(s) which contain a total of 18 files changed, 334 insertions(+), 193 deletions(-). The main changes are: 1) Fix mmap_lock lockdep splat in BPF stack map's build_id lookup, from Yonghong Song. 2) Fix BPF cgroup v2 program bypass upon net_cls/prio activation, from Daniel Borkmann. 3) Fix kvcalloc() BTF line info splat on oversized allocation attempts, from Bixuan Cui. 4) Fix BPF selftest build of task_pt_regs test for arm64/s390, from Jean-Philippe Brucker. 5) Fix BPF's disasm.{c,h} to dual-license so that it is aligned with bpftool given the former is a build dependency for the latter, from Daniel Borkmann with ACKs from contributors. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-09-14 13:09:54 +01:00
Marco Elver	ac20e39e8d	kcsan: selftest: Cleanup and add missing __init Make test_encode_decode() more readable and add missing __init. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:41:20 -07:00
Marco Elver	78c3d954e2	kcsan: Move ctx to start of argument list It is clearer if ctx is at the start of the function argument list; it'll be more consistent when adding functions with varying arguments but all requiring ctx. No functional change intended. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:41:19 -07:00
Marco Elver	d627c537c2	kcsan: Support reporting scoped read-write access type Support generating the string representation of scoped read-write accesses for completeness. They will become required in planned changes. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:41:19 -07:00
Marco Elver	6c65eb7568	kcsan: Start stack trace with explicit location if provided If an explicit access address is set, as is done for scoped accesses, always start the stack trace from that location. get_stack_skipnr() is changed into sanitize_stack_entries(), which if given an address, scans the stack trace for a matching function and then replaces that entry with the explicitly provided address. The previous reports for scoped accesses were all over the place, which could be quite confusing. We now always point at the start of the scope. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:41:19 -07:00
Marco Elver	f4c87dbbef	kcsan: Save instruction pointer for scoped accesses Save the instruction pointer for scoped accesses, so that it becomes possible for the reporting code to construct more accurate stack traces that will show the start of the scope. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:41:19 -07:00
Marco Elver	55a55fec50	kcsan: Add ability to pass instruction pointer of access to reporting Add the ability to pass an explicitly set instruction pointer of access from check_access() all the way through to reporting. In preparation of using it in reporting. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:41:19 -07:00
Marco Elver	ade3a58b2d	kcsan: test: Fix flaky test case If CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY=n, then we may also see data races between the writers only. If we get unlucky and never capture a read-write data race, but only the write-write data races, then the test_no_value_change* test cases may incorrectly fail. The second problem is that the initial value needs to be reset, as otherwise we might actually observe a value change at the start. Fix it by also looking for the write-write data races, and resetting the value to what will be written. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:41:19 -07:00
Marco Elver	8080428410	kcsan: test: Use kunit_skip() to skip tests Use the new kunit_skip() to skip tests if requirements were not met. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:41:19 -07:00
Marco Elver	e80704272f	kcsan: test: Defer kcsan_test_init() after kunit initialization When the test is built into the kernel (not a module), kcsan_test_init() and kunit_init() both use late_initcall(), which means kcsan_test_init() might see a NULL debugfs_rootdir as parent dentry, resulting in kcsan_test_init() and kcsan_debugfs_init() both trying to create a debugfs node named "kcsan" in debugfs root. One of them will show an error and be unsuccessful. Defer kcsan_test_init() until we're sure kunit was initialized. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:41:19 -07:00
Scott Wood	71921a9606	rcutorture: Avoid problematic critical section nesting on PREEMPT_RT rcutorture is generating some nesting scenarios that are not compatible on PREEMPT_RT. For example: preempt_disable(); rcu_read_lock_bh(); preempt_enable(); rcu_read_unlock_bh(); The problem here is that on PREEMPT_RT the bottom halves have to be disabled and enabled in preemptible context. Reorder locking: start with BH locking and continue with then with disabling preemption or interrupts. In the unlocking do it reverse by first enabling interrupts and preemption and BH at the very end. Ensure that on PREEMPT_RT BH locking remains unchanged if in non-preemptible context. Link: https://lkml.kernel.org/r/20190911165729.11178-6-swood@redhat.com Link: https://lkml.kernel.org/r/20210819182035.GF4126399@paulmck-ThinkPad-P17-Gen-1 Signed-off-by: Scott Wood <swood@redhat.com> [bigeasy: Drop ATOM_BH, make it only about changing BH in atomic context. Allow enabling RCU in IRQ-off section. Reword commit message.] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:36:16 -07:00
Paul E. McKenney	fd13fe16db	rcutorture: Don't cpuhp_remove_state() if cpuhp_setup_state() failed Currently, in CONFIG_RCU_BOOST kernels, if the rcu_torture_init() function's call to cpuhp_setup_state() fails, rcu_torture_cleanup() gamely passes nonsense to cpuhp_remove_state(). This results in strange and misleading splats. This commit therefore ensures that if the rcu_torture_init() function's call to cpuhp_setup_state() fails, rcu_torture_cleanup() avoids invoking cpuhp_remove_state(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:36:16 -07:00
Paul E. McKenney	eb77abfdee	rcuscale: Warn on individual rcu_scale_init() error conditions When running rcuscale as a module, any rcu_scale_init() issues will be reflected in the error code from modprobe or insmod, as the case may be. However, these error codes are not available when running rcuscale built-in, for example, when using the kvm.sh script. This commit therefore adds WARN_ON_ONCE() to allow distinguishing rcu_scale_init() errors when running rcuscale built-in. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:36:16 -07:00
Paul E. McKenney	ed60ad733a	refscale: Warn on individual ref_scale_init() error conditions When running refscale as a module, any ref_scale_init() issues will be reflected in the error code from modprobe or insmod, as the case may be. However, these error codes are not available when running refscale built-in, for example, when using the kvm.sh script. This commit therefore adds WARN_ON_ONCE() to allow distinguishing ref_scale_init() errors when running refscale built-in. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:36:16 -07:00
Paul E. McKenney	b3b3cc618e	locktorture: Warn on individual lock_torture_init() error conditions When running locktorture as a module, any lock_torture_init() issues will be reflected in the error code from modprobe or insmod, as the case may be. However, these error codes are not available when running locktorture built-in, for example, when using the kvm.sh script. This commit therefore adds WARN_ON_ONCE() to allow distinguishing lock_torture_init() errors when running locktorture built-in. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:36:16 -07:00
Paul E. McKenney	efeff6b39b	rcutorture: Warn on individual rcu_torture_init() error conditions When running rcutorture as a module, any rcu_torture_init() issues will be reflected in the error code from modprobe or insmod, as the case may be. However, these error codes are not available when running rcutorture built-in, for example, when using the kvm.sh script. This commit therefore adds WARN_ON_ONCE() to allow distinguishing rcu_torture_init() errors when running rcutorture built-in. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:36:16 -07:00
Paul E. McKenney	fda84866b1	rcutorture: Suppressing read-exit testing is not an error Currently, specifying the rcutorture.read_exit_burst=0 kernel boot parameter will result in a -EINVAL exit code that will stop the rcutorture test run before it has fully initialized. This commit therefore uses a zero exit code in that case, thus allowing rcutorture.read_exit_burst=0 to complete normally. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:36:15 -07:00
Daniel Borkmann	8520e224f5	bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used. Back in the days, commit `bd1060a1d6` ("sock, cgroup: add sock->sk_cgroup") embedded per-socket cgroup information into sock->sk_cgrp_data and in order to save 8 bytes in struct sock made both mutually exclusive, that is, when cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2 falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp). The assumption made was "there is no reason to mix the two and this is in line with how legacy and v2 compatibility is handled" as stated in `bd1060a1d6`. However, with Kubernetes more widely supporting cgroups v2 as well nowadays, this assumption no longer holds, and the possibility of the v1/v2 mixed mode with the v2 root fallback being hit becomes a real security issue. Many of the cgroup v2 BPF programs are also used for policy enforcement, just to pick _one_ example, that is, to programmatically deny socket related system calls like connect(2) or bind(2). A v2 root fallback would implicitly cause a policy bypass for the affected Pods. In production environments, we have recently seen this case due to various circumstances: i) a different 3rd party agent and/or ii) a container runtime such as [0] in the user's environment configuring legacy cgroup v1 net_cls tags, which triggered implicitly mentioned root fallback. Another case is Kubernetes projects like kind [1] which create Kubernetes nodes in a container and also add cgroup namespaces to the mix, meaning programs which are attached to the cgroup v2 root of the cgroup namespace get attached to a non-root cgroup v2 path from init namespace point of view. And the latter's root is out of reach for agents on a kind Kubernetes node to configure. Meaning, any entity on the node setting cgroup v1 net_cls tag will trigger the bypass despite cgroup v2 BPF programs attached to the namespace root. Generally, this mutual exclusiveness does not hold anymore in today's user environments and makes cgroup v2 usage from BPF side fragile and unreliable. This fix adds proper struct cgroup pointer for the cgroup v2 case to struct sock_cgroup_data in order to address these issues; this implicitly also fixes the tradeoffs being made back then with regards to races and refcount leaks as stated in `bd1060a1d6`, and removes the fallback, so that cgroup v2 BPF programs always operate as expected. [0] https://github.com/nestybox/sysbox/ [1] https://kind.sigs.k8s.io/ Fixes: `bd1060a1d6` ("sock, cgroup: add sock->sk_cgroup") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/bpf/20210913230759.2313-1-daniel@iogearbox.net	2021-09-13 16:35:58 -07:00
Paul E. McKenney	cbe0d8d914	rcu-tasks: Wait for trc_read_check_handler() IPIs Currently, RCU Tasks Trace initializes the trc_n_readers_need_end counter to the value one, increments it before each trc_read_check_handler() IPI, then decrements it within trc_read_check_handler() if the target task was in a quiescent state (or if the target task moved to some other CPU while the IPI was in flight), complaining if the new value was zero. The rationale for complaining is that the initial value of one must be decremented away before zero can be reached, and this decrement has not yet happened. Except that trc_read_check_handler() is initiated with an asynchronous smp_call_function_single(), which might be significantly delayed. This can result in false-positive complaints about the counter reaching zero. This commit therefore waits for in-flight IPI handlers to complete before decrementing away the initial value of one from the trc_n_readers_need_end counter. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:35:08 -07:00
Neeraj Upadhyay	f0b2b2df54	rcu: Fix existing exp request check in sync_sched_exp_online_cleanup() The sync_sched_exp_online_cleanup() checks to see if RCU needs an expedited quiescent state from the incoming CPU, sending it an IPI if so. Before sending IPI, it checks whether expedited qs need has been already requested for the incoming CPU, by checking rcu_data.cpu_no_qs.b.exp for the current cpu, on which sync_sched_exp_online_cleanup() is running. This works for the case where incoming CPU is same as self. However, for the case where incoming CPU is different from self, expedited request won't get marked, which can potentially delay reporting of expedited quiescent state for the incoming CPU. Fixes: `e015a34112` ("rcu: Avoid self-IPI in sync_sched_exp_online_cleanup()") Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:32:46 -07:00
Juri Lelli	1eac0075eb	rcu: Make rcu update module parameters world-readable rcu update module parameters currently don't appear in sysfs and this is a serviceability issue as it might be needed to access their default values at runtime. Fix this issue by changing rcu update module parameters permissions to world-readable. Suggested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:32:46 -07:00
Juri Lelli	ebb6d30d9e	rcu: Make rcu_normal_after_boot writable again Certain configurations (e.g., systems that make heavy use of netns) need to use synchronize_rcu_expedited() to service RCU grace periods even after boot. Even though synchronize_rcu_expedited() has been traditionally considered harmful for RT for the heavy use of IPIs, it is perfectly usable under certain conditions (e.g. nohz_full). Make rcupdate.rcu_normal_after_boot= again writeable on RT (if NO_HZ_ FULL is defined), but keep its default value to 1 (enabled) to avoid regressions. Users who need synchronize_rcu_expedited() will boot with rcupdate.rcu_normal_after_ boot=0 in the kernel cmdline. Reflect the change in synchronize_rcu_expedited_wait() by removing the WARN related to CONFIG_PREEMPT_RT. Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:32:46 -07:00
Paul E. McKenney	4aa846f97c	rcu: Make rcutree_dying_cpu() use its "cpu" parameter The CPU-hotplug functions take a "cpu" parameter, but rcutree_dying_cpu() ignores it in favor of this_cpu_ptr(). This works at the moment, but it would be better to be consistent. This might also work better given some possible future changes. This commit therefore uses per_cpu_ptr() to avoid ignoring the rcutree_dying_cpu() function's argument. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:32:46 -07:00
Paul E. McKenney	768f5d50e6	rcu: Simplify rcu_report_dead() call to rcu_report_exp_rdp() Currently, rcu_report_dead() disables preemption across its call to rcu_report_exp_rdp(), but this is pointless because interrupts are already disabled by the caller. In addition, rcu_report_dead() computes the address of the outgoing CPU's rcu_data structure, which is also pointless because this address is already present in local variable rdp. This commit therefore drops the preemption disabling and passes rdp to rcu_report_exp_rdp(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:32:46 -07:00
Paul E. McKenney	2caebefb00	rcu: Move rcu_dynticks_eqs_online() to rcu_cpu_starting() The purpose of rcu_dynticks_eqs_online() is to adjust the ->dynticks counter of an incoming CPU when required. It is currently invoked from rcutree_prepare_cpu(), which runs before the incoming CPU is running, and thus on some other CPU. This makes the per-CPU accesses in rcu_dynticks_eqs_online() iffy at best, and it all "works" only because the running CPU cannot possibly be in dyntick-idle mode, which means that rcu_dynticks_eqs_online() never has any effect. It is currently OK for rcu_dynticks_eqs_online() to have no effect, but only because the CPU-offline process just happens to leave ->dynticks in the correct state. After all, if ->dynticks were in the wrong state on a just-onlined CPU, rcutorture would complain bitterly the next time that CPU went idle, at least in kernels built with CONFIG_RCU_EQS_DEBUG=y, for example, those built by rcutorture scenario TREE04. One could argue that this means that rcu_dynticks_eqs_online() is unnecessary, however, removing it would make the CPU-online process vulnerable to slight changes in the CPU-offline process. One could also ask why it is safe to move the rcu_dynticks_eqs_online() call so late in the CPU-online process. Indeed, there was a time when it would not have been safe, which does much to explain its current location. However, the marking of a CPU as online from an RCU perspective has long since moved from rcutree_prepare_cpu() to rcu_cpu_starting(), and all that is required is that ->dynticks be set correctly by the time that the CPU is marked as online from an RCU perspective. After all, the RCU grace-period kthread does not check to see if offline CPUs are also idle. (In case you were curious, this is one reason why there is quiescent-state reporting as part of the offlining process.) This commit therefore moves the call to rcu_dynticks_eqs_online() from rcutree_prepare_cpu() to rcu_cpu_starting(), this latter being guaranteed to be running on the incoming CPU. The call to this function must of course be placed before this rcu_cpu_starting() announces this CPU's presence to RCU. Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:32:46 -07:00
Paul E. McKenney	ebc88ad491	rcu: Comment rcu_gp_init() code waiting for CPU-hotplug operations Near the beginning of rcu_gp_init() is a per-rcu_node loop that waits for CPU-hotplug operations that might have started before the new grace period did. This commit adds a comment explaining that this wait does not exclude CPU-hotplug operations. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:32:46 -07:00
Zhouyi Zhou	3ac8587852	rcu: Fix undefined Kconfig macros Invoking scripts/checkkconfigsymbols.py in the Linux-kernel source tree located the following issues: 1. TREE_PREEMPT_RCU Referencing files: arch/sh/configs/sdk7786_defconfig It should now be CONFIG_PREEMPT_RCU. Except that the CONFIG_PREEMPT=y in that same file implies CONFIG_PREEMPT_RCU=y. Therefore, delete the CONFIG_TREE_PREEMPT_RCU=y line. The reason is as follows: In kernel/rcu/Kconfig, we have config PREEMPT_RCU bool default y if PREEMPTION https://www.kernel.org/doc/Documentation/kbuild/kconfig-language.txt says, "The default value is only assigned to the config symbol if no other value was set by the user (via the input prompt above)." there is no prompt in config PREEMPT_RCU entry, so we are guaranteed to get CONFIG_PREEMPT_RCU=y when CONFIG_PREEMPT is present. 2. RCU_CPU_STALL_INFO Referencing files: arch/xtensa/configs/nommu_kc705_defconfig The old Kconfig option RCU_CPU_STALL_INFO was removed by commit `75c27f119b` ("rcu: Remove CONFIG_RCU_CPU_STALL_INFO"), and the kernel now acts as if this Kconfig option was unconditionally enabled. 3. RCU_NOCB_CPU_ALL Referencing files: Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst This is an old snapshot of the code. I update this from the real rcu_prepare_for_idle() function in kernel/rcu/tree_plugin.h. This change was tested by invoking "make htmldocs". 4. RCU_TORTURE_TESTS Referencing files: kernel/rcu/rcutorture.c Forward-progress checking conflicts with CPU-stall testing, so we should complain at "modprobe rcutorture" when both are enabled. Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:32:46 -07:00
Paul E. McKenney	9424b867a7	rcu: Eliminate rcu_implicit_dynticks_qs() local variable ruqp The rcu_implicit_dynticks_qs() function's local variable ruqp references the ->rcu_urgent_qs field in the rcu_data structure referenced by the function parameter rdp, with a rather odd method for computing the pointer to this field. This commit therefore simplifies things and saves a couple of lines of code by replacing each instance of ruqp with &rdp->need_heavy_qs. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:32:46 -07:00
Paul E. McKenney	88ee23ef1c	rcu: Eliminate rcu_implicit_dynticks_qs() local variable rnhqp The rcu_implicit_dynticks_qs() function's local variable rnhqp references the ->rcu_need_heavy_qs field in the rcu_data structure referenced by the function parameter rdp, with a rather odd method for computing the pointer to this field. This commit therefore simplifies things and saves a few lines of code by replacing each instance of rnhqp with &rdp->need_heavy_qs. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:32:45 -07:00
Paul E. McKenney	52b030aa27	rcu-nocb: Fix a couple of tree_nocb code-style nits This commit removes a non-value-returning "return" statement at the end of __call_rcu_nocb_wake() and adds a blank line following declarations in nocb_cb_can_run(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:32:45 -07:00
Paul E. McKenney	2431774f04	rcu: Mark accesses to rcu_state.n_force_qs This commit marks accesses to the rcu_state.n_force_qs. These data races are hard to make happen, but syzkaller was equal to the task. Reported-by: syzbot+e08a83a1940ec3846cd5@syzkaller.appspotmail.com Acked-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-09-13 16:32:45 -07:00
Bixuan Cui	0e6491b559	bpf: Add oversize check before call kvcalloc() Commit `7661809d49` ("mm: don't allow oversized kvmalloc() calls") add the oversize check. When the allocation is larger than what kmalloc() supports, the following warning triggered: WARNING: CPU: 0 PID: 8408 at mm/util.c:597 kvmalloc_node+0x108/0x110 mm/util.c:597 Modules linked in: CPU: 0 PID: 8408 Comm: syz-executor221 Not tainted 5.14.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:kvmalloc_node+0x108/0x110 mm/util.c:597 Call Trace: kvmalloc include/linux/mm.h:806 [inline] kvmalloc_array include/linux/mm.h:824 [inline] kvcalloc include/linux/mm.h:829 [inline] check_btf_line kernel/bpf/verifier.c:9925 [inline] check_btf_info kernel/bpf/verifier.c:10049 [inline] bpf_check+0xd634/0x150d0 kernel/bpf/verifier.c:13759 bpf_prog_load kernel/bpf/syscall.c:2301 [inline] __sys_bpf+0x11181/0x126e0 kernel/bpf/syscall.c:4587 __do_sys_bpf kernel/bpf/syscall.c:4691 [inline] __se_sys_bpf kernel/bpf/syscall.c:4689 [inline] __x64_sys_bpf+0x78/0x90 kernel/bpf/syscall.c:4689 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae Reported-by: syzbot+f3e749d4c662818ae439@syzkaller.appspotmail.com Signed-off-by: Bixuan Cui <cuibixuan@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210911005557.45518-1-cuibixuan@huawei.com	2021-09-13 16:28:15 -07:00
Andrii Nakryiko	2f38304127	libbpf: Make libbpf_version.h non-auto-generated Turn previously auto-generated libbpf_version.h header into a normal header file. This prevents various tricky Makefile integration issues, simplifies the overall build process, but also allows to further extend it with some more versioning-related APIs in the future. To prevent accidental out-of-sync versions as defined by libbpf.map and libbpf_version.h, Makefile checks their consistency at build time. Simultaneously with this change bump libbpf.map to v0.6. Also undo adding libbpf's output directory into include path for kernel/bpf/preload, bpftool, and resolve_btfids, which is not necessary because libbpf_version.h is just a normal header like any other. Fixes: `0b46b75505` ("libbpf: Add LIBBPF_DEPRECATED_SINCE macro for scheduling API deprecations") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210913222309.3220849-1-andrii@kernel.org	2021-09-13 15:36:47 -07:00
Christophe Leroy	57d4374be9	audit: rename struct node to struct audit_node to prevent future name collisions Future work in the powerpc code results in a name collision with the identified "node" as struct node defined in kernel/audit_tree.c conflicts with struct node defined in include/linux/node.h (below). This patch takes the proactive route and renames the audit code's struct node to audit_node. CC kernel/audit_tree.o kernel/audit_tree.c:33:9: error: redefinition of 'struct node' 33 \| struct node { \| ^~~~ In file included from ./include/linux/cpu.h:17, from ./include/linux/static_call.h:102, from ./arch/powerpc/include/asm/machdep.h:10, from ./arch/powerpc/include/asm/archrandom.h:7, from ./include/linux/random.h:121, from ./include/linux/net.h:18, from ./include/linux/skbuff.h:26, from kernel/audit.h:11, from kernel/audit_tree.c:2: ./include/linux/node.h:84:8: note: originally defined here 84 \| struct node { \| ^~~~ make[2]: *** [kernel/audit_tree.o] Error 1 Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> [PM: rewrite subj/desc as the build failure is just a RFC patch] Signed-off-by: Paul Moore <paul@paul-moore.com>	2021-09-13 16:17:30 -04:00
Waiman Long	b94f9ac79a	cgroup/cpuset: Change references of cpuset_mutex to cpuset_rwsem Since commit `1243dc518c` ("cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem"), cpuset_mutex has been replaced by cpuset_rwsem which is a percpu rwsem. However, the comments in kernel/cgroup/cpuset.c still reference cpuset_mutex which are now incorrect. Change all the references of cpuset_mutex to cpuset_rwsem. Fixes: `1243dc518c` ("cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2021-09-13 08:06:17 -10:00
Song Liu	856c02dbce	bpf: Introduce helper bpf_get_branch_snapshot Introduce bpf_get_branch_snapshot(), which allows tracing pogram to get branch trace from hardware (e.g. Intel LBR). To use the feature, the user need to create perf_event with proper branch_record filtering on each cpu, and then calls bpf_get_branch_snapshot in the bpf function. On Intel CPUs, VLBR event (raw event 0x1b00) can be use for this. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210910183352.3151445-3-songliubraving@fb.com	2021-09-13 10:53:50 -07:00
Song Liu	c22ac2a3d4	perf: Enable branch record for software events The typical way to access branch record (e.g. Intel LBR) is via hardware perf_event. For CPUs with FREEZE_LBRS_ON_PMI support, PMI could capture reliable LBR. On the other hand, LBR could also be useful in non-PMI scenario. For example, in kretprobe or bpf fexit program, LBR could provide a lot of information on what happened with the function. Add API to use branch record for software use. Note that, when the software event triggers, it is necessary to stop the branch record hardware asap. Therefore, static_call is used to remove some branch instructions in this process. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/bpf/20210910183352.3151445-2-songliubraving@fb.com	2021-09-13 10:53:50 -07:00
Hamza Mahfooz	510e1a724a	dma-debug: prevent an error message from causing runtime problems For some drivers, that use the DMA API. This error message can be reached several millions of times per second, causing spam to the kernel's printk buffer and bringing the CPU usage up to 100% (so, it should be rate limited). However, since there is at least one driver that is in the mainline and suffers from the error condition, it is more useful to err_printk() here instead of just rate limiting the error message (in hopes that it will make it easier for other drivers that suffer from this issue to be spotted). Link: https://lkml.kernel.org/r/fd67fbac-64bf-f0ea-01e1-5938ccfab9d0@arm.com Reported-by: Jeremy Linton <jeremy.linton@arm.com> Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-09-13 18:29:24 +02:00
Linus Torvalds	56c244382f	Merge tag 'sched_urgent_for_v5.15_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Make sure the idle timer expires in hardirq context, on PREEMPT_RT - Make sure the run-queue balance callback is invoked only on the outgoing CPU * tag 'sched_urgent_for_v5.15_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Prevent balance_push() on remote runqueues sched/idle: Make the idle timer expire in hard interrupt context	2021-09-12 11:37:41 -07:00
Linus Torvalds	165d05d88c	Merge tag 'locking_urgent_for_v5.15_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Borislav Petkov: - Fix the futex PI requeue machinery to not return to userspace in inconsistent state - Avoid a potential null pointer dereference in the ww_mutex deadlock check - Other smaller cleanups and optimizations * tag 'locking_urgent_for_v5.15_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rtmutex: Fix ww_mutex deadlock check futex: Remove unused variable 'vpid' in futex_proxy_trylock_atomic() futex: Avoid redundant task lookup futex: Clarify comment for requeue_pi_wake_futex() futex: Prevent inconsistent state and exit race futex: Return error code instead of assigning it without effect locking/rwsem: Add missing __init_rwsem() for PREEMPT_RT	2021-09-12 11:27:05 -07:00
Linus Torvalds	ce4c8f8820	Merge tag 'trace-v5.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Minor fixes to the processing of the bootconfig tree" * tag 'trace-v5.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: bootconfig: Rename xbc_node_find_child() to xbc_node_find_subkey() tracing/boot: Fix to check the histogram control param is a leaf node tracing/boot: Fix trace_boot_hist_add_array() to check array is value	2021-09-11 10:16:30 -07:00
Yonghong Song	2f1aaf3ea6	bpf, mm: Fix lockdep warning triggered by stack_map_get_build_id_offset() Currently the bpf selftest "get_stack_raw_tp" triggered the warning: [ 1411.304463] WARNING: CPU: 3 PID: 140 at include/linux/mmap_lock.h:164 find_vma+0x47/0xa0 [ 1411.304469] Modules linked in: bpf_testmod(O) [last unloaded: bpf_testmod] [ 1411.304476] CPU: 3 PID: 140 Comm: systemd-journal Tainted: G W O 5.14.0+ #53 [ 1411.304479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 1411.304481] RIP: 0010:find_vma+0x47/0xa0 [ 1411.304484] Code: de 48 89 ef e8 ba f5 fe ff 48 85 c0 74 2e 48 83 c4 08 5b 5d c3 48 8d bf 28 01 00 00 be ff ff ff ff e8 2d 9f d8 00 85 c0 75 d4 <0f> 0b 48 89 de 48 8 [ 1411.304487] RSP: 0018:ffffabd440403db8 EFLAGS: 00010246 [ 1411.304490] RAX: 0000000000000000 RBX: 00007f00ad80a0e0 RCX: 0000000000000000 [ 1411.304492] RDX: 0000000000000001 RSI: ffffffff9776b144 RDI: ffffffff977e1b0e [ 1411.304494] RBP: ffff9cf5c2f50000 R08: ffff9cf5c3eb25d8 R09: 00000000fffffffe [ 1411.304496] R10: 0000000000000001 R11: 00000000ef974e19 R12: ffff9cf5c39ae0e0 [ 1411.304498] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9cf5c39ae0e0 [ 1411.304501] FS: 00007f00ae754780(0000) GS:ffff9cf5fba00000(0000) knlGS:0000000000000000 [ 1411.304504] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1411.304506] CR2: 000000003e34343c CR3: 0000000103a98005 CR4: 0000000000370ee0 [ 1411.304508] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1411.304510] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1411.304512] Call Trace: [ 1411.304517] stack_map_get_build_id_offset+0x17c/0x260 [ 1411.304528] __bpf_get_stack+0x18f/0x230 [ 1411.304541] bpf_get_stack_raw_tp+0x5a/0x70 [ 1411.305752] RAX: 0000000000000000 RBX: 5541f689495641d7 RCX: 0000000000000000 [ 1411.305756] RDX: 0000000000000001 RSI: ffffffff9776b144 RDI: ffffffff977e1b0e [ 1411.305758] RBP: ffff9cf5c02b2f40 R08: ffff9cf5ca7606c0 R09: ffffcbd43ee02c04 [ 1411.306978] bpf_prog_32007c34f7726d29_bpf_prog1+0xaf/0xd9c [ 1411.307861] R10: 0000000000000001 R11: 0000000000000044 R12: ffff9cf5c2ef60e0 [ 1411.307865] R13: 0000000000000005 R14: 0000000000000000 R15: ffff9cf5c2ef6108 [ 1411.309074] bpf_trace_run2+0x8f/0x1a0 [ 1411.309891] FS: 00007ff485141700(0000) GS:ffff9cf5fae00000(0000) knlGS:0000000000000000 [ 1411.309896] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1411.311221] syscall_trace_enter.isra.20+0x161/0x1f0 [ 1411.311600] CR2: 00007ff48514d90e CR3: 0000000107114001 CR4: 0000000000370ef0 [ 1411.312291] do_syscall_64+0x15/0x80 [ 1411.312941] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1411.313803] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 1411.314223] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1411.315082] RIP: 0033:0x7f00ad80a0e0 [ 1411.315626] Call Trace: [ 1411.315632] stack_map_get_build_id_offset+0x17c/0x260 To reproduce, first build `test_progs` binary: make -C tools/testing/selftests/bpf -j60 and then run the binary at tools/testing/selftests/bpf directory: ./test_progs -t get_stack_raw_tp The warning is due to commit `5b78ed24e8` ("mm/pagemap: add mmap_assert_locked() annotations to find_vma()") which added mmap_assert_locked() in find_vma() function. The mmap_assert_locked() function asserts that mm->mmap_lock needs to be held. But this is not the case for bpf_get_stack() or bpf_get_stackid() helper (kernel/bpf/stackmap.c), which uses mmap_read_trylock_non_owner() instead. Since mm->mmap_lock is not held in bpf_get_stack[id]() use case, the above warning is emitted during test run. This patch fixed the issue by (1). using mmap_read_trylock() instead of mmap_read_trylock_non_owner() to satisfy lockdep checking in find_vma(), and (2). droping lockdep for mmap_lock right before the irq_work_queue(). The function mmap_read_trylock_non_owner() is also removed since after this patch nobody calls it any more. Fixes: `5b78ed24e8` ("mm/pagemap: add mmap_assert_locked() annotations to find_vma()") Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Luigi Rizzo <lrizzo@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: linux-mm@kvack.org Link: https://lore.kernel.org/bpf/20210909155000.1610299-1-yhs@fb.com	2021-09-10 22:24:23 +02:00
Masami Hiramatsu	5dfe50b055	bootconfig: Rename xbc_node_find_child() to xbc_node_find_subkey() Rename xbc_node_find_child() to xbc_node_find_subkey() for clarifying that function returns a key node (no value node). Since there are xbc_node_for_each_child() (loop on all child nodes) and xbc_node_for_each_subkey() (loop on only subkey nodes), this name distinction is necessary to avoid confusing users. Link: https://lkml.kernel.org/r/163119459826.161018.11200274779483115300.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2021-09-09 19:14:33 -04:00
Masami Hiramatsu	5f8895b27d	tracing/boot: Fix to check the histogram control param is a leaf node Since xbc_node_find_child() doesn't ensure the returned node is a leaf node (key-value pair or do not have subkeys), use xbc_node_find_value to ensure the histogram control parameter is a leaf node in trace_boot_compose_hist_cmd(). Link: https://lkml.kernel.org/r/163119459059.161018.18341288218424528962.stgit@devnote2 Fixes: `e66ed86ca6` ("tracing/boot: Add per-event histogram action options") Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2021-09-09 19:14:33 -04:00
Masami Hiramatsu	a3928f877e	tracing/boot: Fix trace_boot_hist_add_array() to check array is value trace_boot_hist_add_array() uses the combination of xbc_node_find_child() and xbc_node_get_child() to get the child node of the key node. But since it missed to check the child node is data node or not, user can pass the subkey node for the array node (anode). To avoid this issue, check the array node is a data node. Actually, there is xbc_node_find_value(node, key, vnode), which ensures the @vnode is a value node, so use it in trace_boot_hist_add_array() to fix this issue. Link: https://lkml.kernel.org/r/163119458308.161018.1516455973625940212.stgit@devnote2 Fixes: `e66ed86ca6` ("tracing/boot: Add per-event histogram action options") Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2021-09-09 19:14:33 -04:00
Quentin Monnet	0b46b75505	libbpf: Add LIBBPF_DEPRECATED_SINCE macro for scheduling API deprecations Introduce a macro LIBBPF_DEPRECATED_SINCE(major, minor, message) to prepare the deprecation of two API functions. This macro marks functions as deprecated when libbpf's version reaches the values passed as an argument. As part of this change libbpf_version.h header is added with recorded major (LIBBPF_MAJOR_VERSION) and minor (LIBBPF_MINOR_VERSION) libbpf version macros. They are now part of libbpf public API and can be relied upon by user code. libbpf_version.h is installed system-wide along other libbpf public headers. Due to this new build-time auto-generated header, in-kernel applications relying on libbpf (resolve_btfids, bpftool, bpf_preload) are updated to include libbpf's output directory as part of a list of include search paths. Better fix would be to use libbpf's make_install target to install public API headers, but that clean up is left out as a future improvement. The build changes were tested by building kernel (with KBUILD_OUTPUT and O= specified explicitly), bpftool, libbpf, selftests/bpf, and resolve_btfids builds. No problems were detected. Note that because of the constraints of the C preprocessor we have to write a few lines of macro magic for each version used to prepare deprecation (0.6 for now). Also, use LIBBPF_DEPRECATED_SINCE() to schedule deprecation of btf__get_from_id() and btf__load(), which are replaced by btf__load_from_kernel_by_id() and btf__load_into_kernel(), respectively, starting from future libbpf v0.6. This is part of libbpf 1.0 effort ([0]). [0] Closes: https://github.com/libbpf/libbpf/issues/278 Co-developed-by: Quentin Monnet <quentin@isovalent.com> Co-developed-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210908213226.1871016-1-andrii@kernel.org	2021-09-09 23:28:05 +02:00
Linus Torvalds	43175623dd	Merge tag 'trace-v5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull more tracing updates from Steven Rostedt: - Add migrate-disable counter to tracing header - Fix error handling in event probes - Fix missed unlock in osnoise in error path - Fix merge issue with tools/bootconfig - Clean up bootconfig data when init memory is removed - Fix bootconfig to loop only on subkeys - Have kernel command lines override bootconfig options - Increase field counts for synthetic events - Have histograms dynamic allocate event elements to save space - Fixes in testing and documentation * tag 'trace-v5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing/boot: Fix to loop on only subkeys selftests/ftrace: Exclude "(fault)" in testing add/remove eprobe events tracing: Dynamically allocate the per-elt hist_elt_data array tracing: synth events: increase max fields count tools/bootconfig: Show whole test command for each test case bootconfig: Fix missing return check of xbc_node_compose_key function tools/bootconfig: Fix tracing_on option checking in ftrace2bconf.sh docs: bootconfig: Add how to use bootconfig for kernel parameters init/bootconfig: Reorder init parameter from bootconfig and cmdline init: bootconfig: Remove all bootconfig data when the init memory is removed tracing/osnoise: Fix missed cpus_read_unlock() in start_per_cpu_kthreads() tracing: Fix some alloc_event_probe() error handling bugs tracing: Add migrate-disabled counter to tracing output.	2021-09-09 13:11:15 -07:00
Thomas Gleixner	868ad33bfa	sched: Prevent balance_push() on remote runqueues sched_setscheduler() and rt_mutex_setprio() invoke the run-queue balance callback after changing priorities or the scheduling class of a task. The run-queue for which the callback is invoked can be local or remote. That's not a problem for the regular rq::push_work which is serialized with a busy flag in the run-queue struct, but for the balance_push() work which is only valid to be invoked on the outgoing CPU that's wrong. It not only triggers the debug warning, but also leaves the per CPU variable push_work unprotected, which can result in double enqueues on the stop machine list. Remove the warning and validate that the function is invoked on the outgoing CPU. Fixes: `ae79270232` ("sched: Optimize finish_lock_switch()") Reported-by: Sebastian Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/87zgt1hdw7.ffs@tglx	2021-09-09 11:27:23 +02:00
Sebastian Andrzej Siewior	9848417926	sched/idle: Make the idle timer expire in hard interrupt context The intel powerclamp driver will setup a per-CPU worker with RT priority. The worker will then invoke play_idle() in which it remains in the idle poll loop until it is stopped by the timer it started earlier. That timer needs to expire in hard interrupt context on PREEMPT_RT. Otherwise the timer will expire in ksoftirqd as a SOFT timer but that task won't be scheduled on the CPU because its priority is lower than the priority of the worker which is in the idle loop. Always expire the idle timer in hard interrupt context. Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210906113034.jgfxrjdvxnjqgtmc@linutronix.de	2021-09-09 10:36:16 +02:00
Peter Zijlstra	e548057270	locking/rtmutex: Fix ww_mutex deadlock check Dan reported that rt_mutex_adjust_prio_chain() can be called with .orig_waiter == NULL however commit `a055fcc132` ("locking/rtmutex: Return success on deadlock for ww_mutex waiters") unconditionally dereferences it. Since both call-sites that have .orig_waiter == NULL don't care for the return value, simply disable the deadlock squash by adding the NULL check. Notably, both callers use the deadlock condition as a termination condition for the iteration; once detected, it is sure that (de)boosting is done. Arguably step [3] would be a more natural termination point, but it's dubious whether adding a third deadlock detection state would improve the code. Fixes: `a055fcc132` ("locking/rtmutex: Return success on deadlock for ww_mutex waiters") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/YS9La56fHMiCCo75@hirez.programming.kicks-ass.net	2021-09-09 10:31:22 +02:00

... 49 50 51 52 53 ...

39759 Commits