Pull irqchip fixes from Marc Zyngier:
- Work around a bad GIC integration on a Renesas platform, where the
interconnect cannot deal with byte-sized MMIO accesses
- Cleanup another Renesas driver abusing the comma operator
- Fix a potential GICv4 memory leak on an error path
- Make the type of 'size' consistent with the rest of the code in
__irq_domain_add()
- Fix a regression in the Armada 370-XP IPI path
- Fix the build for the obviously unloved goldfish-pic
- Some documentation fixes
Link: https://lore.kernel.org/r/20210924090933.2766857-1-maz@kernel.org
net/mptcp/protocol.c
977d293e23 ("mptcp: ensure tx skbs always have the MPTCP ext")
efe686ffce ("mptcp: ensure tx skbs always have the MPTCP ext")
same patch merged in both trees, keep net-next.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Resetting/stopping an itimer eventually leads to it being reprogrammed
with an actual "0" value. As a result the itimer expires on the next
tick, triggering an unexpected signal.
To fix this, make sure that
struct signal_struct::it[CPUCLOCK_PROF/VIRT]::expires is set to 0 when
setitimer() passes a 0 it_value, indicating that the timer must stop.
Fixes: 406dd42bd1 ("posix-cpu-timers: Force next expiration recalc after itimer reset")
Reported-by: Victor Stinner <vstinner@redhat.com>
Reported-by: Chris Hixon <linux-kernel-bugs@hixontech.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210913145332.232023-1-frederic@kernel.org
Invoke rseq_handle_notify_resume() from tracehook_notify_resume() now
that the two function are always called back-to-back by architectures
that have rseq. The rseq helper is stubbed out for architectures that
don't support rseq, i.e. this is a nop across the board.
Note, tracehook_notify_resume() is horribly named and arguably does not
belong in tracehook.h as literally every line of code in it has nothing
to do with tracing. But, that's been true since commit a42c6ded82
("move key_repace_session_keyring() into tracehook_notify_resume()")
first usurped tracehook_notify_resume() back in 2012. Punt cleaning that
mess up to future patches.
No functional change intended.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210901203030.1292304-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Invoke rseq's NOTIFY_RESUME handler when processing the flag prior to
transferring to a KVM guest, which is roughly equivalent to an exit to
userspace and processes many of the same pending actions. While the task
cannot be in an rseq critical section as the KVM path is reachable only
by via ioctl(KVM_RUN), the side effects that apply to rseq outside of a
critical section still apply, e.g. the current CPU needs to be updated if
the task is migrated.
Clearing TIF_NOTIFY_RESUME without informing rseq can lead to segfaults
and other badness in userspace VMMs that use rseq in combination with KVM,
e.g. due to the CPU ID being stale after task migration.
Fixes: 72c3c0fe54 ("x86/kvm: Use generic xfer to guest work function")
Reported-by: Peter Foley <pefoley@google.com>
Bisected-by: Doug Evans <dje@google.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210901203030.1292304-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It was found that the following warning was displayed when remounting
controllers from cgroup v2 to v1:
[ 8042.997778] WARNING: CPU: 88 PID: 80682 at kernel/cgroup/cgroup.c:3130 cgroup_apply_control_disable+0x158/0x190
:
[ 8043.091109] RIP: 0010:cgroup_apply_control_disable+0x158/0x190
[ 8043.096946] Code: ff f6 45 54 01 74 39 48 8d 7d 10 48 c7 c6 e0 46 5a a4 e8 7b 67 33 00 e9 41 ff ff ff 49 8b 84 24 e8 01 00 00 0f b7 40 08 eb 95 <0f> 0b e9 5f ff ff ff 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3
[ 8043.115692] RSP: 0018:ffffba8a47c23d28 EFLAGS: 00010202
[ 8043.120916] RAX: 0000000000000036 RBX: ffffffffa624ce40 RCX: 000000000000181a
[ 8043.128047] RDX: ffffffffa63c43e0 RSI: ffffffffa63c43e0 RDI: ffff9d7284ee1000
[ 8043.135180] RBP: ffff9d72874c5800 R08: ffffffffa624b090 R09: 0000000000000004
[ 8043.142314] R10: ffffffffa624b080 R11: 0000000000002000 R12: ffff9d7284ee1000
[ 8043.149447] R13: ffff9d7284ee1000 R14: ffffffffa624ce70 R15: ffffffffa6269e20
[ 8043.156576] FS: 00007f7747cff740(0000) GS:ffff9d7a5fc00000(0000) knlGS:0000000000000000
[ 8043.164663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8043.170409] CR2: 00007f7747e96680 CR3: 0000000887d60001 CR4: 00000000007706e0
[ 8043.177539] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8043.184673] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8043.191804] PKRU: 55555554
[ 8043.194517] Call Trace:
[ 8043.196970] rebind_subsystems+0x18c/0x470
[ 8043.201070] cgroup_setup_root+0x16c/0x2f0
[ 8043.205177] cgroup1_root_to_use+0x204/0x2a0
[ 8043.209456] cgroup1_get_tree+0x3e/0x120
[ 8043.213384] vfs_get_tree+0x22/0xb0
[ 8043.216883] do_new_mount+0x176/0x2d0
[ 8043.220550] __x64_sys_mount+0x103/0x140
[ 8043.224474] do_syscall_64+0x38/0x90
[ 8043.228063] entry_SYSCALL_64_after_hwframe+0x44/0xae
It was caused by the fact that rebind_subsystem() disables
controllers to be rebound one by one. If more than one disabled
controllers are originally from the default hierarchy, it means that
cgroup_apply_control_disable() will be called multiple times for the
same default hierarchy. A controller may be killed by css_kill() in
the first round. In the second round, the killed controller may not be
completely dead yet leading to the warning.
To avoid this problem, we collect all the ssid's of controllers that
needed to be disabled from the default hierarchy and then disable them
in one go instead of one by one.
Fixes: 334c3679ec ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In scenarios where containers are frequently created and deleted,
a large number of error logs maybe generated. The logs only show
which node is about to go over the max limit, not the node which
resource request failed. As misc.events has provided relevant
information, maybe we can remove this log.
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Introduce misc.events to make it easier for us to understand
the pressure of resources. Currently only the 'max' event is
implemented, which indicates the times the resource is about
to exceeds the max limit.
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch adds basic audit io_uring filtering, using as much of the
existing audit filtering infrastructure as possible. In order to do
this we reuse the audit filter rule's syscall mask for the io_uring
operation and we create a new filter for io_uring operations as
AUDIT_FILTER_URING_EXIT/audit_filter_list[7].
Thanks to Richard Guy Briggs for his review, feedback, and work on
the corresponding audit userspace changes.
Acked-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
This patch adds basic auditing to io_uring operations, regardless of
their context. This is accomplished by allocating audit_context
structures for the io-wq worker and io_uring SQPOLL kernel threads
as well as explicitly auditing the io_uring operations in
io_issue_sqe(). Individual io_uring operations can bypass auditing
through the "audit_skip" field in the struct io_op_def definition for
the operation; although great care must be taken so that security
relevant io_uring operations do not bypass auditing; please contact
the audit mailing list (see the MAINTAINERS file) with any questions.
The io_uring operations are audited using a new AUDIT_URINGOP record,
an example is shown below:
type=UNKNOWN[1336] msg=audit(1631800225.981:37289):
uring_op=19 success=yes exit=0 items=0 ppid=15454 pid=15681
uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0
subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
key=(null)
Thanks to Richard Guy Briggs for review and feedback.
Signed-off-by: Paul Moore <paul@paul-moore.com>
This patch cleans up some of our audit_context handling by
abstracting out the reset and return code fixup handling to dedicated
functions. Not only does this help make things easier to read and
inspect, it allows for easier reuse by future patches. We also
convert the simple audit_context->in_syscall flag into an enum which
can be used to by future patches to indicate a calling context other
than the syscall context.
Thanks to Richard Guy Briggs for review and feedback.
Acked-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
The support for misrouted IRQs is used on old / legacy systems and is
not feasible on PREEMPT_RT.
Polling for interrupts reduces the overall system performance.
Additionally the interrupt latency depends on the polling frequency and
delays are not desired for real time workloads.
Disable IRQ polling on PREEMPT_RT and let the user know that it is not
enabled. The compiler will optimize the real fixup/poll code out.
[ bigeasy: Update changelog and switch to IS_ENABLED() ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210917223841.c6j6jcaffojrnot3@linutronix.de
Pull perf event fix from Thomas Gleixner:
"A single fix for the perf core where a value read with READ_ONCE() was
checked and then reread which makes all the checks invalid. Reuse the
already read value instead"
* tag 'perf-urgent-2021-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
events: Reuse value read using READ_ONCE instead of re-reading it
Pull locking fixes from Thomas Gleixner:
"A set of updates for the RT specific reader/writer locking base code:
- Make the fast path reader ordering guarantees correct.
- Code reshuffling to make the fix simpler"
[ This plays ugly games with atomic_add_return_release() because we
don't have a plain atomic_add_release(), and should really be cleaned
up, I think - Linus ]
* tag 'locking-urgent-2021-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rwbase: Take care of ordering guarantee for fastpath reader
locking/rwbase: Extract __rwbase_write_trylock()
locking/rwbase: Properly match set_and_save_state() to restore_state()
MAX_SNPRINTF_VARARGS and MAX_SEQ_PRINTF_VARARGS are used by bpf helpers
bpf_snprintf and bpf_seq_printf to limit their varargs. Both call into
bpf_bprintf_prepare for print formatting logic and have convenience
macros in libbpf (BPF_SNPRINTF, BPF_SEQ_PRINTF) which use the same
helper macros to convert varargs to a byte array.
Changing shared functionality to support more varargs for either bpf
helper would affect the other as well, so let's combine the _VARARGS
macros to make this more obvious.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210917182911.2426606-2-davemarchevsky@fb.com
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-09-17
We've added 63 non-merge commits during the last 12 day(s) which contain
a total of 65 files changed, 2653 insertions(+), 751 deletions(-).
The main changes are:
1) Streamline internal BPF program sections handling and
bpf_program__set_attach_target() in libbpf, from Andrii.
2) Add support for new btf kind BTF_KIND_TAG, from Yonghong.
3) Introduce bpf_get_branch_snapshot() to capture LBR, from Song.
4) IMUL optimization for x86-64 JIT, from Jie.
5) xsk selftest improvements, from Magnus.
6) Introduce legacy kprobe events support in libbpf, from Rafael.
7) Access hw timestamp through BPF's __sk_buff, from Vadim.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (63 commits)
selftests/bpf: Fix a few compiler warnings
libbpf: Constify all high-level program attach APIs
libbpf: Schedule open_opts.attach_prog_fd deprecation since v0.7
selftests/bpf: Switch fexit_bpf2bpf selftest to set_attach_target() API
libbpf: Allow skipping attach_func_name in bpf_program__set_attach_target()
libbpf: Deprecated bpf_object_open_opts.relaxed_core_relocs
selftests/bpf: Stop using relaxed_core_relocs which has no effect
libbpf: Use pre-setup sec_def in libbpf_find_attach_btf_id()
bpf: Update bpf_get_smp_processor_id() documentation
libbpf: Add sphinx code documentation comments
selftests/bpf: Skip btf_tag test if btf_tag attribute not supported
docs/bpf: Add documentation for BTF_KIND_TAG
selftests/bpf: Add a test with a bpf program with btf_tag attributes
selftests/bpf: Test BTF_KIND_TAG for deduplication
selftests/bpf: Add BTF_KIND_TAG unit tests
selftests/bpf: Change NAME_NTH/IS_NAME_NTH for BTF_KIND_TAG format
selftests/bpf: Test libbpf API function btf__add_tag()
bpftool: Add support for BTF_KIND_TAG
libbpf: Add support for BTF_KIND_TAG
libbpf: Rename btf_{hash,equal}_int to btf_{hash,equal}_int_tag
...
====================
Link: https://lore.kernel.org/r/20210917173738.3397064-1-ast@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull dma-mapping fixes from Christoph Hellwig:
- page align size in sparc32 arch_dma_alloc (Andreas Larsson)
- tone down a new dma-debug message (Hamza Mahfooz)
- fix the kerneldoc for dma_map_sg_attrs (me)
* tag 'dma-mapping-5.15-1' of git://git.infradead.org/users/hch/dma-mapping:
sparc32: page align size in arch_dma_alloc
dma-debug: prevent an error message from causing runtime problems
dma-mapping: fix the kerneldoc for dma_map_sg_attrs
lock_is_held_type(, 1) detects acquired read locks. It only recognized
locks acquired with lock_acquire_shared(). Read locks acquired with
lock_acquire_shared_recursive() are not recognized because a `2' is
stored as the read value.
Rework the check to additionally recognise lock's read value one and two
as a read held lock.
Fixes: e918188611 ("locking: More accurate annotations for read_lock()")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Boqun Feng <boqun.feng@gmail.com>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lkml.kernel.org/r/20210903084001.lblecrvz4esl4mrr@linutronix.de
i915 will soon gain an eviction path that trylock a whole lot of locks
for eviction, getting dmesg failures like below:
BUG: MAX_LOCK_DEPTH too low!
turning off the locking correctness validator.
depth: 48 max: 48!
48 locks held by i915_selftest/5776:
#0: ffff888101a79240 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x88/0x160
#1: ffffc900009778c0 (reservation_ww_class_acquire){+.+.}-{0:0}, at: i915_vma_pin.constprop.63+0x39/0x1b0 [i915]
#2: ffff88800cf74de8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: i915_vma_pin.constprop.63+0x5f/0x1b0 [i915]
#3: ffff88810c7f9e38 (&vm->mutex/1){+.+.}-{3:3}, at: i915_vma_pin_ww+0x1c4/0x9d0 [i915]
#4: ffff88810bad5768 (reservation_ww_class_mutex){+.+.}-{3:3}, at: i915_gem_evict_something+0x110/0x860 [i915]
#5: ffff88810bad60e8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: i915_gem_evict_something+0x110/0x860 [i915]
...
#46: ffff88811964d768 (reservation_ww_class_mutex){+.+.}-{3:3}, at: i915_gem_evict_something+0x110/0x860 [i915]
#47: ffff88811964e0e8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: i915_gem_evict_something+0x110/0x860 [i915]
INFO: lockdep is turned off.
Fixing eviction to nest into ww_class_acquire is a high priority, but
it requires a rework of the entire driver, which can only be done one
step at a time.
As an intermediate solution, add an acquire context to
ww_mutex_trylock, which allows us to do proper nesting annotations on
the trylocks, making the above lockdep splat disappear.
This is also useful in regulator_lock_nested, which may avoid dropping
regulator_nesting_mutex in the uncontended path, so use it there.
TTM may be another user for this, where we could lock a buffer in a
fastpath with list locks held, without dropping all locks we hold.
[peterz: rework actual ww_mutex_trylock() implementations]
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YUBGPdDDjKlxAuXJ@hirez.programming.kicks-ass.net
With enabled threaded interrupts the nouveau driver reported the
following:
| Chain exists of:
| &mm->mmap_lock#2 --> &device->mutex --> &cpuset_rwsem
|
| Possible unsafe locking scenario:
|
| CPU0 CPU1
| ---- ----
| lock(&cpuset_rwsem);
| lock(&device->mutex);
| lock(&cpuset_rwsem);
| lock(&mm->mmap_lock#2);
The device->mutex is nvkm_device::mutex.
Unblocking the lockchain at `cpuset_rwsem' is probably the easiest
thing to do. Move the priority assignment to the start of the newly
created thread.
Fixes: 710da3c8ea ("sched/core: Prevent race condition between cpuset and __sched_setscheduler()")
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[bigeasy: Patch description]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/a23a826af7c108ea5651e73b8fbae5e653f16e86.camel@gmx.de
When running scftorture as a module, any scf_torture_init() issues will be
reflected in the error code from modprobe or insmod, as the case may be.
However, these error codes are not available when running scftorture
built-in, for example, when using the kvm.sh script. This commit
therefore adds WARN_ON_ONCE() to allow distinguishing scf_torture_init()
errors when running scftorture built-in.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, only those IPIs that invoke scftorture's scf_handler()
IPI handler function are counted. This means that runs exercising
only scftorture.weight_resched will look like they have made no forward
progress, resulting in "GP HANG" complaints from the rcutorture scripting.
This commit therefore increments the scf_invoked_count per-CPU counter
immediately after calling resched_cpu().
Fixes: 1ac78b49d6 ("scftorture: Add an alternative IPI vector")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The "all zero weights makes no sense" error is emitted even when
scftorture.weight_resched is non-zero because it was left out of
the enclosing "if" condition. This commit adds it in.
Fixes: 1ac78b49d6 ("scftorture: Add an alternative IPI vector")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If (say) a 10-hour scftorture run is started, but the module parameters
are so nonsensical that the run doesn't even start, then scftorture will
wait the full ten hours when run built into a guest OS. This commit
therefore shuts down the system in this case so that the error is reported
immediately instead of ten hours hence.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit reworks the weighting calculations to allow zero to
be specified to disable a given weight. For example, specifying the
scftorture.weight_resched=0 kernel boot parameter without specifying a
non-zero value for any of the other scftorture.weight_* parameters would
provide the default weights for the others, but would refrain from doing
any resched-based IPIs.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Since commit aa40c138cc ("rcu: Report QS for outermost PREEMPT=n
rcu_read_unlock() for strict GPs") the function rcu_read_unlock_strict()
is invoked by the inlined rcu_read_unlock() function. However,
rcu_read_unlock_strict() is an empty function in production kernels,
which are built with CONFIG_RCU_STRICT_GRACE_PERIOD=n.
There is a mention of rcu_read_unlock_strict() in the BPF verifier,
but this is in a deny-list, meaning that BPF does not care whether
rcu_read_unlock_strict() is ever called.
This commit therefore provides a slight performance improvement
by hoisting the check of CONFIG_RCU_STRICT_GRACE_PERIOD from
rcu_read_unlock_strict() into rcu_read_unlock(), thus avoiding the
pointless call to an empty function.
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The cond_resched_rcu_qs() function no longer exists, despite being mentioned
several times in kernel/rcu/tasks.h. This commit therefore updates it to
the current cond_resched_tasks_rcu_qs().
Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The trc_wait_for_one_reader() function is called at multiple stages
of trace rcu-tasks GP function, rcu_tasks_wait_gp():
- First, it is called as part of per task function -
rcu_tasks_trace_pertask(), for all non-idle tasks. As part of per task
processing, this function add the task in the holdout list and if the
task is currently running on a CPU, it sends IPI to the task's CPU.
The IPI handler takes action depending on whether task is in trace
rcu-tasks read side critical section or not:
- a. If the task is in trace rcu-tasks read side critical section
(t->trc_reader_nesting != 0), the IPI handler sets the task's
->trc_reader_special.b.need_qs, so that this task notifies exit
from its outermost read side critical section (by decrementing
trc_n_readers_need_end) to the GP handling function.
trc_wait_for_one_reader() also increments trc_n_readers_need_end,
so that the trace rcu-tasks GP handler function waits for this
task's read side exit notification. The IPI handler also sets
t->trc_reader_checked to true, and no further IPIs are sent for
this task, for this trace rcu-tasks grace period and this
task can be removed from holdout list.
- b. If the task is in the process of exiting its trace rcu-tasks
read side critical section, (t->trc_reader_nesting < 0), defer
this task's processing to future calls to trc_wait_for_one_reader().
- c. If task is not in rcu-task read side critical section,
t->trc_reader_nesting == 0, ->trc_reader_checked is set for this
task, so that this task is removed from holdout list.
- Second, trc_wait_for_one_reader() is called as part of post scan, in
function rcu_tasks_trace_postscan(), for all idle tasks.
- Third, in function check_all_holdout_tasks_trace(), this function is
called for each task in the holdout list, but only if there isn't
a pending IPI for the task (->trc_ipi_to_cpu == -1). This function
removed the task from holdout list, if IPI handler has completed the
required work, to ensure that the current trace rcu-tasks grace period
either waits for this task, or this task is not in a trace rcu-tasks
read side critical section.
Now, considering the scenario where smp_call_function_single() fails in
first case, inside rcu_tasks_trace_pertask(). In this case,
->trc_ipi_to_cpu is set to the current CPU for that task. This will
result in trc_wait_for_one_reader() getting skipped in third case,
inside check_all_holdout_tasks_trace(), for this task. This further
results in ->trc_reader_checked never getting set for this task,
and the task not getting removed from holdout list. This can cause
the current trace rcu-tasks grace period to stall.
Fix the above problem, by resetting ->trc_ipi_to_cpu to -1, on
smp_call_function_single() failure, so that future IPI calls can
be send for this task.
Note that all three of the trc_wait_for_one_reader() function's
callers (rcu_tasks_trace_pertask(), rcu_tasks_trace_postscan(),
check_all_holdout_tasks_trace()) hold cpu_read_lock(). This means
that smp_call_function_single() cannot race with CPU hotplug, and thus
should never fail. Therefore, also add a warning in order to report
any such failure in case smp_call_function_single() grows some other
reason for failure.
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
call_rcu_tasks_trace() does have read-side primitives - rcu_read_lock_trace()
and rcu_read_unlock_trace(). Fix this information in the comments.
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
RCU tasks rude variant does not check whether the current
running context on a CPU is usermode. Read side critical section ends
on transition to usermode execution, by the virtue of usermode
execution being schedulable. Clarify this in comments for
call_rcu_tasks_rude() and synchronize_rcu_tasks_rude().
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Valid CPU numbers can be zero or greater, but the checks for
->trc_ipi_to_cpu and tick_nohz_full_cpu()'s argument are for strictly
greater than. This commit therefore corrects the check for no_hz_full
cpu in show_stalled_task_trace() so as to include cpu 0.
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
In check_all_holdout_tasks_trace(), firstreport is a pointer argument;
so, check the dereferenced value, instead of checking the pointer.
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Early in debugging, it made some sense to differentiate the first
iteration from subsequent iterations, but now this just causes confusion.
This commit therefore moves the "set_tasks_gp_state(rtp, RTGS_WAIT_CBS)"
statement to the beginning of the "for" loop in rcu_tasks_kthread().
Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The second argument of rcu_read_unlock_trace_special() is always zero.
When called from exit_tasks_rcu_finish_trace(), it is the constant
zero, and rcu_read_unlock_trace_special() doesn't get called from
rcu_read_unlock_trace() unless the value of local variable "nesting"
is zero because in that case the early return is taken instead.
This commit therefore removes the "nesting" argument from the
rcu_read_unlock_trace_special() function, substituting the constant
zero within that function. This commit also adds a WARN_ON_ONCE()
to rcu_read_lock_trace_held() in case non-zeroness some day appears.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, trc_inspect_reader() treats a task exiting its RCU Tasks
Trace read-side critical section the same as being within that critical
section. However, this can fail because that task might have already
checked its .need_qs field, which means that it might never decrement
the all-important trc_n_readers_need_end counter. Of course, for that
to happen, the task would need to never again execute an RCU Tasks Trace
read-side critical section, but this really could happen if the system's
last trampoline was removed. Note that exit from such a critical section
cannot be treated as a quiescent state due to the possibility of nested
critical sections. This means that if trc_inspect_reader() sees a
negative nesting value, it must set up to try again later.
This commit therefore ignores tasks that are exiting their RCU Tasks
Trace read-side critical sections so that they will be rechecked later.
[ paulmck: Apply feedback from Neeraj Upadhyay and Boqun Feng. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, trc_wait_for_one_reader() atomically increments
the trc_n_readers_need_end counter before sending the IPI
invoking trc_read_check_handler(). All failure paths out of
trc_read_check_handler() and also from the smp_call_function_single()
within trc_wait_for_one_reader() must carefully atomically decrement
this counter. This is more complex than it needs to be.
This commit therefore simplifies things and saves a few lines of
code by dispensing with the atomic decrements in favor of having
trc_read_check_handler() do the atomic increment only in the success case.
In theory, this represents no change in functionality.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Readers of rwbase can lock and unlock without taking any inner lock, if
that happens, we need the ordering provided by atomic operations to
satisfy the ordering semantics of lock/unlock. Without that, considering
the follow case:
{ X = 0 initially }
CPU 0 CPU 1
===== =====
rt_write_lock();
X = 1
rt_write_unlock():
atomic_add(READER_BIAS - WRITER_BIAS, ->readers);
// ->readers is READER_BIAS.
rt_read_lock():
if ((r = atomic_read(->readers)) < 0) // True
atomic_try_cmpxchg(->readers, r, r + 1); // succeed.
<acquire the read lock via fast path>
r1 = X; // r1 may be 0, because nothing prevent the reordering
// of "X=1" and atomic_add() on CPU 1.
Therefore audit every usage of atomic operations that may happen in a
fast path, and add necessary barriers.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20210909110203.953991276@infradead.org
In perf_event_addr_filters_apply, the task associated with
the event (event->ctx->task) is read using READ_ONCE at the beginning
of the function, checked, and then re-read from event->ctx->task,
voiding all guarantees of the checks. Reuse the value that was read by
READ_ONCE to ensure the consistency of the task struct throughout the
function.
Fixes: 375637bc52 ("perf/core: Introduce address range filtering")
Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210906015310.12802-1-baptiste.lepers@gmail.com
blk_status_to_errno doesn't appear to perform extra work besides
converting blk_status_t to integer. This patch removes that unnecessary
conversion as the return type of the function is blk_status_t.
Signed-off-by: Falla Coulibaly <fallacoulibalyz@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>