If a CSD-lock stall goes on long enough, it will cause an RCU CPU
stall warning. This additional warning provides much additional
console-log traffic and little additional information. Therefore,
provide a new csd_lock_is_stuck() function that returns true if there
is an ongoing CSD-lock stall. This function will be used by the RCU
CPU stall warnings to provide a one-line indication of the stall when
this function returns true.
[ neeraj.upadhyay: Apply Rik van Riel feedback. ]
[ neeraj.upadhyay: Apply kernel test robot feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Imran Khan <imran.f.khan@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZrym4AAKCRCRxhvAZXjc
oqT3AP9ydoUNavaZcRayH8r3ybvz9+aJGJ6Q7NznFVCk71vn0gD/buLzmq96Muns
M5DWHbft2AFwK0Rz2nx8j5OXUeHwrQg=
=HZBL
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"VFS:
- Fix the name of file lease slab cache. When file leases were split
out of file locks the name of the file lock slab cache was used for
the file leases slab cache as well.
- Fix a type in take_fd() helper.
- Fix infinite directory iteration for stable offsets in tmpfs.
- When the icache is pruned all reclaimable inodes are marked with
I_FREEING and other processes that try to lookup such inodes will
block.
But some filesystems like ext4 can trigger lookups in their inode
evict callback causing deadlocks. Ext4 does such lookups if the
ea_inode feature is used whereby a separate inode may be used to
store xattrs.
Introduce I_LRU_ISOLATING which pins the inode while its pages are
reclaimed. This avoids inode deletion during inode_lru_isolate()
avoiding the deadlock and evict is made to wait until
I_LRU_ISOLATING is done.
netfs:
- Fault in smaller chunks for non-large folio mappings for
filesystems that haven't been converted to large folios yet.
- Fix the CONFIG_NETFS_DEBUG config option. The config option was
renamed a short while ago and that introduced two minor issues.
First, it depended on CONFIG_NETFS whereas it wants to depend on
CONFIG_NETFS_SUPPORT. The former doesn't exist, while the latter
does. Second, the documentation for the config option wasn't fixed
up.
- Revert the removal of the PG_private_2 writeback flag as ceph is
using it and fix how that flag is handled in netfs.
- Fix DIO reads on 9p. A program watching a file on a 9p mount
wouldn't see any changes in the size of the file being exported by
the server if the file was changed directly in the source
filesystem. Fix this by attempting to read the full size specified
when a DIO read is requested.
- Fix a NULL pointer dereference bug due to a data race where a
cachefiles cookies was retired even though it was still in use.
Check the cookie's n_accesses counter before discarding it.
nsfs:
- Fix ioctl declaration for NS_GET_MNTNS_ID from _IO() to _IOR() as
the kernel is writing to userspace.
pidfs:
- Prevent the creation of pidfds for kthreads until we have a
use-case for it and we know the semantics we want. It also confuses
userspace why they can get pidfds for kthreads.
squashfs:
- Fix an unitialized value bug reported by KMSAN caused by a
corrupted symbolic link size read from disk. Check that the
symbolic link size is not larger than expected"
* tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
Squashfs: sanity check symbolic link size
9p: Fix DIO read through netfs
vfs: Don't evict inode under the inode lru traversing context
netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags
netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag"
file: fix typo in take_fd() comment
pidfd: prevent creation of pidfds for kthreads
netfs: clean up after renaming FSCACHE_DEBUG config
libfs: fix infinite directory reads for offset dir
nsfs: fix ioctl declaration
fs/netfs/fscache_cookie: add missing "n_accesses" check
filelock: fix name of file_lease slab cache
netfs: Fault in smaller chunks for non-large folio mappings
This commit uses the new rcu_tasks_torture_stats_print(),
rcu_tasks_trace_torture_stats_print(), and
rcu_tasks_rude_torture_stats_print() functions in order to provide
detailed diagnostics on grace-period, callback, and barrier state when
rcu_scale_writer() hangs.
[ paulmck: Apply kernel test robot feedback. ]
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
RCU keeps a count of the number of callbacks that the current
rcu_barrier() is waiting on, but there is currently no easy way to
work out which callback is stuck. One way to do this is to mark idle
RCU-barrier callbacks by making the ->next pointer point to the callback
itself, and this commit does just that.
Later commits will use this for debug output.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
This commit adds a .stats function pointer to the rcu_scale_ops structure,
and if this is non-NULL, it is invoked after stack traces are dumped in
response to a rcu_scale_writer() stall.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
This commit improves debuggability by dumping the stacks of
rcu_scale_writer() instances that have not completed in a reasonable
timeframe. These stacks are dumped remotely, but they will be accurate
in the thus-far common case where the stalled rcu_scale_writer() instances
are blocked.
[ paulmck: Apply kernel test robot feedback. ]
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
This whitespace-only commit fuses a few lines of code, taking advantage
of the newish 100-character-per-line limit to save a few lines of code.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
process_durations() is not a hot path, but there is no good reason to
iterate over and over the data already in 'buf'.
Using a seq_buf saves some useless strcat() and the need of a temp buffer.
Data is written directly at the correct place.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Tested-by: "Paul E. McKenney" <paulmck@kernel.org>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
This commit adds the start time, in jiffies, of the most recently started
rcu_barrier_tasks*() operation to the diagnostic output used by rcuscale.
This information can be helpful in distinguishing a hung barrier operation
from a long series of barrier operations.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
This commit adds rcu_tasks_torture_stats_print(),
rcu_tasks_trace_torture_stats_print(), and
rcu_tasks_rude_torture_stats_print() functions that provide detailed
diagnostics on grace-period, callback, and barrier state.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Each Tasks RCU flavor keeps a count of the number of callbacks that the
current rcu_barrier_tasks*() is waiting on, but there is currently no
easy way to work out which callback is stuck. One way to do this is to
mark idle RCU-barrier callbacks by making the ->next pointer point to
the callback itself, and this commit does just that.
Later commits will use this for debug output.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
This commit provides a rcu_barrier_cb_is_done() function that returns
true if the *rcu_barrier*() callback passed in is done. This will be
used when printing grace-period debugging information.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The rtp->tasks_gp_seq grace-period sequence number is not a strict count,
but rather the usual RCU sequence number with the lower few bits tracking
per-grace-period state and the upper bits the count of grace periods
since boot, give or take the initial value. This commit therefore
adjusts this comment.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The current mapping of smp_processor_id() to a CPU processing Tasks-RCU
callbacks makes some assumptions about layout. This commit therefore
adds a WARN_ON() to check these assumptions.
[ neeraj.upadhyay: Replace nr_cpu_ids with rcu_task_cpu_ids. ]
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
For kernels built with CONFIG_FORCE_NR_CPUS=y, the nr_cpu_ids is
defined as NR_CPUS instead of the number of possible cpus, this
will cause the following system panic:
smpboot: Allowing 4 CPUs, 0 hotplug CPUs
...
setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:512 nr_node_ids:1
...
BUG: unable to handle page fault for address: ffffffff9911c8c8
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 0 PID: 15 Comm: rcu_tasks_trace Tainted: G W
6.6.21 #1 5dc7acf91a5e8e9ac9dcfc35bee0245691283ea6
RIP: 0010:rcu_tasks_need_gpcb+0x25d/0x2c0
RSP: 0018:ffffa371c00a3e60 EFLAGS: 00010082
CR2: ffffffff9911c8c8 CR3: 000000040fa20005 CR4: 00000000001706f0
Call Trace:
<TASK>
? __die+0x23/0x80
? page_fault_oops+0xa4/0x180
? exc_page_fault+0x152/0x180
? asm_exc_page_fault+0x26/0x40
? rcu_tasks_need_gpcb+0x25d/0x2c0
? __pfx_rcu_tasks_kthread+0x40/0x40
rcu_tasks_one_gp+0x69/0x180
rcu_tasks_kthread+0x94/0xc0
kthread+0xe8/0x140
? __pfx_kthread+0x40/0x40
ret_from_fork+0x34/0x80
? __pfx_kthread+0x40/0x40
ret_from_fork_asm+0x1b/0x80
</TASK>
Considering that there may be holes in the CPU numbers, use the
maximum possible cpu number, instead of nr_cpu_ids, for configuring
enqueue and dequeue limits.
[ neeraj.upadhyay: Fix htmldocs build error reported by Stephen Rothwell ]
Closes: https://lore.kernel.org/linux-input/CALMA0xaTSMN+p4xUXkzrtR5r6k7hgoswcaXx7baR_z9r5jjskw@mail.gmail.com/T/#u
Reported-by: Zhixu Liu <zhixu.liu@gmail.com>
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The call_rcu_tasks_rude() and rcu_barrier_tasks_rude() APIs are currently
unused. This commit therefore removes their definitions and boot-time
self-tests.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The call_rcu_tasks_rude() and rcu_barrier_tasks_rude() APIs are currently
unused. Furthermore, the idea is to get rid of RCU Tasks Rude entirely
once all architectures have their deep-idle and entry/exit code correctly
marked as inline or noinstr. As a step towards this goal, this commit
therefore removes these two functions from rcuscale testing.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The call_rcu_tasks_rude() and rcu_barrier_tasks_rude() APIs are currently
unused. Furthermore, the idea is to get rid of RCU Tasks Rude entirely
once all architectures have their deep-idle and entry/exit code correctly
marked as inline or noinstr. As a first step towards this goal, this
commit therefore removes these two functions from rcutorture testing.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
This commit adds an stall_cpu_repeat kernel, which is also the
rcutorture.stall_cpu_repeat boot parameter, to test repeated CPU stalls.
Note that only the first stall will pay attention to the stall_cpu_irqsoff
module parameter. For the second and subsequent stalls, interrupts will
be enabled. This is helpful when testing the interaction between RCU
CPU stall warnings and CSD-lock stall warnings.
Reported-by: Rik van Riel <riel@surriel.com>
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The two hrtimer_cpu_base_.*_expiry() functions are wrappers around the
locking functions and sparse complains about the missing counterpart.
Add sparse annotation to denote that this bevaviour is expected.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240812105326.2240000-3-bigeasy@linutronix.de
timer_sync_wait_running() first releases two locks and then acquires
them again. This is unexpected and sparse complains about it.
Add sparse annotation for timer_sync_wait_running() to note that the
locking is expected.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240812105326.2240000-2-bigeasy@linutronix.de
Add an interface for a user-defined workqueue lockdep map, which is
helpful when multiple workqueues are created for the same purpose. This
also helps avoid leaking lockdep maps on each workqueue creation.
v2:
- Add alloc_workqueue_lockdep_map (Tejun)
v3:
- Drop __WQ_USER_OWNED_LOCKDEP (Tejun)
- static inline alloc_ordered_workqueue_lockdep_map (Tejun)
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Will help enable user-defined lockdep maps for workqueues.
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Will help enable user-defined lockdep maps for workqueues.
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The regressing commit is new in 6.10. It assumed that anytime event->prog
is set bpf_overflow_handler() should be invoked to execute the attached bpf
program. This assumption is false for tracing events, and as a result the
regressing commit broke bpftrace by invoking the bpf handler with garbage
inputs on overflow.
Prior to the regression the overflow handlers formed a chain (of length 0,
1, or 2) and perf_event_set_bpf_handler() (the !tracing case) added
bpf_overflow_handler() to that chain, while perf_event_attach_bpf_prog()
(the tracing case) did not. Both set event->prog. The chain of overflow
handlers was replaced by a single overflow handler slot and a fixed call to
bpf_overflow_handler() when appropriate. This modifies the condition there
to check event->prog->type == BPF_PROG_TYPE_PERF_EVENT, restoring the
previous behavior and fixing bpftrace.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Reported-by: Joe Damato <jdamato@fastly.com>
Closes: https://lore.kernel.org/lkml/ZpFfocvyF3KHaSzF@LQ3V64L9R2/
Fixes: f11f10bfa1 ("perf/bpf: Call BPF handler directly, not through overflow machinery")
Cc: stable@vger.kernel.org
Tested-by: Joe Damato <jdamato@fastly.com> # bpftrace
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240813151727.28797-1-jdamato@fastly.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
commit 779dbc2e78 ("printk: Avoid non-panic CPUs writing
to ringbuffer") disabled non-panic CPUs to further write messages to
ringbuffer after panicked.
Since the commit, non-panicked CPU's are not allowed to write to
ring buffer after panicked and CPU backtrace which is triggered
after panicked to sample non-panicked CPUs' backtrace no longer
serves its function as it has nothing to print.
Fix the issue by allowing non-panicked CPUs to write into ringbuffer
while CPU backtrace is in flight.
Fixes: 779dbc2e78 ("printk: Avoid non-panic CPUs writing to ringbuffer")
Signed-off-by: Ryo Takakura <takakura@valinux.co.jp>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240812072703.339690-1-takakura@valinux.co.jp
Signed-off-by: Petr Mladek <pmladek@suse.com>
When the domain suffix is not supplied alloc_fwnode_name() unconditionally
adds a separator.
Fix the format strings to get rid of the stray '-' separator.
Fixes: 1e7c052925 ("irqdomain: Allow giving name suffix for domain")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240812193101.1266625-3-andriy.shevchenko@linux.intel.com
The code uses if (bus_token) and if (bus_token == DOMAIN_BUS_ANY).
Since bus_token is an enum, the latter is more robust against changes.
Convert all !bus_token checks to explicitely check for DOMAIN_BUS_ANY.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240812193101.1266625-2-andriy.shevchenko@linux.intel.com
Daniel Hodges reported a kernel verifier crash when playing with sched-ext.
Further investigation shows that the crash is due to invalid memory access
in stacksafe(). More specifically, it is the following code:
if (exact != NOT_EXACT &&
old->stack[spi].slot_type[i % BPF_REG_SIZE] !=
cur->stack[spi].slot_type[i % BPF_REG_SIZE])
return false;
The 'i' iterates old->allocated_stack.
If cur->allocated_stack < old->allocated_stack the out-of-bound
access will happen.
To fix the issue add 'i >= cur->allocated_stack' check such that if
the condition is true, stacksafe() should fail. Otherwise,
cur->stack[spi].slot_type[i % BPF_REG_SIZE] memory access is legal.
Fixes: 2793a8b015 ("bpf: exact states comparison for iterator convergence checks")
Cc: Eduard Zingerman <eddyz87@gmail.com>
Reported-by: Daniel Hodges <hodgesd@meta.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240812214847.213612-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
If a core is offline then enabling SMT should not online CPUs of
this core. By enabling SMT, what is intended is either changing the SMT
value from "off" to "on" or setting the SMT level (threads per core) from a
lower to higher value.
On PowerPC the ppc64_cpu utility can be used, among other things, to
perform the following functions:
ppc64_cpu --cores-on # Get the number of online cores
ppc64_cpu --cores-on=X # Put exactly X cores online
ppc64_cpu --offline-cores=X[,Y,...] # Put specified cores offline
ppc64_cpu --smt={on|off|value} # Enable, disable or change SMT level
If the user has decided to offline certain cores, enabling SMT should
not online CPUs in those cores. This patch fixes the issue and changes
the behaviour as described, by introducing an arch specific function
topology_is_core_online(). It is currently implemented only for PowerPC.
Fixes: 73c58e7e14 ("powerpc: Add HOTPLUG_SMT support")
Reported-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Closes: https://groups.google.com/g/powerpc-utils-devel/c/wrwVzAAnRlI/m/5KJSoqP4BAAJ
Signed-off-by: Nysal Jan K.A <nysal@linux.ibm.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20240731030126.956210-2-nysal@linux.ibm.com
SRCU keeps a count of the number of callbacks that the current
srcu_barrier() is waiting on, but there is currently no easy way to
work out which callback is stuck. One way to do this is to mark idle
SRCU-barrier callbacks by making the ->next pointer point to the callback
itself, and this commit does just that.
Later commits will use this for debug output.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
SRCU maintains the ->srcu_n_exp_nodelay and ->reschedule_count values
to guide heuristics governing auto-expediting of normal SRCU grace
periods and grace-period-state-machine delays. This commit adds KCSAN
ASSERT_EXCLUSIVE_WRITER() calls to check for concurrent updates to
these fields.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Using a higher value for the initial gp sequence counters allows for
wrapping to occur faster. It can help with surfacing any issues that may
be happening as a result of the wrap around.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
It was discovered that isolated CPUs could sometimes be disturbed by
kworkers processing kfree_rcu() works causing higher than expected
latency. It is because the RCU core uses "system_wq" which doesn't have
the WQ_UNBOUND flag to handle all its work items. Fix this violation of
latency limits by using "system_unbound_wq" in the RCU core instead.
This will ensure that those work items will not be run on CPUs marked
as isolated.
Beside the WQ_UNBOUND flag, the other major difference between system_wq
and system_unbound_wq is their max_active count. The system_unbound_wq
has a max_active of WQ_MAX_ACTIVE (512) while system_wq's max_active
is WQ_DFL_ACTIVE (256) which is half of WQ_MAX_ACTIVE.
Reported-by: Vratislav Bendel <vbendel@redhat.com>
Closes: https://issues.redhat.com/browse/RHEL-50220
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Tested-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
- Fix a couple of issues in the NTP code where user supplied values are
neither sanity checked nor clamped to the operating range. This results
in integer overflows and eventualy NTP getting out of sync.
According to the history the sanity checks had been removed in favor of
clamping the values, but the clamping never worked correctly under all
circumstances. The NTP people asked to not bring the sanity checks back
as it might break existing applications.
Make the clamping work correctly and add it where it's missing
- If adjtimex() sets the clock it has to trigger the hrtimer subsystem so
it can adjust and if the clock was set into the future expire timers if
needed. The caller should provide a bitmask to tell hrtimers which
clocks have been adjusted. adjtimex() uses not the proper constant and
uses CLOCK_REALTIME instead, which is 0. So hrtimers adjusts only the
clocks, but does not check for expired timers, which might make them
expire really late. Use the proper bitmask constant instead.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAma4wQ0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoWNmEACMeq/vMoqbbhfgmTK2+XKfUarF5AX8
61uK/rY6ysO/Qz1P+3K4j+coxhuz2t0ekjIL6htgPE0yU5JR3/VjjUpGIbBLUZfa
aY9Ciy0OHFyTaoduyLKyiO/O7GyI6j8vMMhhNyQDaK5Zm+pIin18FqW6udg79HYh
bDkVtCWg27M1zFd9aqRAc1EX+uFfCrSUi+1oc+E3/knDrNFUVwKCznAeDQQZii6Y
pGmt733o7RRkABSf5T1bNOEVpbMlZowcf7zF3J57otz/lZFuwjRtTdmuG4ha3grs
B+4FLNRZFEIEFPW0We43gAW1jLNjIL8xgZ6CMUwkUYOGQ21wmMxTOUCwg6/YMa9Y
vBceijrICOa1EsyO28XqgRkfIvhdoNsp+c5rAN4LcQd5T7F0SoQCn9A71LXpPXgK
ulnWjAgpt+ovD2+OFX0Ul5ySY04TgPcNVeJfnZeYxpuShlPg0GX+z0RuMl9aLbc3
y11P0PDJiguZaoUZ8lUU2W6XA+eFEA2ZOqP+L6FZwIaDwutmXSjHR//ZkTcNg4/h
rIbB8SFsq3BSMo3Ry2p/KMYWoZ1fF3Tm3Qp9/wpiAx1YSTJ6x8LGkHHq5c9qP5ba
qJWi0vz8dgTGd2ta/xzglvPVWwT08rvrwACHCTcJp3Jq8uvJ27mQbTvZs6p3cFE6
RkEBGDvEIfADew==
=EY09
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull time keeping fixes from Thomas Gleixner:
- Fix a couple of issues in the NTP code where user supplied values are
neither sanity checked nor clamped to the operating range. This
results in integer overflows and eventualy NTP getting out of sync.
According to the history the sanity checks had been removed in favor
of clamping the values, but the clamping never worked correctly under
all circumstances. The NTP people asked to not bring the sanity
checks back as it might break existing applications.
Make the clamping work correctly and add it where it's missing
- If adjtimex() sets the clock it has to trigger the hrtimer subsystem
so it can adjust and if the clock was set into the future expire
timers if needed. The caller should provide a bitmask to tell
hrtimers which clocks have been adjusted.
adjtimex() uses not the proper constant and uses CLOCK_REALTIME
instead, which is 0. So hrtimers adjusts only the clocks, but does
not check for expired timers, which might make them expire really
late. Use the proper bitmask constant instead.
* tag 'timers-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Fix bogus clock_was_set() invocation in do_adjtimex()
ntp: Safeguard against time_constant overflow
ntp: Clamp maxerror and esterror to operating range
- The interrupt core fails to honor caller supplied affinity hints for
non-managed interrupts and uses the system default affinity on
startup instead. Set the missing flag in the descriptor to tell the
core to use the provided affinity.
- Fix a shift out of bounds error in the Xilinx driver
- Handle switching to level trigger correctly in the RISCV APLIC
driver. It failed to retrigger the interrupt which causes it to
become stale.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAma4vEUTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoQpoEACcirhCU0x7jfGj7KWJtnx1dko1gG9G
AN86+1lZaHa63vBysAvvEPFVhrbC9JI09SLFTNYrhFTWk9lZeTr7HC9ZgvH2U/Yp
YrYci/5PMBZow7rHjJUcohGM25xFppskMwtUnp1udNsPbXQvY/cFkzi/p5xwfB7b
S4P10UuZTLBiHYDylphIjIQpf2ltQiXDcdxLGeeYnMVdQ4W5sJVqj39YfZmq+Au3
E2IwDuA6SyPIMuEbs+rxKHNl30QmaGhU4CmzOE6A6bgcZ9u4AbvSf1+3maeOrOQf
Erq3oMPhKemWXHdeTIZiufOjJZjph2qJfMNSzEoYnOO1edA+I9y5BkirngIwUOKX
iAl3Oh5f6GwcNuFeVCAW6xr1jMnRDQ3SJUk89wxfgxtZjTVUTjbbtegm97XirhSf
+QXXgVX9zpPYwbAVdwsCoSYi+Ne9XPj+ylixRKBzx6+4McnAdx3OllyfRhH7Hk53
BuXGmSdy9/n+093jyNzhdyQ/5U1lL2XrUXoNh79M/duBp6RI0jpet406Ui/Q96VB
mXKXG/0imRd/WCWR9KDzKjjWdNcToRcsfta7ZzeekJtFIab6e2+G65lIPALB3rNp
1vNfFEWWTjpHpzN2pmmaRQBbwRBpLUrr89wc2KENuHs7DnQu75O1u5SZLPGwEMEI
VuBxXQUKGxZkTA==
=hR0M
-----END PGP SIGNATURE-----
Merge tag 'irq-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"Three small fixes for interrupt core and drivers:
- The interrupt core fails to honor caller supplied affinity hints
for non-managed interrupts and uses the system default affinity on
startup instead. Set the missing flag in the descriptor to tell the
core to use the provided affinity.
- Fix a shift out of bounds error in the Xilinx driver
- Handle switching to level trigger correctly in the RISCV APLIC
driver. It failed to retrigger the interrupt which causes it to
become stale"
* tag 'irq-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/riscv-aplic: Retrigger MSI interrupt on source configuration
irqchip/xilinx: Fix shift out of bounds
genirq/irqdesc: Honor caller provided affinity in alloc_desc()
Add the __counted_by compiler attribute to the flexible array member
records to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Increment nr_records before adding a new pointer to the records array.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Reviewed-by: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, and the 'dynticks' prefix can be dropped without losing any
meaning.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, and the 'dynticks' prefix can be dropped without losing any
meaning.
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, and the 'dynticks' prefix can be dropped without losing any
meaning.
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, and the 'dynticks' prefix can be dropped without losing any
meaning.
[ neeraj.upadhyay: Fix htmldocs build error reported by Stephen Rothwell ]
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, and the 'dynticks' prefix can be dropped without losing any
meaning.
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, reflect that change in the related helpers.
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, reflect that change in the related helpers.
[ neeraj.upadhyay: Fix htmldocs build error reported by Stephen Rothwell ]
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
- avoid a deadlock with dma-debug and netconsole (Rik van Riel)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAma3CHILHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYPMEg/+M/k+Z+KcFgUNBmsyiPckRHAEBwzC+zpHWcaX0uEJ
xYy3F2JW/+ba2p8GzTAwb5gE1DZTmBp0GC9PFYaolRBhhoeRZnWXimO4OynxFf2b
vMrqyFixBYG3GeX+pnLFsT5WB+ZoZVknF/Fvxl9RmJjUh8p4KMJw7CLflu4xqsc6
6lX+IlcV637vZOToFXm2h5todAgqoatz1VhyLekGOL5BEUuDw8QjydqpXd7XAG+T
S0/rIga0fcmTjn+6+5RUzpcyaVAxy/KVzvHx731kjO/ZUswritxlSydZgtD0Tabj
z8N/3ge+TGvekpffSZ6K1EmldCypuQu/WRDlwxDx5LQu4DP4vwkUURToggiQPHlQ
7YP0roOLLfc5zgjQsmpzj/DmymFw+E3bFz6DRw+9f8ftt6rB58ICCO+YXjL4W4aL
wu5IrUsIwoc5W7nBkrlUQZbRTrTrvl62HbuO1pMIirZ4bntuJVYLyOTJ+n7RwYFl
TukNu5WlkHnHvnMt96TGW5lVKBTDGz1aUYUju41USyLpYCZYsKYrHiEAdf0WFB8q
WXprsL6JSSRZ+ukIvucHDdZlBptqaxrLtj3UeALPle05dq12ykG6KOix3FZGVAWA
0WD6SKUV7Z+Cs+WcCnW2zLNuq3NNTiSRCMSvPmSH7soxu3BLRUxPkwTTthgeGlFx
DZs=
=tNDn
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.11-2024-08-10' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fix from Christoph Hellwig:
- avoid a deadlock with dma-debug and netconsole (Rik van Riel)
* tag 'dma-mapping-6.11-2024-08-10' of git://git.infradead.org/users/hch/dma-mapping:
dma-debug: avoid deadlock between dma debug vs printk and netconsole
Devices can provide multiple interrupt lines. One reason for this is that
a device has multiple subfunctions, each providing its own interrupt line.
Another reason is that a device can be designed to be used (also) on a
system where some of the interrupts can be routed to another processor.
A line often further acts as a demultiplex for specific interrupts
and has it's respective set of interrupt (status, mask, ack, ...)
registers.
Regmap supports the handling of these registers and demultiplexing
interrupts, but the interrupt domain code ends up assigning the same name
for the per interrupt line domains. This causes a naming collision in the
debugFS code and leads to confusion, as /proc/interrupts shows two separate
interrupts with the same domain name and hardware interrupt number.
Instead of adding a workaround in regmap or driver code, allow giving a
name suffix for the domain name when the domain is created.
Add a name_suffix field in the irq_domain_info structure and make
irq_domain_instantiate() use this suffix if it is given when a domain is
created.
[ tglx: Adopt it to the cleanup patch and fixup the invalid NULL return ]
Signed-off-by: Matti Vaittinen <mazziesaccount@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/871q2yvk5x.ffs@tglx
irq_domain_set_name() is truly unreadable gunk. Clean it up before adding
more.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Matti Vaittinen <mazziesaccount@gmail.com>
Link: https://lore.kernel.org/all/874j7uvkbm.ffs@tglx
irq_domain_create_simple() and irq_domain_create_legacy() use
__irq_domain_instantiate(), but have extra handling of allocating interrupt
descriptors and associating interrupts in them. Some of that is duplicated.
There are also call sites which have conditonals to invoke different
interrupt domain creator functions, where one of them is usually
irq_domain_create_legacy(). Alternatively they associate the interrupts for
the legacy case after creating the domain.
Moving the extra logic of irq_domain_create_simple()/legacy() into
__irq_domain_instantiate() allows to consolidate that.
Introduce hwirq_base and virq_base members in the irq_domain_info
structure, which allows to transport the required information and add the
conditional interrupt descriptor allocation and interrupt association into
__irq_domain_instantiate().
This reduces irq_domain_create_legacy() and irq_domain_create_simple() to
trivial wrappers which fill in the info structure and allows call sites
which must support the legacy case along with more modern mechanism to
select the domain type via the parameters of the info struct.
[ tglx: Massaged change log ]
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Matti Vaittinen <mazziesaccount@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/32d07bd79eb2b5416e24da9e9e8fe5955423dcf9.1723120028.git.mazziesaccount@gmail.com
When running the following:
# cd /sys/kernel/tracing/
# echo 1 > events/sched/sched_waking/enable
# echo 1 > events/sched/sched_switch/enable
# echo 0 > tracing_on
# dd if=per_cpu/cpu0/trace_pipe_raw of=/tmp/raw0.dat
The dd task would get stuck in an infinite loop in the kernel. What would
happen is the following:
When ring_buffer_read_page() returns -1 (no data) then a check is made to
see if the buffer is empty (as happens when the page is not full), it will
call wait_on_pipe() to wait until the ring buffer has data. When it is it
will try again to read data (unless O_NONBLOCK is set).
The issue happens when there's a reader and the file descriptor is closed.
The wait_on_pipe() will return when that is the case. But this loop will
continue to try again and wait_on_pipe() will again return immediately and
the loop will continue and never stop.
Simply check if the file was closed before looping and exit out if it is.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240808235730.78bf63e5@rorschach.local.home
Fixes: 2aa043a55b ("tracing/ring-buffer: Fix wait_on_pipe() race")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
It turns out that the WARN_ON_ONCE() call in css_release_work_fn
introduced by commit ab03125268 ("cgroup: Show # of subsystem CSSes
in cgroup.stat") is incorrect. Although css->nr_descendants must be
0 when a css is released and ready to be freed, the corresponding
cgrp->nr_dying_subsys[ss->id] may not be 0 if a subsystem is activated
and deactivated multiple times with one or more of its previous
activation leaving behind dying csses.
Fix the incorrect warning by removing the cgrp->nr_dying_subsys check.
Fixes: ab03125268 ("cgroup: Show # of subsystem CSSes in cgroup.stat")
Closes: https://lore.kernel.org/cgroups/6f301773-2fce-4602-a391-8af7ef00b2fb@redhat.com/T/#t
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The recursive aes-arm-bs module load situation reported by Russell King
is getting fixed in the crypto layer, but this in the meantime fixes the
"recursive load hangs forever" by just making the waiting for the first
module load be interruptible.
This should now match the old behavior before commit 9b9879fc03
("modules: catch concurrent module loads, treat them as idempotent"),
which used the different "wait for module to be ready" code in
module_patient_check_exists().
End result: a recursive module load will still block, but now a signal
will interrupt it and fail the second module load, at which point the
first module will successfully complete loading.
Fixes: 9b9879fc03 ("modules: catch concurrent module loads, treat them as idempotent")
Cc: Russell King <linux@armlinux.org.uk>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Have reading of event format files test if the meta data still exists.
When a event is freed, a flag (EVENT_FILE_FL_FREED) in the meta data is
set to state that it is to prevent any new references to it from happening
while waiting for existing references to close. When the last reference
closes, the meta data is freed. But the "format" was missing a check to
this flag (along with some other files) that allowed new references to
happen, and a use-after-free bug to occur.
- Have the trace event meta data use the refcount infrastructure instead
of relying on its own atomic counters.
- Have tracefs inodes use alloc_inode_sb() for allocation instead of
using kmem_cache_alloc() directly.
- Have eventfs_create_dir() return an ERR_PTR instead of NULL as
the callers expect a real object or an ERR_PTR.
- Have release_ei() use call_srcu() and not call_rcu() as all the
protection is on SRCU and not RCU.
- Fix ftrace_graph_ret_addr() to use the task passed in and not current.
- Fix overflow bug in get_free_elt() where the counter can overflow
the integer and cause an infinite loop.
- Remove unused function ring_buffer_nr_pages()
- Have tracefs freeing use the inode RCU infrastructure instead of
creating its own. When the kernel had randomize structure fields
enabled, the rcu field of the tracefs_inode was overlapping the
rcu field of the inode structure, and corrupting it. Instead,
use the destroy_inode() callback to do the initial cleanup of
the code, and then have free_inode() free it.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZrTvXxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qu39AP9ze6ELpShDrxbXhf0adbNqG2IXMepa
MMLqfq8tU8E/vAEAuZXJ6rKXeGvKeONa06ocvWJ0dpb2cy/n4hmx+KtM5gI=
=Pkh4
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Have reading of event format files test if the metadata still exists.
When a event is freed, a flag (EVENT_FILE_FL_FREED) in the metadata
is set to state that it is to prevent any new references to it from
happening while waiting for existing references to close. When the
last reference closes, the metadata is freed. But the "format" was
missing a check to this flag (along with some other files) that
allowed new references to happen, and a use-after-free bug to occur.
- Have the trace event meta data use the refcount infrastructure
instead of relying on its own atomic counters.
- Have tracefs inodes use alloc_inode_sb() for allocation instead of
using kmem_cache_alloc() directly.
- Have eventfs_create_dir() return an ERR_PTR instead of NULL as the
callers expect a real object or an ERR_PTR.
- Have release_ei() use call_srcu() and not call_rcu() as all the
protection is on SRCU and not RCU.
- Fix ftrace_graph_ret_addr() to use the task passed in and not
current.
- Fix overflow bug in get_free_elt() where the counter can overflow the
integer and cause an infinite loop.
- Remove unused function ring_buffer_nr_pages()
- Have tracefs freeing use the inode RCU infrastructure instead of
creating its own.
When the kernel had randomize structure fields enabled, the rcu field
of the tracefs_inode was overlapping the rcu field of the inode
structure, and corrupting it. Instead, use the destroy_inode()
callback to do the initial cleanup of the code, and then have
free_inode() free it.
* tag 'trace-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracefs: Use generic inode RCU for synchronizing freeing
ring-buffer: Remove unused function ring_buffer_nr_pages()
tracing: Fix overflow in get_free_elt()
function_graph: Fix the ret_stack used by ftrace_graph_ret_addr()
eventfs: Use SRCU for freeing eventfs_inodes
eventfs: Don't return NULL in eventfs_create_dir()
tracefs: Fix inode allocation
tracing: Use refcount for trace_event_file reference counter
tracing: Have format file honor EVENT_FILE_FL_FREED
Assorted little stuff:
- lockdep fixup for lockdep_set_notrack_class()
- we can now remove a device when using erasure coding without
deadlocking, though we still hit other issues
- the "allocator stuck" timeout is now configurable, and messages are
ratelimited; default timeout has been increased from 10 seconds to 30
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAma05JwACgkQE6szbY3K
bnYE+Q//VJ6R/UxDoxjk8zgftRcdgwXod6U+/E0Kj3ZBKLYXkcGaWWmmGMkFafBp
eL7Y3wtHSKiMsHYX9KEdFUZFLe1KI4c16RgNIXk9nwkF+3/+8pEDHKPFuoGHJH3O
HComHGqwVg8Zx2jRNvEkvQ980iH7OBGhCjMFXhJ3xbMGLdw91TQQi49a+Q/vz7QT
y3Cl1dgX5xBl7fqKefsYa+X6mpWi4/6t60vJvatI+bvDfznjI6jN3qGVLlQye7tC
6VbJAjHsPPyNMlWa99UaHqDdaM325zR2ES0bsfHd8Up4iAwO8OgjzYQxpYTgi51i
6DTiGEOV2S8gF+Rnprnbzsnau0hEvrtQY2Ub85TCIGbZJa8b+aDIlq9k8jF36O2E
2CUTleQ/E129RxXpkZGsVRpNmemdCi6rHAcluaFEgezX4FJH8BVOwQQq2Xz7rd7E
3ZP5iAWmX0IgOL0VOCP/ZXl/lEMwSk0VAED3jEbT7f2K7rU9nXDO2bIEx1wXDCm1
b32kvmUi2FBjqLHSqvAPEb52tvvZuliMUY7z9dEx+AX9AVC9kGE+amGexosKb/LY
nWzey+D0cKHtgbkMFrCClkpg75Tnt9ISJbad53+5qhN8an/a71djdj8Zk0UQnQjv
6Amv4Ns1lDo3XGC1QtYkF5HqiWaupbUXAftptpS4Av4X1zZEQIc=
=q1dD
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-08-08' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Assorted little stuff:
- lockdep fixup for lockdep_set_notrack_class()
- we can now remove a device when using erasure coding without
deadlocking, though we still hit other issues
- the 'allocator stuck' timeout is now configurable, and messages are
ratelimited. The default timeout has been increased from 10 seconds
to 30"
* tag 'bcachefs-2024-08-08' of git://evilpiepirate.org/bcachefs:
bcachefs: Use bch2_wait_on_allocator() in btree node alloc path
bcachefs: Make allocator stuck timeout configurable, ratelimit messages
bcachefs: Add missing path_traverse() to btree_iter_next_node()
bcachefs: ec should not allocate from ro devs
bcachefs: Improved allocator debugging for ec
bcachefs: Add missing bch2_trans_begin() call
bcachefs: Add a comment for bucket helper types
bcachefs: Don't rely on implicit unsigned -> signed integer conversion
lockdep: Fix lockdep_set_notrack_class() for CONFIG_LOCK_STAT
bcachefs: Fix double free of ca->buckets_nouse
Russell King reported that the arm cbc(aes) crypto module hangs when
loaded, and Herbert Xu bisected it to commit 9b9879fc03 ("modules:
catch concurrent module loads, treat them as idempotent"), and noted:
"So what's happening here is that the first modprobe tries to load a
fallback CBC implementation, in doing so it triggers a load of the
exact same module due to module aliases.
IOW we're loading aes-arm-bs which provides cbc(aes). However, this
needs a fallback of cbc(aes) to operate, which is made out of the
generic cbc module + any implementation of aes, or ecb(aes). The
latter happens to also be provided by aes-arm-cb so that's why it
tries to load the same module again"
So loading the aes-arm-bs module ends up wanting to recursively load
itself, and the recursive load then ends up waiting for the original
module load to complete.
This is a regression, in that it used to be that we just tried to load
the module multiple times, and then as we went on to install it the
second time we would instead just error out because the module name
already existed.
That is actually also exactly what the original "catch concurrent loads"
patch did in commit 9828ed3f69 ("module: error out early on concurrent
load of the same module file"), but it turns out that it ends up being
racy, in that erroring out before the module has been fully initialized
will cause failures in dependent module loading.
See commit ac2263b588 (which was the revert of that "error out early")
commit for details about why erroring out before the module has been
initialized is actually fundamentally racy.
Now, for the actual recursive module load (as opposed to just
concurrently loading the same module twice), the race is not an issue.
At the same time it's hard for the kernel to see that this is recursion,
because the module load is always done from a usermode helper, so the
recursion is not some simple callchain within the kernel.
End result: this is not the real fix, but this at least adds a warning
for the situation (admittedly much too late for all the debugging pain
that Russell and Herbert went through) and if we can come to a
resolution on how to detect the recursion properly, this re-organizes
the code to make that easier.
Link: https://lore.kernel.org/all/ZrFHLqvFqhzykuYw@shell.armlinux.org.uk/
Reported-by: Russell King <linux@armlinux.org.uk>
Debugged-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
module.c was renamed to main.c, but the Makefile directive was copy-pasted
verbatim with the old file name. Fix up the file name.
Fixes: cfc1d27789 ("module: Move all into module/")
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/bc0cf790b4839c5e38e2fafc64271f620568a39e.1718092070.git.dvyukov@google.com
Add a boot self test that can catch sprious coverage from interrupts.
The coverage callback filters out interrupt code, but only after the
handler updates preempt count. Some code periodically leaks out
of that section and leads to spurious coverage.
Add a best-effort (but simple) test that is likely to catch such bugs.
If the test is enabled on CI systems that use KCOV, they should catch
any issues fast.
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Link: https://lore.kernel.org/all/7662127c97e29da1a748ad1c1539dd7b65b737b2.1718092070.git.dvyukov@google.com
aren't considered necessary for earlier kernels. 5 are MM and 4 are
non-MM. No identifiable theme here - please see the individual changelogs.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZrQhyAAKCRDdBJ7gKXxA
jvLLAP46sQ/HspAbx+5JoeKBTiX6XW4Hfd+MAk++EaTAyAhnxQD+Mfq7rPOIHm/G
wiXPVvLO8FEx0lbq06rnXvdotaWFrQg=
=mlE4
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-08-07-18-32' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Nine hotfixes. Five are cc:stable, the others either pertain to
post-6.10 material or aren't considered necessary for earlier kernels.
Five are MM and four are non-MM. No identifiable theme here - please
see the individual changelogs"
* tag 'mm-hotfixes-stable-2024-08-07-18-32' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
padata: Fix possible divide-by-0 panic in padata_mt_helper()
mailmap: update entry for David Heidelberg
memcg: protect concurrent access to mem_cgroup_idr
mm: shmem: fix incorrect aligned index when checking conflicts
mm: shmem: avoid allocating huge pages larger than MAX_PAGECACHE_ORDER for shmem
mm: list_lru: fix UAF for memory cgroup
kcov: properly check for softirq context
MAINTAINERS: Update LTP members and web
selftests: mm: add s390 to ARCH check
There is is no point in doing the perf_pmu_disable() dance just to do
nothing. This happens for ctx_sched_out(.type = EVENT_TIME) for
instance.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240807115550.392851915@infradead.org
Many of the the context reschedule users are of the form:
ctx_sched_out(.type = EVENT_TIME);
... modify context
ctx_resched();
With the idea that the whole reschedule happens with a single
time-stamp, rather than with each ctx_sched_out() advancing time and
ctx_sched_in() re-starting time, creating a non-atomic experience.
However, Kan noticed that since this completely stops time, it
actually looses a bit of time between the stop and start. Worse, now
that we can do partial (per PMU) reschedules, the PMUs that are not
scheduled out still observe the time glitch.
Replace this with:
ctx_time_freeze();
... modify context
ctx_resched();
With the assumption that this happens in a perf_ctx_lock() /
perf_ctx_unlock() pair.
The new ctx_time_freeze() will update time and sets EVENT_FROZEN, and
ensures EVENT_TIME and EVENT_FROZEN remain set, this avoids
perf_event_time_now() from observing a time wobble from not seeing
EVENT_TIME for a little while.
Additionally, this avoids loosing time between
ctx_sched_out(EVENT_TIME) and ctx_sched_in(), which would re-set the
timestamp.
Reported-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240807115550.250637571@infradead.org
All the event_function/@func call context already uses perf_ctx_lock()
except for the !ctx->is_active case. Make it all consistent.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240807115550.138301094@infradead.org
The context time update code is repeated verbatim a few times.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240807115550.031212518@infradead.org
Currently re-scheduling a context will reschedule all active PMUs for
that context, even if it is known only a single event is added.
Namhyung reported that changing this to only reschedule the affected
PMU when possible provides significant performance gains under certain
conditions.
Therefore, allow partial context reschedules for a specific PMU, that
of the event modified.
While the patch looks somewhat noisy, it mostly just propagates a new
@pmu argument through the callchain and modifies the epc loop to only
pick the 'epc->pmu == @pmu' case.
Reported-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240807115549.920950699@infradead.org
We are hit with a not easily reproducible divide-by-0 panic in padata.c at
bootup time.
[ 10.017908] Oops: divide error: 0000 1 PREEMPT SMP NOPTI
[ 10.017908] CPU: 26 PID: 2627 Comm: kworker/u1666:1 Not tainted 6.10.0-15.el10.x86_64 #1
[ 10.017908] Hardware name: Lenovo ThinkSystem SR950 [7X12CTO1WW]/[7X12CTO1WW], BIOS [PSE140J-2.30] 07/20/2021
[ 10.017908] Workqueue: events_unbound padata_mt_helper
[ 10.017908] RIP: 0010:padata_mt_helper+0x39/0xb0
:
[ 10.017963] Call Trace:
[ 10.017968] <TASK>
[ 10.018004] ? padata_mt_helper+0x39/0xb0
[ 10.018084] process_one_work+0x174/0x330
[ 10.018093] worker_thread+0x266/0x3a0
[ 10.018111] kthread+0xcf/0x100
[ 10.018124] ret_from_fork+0x31/0x50
[ 10.018138] ret_from_fork_asm+0x1a/0x30
[ 10.018147] </TASK>
Looking at the padata_mt_helper() function, the only way a divide-by-0
panic can happen is when ps->chunk_size is 0. The way that chunk_size is
initialized in padata_do_multithreaded(), chunk_size can be 0 when the
min_chunk in the passed-in padata_mt_job structure is 0.
Fix this divide-by-0 panic by making sure that chunk_size will be at least
1 no matter what the input parameters are.
Link: https://lkml.kernel.org/r/20240806174647.1050398-1-longman@redhat.com
Fixes: 004ed42638 ("padata: add basic support for multithreaded jobs")
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Waiman Long <longman@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When collecting coverage from softirqs, KCOV uses in_serving_softirq() to
check whether the code is running in the softirq context. Unfortunately,
in_serving_softirq() is > 0 even when the code is running in the hardirq
or NMI context for hardirqs and NMIs that happened during a softirq.
As a result, if a softirq handler contains a remote coverage collection
section and a hardirq with another remote coverage collection section
happens during handling the softirq, KCOV incorrectly detects a nested
softirq coverate collection section and prints a WARNING, as reported by
syzbot.
This issue was exposed by commit a7f3813e58 ("usb: gadget: dummy_hcd:
Switch to hrtimer transfer scheduler"), which switched dummy_hcd to using
hrtimer and made the timer's callback be executed in the hardirq context.
Change the related checks in KCOV to account for this behavior of
in_serving_softirq() and make KCOV ignore remote coverage collection
sections in the hardirq and NMI contexts.
This prevents the WARNING printed by syzbot but does not fix the inability
of KCOV to collect coverage from the __usb_hcd_giveback_urb when dummy_hcd
is in use (caused by a7f3813e58); a separate patch is required for that.
Link: https://lkml.kernel.org/r/20240729022158.92059-1-andrey.konovalov@linux.dev
Fixes: 5ff3b30ab5 ("kcov: collect coverage from interrupts")
Signed-off-by: Andrey Konovalov <andreyknvl@gmail.com>
Reported-by: syzbot+2388cdaeb6b10f0c13ac@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=2388cdaeb6b10f0c13ac
Acked-by: Marco Elver <elver@google.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Aleksandr Nogikh <nogikh@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Marcello Sylvester Bauer <sylv@sylv.io>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Because ring_buffer_nr_pages() is not an inline function and user accesses
buffer->buffers[cpu]->nr_pages directly, the function ring_buffer_nr_pages
is removed.
Signed-off-by: Jianhui Zhou <912460177@qq.com>
Link: https://lore.kernel.org/tencent_F4A7E9AB337F44E0F4B858D07D19EF460708@qq.com
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
"tracing_map->next_elt" in get_free_elt() is at risk of overflowing.
Once it overflows, new elements can still be inserted into the tracing_map
even though the maximum number of elements (`max_elts`) has been reached.
Continuing to insert elements after the overflow could result in the
tracing_map containing "tracing_map->max_size" elements, leaving no empty
entries.
If any attempt is made to insert an element into a full tracing_map using
`__tracing_map_insert()`, it will cause an infinite loop with preemption
disabled, leading to a CPU hang problem.
Fix this by preventing any further increments to "tracing_map->next_elt"
once it reaches "tracing_map->max_elt".
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 08d43a5fa0 ("tracing: Add lock-free tracing_map")
Co-developed-by: Cheng-Jui Wang <cheng-jui.wang@mediatek.com>
Link: https://lore.kernel.org/20240805055922.6277-1-Tze-nan.Wu@mediatek.com
Signed-off-by: Cheng-Jui Wang <cheng-jui.wang@mediatek.com>
Signed-off-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Instead of using an atomic counter for the trace_event_file reference
counter, use the refcount interface. It has various checks to make sure
the reference counting is correct, and will warn if it detects an error
(like refcount_inc() on '0').
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240726144208.687cce24@rorschach.local.home
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When eventfs was introduced, special care had to be done to coordinate the
freeing of the file meta data with the files that are exposed to user
space. The file meta data would have a ref count that is set when the file
is created and would be decremented and freed after the last user that
opened the file closed it. When the file meta data was to be freed, it
would set a flag (EVENT_FILE_FL_FREED) to denote that the file is freed,
and any new references made (like new opens or reads) would fail as it is
marked freed. This allowed other meta data to be freed after this flag was
set (under the event_mutex).
All the files that were dynamically created in the events directory had a
pointer to the file meta data and would call event_release() when the last
reference to the user space file was closed. This would be the time that it
is safe to free the file meta data.
A shortcut was made for the "format" file. It's i_private would point to
the "call" entry directly and not point to the file's meta data. This is
because all format files are the same for the same "call", so it was
thought there was no reason to differentiate them. The other files
maintain state (like the "enable", "trigger", etc). But this meant if the
file were to disappear, the "format" file would be unaware of it.
This caused a race that could be trigger via the user_events test (that
would create dynamic events and free them), and running a loop that would
read the user_events format files:
In one console run:
# cd tools/testing/selftests/user_events
# while true; do ./ftrace_test; done
And in another console run:
# cd /sys/kernel/tracing/
# while true; do cat events/user_events/__test_event/format; done 2>/dev/null
With KASAN memory checking, it would trigger a use-after-free bug report
(which was a real bug). This was because the format file was not checking
the file's meta data flag "EVENT_FILE_FL_FREED", so it would access the
event that the file meta data pointed to after the event was freed.
After inspection, there are other locations that were found to not check
the EVENT_FILE_FL_FREED flag when accessing the trace_event_file. Add a
new helper function: event_file_file() that will make sure that the
event_mutex is held, and will return NULL if the trace_event_file has the
EVENT_FILE_FL_FREED flag set. Have the first reference of the struct file
pointer use event_file_file() and check for NULL. Later uses can still use
the event_file_data() helper function if the event_mutex is still held and
was not released since the event_file_file() call.
Link: https://lore.kernel.org/all/20240719204701.1605950-1-minipli@grsecurity.net/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Cc: Ilkka Naulapää <digirigawa@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Florian Fainelli <florian.fainelli@broadcom.com>
Cc: Alexey Makhalov <alexey.makhalov@broadcom.com>
Cc: Vasavi Sirnapalli <vasavi.sirnapalli@broadcom.com>
Link: https://lore.kernel.org/20240730110657.3b69d3c1@gandalf.local.home
Fixes: b63db58e2f ("eventfs/tracing: Add callback for release of an eventfs_inode")
Reported-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently, whenever a caller is providing an affinity hint for an
interrupt, the allocation code uses it to calculate the node and copies the
cpumask into irq_desc::affinity.
If the affinity for the interrupt is not marked 'managed' then the startup
of the interrupt ignores irq_desc::affinity and uses the system default
affinity mask.
Prevent this by setting the IRQD_AFFINITY_SET flag for the interrupt in the
allocator, which causes irq_setup_affinity() to use irq_desc::affinity on
interrupt startup if the mask contains an online CPU.
[ tglx: Massaged changelog ]
Fixes: 45ddcecbfa ("genirq: Use affinity hint in irqdesc allocation")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/all/20240806072044.837827-1-shayd@nvidia.com
When we want to debug the workqueue stall, we can immediately make
a panic to get the information we want.
In some systems, it may be necessary to quickly reboot the system to
escape from a workqueue lockup situation. In this case, we can control
the number of stall detections to generate panic.
workqueue.panic_on_stall sets the number times of the stall to trigger
panic. 0 disables the panic on stall.
Signed-off-by: Sangmoon Kim <sangmoon.kim@samsung.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently the dma debugging code can end up indirectly calling printk
under the radix_lock. This happens when a radix tree node allocation
fails.
This is a problem because the printk code, when used together with
netconsole, can end up inside the dma debugging code while trying to
transmit a message over netcons.
This creates the possibility of either a circular deadlock on the same
CPU, with that CPU trying to grab the radix_lock twice, or an ABBA
deadlock between different CPUs, where one CPU grabs the console lock
first and then waits for the radix_lock, while the other CPU is holding
the radix_lock and is waiting for the console lock.
The trace captured by lockdep is of the ABBA variant.
-> #2 (&dma_entry_hash[i].lock){-.-.}-{2:2}:
_raw_spin_lock_irqsave+0x5a/0x90
debug_dma_map_page+0x79/0x180
dma_map_page_attrs+0x1d2/0x2f0
bnxt_start_xmit+0x8c6/0x1540
netpoll_start_xmit+0x13f/0x180
netpoll_send_skb+0x20d/0x320
netpoll_send_udp+0x453/0x4a0
write_ext_msg+0x1b9/0x460
console_flush_all+0x2ff/0x5a0
console_unlock+0x55/0x180
vprintk_emit+0x2e3/0x3c0
devkmsg_emit+0x5a/0x80
devkmsg_write+0xfd/0x180
do_iter_readv_writev+0x164/0x1b0
vfs_writev+0xf9/0x2b0
do_writev+0x6d/0x110
do_syscall_64+0x80/0x150
entry_SYSCALL_64_after_hwframe+0x4b/0x53
-> #0 (console_owner){-.-.}-{0:0}:
__lock_acquire+0x15d1/0x31a0
lock_acquire+0xe8/0x290
console_flush_all+0x2ea/0x5a0
console_unlock+0x55/0x180
vprintk_emit+0x2e3/0x3c0
_printk+0x59/0x80
warn_alloc+0x122/0x1b0
__alloc_pages_slowpath+0x1101/0x1120
__alloc_pages+0x1eb/0x2c0
alloc_slab_page+0x5f/0x150
new_slab+0x2dc/0x4e0
___slab_alloc+0xdcb/0x1390
kmem_cache_alloc+0x23d/0x360
radix_tree_node_alloc+0x3c/0xf0
radix_tree_insert+0xf5/0x230
add_dma_entry+0xe9/0x360
dma_map_page_attrs+0x1d2/0x2f0
__bnxt_alloc_rx_frag+0x147/0x180
bnxt_alloc_rx_data+0x79/0x160
bnxt_rx_skb+0x29/0xc0
bnxt_rx_pkt+0xe22/0x1570
__bnxt_poll_work+0x101/0x390
bnxt_poll+0x7e/0x320
__napi_poll+0x29/0x160
net_rx_action+0x1e0/0x3e0
handle_softirqs+0x190/0x510
run_ksoftirqd+0x4e/0x90
smpboot_thread_fn+0x1a8/0x270
kthread+0x102/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x11/0x20
This bug is more likely than it seems, because when one CPU has run out
of memory, chances are the other has too.
The good news is, this bug is hidden behind the CONFIG_DMA_API_DEBUG, so
not many users are likely to trigger it.
Signed-off-by: Rik van Riel <riel@surriel.com>
Reported-by: Konstantin Ovsepian <ovs@meta.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
cpu_pwq is used in various percpu functions that expect variable in
__percpu address space. Correct the declaration of cpu_pwq to
struct pool_workqueue __rcu * __percpu *cpu_pwq
to declare the variable as __percpu pointer.
The patch also fixes following sparse errors:
workqueue.c:380:37: warning: duplicate [noderef]
workqueue.c:380:37: error: multiple address spaces given: __rcu & __percpu
workqueue.c:2271:15: error: incompatible types in comparison expression (different address spaces):
workqueue.c:2271:15: struct pool_workqueue [noderef] __rcu *
workqueue.c:2271:15: struct pool_workqueue [noderef] __percpu *
and uncovers a couple of exisiting "incorrect type in assignment"
warnings (from __rcu address space), which this patch does not address.
Found by GCC's named address space checks.
There were no changes in the resulting object files.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When flushing a work item for cancellation, __flush_work() knows that it
exclusively owns the work item through its PENDING bit. 134874e2ee
("workqueue: Allow cancel_work_sync() and disable_work() from atomic
contexts on BH work items") added a read of @work->data to determine whether
to use busy wait for BH work items that are being canceled. While the read
is safe when @from_cancel, @work->data was read before testing @from_cancel
to simplify code structure:
data = *work_data_bits(work);
if (from_cancel &&
!WARN_ON_ONCE(data & WORK_STRUCT_PWQ) && (data & WORK_OFFQ_BH)) {
While the read data was never used if !@from_cancel, this could trigger
KCSAN data race detection spuriously:
==================================================================
BUG: KCSAN: data-race in __flush_work / __flush_work
write to 0xffff8881223aa3e8 of 8 bytes by task 3998 on cpu 0:
instrument_write include/linux/instrumented.h:41 [inline]
___set_bit include/asm-generic/bitops/instrumented-non-atomic.h:28 [inline]
insert_wq_barrier kernel/workqueue.c:3790 [inline]
start_flush_work kernel/workqueue.c:4142 [inline]
__flush_work+0x30b/0x570 kernel/workqueue.c:4178
flush_work kernel/workqueue.c:4229 [inline]
...
read to 0xffff8881223aa3e8 of 8 bytes by task 50 on cpu 1:
__flush_work+0x42a/0x570 kernel/workqueue.c:4188
flush_work kernel/workqueue.c:4229 [inline]
flush_delayed_work+0x66/0x70 kernel/workqueue.c:4251
...
value changed: 0x0000000000400000 -> 0xffff88810006c00d
Reorganize the code so that @from_cancel is tested before @work->data is
accessed. The only problem is triggering KCSAN detection spuriously. This
shouldn't need READ_ONCE() or other access qualifiers.
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: syzbot+b3e4f2f51ed645fd5df2@syzkaller.appspotmail.com
Fixes: 134874e2ee ("workqueue: Allow cancel_work_sync() and disable_work() from atomic contexts on BH work items")
Link: http://lkml.kernel.org/r/000000000000ae429e061eea2157@google.com
Cc: Jens Axboe <axboe@kernel.dk>
Missing, failed, or corrupted core dumps might impede crash
investigations. To improve reliability of that process and consequently
the programs themselves, one needs to trace the path from producing
a core dumpfile to analyzing it. That path starts from the core dump file
written to the disk by the kernel or to the standard input of a user
mode helper program to which the kernel streams the coredump contents.
There are cases where the kernel will interrupt writing the core out or
produce a truncated/not-well-formed core dump without leaving a note.
Add logging for the core dump collection failure paths to be able to reason
what has gone wrong when the core dump is malformed or missing.
Report the size of the data written to aid in diagnosing the user mode
helper.
Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Link: https://lore.kernel.org/r/20240718182743.1959160-3-romank@linux.microsoft.com
Signed-off-by: Kees Cook <kees@kernel.org>
With the previous commit that eliminates the overlapping partition
root corner cases in the hotplug code, the partition roots passed down
to generate_sched_domains() should not have overlapping CPUs. Enable
overlapping cpuset check for v2 and warn if that happens.
This patch also has the benefit of increasing test coverage of the new
Union-Find cpuset merging code to cgroup v2.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
cgroup/for-6.12 is about to receive updates that are dependent on changes
from both for-6.11-fixes and for-6.12. Pull in for-6.11-fixes.
Signed-off-by: Tejun Heo <tj@kernel.org>
It was found that some hotplug operations may cause multiple
rebuild_sched_domains_locked() calls. Some of those intermediate calls
may use cpuset states not in the final correct form leading to incorrect
sched domain setting.
Fix this problem by using the existing force_rebuild flag to inhibit
immediate rebuild_sched_domains_locked() calls if set and only doing
one final call at the end. Also renaming the force_rebuild flag to
force_sd_rebuild to make its meaning for clear.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit e2ffe502ba ("cgroup/cpuset: Add cpuset.cpus.exclusive for
v2") adds a user writable cpuset.cpus.exclusive file for setting
exclusive CPUs to be used for the creation of partitions. Since then
effective_xcpus depends on both the cpuset.cpus and cpuset.cpus.exclusive
setting. If cpuset.cpus.exclusive is set, effective_xcpus will depend
only on cpuset.cpus.exclusive. When it is not set, effective_xcpus
will be set according to the cpuset.cpus value when the cpuset becomes
a valid partition root.
When cpuset.cpus is being cleared by the user, effective_xcpus should
only be cleared when cpuset.cpus.exclusive is not set. However, that
is not currently the case.
# cd /sys/fs/cgroup/
# mkdir test
# echo +cpuset > cgroup.subtree_control
# cd test
# echo 3 > cpuset.cpus.exclusive
# cat cpuset.cpus.exclusive.effective
3
# echo > cpuset.cpus
# cat cpuset.cpus.exclusive.effective // was cleared
Fix it by clearing effective_xcpus only if cpuset.cpus.exclusive is
not set.
Fixes: e2ffe502ba ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2")
Cc: stable@vger.kernel.org # v6.7+
Reported-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
According to the implementation of cgroup_css_set_fork(), it will fail
if cset cannot be found and the can_fork/cancel_fork methods will not
be called in this case, which means that the argument 'cset' for these
methods must not be NULL, so remove the unrechable paths in them.
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The addition of the bases argument to clock_was_set() fixed up all call
sites correctly except for do_adjtimex(). This uses CLOCK_REALTIME
instead of CLOCK_SET_WALL as argument. CLOCK_REALTIME is 0.
As a result the effect of that clock_was_set() notification is incomplete
and might result in timers expiring late because the hrtimer code does
not re-evaluate the affected clock bases.
Use CLOCK_SET_WALL instead of CLOCK_REALTIME to tell the hrtimers code
which clock bases need to be re-evaluated.
Fixes: 17a1b8826b ("hrtimer: Add bases argument to clock_was_set()")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/877ccx7igo.ffs@tglx
Using syzkaller with the recently reintroduced signed integer overflow
sanitizer produces this UBSAN report:
UBSAN: signed-integer-overflow in ../kernel/time/ntp.c:738:18
9223372036854775806 + 4 cannot be represented in type 'long'
Call Trace:
handle_overflow+0x171/0x1b0
__do_adjtimex+0x1236/0x1440
do_adjtimex+0x2be/0x740
The user supplied time_constant value is incremented by four and then
clamped to the operating range.
Before commit eea83d896e ("ntp: NTP4 user space bits update") the user
supplied value was sanity checked to be in the operating range. That change
removed the sanity check and relied on clamping after incrementing which
does not work correctly when the user supplied value is in the overflow
zone of the '+ 4' operation.
The operation requires CAP_SYS_TIME and the side effect of the overflow is
NTP getting out of sync.
Similar to the fixups for time_maxerror and time_esterror, clamp the user
space supplied value to the operating range.
[ tglx: Switch to clamping ]
Fixes: eea83d896e ("ntp: NTP4 user space bits update")
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240517-b4-sio-ntp-c-v2-1-f3a80096f36f@google.com
Closes: https://github.com/KSPP/linux/issues/352
Using syzkaller alongside the newly reintroduced signed integer overflow
sanitizer spits out this report:
UBSAN: signed-integer-overflow in ../kernel/time/ntp.c:461:16
9223372036854775807 + 500 cannot be represented in type 'long'
Call Trace:
handle_overflow+0x171/0x1b0
second_overflow+0x2d6/0x500
accumulate_nsecs_to_secs+0x60/0x160
timekeeping_advance+0x1fe/0x890
update_wall_time+0x10/0x30
time_maxerror is unconditionally incremented and the result is checked
against NTP_PHASE_LIMIT, but the increment itself can overflow, resulting
in wrap-around to negative space.
Before commit eea83d896e ("ntp: NTP4 user space bits update") the user
supplied value was sanity checked to be in the operating range. That change
removed the sanity check and relied on clamping in handle_overflow() which
does not work correctly when the user supplied value is in the overflow
zone of the '+ 500' operation.
The operation requires CAP_SYS_TIME and the side effect of the overflow is
NTP getting out of sync.
Miroslav confirmed that the input value should be clamped to the operating
range and the same applies to time_esterror. The latter is not used by the
kernel, but the value still should be in the operating range as it was
before the sanity check got removed.
Clamp them to the operating range.
[ tglx: Changed it to clamping and included time_esterror ]
Fixes: eea83d896e ("ntp: NTP4 user space bits update")
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Link: https://lore.kernel.org/all/20240517-b4-sio-ntp-usec-v2-1-d539180f2b79@google.com
Closes: https://github.com/KSPP/linux/issues/354
Since str_has_prefix() takes the prefix as the 2nd argument and the string
as the first, is_cfi_preamble_symbol() always fails to check the prefix.
Fix the function parameter order so that it correctly check the prefix.
Link: https://lore.kernel.org/all/172260679559.362040.7360872132937227206.stgit@devnote2/
Fixes: de02f2ac5d ("kprobes: Prohibit probing on CFI preamble symbol")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
The kernel sleep profile is no longer working due to a recursive locking
bug introduced by commit 42a20f86dc ("sched: Add wrapper for get_wchan()
to keep task blocked")
Booting with the 'profile=sleep' kernel command line option added or
executing
# echo -n sleep > /sys/kernel/profiling
after boot causes the system to lock up.
Lockdep reports
kthreadd/3 is trying to acquire lock:
ffff93ac82e08d58 (&p->pi_lock){....}-{2:2}, at: get_wchan+0x32/0x70
but task is already holding lock:
ffff93ac82e08d58 (&p->pi_lock){....}-{2:2}, at: try_to_wake_up+0x53/0x370
with the call trace being
lock_acquire+0xc8/0x2f0
get_wchan+0x32/0x70
__update_stats_enqueue_sleeper+0x151/0x430
enqueue_entity+0x4b0/0x520
enqueue_task_fair+0x92/0x6b0
ttwu_do_activate+0x73/0x140
try_to_wake_up+0x213/0x370
swake_up_locked+0x20/0x50
complete+0x2f/0x40
kthread+0xfb/0x180
However, since nobody noticed this regression for more than two years,
let's remove 'profile=sleep' support based on the assumption that nobody
needs this functionality.
Fixes: 42a20f86dc ("sched: Add wrapper for get_wchan() to keep task blocked")
Cc: stable@vger.kernel.org # v5.16+
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- The recent fix for making the take over of the broadcast timer more
reliable retrieves a per CPU pointer in preemptible context.
This went unnoticed in testing as some compilers hoist the access into
the non-preemotible section where the pointer is actually used, but
obviously compilers can rightfully invoke it where the code put it.
Move it into the non-preemptible section right to the actual usage
side to cure it.
- The clocksource watchdog is supposed to emit a warning when the retry
count is greater than one and the number of retries reaches the
limit. The condition is backwards and warns always when the count is
greater than one. Fixup the condition to prevent spamming dmesg.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmavdtQTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYofrtD/95Ck3FgHRdxZWlnIwBptzW1ApfTjKa
fuwOHBAcFzpNx13DcSyKqclMIM2QxN2lAjAv3m5IeNFO7RN5Ru1aOskPpFMgQIzj
6UfKqvtuSZeCPIqspeN9/RAnqKTRYAFRZcSnE8FcFxuM6dU9zlnLjms1gstTyLS3
HeoQoUe1DT6IpKUDKdvMP8JiwU6/i+xHAeVizEkGZ5Rxo67+UDUqqfcpBr/pIAIN
W0KfekPVLGjvL19gvmaHrPBHi5tEjM1+7HiepKeyc9GjC1FHYGQjNLxXpri3CcI+
VZmya37ZVLKoOFjvHuqmCYzFWrEs1rEfgnBeCglV5lvwxfgPoOk9awpUAR4IWkEz
HMagUUre3kThEztoPzyf7apJmltVC7U++gRfW0i7p/gSfwF9AYAPgWAcg6VyDrxn
hIbKkQvqLNM1ldXWS0tG/scgEAKEM7yYG9BP03ac/mGdFNGa6yucG986ElHoVLSR
S8Dw1E7/F7G5KOqVK6i25JLFgN0ZJNRWMWbd95VBEuZcZ4fzIKug3intNeSUSjDc
zfvKfu65nRr2bHcaxs5MkPqkDqFOVytoQqgYYqstUZ4bRyeI6px/Rmu58VgJF2cP
DmUvf6gqNXbg3g6ijQmhOiCTqzjW67bxFbYDuX68oLfBDmdVPP8MjZOJHV7/ukvp
mjcFL/MYjg6/iw==
=lI2j
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"Two fixes for the timer/clocksource code:
- The recent fix to make the take over of the broadcast timer more
reliable retrieves a per CPU pointer in preemptible context.
This went unnoticed in testing as some compilers hoist the access
into the non-preemotible section where the pointer is actually
used, but obviously compilers can rightfully invoke it where the
code put it.
Move it into the non-preemptible section right to the actual usage
side to cure it.
- The clocksource watchdog is supposed to emit a warning when the
retry count is greater than one and the number of retries reaches
the limit.
The condition is backwards and warns always when the count is
greater than one. Fixup the condition to prevent spamming dmesg"
* tag 'timers-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource: Fix brown-bag boolean thinko in cs_watchdog_read()
tick/broadcast: Move per CPU pointer access into the atomic section
- When stime is larger than rtime due to accounting imprecision, then
utime = rtime - stime becomes negative. As this is unsigned math, the
result becomes a huge positive number. Cure it by resetting stime to
rtime in that case, so utime becomes 0.
- Restore consistent state when sched_cpu_deactivate() fails.
When offlining a CPU fails in sched_cpu_deactivate() after the SMT
present counter has been decremented, then the function aborts but
fails to increment the SMT present counter and leaves it imbalanced.
Consecutive operations cause it to underflow. Add the missing fixup
for the error path.
As SMT accounting the runqueue needs to marked online again in the
error exit path to restore consistent state.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmavdT8THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodVrEACDLJdjkM2n3T7EL8YjuBjkCW3dGWAZ
umpJGjwMDsT9/oLIU7B1wgX/IdWppssQa+0yXxZy7cKQvfP5VTd4fueuub2k5sJc
yDy5J8N0xRYvOhA0lrnp6jyqhhCZzIGDmSn3G+lDuQuuffaqfFbPkeMwoXmewiyt
72ajFsjeo7q25pm8ALgBhrSKfO5FFV1HJoAyoYKEyT5E/pliKNWrzbcA+PWstMK3
DWmj8dgYk6g/dBwNl6wORlpmcxjcDO65icH5XPSsadwosHe7q1+quIJSqMDyXHNY
qQ5r5f9bvXdq5DPKRON0GJb9gfSQNX5yE/pKdyW75mqHMxJ/pnIIds6h6mLHBewt
eZ5M1a/v8o+QiqQcDogk5DUzZlI46bKZsYLqU9y6v/WgUqa5C4BaEJT7CrQk+6wp
xUB4g3j/+asih55Tq9HKo6PEY8NLj4ytKHgeh0EvEllDxGmnRYR+PEdzLBjuWlAY
ka2/1vaNr/r5grbpQhO6N4txUAASoKF6nx1hq05I/lY45KA+RgeU0mgEN07Pa6HZ
4873Q2CnVUlvMVFulOUkJogGNk7KTDb3e7/+BMsA9Lda/2KmqaOLEh5T6egdLZ0G
feb/UQ6hoYcCD0IAsj9MfEOS3IVhOvtkJSwwLi/j09ucmC+5Ar3v3/Aw1EtTHJHm
ObdoEXJC98RLFA==
=b/q0
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
- When stime is larger than rtime due to accounting imprecision, then
utime = rtime - stime becomes negative. As this is unsigned math, the
result becomes a huge positive number.
Cure it by resetting stime to rtime in that case, so utime becomes 0.
- Restore consistent state when sched_cpu_deactivate() fails.
When offlining a CPU fails in sched_cpu_deactivate() after the SMT
present counter has been decremented, then the function aborts but
fails to increment the SMT present counter and leaves it imbalanced.
Consecutive operations cause it to underflow. Add the missing fixup
for the error path.
For SMT accounting the runqueue needs to marked online again in the
error exit path to restore consistent state.
* tag 'sched-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Fix unbalance set_rq_online/offline() in sched_cpu_deactivate()
sched/core: Introduce sched_set_rq_on/offline() helper
sched/smt: Fix unbalance sched_smt_present dec/inc
sched/smt: Introduce sched_smt_present_inc/dec() helper
sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime
- Ensure that the atomic_cmpxchg() conditions are correct and evaluating
to true on any non-zero value except 1. The missing check of the
return value leads to inconsisted state of the jump label counter.
- Add a missing type conversion in the paravirt spinlock code which
makes loongson build again.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmavcdsTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYofABD/9AJA5feiAhwbidCafFihfiL1yxzUoy
PLvbZ3YME9N0dVen6LKQnGyA68eXAzHQFXunWSIqMHPxC/L35AJTJ2Qjx9d+vXZu
EigtYox9hkR19ZH1VH/yAOZcc2fvTYPvD/hQ8Wqd/5nEfa8nQq8k5i1/GOP9zZ/Y
LtgbPT98FGG3eUHFWxmINv2Ws3y4iNZLT6tmxUrhoTlcojMEuHvEPmdO0KYorOzS
ri0/OySk5j438LX59rucP53vIFr1yUg2uFqkrV8ru9PqGm0lHVmqG3YkVh9A1VWA
huuXZJba6ixjDtUFyeg9ksW0M5jJxAkl4XQ0suiLt8ySZPTA7LDKq0Py037RECfX
jnmY4gvOgv43mvZfgcThtgzqqxO/Jg8IATve8ljKQKOklhQ/A8B/wJgbzVhIMARQ
xczqB6iM1BuJmfwPUrwX8ibfgATo+HCSlyS+Sob3335Tap/XOjB6cdx1+V8OAWkn
VlTzTJpYXOlOq0JtkIC71pzEqSGmAqscPinwBPj+ZHaUd21lvFxpdXi6r+APjPiY
LsneKztQAfQk/DToYddbqisMcGdMxjcdifr4AtlW03XAdEd//G3NjE13TASaCL33
snRPkzARFUf56bW/isIsQFi/kEax3CZ438O7MN0nkelaURLcDtMYJOM+y3ncybmo
Pr3H/be2I/qmtg==
=mbEV
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
"Two fixes for locking and jump labels:
- Ensure that the atomic_cmpxchg() conditions are correct and
evaluating to true on any non-zero value except 1. The missing
check of the return value leads to inconsisted state of the jump
label counter.
- Add a missing type conversion in the paravirt spinlock code which
makes loongson build again"
* tag 'locking-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
jump_label: Fix the fix, brown paper bags galore
locking/pvqspinlock: Correct the type of "old" variable in pv_kick_node()
Currently, the coredump_task_exit() function sets the task state
to TASK_UNINTERRUPTIBLE|TASK_FREEZABLE, which usually works well.
But a combination of large memory and slow (and/or highly contended)
mass storage can cause application core dumps to take more than
two minutes, which can cause check_hung_task(), which is invoked by
check_hung_uninterruptible_tasks(), to produce task-blocked splats.
There does not seem to be any reasonable benefit to getting these splats.
Furthermore, as Oleg Nesterov points out, TASK_UNINTERRUPTIBLE could
be misleading because the task sleeping in coredump_task_exit() really
is killable, albeit indirectly. See the check of signal->core_state
in prepare_signal() and the check of fatal_signal_pending()
in dump_interrupted(), which bypass the normal unkillability of
TASK_UNINTERRUPTIBLE, resulting in coredump_finish() invoking
wake_up_process() on any threads sleeping in coredump_task_exit().
Therefore, change that TASK_UNINTERRUPTIBLE to TASK_IDLE.
Reported-by: Anhad Jai Singh <ffledgling@meta.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Right now, cs_watchdog_read() does clocksource sanity checks based
on WATCHDOG_MAX_SKEW, which sets a floor on any clocksource's
.uncertainty_margin. These sanity checks can therefore act
inappropriately for clocksources with large uncertainty margins.
One reason for a clocksource to have a large .uncertainty_margin is when
that clocksource has long read-out latency, given that it does not make
sense for the .uncertainty_margin to be smaller than the read-out latency.
With the current checks, cs_watchdog_read() could reject all normal
reads from a clocksource with long read-out latencies, such as those
from legacy clocksources that are no longer implemented in hardware.
Therefore, recast the cs_watchdog_read() checks in terms of the
.uncertainty_margin values of the clocksources involved in the timespan in
question. The first covers two watchdog reads and one cs read, so use
twice the watchdog .uncertainty_margin plus that of the cs. The second
covers only a pair of watchdog reads, so use twice the watchdog
.uncertainty_margin.
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240802154618.4149953-4-paulmck@kernel.org
The WATCHDOG_THRESHOLD macro is no longer used to supply a default value
for ->uncertainty_margin, but WATCHDOG_MAX_SKEW now is.
Therefore, update the comments to reflect this change.
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/20240802154618.4149953-3-paulmck@kernel.org
Add more detail on the rationale for bounding the clocksource
->uncertainty_margin below at about 500ppm.
Signed-off-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240802154618.4149953-1-paulmck@kernel.org
The current "nretries > 1 || nretries >= max_retries" check in
cs_watchdog_read() will always evaluate to true, and thus pr_warn(), if
nretries is greater than 1. The intent is instead to never warn on the
first try, but otherwise warn if the successful retry was the last retry.
Therefore, change that "||" to "&&".
Fixes: db3a34e174 ("clocksource: Retry clock read if long delays detected")
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240802154618.4149953-2-paulmck@kernel.org
CONFIG_HOTPLUG_PARALLEL expects the architecture to implement
arch_cpuhp_init_parallel_bringup() to decide whether paralllel hotplug is
possible and to do the necessary architecture specific initialization.
There are architectures which can enable it unconditionally and do not
require architecture specific initialization.
Provide a weak fallback for arch_cpuhp_init_parallel_bringup() so that
such architectures are not forced to implement empty stub functions.
Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240716-loongarch-hotplug-v3-2-af59b3bb35c8@flygoat.com
As Documentation/filesystems/sysfs.rst suggested, show() should only use
sysfs_emit() or sysfs_emit_at() when formatting the value to be returned
to user space.
No functional change intended.
Signed-off-by: Xueqin Luo <luoxueqin@kylinos.cn>
Link: https://patch.msgid.link/20240801083156.2513508-3-luoxueqin@kylinos.cn
[ rjw: Subject edit ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
As Documentation/filesystems/sysfs.rst suggested, show() should only
use sysfs_emit() or sysfs_emit_at() when formatting the value to be
returned to user space.
No functional change intended.
Signed-off-by: Xueqin Luo <luoxueqin@kylinos.cn>
Link: https://patch.msgid.link/20240801083156.2513508-2-luoxueqin@kylinos.cn
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Kill the extra get_uprobe() + put_uprobe() in uprobe_unregister() and
move the possibly final put_uprobe() from delete_uprobe() to its only
caller, uprobe_unregister().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20240801132749.GA8817@redhat.com
Fold __uprobe_unregister() into its single caller, uprobe_unregister().
A separate patch to simplify the next change.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20240801132744.GA8814@redhat.com
If register_for_each_vma() fails uprobe_register() can safely drop
uprobe->register_rwsem and use uprobe_unregister(). There is no worry
about the races with another register/unregister, consumer_add() was
already called so this case doesn't differ from _unregister() right
after the successful _register().
Yes this means the extra up_write() + down_write(), but this is the
slow and unlikely case anyway.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20240801132739.GA8809@redhat.com
This way uprobe_unregister() and uprobe_apply() can use "struct uprobe *"
rather than inode + offset. This simplifies the code and allows to avoid
the unnecessary find_uprobe() + put_uprobe() in these functions.
TODO: uprobe_unregister() still needs get_uprobe/put_uprobe to ensure that
this uprobe can't be freed before up_write(&uprobe->register_rwsem).
Co-developed-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20240801132734.GA8803@redhat.com
It doesn't make any sense to have 2 versions of _register(). Note that
trace_uprobe_enable(), the only user of uprobe_register(), doesn't need
to check tu->ref_ctr_offset to decide which one should be used, it could
safely pass ref_ctr_offset == 0 to uprobe_register_refctr().
Add this argument to uprobe_register(), update the callers, and kill
uprobe_register_refctr().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240801132728.GA8800@redhat.com
get_user_pages_remote() and the comment above it make no sense.
There is no task_struct passed into get_user_pages_remote() anymore, and
nowadays mm_account_fault() increments the current->min/maj_flt counters
regardless of FAULT_FLAG_REMOTE.
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240801132714.GA8783@redhat.com
The comment above uprobe_write_opcode() is wrong, unapply_uprobe() calls
it under mmap_read_lock() and this is correct.
And it is completely unclear why register_for_each_vma() takes mmap_lock
for writing, add a comment to explain that mmap_write_lock() is needed to
avoid the following race:
- A task T hits the bp installed by uprobe and calls
find_active_uprobe()
- uprobe_unregister() removes this uprobe/bp
- T calls find_uprobe() which returns NULL
- another uprobe_register() installs the bp at the same address
- T calls is_trap_at_addr() which returns true
- T returns to handle_swbp() and gets SIGTRAP.
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20240801132709.GA8780@redhat.com
When tracing user functions with uprobe functionality, it's common to
install the probe (e.g., a BPF program) at the first instruction of the
function. This is often going to be `push %rbp` instruction in function
preamble, which means that within that function frame pointer hasn't
been established yet. This leads to consistently missing an actual
caller of the traced function, because perf_callchain_user() only
records current IP (capturing traced function) and then following frame
pointer chain (which would be caller's frame, containing the address of
caller's caller).
So when we have target_1 -> target_2 -> target_3 call chain and we are
tracing an entry to target_3, captured stack trace will report
target_1 -> target_3 call chain, which is wrong and confusing.
This patch proposes a x86-64-specific heuristic to detect `push %rbp`
(`push %ebp` on 32-bit architecture) instruction being traced. Given
entire kernel implementation of user space stack trace capturing works
under assumption that user space code was compiled with frame pointer
register (%rbp/%ebp) preservation, it seems pretty reasonable to use
this instruction as a strong indicator that this is the entry to the
function. In that case, return address is still pointed to by %rsp/%esp,
so we fetch it and add to stack trace before proceeding to unwind the
rest using frame pointer-based logic.
We also check for `endbr64` (for 64-bit modes) as another common pattern
for function entry, as suggested by Josh Poimboeuf. Even if we get this
wrong sometimes for uprobes attached not at the function entry, it's OK
because stack trace will still be overall meaningful, just with one
extra bogus entry. If we don't detect this, we end up with guaranteed to
be missing caller function entry in the stack trace, which is worse
overall.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240729175223.23914-1-andrii@kernel.org
This change allows events to use PERF_SAMPLE_READ with inherit
so long as PERF_SAMPLE_TID is also set. This enables sample based
profiling of a group of counters over a hierarchy of processes or
threads. This is useful, for example, for collecting per-thread
counters/metrics, event based sampling of multiple counters as a unit,
access to the enabled and running time when using multiplexing and so
on.
Prior to this, users were restricted to either collecting aggregate
statistics for a multi-threaded/-process application (e.g. with
"perf stat"), or to sample individual threads, or to profile the entire
system (which requires root or CAP_PERFMON, and may produce much more
data than is required). Theoretically a tool could poll for or otherwise
monitor thread/process creation and construct whatever events the user
is interested in using perf_event_open, for each new thread or process,
but this is racy, can lead to file-descriptor exhaustion, and ultimately
just replicates the behaviour of inherit, but in userspace.
This configuration differs from inherit without PERF_SAMPLE_READ in that
the accumulated event count, and consequently any sample (such as if
triggered by overflow of sample_period) will be on a per-thread rather
than on an aggregate basis.
The meaning of read_format::value field of both PERF_RECORD_READ and
PERF_RECORD_SAMPLE is changed such that if the sampled event uses this
new configuration then the values reported will be per-thread rather
than the global aggregate value. This is a change from the existing
semantics of read_format (where PERF_SAMPLE_READ is used without
inherit), but it is necessary to expose the per-thread counter values,
and it avoids reinventing a separate "read_format_thread" field that
otherwise replicates the same behaviour. This change should not break
existing tools, since this configuration was not previously valid and
was rejected by the kernel. Tools that opt into this new mode will need
to account for this when calculating the counter delta for a given
sample. Tools that wish to have both the per-thread and aggregate value
can perform the global aggregation themselves from the per-thread
values.
The change to read_format::value does not affect existing valid
perf_event_attr configurations, nor does it change the behaviour of
calls to "read" on an event descriptor. Both continue to report the
aggregate value for the entire thread/process hierarchy. The difference
between the results reported by "read" and PERF_RECORD_SAMPLE in this
new configuration is justified on the basis that it is not (easily)
possible for "read" to target a specific thread (the caller only has
the fd for the original parent event).
Signed-off-by: Ben Gainey <ben.gainey@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240730084417.7693-3-ben.gainey@arm.com
nr_pending counts the number of events in the context that
either pending_sigtrap or pending_work, but it is used
to prevent taking the fast path in perf_event_context_sched_out.
Renamed to reflect what it is used for, rather than what it
counts. This change allows using the field to track other
event properties that also require skipping the fast path
without possible confusion over the name.
Signed-off-by: Ben Gainey <ben.gainey@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240730084417.7693-2-ben.gainey@arm.com
Fixes the following Coccinelle/coccicheck warning reported by
minmax.cocci:
WARNING opportunity for min()
Use const size_t instead of int for the result of min().
Compile-tested with CONFIG_KCSAN=y.
Reviewed-by: Marco Elver <elver@google.com>
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cgroup subsystem state (CSS) is an abstraction in the cgroup layer to
help manage different structures in various cgroup subsystems by being
an embedded element inside a larger structure like cpuset or mem_cgroup.
The /proc/cgroups file shows the number of cgroups for each of the
subsystems. With cgroup v1, the number of CSSes is the same as the
number of cgroups. That is not the case anymore with cgroup v2. The
/proc/cgroups file cannot show the actual number of CSSes for the
subsystems that are bound to cgroup v2.
So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup),
we can't tell by looking at /proc/cgroups which cgroup subsystems may
be responsible.
As cgroup v2 had deprecated the use of /proc/cgroups, the hierarchical
cgroup.stat file is now being extended to show the number of live and
dying CSSes associated with all the non-inhibited cgroup subsystems that
have been bound to cgroup v2. The number includes CSSes in the current
cgroup as well as in all the descendants underneath it. This will help
us pinpoint which subsystems are responsible for the increasing number
of dying (nr_dying_descendants) cgroups.
The CSSes dying counts are stored in the cgroup structure itself
instead of inside the CSS as suggested by Johannes. This will allow
us to accurately track dying counts of cgroup subsystems that have
recently been disabled in a cgroup. It is now possible that a zero
subsystem number is coupled with a non-zero dying subsystem number.
The cgroup-v2.rst file is updated to discuss this new behavior.
With this patch applied, a sample output from root cgroup.stat file
was shown below.
nr_descendants 56
nr_subsys_cpuset 1
nr_subsys_cpu 43
nr_subsys_io 43
nr_subsys_memory 56
nr_subsys_perf_event 57
nr_subsys_hugetlb 1
nr_subsys_pids 56
nr_subsys_rdma 1
nr_subsys_misc 1
nr_dying_descendants 30
nr_dying_subsys_cpuset 0
nr_dying_subsys_cpu 0
nr_dying_subsys_io 0
nr_dying_subsys_memory 30
nr_dying_subsys_perf_event 0
nr_dying_subsys_hugetlb 0
nr_dying_subsys_pids 0
nr_dying_subsys_rdma 0
nr_dying_subsys_misc 0
Another sample output from system.slice/cgroup.stat was:
nr_descendants 34
nr_subsys_cpuset 0
nr_subsys_cpu 32
nr_subsys_io 32
nr_subsys_memory 34
nr_subsys_perf_event 35
nr_subsys_hugetlb 0
nr_subsys_pids 34
nr_subsys_rdma 0
nr_subsys_misc 0
nr_dying_descendants 30
nr_dying_subsys_cpuset 0
nr_dying_subsys_cpu 0
nr_dying_subsys_io 0
nr_dying_subsys_memory 30
nr_dying_subsys_perf_event 0
nr_dying_subsys_hugetlb 0
nr_dying_subsys_pids 0
nr_dying_subsys_rdma 0
nr_dying_subsys_misc 0
Note that 'debug' controller wasn't used to provide this information because
the controller is not recommended in productions kernels, also many of them
won't enable CONFIG_CGROUP_DEBUG by default.
Similar information could be retrieved with debuggers like drgn but that's
also not always available (e.g. lockdown) and the additional cost of runtime
tracking here is deemed marginal.
tj: Added Michal's paragraphs on why this is not added the debug controller
to the commit message.
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20240715150034.2583772-1-longman@redhat.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Per the example of:
!atomic_cmpxchg(&key->enabled, 0, 1)
the inverse was written as:
atomic_cmpxchg(&key->enabled, 1, 0)
except of course, that while !old is only true for old == 0, old is
true for everything except old == 0.
Fix it to read:
atomic_cmpxchg(&key->enabled, 1, 0) == 1
such that only the 1->0 transition returns true and goes on to disable
the keys.
Fixes: 83ab38ef0a ("jump_label: Fix concurrency issues in static_key_slow_dec()")
Reported-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lkml.kernel.org/r/20240731105557.GY33588@noisy.programming.kicks-ass.net
The recent fix for making the take over of the broadcast timer more
reliable retrieves a per CPU pointer in preemptible context.
This went unnoticed as compilers hoist the access into the non-preemptible
region where the pointer is actually used. But of course it's valid that
the compiler keeps it at the place where the code puts it which rightfully
triggers:
BUG: using smp_processor_id() in preemptible [00000000] code:
caller is hotplug_cpu__broadcast_tick_pull+0x1c/0xc0
Move it to the actual usage site which is in a non-preemptible region.
Fixes: f7d43dd206 ("tick/broadcast: Make takeover of broadcast hrtimer reliable")
Reported-by: David Wang <00107082@163.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Yu Liao <liaoyu15@huawei.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/87ttg56ers.ffs@tglx
The process of constructing scheduling domains
involves multiple loops and repeated evaluations, leading to numerous
redundant and ineffective assessments that impact code efficiency.
Here, we use union-find to optimize the merging of cpumasks. By employing
path compression and union by rank, we effectively reduce the number of
lookups and merge comparisons.
Signed-off-by: Xavier <xavier_qy@163.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There are several functions to decrease attach_in_progress, and they
will wake up cpuset_attach_wq when attach_in_progress is zero. So,
add a helper to make it concise.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Since the SLAB implementation was removed in v6.8, so the
cpuset_slab_spread_rotor is no longer used and can be removed.
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently when a process in a group forks and fails due to it's
parent's max restriction, all the cgroups from 'pids_forking' to root
will generate event notifications but only the cgroups from
'pids_over_limit' to root will increase the counter of PIDCG_MAX.
Consider this scenario: there are 4 groups A, B, C,and D, the
relationships are as follows, and user is watching on C.pids.events.
root->A->B->C->D
When a process in D forks and fails due to B.max restriction, the
user will get a spurious event notification because when he wakes up
and reads C.pids.events, he will find that the content has not changed.
To address this issue, only the cgroups from 'pids_over_limit' to root
will have their PIDCG_MAX counters increased and event notifications
generated.
Fixes: 385a635cac ("cgroup/pids: Make event counters hierarchical")
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The child_ecpus_count variable was previously used to update
sibling cpumask when parent's effective_cpus is updated. However, it became
obsolete after commit e2ffe502ba ("cgroup/cpuset: Add
cpuset.cpus.exclusive for v2"). It should be removed.
tj: Restored {} for style consistency.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
For some reason I didn't see this issue on my arm64 or x86-64 builds,
but Stephen Rothwell reports that commit 2accfdb7ef ("profiling:
attempt to remove per-cpu profile flip buffer") left these static
variables around, and the powerpc build is unhappy about them:
kernel/profile.c:52:28: warning: 'cpu_profile_flip' defined but not used [-Wunused-variable]
52 | static DEFINE_PER_CPU(int, cpu_profile_flip);
| ^~~~~~~~~~~~~~~~
..
So remove these stale left-over remnants too.
Fixes: 2accfdb7ef ("profiling: attempt to remove per-cpu profile flip buffer")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These really can be handled gracefully without killing the machine.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
The task pointer which is handed to dequeue_signal() is always current. The
argument along with the first comment about signalfd in that function is
confusing at best. Remove it and use current internally.
Update the stale comment for dequeue_signal() while at it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Rename posix_timer_event() to posix_timer_queue_signal() as this is what
the function is about.
Consolidate the requeue pending and deactivation updates into that function
as there is no point in doing this in all incarnations of posix timers.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Posix CPU timers are not updating k_itimer::it_active which makes it
impossible to base decisions in the common posix timer code on it.
Update it when queueing or dequeueing posix CPU timers.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
hrtimer based and CPU timers have their own way to install the new interval
and to reset overrun and signal handling related data.
Create a helper function and do the same operation for all variants.
This also makes the handling of the interval consistent. It's only stored
when the timer is actually armed, i.e. timer->it_value != 0. Before that it
was stored unconditionally for posix CPU timers and conditionally for the
other posix timers.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
No requirement for a real list. Spare a few bytes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Keeping the overrun count of the previous setup around is just wrong. The
new setting has nothing to do with the previous one and has to start from a
clean slate.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
No point in doing this all over the place.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Avoid the late sighand lock/unlock dance when a timer is not armed to
enforce reevaluation of the timer base so that the process wide CPU timer
sampling can be disabled.
Do it right at the point where the arming decision is made which already
has sighand locked.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
A leftover from historical code which describes fiction.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
posix_cpu_timer_set() uses @val as variable for the current time. That's
confusing at best.
Use @now as anywhere else and rewrite the confusing comment about clock
sampling.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
There is no point in arming SIGEV_NONE timers as they never deliver a
signal. timer_gettime() is handling the expiry time correctly and that's
all SIGEV_NONE timers care about.
Prevent arming them and remove the expiry handler code which just disarms
them.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reuse the split out __posix_cpu_timer_get() function which does already the
right thing.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Expired SIGEV_NONE oneshot timers must return 0 nsec for the expiry time in
timer_get(), but the posix CPU timer implementation returns 1 nsec.
Add the missing conditional.
This will be cleaned up in a follow up patch.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Expired SIGEV_NONE oneshot timers must return 0 nsec for the expiry time in
timer_get(), but the posix CPU timer implementation returns 1 nsec.
Add the missing conditional.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
timer_gettime() must return the remaining time to the next expiry of a
timer or 0 if the timer is not armed and no signal pending, but posix CPU
timers fail to forward a timer which is already expired.
Add the required logic to address that.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
There is no point to return the interval for timers which have been
disarmed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
In preparation for addressing issues in the timer_get() and timer_set()
functions of posix CPU timers.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
The TWA_NMI_CURRENT handling very much depends on IRQ_WORK, but that
isn't universally enabled everywhere.
Maybe the IRQ_WORK infrastructure should just be unconditional - x86
ends up indirectly enabling it through unconditionally enabling
PERF_EVENTS, for example. But it also gets enabled by having SMP
support, or even if you just have PRINTK enabled.
But in the meantime TWA_NMI_CURRENT causes tons of build failures on
various odd minimal configs. Which did show up in linux-next, but
despite that nobody bothered to fix it or even inform me until -rc1 was
out.
Fixes: 466e4d801c ("task_work: Add TWA_NMI_CURRENT as an additional notify mode")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: kernelci.org bot <bot@kernelci.org>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the really old legacy kernel profiling code, which has long
since been obviated by "real profiling" (ie 'prof' and company), and
mainly remains as a source of syzbot reports.
There are anecdotal reports that people still use it for boot-time
profiling, but it's unlikely that such use would care about the old NUMA
optimizations in this code from 2004 (commit ad02973d42: "profile: 512x
Altix timer interrupt livelock fix" in the BK import archive at [1])
So in order to head off future syzbot reports, let's try to simplify
this code and get rid of the per-cpu profile buffers that are quite a
large portion of the complexity footprint of this thing (including CPU
hotplug callbacks etc).
It's unlikely anybody will actually notice, or possibly, as Thomas put
it: "Only people who indulge in nostalgia will notice :)".
That said, if it turns out that this code is actually actively used by
somebody, we can always revert this removal. Thus the "attempt" in the
summary line.
[ Note: in a small nod to "the profiling code can cause NUMA problems",
this also removes the "increment the last entry in the profiling array
on any unknown hits" logic. That would account any program counter in
a module to that single counter location, and might exacerbate any
NUMA cacheline bouncing issues ]
Link: https://lore.kernel.org/all/CAHk-=wgs52BxT4Zjmjz8aNvHWKxf5_ThBY4bYL1Y6CTaNL2dTw@mail.gmail.com/
Link: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git [1]
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
syzbot is reporting uninit-value at profile_hits(), for there is a race
window between
if (!alloc_cpumask_var(&prof_cpu_mask, GFP_KERNEL))
return -ENOMEM;
cpumask_copy(prof_cpu_mask, cpu_possible_mask);
in profile_init() and
cpumask_available(prof_cpu_mask) &&
cpumask_test_cpu(smp_processor_id(), prof_cpu_mask))
in profile_tick(); prof_cpu_mask remains uninitialzed until cpumask_copy()
completes while cpumask_available(prof_cpu_mask) returns true as soon as
alloc_cpumask_var(&prof_cpu_mask) completes.
We could replace alloc_cpumask_var() with zalloc_cpumask_var() and
call cpumask_copy() from create_proc_profile() on only UP kernels, for
profile_online_cpu() calls cpumask_set_cpu() as needed via
cpuhp_setup_state(CPUHP_AP_ONLINE_DYN) on SMP kernels. But this patch
removes prof_cpu_mask because it seems unnecessary.
The cpumask_test_cpu(smp_processor_id(), prof_cpu_mask) test
in profile_tick() is likely always true due to
a CPU cannot call profile_tick() if that CPU is offline
and
cpumask_set_cpu(cpu, prof_cpu_mask) is called when that CPU becomes
online and cpumask_clear_cpu(cpu, prof_cpu_mask) is called when that
CPU becomes offline
. This test could be false during transition between online and offline.
But according to include/linux/cpuhotplug.h , CPUHP_PROFILE_PREPARE
belongs to PREPARE section, which means that the CPU subjected to
profile_dead_cpu() cannot be inside profile_tick() (i.e. no risk of
use-after-free bug) because interrupt for that CPU is disabled during
PREPARE section. Therefore, this test is guaranteed to be true, and
can be removed. (Since profile_hits() checks prof_buffer != NULL, we
don't need to check prof_buffer != NULL here unless get_irq_regs() or
user_mode() is such slow that we want to avoid when prof_buffer == NULL).
do_profile_hits() is called from profile_tick() from timer interrupt
only if cpumask_test_cpu(smp_processor_id(), prof_cpu_mask) is true and
prof_buffer is not NULL. But syzbot is also reporting that sometimes
do_profile_hits() is called while current thread is still doing vzalloc(),
where prof_buffer must be NULL at this moment. This indicates that multiple
threads concurrently tried to write to /sys/kernel/profiling interface,
which caused that somebody else try to re-allocate prof_buffer despite
somebody has already allocated prof_buffer. Fix this by using
serialization.
Reported-by: syzbot <syzbot+b1a83ab2a9eb9321fbdd@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=b1a83ab2a9eb9321fbdd
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: syzbot <syzbot+b1a83ab2a9eb9321fbdd@syzkaller.appspotmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I got the following warn report while doing stress test:
jump label: negative count!
WARNING: CPU: 3 PID: 38 at kernel/jump_label.c:263 static_key_slow_try_dec+0x9d/0xb0
Call Trace:
<TASK>
__static_key_slow_dec_cpuslocked+0x16/0x70
sched_cpu_deactivate+0x26e/0x2a0
cpuhp_invoke_callback+0x3ad/0x10d0
cpuhp_thread_fun+0x3f5/0x680
smpboot_thread_fn+0x56d/0x8d0
kthread+0x309/0x400
ret_from_fork+0x41/0x70
ret_from_fork_asm+0x1b/0x30
</TASK>
Because when cpuset_cpu_inactive() fails in sched_cpu_deactivate(),
the cpu offline failed, but sched_smt_present is decremented before
calling sched_cpu_deactivate(), it leads to unbalanced dec/inc, so
fix it by incrementing sched_smt_present in the error path.
Fixes: c5511d03ec ("sched/smt: Make sched_smt_present track topology")
Cc: stable@kernel.org
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/r/20240703031610.587047-3-yangyingliang@huaweicloud.com
In extreme test scenarios:
the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime,
utime = 18446744073709518790 ns, rtime = 135989749728000 ns
In cputime_adjust() process, stime is greater than rtime due to
mul_u64_u64_div_u64() precision problem.
before call mul_u64_u64_div_u64(),
stime = 175136586720000, rtime = 135989749728000, utime = 1416780000.
after call mul_u64_u64_div_u64(),
stime = 135989949653530
unsigned reversion occurs because rtime is less than stime.
utime = rtime - stime = 135989749728000 - 135989949653530
= -199925530
= (u64)18446744073709518790
Trigger condition:
1). User task run in kernel mode most of time
2). ARM64 architecture
3). TICK_CPU_ACCOUNTING=y
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set
Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime
Fixes: 3dc167ba57 ("sched/cputime: Improve cputime_adjust()")
Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20240726023235.217771-1-zhengzucheng@huawei.com
"enum vcpu_state" is not compatible with "u8" type for all targets,
resulting in:
error: initialization of 'u8 *' {aka 'unsigned char *'} from incompatible pointer type 'enum vcpu_state *'
for LoongArch. Correct the type of "old" variable to "u8".
Fixes: fea0e1820b ("locking/pvqspinlock: Use try_cmpxchg() in qspinlock_paravirt.h")
Closes: https://lore.kernel.org/lkml/20240719024010.3296488-1-maobibo@loongson.cn/
Reported-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20240721164552.50175-1-ubizjak@gmail.com
The CSD-lock-hold diagnostics from CONFIG_CSD_LOCK_WAIT_DEBUG are
printed in nanoseconds as unsigned long longs, which is a bit obtuse for
human readers when timing bugs result in negative CSD-lock hold times.
Yes, there are some people to whom it is immediately obvious that
18446744073709551615 is really -1, but for the rest of us...
Therefore, print these numbers as signed long longs, making the negative
hold times immediately apparent.
Reported-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Imran Khan <imran.f.khan@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Within the rcu_sr_normal_gp_cleanup_work() function, there is an acquire
load from rcu_state.srs_done_tail, which is expected to be non-NULL.
This commit adds a WARN_ON_ONCE() to check this expectation.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The rcutorture test suite has specific tests for both of the
NUM_ACTIVE_RCU_POLL_OLDSTATE and NUM_ACTIVE_RCU_POLL_FULL_OLDSTATE
macros provided for RCU polled grace periods. However, with the
advent of NUM_ACTIVE_SRCU_POLL_OLDSTATE, a more generic test is needed.
This commit therefore adds ->poll_active and ->poll_active_full fields
to the rcu_torture_ops structure and converts the existing specific
tests to use these fields, when present.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
This commit points the SRCU ->same_gp_state and ->get_comp_state fields
to same_state_synchronize_srcu() and get_completed_synchronize_srcu(),
allowing them to be tested.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The rcu_torture_ops structure's ->get_gp_completed and
->get_gp_completed_full fields are redundant with its ->get_comp_state
and ->get_comp_state_full fields. This commit therefore removes the
former in favor of the latter.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
RCU core can't be running anymore while in the middle of (de-)offloading
since this sort of transition now only applies to offline CPUs.
The SEGCBLIST_RCU_CORE state can therefore be removed.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
RCU core can't be running anymore while in the middle of (de-)offloading
since this sort of transition now only applies to offline CPUs.
The locked callback acceleration handling during the transition can
therefore be removed, along with concurrent batch execution.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
RCU core can't be running anymore while in the middle of (de-)offloading
since this sort of transition now only applies to offline CPUs.
The locked callback acceleration handling during the transition can
therefore be removed.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Bypass enqueue can't happen anymore in the middle of (de-)offloading
since this sort of transition now only applies to offline CPUs.
The related safety check can therefore be removed.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Currently callbacks can be (de-)offloaded only on online CPUs. This
involves an overly elaborated state machine in order to make sure that
callbacks are always handled during the process while ensuring
synchronization between rcu_core and NOCB kthreads.
The only potential user of NOCB (de-)offloading appears to be a
nohz_full toggling interface through cpusets. And the general agreement
is now to work toward toggling the nohz_full state on offline CPUs to
simplify the whole picture.
Therefore, convert the (de-)offloading to only support offline CPUs.
This involves the following changes:
* Call rcu_barrier() before deoffloading. An offline offloaded CPU may
still carry callbacks in its queue ignored by
rcutree_migrate_callbacks(). Those callbacks must all be flushed
before switching to a regular queue because no more kthreads will
handle those before the CPU ever gets re-onlined.
This means that further calls to rcu_barrier() will find an empty
queue until the CPU goes through rcutree_report_cpu_starting(). As a
result it is guaranteed that further rcu_barrier() won't try to lock
the nocb_lock for that target and thus won't risk an imbalance.
Therefore barrier_mutex doesn't need to be locked anymore upon
deoffloading.
* Assume the queue is empty before offloading, as
rcutree_migrate_callbacks() took care of everything.
This means that further calls to rcu_barrier() will find an empty
queue until the CPU goes through rcutree_report_cpu_starting(). As a
result it is guaranteed that further rcu_barrier() won't risk a
nocb_lock imbalance.
Therefore barrier_mutex doesn't need to be locked anymore upon
offloading.
* No need to flush bypass anymore.
Further simplifications will follow in upcoming patches.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The barrier_mutex is used currently to protect (de-)offloading
operations and prevent from nocb_lock locking imbalance in rcu_barrier()
and shrinker, and also from misordered RCU barrier invocation.
Now since RCU (de-)offloading is going to happen on offline CPUs, an RCU
barrier will have to be executed while transitionning from offloaded to
de-offloaded state. And this can't happen while holding the
barrier_mutex.
Introduce a NOCB mutex to protect (de-)offloading transitions. The
barrier_mutex is still held for now when necessary to avoid barrier
callbacks reordering and nocb_lock imbalance.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
When a NOCB CPU fails to create a nocb kthread on bringup, the CPU is
then deoffloaded. The barrier mutex is locked at this stage. It is
typically used to protect against concurrent (de-)offloading and/or
concurrent rcu_barrier() that would otherwise risk a nocb locking
imbalance. However:
* rcu_barrier() can't run concurrently if it's the boot CPU on early
boot-up.
* rcu_barrier() can run concurrently if it's a secondary CPU but it is
expected to see 0 callbacks on this target because it's the first
time it boots.
* (de-)offloading can't happen concurrently with smp_init(), as
rcutorture is initialized later, at least not before device_initcall(),
and userspace isn't available yet.
* (de-)offloading can't happen concurrently with cpu_up(), courtesy of
cpu_hotplug_lock.
But:
* The lazy shrinker might run concurrently with cpu_up(). It shouldn't
try to grab the nocb_lock and risk an imbalance due to lazy_len
supposed to be 0 but be extra cautious.
* Also be cautious against resume from hibernation potential subtleties.
So keep the locking and add some assertions and comments.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
nocb_is_setup is a rarely used field, mostly on boot and CPU hotplug.
It shouldn't occupy the middle of the rcu state hot fields cacheline.
Move it to the end and build it conditionally while at it. More cold
NOCB fields are to come.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Checking for races against concurrent (de-)offloading implies the
creation of !CONFIG_RCU_NOCB_CPU stubs to check if each relevant lock
is held. For now this only implies the nocb_lock but more are to be
expected.
Create instead a NOCB specific version of RCU_LOCKDEP_WARN() to avoid
the proliferation of stubs.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, reflect that change in the related helpers.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, reflect that change in the related helpers.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, reflect that change in the related helpers.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The symbols relating to the CT_STATE part of context_tracking.state are now
all prefixed with CT_STATE.
The RCU dynticks counter part of that atomic variable still involves
symbols with different prefixes, align them all to be prefixed with
CT_RCU_WATCHING.
Suggested-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Context tracking state related symbols currently use a mix of the
CONTEXT_ (e.g. CONTEXT_KERNEL) and CT_SATE_ (e.g. CT_STATE_MASK) prefixes.
Clean up the naming and make the ctx_state enum use the CT_STATE_ prefix.
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
This just standardizes the use of MIN() and MAX() macros, with the very
traditional semantics. The goal is to use these for C constant
expressions and for top-level / static initializers, and so be able to
simplify the min()/max() macros.
These macro names were used by various kernel code - they are very
traditional, after all - and all such users have been fixed up, with a
few different approaches:
- trivial duplicated macro definitions have been removed
Note that 'trivial' here means that it's obviously kernel code that
already included all the major kernel headers, and thus gets the new
generic MIN/MAX macros automatically.
- non-trivial duplicated macro definitions are guarded with #ifndef
This is the "yes, they define their own versions, but no, the include
situation is not entirely obvious, and maybe they don't get the
generic version automatically" case.
- strange use case #1
A couple of drivers decided that the way they want to describe their
versioning is with
#define MAJ 1
#define MIN 2
#define DRV_VERSION __stringify(MAJ) "." __stringify(MIN)
which adds zero value and I just did my Alexander the Great
impersonation, and rewrote that pointless Gordian knot as
#define DRV_VERSION "1.2"
instead.
- strange use case #2
A couple of drivers thought that it's a good idea to have a random
'MIN' or 'MAX' define for a value or index into a table, rather than
the traditional macro that takes arguments.
These values were re-written as C enum's instead. The new
function-line macros only expand when followed by an open
parenthesis, and thus don't clash with enum use.
Happily, there weren't really all that many of these cases, and a lot of
users already had the pattern of using '#ifndef' guarding (or in one
case just using '#undef MIN') before defining their own private version
that does the same thing. I left such cases alone.
Cc: David Laight <David.Laight@aculab.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Stop testing the group->parent pointer as it is not guaranteed to be
stable over a chain of operations by design. This includes a warning
which would be nice to have but it produces false positives due to
the racy nature of the check.
- Plug a race between CPUs going in and out of idle and a CPU hotplug
operation. The latter can create and connect a new hierarchy level
which is missed in the concurrent updates of CPUs which go into idle.
As a result the events of such a CPU might not be processed and
timers go stale.
Cure it by splitting the hotplug operation into a prepare and online
callback. The prepare callback is guaranteed to run on an online and
therefore active CPU. This CPU updates the hierarchy and being online
ensures that there is always at least one migrator active which
handles the modified hierarchy correctly when going idle. The online
callback which runs on the incoming CPU then just marks the CPU
active and brings it into operation.
- Improve tracing and polish the code further so it is more obvious
what's going on.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmajmXYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoWaGD/4iAuj0S3AQ+odB/Fkg1cvCF8YbOJGy
PEMSDnYtW7ErShEwVQMsWnXCFhIMuqMN0KIzOChpya2ZkmdT47mNwKuwdOMEQZzg
dKhmqGWR+sk4LMszCFbf5u6JjJeQ+nnxthgJ1IieJwC4VEfRceXYk7ng6Wvu1+lU
JEIukUh9nRJWma7FYW8MeNZ4lJGdvawZ5UjUAkPtzeKWn0+0/oqV5t1c8E/1jBbi
sKZW2soL716Xd/3QJUKKmtAcH7yDFwq5AY5bJURr5ztJw/yr2loVdvAPEdn/RQ6f
fzN/J+nu2ig14g/QvhI8Ke+HbHJEZHpo6simSZRbdaqnCX3R/lYZwLDe7EGjqKIb
0slsx2V1UxQ+qIYRplrtr/HGChjG/mXDLPIWRWjsiUAqyygy6QtUIko9AuH99Kd6
7cBjOzajKIAA/J9SUD03VgjXcQ53bW64NMe2pOX9ED1mbfmmu/ROd0neOgksKw5o
G5XQ+T6tNOoHMzJgv4R8PiViVdrf53A/g1wYTY1RR3XI8IWpounkyDExDvbtigGo
N+reKoawDGpXeMAByO2E6UDFNA05NYPjlvSrzTS5ywwyF1qCowKI1Qyup9wA/God
WJvfesmOJHtfcuUcVZf6Pm6+otJiKrT3reauFd6laEbyRuGTKvtNN/pQa6yxiZzT
FTxZKpcYkPE76g==
=G1Lg
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer migration updates from Thomas Gleixner:
"Fixes and minor updates for the timer migration code:
- Stop testing the group->parent pointer as it is not guaranteed to
be stable over a chain of operations by design.
This includes a warning which would be nice to have but it produces
false positives due to the racy nature of the check.
- Plug a race between CPUs going in and out of idle and a CPU hotplug
operation. The latter can create and connect a new hierarchy level
which is missed in the concurrent updates of CPUs which go into
idle. As a result the events of such a CPU might not be processed
and timers go stale.
Cure it by splitting the hotplug operation into a prepare and
online callback. The prepare callback is guaranteed to run on an
online and therefore active CPU. This CPU updates the hierarchy and
being online ensures that there is always at least one migrator
active which handles the modified hierarchy correctly when going
idle. The online callback which runs on the incoming CPU then just
marks the CPU active and brings it into operation.
- Improve tracing and polish the code further so it is more obvious
what's going on"
* tag 'timers-urgent-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers/migration: Fix grammar in comment
timers/migration: Spare write when nothing changed
timers/migration: Rename childmask by groupmask to make naming more obvious
timers/migration: Read childmask and parent pointer in a single place
timers/migration: Use a single struct for hierarchy walk data
timers/migration: Improve tracing
timers/migration: Move hierarchy setup into cpuhotplug prepare callback
timers/migration: Do not rely always on group->parent
catching COVID, so relatively short PR. Including fixes from bpf
and netfilter.
Current release - regressions:
- tcp: process the 3rd ACK with sk_socket for TFO and MPTCP
Current release - new code bugs:
- l2tp: protect session IDR and tunnel session list with one lock,
make sure the state is coherent to avoid a warning
- eth: bnxt_en: update xdp_rxq_info in queue restart logic
- eth: airoha: fix location of the MBI_RX_AGE_SEL_MASK field
Previous releases - regressions:
- xsk: require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len,
the field reuses previously un-validated pad
Previous releases - always broken:
- tap/tun: drop short frames to prevent crashes later in the stack
- eth: ice: add a per-VF limit on number of FDIR filters
- af_unix: disable MSG_OOB handling for sockets in sockmap/sockhash
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmaibxAACgkQMUZtbf5S
IruuIRAAu96TiN/urPwmKznyb/Sk8x7p8iUzn6OvPS/TUlFUkURQtOh6M9uvbpN4
x/L//EWkMR0hY4SkBegoiXfb1GS0PjBdWTWUiROm5X9nVHqp5KRZAxWXhjFiS1BO
BIYOT+JfCl7mQiPs90Mys/cEtYOggMBsCZQVIGw/iYoJLFREqxFSONwa0dG+tGMX
jn9WNu4yCVDhJ/jtl2MaTsCNtYUaBUgYrKHJBfNGfJ2Lz/7rH9yFui2WSMlmOd/U
QGeCb1DWURlShlCqY37wNinbFsxWkI5JN00ukTtwFAXLIaqc+zgHcIjrDjTJwK43
F4tKbJT3+bmehMU/h3Uo3c7DhXl7n9zDGiDtbCxnkykp0sFGJpjhDrWydo51c+YB
qW5HaNrII2LiDicOVN8L29ylvKp7AEkClxgivEhZVGGk2f/szJRXfp9u3WBn5kAx
3paH55YN0DEsKbYbb1ZENEI1Vnc/4ff4PxZJCUNKwzcS8wCn1awqwcriK9TjS/cp
fjilNFT4J3/uFrodHWTkx0jJT6UJFT0aF03qPLUH/J5kG+EVukOf1jBPInNdf1si
1j47SpblHUe86HiHphFMt32KZ210lJzWxh8uGma57Y2sB9makdLiK4etrFjkiMJJ
Z8A3kGp3KpFjbuK4tHY25rp+5oxLNNOBNpay29lQrWtCL/NDcaQ=
=9OsH
-----END PGP SIGNATURE-----
Merge tag 'net-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf and netfilter.
A lot of networking people were at a conference last week, busy
catching COVID, so relatively short PR.
Current release - regressions:
- tcp: process the 3rd ACK with sk_socket for TFO and MPTCP
Current release - new code bugs:
- l2tp: protect session IDR and tunnel session list with one lock,
make sure the state is coherent to avoid a warning
- eth: bnxt_en: update xdp_rxq_info in queue restart logic
- eth: airoha: fix location of the MBI_RX_AGE_SEL_MASK field
Previous releases - regressions:
- xsk: require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len,
the field reuses previously un-validated pad
Previous releases - always broken:
- tap/tun: drop short frames to prevent crashes later in the stack
- eth: ice: add a per-VF limit on number of FDIR filters
- af_unix: disable MSG_OOB handling for sockets in sockmap/sockhash"
* tag 'net-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (34 commits)
tun: add missing verification for short frame
tap: add missing verification for short frame
mISDN: Fix a use after free in hfcmulti_tx()
gve: Fix an edge case for TSO skb validity check
bnxt_en: update xdp_rxq_info in queue restart logic
tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
selftests/bpf: Add XDP_UMEM_TX_METADATA_LEN to XSK TX metadata test
xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len
bpf: Fix a segment issue when downgrading gso_size
net: mediatek: Fix potential NULL pointer dereference in dummy net_device handling
MAINTAINERS: make Breno the netconsole maintainer
MAINTAINERS: Update bonding entry
net: nexthop: Initialize all fields in dumped nexthops
net: stmmac: Correct byte order of perfect_match
selftests: forwarding: skip if kernel not support setting bridge fdb learning limit
tipc: Return non-zero value from tipc_udp_addr2str() on error
netfilter: nft_set_pipapo_avx2: disable softinterrupts
ice: Fix recipe read procedure
ice: Add a per-VF limit on number of FDIR filters
net: bonding: correctly annotate RCU in bond_should_notify_peers()
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmaiUQcACgkQUqAMR0iA
lPITNQ/+KDdmQljKwpuXlqe01F1mG/LFn5i1Y/a8fZVep/OSmihsgnEqnYzBomTq
CF22tlrH7r6EZ2D5by1fjno/AG6/BAXvwH8jGDr9jhVNNBsneeVYrtMB1UUslR/e
OEFoFKyzpq6VJNmHl5aAM95CEFEkE5uBba4DkJ/hCh3oErc2zP5DRdD9COCkdlIp
+LzQa6XsMjwzrWAMAm2vWdBgePCHVKsAVVFUdfmN28FQw3BcFZEgIvN7vPT7Ee3I
ESKx/Asb3myb1J7bFvDKnpT9O+7EkU/cpQn+HxjiIFVPqFLX3mfSXzgvfNocuPB0
hkIUzA9Sbu2wa+SE7qU1IwHVZj2N612OPso8lG8cbcic/KaYLpd6Kt7bnJe8kBe7
gFGBVmDjvapilQwWteWJcMs2hBxXuq0Xd+CMcXTMKYcLS8Z+4TVAYA8onssrUq+0
Jye8hW/CvST/P7wazVtuQu1fsKKz3SG+dAaXw9/7fYTGQ2LdRoLUkDhLkwqIURjD
j6+pMMuYpxrpeA7yaOD1xLLOKC33OINClgfodjGVHvDlwxsQBhZFchCKsbufYEa1
CxFi8lDcfNVSGuw5x3a6iMqwnqxoeVKpi6eKKgpU81fXnGd4G2NA5jGRMycWmqzX
+uUq6Ot1NO4QVK+pT70GhXMt2rQ6UsC1SyWggeFYBhaaqfIGKRk=
=89Nl
-----END PGP SIGNATURE-----
Merge tag 'printk-for-6.11-trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:
- trivial printk changes
The bigger "real" printk work is still being discussed.
* tag 'printk-for-6.11-trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
vsprintf: add missing MODULE_DESCRIPTION() macro
printk: Rename console_replay_all() and update context
Summary
- const qualify struct ctl_table args in proc_handlers:
This is a prerequisite to moving the static ctl_table structs into .rodata
data which will ensure that proc_handler function pointers cannot be
modified.
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmahVOcACgkQupfNUreW
QU8QdgwAiDZMEQ6psvDfLGQlOvHyp21exZFC/i1OCrCq4BmTu83iYcojIKSaZ0q6
pfR9eYTVhNfAGsV2hQJvvUQPSVNESB/A7Z2zGLVQmZcOerAN17L2C7uA1SkbRBel
P3yoMX6wruVCnBgQTcjktoBgpDqaOiqLTSh8Ko512qYOClLf22sCbbk+doiOwMF6
zbzh9bRo+qTuTgW30gwzRNml/lA1/XylSsM/cEFCxsc8k22lCneP+iWKDAfWtsr6
0GL+SvbBTyIEMFFvSw+0MUW30QctxaCYA1pBmn8nmc6eJ2yDly75/r8EWRr7zqDW
SrRiXYDAzNcW3tiOdC2FW/J95VX62DiSQporVZ/OvahBKpQKGaLLNjx1EE3qGKIk
ggnYiDmIO88ZBHoRbDDK9EdGGqrIZfRx9JeFD3bMOrwT3Ffxw30rwcYpzlOyjFz3
TIUeNthGXoQNRxZfTE+iybotKWcXnxCNe4U1Mvc2liY6+Z0jazDWfs+jV7CiuX1G
Xk8jpTCR
=JMg4
-----END PGP SIGNATURE-----
Merge tag 'constfy-sysctl-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl constification from Joel Granados:
"Treewide constification of the ctl_table argument of proc_handlers
using a coccinelle script and some manual code formatting fixups.
This is a prerequisite to moving the static ctl_table structs into
read-only data section which will ensure that proc_handler function
pointers cannot be modified"
* tag 'constfy-sysctl-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
sysctl: treewide: constify the ctl_table argument of proc_handlers
Three small changes this cycle:
1. Cleaning up an architecture abstraction that is no longer needed
because all the architectures have converged.
2. Actually use the prompt argument to kdb_position_cursor() instead of
ignoring it (functionally this fix is a nop but that was due to luck
rather than good judgement)
3. Fix a -Wformat-security warning.
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAmaiHisACgkQfOMlXTn3
iKHUgQ//WS9t2GOdKzeOTYKbUfmE/RhdjLzMOqJhbtLn2uV7wbFxbcWDXz7rzcUQ
ipOvaO7wv78AkOasse6cBml8MLkfxvm+qrigFogUDiurRJ06JqkitJTF0fqc0lti
n6n8oHPteEfGY/QXCx/JDoUli4/zfxykUOYqMwKU8WZSjnqwCvNIG9BE17D4QL87
KHVMBpq0NS1eVs39y2mHkQ3PnHzwlf9fyzbmO2mxDIwaAMSR8ZJ5ZkcdH3PM1hhi
vHsbdljhHt9IjZowhjTXUeSHCG7aYbWAGxQNyI6iS936nguJorH2xy5NAXYJo7Ph
gXoEXl0xfeQpRWllzGy2f409ZDLWIRTu8AUjlJLLSJ4E9+aQn5YjPE0oSg/fuudC
49EQo5WGFZz2+1p+7ST5iJp73qGw87A7UWVNtrnxRbq910b4B1k564WBaADjGDh9
TslwOXim+/DwtLXjgUTtiZKxjMKzn+yh58UMwiHeBo84EiQvGFR2ZjBUD7SN2UUN
6ME+5HviUHvXkdWm554l6FRPHwQBuY2XjX0ipW7DtzMJ+rK5cRLc1ylGYyI4Cooz
stMaHTdnCZTacQU4EZTmddQEtfdYmNOU/wu1i91lyudgTlf0SGnH/orD01dskKaT
6L8AU/YobzODZCtyQmXlKeN6OIc1WoBY/Ct84fNSH4wtiVesBcE=
=Se36
-----END PGP SIGNATURE-----
Merge tag 'kgdb-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux
Pull kgdb updates from Daniel Thompson:
"Three small changes this cycle:
- Clean up an architecture abstraction that is no longer needed
because all the architectures have converged.
- Actually use the prompt argument to kdb_position_cursor() instead
of ignoring it (functionally this fix is a nop but that was due to
luck rather than good judgement)
- Fix a -Wformat-security warning"
* tag 'kgdb-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
kdb: Get rid of redundant kdb_curr_task()
kdb: Use the passed prompt in kdb_position_cursor()
kdb: address -Wformat-security warnings
- fix the order of actions in dmam_free_coherent (Lance Richardson)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmahEd4LHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYPnhQ/9GAD6MEOVrPxBwdR8gKhM/zIBfhsQgMt1+Htl/dIv
ytPFDnXP6OCl4bPO7SkGRa92VVVAYBDWH7YnWct/cZc/6Kq4HZsAbWQzxzFVTI+/
acHZuppqXY3iewSd3H+noCx92EuMFypG6FPf82gtftnQgWlQl55putR+5Xh1rlap
2wA566eaLiarN+RC8tCjnpZrNMn+Obx4DuyCNLaQtckHaUVlmN5GgaD5yh2+h+jU
RjTAbVhFbVhxeAo0qUVe1bDK2iphyDtgSTeFwCt9OQ5/YaKUQe2wgbL5cpBgJk76
g6BUtiwJ4d+jbR5xxuCJKu7U4PQGZHnOxL3XR5ImEFfQCPGSk4Dqz0VvpINLWfUg
UwfysaCdMyqpVEyV2o5+0d7RpNqnvc4EIG/4AL7fYcTkWFLpBs8tz2AeDf1UafiS
ZvOaYt6tl3Q6Wa8yNFwSt1/06/60Ngj4UOK10j6vn7oATiVH5uVPo8wrT/v6PYeK
mAeQe7QJjWXjWdCQ8Rved9fS6ooh9A+B2GVtQTIK0kIQSkRv3LelLaOW8/L53HUL
KiODBe3Cxa3nD0sktqowrXmh33uTsVBiAyFNFeE0C8kG+CBDbHBAUQ4wZVVKK3ni
6Y0+yrSpFF2yeZZVajuOn3jUpGN1Q1f0I1ImNaWN6CHhIL8D75d/4QK/JI925JCe
tRY=
=0dRz
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.11-2024-07-24' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fix from Christoph Hellwig:
- fix the order of actions in dmam_free_coherent (Lance Richardson)
* tag 'dma-mapping-6.11-2024-07-24' of git://git.infradead.org/users/hch/dma-mapping:
dma: fix call order in dmam_free_coherent
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZqIl1AAKCRDbK58LschI
g/MdAP9oyZV9/IZ6Y6Z1fWfio0SB+yJGugcwbFjWcEtNrzsqJQEAwipQnemAI4NC
HBMfK2a/w7vhAFMXrP/SbkB/gUJJ7QE=
=vovf
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2024-07-25
We've added 14 non-merge commits during the last 8 day(s) which contain
a total of 19 files changed, 177 insertions(+), 70 deletions(-).
The main changes are:
1) Fix af_unix to disable MSG_OOB handling for sockets in BPF sockmap and
BPF sockhash. Also add test coverage for this case, from Michal Luczaj.
2) Fix a segmentation issue when downgrading gso_size in the BPF helper
bpf_skb_adjust_room(), from Fred Li.
3) Fix a compiler warning in resolve_btfids due to a missing type cast,
from Liwei Song.
4) Fix stack allocation for arm64 to align the stack pointer at a 16 byte
boundary in the fexit_sleep BPF selftest, from Puranjay Mohan.
5) Fix a xsk regression to require a flag when actuating tx_metadata_len,
from Stanislav Fomichev.
6) Fix function prototype BTF dumping in libbpf for prototypes that have
no input arguments, from Andrii Nakryiko.
7) Fix stacktrace symbol resolution in perf script for BPF programs
containing subprograms, from Hou Tao.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Add XDP_UMEM_TX_METADATA_LEN to XSK TX metadata test
xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len
bpf: Fix a segment issue when downgrading gso_size
tools/resolve_btfids: Fix comparison of distinct pointer types warning in resolve_btfids
bpf, events: Use prog to emit ksymbol event for main program
selftests/bpf: Test sockmap redirect for AF_UNIX MSG_OOB
selftests/bpf: Parametrize AF_UNIX redir functions to accept send() flags
selftests/bpf: Support SOCK_STREAM in unix_inet_redir_to_connected()
af_unix: Disable MSG_OOB handling for sockets in sockmap/sockhash
bpftool: Fix typo in usage help
libbpf: Fix no-args func prototype BTF dumping syntax
MAINTAINERS: Update powerpc BPF JIT maintainers
MAINTAINERS: Update email address of Naveen
selftests/bpf: fexit_sleep: Fix stack allocation for arm64
====================
Link: https://patch.msgid.link/20240725114312.32197-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.
This patch has been generated by the following coccinelle script:
```
virtual patch
@r1@
identifier ctl, write, buffer, lenp, ppos;
identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int write, void *buffer, size_t *lenp, loff_t *ppos);
@r2@
identifier func, ctl, write, buffer, lenp, ppos;
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int write, void *buffer, size_t *lenp, loff_t *ppos)
{ ... }
@r3@
identifier func;
@@
int func(
- struct ctl_table *
+ const struct ctl_table *
,int , void *, size_t *, loff_t *);
@r4@
identifier func, ctl;
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int , void *, size_t *, loff_t *);
@r5@
identifier func, write, buffer, lenp, ppos;
@@
int func(
- struct ctl_table *
+ const struct ctl_table *
,int write, void *buffer, size_t *lenp, loff_t *ppos);
```
* Code formatting was adjusted in xfs_sysctl.c to comply with code
conventions. The xfs_stats_clear_proc_handler,
xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
adjusted.
* The ctl_table argument in proc_watchdog_common was const qualified.
This is called from a proc_handler itself and is calling back into
another proc_handler, making it necessary to change it as part of the
proc_handler migration.
Co-developed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Co-developed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Joel Granados <j.granados@samsung.com>
- Remove tristate choice support from Kconfig
- Stop using the PROVIDE() directive in the linker script
- Reduce the number of links for the combination of CONFIG_DEBUG_INFO_BTF
and CONFIG_KALLSYMS
- Enable the warning for symbol reference to .exit.* sections by default
- Fix warnings in RPM package builds
- Improve scripts/make_fit.py to generate a FIT image with separate base
DTB and overlays
- Improve choice value calculation in Kconfig
- Fix conditional prompt behavior in choice in Kconfig
- Remove support for the uncommon EMAIL environment variable in Debian
package builds
- Remove support for the uncommon "name <email>" form for the DEBEMAIL
environment variable
- Raise the minimum supported GNU Make version to 4.0
- Remove stale code for the absolute kallsyms
- Move header files commonly used for host programs to scripts/include/
- Introduce the pacman-pkg target to generate a pacman package used in
Arch Linux
- Clean up Kconfig
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmagBLUVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGmoUQAJ8pnURs0g+Rcyk6bdY/qtXBYkS+
nXpIK1ssFgRRgAQdeszYtvBqLFzb0wRCSie87G1AriD/JkVVTjCCY1For1y+vs0u
a7HfxitHhZpPyZW/T+WMQ3LViNccpkx+DFAcoRH8xOY/XPEJKVUby332jOIXMuyg
+NKIELQJVsLhcDofTUGb5VfIQektw219n5c4jKjXdNk4ZtE24xCRM5X528ZebwWJ
RZhMvJ968PyIH1IRXvNt6dsKBxoGIwPP8IO6yW9hzHaNsBqt7MGSChSel7r1VKpk
iwCNApJvEiVBe5wvTSVOVro7/8p/AZ70CQAqnMJV+dNnRqtGqW7NvL6XAjZRJgJJ
Uxe5NSrXgQd3FtqfcbXLetBgp9zGVt328nHm1HXHR5rFsvoOiTvO7hHPbhA+OoWJ
fs+jHzEXdAMRgsNrczPWU5Svq6MgGe4v8HBf0m8N1Uy65t/O+z9ti2QAw7kIFlbu
/VSFNjw4CHmNxGhnH0khCMsy85FwVIt9Ux+2d6IEc0gP8S1Qa1HgHGAoVI4U51eS
9dxEPVJNPOugaIVHheuS3wimEO6wzaJcQHn4IXaasMA7P6Yo4G/jiGoy4cb9qPTM
Hb+GaOltUy7vDoG4D2LSym8zR8rdKwbIf/5psdZrq/IWVKq5p+p7KWs3aOykSoM7
o6Hb532Ioalhm8je
=BYu7
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Remove tristate choice support from Kconfig
- Stop using the PROVIDE() directive in the linker script
- Reduce the number of links for the combination of CONFIG_KALLSYMS and
CONFIG_DEBUG_INFO_BTF
- Enable the warning for symbol reference to .exit.* sections by
default
- Fix warnings in RPM package builds
- Improve scripts/make_fit.py to generate a FIT image with separate
base DTB and overlays
- Improve choice value calculation in Kconfig
- Fix conditional prompt behavior in choice in Kconfig
- Remove support for the uncommon EMAIL environment variable in Debian
package builds
- Remove support for the uncommon "name <email>" form for the DEBEMAIL
environment variable
- Raise the minimum supported GNU Make version to 4.0
- Remove stale code for the absolute kallsyms
- Move header files commonly used for host programs to scripts/include/
- Introduce the pacman-pkg target to generate a pacman package used in
Arch Linux
- Clean up Kconfig
* tag 'kbuild-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (65 commits)
kbuild: doc: gcc to CC change
kallsyms: change sym_entry::percpu_absolute to bool type
kallsyms: unify seq and start_pos fields of struct sym_entry
kallsyms: add more original symbol type/name in comment lines
kallsyms: use \t instead of a tab in printf()
kallsyms: avoid repeated calculation of array size for markers
kbuild: add script and target to generate pacman package
modpost: use generic macros for hash table implementation
kbuild: move some helper headers from scripts/kconfig/ to scripts/include/
Makefile: add comment to discourage tools/* addition for kernel builds
kbuild: clean up scripts/remove-stale-files
kconfig: recursive checks drop file/lineno
kbuild: rpm-pkg: introduce a simple changelog section for kernel.spec
kallsyms: get rid of code for absolute kallsyms
kbuild: Create INSTALL_PATH directory if it does not exist
kbuild: Abort make on install failures
kconfig: remove 'e1' and 'e2' macros from expression deduplication
kconfig: remove SYMBOL_CHOICEVAL flag
kconfig: add const qualifiers to several function arguments
kconfig: call expr_eliminate_yn() at least once in expr_eliminate_dups()
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmafyQEACgkQUqAMR0iA
lPIb5Q/+Puck/TtP9W7uyFMDi6AQMm2/Ppja12OV0cpzKwvZci81mvANEJDC2gGa
Wlg4++WVqeA6UfxsPLqDWDWI46dSHWfbMR5d9c5ZZBWh8byNLznBvgQdV04m0KaT
567gXwRaVc3Rx4q0WkoWYn12FFvZV5IZPJbUuM4hKS1WmDrW5tPKteBoxGVujU5N
j9BB3zMHH+cDrgB2+tasUGkEXRLQtesjVrPNQp4lJA535JwnpFk3961+y4cnaAaQ
EqrGy/uUVk6Hl/4Q/JgnYRREmDytMMmvxJJx3FIZm6i5xLOY0y1rJ1YF0z7fL+Eu
+WJi6Uncwo1mFUz5MvKAg2tIUilbOp3awQwxPdPFXoziF0A1mS+S6qHlz6Y8Kk4N
kqgmfIXZeb/K+UCWX6freqqvAg/sJ0JlNIcaGOGZi7+KRwoi8US3eahIAEUxHuis
DRnwtF9qI6YRyjfwy3INDc9Atgnhe6O/mcUpAe1TiSHgbgKT8kQgIQUWb1BwurVE
Ey1Hpci+gpJyj1MW0f1jE08lqQ1O7EdFTJyxJvgjs1IeGA2Zgc0U2GMemSidh6qZ
pqQnEwOfvCdVgR8W4R0Gu8IZYsyCYEawNVHixk2V+GQSdvZsPfJNvnhkqYH/IzUQ
yXKBCZVVBwTQW6zkGI6fjeXgjVH2cEL29vDO4ChQsSH4qEZb/GA=
=k4I2
-----END PGP SIGNATURE-----
Merge tag 'livepatching-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
Pull livepatching update from Petr Mladek:
- show patch->replace flag in sysfs
- add or improve few selftests
* tag 'livepatching-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
livepatch: Replace snprintf() with sysfs_emit()
selftests/livepatch: Add selftests for "replace" sysfs attribute
livepatch: Add "replace" sysfs attribute
selftests: livepatch: Test atomic replace against multiple modules
selftests/livepatch: define max test-syscall processes
This simplifies the handling of platform MSI and wire to MSI controllers
and removes about 500 lines of legacy code.
Aside of that it paves the way for ARM/ARM64 to utilize the dynamic
allocation of PCI/MSI interrupts and to support the upcoming non
standard IMS (Interrupt Message Store) mechanism on PCIe devices
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmaeheUTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYocX4D/wLYD+DQDpA3U1XS8jPNE4vKcmBBNX8
Mj4qdHsY8fK+FhmtLsj8FL3iSTymPtgXzFupXGS+5iFG3LhbW8JWEbqjbowcJ1c8
/4w8sKyyWdCSScrCTrH4A3RrLNDAX3DzSMqqi17sdETuwtN0RJiXgcm/CwRXETmn
kVqB7ddalyAR0Z2N/ym1fkuwyBAdeu3cBxMy/BWR6GFae1dAGe8Kr8GsmmuzBTFi
DQSmkh6kZntTn9y+K7juXF+1q8InolmHiOOUeoUJachSCyp6nu9W2+S2MVUiuOA2
X1/Ei3eKvkBHFDd7phZnIrVecuNehAQEV6BRMKOYBiDG4lwD6vCbbr9/YF5vBGni
tbZAetk9VBpIj0YRVAz7WkLC2JmVbw4znlrDwe8+xeLeDwRXl9f4Xc1Udr0qKgpd
1bNE1zG1z45v5J3OtJLJ4MCYcUCsQgv1CkUlNEdz5+NhXHT+W+oKJor/0WYJ3Qwe
iqTEJ9BA1/SzvngwIt/uoMZlEjBl/0/T1UEMJvP/7oEqjl/UAEWGlpKnID3hsDc2
GcIEOJod6hWzyPyeJUI6RpCHy4ZG93WL7Ks+lvzfp381yoDL5/KlveDtSomyuzYF
2xXHUAvw8MAYfJ/CFft/DYme8sBpn1cxAMWdctEiAn0qfS7X1RNZ/RhQ2OXxRw3q
tNpc0jEen9m72A==
=2adH
-----END PGP SIGNATURE-----
Merge tag 'irq-msi-2024-07-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull MSI interrupt updates from Thomas Gleixner:
"Switch ARM/ARM64 over to the modern per device MSI domains.
This simplifies the handling of platform MSI and wire to MSI
controllers and removes about 500 lines of legacy code.
Aside of that it paves the way for ARM/ARM64 to utilize the dynamic
allocation of PCI/MSI interrupts and to support the upcoming non
standard IMS (Interrupt Message Store) mechanism on PCIe devices"
* tag 'irq-msi-2024-07-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
irqchip/gic-v3-its: Correctly fish out the DID for platform MSI
irqchip/gic-v3-its: Correctly honor the RID remapping
genirq/msi: Move msi_device_data to core
genirq/msi: Remove platform MSI leftovers
irqchip/irq-mvebu-icu: Remove platform MSI leftovers
irqchip/irq-mvebu-sei: Switch to MSI parent
irqchip/mvebu-odmi: Switch to parent MSI
irqchip/mvebu-gicp: Switch to MSI parent
irqchip/irq-mvebu-icu: Prepare for real per device MSI
irqchip/imx-mu-msi: Switch to MSI parent
irqchip/gic-v2m: Switch to device MSI
irqchip/gic_v3_mbi: Switch over to parent domain
genirq/msi: Remove platform_msi_create_device_domain()
irqchip/mbigen: Remove platform_msi_create_device_domain() fallback
irqchip/gic-v3-its: Switch platform MSI to MSI parent
irqchip/irq-msi-lib: Prepare for DOMAIN_BUS_WIRED_TO_MSI
irqchip/mbigen: Prepare for real per device MSI
irqchip/irq-msi-lib: Prepare for DEVICE MSI to replace platform MSI
irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]
irqchip/irq-msi-lib: Prepare for PCI MSI/MSIX
...
- Core:
- Provide a new mechanism to create interrupt domains. The existing
interfaces have already too many parameters and it's a pain to expand
any of this for new required functionality.
The new function takes a pointer to a data structure as argument. The
data structure combines all existing parameters and allows for easy
extension.
The first extension for this is to handle the instantiation of
generic interrupt chips at the core level and to allow drivers to
provide extra init/exit callbacks.
This is necessary to do the full interrupt chip initialization before
the new domain is published, so that concurrent usage sites won't see
a half initialized interrupt domain. Similar problems exist on
teardown.
This has turned out to be a real problem due to the deferred and
parallel probing which was added in recent years.
Handling this at the core level allows to remove quite some accrued
boilerplate code in existing drivers and avoids horrible workarounds
at the driver level.
- The usual small improvements all over the place
- Drivers
- Add support for LAN966x OIC and RZ/Five SoC
- Split the STM ExtI driver into a microcontroller and a SMP version to
allow building the latter as a module for multi-platform kernels.
- Enable MSI support for Armada 370XP on platforms which do not support
IPIs.
- The usual small fixes and enhancements all over the place.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmaVJbUTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoXTuD/9Tc9BhY5CW7HQkdPQu2Db1O+esprkQ
Uo9lMpTTpPiy9btg4LONzLf4mjbufZpyKBxkRWoZFO0Zj5q4UE9NZYh7EcxrF5Tl
CIFJmyteLsYuOyCmPrtSDSovonXjQKYBE3u2LVJNNkwEkhYbYW9sqIKeT8nneLv6
53gd28ESFUEUjHNTblw/eXviweyUKSXc0qyg+3hgZQPMoh9RkdkEPvyaw9Y/s5Ce
FelLLxzMqX86dR2TJMLqiaGiMpUu/kl+Yz2m5c77TwA2D68qjhHywbtKtlH7b3C6
LMHu2dMrrKSJrLL8roVIYJdHAd1TKWVdnYhqv9WBHFTu1sDuztpR44mewbo8exUU
L2RgVSGYNmeFC3p4wztWYSQfIVa9uOg7+TnJJdh7G0jLIeKM/TbufWqDAJAuoVPL
QhGbZ5xNbZJZ8bvhhItjxpRN/kPs44p3mUGyRJBQzm+mDN118bqfmQzhLcwRbfE2
smp73SQzg9alG2rGdNVEqkKmp8zhg2Crx2VCeVdgbeOxWQRet9zLWcp4FfCEUE9e
eK3iEi8z+rmwafaf3rsxYdrdIRLaUmcni0v7R/16cJH/Cs7bU3Re8XyGhevo3lsO
pJiP5wZDxbckwXNpLm3S/qPDW7vSCnuFPF7QmOvC3a70PsD+E4NKUgiwJuHtn/ZV
pFBKzbQgCsowQA==
=QCRH
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2024-07-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull interrupt subsystem updates from Thomas Gleixner:
"Core:
- Provide a new mechanism to create interrupt domains. The existing
interfaces have already too many parameters and it's a pain to
expand any of this for new required functionality.
The new function takes a pointer to a data structure as argument.
The data structure combines all existing parameters and allows for
easy extension.
The first extension for this is to handle the instantiation of
generic interrupt chips at the core level and to allow drivers to
provide extra init/exit callbacks.
This is necessary to do the full interrupt chip initialization
before the new domain is published, so that concurrent usage sites
won't see a half initialized interrupt domain. Similar problems
exist on teardown.
This has turned out to be a real problem due to the deferred and
parallel probing which was added in recent years.
Handling this at the core level allows to remove quite some accrued
boilerplate code in existing drivers and avoids horrible
workarounds at the driver level.
- The usual small improvements all over the place
Drivers:
- Add support for LAN966x OIC and RZ/Five SoC
- Split the STM ExtI driver into a microcontroller and a SMP version
to allow building the latter as a module for multi-platform
kernels
- Enable MSI support for Armada 370XP on platforms which do not
support IPIs
- The usual small fixes and enhancements all over the place"
* tag 'irq-core-2024-07-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (59 commits)
irqdomain: Fix the kernel-doc and plug it into Documentation
genirq: Set IRQF_COND_ONESHOT in request_irq()
irqchip/imx-irqsteer: Handle runtime power management correctly
irqchip/gic-v3: Pass #redistributor-regions to gic_of_setup_kvm_info()
irqchip/bcm2835: Enable SKIP_SET_WAKE and MASK_ON_SUSPEND
irqchip/gic-v4: Make sure a VPE is locked when VMAPP is issued
irqchip/gic-v4: Substitute vmovp_lock for a per-VM lock
irqchip/gic-v4: Always configure affinity on VPE activation
Revert "irqchip/dw-apb-ictl: Support building as module"
Revert "Loongarch: Support loongarch avec"
arm64: Kconfig: Allow build irq-stm32mp-exti driver as module
ARM: stm32: Allow build irq-stm32mp-exti driver as module
irqchip/stm32mp-exti: Allow building as module
irqchip/stm32mp-exti: Rename internal symbols
irqchip/stm32-exti: Split MCU and MPU code
arm64: Kconfig: Select STM32MP_EXTI on STM32 platforms
ARM: stm32: Use different EXTI driver on ARMv7m and ARMv7a
irqchip/stm32-exti: Add CONFIG_STM32MP_EXTI
irqchip/dw-apb-ictl: Support building as module
irqchip/riscv-aplic: Simplify the initialization code
...
The wakeup value is written unconditionally in tmigr_cpu_new_timer(). When
there was no new next timer expiry that needs to be propagated, then the
value that was read before is written. This is not required.
Move the write to the place where wakeup value is changed changed.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240716-tmigr-fixes-v4-7-757baa7803fe@linutronix.de
childmask in the group reflects the mask that is required to 'reference'
this group in the parent. When reading childmask, this might be confusing,
as this suggests, that this is the mask of the child of the group.
Clarify this by renaming childmask in the tmigr_group and tmc_group by
groupmask.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240716-tmigr-fixes-v4-6-757baa7803fe@linutronix.de
Reading the childmask and parent pointer is required when propagating
changes through the hierarchy. At the moment this reads are spread all over
the place which makes it harder to follow.
Move those reads to a single place to keep code clean.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240716-tmigr-fixes-v4-5-757baa7803fe@linutronix.de
Two different structs are defined for propagating data from one to another
level when walking the hierarchy. Several struct members exist in both
structs which makes generalization harder.
Merge those two structs into a single one and use it directly in
walk_groups() and the corresponding function pointers instead of
introducing pointer casting all over the place.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240716-tmigr-fixes-v4-4-757baa7803fe@linutronix.de
Trace points of inactive and active propagation are located at the end of
the related functions. The interesting information of those trace points is
the updated group state. When trace points are not located directly at the
place where group state changed, order of trace points in traces could be
confusing.
Move inactive and active propagation trace points directly after update of
group state values.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240716-tmigr-fixes-v4-3-757baa7803fe@linutronix.de
When a CPU comes online the first time, it is possible that a new top level
group will be created. In general all propagation is done from the bottom
to top. This minimizes complexity and prevents possible races. But when a
new top level group is created, the formely top level group needs to be
connected to the new level. This is the only time, when the direction to
propagate changes is changed: the changes are propagated from top (new top
level group) to bottom (formerly top level group).
This introduces two races (see (A) and (B)) as reported by Frederic:
(A) This race happens, when marking the formely top level group as active,
but the last active CPU of the formerly top level group goes idle. Then
it's likely that formerly group is no longer active, but marked
nevertheless as active in new top level group:
[GRP0:0]
migrator = 0
active = 0
nextevt = KTIME_MAX
/ \
0 1 .. 7
active idle
0) Hierarchy has for now only 8 CPUs and CPU 0 is the only active CPU.
[GRP1:0]
migrator = TMIGR_NONE
active = NONE
nextevt = KTIME_MAX
\
[GRP0:0] [GRP0:1]
migrator = 0 migrator = TMIGR_NONE
active = 0 active = NONE
nextevt = KTIME_MAX nextevt = KTIME_MAX
/ \
0 1 .. 7 8
active idle !online
1) CPU 8 is booting and creates a new group in first level GRP0:1 and
therefore also a new top group GRP1:0. For now the setup code proceeded
only until the connected between GRP0:1 to the new top group. The
connection between CPU8 and GRP0:1 is not yet established and CPU 8 is
still !online.
[GRP1:0]
migrator = TMIGR_NONE
active = NONE
nextevt = KTIME_MAX
/ \
[GRP0:0] [GRP0:1]
migrator = 0 migrator = TMIGR_NONE
active = 0 active = NONE
nextevt = KTIME_MAX nextevt = KTIME_MAX
/ \
0 1 .. 7 8
active idle !online
2) Setup code now connects GRP0:0 to GRP1:0 and observes while in
tmigr_connect_child_parent() that GRP0:0 is not TMIGR_NONE. So it
prepares to call tmigr_active_up() on it. It hasn't done it yet.
[GRP1:0]
migrator = TMIGR_NONE
active = NONE
nextevt = KTIME_MAX
/ \
[GRP0:0] [GRP0:1]
migrator = TMIGR_NONE migrator = TMIGR_NONE
active = NONE active = NONE
nextevt = KTIME_MAX nextevt = KTIME_MAX
/ \
0 1 .. 7 8
idle idle !online
3) CPU 0 goes idle. Since GRP0:0->parent has been updated by CPU 8 with
GRP0:0->lock held, CPU 0 observes GRP1:0 after calling
tmigr_update_events() and it propagates the change to the top (no change
there and no wakeup programmed since there is no timer).
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = KTIME_MAX
/ \
[GRP0:0] [GRP0:1]
migrator = TMIGR_NONE migrator = TMIGR_NONE
active = NONE active = NONE
nextevt = KTIME_MAX nextevt = KTIME_MAX
/ \
0 1 .. 7 8
idle idle !online
4) Now the setup code finally calls tmigr_active_up() to and sets GRP0:0
active in GRP1:0
[GRP1:0]
migrator = GRP0:0
active = GRP0:0, GRP0:1
nextevt = KTIME_MAX
/ \
[GRP0:0] [GRP0:1]
migrator = TMIGR_NONE migrator = 8
active = NONE active = 8
nextevt = KTIME_MAX nextevt = KTIME_MAX
/ \ |
0 1 .. 7 8
idle idle active
5) Now CPU 8 is connected with GRP0:1 and CPU 8 calls tmigr_active_up() out
of tmigr_cpu_online().
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = T8
/ \
[GRP0:0] [GRP0:1]
migrator = TMIGR_NONE migrator = TMIGR_NONE
active = NONE active = NONE
nextevt = KTIME_MAX nextevt = T8
/ \ |
0 1 .. 7 8
idle idle idle
5) CPU 8 goes idle with a timer T8 and relies on GRP0:0 as the migrator.
But it's not really active, so T8 gets ignored.
--> The update which is done in third step is not noticed by setup code. So
a wrong migrator is set to top level group and a timer could get
ignored.
(B) Reading group->parent and group->childmask when an hierarchy update is
ongoing and reaches the formerly top level group is racy as those values
could be inconsistent. (The notation of migrator and active now slightly
changes in contrast to the above example, as now the childmasks are used.)
[GRP1:0]
migrator = TMIGR_NONE
active = 0x00
nextevt = KTIME_MAX
\
[GRP0:0] [GRP0:1]
migrator = TMIGR_NONE migrator = TMIGR_NONE
active = 0x00 active = 0x00
nextevt = KTIME_MAX nextevt = KTIME_MAX
childmask= 0 childmask= 1
parent = NULL parent = GRP1:0
/ \
0 1 .. 7 8
idle idle !online
childmask=1
1) Hierarchy has 8 CPUs. CPU 8 is at the moment in the process of onlining
but did not yet connect GRP0:0 to GRP1:0.
[GRP1:0]
migrator = TMIGR_NONE
active = 0x00
nextevt = KTIME_MAX
/ \
[GRP0:0] [GRP0:1]
migrator = TMIGR_NONE migrator = TMIGR_NONE
active = 0x00 active = 0x00
nextevt = KTIME_MAX nextevt = KTIME_MAX
childmask= 0 childmask= 1
parent = GRP1:0 parent = GRP1:0
/ \
0 1 .. 7 8
idle idle !online
childmask=1
2) Setup code (running on CPU 8) now connects GRP0:0 to GRP1:0, updates
parent pointer of GRP0:0 and ...
[GRP1:0]
migrator = TMIGR_NONE
active = 0x00
nextevt = KTIME_MAX
/ \
[GRP0:0] [GRP0:1]
migrator = 0x01 migrator = TMIGR_NONE
active = 0x01 active = 0x00
nextevt = KTIME_MAX nextevt = KTIME_MAX
childmask= 0 childmask= 1
parent = GRP1:0 parent = GRP1:0
/ \
0 1 .. 7 8
active idle !online
childmask=1
tmigr_walk.childmask = 0
3) ... CPU 0 comes active in the same time. As migrator in GRP0:0 was
TMIGR_NONE, childmask of GRP0:0 is stored in update propagation data
structure tmigr_walk (as update of childmask is not yet
visible/updated). And now ...
[GRP1:0]
migrator = TMIGR_NONE
active = 0x00
nextevt = KTIME_MAX
/ \
[GRP0:0] [GRP0:1]
migrator = 0x01 migrator = TMIGR_NONE
active = 0x01 active = 0x00
nextevt = KTIME_MAX nextevt = KTIME_MAX
childmask= 2 childmask= 1
parent = GRP1:0 parent = GRP1:0
/ \
0 1 .. 7 8
active idle !online
childmask=1
tmigr_walk.childmask = 0
4) ... childmask of GRP0:0 is updated by CPU 8 (still part of setup
code).
[GRP1:0]
migrator = 0x00
active = 0x00
nextevt = KTIME_MAX
/ \
[GRP0:0] [GRP0:1]
migrator = 0x01 migrator = TMIGR_NONE
active = 0x01 active = 0x00
nextevt = KTIME_MAX nextevt = KTIME_MAX
childmask= 2 childmask= 1
parent = GRP1:0 parent = GRP1:0
/ \
0 1 .. 7 8
active idle !online
childmask=1
tmigr_walk.childmask = 0
5) CPU 0 sees the connection to GRP1:0 and now propagates active state to
GRP1:0 but with childmask = 0 as stored in propagation data structure.
--> Now GRP1:0 always has a migrator as 0x00 != TMIGR_NONE and for all CPUs
it looks like GRP1:0 is always active.
To prevent those races, the setup of the hierarchy is moved into the
cpuhotplug prepare callback. The prepare callback is not executed by the
CPU which will come online, it is executed by the CPU which prepares
onlining of the other CPU. This CPU is active while it is connecting the
formerly top level to the new one. This prevents from (A) to happen and it
also prevents from any further walk above the formerly top level until that
active CPU becomes inactive, releasing the new ->parent and ->childmask
updates to be visible by any subsequent walk up above the formerly top
level hierarchy. This prevents from (B) to happen. The direction for the
updates is now forced to look like "from bottom to top".
However if the active CPU prevents from tmigr_cpu_(in)active() to walk up
with the update not-or-half visible, nothing prevents walking up to the new
top with a 0 childmask in tmigr_handle_remote_up() or
tmigr_requires_handle_remote_up() if the active CPU doing the prepare is
not the migrator. But then it looks fine because:
* tmigr_check_migrator() should just return false
* The migrator is active and should eventually observe the new childmask
at some point in a future tick.
Split setup functionality of online callback into the cpuhotplug prepare
callback and setup hotplug state. Change init call into early_initcall() to
make sure an already active CPU prepares everything for newly upcoming
CPUs. Reorder the code, that all prepare related functions are close to
each other and online and offline callbacks are also close together.
Fixes: 7ee9887703 ("timers: Implement the hierarchical pull model")
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240717094940.18687-1-anna-maria@linutronix.de
When reading group->parent without holding the group lock it is racy
against CPUs coming online the first time and thereby creating another
level of the hierarchy. This is not a problem when this value is read once
to decide whether to abort a propagation or not. The worst outcome is an
unnecessary/early CPU wake up. But it is racy when reading it several times
during a single 'action' (like activation, deactivation, checking for
remote timer expiry,...) and relying on the consitency of this value
without holding the lock. This happens at the moment e.g. in
tmigr_inactive_up() which is also calling tmigr_udpate_events(). Code relys
on group->parent not to change during this 'action'.
Update parent struct member description to explain the above only
once. Remove parent pointer checks when they are not mandatory (like update
of data->childmask). Remove a warning, which would be nice but the trigger
of this warning is not reliable and add expand the data structure member
description instead. Expand a comment, why it is safe to rely on parent
pointer here (inside hierarchy update).
Fixes: 7ee9887703 ("timers: Implement the hierarchical pull model")
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240716-tmigr-fixes-v4-1-757baa7803fe@linutronix.de
Kuan-Wei Chiu has significantly reworked the min_heap library code and
has taught bcachefs to use the new more generic implementation.
- Yury Norov's series "Cleanup cpumask.h inclusion in core headers"
reworks the cpumask and nodemask headers to make things generally more
rational.
- Kuan-Wei Chiu has sent along some maintenance work against our sorting
library code in the series "lib/sort: Optimizations and cleanups".
- More library maintainance work from Christophe Jaillet in the series
"Remove usage of the deprecated ida_simple_xx() API".
- Ryusuke Konishi continues with the nilfs2 fixes and clanups in the
series "nilfs2: eliminate the call to inode_attach_wb()".
- Kuan-Ying Lee has some fixes to the gdb scripts in the series "Fix GDB
command error".
- Plus the usual shower of singleton patches all over the place. Please
see the relevant changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZp2GvwAKCRDdBJ7gKXxA
jlf/AP48xP5ilIHbtpAKm2z+MvGuTxJQ5VSC0UXFacuCbc93lAEA+Yo+vOVRmh6j
fQF2nVKyKLYfSz7yqmCyAaHWohIYLgg=
=Stxz
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2024-07-21-15-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- In the series "treewide: Refactor heap related implementation",
Kuan-Wei Chiu has significantly reworked the min_heap library code
and has taught bcachefs to use the new more generic implementation.
- Yury Norov's series "Cleanup cpumask.h inclusion in core headers"
reworks the cpumask and nodemask headers to make things generally
more rational.
- Kuan-Wei Chiu has sent along some maintenance work against our
sorting library code in the series "lib/sort: Optimizations and
cleanups".
- More library maintainance work from Christophe Jaillet in the series
"Remove usage of the deprecated ida_simple_xx() API".
- Ryusuke Konishi continues with the nilfs2 fixes and clanups in the
series "nilfs2: eliminate the call to inode_attach_wb()".
- Kuan-Ying Lee has some fixes to the gdb scripts in the series "Fix
GDB command error".
- Plus the usual shower of singleton patches all over the place. Please
see the relevant changelogs for details.
* tag 'mm-nonmm-stable-2024-07-21-15-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (98 commits)
ia64: scrub ia64 from poison.h
watchdog/perf: properly initialize the turbo mode timestamp and rearm counter
tsacct: replace strncpy() with strscpy()
lib/bch.c: use swap() to improve code
test_bpf: convert comma to semicolon
init/modpost: conditionally check section mismatch to __meminit*
init: remove unused __MEMINIT* macros
nilfs2: Constify struct kobj_type
nilfs2: avoid undefined behavior in nilfs_cnt32_ge macro
math: rational: add missing MODULE_DESCRIPTION() macro
lib/zlib: add missing MODULE_DESCRIPTION() macro
fs: ufs: add MODULE_DESCRIPTION()
lib/rbtree.c: fix the example typo
ocfs2: add bounds checking to ocfs2_check_dir_entry()
fs: add kernel-doc comments to ocfs2_prepare_orphan_dir()
coredump: simplify zap_process()
selftests/fpu: add missing MODULE_DESCRIPTION() macro
compiler.h: simplify data_race() macro
build-id: require program headers to be right after ELF header
resource: add missing MODULE_DESCRIPTION()
...
walkers") is known to cause a performance regression
(https://lore.kernel.org/all/3acefad9-96e5-4681-8014-827d6be71c7a@linux.ibm.com/T/#mfa809800a7862fb5bdf834c6f71a3a5113eb83ff).
Yu has a fix which I'll send along later via the hotfixes branch.
- In the series "mm: Avoid possible overflows in dirty throttling" Jan
Kara addresses a couple of issues in the writeback throttling code.
These fixes are also targetted at -stable kernels.
- Ryusuke Konishi's series "nilfs2: fix potential issues related to
reserved inodes" does that. This should actually be in the
mm-nonmm-stable tree, along with the many other nilfs2 patches. My bad.
- More folio conversions from Kefeng Wang in the series "mm: convert to
folio_alloc_mpol()"
- Kemeng Shi has sent some cleanups to the writeback code in the series
"Add helper functions to remove repeated code and improve readability of
cgroup writeback"
- Kairui Song has made the swap code a little smaller and a little
faster in the series "mm/swap: clean up and optimize swap cache index".
- In the series "mm/memory: cleanly support zeropage in
vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David
Hildenbrand has reworked the rather sketchy handling of the use of the
zeropage in MAP_SHARED mappings. I don't see any runtime effects here -
more a cleanup/understandability/maintainablity thing.
- Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling of
higher addresses, for aarch64. The (poorly named) series is
"Restructure va_high_addr_switch".
- The core TLB handling code gets some cleanups and possible slight
optimizations in Bang Li's series "Add update_mmu_tlb_range() to
simplify code".
- Jane Chu has improved the handling of our
fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in the
series "Enhance soft hwpoison handling and injection".
- Jeff Johnson has sent a billion patches everywhere to add
MODULE_DESCRIPTION() to everything. Some landed in this pull.
- In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang has
simplified migration's use of hardware-offload memory copying.
- Yosry Ahmed performs more folio API conversions in his series "mm:
zswap: trivial folio conversions".
- In the series "large folios swap-in: handle refault cases first",
Chuanhua Han inches us forward in the handling of large pages in the
swap code. This is a cleanup and optimization, working toward the end
objective of full support of large folio swapin/out.
- In the series "mm,swap: cleanup VMA based swap readahead window
calculation", Huang Ying has contributed some cleanups and a possible
fixlet to his VMA based swap readahead code.
- In the series "add mTHP support for anonymous shmem" Baolin Wang has
taught anonymous shmem mappings to use multisize THP. By default this
is a no-op - users must opt in vis sysfs controls. Dramatic
improvements in pagefault latency are realized.
- David Hildenbrand has some cleanups to our remaining use of
page_mapcount() in the series "fs/proc: move page_mapcount() to
fs/proc/internal.h".
- David also has some highmem accounting cleanups in the series
"mm/highmem: don't track highmem pages manually".
- Build-time fixes and cleanups from John Hubbard in the series
"cleanups, fixes, and progress towards avoiding "make headers"".
- Cleanups and consolidation of the core pagemap handling from Barry
Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers
and utilize them".
- Lance Yang's series "Reclaim lazyfree THP without splitting" has
reduced the latency of the reclaim of pmd-mapped THPs under fairly
common circumstances. A 10x speedup is seen in a microbenchmark.
It does this by punting to aother CPU but I guess that's a win unless
all CPUs are pegged.
- hugetlb_cgroup cleanups from Xiu Jianfeng in the series
"mm/hugetlb_cgroup: rework on cftypes".
- Miaohe Lin's series "Some cleanups for memory-failure" does just that
thing.
- Is anyone reading this stuff? If so, email me!
- Someone other than SeongJae has developed a DAMON feature in Honggyu
Kim's series "DAMON based tiered memory management for CXL memory".
This adds DAMON features which may be used to help determine the
efficiency of our placement of CXL/PCIe attached DRAM.
- DAMON user API centralization and simplificatio work in SeongJae
Park's series "mm/damon: introduce DAMON parameters online commit
function".
- In the series "mm: page_type, zsmalloc and page_mapcount_reset()"
David Hildenbrand does some maintenance work on zsmalloc - partially
modernizing its use of pageframe fields.
- Kefeng Wang provides more folio conversions in the series "mm: remove
page_maybe_dma_pinned() and page_mkclean()".
- More cleanup from David Hildenbrand, this time in the series
"mm/memory_hotplug: use PageOffline() instead of PageReserved() for
!ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline()
pages" and permits the removal of some virtio-mem hacks.
- Barry Song's series "mm: clarify folio_add_new_anon_rmap() and
__folio_add_anon_rmap()" is a cleanup to the anon folio handling in
preparation for mTHP (multisize THP) swapin.
- Kefeng Wang's series "mm: improve clear and copy user folio"
implements more folio conversions, this time in the area of large folio
userspace copying.
- The series "Docs/mm/damon/maintaier-profile: document a mailing tool
and community meetup series" tells people how to get better involved
with other DAMON developers. From SeongJae Park.
- A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does
that.
- David Hildenbrand sends along more cleanups, this time against the
migration code. The series is "mm/migrate: move NUMA hinting fault
folio isolation + checks under PTL".
- Jan Kara has found quite a lot of strangenesses and minor errors in
the readahead code. He addresses this in the series "mm: Fix various
readahead quirks".
- SeongJae Park's series "selftests/damon: test DAMOS tried regions and
{min,max}_nr_regions" adds features and addresses errors in DAMON's self
testing code.
- Gavin Shan has found a userspace-triggerable WARN in the pagecache
code. The series "mm/filemap: Limit page cache size to that supported
by xarray" addresses this. The series is marked cc:stable.
- Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations
and cleanup" cleans up and slightly optimizes KSM.
- Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of
code motion. The series (which also makes the memcg-v1 code
Kconfigurable) are
"mm: memcg: separate legacy cgroup v1 code and put under config
option" and
"mm: memcg: put cgroup v1-specific memcg data under CONFIG_MEMCG_V1"
- Dan Schatzberg's series "Add swappiness argument to memory.reclaim"
adds an additional feature to this cgroup-v2 control file.
- The series "Userspace controls soft-offline pages" from Jiaqi Yan
permits userspace to stop the kernel's automatic treatment of excessive
correctable memory errors. In order to permit userspace to monitor and
handle this situation.
- Kefeng Wang's series "mm: migrate: support poison recover from migrate
folio" teaches the kernel to appropriately handle migration from
poisoned source folios rather than simply panicing.
- SeongJae Park's series "Docs/damon: minor fixups and improvements"
does those things.
- In the series "mm/zsmalloc: change back to per-size_class lock"
Chengming Zhou improves zsmalloc's scalability and memory utilization.
- Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for
pinning memfd folios" makes the GUP code use FOLL_PIN rather than bare
refcount increments. So these paes can first be moved aside if they
reside in the movable zone or a CMA block.
- Andrii Nakryiko has added a binary ioctl()-based API to /proc/pid/maps
for much faster reading of vma information. The series is "query VMAs
from /proc/<pid>/maps".
- In the series "mm: introduce per-order mTHP split counters" Lance Yang
improves the kernel's presentation of developer information related to
multisize THP splitting.
- Michael Ellerman has developed the series "Reimplement huge pages
without hugepd on powerpc (8xx, e500, book3s/64)". This permits
userspace to use all available huge page sizes.
- In the series "revert unconditional slab and page allocator fault
injection calls" Vlastimil Babka removes a performance-affecting and not
very useful feature from slab fault injection.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZp2C+QAKCRDdBJ7gKXxA
joTkAQDvjqOoFStqk4GU3OXMYB7WCU/ZQMFG0iuu1EEwTVDZ4QEA8CnG7seek1R3
xEoo+vw0sWWeLV3qzsxnCA1BJ8cTJA8=
=z0Lf
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- In the series "mm: Avoid possible overflows in dirty throttling" Jan
Kara addresses a couple of issues in the writeback throttling code.
These fixes are also targetted at -stable kernels.
- Ryusuke Konishi's series "nilfs2: fix potential issues related to
reserved inodes" does that. This should actually be in the
mm-nonmm-stable tree, along with the many other nilfs2 patches. My
bad.
- More folio conversions from Kefeng Wang in the series "mm: convert to
folio_alloc_mpol()"
- Kemeng Shi has sent some cleanups to the writeback code in the series
"Add helper functions to remove repeated code and improve readability
of cgroup writeback"
- Kairui Song has made the swap code a little smaller and a little
faster in the series "mm/swap: clean up and optimize swap cache
index".
- In the series "mm/memory: cleanly support zeropage in
vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David
Hildenbrand has reworked the rather sketchy handling of the use of
the zeropage in MAP_SHARED mappings. I don't see any runtime effects
here - more a cleanup/understandability/maintainablity thing.
- Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling
of higher addresses, for aarch64. The (poorly named) series is
"Restructure va_high_addr_switch".
- The core TLB handling code gets some cleanups and possible slight
optimizations in Bang Li's series "Add update_mmu_tlb_range() to
simplify code".
- Jane Chu has improved the handling of our
fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in
the series "Enhance soft hwpoison handling and injection".
- Jeff Johnson has sent a billion patches everywhere to add
MODULE_DESCRIPTION() to everything. Some landed in this pull.
- In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang
has simplified migration's use of hardware-offload memory copying.
- Yosry Ahmed performs more folio API conversions in his series "mm:
zswap: trivial folio conversions".
- In the series "large folios swap-in: handle refault cases first",
Chuanhua Han inches us forward in the handling of large pages in the
swap code. This is a cleanup and optimization, working toward the end
objective of full support of large folio swapin/out.
- In the series "mm,swap: cleanup VMA based swap readahead window
calculation", Huang Ying has contributed some cleanups and a possible
fixlet to his VMA based swap readahead code.
- In the series "add mTHP support for anonymous shmem" Baolin Wang has
taught anonymous shmem mappings to use multisize THP. By default this
is a no-op - users must opt in vis sysfs controls. Dramatic
improvements in pagefault latency are realized.
- David Hildenbrand has some cleanups to our remaining use of
page_mapcount() in the series "fs/proc: move page_mapcount() to
fs/proc/internal.h".
- David also has some highmem accounting cleanups in the series
"mm/highmem: don't track highmem pages manually".
- Build-time fixes and cleanups from John Hubbard in the series
"cleanups, fixes, and progress towards avoiding "make headers"".
- Cleanups and consolidation of the core pagemap handling from Barry
Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers
and utilize them".
- Lance Yang's series "Reclaim lazyfree THP without splitting" has
reduced the latency of the reclaim of pmd-mapped THPs under fairly
common circumstances. A 10x speedup is seen in a microbenchmark.
It does this by punting to aother CPU but I guess that's a win unless
all CPUs are pegged.
- hugetlb_cgroup cleanups from Xiu Jianfeng in the series
"mm/hugetlb_cgroup: rework on cftypes".
- Miaohe Lin's series "Some cleanups for memory-failure" does just that
thing.
- Someone other than SeongJae has developed a DAMON feature in Honggyu
Kim's series "DAMON based tiered memory management for CXL memory".
This adds DAMON features which may be used to help determine the
efficiency of our placement of CXL/PCIe attached DRAM.
- DAMON user API centralization and simplificatio work in SeongJae
Park's series "mm/damon: introduce DAMON parameters online commit
function".
- In the series "mm: page_type, zsmalloc and page_mapcount_reset()"
David Hildenbrand does some maintenance work on zsmalloc - partially
modernizing its use of pageframe fields.
- Kefeng Wang provides more folio conversions in the series "mm: remove
page_maybe_dma_pinned() and page_mkclean()".
- More cleanup from David Hildenbrand, this time in the series
"mm/memory_hotplug: use PageOffline() instead of PageReserved() for
!ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline()
pages" and permits the removal of some virtio-mem hacks.
- Barry Song's series "mm: clarify folio_add_new_anon_rmap() and
__folio_add_anon_rmap()" is a cleanup to the anon folio handling in
preparation for mTHP (multisize THP) swapin.
- Kefeng Wang's series "mm: improve clear and copy user folio"
implements more folio conversions, this time in the area of large
folio userspace copying.
- The series "Docs/mm/damon/maintaier-profile: document a mailing tool
and community meetup series" tells people how to get better involved
with other DAMON developers. From SeongJae Park.
- A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does
that.
- David Hildenbrand sends along more cleanups, this time against the
migration code. The series is "mm/migrate: move NUMA hinting fault
folio isolation + checks under PTL".
- Jan Kara has found quite a lot of strangenesses and minor errors in
the readahead code. He addresses this in the series "mm: Fix various
readahead quirks".
- SeongJae Park's series "selftests/damon: test DAMOS tried regions and
{min,max}_nr_regions" adds features and addresses errors in DAMON's
self testing code.
- Gavin Shan has found a userspace-triggerable WARN in the pagecache
code. The series "mm/filemap: Limit page cache size to that supported
by xarray" addresses this. The series is marked cc:stable.
- Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations
and cleanup" cleans up and slightly optimizes KSM.
- Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of
code motion. The series (which also makes the memcg-v1 code
Kconfigurable) are "mm: memcg: separate legacy cgroup v1 code and put
under config option" and "mm: memcg: put cgroup v1-specific memcg
data under CONFIG_MEMCG_V1"
- Dan Schatzberg's series "Add swappiness argument to memory.reclaim"
adds an additional feature to this cgroup-v2 control file.
- The series "Userspace controls soft-offline pages" from Jiaqi Yan
permits userspace to stop the kernel's automatic treatment of
excessive correctable memory errors. In order to permit userspace to
monitor and handle this situation.
- Kefeng Wang's series "mm: migrate: support poison recover from
migrate folio" teaches the kernel to appropriately handle migration
from poisoned source folios rather than simply panicing.
- SeongJae Park's series "Docs/damon: minor fixups and improvements"
does those things.
- In the series "mm/zsmalloc: change back to per-size_class lock"
Chengming Zhou improves zsmalloc's scalability and memory
utilization.
- Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for
pinning memfd folios" makes the GUP code use FOLL_PIN rather than
bare refcount increments. So these paes can first be moved aside if
they reside in the movable zone or a CMA block.
- Andrii Nakryiko has added a binary ioctl()-based API to
/proc/pid/maps for much faster reading of vma information. The series
is "query VMAs from /proc/<pid>/maps".
- In the series "mm: introduce per-order mTHP split counters" Lance
Yang improves the kernel's presentation of developer information
related to multisize THP splitting.
- Michael Ellerman has developed the series "Reimplement huge pages
without hugepd on powerpc (8xx, e500, book3s/64)". This permits
userspace to use all available huge page sizes.
- In the series "revert unconditional slab and page allocator fault
injection calls" Vlastimil Babka removes a performance-affecting and
not very useful feature from slab fault injection.
* tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (411 commits)
mm/mglru: fix ineffective protection calculation
mm/zswap: fix a white space issue
mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio
mm/hugetlb: fix possible recursive locking detected warning
mm/gup: clear the LRU flag of a page before adding to LRU batch
mm/numa_balancing: teach mpol_to_str about the balancing mode
mm: memcg1: convert charge move flags to unsigned long long
alloc_tag: fix page_ext_get/page_ext_put sequence during page splitting
lib: reuse page_ext_data() to obtain codetag_ref
lib: add missing newline character in the warning message
mm/mglru: fix overshooting shrinker memory
mm/mglru: fix div-by-zero in vmpressure_calc_level()
mm/kmemleak: replace strncpy() with strscpy()
mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC
mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB
mm: ignore data-race in __swap_writepage
hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr
mm: shmem: rename mTHP shmem counters
mm: swap_state: use folio_alloc_mpol() in __read_swap_cache_async()
mm/migrate: putback split folios when numa hint migration fails
...
Commit cf8e865810 ("arch: Remove Itanium (IA-64) architecture")
removed the last use of the absolute kallsyms.
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/all/20240221202655.2423854-1-jannh@google.com/
[masahiroy@kernel.org: rebase the code and reword the commit description]
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmaahiEUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vwypg/+LSzrx0CyyXruwwkjuoMIzqXoEpxV
SSdJv47E9rnJymQvd0RAeNyc1BPbtRcP1FdEvV/G1ovb8qJSOJgU22PSSiMQsQ0h
2WGBl1ShubQDDLBdy1AggAsRJhIH4P4tWZ4k5Ftz6WZPWA1UcrDqmjN4d02UIYZb
A3YYcBEIm6bvrixxy+xq/Ii7S9A2idikabDLLGXOMSliFHx0ehWDNXyQEBONlrDh
rEHih21rPtOltVEdJl7yF+SIA467HI09NuXfTviHWnJ1hinFoSlEHIhz4j+i+r//
xOj7iDqtk/UAIToVsxtwgOnElNwY6ab/h/t1AmSSxX4FUEV2TiS1YEpUfX7pByt+
dytgvepjQyycC/ZHUtRZFZ6+1M0z+Vgb5c3+jXyPh8pQEPqmXt8+KYVIi/wychmJ
Opo4xniiDoKHSZ4E0bg/wMbe9yVCjTpX0i0S7BbNa/TRjud6vAhXvgx/y092jsdg
h4lU0ywNCgea/rZFHZYomPjncx9xJ+rtOaH+/dVQhCm/wuRHnj7tJGZnl5LfCWVw
+yNOcExQaE+lRvKqp6mQvUva3+4UArAL2tnFC00tGd0emRLIvXrxY2lF1sqp9wCZ
AJu65El4nnpFNU7vJR7x4X31BvcdquFEvfofPxPXbPz09N8hPRhkunKzgd5ftKZS
mcxMfStvIFXiMEM=
=vw2i
-----END PGP SIGNATURE-----
Merge tag 'pci-v6.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci updates from Bjorn Helgaas:
"Enumeration:
- Define PCIE_RESET_CONFIG_DEVICE_WAIT_MS for the generic 100ms
required after reset before config access (Kevin Xie)
- Define PCIE_T_RRS_READY_MS for the generic 100ms required after
reset before config access (probably should be unified with
PCIE_RESET_CONFIG_DEVICE_WAIT_MS) (Damien Le Moal)
Resource management:
- Rename find_resource() to find_resource_space() to be more
descriptive (Ilpo Järvinen)
- Export find_resource_space() for use by PCI core, which needs to
learn whether there is available space for a bridge window (Ilpo
Järvinen)
- Prevent double counting of resources so window size doesn't grow on
each remove/rescan cycle (Ilpo Järvinen)
- Relax bridge window sizing algorithm so a device doesn't break
simply because it was removed and rescanned (Ilpo Järvinen)
- Evaluate the ACPI PRESERVE_BOOT_CONFIG _DSM in
pci_register_host_bridge() (not acpi_pci_root_create()) so we can
unify it with similar DT functionality (Vidya Sagar)
- Extend use of DT "linux,pci-probe-only" property so it works
per-host bridge as well as globally (Vidya Sagar)
- Unify support for ACPI PRESERVE_BOOT_CONFIG _DSM and the DT
"linux,pci-probe-only" property in pci_preserve_config() (Vidya
Sagar)
Driver binding:
- Add devres infrastructure for managed request and map of partial
BAR resources (Philipp Stanner)
- Deprecate pcim_iomap_table() because uses like
"pcim_iomap_table()[0]" have no good way to return errors (Philipp
Stanner)
- Add an always-managed pcim_request_region() for use instead of
pci_request_region() and similar, which are sometimes managed
depending on whether pcim_enable_device() has been called
previously (Philipp Stanner)
- Reimplement pcim_set_mwi() so it doesn't need to keep store MWI
state (Philipp Stanner)
- Add pcim_intx() for use instead of pci_intx(), which is sometimes
managed depending on whether pcim_enable_device() has been called
previously (Philipp Stanner)
- Add managed pcim_iomap_range() to allow mapping of a partial BAR
(Philipp Stanner)
- Fix a devres mapping leak in drm/vboxvideo (Philipp Stanner)
Error handling:
- Add missing bridge locking in device reset path and add a warning
for other possible lock issues (Dan Williams)
- Fix use-after-free on concurrent DPC and hot-removal (Lukas Wunner)
Power management:
- Disable AER and DPC during suspend to avoid spurious wakeups if
they share an interrupt with PME (Kai-Heng Feng)
PCIe native device hotplug:
- Detect if a device was removed or replaced during system sleep so
we don't assume a new device is the one that used to be there
(Lukas Wunner)
Virtualization:
- Add an ACS quirk for Broadcom BCM5760X multi-function NIC; it
prevents transactions between functions even though it doesn't
advertise ACS, so the functions can be attached individually via
VFIO (Ajit Khaparde)
Peer-to-peer DMA:
- Add a "pci=config_acs=" kernel command-line parameter to relax
default ACS settings to enable additional peer-to-peer
configurations. Requires expert knowledge of topology and ACS
operation (Vidya Sagar)
Endpoint framework:
- Remove unused struct pci_epf_group.type_group (Christophe JAILLET)
- Fix error handling in vpci_scan_bus() and epf_ntb_epc_cleanup()
(Dan Carpenter)
- Make struct pci_epc_class constant (Greg Kroah-Hartman)
- Remove unused pci_endpoint_test_bar_{readl,writel} functions
(Jiapeng Chong)
- Rename "BME" to "Bus Master Enable" (Manivannan Sadhasivam)
- Rename struct pci_epc_event_ops.core_init() callback to epc_init()
(Manivannan Sadhasivam)
- Move DMA init to MHI .epc_init() callback for uniformity
(Manivannan Sadhasivam)
- Cancel EPF test delayed work when link goes down (Manivannan
Sadhasivam)
- Add struct pci_epc_event_ops.epc_deinit() callback for cleanup
needed on fundamental reset (Manivannan Sadhasivam)
- Add 64KB alignment to endpoint test to support Rockchip rk3588
(Niklas Cassel)
- Optimize endpoint test by using memcpy() instead of readl() (Niklas
Cassel)
Device tree bindings:
- Add generic "ats-supported" property to advertise that a PCIe Root
Complex supports ATS (Jean-Philippe Brucker)
Amazon Annapurna Labs PCIe controller driver:
- Validate IORESOURCE_BUS presence to avoid NULL pointer dereference
(Aleksandr Mishin)
Axis ARTPEC-6 PCIe controller driver:
- Rename .cpu_addr_fixup() parameter to reflect that it is a PCI
address, not a CPU address (Niklas Cassel)
Freescale i.MX6 PCIe controller driver:
- Convert to agnostic GPIO API (Andy Shevchenko)
Freescale Layerscape PCIe controller driver:
- Make struct mobiveil_rp_ops constant (Christophe JAILLET)
- Use new generic dw_pcie_ep_linkdown() to handle link-down events
(Manivannan Sadhasivam)
HiSilicon Kirin PCIe controller driver:
- Convert to agnostic GPIO API (Andy Shevchenko)
- Use _scoped() iterator for OF children to ensure refcounts are
decremented at loop exit (Javier Carrasco)
Intel VMD host bridge driver:
- Create sysfs "domain" symlink before downstream devices are exposed
to userspace by pci_bus_add_devices() (Jiwei Sun)
Loongson PCIe controller driver:
- Enable MSI when LS7A is used with new CPUs that have integrated
PCIe Root Complex, e.g., Loongson-3C6000, so downstream devices can
use MSI (Huacai Chen)
Microchip AXI PolarFlare PCIe controller driver:
- Move pcie-microchip-host.c to a new PLDA directory (Minda Chen)
- Factor PLDA generic items out to a common
plda,xpressrich3-axi-common.yaml binding (Minda Chen)
- Factor PLDA generic data structures and code out to shared
pcie-plda.h, pcie-plda-host.c (Minda Chen)
- Add PLDA generic interrupt handling with a .request_event_irq()
callback for vendor-specific events (Minda Chen)
- Add PLDA generic host init/deinit and map bus functions for use by
vendor-specific drivers (Minda Chen)
- Rework to use PLDA core (Minda Chen)
Microsoft Hyper-V host bridge driver:
- Return zero, not garbage, when reading PCI_INTERRUPT_PIN (Wei Liu)
NVIDIA Tegra194 PCIe controller driver:
- Remove unused struct tegra_pcie_soc (Dr. David Alan Gilbert)
- Set 64KB inbound ATU alignment restriction (Jon Hunter)
Qualcomm PCIe controller driver:
- Make the MHI reg region mandatory for X1E80100, since all PCIe
controllers have it (Abel Vesa)
- Prevent use of uninitialized data and possible error pointer
dereference (Dan Carpenter)
- Return error, not success, if dev_pm_opp_find_freq_floor() fails
(Dan Carpenter)
- Add Operating Performance Points (OPP) support to scale performance
state based on aggregate link bandwidth to improve SoC power
efficiency (Krishna chaitanya chundru)
- Vote for the CPU-PCIe ICC (interconnect) path to ensure it stays
active even if other drivers don't vote for it (Krishna chaitanya
chundru)
- Use devm_clk_bulk_get_all() to get all the clocks from DT to avoid
writing out all the clock names (Manivannan Sadhasivam)
- Add DT binding and driver support for the SA8775P SoC (Mrinmay
Sarkar)
- Add HDMA support for the SA8775P SoC (Mrinmay Sarkar)
- Override the SA8775P NO_SNOOP default to avoid possible memory
corruption (Mrinmay Sarkar)
- Make sure resources are disabled during PERST# assertion, even if
the link is already disabled (Manivannan Sadhasivam)
- Use new generic dw_pcie_ep_linkdown() to handle link-down events
(Manivannan Sadhasivam)
- Add DT and endpoint driver support for the SA8775P SoC (Mrinmay
Sarkar)
- Add Hyper DMA (HDMA) support for the SA8775P SoC and enable it in
the EPF MHI driver (Mrinmay Sarkar)
- Set PCIE_PARF_NO_SNOOP_OVERIDE to override the default NO_SNOOP
attribute on the SA8775P SoC (both Root Complex and Endpoint mode)
to avoid possible memory corruption (Mrinmay Sarkar)
Renesas R-Car PCIe controller driver:
- Demote WARN() to dev_warn_ratelimited() in rcar_pcie_wakeup() to
avoid unnecessary backtrace (Marek Vasut)
- Add DT and driver support for R-Car V4H (R8A779G0) host and
endpoint. This requires separate proprietary firmware (Yoshihiro
Shimoda)
Rockchip PCIe controller driver:
- Assert PERST# for 100ms after power is stable (Damien Le Moal)
- Wait PCIE_T_RRS_READY_MS (100ms) after reset before starting
configuration (Damien Le Moal)
- Use GPIOD_OUT_LOW flag while requesting ep_gpio to fix a firmware
crash on Qcom-based modems with Rockpro64 board (Manivannan
Sadhasivam)
Rockchip DesignWare PCIe controller driver:
- Factor common parts of rockchip-dw-pcie DT binding to be shared by
Root Complex and Endpoint mode (Niklas Cassel)
- Add missing INTx signals to common DT binding (Niklas Cassel)
- Add eDMA items to DT binding for Endpoint controller (Niklas
Cassel)
- Fix initial dw-rockchip PERST# GPIO value to prevent unnecessary
short assert/deassert that causes issues with some WLAN controllers
(Niklas Cassel)
- Refactor dw-rockchip and add support for Endpoint mode (Niklas
Cassel)
- Call pci_epc_init_notify() and drop dw_pcie_ep_init_notify()
wrapper (Niklas Cassel)
- Add error messages in .probe() error paths to improve user
experience (Uwe Kleine-König)
Samsung Exynos PCIe controller driver:
- Use bulk clock APIs to simplify clock setup (Shradha Todi)
StarFive PCIe controller driver:
- Add DT binding and driver support for the StarFive JH7110
PLDA-based PCIe controller (Minda Chen)
Synopsys DesignWare PCIe controller driver:
- Add generic support for sending PME_Turn_Off when system suspends
(Frank Li)
- Fix incorrect interpretation of iATU slot 0 after PERST#
assert/deassert (Frank Li)
- Use msleep() instead of usleep_range() while waiting for link
(Konrad Dybcio)
- Refactor dw_pcie_edma_find_chip() to enable adding support for
Hyper DMA (HDMA) (Manivannan Sadhasivam)
- Enable drivers to supply the eDMA channel count since some can't
auto detect this (Manivannan Sadhasivam)
- Call pci_epc_init_notify() and drop dw_pcie_ep_init_notify()
wrapper (Manivannan Sadhasivam)
- Pass the eDMA mapping format directly from drivers instead of
maintaining a capability for it (Manivannan Sadhasivam)
- Add generic dw_pcie_ep_linkdown() to notify EPF drivers about
link-down events and restore non-sticky DWC registers lost on link
down (Manivannan Sadhasivam)
- Add vendor-specific "apb" reg name, interrupt names, INTx names to
generic binding (Niklas Cassel)
- Enforce DWC restriction that 64-bit BARs must start with an
even-numbered BAR (Niklas Cassel)
- Consolidate args of dw_pcie_prog_outbound_atu() into a structure
(Yoshihiro Shimoda)
- Add support for endpoints to send Message TLPs, e.g., for INTx
emulation (Yoshihiro Shimoda)
TI DRA7xx PCIe controller driver:
- Rename .cpu_addr_fixup() parameter to reflect that it is a PCI
address, not a CPU address (Niklas Cassel)
TI Keystone PCIe controller driver:
- Validate IORESOURCE_BUS presence to avoid NULL pointer dereference
(Aleksandr Mishin)
- Work around AM65x/DRA80xM Errata #i2037 that corrupts TLPs and
causes processor hangs by limiting Max_Read_Request_Size (MRRS) and
Max_Payload_Size (MPS) (Kishon Vijay Abraham I)
- Leave BAR 0 disabled for AM654x to fix a regression caused by
6ab15b5e70 ("PCI: dwc: keystone: Convert .scan_bus() callback to
use add_bus"), which caused a 45-second boot delay (Siddharth
Vadapalli)
Xilinx Versal CPM PCIe controller driver:
- Fix overlapping bridge registers and 32-bit BAR addresses in DT
binding (Thippeswamy Havalige)
MicroSemi Switchtec management driver:
- Make struct switchtec_class constant (Greg Kroah-Hartman)
Miscellaneous:
- Remove unused struct acpi_handle_node (Dr. David Alan Gilbert)
- Add missing MODULE_DESCRIPTION() macros (Jeff Johnson)"
* tag 'pci-v6.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (154 commits)
PCI: loongson: Enable MSI in LS7A Root Complex
PCI: Extend ACS configurability
PCI: Add missing bridge lock to pci_bus_lock()
drm/vboxvideo: fix mapping leaks
PCI: Add managed pcim_iomap_range()
PCI: Remove legacy pcim_release()
PCI: Add managed pcim_intx()
PCI: vmd: Create domain symlink before pci_bus_add_devices()
PCI: qcom: Prevent use of uninitialized data in qcom_pcie_suspend_noirq()
PCI: qcom: Prevent potential error pointer dereference
PCI: qcom: Fix missing error code in qcom_pcie_probe()
PCI: Give pcim_set_mwi() its own devres cleanup callback
PCI: Move struct pci_devres.pinned bit to struct pci_dev
PCI: Remove struct pci_devres.enabled status bit
PCI: Document hybrid devres hazards
PCI: Add managed pcim_request_region()
PCI: Deprecate pcim_iomap_table(), pcim_iomap_regions_request_all()
PCI: Add managed partial-BAR request and map infrastructure
PCI: Add devres helpers for iomap table
PCI: Add and use devres helper for bit masks
...
Here is a small set of tty and serial driver updates for 6.11-rc1. Not
much happened this cycle, unlike the previous kernel release which had
lots of "excitement" in this part of the kernel. Included in here are
the following changes:
- dt binding updates for new platforms
- 8250 driver updates
- various small serial driver fixes and updates
- printk/console naming and matching attempt #2 (was reverted for
6.10-final, should be good to go this time around, acked by the
relevant maintainers).
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZppbCQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymV1ACeIY5kgipqY7w4d3/7PcpKMiftrisAn0hr6csj
Gan+k3cuVGlasGkaQ5/B
=35VK
-----END PGP SIGNATURE-----
Merge tag 'tty-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty / serial updates from Greg KH:
"Here is a small set of tty and serial driver updates for 6.11-rc1. Not
much happened this cycle, unlike the previous kernel release which had
lots of "excitement" in this part of the kernel. Included in here are
the following changes:
- dt binding updates for new platforms
- 8250 driver updates
- various small serial driver fixes and updates
- printk/console naming and matching attempt #2 (was reverted for
6.10-final, should be good to go this time around, acked by the
relevant maintainers).
All of these have been in linux-next for a while with no reported
issues"
* tag 'tty-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (22 commits)
Documentation: kernel-parameters: Add DEVNAME:0.0 format for serial ports
serial: core: Add serial_base_match_and_update_preferred_console()
printk: Add match_devname_and_update_preferred_console()
serial: sc16is7xx: hardware reset chip if reset-gpios is defined in DT
dt-bindings: serial: sc16is7xx: add reset-gpios
dt-bindings: serial: vt8500-uart: convert to json-schema
serial: 8250_platform: Explicitly show we initialise ISA ports only once
tty: add missing MODULE_DESCRIPTION() macros
dt-bindings: serial: mediatek,uart: add MT7988
serial: sh-sci: Add support for RZ/V2H(P) SoC
dt-bindings: serial: Add documentation for Renesas RZ/V2H(P) (R9A09G057) SCIF support
dt-bindings: serial: renesas,scif: Make 'interrupt-names' property as required
dt-bindings: serial: renesas,scif: Validate 'interrupts' and 'interrupt-names'
dt-bindings: serial: renesas,scif: Move ref for serial.yaml at the end
riscv: dts: starfive: jh7110: Add the core reset and jh7110 compatible for uarts
serial: 8250_dw: Use reset array API to get resets
dt-bindings: serial: snps-dw-apb-uart: Add one more reset signal for StarFive JH7110 SoC
serial: 8250: Extract platform driver
serial: 8250: Extract RSA bits
serial: imx: stop casting struct uart_port to struct imx_port
...
- reduce duplicate swiotlb pool lookups (Michael Kelley)
- minor small fixes (Yicong Yang, Yang Li)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmaZ+C0LHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYPtUw/9FZr0azAzx1E5nqDNfa5sisqoPZUFKcHYBYmjhcgL
9NvDKbKqriRBt8fAMtjGpQqUfpIytfa4p0BGLYnzYKJ8Im+MqC9KYq7lhoNVQGO3
iz8T8clT9TPoVDkiMn/LXlf6dqE0SQtcSh4TJ1c+aTFFlez5jXClFu376rcbdewM
n+nyGL93uaCeRpeQlywyBsUdpu9NnWAJUYi/OszWSTlX/1ucG5kD9kNQ4sOFL0S/
WszylOCIuur9EweIvMgPa5X5yCUsr/n95EthbIQRiZq/TSkLpJ3DkTk3VYaaKLsJ
imq5Q6dqOnPplL22X6GgvXEDYd1OYzYSvbB93HcxuGnIIYLvdFMV+McCTkwL7XJV
v9fDNsx8Wn7cS+6OpT0YxfTFvzHXFAzRTqVoqA/lCWo8eaclzUKM3qOoIQxjl1Wh
1IDoHUhiH0m0ecS+ZzljjDrn39RzU6+5XJiJzCmzYlqL/9f0SumTCsNrEmF3W6l9
GmuP2Q49Rbh+Ru+oyx3KGlBS9DQxTB+heBnjkwigtAiwoO8Z2V9Vwg/a2Lzl3RkN
vDaHqEo6f8gAc/QlonOjN9UfuR5U8bxGUps1zw9O0CeTDqSDXmJKvkddfkBSAY7s
bIW+sObA5yRrPuwz4ALykw3h1UJXdZNJIEYjin3CJIHr3K0DZADWXXSXCSao16kC
MXU=
=PzrU
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.11-2024-07-19' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- reduce duplicate swiotlb pool lookups (Michael Kelley)
- minor small fixes (Yicong Yang, Yang Li)
* tag 'dma-mapping-6.11-2024-07-19' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: fix kernel-doc description for swiotlb_del_transient
swiotlb: reduce swiotlb pool lookups
dma-mapping: benchmark: Don't starve others when doing the test
Since commit 0108a4e9f3 ("bpf: ensure main program has an extable"),
prog->aux->func[0]->kallsyms is left as uninitialized. For BPF programs
with subprogs, the symbol for the main program is missing just as shown
in the output of perf script below:
ffffffff81284b69 qp_trie_lookup_elem+0xb9 ([kernel.kallsyms])
ffffffffc0011125 bpf_prog_a4a0eb0651e6af8b_lookup_qp_trie+0x5d (bpf...)
ffffffff8127bc2b bpf_for_each_array_elem+0x7b ([kernel.kallsyms])
ffffffffc00110a1 +0x25 ()
ffffffff8121a89a trace_call_bpf+0xca ([kernel.kallsyms])
Fix it by always using prog instead prog->aux->func[0] to emit ksymbol
event for the main program. After the fix, the output of perf script
will be correct:
ffffffff81284b96 qp_trie_lookup_elem+0xe6 ([kernel.kallsyms])
ffffffffc001382d bpf_prog_a4a0eb0651e6af8b_lookup_qp_trie+0x5d (bpf...)
ffffffff8127bc2b bpf_for_each_array_elem+0x7b ([kernel.kallsyms])
ffffffffc0013779 bpf_prog_245c55ab25cfcf40_qp_trie_lookup+0x25 (bpf...)
ffffffff8121a89a trace_call_bpf+0xca ([kernel.kallsyms])
Fixes: 0108a4e9f3 ("bpf: ensure main program has an extable")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Yonghong Song <yonghong.song@linux.dev>
Reviewed-by: Krister Johansen <kjlx@templeofstupid.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20240714065533.1112616-1-houtao@huaweicloud.com
dmam_free_coherent() frees a DMA allocation, which makes the
freed vaddr available for reuse, then calls devres_destroy()
to remove and free the data structure used to track the DMA
allocation. Between the two calls, it is possible for a
concurrent task to make an allocation with the same vaddr
and add it to the devres list.
If this happens, there will be two entries in the devres list
with the same vaddr and devres_destroy() can free the wrong
entry, triggering the WARN_ON() in dmam_match.
Fix by destroying the devres entry before freeing the DMA
allocation.
Tested:
kokonut //net/encryption
http://sponge2/b9145fe6-0f72-4325-ac2f-a84d81075b03
Fixes: 9ac7849e35 ("devres: device resource management")
Signed-off-by: Lance Richardson <rlance@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Additional fixes on top of the original 6.11 pull request:
- undefined behaviour fixes, originally noted as breaking userspace LTO
builds
- fix a spurious warning in fsck_err, reported by Marcin
- fix an integer overflow on trans->nr_updates, also reported by Marcin;
this broke during deletion of highly fragmented indirect extents
- Add comments for lockdep functions
======
- Metadata version 1.8: Stripe sectors accounting, BCH_DATA_unstriped
This splits out the accounting of dirty sectors and stripe sectors in
alloc keys; this lets us see stripe buckets that still have unstriped
data in them.
This is needed for ensuring that erasure coding is working correctly, as
well as completing stripe creation after a crash.
- Metadata version 1.9: Disk accounting rewrite
The previous disk accounting scheme relied heavily on percpu counters
that were also sharded by outstanding journal buffer; it was fast but
not extensible or scalable, and meant that all accounting counters were
recorded in every journal entry.
The new disk accounting scheme stores accounting as normal btree keys;
updates are deltas until they are flushed by the btree write buffer.
This means we have no practical limit on the number of counters, and a
new tagged union format that's easy to extend.
We now have counters for compression type/ratio, per-snapshot-id usage,
per-btree-id usage, and pending rebalance work.
- Self healing on read IO/checksum error
data is now automatically rewritten if we get a read error and then a
successful retry
- Mount API conversion (thanks to Thomas Bertschinger)
- Better lockdep coverage
Previously, btree node locks were tracked individually by lockdep, like
any other lock. But we may take _many_ btree node locks simultaneously,
we easily blow through the limit of 48 locks that lockdep can track,
leading to lockdep turning itself off.
Tracking each btree node lock individually isn't really necessary since
we have our own cycle detector for deadlock avoidance and centralized
tracking of btree node locks, so we now have a single lockdep_map in
btree_trans for "any btree nodes are locked".
- some more small incremental work towards online check_allocations
- lots more debugging improvements, fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmaZmVYACgkQE6szbY3K
bnayLBAAr/RB75neEVzNqzE/qLoz9HBfPs7NrGNZg3bLzie9BvcEKGf3VUs2mm83
6qN/bCJRBhd2vdMcFvIrYYqvD+F2qFFFrBjTzY20toReCwO6q5o8Ihfzv+uSu1qn
Lg/6AbwwDwHPgSLcFM6yVfNsNBpOZ4tru7xble5/89RGp5uhbU38tsFwkUcngN/t
4GWjtUYC+rFZaJSbt+wtGtM++nSURhMu1rxbR3MnVLT3JNW0wiG+FnRymCRQOYwx
eslI6wP3JArTKMvACJqXTDivmjUPNdvsJ26LW6b5KBLi411OgV219ek0mJlkc5gl
lOOZ5LPmPqn0BxsIdqOd+/OGOpvYkfWEyDj6/46K8KKFt+i++vCTzqIeL06HGQFS
oCZajoHseLTlWPZTnoi3/KPWkKSJyaROrvFPSHT5/9sJfeeFt88hNSMP9g4+S4GI
QSXK70GzEVjxr7YzUZwZUmRGWHt7YcS/qkTKJZRkLwUbr2BnrKEJhOa0t/i3RopN
glFP/hP3dObTcWUF4xH90htfD2E/wHbP7nPwNCL6367n18Uj4TJD09vtM8xHsXSR
YXECCjfskVDHAQJw/aV5jF1NpaeSW6g/bILi8tQhvRpfzLLvsaVGxA1xp5v+VZEQ
w3WBepPofEE1EaVfKNeHCiE7STAnSVbWYWTatmjPdqVCyT7C9S0=
=tnhh
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-07-18.2' of https://evilpiepirate.org/git/bcachefs
Pull bcachefs updates from Kent Overstreet:
- Metadata version 1.8: Stripe sectors accounting, BCH_DATA_unstriped
This splits out the accounting of dirty sectors and stripe sectors in
alloc keys; this lets us see stripe buckets that still have unstriped
data in them.
This is needed for ensuring that erasure coding is working correctly,
as well as completing stripe creation after a crash.
- Metadata version 1.9: Disk accounting rewrite
The previous disk accounting scheme relied heavily on percpu counters
that were also sharded by outstanding journal buffer; it was fast but
not extensible or scalable, and meant that all accounting counters
were recorded in every journal entry.
The new disk accounting scheme stores accounting as normal btree
keys; updates are deltas until they are flushed by the btree write
buffer.
This means we have no practical limit on the number of counters, and
a new tagged union format that's easy to extend.
We now have counters for compression type/ratio, per-snapshot-id
usage, per-btree-id usage, and pending rebalance work.
- Self healing on read IO/checksum error
Data is now automatically rewritten if we get a read error and then a
successful retry
- Mount API conversion (thanks to Thomas Bertschinger)
- Better lockdep coverage
Previously, btree node locks were tracked individually by lockdep,
like any other lock. But we may take _many_ btree node locks
simultaneously, we easily blow through the limit of 48 locks that
lockdep can track, leading to lockdep turning itself off.
Tracking each btree node lock individually isn't really necessary
since we have our own cycle detector for deadlock avoidance and
centralized tracking of btree node locks, so we now have a single
lockdep_map in btree_trans for "any btree nodes are locked".
- Some more small incremental work towards online check_allocations
- Lots more debugging improvements
- Fixes, including:
- undefined behaviour fixes, originally noted as breaking userspace
LTO builds
- fix a spurious warning in fsck_err, reported by Marcin
- fix an integer overflow on trans->nr_updates, also reported by
Marcin; this broke during deletion of highly fragmented indirect
extents
* tag 'bcachefs-2024-07-18.2' of https://evilpiepirate.org/git/bcachefs: (120 commits)
lockdep: Add comments for lockdep_set_no{validate,track}_class()
bcachefs: Fix integer overflow on trans->nr_updates
bcachefs: silence silly kdoc warning
bcachefs: Fix fsck warning about btree_trans not passed to fsck error
bcachefs: Add an error message for insufficient rw journal devs
bcachefs: varint: Avoid left-shift of a negative value
bcachefs: darray: Don't pass NULL to memcpy()
bcachefs: Kill bch2_assert_btree_nodes_not_locked()
bcachefs: Rename BCH_WRITE_DONE -> BCH_WRITE_SUBMITTED
bcachefs: __bch2_read(): call trans_begin() on every loop iter
bcachefs: show none if label is not set
bcachefs: drop packed, aligned from bkey_inode_buf
bcachefs: btree node scan: fall back to comparing by journal seq
bcachefs: Add lockdep support for btree node locks
lockdep: lockdep_set_notrack_class()
bcachefs: Improve copygc_wait_to_text()
bcachefs: Convert clock code to u64s
bcachefs: Improve startup message
bcachefs: Self healing on read IO error
bcachefs: Make read_only a mount option again, but hidden
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmaXl0kACgkQu+CwddJF
iJrOlgf+N/G7BmgoW2CBF7mKsvCYs+pX3xeBuxPtsuq4FD386nsPFMN8gWAYLG3q
ZU1z1S+0M8LhTg6/G9jMYLHt2Y7WhYbhFTjTHmULJkuhMDTUP9CRYy4XZ+hdPtHF
30ezSdJQF9x/XxCSaaRVK1s+SMVHFg5xAOHKpfkNSamcMz9g+ZkYyPBr10/VoKd0
JqwhW7r6hrlvWAiqY3QKCOvohIWglgvBUnNjUGMh1cUkOE2aYLYHklhRwICKgA6z
p/2BUXiAEWUtgBkUrizwm/pdhJXLs0pOeYarVZP1v83tQMxyrc6XLNnqhvxP3DPW
31thF5Rf9I8WaWTczXhxsAwFjqO3KQ==
=4uf9
-----END PGP SIGNATURE-----
Merge tag 'slab-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
"The most prominent change this time is the kmem_buckets based
hardening of kmalloc() allocations from Kees Cook.
We have also extended the kmalloc() alignment guarantees for
non-power-of-two sizes in a way that benefits rust.
The rest are various cleanups and non-critical fixups.
- Dedicated bucket allocator (Kees Cook)
This series [1] enhances the probabilistic defense against heap
spraying/grooming of CONFIG_RANDOM_KMALLOC_CACHES from last year.
kmalloc() users that are known to be useful for exploits can get
completely separate set of kmalloc caches that can't be shared with
other users. The first converted users are alloc_msg() and
memdup_user().
The hardening is enabled by CONFIG_SLAB_BUCKETS.
- Extended kmalloc() alignment guarantees (Vlastimil Babka)
For years now we have guaranteed natural alignment for power-of-two
allocations, but nothing was defined for other sizes (in practice,
we have two such buckets, kmalloc-96 and kmalloc-192).
To avoid unnecessary padding in the rust layer due to its alignment
rules, extend the guarantee so that the alignment is at least the
largest power-of-two divisor of the requested size.
This fits what rust needs, is a superset of the existing
power-of-two guarantee, and does not in practice change the layout
(and thus does not add overhead due to padding) of the kmalloc-96
and kmalloc-192 caches, unless slab debugging is enabled for them.
- Cleanups and non-critical fixups (Chengming Zhou, Suren
Baghdasaryan, Matthew Willcox, Alex Shi, and Vlastimil Babka)
Various tweaks related to the new alloc profiling code, folio
conversion, debugging and more leftovers after SLAB"
Link: https://lore.kernel.org/all/20240701190152.it.631-kees@kernel.org/ [1]
* tag 'slab-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm/memcg: alignment memcg_data define condition
mm, slab: move prepare_slab_obj_exts_hook under CONFIG_MEM_ALLOC_PROFILING
mm, slab: move allocation tagging code in the alloc path into a hook
mm/util: Use dedicated slab buckets for memdup_user()
ipc, msg: Use dedicated slab buckets for alloc_msg()
mm/slab: Introduce kmem_buckets_create() and family
mm/slab: Introduce kvmalloc_buckets_node() that can take kmem_buckets argument
mm/slab: Plumb kmem_buckets into __do_kmalloc_node()
mm/slab: Introduce kmem_buckets typedef
slab, rust: extend kmalloc() alignment guarantees to remove Rust padding
slab: delete useless RED_INACTIVE and RED_ACTIVE
slab: don't put freepointer outside of object if only orig_size
slab: make check_object() more consistent
mm: Reduce the number of slab->folio casts
mm, slab: don't wrap internal functions with alloc_hooks()
Up until now, the function graph tracer could only have a single user
attached to it. If another user tried to attach to the function graph
tracer while one was already attached, it would fail. Allowing function
graph tracer to have more than one user has been asked for since 2009, but
it required a rewrite to the logic to pull it off so it never happened.
Until now!
There's three systems that trace the return of a function. That is
kretprobes, function graph tracer, and BPF. kretprobes and function graph
tracing both do it similarly. The difference is that kretprobes uses a
shadow stack per callback and function graph tracer creates a shadow stack
for all tasks. The function graph tracer method makes it possible to trace
the return of all functions. As kretprobes now needs that feature too,
allowing it to use function graph tracer was needed. BPF also wants to
trace the return of many probes and its method doesn't scale either.
Having it use function graph tracer would improve that.
By allowing function graph tracer to have multiple users allows both
kretprobes and BPF to use function graph tracer in these cases. This will
allow kretprobes code to be removed in the future as it's version will no
longer be needed. Note, function graph tracer is only limited to 16
simultaneous users, due to shadow stack size and allocated slots.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZpbWlxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qgtvAP9jxmgEiEhz4Bpe1vRKVSMYK6ozXHTT
7MFKRMeQqQ8zeAEA2sD5Zrt9l7zKzg0DFpaDLgc3/yh14afIDxzTlIvkmQ8=
=umuf
-----END PGP SIGNATURE-----
Merge tag 'ftrace-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull ftrace updates from Steven Rostedt:
"Rewrite of function graph tracer to allow multiple users
Up until now, the function graph tracer could only have a single user
attached to it. If another user tried to attach to the function graph
tracer while one was already attached, it would fail. Allowing
function graph tracer to have more than one user has been asked for
since 2009, but it required a rewrite to the logic to pull it off so
it never happened. Until now!
There's three systems that trace the return of a function. That is
kretprobes, function graph tracer, and BPF. kretprobes and function
graph tracing both do it similarly. The difference is that kretprobes
uses a shadow stack per callback and function graph tracer creates a
shadow stack for all tasks. The function graph tracer method makes it
possible to trace the return of all functions. As kretprobes now needs
that feature too, allowing it to use function graph tracer was needed.
BPF also wants to trace the return of many probes and its method
doesn't scale either. Having it use function graph tracer would
improve that.
By allowing function graph tracer to have multiple users allows both
kretprobes and BPF to use function graph tracer in these cases. This
will allow kretprobes code to be removed in the future as it's version
will no longer be needed.
Note, function graph tracer is only limited to 16 simultaneous users,
due to shadow stack size and allocated slots"
* tag 'ftrace-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (49 commits)
fgraph: Use str_plural() in test_graph_storage_single()
function_graph: Add READ_ONCE() when accessing fgraph_array[]
ftrace: Add missing kerneldoc parameters to unregister_ftrace_direct()
function_graph: Everyone uses HAVE_FUNCTION_GRAPH_RET_ADDR_PTR, remove it
function_graph: Fix up ftrace_graph_ret_addr()
function_graph: Make fgraph_update_pid_func() a stub for !DYNAMIC_FTRACE
function_graph: Rename BYTE_NUMBER to CHAR_NUMBER in selftests
fgraph: Remove some unused functions
ftrace: Hide one more entry in stack trace when ftrace_pid is enabled
function_graph: Do not update pid func if CONFIG_DYNAMIC_FTRACE not enabled
function_graph: Make fgraph_do_direct static key static
ftrace: Fix prototypes for ftrace_startup/shutdown_subops()
ftrace: Assign RCU list variable with rcu_assign_ptr()
ftrace: Assign ftrace_list_end to ftrace_ops_list type cast to RCU
ftrace: Declare function_trace_op in header to quiet sparse warning
ftrace: Add comments to ftrace_hash_move() and friends
ftrace: Convert "inc" parameter to bool in ftrace_hash_rec_update_modify()
ftrace: Add comments to ftrace_hash_rec_disable/enable()
ftrace: Remove "filter_hash" parameter from __ftrace_hash_rec_update()
ftrace: Rename dup_hash() and comment it
...
- Set rtla/osnoise default threshold to 1us from 5us
The 5us default was missing noise that people cared about.
Changing it to 1us makes it work as expected.
- Restructure how sched_switch prev_comm and next_comm was being saved.
The prev_comm was being saved along with the other next fields, and the
next_comm was being saved along with the other prev fields. This is just
a cosmetic change.
- Have the allocation of pid_list use GFP_NOWAIT instead of GFP_KERNEL.
The allocation can happen in irq_work context, but luckily, the size
was by default so large, it was never triggered. But in case it ever is,
use the NOWAIT allocation in the interrupt context.
- Fix some kernel doc errors.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZpbRchQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qt9qAQCM/FNa1mRRlnCxMD+SrcGJmW4j/VvE
5c+QY5ezCYxGjwEA5tYwXttoZ1WHGRWCvuRSnkWidE1K0HpRyITo3BVzaQU=
=Sw4T
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
"Trivial updates for 6.11:
- Set rtla/osnoise default threshold to 1us from 5us
The 5us default was missing noise that people cared about. Changing
it to 1us makes it work as expected.
- Restructure how sched_switch prev_comm and next_comm was being saved
The prev_comm was being saved along with the other next fields, and
the next_comm was being saved along with the other prev fields.
This is just a cosmetic change.
- Have the allocation of pid_list use GFP_NOWAIT instead of GFP_KERNEL
The allocation can happen in irq_work context, but luckily, the
size was by default so large, it was never triggered. But in case
it ever is, use the NOWAIT allocation in the interrupt context.
- Fix some kernel doc errors"
* tag 'trace-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
trace/pid_list: Change gfp flags in pid_list_fill_irq()
tracing/sched: sched_switch: place prev_comm and next_comm in right order
rtla/osnoise: set the default threshold to 1us
tracing: Fix trace_pid_list_free() kernel-doc
Uprobes:
- x86/shstk: Make return uprobe work with shadow stack.
- Add uretprobe syscall which speeds up the uretprobe 10-30% faster. This
syscall is automatically used from user-space trampolines which are
generated by the uretprobe. If this syscall is used by normal
user program, it will cause SIGILL. Note that this is currently only
implemented on x86_64.
(This also has 2 fixes for adjusting the syscall number to avoid conflict
with new *attrat syscalls.)
- uprobes/perf: fix user stack traces in the presence of pending uretprobe.
This corrects the uretprobe's trampoline address in the stacktrace with
correct return address.
- selftests/x86: Add a return uprobe with shadow stack test.
- selftests/bpf: Add uretprobe syscall related tests.
. test case for register integrity check.
. test case with register changing case.
. test case for uretprobe syscall without uprobes (expected to be failed).
. test case for uretprobe with shadow stack.
- selftests/bpf: add test validating uprobe/uretprobe stack traces
- MAINTAINERS: Add uprobes entry. This does not specify the tree but to
clarify who maintains and reviews the uprobes.
Kprobes:
- tracing/kprobes: Test case cleanups. Replace redundant WARN_ON_ONCE() +
pr_warn() with WARN_ONCE() and remove unnecessary code from selftest.
- tracing/kprobes: Add symbol counting check when module loads. This
checks the uniqueness of the probed symbol on modules. The same check
has already done for kernel symbols.
(This also has a fix for build error with CONFIG_MODULES=n)
Cleanup:
- Add MODULE_DESCRIPTION() macros for fprobe and kprobe examples.
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmaWYxwbHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bsUgH/3JcSzDZujQWCZ1f4fJn
QecvTFSYcCl6ck8+/3wm4EsgeCXIFOyPnoPc7k2Gm+l6Dlk1DKGV6wV4tuKFUq9X
9mplcwoVA0Ln+EX9zv9v4s99yUGxcU9xjgC9XT7J52SvqYncPIi6dR0Z9wlJBmyd
Bx3cZk+wSzCYaoqYngI2fKlzsEcYgDIP999fQPRi0HGzNZujc4xeJyjCTC/48yWO
9kreRQq6wFdgRQTwMcR/fKPDKIGZQCU8jkXv5crVV5K3rNaBcwBmCJJMP8PzPU0V
UQ0+8RZK+Qk8SBwXcMNVRqm/efTderob4IYxP8OBe5wjAIE7+vu8r6sqwxRIS54M
Cyg=
=DRSr
-----END PGP SIGNATURE-----
Merge tag 'probes-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes updates from Masami Hiramatsu:
"Uprobes:
- x86/shstk: Make return uprobe work with shadow stack
- Add uretprobe syscall which speeds up the uretprobe 10-30% faster.
This syscall is automatically used from user-space trampolines
which are generated by the uretprobe. If this syscall is used by
normal user program, it will cause SIGILL. Note that this is
currently only implemented on x86_64.
(This also has two fixes for adjusting the syscall number to avoid
conflict with new *attrat syscalls.)
- uprobes/perf: fix user stack traces in the presence of pending
uretprobe. This corrects the uretprobe's trampoline address in the
stacktrace with correct return address
- selftests/x86: Add a return uprobe with shadow stack test
- selftests/bpf: Add uretprobe syscall related tests.
- test case for register integrity check
- test case with register changing case
- test case for uretprobe syscall without uprobes (expected to fail)
- test case for uretprobe with shadow stack
- selftests/bpf: add test validating uprobe/uretprobe stack traces
- MAINTAINERS: Add uprobes entry. This does not specify the tree but
to clarify who maintains and reviews the uprobes
Kprobes:
- tracing/kprobes: Test case cleanups.
Replace redundant WARN_ON_ONCE() + pr_warn() with WARN_ONCE() and
remove unnecessary code from selftest
- tracing/kprobes: Add symbol counting check when module loads.
This checks the uniqueness of the probed symbol on modules. The
same check has already done for kernel symbols
(This also has a fix for build error with CONFIG_MODULES=n)
Cleanup:
- Add MODULE_DESCRIPTION() macros for fprobe and kprobe examples"
* tag 'probes-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
MAINTAINERS: Add uprobes entry
selftests/bpf: Change uretprobe syscall number in uprobe_syscall test
uprobe: Change uretprobe syscall scope and number
tracing/kprobes: Fix build error when find_module() is not available
tracing/kprobes: Add symbol counting check when module loads
selftests/bpf: add test validating uprobe/uretprobe stack traces
perf,uprobes: fix user stack traces in the presence of pending uretprobes
tracing/kprobe: Remove cleanup code unrelated to selftest
tracing/kprobe: Integrate test warnings into WARN_ONCE
selftests/bpf: Add uretprobe shadow stack test
selftests/bpf: Add uretprobe syscall call from user space test
selftests/bpf: Add uretprobe syscall test for regs changes
selftests/bpf: Add uretprobe syscall test for regs integrity
selftests/x86: Add return uprobe shadow stack test
uprobe: Add uretprobe syscall to speed up return probe
uprobe: Wire up uretprobe system call
x86/shstk: Make return uprobe work with shadow stack
samples: kprobes: add missing MODULE_DESCRIPTION() macros
fprobe: add missing MODULE_DESCRIPTION() macro
Now that the platform MSI hack is gone, nothing needs to know about struct
msi_device_data outside of the core code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Shivamurthy Shastri <shivamurthy.shastri@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240623142236.003295177@linutronix.de
For systems on which the performance counter can expire early due to turbo
modes the watchdog handler has a safety net in place which validates that
since the last watchdog event there has at least 4/5th of the watchdog
period elapsed.
This works reliably only after the first watchdog event because the per
CPU variable which holds the timestamp of the last event is never
initialized.
So a first spurious event will validate against a timestamp of 0 which
results in a delta which is likely to be way over the 4/5 threshold of the
period. As this might happen before the first watchdog hrtimer event
increments the watchdog counter, this can lead to false positives.
Fix this by initializing the timestamp before enabling the hardware event.
Reset the rearm counter as well, as that might be non zero after the
watchdog was disabled and reenabled.
Link: https://lkml.kernel.org/r/87frsfu15a.ffs@tglx
Fixes: 7edaeb6841 ("kernel/watchdog: Prevent false positives with turbo modes")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This mostly reverts commit af3b854492 ("mm/page_alloc.c: allow error
injection"). The commit made should_fail_alloc_page() a noinline function
that's always called from the page allocation hotpath, even if it's empty
because CONFIG_FAIL_PAGE_ALLOC is not enabled, and there is no option to
disable it and prevent the associated function call overhead.
As with the preceding patch "mm, slab: put should_failslab back behind
CONFIG_SHOULD_FAILSLAB" and for the same reasons, put the
should_fail_alloc_page() back behind the config option. When enabled, the
ALLOW_ERROR_INJECTION and BTF_ID records are preserved so it's not a
complete revert.
Link: https://lkml.kernel.org/r/20240711-b4-fault-injection-reverts-v1-2-9e2651945d68@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@fomichev.me>
Cc: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "revert unconditional slab and page allocator fault injection
calls".
These two patches largely revert commits that added function call overhead
into slab and page allocation hotpaths and that cannot be currently
disabled even though related CONFIG_ options do exist.
A much more involved solution that can keep the callsites always existing
but hidden behind a static key if unused, is possible [1] and can be
pursued by anyone who believes it's necessary. Meanwhile the fact the
should_failslab() error injection is already not functional on kernels
built with current gcc without anyone noticing [2], and lukewarm response
to [1] suggests the need is not there. I believe it will be more fair to
have the state after this series as a baseline for possible further
optimisation, instead of the unconditional overhead.
For example a possible compromise for anyone who's fine with an empty
function call overhead but not the full CONFIG_FAILSLAB /
CONFIG_FAIL_PAGE_ALLOC overhead is to reuse patch 1 from [1] but insert a
static key check only inside should_failslab() and
should_fail_alloc_page() before performing the more expensive checks.
[1] https://lore.kernel.org/all/20240620-fault-injection-statickeys-v2-0-e23947d3d84b@suse.cz/#t
[2] https://github.com/bpftrace/bpftrace/issues/3258
This patch (of 2):
This mostly reverts commit 4f6923fbb3 ("mm: make should_failslab always
available for fault injection"). The commit made should_failslab() a
noinline function that's always called from the slab allocation hotpath,
even if it's empty because CONFIG_SHOULD_FAILSLAB is not enabled, and
there is no option to disable that call. This is visible in profiles and
the function call overhead can be noticeable especially with cpu
mitigations.
Meanwhile the bpftrace program example in the commit silently does not
work without CONFIG_SHOULD_FAILSLAB anyway with a recent gcc, because the
empty function gets a .constprop clone that is actually being called
(uselessly) from the slab hotpath, while the error injection is hooked to
the original function that's not being called at all [1].
Thus put the whole should_failslab() function back behind
CONFIG_SHOULD_FAILSLAB. It's not a complete revert of 4f6923fbb3 - the
int return type that returns -ENOMEM on failure is preserved, as well
ALLOW_ERROR_INJECTION annotation. The BTF_ID() record that was meanwhile
added is also guarded by CONFIG_SHOULD_FAILSLAB.
[1] https://github.com/bpftrace/bpftrace/issues/3258
Link: https://lkml.kernel.org/r/20240711-b4-fault-injection-reverts-v1-0-9e2651945d68@suse.cz
Link: https://lkml.kernel.org/r/20240711-b4-fault-injection-reverts-v1-1-9e2651945d68@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@fomichev.me>
Cc: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
patchsets (devmem among them) did not make it in time.
Core & protocols
----------------
- Use local_lock in addition to local_bh_disable() to protect per-CPU
resources in networking, a step closer for local_bh_disable() not
to act as a big lock on PREEMPT_RT.
- Use flex array for netdevice priv area, ensure its cache alignment.
- Add a sysctl knob to allow user to specify a default rto_min at socket
init time. Bit of a big hammer but multiple companies were
independently carrying such patch downstream so clearly it's useful.
- Support scheduling transmission of packets based on CLOCK_TAI.
- Un-pin TCP TIMEWAIT timer to avoid it firing on CPUs later cordoned off
using cpusets.
- Support multiple L2TPv3 UDP tunnels using the same 5-tuple address.
- Allow configuration of multipath hash seed, to both allow synchronizing
hashing of two routers, and preventing partial accidental sync.
- Improve TCP compliance with RFC 9293 for simultaneous connect().
- Support sending NAT keepalives in IPsec ESP in UDP states. Userspace
IKE daemon had to do this before, but the kernel can better keep
track of it.
- Support sending supervision HSR frames with MAC addresses stored in
ProxyNodeTable when RedBox (i.e. HSR-SAN) is enabled.
- Introduce IPPROTO_SMC for selecting SMC when socket is created.
- Allow UDP GSO transmit from devices with no checksum offload.
- openvswitch: add packet sampling via psample, separating the sampled
traffic from "upcall" packets sent to user space for forwarding.
- nf_tables: shrink memory consumption for transaction objects.
Things we sprinkled into general kernel code
--------------------------------------------
- Power Sequencing subsystem (used by Qualcomm Bluetooth driver
for QCA6390).
- Add IRQ information in sysfs for auxiliary bus.
- Introduce guard definition for local_lock.
- Add aligned flavor of __cacheline_group_{begin, end}() markings for
grouping fields in structures.
BPF
---
- Notify user space (via epoll) when a struct_ops object is getting
detached/unregistered.
- Add new kfuncs for a generic, open-coded bits iterator.
- Enable BPF programs to declare arrays of kptr, bpf_rb_root, and
bpf_list_head.
- Support resilient split BTF which cuts down on duplication and makes
BTF as compact as possible WRT BTF from modules.
- Add support for dumping kfunc prototypes from BTF which enables both
detecting as well as dumping compilable prototypes for kfuncs.
- riscv64 BPF JIT improvements in particular to add 12-argument support
for BPF trampolines and to utilize bpf_prog_pack for the latter.
- Add the capability to offload the netfilter flowtable in XDP layer
through kfuncs.
Driver API
----------
- Allow users to configure IRQ tresholds between which automatic IRQ
moderation can choose.
- Expand Power Sourcing (PoE) status with power, class and failure
reason. Support setting power limits.
- Track additional RSS contexts in the core, make sure configuration
changes don't break them.
- Support IPsec crypto offload for IPv6 ESP and IPv4 UDP-encapsulated ESP
data paths.
- Support updating firmware on SFP modules.
Tests and tooling
-----------------
- mptcp: use net/lib.sh to manage netns.
- TCP-AO and TCP-MD5: replace debug prints used by tests with
tracepoints.
- openvswitch: make test self-contained (don't depend on OvS CLI tools).
Drivers
-------
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- increase the max total outstanding PTP TX packets to 4
- add timestamping statistics support
- implement netdev_queue_mgmt_ops
- support new RSS context API
- Intel (100G, ice, idpf):
- implement FEC statistics and dumping signal quality indicators
- support E825C products (with 56Gbps PHYs)
- nVidia/Mellanox:
- support HW-GRO
- mlx4/mlx5: support per-queue statistics via netlink
- obey the max number of EQs setting in sub-functions
- AMD/Solarflare:
- support new RSS context API
- AMD/Pensando:
- ionic: rework fix for doorbell miss to lower overhead
and skip it on new HW
- Wangxun:
- txgbe: support Flow Director perfect filters
- Ethernet NICs consumer, embedded and virtual:
- Add driver for Tehuti Networks TN40xx chips
- Add driver for Meta's internal NIC chips
- Add driver for Ethernet MAC on Airoha EN7581 SoCs
- Add driver for Renesas Ethernet-TSN devices
- Google cloud vNIC:
- flow steering support
- Microsoft vNIC:
- support page sizes other than 4KB on ARM64
- vmware vNIC:
- support latency measurement (update to version 9)
- VirtIO net:
- support for Byte Queue Limits
- support configuring thresholds for automatic IRQ moderation
- support for AF_XDP Rx zero-copy
- Synopsys (stmmac):
- support for STM32MP13 SoC
- let platforms select the right PCS implementation
- TI:
- icssg-prueth: add multicast filtering support
- icssg-prueth: enable PTP timestamping and PPS
- Renesas:
- ravb: improve Rx performance 30-400% by using page pool,
theaded NAPI and timer-based IRQ coalescing
- ravb: add MII support for R-Car V4M
- Cadence (macb):
- macb: add ARP support to Wake-On-LAN
- Cortina:
- use phylib for RX and TX pause configuration
- Ethernet switches:
- nVidia/Mellanox:
- support configuration of multipath hash seed
- report more accurate max MTU
- use page_pool to improve Rx performance
- MediaTek:
- mt7530: add support for bridge port isolation
- Qualcomm:
- qca8k: add support for bridge port isolation
- Microchip:
- lan9371/2: add 100BaseTX PHY support
- NXP:
- vsc73xx: implement VLAN operations
- Ethernet PHYs:
- aquantia: enable support for aqr115c
- aquantia: add support for PHY LEDs
- realtek: add support for rtl8224 2.5Gbps PHY
- xpcs: add memory-mapped device support
- add BroadR-Reach link mode and support in Broadcom's PHY driver
- CAN:
- add document for ISO 15765-2 protocol support
- mcp251xfd: workaround for erratum DS80000789E, use timestamps
to catch when device returns incorrect FIFO status
- WiFi:
- mac80211/cfg80211:
- parse Transmit Power Envelope (TPE) data in mac80211 instead of
in drivers
- improvements for 6 GHz regulatory flexibility
- multi-link improvements
- support multiple radios per wiphy
- remove DEAUTH_NEED_MGD_TX_PREP flag
- Intel (iwlwifi):
- bump FW API to 91 for BZ/SC devices
- report 64-bit radiotap timestamp
- enable P2P low latency by default
- handle Transmit Power Envelope (TPE) advertised by AP
- remove support for older FW for new devices
- fast resume (keeping the device configured)
- mvm: re-enable Multi-Link Operation (MLO)
- aggregation (A-MSDU) optimizations
- MediaTek (mt76):
- mt7925 Multi-Link Operation (MLO) support
- Qualcomm (ath10k):
- LED support for various chipsets
- Qualcomm (ath12k):
- remove unsupported Tx monitor handling
- support channel 2 in 6 GHz band
- support Spatial Multiplexing Power Save (SMPS) in 6 GHz band
- supprt multiple BSSID (MBSSID) and Enhanced Multi-BSSID
Advertisements (EMA)
- support dynamic VLAN
- add panic handler for resetting the firmware state
- DebugFS support for datapath statistics
- WCN7850: support for Wake on WLAN
- Microchip (wilc1000):
- read MAC address during probe to make it visible to user space
- suspend/resume improvements
- TI (wl18xx):
- support newer firmware versions
- RealTek (rtw89):
- preparation for RTL8852BE-VT support
- Wake on WLAN support for WiFi 6 chips
- 36-bit PCI DMA support
- RealTek (rtlwifi):
- RTL8192DU support
- Broadcom (brcmfmac):
- Management Frame Protection support (to enable WPA3)
- Bluetooth:
- qualcomm: use the power sequencer for QCA6390
- btusb: mediatek: add ISO data transmission functions
- hci_bcm4377: add BCM4388 support
- btintel: add support for BlazarU core
- btintel: add support for Whale Peak2
- btnxpuart: add support for AW693 A1 chipset
- btnxpuart: add support for IW615 chipset
- btusb: add Realtek RTL8852BE support ID 0x13d3:0x3591
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmaWjBwACgkQMUZtbf5S
IrvuSRAAkJuEzTRqgURBCe4eNEQde6mJJig7l2CKHwCbFiHZpRkFHf8qKbcGWbL6
uLW33SWnKtJVDhxVKWHLq635XW7BAa80YhqGw21GDi+mIEhWXZglHj3xbXNxsMfE
4eg/kG4BkfYWFmHaXOwVWV/mr7nXf6j7WmXNeXEi32ufE1j0OL+YlQenKnMj8yP2
j9JmYa2Chwppng1SblHmcjmGkdNVwFhStKeCG+2K7v06wdDH/QYBlbgUv9gw/cxp
NlW//wgiaeX40U4O3kDwt9C+LDoh+0VrDDeVdQ+IsScLtY3PhAzEoKolFYTq2HSr
I1JpoaHNnyNsJq3DZrACQ5WlH4yDn6C2EUB6dxNnFaI9F1ZPsi+7MTl6Sei1AklD
TuQTj/lxOACBwW2Q77NU72uoxiIUauesGPHcnrAFuoCIEhZF0mso7k59BvrXhsOP
QwcLbQdc1YHNkqv/Vc7NBY+ruMsYB+5Ubbhhj2p27dp/CWFIwxI29fze4dn2uhO6
ejHN3mbqwPdSzg12YJtM6Iq61Cnwo2eVSvhTxl+ZVSZtI4nu2arzR+y7QTYmNrXP
6tkgVN9UsWeLl2xJ8wyyqL5mcvNHP2rPXWZ2X56iTaa26m+UlleeQ7YRaYtQAAr0
Ec/vlDMX64SwHhd+qwE99DXGQf2g+KklHKSLsnajJUVrWFTlRI0=
=opz8
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Not much excitement - a handful of large patchsets (devmem among them)
did not make it in time.
Core & protocols:
- Use local_lock in addition to local_bh_disable() to protect per-CPU
resources in networking, a step closer for local_bh_disable() not
to act as a big lock on PREEMPT_RT
- Use flex array for netdevice priv area, ensure its cache alignment
- Add a sysctl knob to allow user to specify a default rto_min at
socket init time. Bit of a big hammer but multiple companies were
independently carrying such patch downstream so clearly it's useful
- Support scheduling transmission of packets based on CLOCK_TAI
- Un-pin TCP TIMEWAIT timer to avoid it firing on CPUs later cordoned
off using cpusets
- Support multiple L2TPv3 UDP tunnels using the same 5-tuple address
- Allow configuration of multipath hash seed, to both allow
synchronizing hashing of two routers, and preventing partial
accidental sync
- Improve TCP compliance with RFC 9293 for simultaneous connect()
- Support sending NAT keepalives in IPsec ESP in UDP states.
Userspace IKE daemon had to do this before, but the kernel can
better keep track of it
- Support sending supervision HSR frames with MAC addresses stored in
ProxyNodeTable when RedBox (i.e. HSR-SAN) is enabled
- Introduce IPPROTO_SMC for selecting SMC when socket is created
- Allow UDP GSO transmit from devices with no checksum offload
- openvswitch: add packet sampling via psample, separating the
sampled traffic from "upcall" packets sent to user space for
forwarding
- nf_tables: shrink memory consumption for transaction objects
Things we sprinkled into general kernel code:
- Power Sequencing subsystem (used by Qualcomm Bluetooth driver for
QCA6390) [ Already merged separately - Linus ]
- Add IRQ information in sysfs for auxiliary bus
- Introduce guard definition for local_lock
- Add aligned flavor of __cacheline_group_{begin, end}() markings for
grouping fields in structures
BPF:
- Notify user space (via epoll) when a struct_ops object is getting
detached/unregistered
- Add new kfuncs for a generic, open-coded bits iterator
- Enable BPF programs to declare arrays of kptr, bpf_rb_root, and
bpf_list_head
- Support resilient split BTF which cuts down on duplication and
makes BTF as compact as possible WRT BTF from modules
- Add support for dumping kfunc prototypes from BTF which enables
both detecting as well as dumping compilable prototypes for kfuncs
- riscv64 BPF JIT improvements in particular to add 12-argument
support for BPF trampolines and to utilize bpf_prog_pack for the
latter
- Add the capability to offload the netfilter flowtable in XDP layer
through kfuncs
Driver API:
- Allow users to configure IRQ tresholds between which automatic IRQ
moderation can choose
- Expand Power Sourcing (PoE) status with power, class and failure
reason. Support setting power limits
- Track additional RSS contexts in the core, make sure configuration
changes don't break them
- Support IPsec crypto offload for IPv6 ESP and IPv4 UDP-encapsulated
ESP data paths
- Support updating firmware on SFP modules
Tests and tooling:
- mptcp: use net/lib.sh to manage netns
- TCP-AO and TCP-MD5: replace debug prints used by tests with
tracepoints
- openvswitch: make test self-contained (don't depend on OvS CLI
tools)
Drivers:
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- increase the max total outstanding PTP TX packets to 4
- add timestamping statistics support
- implement netdev_queue_mgmt_ops
- support new RSS context API
- Intel (100G, ice, idpf):
- implement FEC statistics and dumping signal quality indicators
- support E825C products (with 56Gbps PHYs)
- nVidia/Mellanox:
- support HW-GRO
- mlx4/mlx5: support per-queue statistics via netlink
- obey the max number of EQs setting in sub-functions
- AMD/Solarflare:
- support new RSS context API
- AMD/Pensando:
- ionic: rework fix for doorbell miss to lower overhead and
skip it on new HW
- Wangxun:
- txgbe: support Flow Director perfect filters
- Ethernet NICs consumer, embedded and virtual:
- Add driver for Tehuti Networks TN40xx chips
- Add driver for Meta's internal NIC chips
- Add driver for Ethernet MAC on Airoha EN7581 SoCs
- Add driver for Renesas Ethernet-TSN devices
- Google cloud vNIC:
- flow steering support
- Microsoft vNIC:
- support page sizes other than 4KB on ARM64
- vmware vNIC:
- support latency measurement (update to version 9)
- VirtIO net:
- support for Byte Queue Limits
- support configuring thresholds for automatic IRQ moderation
- support for AF_XDP Rx zero-copy
- Synopsys (stmmac):
- support for STM32MP13 SoC
- let platforms select the right PCS implementation
- TI:
- icssg-prueth: add multicast filtering support
- icssg-prueth: enable PTP timestamping and PPS
- Renesas:
- ravb: improve Rx performance 30-400% by using page pool,
theaded NAPI and timer-based IRQ coalescing
- ravb: add MII support for R-Car V4M
- Cadence (macb):
- macb: add ARP support to Wake-On-LAN
- Cortina:
- use phylib for RX and TX pause configuration
- Ethernet switches:
- nVidia/Mellanox:
- support configuration of multipath hash seed
- report more accurate max MTU
- use page_pool to improve Rx performance
- MediaTek:
- mt7530: add support for bridge port isolation
- Qualcomm:
- qca8k: add support for bridge port isolation
- Microchip:
- lan9371/2: add 100BaseTX PHY support
- NXP:
- vsc73xx: implement VLAN operations
- Ethernet PHYs:
- aquantia: enable support for aqr115c
- aquantia: add support for PHY LEDs
- realtek: add support for rtl8224 2.5Gbps PHY
- xpcs: add memory-mapped device support
- add BroadR-Reach link mode and support in Broadcom's PHY driver
- CAN:
- add document for ISO 15765-2 protocol support
- mcp251xfd: workaround for erratum DS80000789E, use timestamps to
catch when device returns incorrect FIFO status
- WiFi:
- mac80211/cfg80211:
- parse Transmit Power Envelope (TPE) data in mac80211 instead
of in drivers
- improvements for 6 GHz regulatory flexibility
- multi-link improvements
- support multiple radios per wiphy
- remove DEAUTH_NEED_MGD_TX_PREP flag
- Intel (iwlwifi):
- bump FW API to 91 for BZ/SC devices
- report 64-bit radiotap timestamp
- enable P2P low latency by default
- handle Transmit Power Envelope (TPE) advertised by AP
- remove support for older FW for new devices
- fast resume (keeping the device configured)
- mvm: re-enable Multi-Link Operation (MLO)
- aggregation (A-MSDU) optimizations
- MediaTek (mt76):
- mt7925 Multi-Link Operation (MLO) support
- Qualcomm (ath10k):
- LED support for various chipsets
- Qualcomm (ath12k):
- remove unsupported Tx monitor handling
- support channel 2 in 6 GHz band
- support Spatial Multiplexing Power Save (SMPS) in 6 GHz band
- supprt multiple BSSID (MBSSID) and Enhanced Multi-BSSID
Advertisements (EMA)
- support dynamic VLAN
- add panic handler for resetting the firmware state
- DebugFS support for datapath statistics
- WCN7850: support for Wake on WLAN
- Microchip (wilc1000):
- read MAC address during probe to make it visible to user space
- suspend/resume improvements
- TI (wl18xx):
- support newer firmware versions
- RealTek (rtw89):
- preparation for RTL8852BE-VT support
- Wake on WLAN support for WiFi 6 chips
- 36-bit PCI DMA support
- RealTek (rtlwifi):
- RTL8192DU support
- Broadcom (brcmfmac):
- Management Frame Protection support (to enable WPA3)
- Bluetooth:
- qualcomm: use the power sequencer for QCA6390
- btusb: mediatek: add ISO data transmission functions
- hci_bcm4377: add BCM4388 support
- btintel: add support for BlazarU core
- btintel: add support for Whale Peak2
- btnxpuart: add support for AW693 A1 chipset
- btnxpuart: add support for IW615 chipset
- btusb: add Realtek RTL8852BE support ID 0x13d3:0x3591"
* tag 'net-next-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1589 commits)
eth: fbnic: Fix spelling mistake "tiggerring" -> "triggering"
tcp: Replace strncpy() with strscpy()
wifi: ath12k: fix build vs old compiler
tcp: Don't access uninit tcp_rsk(req)->ao_keyid in tcp_create_openreq_child().
eth: fbnic: Write the TCAM tables used for RSS control and Rx to host
eth: fbnic: Add L2 address programming
eth: fbnic: Add basic Rx handling
eth: fbnic: Add basic Tx handling
eth: fbnic: Add link detection
eth: fbnic: Add initial messaging to notify FW of our presence
eth: fbnic: Implement Rx queue alloc/start/stop/free
eth: fbnic: Implement Tx queue alloc/start/stop/free
eth: fbnic: Allocate a netdevice and napi vectors with queues
eth: fbnic: Add FW communication mechanism
eth: fbnic: Add message parsing for FW messages
eth: fbnic: Add register init to set PCIe/Ethernet device config
eth: fbnic: Allocate core device specific structures and devlink interface
eth: fbnic: Add scaffolding for Meta's NIC driver
PCI: Add Meta Platforms vendor ID
net/sched: cls_flower: propagate tca[TCA_OPTIONS] to NL_REQ_ATTR_CHECK
...
This KUnit next update for Linux 6.11-rc1 consists of:
-- adds vm_mmap() allocation resource manager
-- converts usercopy kselftest to KUnit
-- disables usercopy testing on !CONFIG_MMU
-- adds MODULE_DESCRIPTION() to core, list, and usercopy tests
-- adds tests for assertion formatting functions - assert.c
-- introduces KUNIT_ASSERT_MEMEQ and KUNIT_ASSERT_MEMNEQ macros
-- fixes KUNIT_ASSERT_STRNEQ comments to make it clear that it is
an assertion
-- renames KUNIT_ASSERT_FAILURE to KUNIT_FAIL_AND_ABORT
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmaWpCYACgkQCwJExA0N
QxwdPQ/9G26Q+xhbieosvXHu/04ZWTcuUP/cFRv56jLH9bKm25YbW8WZzKM/imE5
So35IT6SIYlwxn9fYyriPz372h3ZC522cu8tIVrUh5Uo3O5LbzQqdrxos9a+RuCg
u6lenSksAjJRZ3S3IKDJ1ErxLnPYKyjjZFwDmV1+0Xxy30SwzFEbQqj9lY2Q4iGs
KWBm0lrFPipbHdBqZcPB/mxIDyF6rhe+oeuOPU8uag6ncNN31xMpDanU8O6XEAz9
QoAiDICANbVKTRKG5xXgmsJtyLF8GON4e49kEYtCLdnESPc39hQtf3cTHeYI22HC
7OWhhOySifNIukFj1hVtxnN3ZfjtBGmbCwe5rXZFvMovE3YwAplKK61GoOaI9UV0
qPk5GGrAb/xEh2HZ9tgf8+CsqmnPQLGnVt2h3u3c28u4YzbkinqVj20KYsye39zz
KzJsO2yDJH4LlIJjc8XWof1cyyo0TIJQVOwJqAieOPePnfs4zabmVOus8y1Cj07V
iAvQTPPoZ165zA1cl0iSMolKkXeAgf2FjlEGbODrktKKX6Ag/PKVp3e6PW28zJbp
0p1V1IDQQAlEhbcRAZb+5y1voh+hcy++KyPwpj7lAVkmHd7RoK/mDL3W+oLdOTrB
aXWs4JOlkmtUaz3EpAQZuvhYWVW7DexR9rU1SF44UAVzSdZSndw=
=nnFR
-----END PGP SIGNATURE-----
Merge tag 'linux_kselftest-kunit-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull KUnit updates from Shuah Khan:
- add vm_mmap() allocation resource manager
- convert usercopy kselftest to KUnit
- disable usercopy testing on !CONFIG_MMU
- add MODULE_DESCRIPTION() to core, list, and usercopy tests
- add tests for assertion formatting functions - assert.c
- introduce KUNIT_ASSERT_MEMEQ and KUNIT_ASSERT_MEMNEQ macros
- fix KUNIT_ASSERT_STRNEQ comments to make it clear that it is an
assertion
- rename KUNIT_ASSERT_FAILURE to KUNIT_FAIL_AND_ABORT
* tag 'linux_kselftest-kunit-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: Introduce KUNIT_ASSERT_MEMEQ and KUNIT_ASSERT_MEMNEQ macros
kunit: Rename KUNIT_ASSERT_FAILURE to KUNIT_FAIL_AND_ABORT for readability
kunit: Fix the comment of KUNIT_ASSERT_STRNEQ as assertion
kunit: executor: Simplify string allocation handling
kunit/usercopy: Add missing MODULE_DESCRIPTION()
kunit/usercopy: Disable testing on !CONFIG_MMU
usercopy: Convert test_user_copy to KUnit test
kunit: test: Add vm_mmap() allocation resource manager
list: test: add the missing MODULE_DESCRIPTION() macro
kunit: add missing MODULE_DESCRIPTION() macros to core modules
list: test: remove unused struct 'klist_test_struct'
kunit: Cover 'assert.c' with tests
- Intel PT support enhancements & fixes
- Fix leaked SIGTRAP events
- Improve and fix the Intel uncore driver
- Add support for Intel HBM and CXL uncore counters
- Add Intel Lake and Arrow Lake support
- AMD uncore driver fixes
- Make SIGTRAP and __perf_pending_irq() work on RT
- Micro-optimizations
- Misc cleanups and fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmaWjncRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iZyg//TSafjCK4N9fyXrPdPqf8L7ntX5uYf0rd
uVZpEo/+VGvuFhznHnZIV2DLetvuwYZcUWszCqQMYfokGGi6WI1/k4MeZkSpN5QE
p5mFk6gW3cmpHT9bECg7mKQH+w7Qna/b6mnA0HYTFxPGmQKdQDl1/S+ZsgWedxpC
4V3re7/FzenFVS45DwSMPi9s7uZzZhVhTSgb4XLy+0Da4S0iRULItBa8HT8HmqE5
v5aQlw3mmwKPUWvyPMi3Sw6RRWK3C+n5ZxWswSYoLSM3dsp1ZD+YYqtOv2GqAx8v
JoL0SOnGnNCfxGHh0kz5D2hztDvq61Enotih2gz7HxvdWh2DasNp4yS1USGQhu5h
VJnKNA0TfOUaYqWFVj0EgRVhDX79lMwSHTkR1DZd4vM2GDigHeRPh0zGSn2w/koV
oCRxFfBoktHBnX0Te1NE2BhojbuKp25vTGK6GriVcHt/RNpuz6hTxsjdJzHCAlVX
M349l0EpUJafvfaIN9zF22uw22J8P9y9JYqI6ebkUIKiuoT9LuafVYhQupSE9H4u
IqlozPCTNw6eAQcUo03gkl3n+SY/DZH6eU2ycKgEp3r7TDGYbJPwxY1BgOHbwi4U
lySM07leso2accSVAz7GDMI3ejj6Sx64asWS1FSwbajDflouaIK2jtey+1IOdXfv
hHY65tomV8U=
=gguT
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2024-07-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance events updates from Ingo Molnar:
- Intel PT support enhancements & fixes
- Fix leaked SIGTRAP events
- Improve and fix the Intel uncore driver
- Add support for Intel HBM and CXL uncore counters
- Add Intel Lake and Arrow Lake support
- AMD uncore driver fixes
- Make SIGTRAP and __perf_pending_irq() work on RT
- Micro-optimizations
- Misc cleanups and fixes
* tag 'perf-core-2024-07-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
perf/x86/intel: Add a distinct name for Granite Rapids
perf/x86/intel/ds: Fix non 0 retire latency on Raptorlake
perf/x86/intel: Hide Topdown metrics events if the feature is not enumerated
perf/x86/intel/uncore: Fix the bits of the CHA extended umask for SPR
perf: Split __perf_pending_irq() out of perf_pending_irq()
perf: Don't disable preemption in perf_pending_task().
perf: Move swevent_htable::recursion into task_struct.
perf: Shrink the size of the recursion counter.
perf: Enqueue SIGTRAP always via task_work.
task_work: Add TWA_NMI_CURRENT as an additional notify mode.
perf: Move irq_work_queue() where the event is prepared.
perf: Fix event leak upon exec and file release
perf: Fix event leak upon exit
task_work: Introduce task_work_cancel() again
task_work: s/task_work_cancel()/task_work_cancel_func()/
perf/x86/amd/uncore: Fix DF and UMC domain identification
perf/x86/amd/uncore: Avoid PMU registration if counters are unavailable
perf/x86/intel: Support Perfmon MSRs aliasing
perf/x86/intel: Support PERFEVTSEL extension
perf/x86: Add config_mask to represent EVENTSEL bitmask
...
- Update Daniel Bristot de Oliveira's entry in MAINTAINERS,
and credit him in CREDITS.
- Harmonize the lock-yielding behavior on dynamically selected
preemption models with static ones.
- Reorganize the code a bit: split out sched/syscalls.c to reduce
the size of sched/core.c
- Micro-optimize psi_group_change()
- Fix set_load_weight() for SCHED_IDLE tasks
- Misc cleanups & fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmaVtVARHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iqTQ/9GLNzNBnl0oBWCiybeQjyWsZ6BiZi48R0
C1g9/RKy++OyGOjn/yqYK0Kg8cdfoGzHGioMMAucHFW1nXZwVw17xAJK127N0apF
83up7AnFJw/JGr1bI0FwuozqHAs4Z5KzHTv2KBxhYuO77lyYna6/t0liRUbF8ZUZ
I/nqav7wDB8RBIB5hEJ/uYLDX7qWdUlyFB+mcvV4ANA99yr++OgipCp6Ob3Rz3cP
O676nKJY4vpNbZ/B6bpKg8ezULRP8re2qD3GJRf2huS63uu/Z5ct7ouLVZ1DwN53
mFDBTYUMI2ToV0pseikuqwnmrjxAKcEajTyZpD3vckafd2TlWIopkQZoQ9XLLlIZ
DxO+KoekaHTSVy8FWlO8O+iE3IAdUUgECEpNveX45Pb7nFP+5dtFqqnVIdNqCq5e
zEuQvizaa5m+A1POZhZKya+z9jbLXXx+gtPCbbADTBWtuyl8azUIh3vjn0bykmv4
IVV/wvUm+BPEIhnKusZZOgB0vLtxUdntBBfUSxqoSOad9L+0/UtSKoKI6wvW00q8
ZkW+85yS3YFiN9W61276RLis2j7OAjE0eDJ96wfhooma2JRDJU4Wmg5oWg8x3WuA
JRmK0s63Qik5gpwG5rHQsR5jNqYWTj5Lp7So+M1kRfFsOM/RXQ/AneSXZu/P7d65
LnYWzbKu76c=
=lLab
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2024-07-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Update Daniel Bristot de Oliveira's entry in MAINTAINERS,
and credit him in CREDITS
- Harmonize the lock-yielding behavior on dynamically selected
preemption models with static ones
- Reorganize the code a bit: split out sched/syscalls.c to reduce
the size of sched/core.c
- Micro-optimize psi_group_change()
- Fix set_load_weight() for SCHED_IDLE tasks
- Misc cleanups & fixes
* tag 'sched-core-2024-07-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Update MAINTAINERS and CREDITS
sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE tasks
sched/psi: Optimise psi_group_change a bit
sched/core: Drop spinlocks on contention iff kernel is preemptible
sched/core: Move preempt_model_*() helpers from sched.h to preempt.h
sched/balance: Skip unnecessary updates to idle load balancer's flags
idle: Remove stale RCU comment
sched/headers: Move struct pre-declarations to the beginning of the header
sched/core: Clean up kernel/sched/sched.h a bit
sched/core: Simplify prefetch_curr_exec_start()
sched: Fix spelling in comments
sched/syscalls: Split out kernel/sched/syscalls.c from kernel/sched/core.c
- Jump label fixes, including a perf events fix that originally
manifested as jump label failures, but was a serialization bug
at the usage site.
- Mark down_write*() helpers as __always_inline, to improve
WCHAN debuggability.
- Misc cleanups and fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmaVEV8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iWbQ/+KBa29udVx0qiX/1R/+4EBNKzFDyvhobu
9XuAUZJZ9g46PFqnXlnA/VhGff52M2I+1scva11OPnb2NnOhAN1yUFQJaZh0HM+y
GTsaLQEdJfNKv1Z0i05AnYE7BKFEnoksWInsOg6Or5KoML5RS6vIyLWsBQpCjelM
ZxAYNHVpI5qeeUT7dqy9bkBxUXgOpc5X5+qZmmBnBDTGvhsPviV88gL1Ol2tjmGi
hhALK9RdLMltrNEejDc9sBcLw/qRA5QOUDDxSoBykSOBaljJB+3BcTCuXdUJT24Y
xpufbESHyl7HAmdvRU+n4ypo4kuKSIcyF1+V6LcMmNi0jSsyUG7x+SuhJetQHd2z
kLsYD18JbHHBVz7/047DnVTtyTcifsDpJi4MqEYYrUhMw1ns/goEiLRgSEYdB/FT
eCjdZxbCzkKzmV42/gqkM/mv8/pWkhvAqZ1/LNkHSOR/JSa5X3NeR0qrj18OKSrS
Du+ycGDqk1PqVcqiC1heBayIU/QoY/sEkdktTtbAKSujKCJ+tYhPtqMki6JaOh5I
VPp9ekcbs+Vvuu1qNDcueoe3TqxO2wfchKMNcSD8+I68EZtxHWZN3iHtTOlqGYVr
uZuG0V2WSgGvUi7k0UIPpnnmZ2+XejA++VnjVitXM3Z+xf7MlrPNy/4fzUbkgTQn
T8/Tt7SytQE=
=T1KT
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2024-07-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
- Jump label fixes, including a perf events fix that originally
manifested as jump label failures, but was a serialization bug at the
usage site
- Mark down_write*() helpers as __always_inline, to improve WCHAN
debuggability
- Misc cleanups and fixes
* tag 'locking-core-2024-07-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rwsem: Add __always_inline annotation to __down_write_common() and inlined callers
jump_label: Simplify and clarify static_key_fast_inc_cpus_locked()
jump_label: Clarify condition in static_key_fast_inc_not_disabled()
jump_label: Fix concurrency issues in static_key_slow_dec()
perf/x86: Serialize set_attr_rdpmc()
cleanup: Standardize the header guard define's name
Summary
* Remove "->procname == NULL" check when iterating through sysctl table arrays
Removing sentinels in ctl_table arrays reduces the build time size and
runtime memory consumed by ~64 bytes per array. With all ctl_table
sentinels gone, the additional check for ->procname == NULL that worked in
tandem with the ARRAY_SIZE to calculate the size of the ctl_table arrays is
no longer needed and has been removed. The sysctl register functions now
returns an error if a sentinel is used.
* Preparation patches for sysctl constification
Constifying ctl_table structs prevents the modification of proc_handler
function pointers as they would reside in .rodata. The ctl_table arguments
in sysctl utility functions are const qualified in preparation for a future
treewide proc_handler argument constification commit.
* Misc fixes
Increase robustness of set_ownership by providing sane default ownership
values in case the callee doesn't set them. Bound check proc_dou8vec_minmax
to avoid loading buggy modules and give sysctl testing module a name to
avoid compiler complaints.
Testing
* This got push to linux-next in v6.10-rc2, so it has had more than a month
of testing
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmaWdz4ACgkQupfNUreW
QU/WKQwAkSuUz42yCQye77BK+Z8ANcTF1f3aI/wfv2nahq1GaSrNBpqUiXvEe9Tt
KD2lM1PWiQfizVLIDPh96yxa5q69GQrPPOA/V1jwIXmk/HRpjjoONCFNNXVRCTls
VCqDz/RatuXvzO35Yn87MnWnxv6PiX7X/zq/3WikVsUI381kvTgC6OwZxdFM52w4
ESwOa3LeOovtRnqV5dpHr6DCQKyd0N52nPxgXvaerjlsJsv7PlezN7z9YyLOOfmW
xUD7X6LQcJq7HcEukaB6I9o2GQOi4yYXL2YOzed7qu9Thu+lasEoN3Bd7P+ilXkc
JY6EXJ5o+d69PewKRuJ1QvD7wrHIkhNMNbMtvehNay124wAHDy3KtonFzyvlX4wE
qCHBYc6rySJNhSqwVp9MoksOZfDM99pVIOs9YVIjc90Zzu5J7tORgYWRVOHTcAtj
fd8nMdkK3+ZANapygFCyew6GueIzaqlQwveVgLGw4vc5L3ClknmURit3y487Pzdg
B+BEVlsp
=bs2G
-----END PGP SIGNATURE-----
Merge tag 'sysctl-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl updates from Joel Granados:
- Remove "->procname == NULL" check when iterating through sysctl table
arrays
Removing sentinels in ctl_table arrays reduces the build time size
and runtime memory consumed by ~64 bytes per array. With all
ctl_table sentinels gone, the additional check for ->procname == NULL
that worked in tandem with the ARRAY_SIZE to calculate the size of
the ctl_table arrays is no longer needed and has been removed. The
sysctl register functions now returns an error if a sentinel is used.
- Preparation patches for sysctl constification
Constifying ctl_table structs prevents the modification of
proc_handler function pointers as they would reside in .rodata. The
ctl_table arguments in sysctl utility functions are const qualified
in preparation for a future treewide proc_handler argument
constification commit.
- Misc fixes
Increase robustness of set_ownership by providing sane default
ownership values in case the callee doesn't set them. Bound check
proc_dou8vec_minmax to avoid loading buggy modules and give sysctl
testing module a name to avoid compiler complaints.
* tag 'sysctl-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
sysctl: Warn on an empty procname element
sysctl: Remove ctl_table sentinel code comments
sysctl: Remove "child" sysctl code comments
sysctl: Remove superfluous empty allocations from sysctl internals
sysctl: Replace nr_entries with ctl_table_size in new_links
sysctl: Remove check for sentinel element in ctl_table arrays
mm profiling: Remove superfluous sentinel element from ctl_table
locking: Remove superfluous sentinel element from kern_lockdep_table
sysctl: Add module description to sysctl-testing
sysctl: constify ctl_table arguments of utility function
utsname: constify ctl_table arguments of utility function
sysctl: move the extra1/2 boundary check of u8 to sysctl_check_table_array
sysctl: always initialize i_uid/i_gid
- interrupt SECCOMP_IOCTL_NOTIF_RECV when all users exit (Andrei Vagin)
- Update selftests to check for expected NOTIF_RECV exits (Andrei Vagin)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmaVTOkACgkQiXL039xt
wCayUA//fJTZUq+idihdKMql9SEwYh0uIJQpGOxxBEesUksZUM4alTCswzmheLFL
q9BxlWWJJfUsp3djeZK+0vnDv3izaR+LfA1JPEJf64ImbympbEenVXK0ZkmVrTqg
bNAlD5c/LVpzYtB7cnOaglq18uUja6/E+EQvNYz5NLHrIhYYCieJZIiFATkHQ9Lj
3Wq3g9FWEa5pZxpKbEI3UA2HllADnmHeb/Z78Zdvyue5lOOvsBIheQfL4m0pW38x
xgBWNglIg7b+X+YgwYSv8w50Lhn4SJVtIynWnwzBz19qFJRQL7oJRj1zyFHZPCwZ
ajHVIj5LOuts/BYxSiGzczxVqZaAqeOyCY5e8G+Mjk5ZD5kLYznbbcrIFIUIaHpx
rpRD/TVVwJ3PHsOIpWHwrXKgoKnbe/0n8lJT+Ehnm/2lLrlGyZj9hLyl6/+JsizE
dGIWgE2emykYI+52IRRYSZaw4hLb+CU52d1vd5a35wUk1ie5fcVGZAWnaul23x0I
maQtXcyB6tYuhX3oPnxoxFVqGvCKGi3Tc5N+Vg4JR/RzTy2H7fZ02DRZq8Vs0QST
EgO3cpCD3035qxkK6ivaV4ebPJjkL158D/+uFyne+PWQlSfrmXJvhfggPFA+Oqtb
y8PTUV73+HxDiruXbGZ7MD8C7d+ZvGI/D+xheohYrijhivcXxcc=
=tlFN
-----END PGP SIGNATURE-----
Merge tag 'seccomp-v6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp updates from Kees Cook:
- interrupt SECCOMP_IOCTL_NOTIF_RECV when all users exit (Andrei Vagin)
- Update selftests to check for expected NOTIF_RECV exits (Andrei
Vagin)
* tag 'seccomp-v6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
selftests/seccomp: check that a zombie leader doesn't affect others
selftests/seccomp: add test for NOTIF_RECV and unused filters
seccomp: release task filters when the task exits
seccomp: interrupt SECCOMP_IOCTL_NOTIF_RECV when all users have exited
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCZpS2TQAKCRCAXGG7T9hj
vjryAQDy08vSiCNYnO4K0AXO9KhCLsXMpbdevTE0yHdtSW/IwQD/eOmrntgBArA4
PfQanbzM3Rj+h6p1zsfvW98DgmFrfAQ=
=tG6C
-----END PGP SIGNATURE-----
Merge tag 'for-linus-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
- some trivial cleanups
- a fix for the Xen timer
- add boot time selectable debug capability to the Xen multicall
handling
- two fixes for the recently added Xen irqfd handling
* tag 'for-linus-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: remove deprecated xen_nopvspin boot parameter
x86/xen: eliminate some private header files
x86/xen: make some functions static
xen: make multicall debug boot time selectable
xen/arm: Convert comma to semicolon
xen: privcmd: Fix possible access to a freed kirqfd instance
xen: privcmd: Switch from mutex to spinlock for irqfds
xen: add missing MODULE_DESCRIPTION() macros
x86/xen: Convert comma to semicolon
x86/xen/time: Reduce Xen timer tick
xen/manage: Constify struct shutdown_handler
Most of this is part of my ongoing work to clean up the system call
tables. In this bit, all of the newer architectures are converted to
use the machine readable syscall.tbl format instead in place of complex
macros in include/uapi/asm-generic/unistd.h.
This follows an earlier series that fixed various API mismatches
and in turn is used as the base for planned simplifications.
The other two patches are dead code removal and a warning fix.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmaVB1cACgkQYKtH/8kJ
UicMqxAAnYKOxfjoMIhYYK6bl126wg/vIcDcjIR9cNWH21Nhn3qxn11ZXau3S7xv
3l/HreEhyEQr4gC2a70IlXyHUadYOlrk+83OURrunWk1oKPmZlMKcfPVbtp8GL7x
PUNXQfwM1XZLveKwufY24hoZdwKC+Y/5WLc1t0ReznJuAqgeO2rM9W5dnV5bAfCp
he3F5hFcr196Dz3/GJjJIWrY+cbwfmZWsNtj1vFTL5/r/LuCu8HTkqhsGj8tE5BJ
NGVEEXbp5eaVTCIGqJWhnuZcsnKN9kM51M7CtdwWf8OTckUVuJap5OsDVKQkWkGl
bLPbd2jhDltph0sah51hAIvv4WdkThW76u9FRW7KR3fo7ra67eF7l5j7wc1lE2JB
GwLJ1X56Bxe1GhvvNTlDmb7DrnlP/DMPuRv3Z6xyH6l8iZ2pMGlnAxuw6Bs1s6Y5
WSs36ZpnS0ctgjfx37ZITsZSvbKFPpQFJP4siwS8aRNv/NFALNNdFyOCY5lNzspZ
0dxwjn6/7UpHE4MKh6/hvCg2QwupXXBTRytibw+75/rOsR+EYlmtuONtyq2sLUHe
ktJ5pg+8XuZm27+wLffuluzmY7sv2F8OU4cTYeM60Ynmc6pRzwUY6/VhG52S1/mU
Ua4VgYIpzOtlLrYmz5QTWIZpdSFSVbIc/3pLriD6hn4Mvg+BwdA=
=XOhL
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic updates from Arnd Bergmann:
"Most of this is part of my ongoing work to clean up the system call
tables. In this bit, all of the newer architectures are converted to
use the machine readable syscall.tbl format instead in place of
complex macros in include/uapi/asm-generic/unistd.h.
This follows an earlier series that fixed various API mismatches and
in turn is used as the base for planned simplifications.
The other two patches are dead code removal and a warning fix"
* tag 'asm-generic-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
vmlinux.lds.h: catch .bss..L* sections into BSS")
fixmap: Remove unused set_fixmap_offset_io()
riscv: convert to generic syscall table
openrisc: convert to generic syscall table
nios2: convert to generic syscall table
loongarch: convert to generic syscall table
hexagon: use new system call table
csky: convert to generic syscall table
arm64: rework compat syscall macros
arm64: generate 64-bit syscall.tbl
arm64: convert unistd_32.h to syscall.tbl format
arc: convert to generic syscall table
clone3: drop __ARCH_WANT_SYS_CLONE3 macro
kbuild: add syscall table generation to scripts/Makefile.asm-headers
kbuild: verify asm-generic header list
loongarch: avoid generating extra header files
um: don't generate asm/bpf_perf_event.h
csky: drop asm/gpio.h wrapper
syscalls: add generic scripts/syscall.tbl
conflicts:
- Carve out CPU hotplug function declarations into a separate header
with the goal to be able to use the lockdep assertions in a more
flexible manner
- As a result, refactor cacheinfo code after carving out a function
to return the cache ID associated with a given cache level
- Cleanups
- Add support to be able to kexec TDX guests. For that
- Expand ACPI MADT CPU offlining support
- Add machinery to prepare CoCo guests memory before kexec-ing into a new
kernel
- Cleanup, readjust and massage related code
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaVCYoACgkQEsHwGGHe
VUoi6g//Up/4vMzcjqzrndXfl0aP+NpK4zNud+ZPP4Qza2yPhKydniMvkWVQ8DTx
jQaGk/tJDeFG6ofOzGkmBGyuZzuO4D7E0XFyXZZeVgSvdk2Af5vaWu1D3e4i4MiM
Ox4H8NtWnC4MozP0hos4qB0vtYaBWVJkNvIXDVF6162zLwEmbuyrpFe3glscwIxv
hMZR/C47RHcEeOb7yA4m/gJ+AqMe9OKradoNJkkfDpnYr6CYsbmpY09or2WYuvoI
0gevkIe6Q9HMcq3CQl6/pR8IgbA5VmGi7iCiE1ihgTPwR3AaU8llzBqYdSgezFrk
68A7oGeUZQeifQgjwkreZclMtsGEeGWVOB0Bh3Jgr6uaWGFXtpydi/hc73wbTz+F
IazKQcKQYjaPW/9UG+0+cFTQlCgQ+WxwqAsN1uqzL6gMgmC9B+TM//xzk5nVxpOd
ouf8T85tyceIPCKepGE/bWEHYYCjfbqBMyQT6RHmxUKbb1/PIsbzN26cenkZmPXT
cpwurWVG7mRQJRqTrsS+D+opP1h/jOdkpwGlBfl1s0sX6RZuMFBk+7TlMMs61Cyo
PWtrLV7Dr369cuXE72wIgfBAao2AS8kFshc7Atokq7/XfL9cCWHeqIcu7yvParP5
WY43YQv8XPGI7ZnPqULByTY0Wxg8TFk8whamx97kEp8uy2HmbQU=
=k+T+
-----END PGP SIGNATURE-----
Merge tag 'x86_cc_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 confidential computing updates from Borislav Petkov:
"Unrelated x86/cc changes queued here to avoid ugly cross-merges and
conflicts:
- Carve out CPU hotplug function declarations into a separate header
with the goal to be able to use the lockdep assertions in a more
flexible manner
- As a result, refactor cacheinfo code after carving out a function
to return the cache ID associated with a given cache level
- Cleanups
Add support to be able to kexec TDX guests:
- Expand ACPI MADT CPU offlining support
- Add machinery to prepare CoCo guests memory before kexec-ing into a
new kernel
- Cleanup, readjust and massage related code"
* tag 'x86_cc_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
ACPI: tables: Print MULTIPROC_WAKEUP when MADT is parsed
x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
x86/mm: Introduce kernel_ident_mapping_free()
x86/smp: Add smp_ops.stop_this_cpu() callback
x86/acpi: Do not attempt to bring up secondary CPUs in the kexec case
x86/acpi: Rename fields in the acpi_madt_multiproc_wakeup structure
x86/mm: Do not zap page table entries mapping unaccepted memory table during kdump
x86/mm: Make e820__end_ram_pfn() cover E820_TYPE_ACPI ranges
x86/tdx: Convert shared memory back to private on kexec
x86/mm: Add callbacks to prepare encrypted memory for kexec
x86/tdx: Account shared memory
x86/mm: Return correct level from lookup_address() if pte is none
x86/mm: Make x86_platform.guest.enc_status_change_*() return an error
x86/kexec: Keep CR4.MCE set during kexec for TDX guest
x86/relocate_kernel: Use named labels for less confusion
cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup
cpu/hotplug: Add support for declaring CPU offlining not supported
x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init
x86/acpi: Extract ACPI MADT wakeup code into a separate file
x86/kexec: Remove spurious unconditional JMP from from identity_mapped()
...
GPIOLIB core:
- rework kfifo handling rework in the character device code
- improve the labeling of GPIOs requested as interrupts and show more info
on interrupt-only GPIOs in debugfs
- remove unused APIs
- unexport interfaces that are only used from the core GPIO code
- drop the return value from gpiochip_set_desc_names() as it cannot fail
- move a string array definition out of a header and into a specific
compilation unit
- convert the last user of gpiochip_get_desc() other than GPIO core to using
a safer alternative
- use array_index_nospec() where applicable
New drivers:
- add a "virtual GPIO consumer" module that allows requesting GPIOs from actual
hardware and driving tests of the in-kernel GPIO API from user-space over
debugfs
- add a GPIO-based "sloppy" logic analyzer module useful for "first glance"
debugging on remote boards
Driver improvements:
- add support for a new model to gpio-pca953x
- lock GPIOs as interrupts in gpio-sim when the lines are requested as irqs
via the simulator domain + some other minor improvements
- improve error reporting in gpio-syscon
- convert gpio-ath79 to using dynamic GPIO base and range
- use pcibios_err_to_errno() for converting PCIBIOS error codes to errno
vaues in gpio-amd8111 and gpio-rdc321x
- allow building gpio-brcmstb for the BCM2835 architecture
DT bindings:
- convert DT bindings for lsi,zevio, mpc8xxx, and atmel to DT schema
- document new properties for aspeed,gpio, fsl,qoriq-gpio and gpio-vf610
- document new compatibles for pca953x and fsl,qoriq-gpio
Documentation:
- document stricter behavior of the GPIO character device uAPI with regards to
reconfiguring requested line without direction set
- clarify the effect of the active-low flag on line values and edges
- remove documentation for the legacy GPIO API in order to stop tempting
people to use it
- document the preference for using pread() for reading edge events in the
sysfs API
Other:
- add an extended initializer to the interrupt simulator allowing to specify
a number of callbacks callers can use to be notified about irqs being
requested and released
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEFp3rbAvDxGAT0sefEacuoBRx13IFAmaU3n4ACgkQEacuoBRx
13LnYxAAgIfV2MQxdlB8+I3+PrObiom9uykNpoj8Ho3FiGroJEmSi2Vg9NOP5j8n
7THjBDFqKk0USMdzGgWDe+u0oCpql8ONLd+lxPiKzRxkebMVzlumeNzWNEE3wqxO
MdV3AOs9DLM1a4MAuv9E8PgooBVR8Cyqs3tc3wwpZRKoSZBIzwrjFL3tO1P8Dezv
9xoPqIiMJRBZr8jifU/ZRdLG3gYKqgQH1Mha7bm94ebUwA6q/hxtGYAtc2a3Q+dF
6lPrPONJBN6/YwvmoDddm2ppoiyWN7QdX9DQjJvKBcNRTZSE1EAggdh8kNnCoa1d
+PeClIAJLl8ZSkdMS8yvMZIpduK4gTl7yEBMkER1d0JoJLkTowqKsvONgU12Npr2
3rwbpACt/kVt5v0lRwaafj5vnD3NgiiVCeuZZz99ICbrXqe6rYszMIemKDYWWlTn
kEFrTM5ql+dwAfvp8Y9JZf4oOgInHbF3LBKM34PKMW9D0a4aQC/HTfmtHobeNHzn
FmY9ysHjMG6fvuwnkpojW6N3/LLwt+TX8jik9x0O42AE7qXn6a8U2g6RUg6rJOdd
mUiIX3+rn+AaI6eKPvUNp2h391jH1K3hBCAca4cNAIKpqPuE/A/B5RyZZnL5Q7HQ
Iz2G3hSlTBVPf7QWMkBUfMzQMwmvqfoKsZljC5y7YgafJc5cf4k=
=kK88
-----END PGP SIGNATURE-----
Merge tag 'gpio-updates-for-v6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio updates from Bartosz Golaszewski:
"The majority of added lines are two new modules: the GPIO virtual
consumer module that improves our ability to add automated tests for
the kernel API and the "sloppy" logic analyzer module that uses the
GPIO API to implement a coarse-grained debugging tool for useful for
remote development.
Other than that we have the usual assortment of various driver
extensions, improvements to the core GPIO code, DT-bindings and other
documentation updates as well as an extension to the interrupt
simulator:
GPIOLIB core:
- rework kfifo handling rework in the character device code
- improve the labeling of GPIOs requested as interrupts and show more
info on interrupt-only GPIOs in debugfs
- remove unused APIs
- unexport interfaces that are only used from the core GPIO code
- drop the return value from gpiochip_set_desc_names() as it cannot
fail
- move a string array definition out of a header and into a specific
compilation unit
- convert the last user of gpiochip_get_desc() other than GPIO core
to using a safer alternative
- use array_index_nospec() where applicable
New drivers:
- add a "virtual GPIO consumer" module that allows requesting GPIOs
from actual hardware and driving tests of the in-kernel GPIO API
from user-space over debugfs
- add a GPIO-based "sloppy" logic analyzer module useful for "first
glance" debugging on remote boards
Driver improvements:
- add support for a new model to gpio-pca953x
- lock GPIOs as interrupts in gpio-sim when the lines are requested
as irqs via the simulator domain + some other minor improvements
- improve error reporting in gpio-syscon
- convert gpio-ath79 to using dynamic GPIO base and range
- use pcibios_err_to_errno() for converting PCIBIOS error codes to
errno vaues in gpio-amd8111 and gpio-rdc321x
- allow building gpio-brcmstb for the BCM2835 architecture
DT bindings:
- convert DT bindings for lsi,zevio, mpc8xxx, and atmel to DT schema
- document new properties for aspeed,gpio, fsl,qoriq-gpio and
gpio-vf610
- document new compatibles for pca953x and fsl,qoriq-gpio
Documentation:
- document stricter behavior of the GPIO character device uAPI with
regards to reconfiguring requested line without direction set
- clarify the effect of the active-low flag on line values and edges
- remove documentation for the legacy GPIO API in order to stop
tempting people to use it
- document the preference for using pread() for reading edge events
in the sysfs API
Other:
- add an extended initializer to the interrupt simulator allowing to
specify a number of callbacks callers can use to be notified about
irqs being requested and released"
* tag 'gpio-updates-for-v6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: (41 commits)
gpio: mc33880: Convert comma to semicolon
gpio: virtuser: actually use the "trimmed" local variable
dt-bindings: gpio: convert Atmel GPIO to json-schema
gpio: virtuser: new virtual testing driver for the GPIO API
dt-bindings: gpio: vf610: Allow gpio-line-names to be set
gpio: sim: lock GPIOs as interrupts when they are requested
genirq/irq_sim: add an extended irq_sim initializer
dt-bindings: gpio: fsl,qoriq-gpio: Add compatible string fsl,ls1046a-gpio
gpiolib: unexport gpiochip_get_desc()
gpio: add sloppy logic analyzer using polling
Documentation: gpio: Reconfiguration with unset direction (uAPI v2)
Documentation: gpio: Reconfiguration with unset direction (uAPI v1)
dt-bindings: gpio: fsl,qoriq-gpio: add common property gpio-line-names
gpio: ath79: convert to dynamic GPIO base allocation
pinctrl: da9062: replace gpiochip_get_desc() with gpio_device_get_desc()
gpiolib: put gpio_suffixes in a single compilation unit
Documentation: gpio: Clarify effect of active low flag on line edges
Documentation: gpio: Clarify effect of active low flag on line values
gpiolib: Remove data-less gpiochip_add() function
gpio: sim: use devm_mutex_init()
...
cpus_read_lock() is dropped from workqueue creation path but there were
still remaining lockdep_assert_cpus_held() triggering spurious lockdep
failures. Remove them.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZpW6Zw4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGWLmAQCKp0zrzmJYfsU7KYFFKyUy1PukIe8NZ2kaqNcX
7FyDZwD6A5k6SZfS/efXWYEwbQZo0MCOS5V1/M7KjqGR/ZeIEgc=
=7t+F
-----END PGP SIGNATURE-----
Merge tag 'wq-for-6.11-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
"cpus_read_lock() was dropped from workqueue creation path but there
were still remaining lockdep_assert_cpus_held() triggering spurious
lockdep failures. Remove them"
* tag 'wq-for-6.11-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Remove unneeded lockdep_assert_cpus_held()
* Virtual CPU hotplug support for arm64 ACPI systems
* cpufeature infrastructure cleanups and making the FEAT_ECBHB ID bits
visible to guests
* CPU errata: expand the speculative SSBS workaround to more CPUs
* arm64 ACPI:
- acpi=nospcr option to disable SPCR as default console for arm64
- Move some ACPI code (cpuidle, FFH) to drivers/acpi/arm64/
* GICv3, use compile-time PMR values: optimise the way regular IRQs are
masked/unmasked when GICv3 pseudo-NMIs are used, removing the need for
a static key in fast paths by using a priority value chosen
dynamically at boot time
* arm64 perf updates:
- Rework of the IMX PMU driver to enable support for I.MX95
- Enable support for tertiary match groups in the CMN PMU driver
- Initial refactoring of the CPU PMU code to prepare for the fixed
instruction counter introduced by Arm v9.4
- Add missing PMU driver MODULE_DESCRIPTION() strings
- Hook up DT compatibles for recent CPU PMUs
* arm64 kselftest updates:
- Kernel mode NEON fp-stress
- Cleanups, spelling mistakes
* arm64 Documentation update with a minor clarification on TBI
* Miscellaneous:
- Fix missing IPI statistics
- Implement raw_smp_processor_id() using thread_info rather than a
per-CPU variable (better code generation)
- Make MTE checking of in-kernel asynchronous tag faults conditional
on KASAN being enabled
- Minor cleanups, typos
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmaQKN4ACgkQa9axLQDI
XvE0Nw/+JZ6OEQ+DMUHXZfbWanvn1p0nVOoEV3MYVpOeQK1ILYCoDapatLNIlet0
wcja7tohKbL1ifc7GOqlkitu824LMlotncrdOBycRqb/4C5KuJ+XhygFv5hGfX0T
Uh2zbo4w52FPPEUMICfEAHrKT3QB9tv7f66xeUNbWWFqUn3rY02/ZVQVVdw6Zc0e
fVYWGUUoQDR7+9hRkk6tnYw3+9YFVAUAbLWk+DGrW7WsANi6HuJ/rBMibwFI6RkG
SZDZHum6vnwx0Dj9H7WrYaQCvUMm7AlckhQGfPbIFhUk6pWysfJtP5Qk49yiMl7p
oRk/GrSXpiKumuetgTeOHbokiE1Nb8beXx0OcsjCu4RrIaNipAEpH1AkYy5oiKoT
9vKZErMDtQgd96JHFVaXc+A3D2kxVfkc1u7K3TEfVRnZFV7CN+YL+61iyZ+uLxVi
d9xrAmwRsWYFVQzlZG3NWvSeQBKisUA1L8JROlzWc/NFDwTqDGIt/zS4pZNL3+OM
EXW0LyKt7Ijl6vPXKCXqrODRrPlcLc66VMZxofZOl0/dEqyJ+qLL4GUkWZu8lTqO
BqydYnbTSjiDg/ntWjTrD0uJ8c40Qy7KTPEdaPqEIQvkDEsUGlOnhAQjHrnGNb9M
psZtpDW2xm7GykEOcd6rgSz4Xeky2iLsaR4Wc7FTyDS0YRmeG44=
=ob2k
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"The biggest part is the virtual CPU hotplug that touches ACPI,
irqchip. We also have some GICv3 optimisation for pseudo-NMIs that has
been queued via the arm64 tree. Otherwise the usual perf updates,
kselftest, various small cleanups.
Core:
- Virtual CPU hotplug support for arm64 ACPI systems
- cpufeature infrastructure cleanups and making the FEAT_ECBHB ID
bits visible to guests
- CPU errata: expand the speculative SSBS workaround to more CPUs
- GICv3, use compile-time PMR values: optimise the way regular IRQs
are masked/unmasked when GICv3 pseudo-NMIs are used, removing the
need for a static key in fast paths by using a priority value
chosen dynamically at boot time
ACPI:
- 'acpi=nospcr' option to disable SPCR as default console for arm64
- Move some ACPI code (cpuidle, FFH) to drivers/acpi/arm64/
Perf updates:
- Rework of the IMX PMU driver to enable support for I.MX95
- Enable support for tertiary match groups in the CMN PMU driver
- Initial refactoring of the CPU PMU code to prepare for the fixed
instruction counter introduced by Arm v9.4
- Add missing PMU driver MODULE_DESCRIPTION() strings
- Hook up DT compatibles for recent CPU PMUs
Kselftest updates:
- Kernel mode NEON fp-stress
- Cleanups, spelling mistakes
Miscellaneous:
- arm64 Documentation update with a minor clarification on TBI
- Fix missing IPI statistics
- Implement raw_smp_processor_id() using thread_info rather than a
per-CPU variable (better code generation)
- Make MTE checking of in-kernel asynchronous tag faults conditional
on KASAN being enabled
- Minor cleanups, typos"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (69 commits)
selftests: arm64: tags: remove the result script
selftests: arm64: tags_test: conform test to TAP output
perf: add missing MODULE_DESCRIPTION() macros
arm64: smp: Fix missing IPI statistics
irqchip/gic-v3: Fix 'broken_rdists' unused warning when !SMP and !ACPI
ACPI: Add acpi=nospcr to disable ACPI SPCR as default console on ARM64
Documentation: arm64: Update memory.rst for TBI
arm64/cpufeature: Replace custom macros with fields from ID_AA64PFR0_EL1
KVM: arm64: Replace custom macros with fields from ID_AA64PFR0_EL1
perf: arm_pmuv3: Include asm/arm_pmuv3.h from linux/perf/arm_pmuv3.h
perf: arm_v6/7_pmu: Drop non-DT probe support
perf/arm: Move 32-bit PMU drivers to drivers/perf/
perf: arm_pmuv3: Drop unnecessary IS_ENABLED(CONFIG_ARM64) check
perf: arm_pmuv3: Avoid assigning fixed cycle counter with threshold
arm64: Kconfig: Fix dependencies to enable ACPI_HOTPLUG_CPU
perf: imx_perf: add support for i.MX95 platform
perf: imx_perf: fix counter start and config sequence
perf: imx_perf: refactor driver for imx93
perf: imx_perf: let the driver manage the counter usage rather the user
perf: imx_perf: add macro definitions for parsing config attr
...
- Lai fixed a bug where CPU hotplug and workqueue attribute changes race
leaving some workqueues not fully updated. This involved refactoring and
changing how online CPUs are tracked. The resulting code is cleaner.
- Workqueue watchdog touch operation was causing too much cacheline
contention on very large machines. Nicholas improved scalabililty by
avoiding unnecessary global updates.
- Code cleanups and minor rescuer behavior improvement.
- The last commit 58629d4871 ("workqueue: Always queue work items to the
newest PWQ for order workqueues") is a cherry-picked straggler commit from
for-6.10-fixes, a fix for a bug which may not actually trigger.
Unfortunately, maybe because for-6.10-fixes was branched off at a
different point from for-6.11, I couldn't persuade git request-pull to
generate clean diffstat if pulled into for-6.11.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZpSn2g4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGe4UAQCc4Zo2qShh6M97r7K5Epa0XmHKqGcOvafiEZQx
qwXfyAD9HWBrwPfdA+L1hHnMRTtZJapvWnBRK/swdg8Rou5Grgk=
=t60j
-----END PGP SIGNATURE-----
Merge tag 'wq-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
- Lai fixed a bug where CPU hotplug and workqueue attribute changes
race leaving some workqueues not fully updated. This involved
refactoring and changing how online CPUs are tracked. The resulting
code is cleaner.
- Workqueue watchdog touch operation was causing too much cacheline
contention on very large machines. Nicholas improved scalabililty by
avoiding unnecessary global updates.
- Code cleanups and minor rescuer behavior improvement.
- The last commit 58629d4871 ("workqueue: Always queue work items to
the newest PWQ for order workqueues") is a cherry-picked straggler
commit from for-6.10-fixes, a fix for a bug which may not actually
trigger.
* tag 'wq-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (24 commits)
workqueue: Always queue work items to the newest PWQ for order workqueues
workqueue: Rename wq_update_pod() to unbound_wq_update_pwq()
workqueue: Remove the arguments @hotplug_cpu and @online from wq_update_pod()
workqueue: Remove the argument @cpu_going_down from wq_calc_pod_cpumask()
workqueue: Remove the unneeded cpumask empty check in wq_calc_pod_cpumask()
workqueue: Remove cpus_read_lock() from apply_wqattrs_lock()
workqueue: Simplify wq_calc_pod_cpumask() with wq_online_cpumask
workqueue: Add wq_online_cpumask
workqueue: Init rescuer's affinities as the wq's effective cpumask
workqueue: Put PWQ allocation and WQ enlistment in the same lock C.S.
workqueue: Move kthread_flush_worker() out of alloc_and_link_pwqs()
workqueue: Make rescuer initialization as the last step of the creation of a new wq
workqueue: Register sysfs after the whole creation of the new wq
workqueue: Simplify goto statement
workqueue: Update cpumasks after only applying it successfully
workqueue: Improve scalability of workqueue watchdog touch
workqueue: wq_watchdog_touch is always called with valid CPU
workqueue: Remove useless pool->dying_workers
workqueue: Detach workers directly in idle_cull_fn()
workqueue: Don't bind the rescuer in the last working cpu
...
- Added Michal Koutný as a maintainer.
- Counters in pids.events were behaving inconsistently. pids.events made
properly hierarchical and pids.events.local added.
- misc.peak and misc.events.local added.
- cpuset remote partition creation and cpuset.cpus.exclusive handling
improved.
- Code cleanups, non-critical fixes, doc updates.
- for-6.10-fixes is merged in to receive two non-critical fixes that didn't
trigger pull.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZpSsdw4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGSEMAQDQ5VfcRz+rW20ez5IAgyN3EKIwSbW6pY6jojgj
bJtJSQD/TzA8DoRxcCvTdHcZcwJ2e2zBVcuM8NkZHfSCNiPrrgs=
=5f3I
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- Added Michal Koutný as a maintainer
- Counters in pids.events were behaving inconsistently. pids.events
made properly hierarchical and pids.events.local added
- misc.peak and misc.events.local added
- cpuset remote partition creation and cpuset.cpus.exclusive handling
improved
- Code cleanups, non-critical fixes, doc updates
- for-6.10-fixes is merged in to receive two non-critical fixes that
didn't trigger pull
* tag 'cgroup-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (23 commits)
cgroup: Add Michal Koutný as a maintainer
cgroup/misc: Introduce misc.events.local
cgroup/rstat: add force idle show helper
cgroup: Protect css->cgroup write under css_set_lock
cgroup/misc: Introduce misc.peak
cgroup_misc: add kernel-doc comments for enum misc_res_type
cgroup/cpuset: Prevent UAF in proc_cpuset_show()
selftest/cgroup: Update test_cpuset_prs.sh to match changes
cgroup/cpuset: Make cpuset.cpus.exclusive independent of cpuset.cpus
cgroup/cpuset: Delay setting of CS_CPU_EXCLUSIVE until valid partition
selftest/cgroup: Fix test_cpuset_prs.sh problems reported by test robot
cgroup/cpuset: Fix remote root partition creation problem
cgroup: avoid the unnecessary list_add(dying_tasks) in cgroup_exit()
cgroup/cpuset: Optimize isolated partition only generate_sched_domains() calls
cgroup/cpuset: Reduce the lock protecting CS_SCHED_LOAD_BALANCE
kernel/cgroup: cleanup cgroup_base_files when fail to add cgroup_psi_files
selftests: cgroup: Add basic tests for pids controller
selftests: cgroup: Lexicographic order in Makefile
cgroup/pids: Add pids.events.local
cgroup/pids: Make event counters hierarchical
...
This series contains on commit that improves the documentation for the
new __data_racy type qualifier to the data_race() macro's kernel-doc
header and to the LKMM's access-marking documentation.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmaRubITHHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jLWlD/99wMLUIfPh1cvWVVyhv8tytWuLJn6y
olwukg9pOCz6WGucECfjAC2kFivSQYxS3b54K697tF4hnABEtpsx7ozkv11kgyR8
niHNv4P+L+0J7CnM/g7gkLrAGosdo+PF7rUhX3u3kpeasVjJ5NyK379jUeTjrDrc
ia34vjbVVNdJ5v0c4ITxbbV/NKecbsadRSqDLzjtNrFPkwo/yfFCrz8ddztadsTd
4jftt/L4Up51QQ8NAgiHsHdp+iY/FRjwd/QtvEXSVUZ38sGseH+eNqn4hMuyTnka
cjnHwAlTbEfAuR3Mdcz25ToiaI6qNxptvvM7E9+LtHuBhI+h6vEvdvPMmSFo48Rv
kzdPix39O5dqpu49nz/Qsmmr/AWn9i3M5oht62YMJ5twoutDFAcc5MMppPT6sNsX
Kq3fn8fjRvdHqgE3jBYOgDDWEZuTxdtcdv8wKfplID/SgyjXL48fghMEl62b2O1d
X6PrHhKARk7RKMndAPk0UNz4m8T5oDmKu9Z9mF8UidjeZjYU0tVYPAoN1H/9lVdH
Kn2DoTW5Oz2A21z37r5SGAro0QBE50ZMYA+xH+Ab3fyfFcd4l4ia/wu06c3AVJFc
DqgKJ7f3YQ7tHGLf0AEpO6o/nNZGxwDHJ5Pjbteu2RwbCcUHmUtKxrM+McPXTvuv
MjW/IAvvPdMxCw==
=r0/X
-----END PGP SIGNATURE-----
Merge tag 'kcsan.2024.07.12a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull KCSAN updates from Paul McKenney:
- improve the documentation for the new __data_racy type qualifier
to the data_race() macro's kernel-doc header and to the LKMM's
access-marking documentation
- add missing MODULE_DESCRIPTION
* tag 'kcsan.2024.07.12a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
kcsan: Add missing MODULE_DESCRIPTION() macro
kcsan: Add example to data_race() kerneldoc header
This pull request adds MODULE_DESCRIPTION() to torture.c, locktorture.c,
and scftorture.c, and also adds static to a global variable that is used
only in scftorture.c.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmaRxOcTHHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jCa6D/9/Q/2MKzFMH3MzBVQXNtELbxw5r/IE
7BWcNE3MUt6If2ZyybvF3KoyGp4cfjyyox0EZWDeOHc4yJ85DVflzG9kn52tc9xZ
QBgLQxtovnyoqHGjCKt4ULYAUBl6KlL2H+YOAYc576w5EcL+kbomjXDJEH5HtbAe
p7KT+bWCcYOkz2irEoK32mlv3nG8eVzySODAbtNKQQv9fkiwftPAD/cRTYbCsEKH
jk9JLToAbBbN8LXqWGVa/FrTZeXIt8DieIffsJi1t7K//k6nnGojUqgOAej5A6rS
bGQZs7Z/Yb3KCF8hbWOqV0ofgBOAHWo/kE8BvcCwvnFV8d6DrEucVT5i/FQugvrE
HA/HwsS+Bymp70/1iZHtKA+p710/Snh99dr6Okylj8wUhMJYK9UmA0RIq9QIkhhU
VNXg7SmRdT/oZnxHXR2iqWSVjrpgAEgzT6Jw0PLLeFpVS+e95LtOh0SFvPCaNVJ9
bDyYedB1Rbfsovg0XShGeNNTNKZrgqWav7XfpHPz/YnwJNahWJb+xjZ7L3+wIjAv
CIcAaqU6nk+TnjG5zXKRdtV/8/qcPbe8Pij9TPd2iJ4NvONLDAGkxDLy/aY0uVei
NGxsmM4F+Bgd9ry8xrvQ8xQg54VRABI6BUwdRiYUo7+WPdZ2DPbjWiAQf5FtQ+ng
HosxD6XdCSa0Iw==
=WdlA
-----END PGP SIGNATURE-----
Merge tag 'torture.2024.07.12a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull torture-test updates from Paul McKenney:
"This adds MODULE_DESCRIPTION() to torture.c, locktorture.c, and
scftorture.c, and also adds 'static' to a global variable that is used
only in scftorture.c"
* tag 'torture.2024.07.12a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
scftorture: Make torture_type static
scftorture: Add MODULE_DESCRIPTION()
locktorture: Add MODULE_DESCRIPTION()
torture: Add MODULE_DESCRIPTION()
doc.2024.06.06a: Update Tasks RCU and Tasks Rude RCU description in
Requirements.rst and clarify rcu_assign_pointer() and
rcu_dereference() ordering properties.
fixes.2024.07.04a: Add lockdep assertions for RCU readers, limit inline
wakeups for callback-bypass synchronize_rcu(), add an
rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter,
add Uladzislau Rezki as RCU maintainer, and fix a subtle
callback-migration memory-ordering issue.
mb.2024.06.28a: Remove a number of redundant memory barriers.
nocb.2024.06.03a: Remove unnecessary bypass-list lock-contention
mitigation, use parking API instead of open-coded ad-hoc
equivalent, and upgrade obsolete comments.
rcu-tasks.2024.06.06a: Revert avoidance of a deadlock that can no
longer occur and properly synchronize Tasks Trace RCU checking
of runqueues.
rcutorture.2024.06.06a: Add tests for handling of double-call_rcu()
bug, add missing MODULE_DESCRIPTION, and add a script that
histograms the number of calls to RCU updaters.
srcu.2024.06.18a: Fill out SRCU polled-grace-period API.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmaR7/QTHHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jGwAEACJKef2LryG6khoJdorWbvRf1V2k23H
19CxXexCE4UoGsgGST9z1/5rM8kBdNhdhQ0JB9CitW+zGlXpOM79/mO3gALKMj++
YBPw9B5EM622H2cKJGFzoHFSO4X9nM1CCMeuFCo6bVsbWfMtX3ENqsYl2IQy1JkB
pGiKqcNXGWU0mdUcZKs/8ilfLG1NhaLwrkfinlsP9V1+8z8LxxDH5Qh27AT3rIvu
W87OITTZoHlUaDVHYTautHTZoqM381xv9kNoQlS9lpH/gcFOPiO9DLj8NcLjkJ4y
S/OrxOwfQ+BGKwnk8daFQFAc3Nr9KeVAQH7CbOW7guARhj3z97J0+wPm6nZGEE2s
tDzg8zLT9LtbmUypJLurl29+wFE4fPNsnd69XDONbMFN1Ox2tJM3dd/rPCsHSUvz
kEOK9gUreHOv7/Ou6UIHlYVlHY7HHuD7TAsrhaaWk7CEmlY31UKwXG+fMl1FAnSy
F3PcBF/1M687RRFWVeMlug/+0/+ghtc+kZ1YyR79KZR6dI0C7ueQbCBGztCCtFDz
RjrHcDifS0Y2GNQO9+zAyrJvttidRATdYDeFstk+8nnta3CnYzxCp4rn5hs3Ss3N
AJVJm244jR3AcoL4V/tQwiQlYh9ZYN5tZ7qxFiASdtV50Uc8HoIrWXeP0Ar+GHiV
2z/f5fKF4+5clQ==
=7a1C
-----END PGP SIGNATURE-----
Merge tag 'rcu.2024.07.12a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Update Tasks RCU and Tasks Rude RCU description in Requirements.rst
and clarify rcu_assign_pointer() and rcu_dereference() ordering
properties
- Add lockdep assertions for RCU readers, limit inline wakeups for
callback-bypass synchronize_rcu(), add an
rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter, add
Uladzislau Rezki as RCU maintainer, and fix a subtle
callback-migration memory-ordering issue
- Remove a number of redundant memory barriers
- Remove unnecessary bypass-list lock-contention mitigation, use
parking API instead of open-coded ad-hoc equivalent, and upgrade
obsolete comments
- Revert avoidance of a deadlock that can no longer occur and properly
synchronize Tasks Trace RCU checking of runqueues
- Add tests for handling of double-call_rcu() bug, add missing
MODULE_DESCRIPTION, and add a script that histograms the number of
calls to RCU updaters
- Fill out SRCU polled-grace-period API
* tag 'rcu.2024.07.12a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (29 commits)
rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU invocation
rcu: Eliminate lockless accesses to rcu_sync->gp_count
MAINTAINERS: Add Uladzislau Rezki as RCU maintainer
rcu: Add rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter
rcu/exp: Remove redundant full memory barrier at the end of GP
rcu: Remove full memory barrier on RCU stall printout
rcu: Remove full memory barrier on boot time eqs sanity check
rcu/exp: Remove superfluous full memory barrier upon first EQS snapshot
rcu: Remove superfluous full memory barrier upon first EQS snapshot
rcu: Remove full ordering on second EQS snapshot
srcu: Fill out polled grace-period APIs
srcu: Update cleanup_srcu_struct() comment
srcu: Add NUM_ACTIVE_SRCU_POLL_OLDSTATE
srcu: Disable interrupts directly in srcu_gp_end()
rcu: Disable interrupts directly in rcu_gp_init()
rcu/tree: Reduce wake up for synchronize_rcu() common case
rcu/tasks: Fix stale task snaphot for Tasks Trace
tools/rcu: Add rcu-updaters.sh script
rcutorture: Add missing MODULE_DESCRIPTION() macros
rcutorture: Fix rcu_torture_fwd_cb_cr() data race
...
- Core:
- Make the takeover of a hrtimer based broadcast timer reliable during
CPU hot-unplug. The current implementation suffers from a race which
can lead to broadcast timer starvation in the worst case.
- VDSO related cleanups and simplifications
- Small cleanups and enhancements all over the place
- PTP:
- Replace the architecture specific base clock to clocksource, e.g. ART
to TSC, conversion function with generic functionality to avoid
exposing such internals to drivers and convert all existing drivers
over. This also allows to provide functionality which converts the
other way round in the core code based on the same parameter set.
- Provide a function to convert CLOCK_REALTIME to the base clock to
support the upcoming PPS output driver on Intel platforms.
- Drivers:
- A set of Device Tree bindings for new hardware
- Cleanups and enhancements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmaUOM0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYofolD/9kK+aYdDj1gCFuZXZ2wTgMMxFmf/91
0UcsGRuBJiIXs3H3iizQ0Mb0cdTW6qZJoBp0jPlvUSm0BEKdEgE1uRX2RuAPZ/Gq
4/54ZJVopKSgAqeJFmqQubRVSv2XdMRAAJT0o1oUG3jZ0c6u8vqArIh5ZCnu13l/
tsNOeYLYzQFyA30eHSJ/KjQ2zHwAhJnl5a/b7pdAvxmlN37bGgKEpglv+9zwFiDB
K/kWbpb/oED9WOmoQy5QYi8iSvLQHEhFGrqzXV3fegu/B/mBBf/bpsisVx7Z1m2R
nzxNqg86RdMjNR6giwBETZjm7YxM+gKb9nCBNILjbjWZFC4tyrBkLGJ+KniTRNyZ
M5R4X1oP/14h00qXmCgIEFWysXaJRewYI+TIm8R2rLXrR6Tf3c4oL6fHQJxy3X52
7A+4Z/vOk/KX6PxYmLC+xQDukhFh2nirVYsP1oNM9yC9zR/wkBBXTTmUSAI+8m8l
KphniSPS2HMSBI6TtgOT8SKY7lRUZTnafBZq7wRXCv0Zz8AXoofgQDmBkXC99BkB
MjLvRotJVJvY9a8LtA7htjDg/jiEMa0wHRNAGNSbflKoAKrJzoE5WbFxFZKbq3vZ
o8cEYRMAIP+X+qn+oymT45XXXQlifZiccJdAi9FqDTvplEib2jmTmH6Ae5Khkr4l
Lbzh/nSKVN7lOg==
=8GjP
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2024-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Updates for timers, timekeeping and related functionality:
Core:
- Make the takeover of a hrtimer based broadcast timer reliable
during CPU hot-unplug. The current implementation suffers from a
race which can lead to broadcast timer starvation in the worst
case.
- VDSO related cleanups and simplifications
- Small cleanups and enhancements all over the place
PTP:
- Replace the architecture specific base clock to clocksource, e.g.
ART to TSC, conversion function with generic functionality to avoid
exposing such internals to drivers and convert all existing drivers
over. This also allows to provide functionality which converts the
other way round in the core code based on the same parameter set.
- Provide a function to convert CLOCK_REALTIME to the base clock to
support the upcoming PPS output driver on Intel platforms.
Drivers:
- A set of Device Tree bindings for new hardware
- Cleanups and enhancements all over the place"
* tag 'timers-core-2024-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
clocksource/drivers/realtek: Add timer driver for rtl-otto platforms
dt-bindings: timer: Add schema for realtek,otto-timer
dt-bindings: timer: Add SOPHGO SG2002 clint
dt-bindings: timer: renesas,tmu: Add R-Car Gen2 support
dt-bindings: timer: renesas,tmu: Add RZ/G1 support
dt-bindings: timer: renesas,tmu: Add R-Mobile APE6 support
clocksource/drivers/mips-gic-timer: Correct sched_clock width
clocksource/drivers/mips-gic-timer: Refine rating computation
clocksource/drivers/sh_cmt: Address race condition for clock events
clocksource/driver/arm_global_timer: Remove unnecessary ‘0’ values from err
clocksource/drivers/arm_arch_timer: Remove unnecessary ‘0’ values from irq
tick/broadcast: Make takeover of broadcast hrtimer reliable
tick/sched: Combine WARN_ON_ONCE and print_once
x86/vdso: Remove unused include
x86/vgtod: Remove unused typedef gtod_long_t
x86/vdso: Fix function reference in comment
vdso: Add comment about reason for vdso struct ordering
vdso/gettimeofday: Clarify comment about open coded function
timekeeping: Add missing kernel-doc function comments
tick: Remove unnused tick_nohz_get_idle_calls()
...
- Reverse the order of iteration when freezing secondary CPUs for
hibernation.
This avoids that drivers like the Intel uncore performance counter have
to transfer the assignement of handling the per package uncore events
for every CPU in a package, which is a considerable speedup on larger
systems.
- Add a missing destroy_work_on_stack() invocation in smp_call_on_cpu()
to prevent debug objects to emit a false positive warning when the
stack is freed.
- Small cleanups in comments and a str_plural() conversion
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmaUNnYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYocjbEACqfc8EkTEJczAMbAaq7Nq7jYGRt5e8
UyWpbnT1re63TAiGUEAr7pPJd6ucXAwPOHBt0V+j1HIr2UsaGT2/cfnlSipR68xh
UmqPxXpuba+JeZDS0Bf6gRXKYBogVWFgjP8cp+f9IKr6xMgwJc7ujB9mX0YKmIgz
bDoaFS+NFSD1ZS7tuidLqfU9UmJYRDRhyZg124HwDXG20zHR2CHgP4QQ2bHOvxZU
LKbdjoHBmGtkLomS+R1UxsT+onsnE0c1EN37LX0mUE5L1YbUTcbZlXLLG7T35jOO
rvai+EKVbA2KUAtrM/LZ8WZ0Lt5DTjMrouyzv6of7N2WljijdlxMb04sXl7kdng+
rohRfDB6yNhQhEnDx6fd+IP/JCpPCctCmkN/QbvMBfnRnTe1yNg/gmnWaY3RAHNM
GBifAxSEosMn7AMnSs/or6DAVNHVxI3Ms+r4Xb/o1JvGx7PXiBULh1Nww5WdxmBl
IraXNR/R0qXUokXmNtJrq3aG/SepKWAc0BJJc3b6zA9tf5rsizTaxetTVZLS6jOX
/DHJOOgAlLRDZdE53YpdL3HVVTdM/BDSM0xpMO7uxdJ4laJ9s+7dFGk9KXxu24qM
6dIG6hn7D9XT0q05sP+r7qO0ygIe0Qorg5bA+xgvpYdPNrVfQpjzkj5jwkspvk+l
5/gpTUmVm8Vwhw==
=M4eq
-----END PGP SIGNATURE-----
Merge tag 'smp-core-2024-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug updates from Thomas Gleixner:
"A small set of SMP/CPU hotplug updates:
- Reverse the order of iteration when freezing secondary CPUs for
hibernation.
This avoids that drivers like the Intel uncore performance counter
have to transfer the assignement of handling the per package uncore
events for every CPU in a package, which is a considerable speedup
on larger systems.
- Add a missing destroy_work_on_stack() invocation in
smp_call_on_cpu() to prevent debug objects to emit a false positive
warning when the stack is freed.
- Small cleanups in comments and a str_plural() conversion"
* tag 'smp-core-2024-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smp: Add missing destroy_work_on_stack() call in smp_call_on_cpu()
cpu/hotplug: Reverse order of iteration in freeze_secondary_cpus()
smp: Use str_plural() to fix Coccinelle warnings
cpu/hotplug: Fix typo in comment
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaTgusQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpr+1EAC4I7pRAM341sfmhe/9QQKMM8VzGwy5Tlr1
AFLO3BujRTl6X8S9fQjIjN1coW6u4F42I19+vVlxqvB7CUnqt9VWpexEjxe4K0FR
R+hIZW+fWV9K/eMrcsLcI7oReN5kIihHOzzy3wz0rENoGB5dCl6JAZMHDUCSqP0/
ZJJQ5ut8ah20Y/myHnzP5o4TfdE7nGo73Di2YoE2g3KqeX/dlAKW9+5hqKzzrHhM
2U25k/6KLy0ROzKpy2qW0QRE3pT5udoHLK2ue9+XwXF8JWVTlfVkHBzGY7NstyyT
z07SEzW1q4xV1HdCwGDAU7cL2NJMRXSG0p2WZTm8QyaVTdsZQvEx08GLsVdLvFH5
Gg+oOaxVE+INzW+/Lwz7lFHgq6XEjdAlEAOXDtGkZoni6Rt6iCzFCW6RTf/guy8o
Cub7tatMyegxai9+FTN/oFVoydRR0tsMf0OHrWnLOperh9CaxAwXvmKFeT/UTwiB
KIuIOJop7aThJbiV42a/xwTrEjNMZRv6uVBBEtJX3rxpmIhqTbjcAv9rKMmgtLMk
s6yX1MvYdOLhhEDyoUBX0dJdEETBf3KbnYIwi8kb4Sbkw/ZDgnkmSxFysom61wUF
byAFEpah3ZFR8aES0uNKUE6UHK6i5qqp0Za/n6gA927E/WGCU9ndaS+01gyknog0
8FqFYwruHQ==
=50CO
-----END PGP SIGNATURE-----
Merge tag 'for-6.11/io_uring-20240714' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
"Here are the io_uring updates queued up for 6.11.
Nothing major this time around, various minor improvements and
cleanups/fixes. This contains:
- Add bind/listen opcodes. Main motivation is to support direct
descriptors, to avoid needing a regular fd just for doing these two
operations (Gabriel)
- Probe fixes (Gabriel)
- Treat io-wq work flags as atomics. Not fixing a real issue, but may
as well and it silences a KCSAN warning (me)
- Cleanup of rsrc __set_current_state() usage (me)
- Add 64-bit for {m,f}advise operations (me)
- Improve performance of data ring messages (me)
- Fix for ring message overflow posting (Pavel)
- Fix for freezer interaction with TWA_NOTIFY_SIGNAL. Not strictly an
io_uring thing, but since TWA_NOTIFY_SIGNAL was originally added
for faster task_work signaling for io_uring, bundling it with this
pull (Pavel)
- Add Pavel as a co-maintainer
- Various cleanups (me, Thorsten)"
* tag 'for-6.11/io_uring-20240714' of git://git.kernel.dk/linux: (28 commits)
io_uring/net: check socket is valid in io_bind()/io_listen()
kernel: rerun task_work while freezing in get_signal()
io_uring/io-wq: limit retrying worker initialisation
io_uring/napi: Remove unnecessary s64 cast
io_uring/net: cleanup io_recv_finish() bundle handling
io_uring/msg_ring: fix overflow posting
MAINTAINERS: change Pavel Begunkov from io_uring reviewer to maintainer
io_uring/msg_ring: use kmem_cache_free() to free request
io_uring/msg_ring: check for dead submitter task
io_uring/msg_ring: add an alloc cache for io_kiocb entries
io_uring/msg_ring: improve handling of target CQE posting
io_uring: add io_add_aux_cqe() helper
io_uring: add remote task_work execution helper
io_uring/msg_ring: tighten requirement for remote posting
io_uring: Allocate only necessary memory in io_probe
io_uring: Fix probe of disabled operations
io_uring: Introduce IORING_OP_LISTEN
io_uring: Introduce IORING_OP_BIND
net: Split a __sys_listen helper for io_uring
net: Split a __sys_bind helper for io_uring
...
pid_list_fill_irq() runs via irq_work.
When CONFIG_PREEMPT_RT is disabled, it would run in irq_context.
so it shouldn't sleep while memory allocation.
Change gfp flags from GFP_KERNEL to GFP_NOWAIT to prevent sleep in
irq_work.
This change wouldn't impact functionality in practice because the worst-size
is 2K.
Cc: stable@goodmis.org
Fixes: 8d6e90983a ("tracing: Create a sparse bitmask for pid filtering")
Link: https://lore.kernel.org/20240704150226.1359936-1-yeoreum.yun@arm.com
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: levi.yun <yeoreum.yun@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEF0AAKCRCRxhvAZXjc
oq0TAQDjfTLN75RwKQ34RIFtRun2q+OMfBQtSegtaccqazghyAD/QfmPuZDxB5DL
rsI/5k5O4VupIKrEdIaqvNxmkmDsSAc=
=bf7E
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.11.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Support passing NULL along AT_EMPTY_PATH for statx().
NULL paths with any flag value other than AT_EMPTY_PATH go the
usual route and end up with -EFAULT to retain compatibility (Rust
is abusing calls of the sort to detect availability of statx)
This avoids path lookup code, lockref management, memory allocation
and in case of NULL path userspace memory access (which can be
quite expensive with SMAP on x86_64)
- Don't block i_writecount during exec. Remove the
deny_write_access() mechanism for executables
- Relax open_by_handle_at() permissions in specific cases where we
can prove that the caller had sufficient privileges to open a file
- Switch timespec64 fields in struct inode to discrete integers
freeing up 4 bytes
Fixes:
- Fix false positive circular locking warning in hfsplus
- Initialize hfs_inode_info after hfs_alloc_inode() in hfs
- Avoid accidental overflows in vfs_fallocate()
- Don't interrupt fallocate with EINTR in tmpfs to avoid constantly
restarting shmem_fallocate()
- Add missing quote in comment in fs/readdir
Cleanups:
- Don't assign and test in an if statement in mqueue. Move the
assignment out of the if statement
- Reflow the logic in may_create_in_sticky()
- Remove the usage of the deprecated ida_simple_xx() API from procfs
- Reject FSCONFIG_CMD_CREATE_EXCL requets that depend on the new
mount api early
- Rename variables in copy_tree() to make it easier to understand
- Replace WARN(down_read_trylock, ...) abuse with proper asserts in
various places in the VFS
- Get rid of user_path_at_empty() and drop the empty argument from
getname_flags()
- Check for error while copying and no path in one branch in
getname_flags()
- Avoid redundant smp_mb() for THP handling in do_dentry_open()
- Rename parent_ino to d_parent_ino and make it use RCU
- Remove unused header include in fs/readdir
- Export in_group_capable() helper and switch f2fs and fuse over to
it instead of open-coding the logic in both places"
* tag 'vfs-6.11.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits)
ipc: mqueue: remove assignment from IS_ERR argument
vfs: rename parent_ino to d_parent_ino and make it use RCU
vfs: support statx(..., NULL, AT_EMPTY_PATH, ...)
stat: use vfs_empty_path() helper
fs: new helper vfs_empty_path()
fs: reflow may_create_in_sticky()
vfs: remove redundant smp_mb for thp handling in do_dentry_open
fuse: Use in_group_or_capable() helper
f2fs: Use in_group_or_capable() helper
fs: Export in_group_or_capable()
vfs: reorder checks in may_create_in_sticky
hfs: fix to initialize fields of hfs_inode_info after hfs_alloc_inode()
proc: Remove usage of the deprecated ida_simple_xx() API
hfsplus: fix to avoid false alarm of circular locking
Improve readability of copy_tree
vfs: shave a branch in getname_flags
vfs: retire user_path_at_empty and drop empty arg from getname_flags
vfs: stop using user_path_at_empty in do_readlinkat
tmpfs: don't interrupt fallocate with EINTR
fs: don't block i_writecount during exec
...
This reimplements commit 951bcae6c5 ("kallsyms: Avoid weak references
for kallsyms symbols") because I am not a big fan of PROVIDE().
As an alternative solution, this commit prepends one more kallsyms step.
KSYMS .tmp_vmlinux.kallsyms0.S # added
AS .tmp_vmlinux.kallsyms0.o # added
LD .tmp_vmlinux.btf
BTF .btf.vmlinux.bin.o
LD .tmp_vmlinux.kallsyms1
NM .tmp_vmlinux.kallsyms1.syms
KSYMS .tmp_vmlinux.kallsyms1.S
AS .tmp_vmlinux.kallsyms1.o
LD .tmp_vmlinux.kallsyms2
NM .tmp_vmlinux.kallsyms2.syms
KSYMS .tmp_vmlinux.kallsyms2.S
AS .tmp_vmlinux.kallsyms2.o
LD vmlinux
Step 0 takes /dev/null as input, and generates .tmp_vmlinux.kallsyms0.o,
which has a valid kallsyms format with the empty symbol list, and can be
linked to vmlinux. Since it is really small, the added compile-time cost
is negligible.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
To ensure non-reentrancy, __queue_work() attempts to enqueue a work
item to the pool of the currently executing worker. This is not only
unnecessary for an ordered workqueue, where order inherently suggests
non-reentrancy, but it could also disrupt the sequence if the item is
not enqueued on the newest PWQ.
Just queue it to the newest PWQ and let order management guarantees
non-reentrancy.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Fixes: 4c065dbce1 ("workqueue: Enable unbound cpumask update on ordered workqueues")
Cc: stable@vger.kernel.org # v6.9+
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit 74347be3edfd11277799242766edf844c43dd5d3)
Add a new helper to disable lockdep tracking entirely for a given class.
This is needed for bcachefs, which takes too many btree node locks for
lockdep to track. Instead, we have a single lockdep_map for "btree_trans
has any btree nodes locked", which makes more since given that we have
centralized lock management and a cycle detector.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
(clock_gettime(CLOCK_THREAD_CPUTIME_ID,...)) due to the addition of
PSI IRQ time accounting in the hotpath
- Fix a task_struct leak due to missing to decrement the refcount when
the task is enqueued before the timer which is supposed to do that,
expires
- Revert an attempt to expedite detaching of movable tasks, as finding
those could become very costly. Turns out the original issue wasn't
even hit by anyone
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaTmqUACgkQEsHwGGHe
VUos3BAAgeZdeFiqop5TuNPURy7DDFpl/Ibwe9Wv1PGvZ70WHT0Aqf6S+woE91+g
uRR9VZnyS7ODUEP4PD43zFeBHbrt6mZkKTyPRKxiylZpJGOp1KGfGmaxPEoi+kC+
3rwphrs7F6cJ0H4mKvqj5+x1jA19L/RZ7LqZ4tZBicwkZXmBnk4Hy9mlO/5Neb2Q
SqhzzCSVgpUW3mVvpPetst8N26R7BTYkejA3RWmCr8xFLB9nyzLBX5uGPtolv1QZ
B5gRtK5ZY2tohdKaShFqdiSUDoUAKuO2WPLSS/ALZDfwK5a+Pue7uGt97OSHhVLt
fTCcPcWDiNj5t/uA7FYXA9wiTQmATUzPtvj2urf/mVupaMLZJiadnQvX0Ya9YrCr
9dyowk7me3326FvUzeqga12cyUtPxElfVsb3KzT1YzSyu7nd/Kezn8Lie41xzZJL
cSqhex3Xl8Eoxvkf2gIy9FnBrx8t47B8SYCSK+JvjeGzIgeBpiHM4zsd5RE5xZ8d
9m2uhlFXV+fIlDrqO1RA9k3yRfvbnuGrIHQUo0B2EKC/u/aSvrQVShQwnKzjM1u7
mXMMyPUxyiDi2VLCTUcKqhLmf77TA0/1px+4dQ7E+ar6tvZxeBUrhoTM9jbNqDWl
g5ShAHBWQXifoIytyhFNM5cfJnkRll97LyKm3LXyG6yZ7VK/JL8=
=gxU0
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Fix a performance regression when measuring the CPU time of a thread
(clock_gettime(CLOCK_THREAD_CPUTIME_ID,...)) due to the addition of
PSI IRQ time accounting in the hotpath
- Fix a task_struct leak due to missing to decrement the refcount when
the task is enqueued before the timer which is supposed to do that,
expires
- Revert an attempt to expedite detaching of movable tasks, as finding
those could become very costly. Turns out the original issue wasn't
even hit by anyone
* tag 'sched_urgent_for_v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath
sched/deadline: Fix task_struct reference leak
Revert "sched/fair: Make sure to try to detach at least one movable task"
initialized in the code path anyway right after on the ARM arch
timer and the ARM global timer (Li kunyu)
- Fix a race condition in the interrupt leading to a deadlock on the
SH CMT driver. Note that this fix was not tested on the platform
using this timer but the fix seems reasonable enough to be picked
confidently (Niklas Söderlund)
- Increase the rating of the gic-timer and use the configured width
clocksource register on the MIPS architecture (Jiaxun Yang)
- Add the DT bindings for the TMU on the Renesas platforms (Geert
Uytterhoeven)
- Add the DT bindings for the SOPHGO SG2002 clint on RiscV (Thomas
Bonnefille)
- Add the rtl-otto timer driver along with the DT bindings for the
Realtek platform (Chris Packham)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGn3N4YVz0WNVyHskqDIjiipP6E8FAmaRQh0ACgkQqDIjiipP
6E+rfQgAqkAWZ9BjswxV8Fg+Hj+a1cSohKjDczqitQF5rJm25X5VvMwlXVa3XQGm
yemh4tKPpll02LOiYCTyqOWzNrkVS9VsoBd5rrYjRX5aSv7UD35EXklLj4P/INwX
O9CRGD6aK4Xbw66xxheYHSSh+2iRs2x2mq61+/VdcIBlAwpQo+vx7McRoJZZI+2t
NFIXw8RF5dDlmmAaqiB0WnPAtcOK3SDo9fu1LEAX1ZAzvbZriLo7XLnL7ibySWVe
BW1n7Ore6PN5Dvz7jMfTsOQsgAlVv6MPfp/s4EDqMfBLVqXNirzXrdhiee/ahnYP
vyzQyU5HPCMiIYS45mhJF0OyDd3wyw==
=wuYA
-----END PGP SIGNATURE-----
Merge tag 'timers-v6.11-rc1' of https://git.linaro.org/people/daniel.lezcano/linux into timers/core
Pull clocksource/event driver updates from Daniel Lezcano:
- Remove unnecessary local variables initialization as they will be
initialized in the code path anyway right after on the ARM arch
timer and the ARM global timer (Li kunyu)
- Fix a race condition in the interrupt leading to a deadlock on the
SH CMT driver. Note that this fix was not tested on the platform
using this timer but the fix seems reasonable enough to be picked
confidently (Niklas Söderlund)
- Increase the rating of the gic-timer and use the configured width
clocksource register on the MIPS architecture (Jiaxun Yang)
- Add the DT bindings for the TMU on the Renesas platforms (Geert
Uytterhoeven)
- Add the DT bindings for the SOPHGO SG2002 clint on RiscV (Thomas
Bonnefille)
- Add the rtl-otto timer driver along with the DT bindings for the
Realtek platform (Chris Packham)
Link: https://lore.kernel.org/all/91cd05de-4c5d-4242-a381-3b8a4fe6a2a2@linaro.org
Describe the pool argument in the kernel-doc comment for
swiotlb_del_transient.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZpGVmAAKCRDbK58LschI
gxB4AQCgquQis63yqTI36j4iXBT+TuxHEBNoQBSLyzYdrLS1dgD/S5DRJDA+3LD+
394hn/VtB1qvX5vaqjsov4UIwSMyxA0=
=OhSn
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-07-12
We've added 23 non-merge commits during the last 3 day(s) which contain
a total of 18 files changed, 234 insertions(+), 243 deletions(-).
The main changes are:
1) Improve BPF verifier by utilizing overflow.h helpers to check
for overflows, from Shung-Hsi Yu.
2) Fix NULL pointer dereference in resolve_prog_type() for BPF_PROG_TYPE_EXT
when attr->attach_prog_fd was not specified, from Tengda Wu.
3) Fix arm64 BPF JIT when generating code for BPF trampolines with
BPF_TRAMP_F_CALL_ORIG which corrupted upper address bits,
from Puranjay Mohan.
4) Remove test_run callback from lwt_seg6local_prog_ops which never worked
in the first place and caused syzbot reports,
from Sebastian Andrzej Siewior.
5) Relax BPF verifier to accept non-zero offset on KF_TRUSTED_ARGS/
/KF_RCU-typed BPF kfuncs, from Matt Bobrowski.
6) Fix a long standing bug in libbpf with regards to handling of BPF
skeleton's forward and backward compatibility, from Andrii Nakryiko.
7) Annotate btf_{seq,snprintf}_show functions with __printf,
from Alan Maguire.
8) BPF selftest improvements to reuse common network helpers in sk_lookup
test and dropping the open-coded inetaddr_len() and make_socket() ones,
from Geliang Tang.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (23 commits)
selftests/bpf: Test for null-pointer-deref bugfix in resolve_prog_type()
bpf: Fix null pointer dereference in resolve_prog_type() for BPF_PROG_TYPE_EXT
selftests/bpf: DENYLIST.aarch64: Skip fexit_sleep again
bpf: use check_sub_overflow() to check for subtraction overflows
bpf: use check_add_overflow() to check for addition overflows
bpf: fix overflow check in adjust_jmp_off()
bpf: Eliminate remaining "make W=1" warnings in kernel/bpf/btf.o
bpf: annotate BTF show functions with __printf
bpf, arm64: Fix trampoline for BPF_TRAMP_F_CALL_ORIG
selftests/bpf: Close obj in error path in xdp_adjust_tail
selftests/bpf: Null checks for links in bpf_tcp_ca
selftests/bpf: Use connect_fd_to_fd in sk_lookup
selftests/bpf: Use start_server_addr in sk_lookup
selftests/bpf: Use start_server_str in sk_lookup
selftests/bpf: Close fd in error path in drop_on_reuseport
selftests/bpf: Add ASSERT_OK_FD macro
selftests/bpf: Add backlog for network_helper_opts
selftests/bpf: fix compilation failure when CONFIG_NF_FLOW_TABLE=m
bpf: Remove tst_run from lwt_seg6local_prog_ops.
bpf: relax zero fixed offset constraint on KF_TRUSTED_ARGS/KF_RCU
...
====================
Link: https://patch.msgid.link/20240712212448.5378-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Replace the deprecated[1] use of strncpy() in bacct_add_tsk(). Since this
is UAPI, include trailing padding in the copy.
Link: https://github.com/KSPP/linux/issues/90 [1]
Link: https://lkml.kernel.org/r/20240711171308.work.995-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
Cc: "Dr. Thomas Orgis" <thomas.orgis@uni-hamburg.de>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Ismael Luceno <ismael@iodev.co.uk>
Cc: Peng Liu <liupeng256@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
On powerpc 8xx, when a page is 8M size, the information is in the PMD
entry. So allow architectures to provide __pte_leaf_size() instead of
pte_leaf_size() and provide the PMD entry to that function.
When __pte_leaf_size() is not defined, define it as a pte_leaf_size() so
that architectures not interested in the PMD arguments are not impacted.
Only define a default pte_leaf_size() when __pte_leaf_size() is not
defined to make sure nobody adds new calls to pte_leaf_size() in the core.
Link: https://lkml.kernel.org/r/c7c008f0a314bf8029ad7288fdc908db1ec7e449.1719928057.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently the event counting provided by misc.events is hierarchical,
it's not practical if user is only concerned with events of a
specified cgroup. Therefore, introduce misc.events.local collect events
specific to the given cgroup.
This is analogous to memory.events.local and pids.events.local.
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>