Commit Graph

39300 Commits

Author SHA1 Message Date
Neeraj Upadhyay
a03ae49c47 rcu/tree: Add comment to describe GP-done condition in fqs loop
Add a comment to explain why !rcu_preempt_blocked_readers_cgp() condition
is required on root rnp node, for GP completion check in rcu_gp_fqs_loop().

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-19 11:40:00 -07:00
Paul E. McKenney
9bdb5b3a8d rcu: Initialize first_gp_fqs at declaration in rcu_gp_fqs()
This commit saves a line of code by initializing the rcu_gp_fqs()
function's first_gp_fqs local variable in its declaration.

Reported-by: Frederic Weisbecker <frederic@kernel.org>
Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-19 11:40:00 -07:00
Joel Fernandes (Google)
82d26c36cc rcu/kvfree: Remove useless monitor_todo flag
monitor_todo is not needed as the work struct already tracks
if work is pending. Just use that to know if work is pending
using schedule_delayed_work() helper.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19 11:40:00 -07:00
Zqiang
e2bb1288a3 rcu: Cleanup RCU urgency state for offline CPU
When a CPU is slow to provide a quiescent state for a given grace
period, RCU takes steps to encourage that CPU to get with the
quiescent-state program in a more timely fashion.  These steps
include these flags in the rcu_data structure:

1.	->rcu_urgent_qs, which causes the scheduling-clock interrupt to
	request an otherwise pointless context switch from the scheduler.

2.	->rcu_need_heavy_qs, which causes both cond_resched() and RCU's
	context-switch hook to do an immediate momentary quiscent state.

3.	->rcu_need_heavy_qs, which causes the scheduler-clock tick to
	be enabled even on nohz_full CPUs with only one runnable task.

These flags are of course cleared once the corresponding CPU has passed
through a quiescent state.  Unless that quiescent state is the CPU
going offline, which means that when the CPU comes back online, it will
needlessly consume additional CPU time and incur additional latency,
which constitutes a minor but very real performance bug.

This commit therefore adds the call to rcu_disable_urgency_upon_qs()
that clears these flags to the CPU-hotplug offlining code path.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19 11:40:00 -07:00
Johannes Berg
800d6acf40 rcu: tiny: Record kvfree_call_rcu() call stack for KASAN
When running KASAN with Tiny RCU (e.g. under ARCH=um, where
a working KASAN patch is now available), we don't get any
information on the original kfree_rcu() (or similar) caller
when a problem is reported, as Tiny RCU doesn't record this.

Add the recording, which required pulling kvfree_call_rcu()
out of line for the KASAN case since the recording function
(kasan_record_aux_stack_noalloc) is neither exported, nor
can we include kasan.h into rcutiny.h.

without KASAN, the patch has no size impact (ARCH=um kernel):
    text       data         bss         dec        hex    filename
 6151515    4423154    33148520    43723189    29b29b5    linux
 6151515    4423154    33148520    43723189    29b29b5    linux + patch

with KASAN, the impact on my build was minimal:
    text       data         bss         dec        hex    filename
13915539    7388050    33282304    54585893    340ea25    linux
13911266    7392114    33282304    54585684    340e954    linux + patch
   -4273      +4064         +-0        -209

Acked-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-19 11:40:00 -07:00
Chen Zhongjin
9c9b26b0df locking/csd_lock: Change csdlock_debug from early_param to __setup
The csdlock_debug kernel-boot parameter is parsed by the
early_param() function csdlock_debug().  If set, csdlock_debug()
invokes static_branch_enable() to enable csd_lock_wait feature, which
triggers a panic on arm64 for kernels built with CONFIG_SPARSEMEM=y and
CONFIG_SPARSEMEM_VMEMMAP=n.

With CONFIG_SPARSEMEM_VMEMMAP=n, __nr_to_section is called in
static_key_enable() and returns NULL, resulting in a NULL dereference
because mem_section is initialized only later in sparse_init().

This is also a problem for powerpc because early_param() functions
are invoked earlier than jump_label_init(), also resulting in
static_key_enable() failures.  These failures cause the warning "static
key 'xxx' used before call to jump_label_init()".

Thus, early_param is too early for csd_lock_wait to run
static_branch_enable(), so changes it to __setup to fix these.

Fixes: 8d0968cc6b ("locking/csd_lock: Add boot parameter for controlling CSD lock debugging")
Cc: stable@vger.kernel.org
Reported-by: Chen jingwen <chenjingwen6@huawei.com>
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-19 11:40:00 -07:00
Paul E. McKenney
b3ade95b8e rcu: Forbid RCU_STRICT_GRACE_PERIOD in TINY_RCU kernels
The RCU_STRICT_GRACE_PERIOD Kconfig option does nothing in kernels
built with CONFIG_TINY_RCU=y, so this commit adjusts the dependencies
to disallow this combination.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19 11:39:59 -07:00
Zqiang
70a82c3c55 rcu: Immediately boost preempted readers for strict grace periods
The intent of the CONFIG_RCU_STRICT_GRACE_PERIOD Konfig option is to
cause normal grace periods to complete quickly in order to better catch
errors resulting from improperly leaking pointers from RCU read-side
critical sections.  However, kernels built with this option enabled still
wait for some hundreds of milliseconds before boosting RCU readers that
have been preempted within their current critical section.  The value
of this delay is set by the CONFIG_RCU_BOOST_DELAY Kconfig option,
which defaults to 500 milliseconds.

This commit therefore causes kernels build with strict grace periods
to ignore CONFIG_RCU_BOOST_DELAY.  This causes rcu_initiate_boost()
to start boosting immediately after all CPUs on a given leaf rcu_node
structure have passed through their quiescent states.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19 11:39:59 -07:00
Zqiang
52c1d81ee2 rcu: Add rnp->cbovldmask check in rcutree_migrate_callbacks()
Currently, the rcu_node structure's ->cbovlmask field is set in call_rcu()
when a given CPU is suffering from callback overload.  But if that CPU
goes offline, the outgoing CPU's callbacks is migrated to the running
CPU, which is likely to overload the running CPU.  However, that CPU's
bit in its leaf rcu_node structure's ->cbovlmask field remains zero.

Initially, this is OK because the outgoing CPU's bit remains set.
However, that bit will be cleared at the next end of a grace period,
at which time it is quite possible that the running CPU will still
be overloaded.  If the running CPU invokes call_rcu(), then overload
will be checked for and the bit will be set.  Except that there is no
guarantee that the running CPU will invoke call_rcu(), in which case the
next grace period will fail to take the running CPU's overload condition
into account.  Plus, because the bit is not set, the end of the grace
period won't check for overload on this CPU.

This commit therefore adds a call to check_cb_ovld_locked() in
rcutree_migrate_callbacks() to set the running CPU's ->cbovlmask bit
appropriately.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19 11:39:59 -07:00
Patrick Wang
48f8070f5d rcu: Avoid tracing a few functions executed in stop machine
Stop-machine recently started calling additional functions while waiting:

----------------------------------------------------------------
Former stop machine wait loop:
do {
    cpu_relax(); => macro
    ...
} while (curstate != STOPMACHINE_EXIT);
-----------------------------------------------------------------
Current stop machine wait loop:
do {
    stop_machine_yield(cpumask); => function (notraced)
    ...
    touch_nmi_watchdog(); => function (notraced, inside calls also notraced)
    ...
    rcu_momentary_dyntick_idle(); => function (notraced, inside calls traced)
} while (curstate != MULTI_STOP_EXIT);
------------------------------------------------------------------

These functions (and the functions that they call) must be marked
notrace to prevent them from being updated while they are executing.
The consequences of failing to mark these functions can be severe:

  rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  rcu: 	1-...!: (0 ticks this GP) idle=14f/1/0x4000000000000000 softirq=3397/3397 fqs=0
  rcu: 	3-...!: (0 ticks this GP) idle=ee9/1/0x4000000000000000 softirq=5168/5168 fqs=0
  	(detected by 0, t=8137 jiffies, g=5889, q=2 ncpus=4)
  Task dump for CPU 1:
  task:migration/1     state:R  running task     stack:    0 pid:   19 ppid:     2 flags:0x00000000
  Stopper: multi_cpu_stop+0x0/0x18c <- stop_machine_cpuslocked+0x128/0x174
  Call Trace:
  Task dump for CPU 3:
  task:migration/3     state:R  running task     stack:    0 pid:   29 ppid:     2 flags:0x00000000
  Stopper: multi_cpu_stop+0x0/0x18c <- stop_machine_cpuslocked+0x128/0x174
  Call Trace:
  rcu: rcu_preempt kthread timer wakeup didn't happen for 8136 jiffies! g5889 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
  rcu: 	Possible timer handling issue on cpu=2 timer-softirq=594
  rcu: rcu_preempt kthread starved for 8137 jiffies! g5889 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
  rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
  rcu: RCU grace-period kthread stack dump:
  task:rcu_preempt     state:I stack:    0 pid:   14 ppid:     2 flags:0x00000000
  Call Trace:
    schedule+0x56/0xc2
    schedule_timeout+0x82/0x184
    rcu_gp_fqs_loop+0x19a/0x318
    rcu_gp_kthread+0x11a/0x140
    kthread+0xee/0x118
    ret_from_exception+0x0/0x14
  rcu: Stack dump where RCU GP kthread last ran:
  Task dump for CPU 2:
  task:migration/2     state:R  running task     stack:    0 pid:   24 ppid:     2 flags:0x00000000
  Stopper: multi_cpu_stop+0x0/0x18c <- stop_machine_cpuslocked+0x128/0x174
  Call Trace:

This commit therefore marks these functions notrace:
 rcu_preempt_deferred_qs()
 rcu_preempt_need_deferred_qs()
 rcu_preempt_deferred_qs_irqrestore()

[ paulmck: Apply feedback from Neeraj Upadhyay. ]

Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19 11:39:59 -07:00
Paul E. McKenney
fb77dccfc7 rcu: Decrease FQS scan wait time in case of callback overloading
The force-quiesce-state loop function rcu_gp_fqs_loop() checks for
callback overloading and does an immediate initial scan for idle CPUs
if so.  However, subsequent rescans will be carried out at as leisurely a
rate as they always are, as specified by the rcutree.jiffies_till_next_fqs
module parameter.  It might be tempting to just continue immediately
rescanning, but this turns the RCU grace-period kthread into a CPU hog.
It might also be tempting to reduce the time between rescans to a single
jiffy, but this can be problematic on larger systems.

This commit therefore divides the normal time between rescans by three,
rounding up.  Thus a small system running at HZ=1000 that is suffering
from callback overload will wait only one jiffy instead of the normal
three between rescans.

[ paulmck: Apply Neeraj Upadhyay feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19 11:39:59 -07:00
Neeraj Upadhyay
4f2bfd9494 srcu: Make expedited RCU grace periods block even less frequently
The purpose of commit 282d8998e9 ("srcu: Prevent expedited GPs
and blocking readers from consuming CPU") was to prevent a long
series of never-blocking expedited SRCU grace periods from blocking
kernel-live-patching (KLP) progress.  Although it was successful, it also
resulted in excessive boot times on certain embedded workloads running
under qemu with the "-bios QEMU_EFI.fd" command line.  Here "excessive"
means increasing the boot time up into the three-to-four minute range.
This increase in boot time was due to the more than 6000 back-to-back
invocations of synchronize_rcu_expedited() within the KVM host OS, which
in turn resulted from qemu's emulation of a long series of MMIO accesses.

Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace
periods") did not significantly help this particular use case.

Zhangfei Gao and Shameerali Kolothum Thodi did experiments varying the
value of SRCU_MAX_NODELAY_PHASE with HZ=250 and with various values
of non-sleeping per phase counts on a system with preemption enabled,
and observed the following boot times:

+──────────────────────────+────────────────+
| SRCU_MAX_NODELAY_PHASE   | Boot time (s)  |
+──────────────────────────+────────────────+
| 100                      | 30.053         |
| 150                      | 25.151         |
| 200                      | 20.704         |
| 250                      | 15.748         |
| 500                      | 11.401         |
| 1000                     | 11.443         |
| 10000                    | 11.258         |
| 1000000                  | 11.154         |
+──────────────────────────+────────────────+

Analysis on the experiment results show additional improvements with
CPU-bound delays approaching one jiffy in duration. This improvement was
also seen when number of per-phase iterations were scaled to one jiffy.

This commit therefore scales per-grace-period phase number of non-sleeping
polls so that non-sleeping polls extend for about one jiffy. In addition,
the delay-calculation call to srcu_get_delay() in srcu_gp_end() is
replaced with a simple check for an expedited grace period.  This change
schedules callback invocation immediately after expedited grace periods
complete, which results in greatly improved boot times.  Testing done
by Marc and Zhangfei confirms that this change recovers most of the
performance degradation in boottime; for CONFIG_HZ_250 configuration,
specifically, boot times improve from 3m50s to 41s on Marc's setup;
and from 2m40s to ~9.7s on Zhangfei's setup.

In addition to the changes to default per phase delays, this
change adds 3 new kernel parameters - srcutree.srcu_max_nodelay,
srcutree.srcu_max_nodelay_phase, and srcutree.srcu_retry_check_delay.
This allows users to configure the srcu grace period scanning delays in
order to more quickly react to additional use cases.

Fixes: 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace periods")
Fixes: 282d8998e9 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU")
Reported-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Reported-by: yueluck <yueluck@163.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Tested-by: Marc Zyngier <maz@kernel.org>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03ebfc@linaro.org/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-19 11:39:59 -07:00
Paul E. McKenney
8f870e6eb8 srcu: Block less aggressively for expedited grace periods
Commit 282d8998e9 ("srcu: Prevent expedited GPs and blocking readers
from consuming CPU") fixed a problem where a long-running expedited SRCU
grace period could block kernel live patching.  It did so by giving up
on expediting once a given SRCU expedited grace period grew too old.

Unfortunately, this added excessive delays to boots of virtual embedded
systems specifying "-bios QEMU_EFI.fd" to qemu.  This commit therefore
makes the transition away from expediting less aggressive, increasing
the per-grace-period phase number of non-sleeping polls of readers from
one to three and increasing the required grace-period age from one jiffy
(actually from zero to one jiffies) to two jiffies (actually from one
to two jiffies).

Fixes: 282d8998e9 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reported-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Reported-by: chenxiang (M)" <chenxiang66@hisilicon.com>
Cc: Shameerali Kolothum Thodi  <shameerali.kolothum.thodi@huawei.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Link: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03ebfc@linaro.org/
2022-07-19 11:39:59 -07:00
Linus Torvalds
727c3991df Merge tag 'sched-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
 "A single scheduler fix plugging a race between sched_setscheduler()
  and balance_push().

  sched_setscheduler() spliced the balance callbacks accross a lock
  break which makes it possible for an interleaving schedule() to
  observe an empty list"

* tag 'sched-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix balance_push() vs __sched_setscheduler()
2022-06-19 09:51:00 -05:00
Linus Torvalds
4afb65156a Merge tag 'locking-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull lockdep fix from Thomas Gleixner:
 "A RT fix for lockdep.

  lockdep invokes prandom_u32() to create cookies. This worked until
  prandom_u32() was switched to the real random generator, which takes a
  spinlock for extraction, which does not work on RT when invoked from
  atomic contexts.

  lockdep has no requirement for real random numbers and it turns out
  sched_clock() is good enough to create the cookie. That works
  everywhere and is faster"

* tag 'locking-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/lockdep: Use sched_clock() for random numbers
2022-06-19 09:47:41 -05:00
Linus Torvalds
36da9f5fb6 Merge tag 'irq-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 "A set of interrupt subsystem updates:

  Core:

   - Ensure runtime power management for chained interrupts

  Drivers:

   - A collection of OF node refcount fixes

   - Unbreak MIPS uniprocessor builds

   - Fix xilinx interrupt controller Kconfig dependencies

   - Add a missing compatible string to the Uniphier driver"

* tag 'irq-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/loongson-liointc: Use architecture register to get coreid
  irqchip/uniphier-aidet: Add compatible string for NX1 SoC
  dt-bindings: interrupt-controller/uniphier-aidet: Add bindings for NX1 SoC
  irqchip/realtek-rtl: Fix refcount leak in map_interrupts
  irqchip/gic-v3: Fix refcount leak in gic_populate_ppi_partitions
  irqchip/gic-v3: Fix error handling in gic_populate_ppi_partitions
  irqchip/apple-aic: Fix refcount leak in aic_of_ic_init
  irqchip/apple-aic: Fix refcount leak in build_fiq_affinity
  irqchip/gic/realview: Fix refcount leak in realview_gic_of_init
  irqchip/xilinx: Remove microblaze+zynq dependency
  genirq: PM: Use runtime PM for chained interrupts
2022-06-19 09:45:16 -05:00
Linus Torvalds
93d17c1c8c Merge tag 'printk-for-5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk fixes from Petr Mladek:
 "Make the global console_sem available for CPU that is handling panic()
  or shutdown.

  This is an old problem when an existing console lock owner might block
  console output, but it became more visible with the kthreads"

* tag 'printk-for-5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  printk: Wait for the global console lock when the system is going down
  printk: Block console kthreads when direct printing will be required
2022-06-17 14:57:42 -05:00
Petr Mladek
38335cc5ff Merge branch 'rework/kthreads' into for-linus 2022-06-17 16:36:48 +02:00
Linus Torvalds
0639b599f6 Merge tag 'audit-pr-20220616' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit fix from Paul Moore:
 "A single audit patch to fix a problem where we were not properly
  freeing memory allocated when recording information related to a
  module load"

* tag 'audit-pr-20220616' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: free module name
2022-06-16 15:53:38 -07:00
Christian Göttsche
ef79c396c6 audit: free module name
Reset the type of the record last as the helper `audit_free_module()`
depends on it.

    unreferenced object 0xffff888153b707f0 (size 16):
      comm "modprobe", pid 1319, jiffies 4295110033 (age 1083.016s)
      hex dump (first 16 bytes):
        62 69 6e 66 6d 74 5f 6d 69 73 63 00 6b 6b 6b a5  binfmt_misc.kkk.
      backtrace:
        [<ffffffffa07dbf9b>] kstrdup+0x2b/0x50
        [<ffffffffa04b0a9d>] __audit_log_kern_module+0x4d/0xf0
        [<ffffffffa03b6664>] load_module+0x9d4/0x2e10
        [<ffffffffa03b8f44>] __do_sys_finit_module+0x114/0x1b0
        [<ffffffffa1f47124>] do_syscall_64+0x34/0x80
        [<ffffffffa200007e>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

Cc: stable@vger.kernel.org
Fixes: 12c5e81d3f ("audit: prepare audit_context for use in calling contexts beyond syscalls")
Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2022-06-15 19:28:44 -04:00
Petr Mladek
b87f02307d printk: Wait for the global console lock when the system is going down
There are reports that the console kthreads block the global console
lock when the system is going down, for example, reboot, panic.

First part of the solution was to block kthreads in these problematic
system states so they stopped handling newly added messages.

Second part of the solution is to wait when for the kthreads when
they are actively printing. It solves the problem when a message
was printed before the system entered the problematic state and
the kthreads managed to step in.

A busy waiting has to be used because panic() can be called in any
context and in an unknown state of the scheduler.

There must be a timeout because the kthread might get stuck or sleeping
and never release the lock. The timeout 10s is an arbitrary value
inspired by the softlockup timeout.

Link: https://lore.kernel.org/r/20220610205038.GA3050413@paulmck-ThinkPad-P17-Gen-1
Link: https://lore.kernel.org/r/CAMdYzYpF4FNTBPZsEFeWRuEwSies36QM_As8osPWZSr2q-viEA@mail.gmail.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220615162805.27962-3-pmladek@suse.com
2022-06-15 22:04:15 +02:00
Petr Mladek
c3230283e2 printk: Block console kthreads when direct printing will be required
There are known situations when the console kthreads are not
reliable or does not work in principle, for example, early boot,
panic, shutdown.

For these situations there is the direct (legacy) mode when printk() tries
to get console_lock() and flush the messages directly. It works very well
during the early boot when the console kthreads are not available at all.
It gets more complicated in the other situations when console kthreads
might be actively printing and block console_trylock() in printk().

The same problem is in the legacy code as well. Any console_lock()
owner could block console_trylock() in printk(). It is solved by
a trick that the current console_lock() owner is responsible for
printing all pending messages. It is actually the reason why there
is the risk of softlockups and why the console kthreads were
introduced.

The console kthreads use the same approach. They are responsible
for printing the messages by definition. So that they handle
the messages anytime when they are awake and see new ones.
The global console_lock is available when there is nothing
to do.

It should work well when the problematic context is correctly
detected and printk() switches to the direct mode. But it seems
that it is not enough in practice. There are reports that
the messages are not printed during panic() or shutdown()
even though printk() tries to use the direct mode here.

The problem seems to be that console kthreads become active in these
situation as well. They steel the job before other CPUs are stopped.
Then they are stopped in the middle of the job and block the global
console_lock.

First part of the solution is to block console kthreads when
the system is in a problematic state and requires the direct
printk() mode.

Link: https://lore.kernel.org/r/20220610205038.GA3050413@paulmck-ThinkPad-P17-Gen-1
Link: https://lore.kernel.org/r/CAMdYzYpF4FNTBPZsEFeWRuEwSies36QM_As8osPWZSr2q-viEA@mail.gmail.com
Suggested-by: John Ogness <john.ogness@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220615162805.27962-2-pmladek@suse.com
2022-06-15 22:03:38 +02:00
Sami Tolvanen
57cd6d157e cfi: Fix __cfi_slowpath_diag RCU usage with cpuidle
RCU_NONIDLE usage during __cfi_slowpath_diag can result in an invalid
RCU state in the cpuidle code path:

  WARNING: CPU: 1 PID: 0 at kernel/rcu/tree.c:613 rcu_eqs_enter+0xe4/0x138
  ...
  Call trace:
    rcu_eqs_enter+0xe4/0x138
    rcu_idle_enter+0xa8/0x100
    cpuidle_enter_state+0x154/0x3a8
    cpuidle_enter+0x3c/0x58
    do_idle.llvm.6590768638138871020+0x1f4/0x2ec
    cpu_startup_entry+0x28/0x2c
    secondary_start_kernel+0x1b8/0x220
    __secondary_switched+0x94/0x98

Instead, call rcu_irq_enter/exit to wake up RCU only when needed and
disable interrupts for the entire CFI shadow/module check when we do.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20220531175910.890307-1-samitolvanen@google.com
Fixes: cf68fffb66 ("add support for Clang CFI")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2022-06-13 09:18:46 -07:00
Sebastian Andrzej Siewior
4051a81774 locking/lockdep: Use sched_clock() for random numbers
Since the rewrote of prandom_u32(), in the commit mentioned below, the
function uses sleeping locks which extracing random numbers and filling
the batch.
This breaks lockdep on PREEMPT_RT because lock_pin_lock() disables
interrupts while calling __lock_pin_lock(). This can't be moved earlier
because the main user of the function (rq_pin_lock()) invokes that
function after disabling interrupts in order to acquire the lock.

The cookie does not require random numbers as its goal is to provide a
random value in order to notice unexpected "unlock + lock" sites.

Use sched_clock() to provide random numbers.

Fixes: a0103f4d86f88 ("random32: use real rng for non-deterministic randomness")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YoNn3pTkm5+QzE5k@linutronix.de
2022-06-13 10:29:57 +02:00
Peter Zijlstra
04193d590b sched: Fix balance_push() vs __sched_setscheduler()
The purpose of balance_push() is to act as a filter on task selection
in the case of CPU hotplug, specifically when taking the CPU out.

It does this by (ab)using the balance callback infrastructure, with
the express purpose of keeping all the unlikely/odd cases in a single
place.

In order to serve its purpose, the balance_push_callback needs to be
(exclusively) on the callback list at all times (noting that the
callback always places itself back on the list the moment it runs,
also noting that when the CPU goes down, regular balancing concerns
are moot, so ignoring them is fine).

And here-in lies the problem, __sched_setscheduler()'s use of
splice_balance_callbacks() takes the callbacks off the list across a
lock-break, making it possible for, an interleaving, __schedule() to
see an empty list and not get filtered.

Fixes: ae79270232 ("sched: Optimize finish_lock_switch()")
Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Link: https://lkml.kernel.org/r/20220519134706.GH2578@worktop.programming.kicks-ass.net
2022-06-13 10:15:07 +02:00
Linus Torvalds
b0cb8db396 Merge tag 'wq-for-5.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
 "Tetsuo's patch to trigger build warnings if system-wide wq's are
  flushed along with a TP type update and trivial comment update"

* tag 'wq-for-5.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Switch to new kerneldoc syntax for named variable macro argument
  workqueue: Fix type of cpu in trace event
  workqueue: Wrap flush_workqueue() using a macro
2022-06-12 11:16:00 -07:00
Linus Torvalds
1bc27dec7e Merge tag 'pm-5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
 "These fix an intel_idle issue introduced during the 5.16 development
  cycle and two recent regressions in the system reboot/poweroff code.

  Specifics:

   - Fix CPUIDLE_FLAG_IRQ_ENABLE handling in intel_idle (Peter Zijlstra)

   - Allow all platforms to use the global poweroff handler and make
     non-syscall poweroff code paths work again (Dmitry Osipenko)"

* tag 'pm-5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpuidle,intel_idle: Fix CPUIDLE_FLAG_IRQ_ENABLE
  kernel/reboot: Fix powering off using a non-syscall code paths
  kernel/reboot: Use static handler for register_platform_power_off()
2022-06-10 11:49:27 -07:00
Rafael J. Wysocki
67e59f8d01 Merge branch 'pm-sysoff'
Merge fixes for regressions introduced by the recent rework of the
system reboot/poweroff code.

* pm-sysoff:
  kernel/reboot: Fix powering off using a non-syscall code paths
  kernel/reboot: Use static handler for register_platform_power_off()
2022-06-10 20:24:10 +02:00
Linus Torvalds
f2ecc964b9 Merge tag 'for-linus-5.19a-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:

 - a small cleanup removing "export" of an __init function

 - a small series adding a new infrastructure for platform flags

 - a series adding generic virtio support for Xen guests (frontend side)

* tag 'for-linus-5.19a-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen: unexport __init-annotated xen_xlate_map_ballooned_pages()
  arm/xen: Assign xen-grant DMA ops for xen-grant DMA devices
  xen/grant-dma-ops: Retrieve the ID of backend's domain for DT devices
  xen/grant-dma-iommu: Introduce stub IOMMU driver
  dt-bindings: Add xen,grant-dma IOMMU description for xen-grant DMA ops
  xen/virtio: Enable restricted memory access using Xen grant mappings
  xen/grant-dma-ops: Add option to restrict memory access under Xen
  xen/grants: support allocating consecutive grants
  arm/xen: Introduce xen_setup_dma_ops()
  virtio: replace arch_has_restricted_virtio_memory_access()
  kernel: add platform_has() infrastructure
2022-06-10 09:57:11 -07:00
Linus Torvalds
825464e79d Merge tag 'net-5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
 "Including fixes from bpf and netfilter.

  Current release - regressions:

   - eth: amt: fix possible null-ptr-deref in amt_rcv()

  Previous releases - regressions:

   - tcp: use alloc_large_system_hash() to allocate table_perturb

   - af_unix: fix a data-race in unix_dgram_peer_wake_me()

   - nfc: st21nfca: fix memory leaks in EVT_TRANSACTION handling

   - eth: ixgbe: fix unexpected VLAN rx in promisc mode on VF

  Previous releases - always broken:

   - ipv6: fix signed integer overflow in __ip6_append_data

   - netfilter:
       - nat: really support inet nat without l3 address
       - nf_tables: memleak flow rule from commit path

   - bpf: fix calling global functions from BPF_PROG_TYPE_EXT programs

   - openvswitch: fix misuse of the cached connection on tuple changes

   - nfc: nfcmrvl: fix memory leak in nfcmrvl_play_deferred

   - eth: altera: fix refcount leak in altera_tse_mdio_create

  Misc:

   - add Quentin Monnet to bpftool maintainers"

* tag 'net-5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (45 commits)
  net: amd-xgbe: fix clang -Wformat warning
  tcp: use alloc_large_system_hash() to allocate table_perturb
  net: dsa: realtek: rtl8365mb: fix GMII caps for ports with internal PHY
  net: dsa: mv88e6xxx: correctly report serdes link failure
  net: dsa: mv88e6xxx: fix BMSR error to be consistent with others
  net: dsa: mv88e6xxx: use BMSR_ANEGCOMPLETE bit for filling an_complete
  net: altera: Fix refcount leak in altera_tse_mdio_create
  net: openvswitch: fix misuse of the cached connection on tuple changes
  net: ethernet: mtk_eth_soc: fix misuse of mem alloc interface netdev[napi]_alloc_frag
  ip_gre: test csum_start instead of transport header
  au1000_eth: stop using virt_to_bus()
  ipv6: Fix signed integer overflow in l2tp_ip6_sendmsg
  ipv6: Fix signed integer overflow in __ip6_append_data
  nfc: nfcmrvl: Fix memory leak in nfcmrvl_play_deferred
  nfc: st21nfca: fix incorrect sizing calculations in EVT_TRANSACTION
  nfc: st21nfca: fix memory leaks in EVT_TRANSACTION handling
  nfc: st21nfca: fix incorrect validating logic in EVT_TRANSACTION
  net: ipv6: unexport __init-annotated seg6_hmac_init()
  net: xfrm: unexport __init-annotated xfrm4_protocol_init()
  net: mdio: unexport __init-annotated mdio_bus_init()
  ...
2022-06-09 12:06:52 -07:00
Marc Zyngier
668a9fe5c6 genirq: PM: Use runtime PM for chained interrupts
When requesting an interrupt, we correctly call into the runtime
PM framework to guarantee that the underlying interrupt controller
is up and running.

However, we fail to do so for chained interrupt controllers, as
the mux interrupt is not requested along the same path.

Augment __irq_do_set_handler() to call into the runtime PM code
in this case, making sure the PM flow is the same for all interrupts.

Reported-by: Lucas Stach <l.stach@pengutronix.de>
Tested-by: Liu Ying <victor.liu@nxp.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/26973cddee5f527ea17184c0f3fccb70bc8969a0.camel@pengutronix.de
2022-06-09 15:58:13 +01:00
Linus Torvalds
34f4335c16 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:

 - syzkaller NULL pointer dereference

 - TDP MMU performance issue with disabling dirty logging

 - 5.14 regression with SVM TSC scaling

 - indefinite stall on applying live patches

 - unstable selftest

 - memory leak from wrong copy-and-paste

 - missed PV TLB flush when racing with emulation

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: do not report a vCPU as preempted outside instruction boundaries
  KVM: x86: do not set st->preempted when going back to user space
  KVM: SVM: fix tsc scaling cache logic
  KVM: selftests: Make hyperv_clock selftest more stable
  KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging
  x86: drop bogus "cc" clobber from __try_cmpxchg_user_asm()
  KVM: x86/mmu: Check every prev_roots in __kvm_mmu_free_obsolete_roots()
  entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set
  KVM: Don't null dereference ops->destroy
2022-06-08 09:16:31 -07:00
Dmitry Osipenko
2b8c612c61 kernel/reboot: Fix powering off using a non-syscall code paths
There are other methods of powering off machine than the reboot syscall.
Previously we missed to cover those methods and it created power-off
regression for some machines, like the PowerPC e500.

Fix this problem by moving the legacy sys-off handler registration to
the latest phase of power-off process and making the kernel_can_power_off()
check the legacy pm_power_off presence.

Tested-by: Michael Ellerman <mpe@ellerman.id.au> # ppce500
Reported-by: Michael Ellerman <mpe@ellerman.id.au> # ppce500
Fixes: da007f171f ("kernel/reboot: Change registration order of legacy power-off handler")
Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-06-07 19:42:31 +02:00
Toke Høiland-Jørgensen
f858c2b2ca bpf: Fix calling global functions from BPF_PROG_TYPE_EXT programs
The verifier allows programs to call global functions as long as their
argument types match, using BTF to check the function arguments. One of the
allowed argument types to such global functions is PTR_TO_CTX; however the
check for this fails on BPF_PROG_TYPE_EXT functions because the verifier
uses the wrong type to fetch the vmlinux BTF ID for the program context
type. This failure is seen when an XDP program is loaded using
libxdp (which loads it as BPF_PROG_TYPE_EXT and attaches it to a global XDP
type program).

Fix the issue by passing in the target program type instead of the
BPF_PROG_TYPE_EXT type to bpf_prog_get_ctx() when checking function
argument compatibility.

The first Fixes tag refers to the latest commit that touched the code in
question, while the second one points to the code that first introduced
the global function call verification.

v2:
- Use resolve_prog_type()

Fixes: 3363bd0cfb ("bpf: Extend kfunc with PTR_TO_CTX, PTR_TO_MEM argument support")
Fixes: 51c39bb1d5 ("bpf: Introduce function-by-function verification")
Reported-by: Simon Sundberg <simon.sundberg@kau.se>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20220606075253.28422-1-toke@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-07 10:41:20 -07:00
Dan Carpenter
fd58f7df24 bpf: Use safer kvmalloc_array() where possible
The kvmalloc_array() function is safer because it has a check for
integer overflows.  These sizes come from the user and I was not
able to see any bounds checking so an integer overflow seems like a
realistic concern.

Fixes: 0dcac27254 ("bpf: Add multi kprobe link")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/Yo9VRVMeHbALyjUH@kili
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-07 10:40:53 -07:00
Tetsuo Handa
c4f135d643 workqueue: Wrap flush_workqueue() using a macro
Since flush operation synchronously waits for completion, flushing
system-wide WQs (e.g. system_wq) might introduce possibility of deadlock
due to unexpected locking dependency. Tejun Heo commented at [1] that it
makes no sense at all to call flush_workqueue() on the shared WQs as the
caller has no idea what it's gonna end up waiting for.

Although there is flush_scheduled_work() which flushes system_wq WQ with
"Think twice before calling this function! It's very easy to get into
trouble if you don't take great care." warning message, syzbot found a
circular locking dependency caused by flushing system_wq WQ [2].

Therefore, let's change the direction to that developers had better use
their local WQs if flush_scheduled_work()/flush_workqueue(system_*_wq) is
inevitable.

Steps for converting system-wide WQs into local WQs are explained at [3],
and a conversion to stop flushing system-wide WQs is in progress. Now we
want some mechanism for preventing developers who are not aware of this
conversion from again start flushing system-wide WQs.

Since I found that WARN_ON() is complete but awkward approach for teaching
developers about this problem, let's use __compiletime_warning() for
incomplete but handy approach. For completeness, we will also insert
WARN_ON() into __flush_workqueue() after all in-tree users stopped calling
flush_scheduled_work().

Link: https://lore.kernel.org/all/YgnQGZWT%2Fn3VAITX@slm.duckdns.org/ [1]
Link: https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8 [2]
Link: https://lkml.kernel.org/r/49925af7-78a8-a3dd-bce6-cfc02e1a9236@I-love.SAKURA.ne.jp [3]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-06-07 07:07:14 -10:00
Seth Forshee
3e684903a8 entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set
A livepatch transition may stall indefinitely when a kvm vCPU is heavily
loaded. To the host, the vCPU task is a user thread which is spending a
very long time in the ioctl(KVM_RUN) syscall. During livepatch
transition, set_notify_signal() will be called on such tasks to
interrupt the syscall so that the task can be transitioned. This
interrupts guest execution, but when xfer_to_guest_mode_work() sees that
TIF_NOTIFY_SIGNAL is set but not TIF_SIGPENDING it concludes that an
exit to user mode is unnecessary, and guest execution is resumed without
transitioning the task for the livepatch.

This handling of TIF_NOTIFY_SIGNAL is incorrect, as set_notify_signal()
is expected to break tasks out of interruptible kernel loops and cause
them to return to userspace. Change xfer_to_guest_mode_work() to handle
TIF_NOTIFY_SIGNAL the same as TIF_SIGPENDING, signaling to the vCPU run
loop that an exit to userpsace is needed. Any pending task_work will be
run when get_signal() is called from exit_to_user_mode_loop(), so there
is no longer any need to run task work from xfer_to_guest_mode_work().

Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Petr Mladek <pmladek@suse.com>
Signed-off-by: Seth Forshee <sforshee@digitalocean.com>
Message-Id: <20220504180840.2907296-1-sforshee@digitalocean.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07 11:19:00 -04:00
Linus Torvalds
e71e60cd74 Merge tag 'dma-mapping-5.19-2022-06-06' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:

 - fix a regressin in setting swiotlb ->force_bounce (me)

 - make dma-debug less chatty (Rob Clark)

* tag 'dma-mapping-5.19-2022-06-06' of git://git.infradead.org/users/hch/dma-mapping:
  swiotlb: fix setting ->force_bounce
  dma-debug: make things less spammy under memory pressure
2022-06-06 17:56:18 -07:00
Juergen Gross
2130a790ca kernel: add platform_has() infrastructure
Add a simple infrastructure for setting, resetting and querying
platform feature flags.

Flags can be either global or architecture specific.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Tested-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> # Arm64 only
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Juergen Gross <jgross@suse.com>
2022-06-06 08:06:00 +02:00
Linus Torvalds
e17fee8976 Merge tag 'mm-nonmm-stable-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull delay-accounting update from Andrew Morton:
 "A single featurette for delay accounting.

  Delayed a bit because, unusually, it had dependencies on both the
  mm-stable and mm-nonmm-stable queues"

* tag 'mm-nonmm-stable-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  delayacct: track delays from write-protect copy
2022-06-05 16:58:27 -07:00
Linus Torvalds
bc1e02c3e5 Merge tag 'sched-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
 "Fix the fallout of sysctl code move which placed the init function
  wrong"

* tag 'sched-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/autogroup: Fix sysctl move
2022-06-05 10:42:40 -07:00
Linus Torvalds
fa11c28046 Merge tag 'perf-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:

  - Make the ICL event constraints match reality

  - Remove a unused local variable

* tag 'perf-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Remove unused local variable
  perf/x86/intel: Fix event constraints for ICL
2022-06-05 10:40:31 -07:00
Linus Torvalds
cbd76edeab Merge tag 'pull-18-rc1-work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull mount handling updates from Al Viro:
 "Cleanups (and one fix) around struct mount handling.

  The fix is usermode_driver.c one - once you've done kern_mount(), you
  must kern_unmount(); simple mntput() will end up with a leak. Several
  failure exits in there messed up that way... In practice you won't hit
  those particular failure exits without fault injection, though"

* tag 'pull-18-rc1-work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  move mount-related externs from fs.h to mount.h
  blob_to_mnt(): kern_unmount() is needed to undo kern_mount()
  m->mnt_root->d_inode->i_sb is a weird way to spell m->mnt_sb...
  linux/mount.h: trim includes
  uninline may_mount() and don't opencode it in fspick(2)/fsopen(2)
2022-06-04 19:00:05 -07:00
Linus Torvalds
67850b7bdc Merge tag 'ptrace_stop-cleanup-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ptrace_stop cleanups from Eric Biederman:
 "While looking at the ptrace problems with PREEMPT_RT and the problems
  Peter Zijlstra was encountering with ptrace in his freezer rewrite I
  identified some cleanups to ptrace_stop that make sense on their own
  and move make resolving the other problems much simpler.

  The biggest issue is the habit of the ptrace code to change
  task->__state from the tracer to suppress TASK_WAKEKILL from waking up
  the tracee. No other code in the kernel does that and it is straight
  forward to update signal_wake_up and friends to make that unnecessary.

  Peter's task freezer sets frozen tasks to a new state TASK_FROZEN and
  then it stores them by calling "wake_up_state(t, TASK_FROZEN)" relying
  on the fact that all stopped states except the special stop states can
  tolerate spurious wake up and recover their state.

  The state of stopped and traced tasked is changed to be stored in
  task->jobctl as well as in task->__state. This makes it possible for
  the freezer to recover tasks in these special states, as well as
  serving as a general cleanup. With a little more work in that
  direction I believe TASK_STOPPED can learn to tolerate spurious wake
  ups and become an ordinary stop state.

  The TASK_TRACED state has to remain a special state as the registers
  for a process are only reliably available when the process is stopped
  in the scheduler. Fundamentally ptrace needs acess to the saved
  register values of a task.

  There are bunch of semi-random ptrace related cleanups that were found
  while looking at these issues.

  One cleanup that deserves to be called out is from commit 57b6de08b5
  ("ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs"). This
  makes a change that is technically user space visible, in the handling
  of what happens to a tracee when a tracer dies unexpectedly. According
  to our testing and our understanding of userspace nothing cares that
  spurious SIGTRAPs can be generated in that case"

* tag 'ptrace_stop-cleanup-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state
  ptrace: Always take siglock in ptrace_resume
  ptrace: Don't change __state
  ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs
  ptrace: Document that wait_task_inactive can't fail
  ptrace: Reimplement PTRACE_KILL by always sending SIGKILL
  signal: Use lockdep_assert_held instead of assert_spin_locked
  ptrace: Remove arch_ptrace_attach
  ptrace/xtensa: Replace PT_SINGLESTEP with TIF_SINGLESTEP
  ptrace/um: Replace PT_DTRACE with TIF_SINGLESTEP
  signal: Replace __group_send_sig_info with send_signal_locked
  signal: Rename send_signal send_signal_locked
2022-06-03 16:13:25 -07:00
Linus Torvalds
1ec6574a3c Merge tag 'kthread-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull kthread updates from Eric Biederman:
 "This updates init and user mode helper tasks to be ordinary user mode
  tasks.

  Commit 40966e316f ("kthread: Ensure struct kthread is present for
  all kthreads") caused init and the user mode helper threads that call
  kernel_execve to have struct kthread allocated for them. This struct
  kthread going away during execve in turned made a use after free of
  struct kthread possible.

  Here, commit 343f4c49f2 ("kthread: Don't allocate kthread_struct for
  init and umh") is enough to fix the use after free and is simple
  enough to be backportable.

  The rest of the changes pass struct kernel_clone_args to clean things
  up and cause the code to make sense.

  In making init and the user mode helpers tasks purely user mode tasks
  I ran into two complications. The function task_tick_numa was
  detecting tasks without an mm by testing for the presence of
  PF_KTHREAD. The initramfs code in populate_initrd_image was using
  flush_delayed_fput to ensuere the closing of all it's file descriptors
  was complete, and flush_delayed_fput does not work in a userspace
  thread.

  I have looked and looked and more complications and in my code review
  I have not found any, and neither has anyone else with the code
  sitting in linux-next"

* tag 'kthread-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  sched: Update task_tick_numa to ignore tasks without an mm
  fork: Stop allowing kthreads to call execve
  fork: Explicitly set PF_KTHREAD
  init: Deal with the init process being a user mode process
  fork: Generalize PF_IO_WORKER handling
  fork: Explicity test for idle tasks in copy_thread
  fork: Pass struct kernel_clone_args into copy_thread
  kthread: Don't allocate kthread_struct for init and umh
2022-06-03 16:03:05 -07:00
Linus Torvalds
21873bd66b Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
 "Most of issues addressed were introduced during this merging window.

   - Initialise jump labels before setup_machine_fdt(), needed by commit
     f5bda35fba ("random: use static branch for crng_ready()").

   - Sparse warnings: missing prototype, incorrect __user annotation.

   - Skip SVE kselftest if not sufficient vector lengths supported"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  kselftest/arm64: signal: Skip SVE signal test if not enough VLs supported
  arm64: Initialize jump labels before setup_machine_fdt()
  arm64: hibernate: Fix syntax errors in comments
  arm64: Remove the __user annotation for the restore_za_context() argument
  ftrace/fgraph: fix increased missing-prototypes warnings
2022-06-03 14:05:34 -07:00
Linus Torvalds
58f9d52ff6 Merge tag 'net-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf and netfilter.

  Current release - new code bugs:

   - af_packet: make sure to pull the MAC header, avoid skb panic in GSO

   - ptp_clockmatrix: fix inverted logic in is_single_shot()

   - netfilter: flowtable: fix missing FLOWI_FLAG_ANYSRC flag

   - dt-bindings: net: adin: fix adi,phy-output-clock description syntax

   - wifi: iwlwifi: pcie: rename CAUSE macro, avoid MIPS build warning

  Previous releases - regressions:

   - Revert "net: af_key: add check for pfkey_broadcast in function
     pfkey_process"

   - tcp: fix tcp_mtup_probe_success vs wrong snd_cwnd

   - nf_tables: disallow non-stateful expression in sets earlier

   - nft_limit: clone packet limits' cost value

   - nf_tables: double hook unregistration in netns path

   - ping6: fix ping -6 with interface name

  Previous releases - always broken:

   - sched: fix memory barriers to prevent skbs from getting stuck in
     lockless qdiscs

   - neigh: set lower cap for neigh_managed_work rearming, avoid
     constantly scheduling the probe work

   - bpf: fix probe read error on big endian in ___bpf_prog_run()

   - amt: memory leak and error handling fixes

  Misc:

   - ipv6: expand & rename accept_unsolicited_na to accept_untracked_na"

* tag 'net-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (80 commits)
  net/af_packet: make sure to pull mac header
  net: add debug info to __skb_pull()
  net: CONFIG_DEBUG_NET depends on CONFIG_NET
  stmmac: intel: Add RPL-P PCI ID
  net: stmmac: use dev_err_probe() for reporting mdio bus registration failure
  tipc: check attribute length for bearer name
  ice: fix access-beyond-end in the switch code
  nfp: remove padding in nfp_nfdk_tx_desc
  ax25: Fix ax25 session cleanup problems
  net: usb: qmi_wwan: Add support for Cinterion MV31 with new baseline
  sfc/siena: fix wrong tx channel offset with efx_separate_tx_channels
  sfc/siena: fix considering that all channels have TX queues
  socket: Don't use u8 type in uapi socket.h
  net/sched: act_api: fix error code in tcf_ct_flow_table_fill_tuple_ipv6()
  net: ping6: Fix ping -6 with interface name
  macsec: fix UAF bug for real_dev
  octeontx2-af: fix error code in is_valid_offset()
  wifi: mac80211: fix use-after-free in chanctx code
  bonding: guard ns_targets by CONFIG_IPV6
  tcp: tcp_rtx_synack() can be called from process context
  ...
2022-06-02 12:50:16 -07:00
Saravana Kannan
73503963b7 module: Fix prefix for module.sig_enforce module param
Commit cfc1d27789 ("module: Move all into module/") changed the prefix
of the module param by moving/renaming files.  A later commit also moves
the module_param() into a different file, thereby changing the prefix
yet again.

This would break kernel cmdline compatibility and also userspace
compatibility at /sys/module/module/parameters/sig_enforce.

So, set the prefix back to "module.".

Fixes: cfc1d27789 ("module: Move all into module/")
Link: https://lore.kernel.org/lkml/20220602034111.4163292-1-saravanak@google.com/
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Aaron Tomlin <atomlin@redhat.com>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Saravana Kannan <saravanak@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-06-02 12:44:33 -07:00
Dmitry Osipenko
587b9bfe06 kernel/reboot: Use static handler for register_platform_power_off()
The register_platform_power_off() fails on m68k platform due to the
memory allocation error that happens at a very early boot time when
memory allocator isn't available yet. Fix it by using a static sys-off
handler for the platform-level power-off handlers.

Fixes: f0f7e5265b ("m68k: Switch to new sys-off handler API")
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-06-02 20:36:13 +02:00
Linus Torvalds
7c9e960c63 Merge tag 'livepatching-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
Pull livepatching cleanup from Petr Mladek:

 - Remove duplicated livepatch code [Christophe]

* tag 'livepatching-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
  livepatch: Remove klp_arch_set_pc() and asm/livepatch.h
2022-06-02 08:55:01 -07:00