linux

History

Srivatsa S. Bhat 8d056c48e4 CPU hotplug, smp: flush any pending IPI callbacks before CPU offline There is a race between the CPU offline code (within stop-machine) and the smp-call-function code, which can lead to getting IPIs on the outgoing CPU, after it has gone offline. Specifically, this can happen when using smp_call_function_single_async() to send the IPI, since this API allows sending asynchronous IPIs from IRQ disabled contexts. The exact race condition is described below. During CPU offline, in stop-machine, we don't enforce any rule in the _DISABLE_IRQ stage, regarding the order in which the outgoing CPU and the other CPUs disable their local interrupts. Due to this, we can encounter a situation in which an IPI is sent by one of the other CPUs to the outgoing CPU (while it is still online), but the outgoing CPU ends up noticing it only after it has gone offline. CPU 1 CPU 2 (Online CPU) (CPU going offline) Enter _PREPARE stage Enter _PREPARE stage Enter _DISABLE_IRQ stage = Got a device interrupt, and \| Didn't notice the IPI the interrupt handler sent an \| since interrupts were IPI to CPU 2 using \| disabled on this CPU. smp_call_function_single_async() \| = Enter _DISABLE_IRQ stage Enter _RUN stage Enter _RUN stage = Busy loop with interrupts \| Invoke take_cpu_down() disabled. \| and take CPU 2 offline = Enter _EXIT stage Enter _EXIT stage Re-enable interrupts Re-enable interrupts The pending IPI is noted immediately, but alas, the CPU is offline at this point. This of course, makes the smp-call-function IPI handler code running on CPU 2 unhappy and it complains about "receiving an IPI on an offline CPU". One real example of the scenario on CPU 1 is the block layer's complete-request call-path: __blk_complete_request() [interrupt-handler] raise_blk_irq() smp_call_function_single_async() However, if we look closely, the block layer does check that the target CPU is online before firing the IPI. So in this case, it is actually the unfortunate ordering/timing of events in the stop-machine phase that leads to receiving IPIs after the target CPU has gone offline. In reality, getting a late IPI on an offline CPU is not too bad by itself (this can happen even due to hardware latencies in IPI send-receive). It is a bug only if the target CPU really went offline without executing all the callbacks queued on its list. (Note that a CPU is free to execute its pending smp-call-function callbacks in a batch, without waiting for the corresponding IPIs to arrive for each one of those callbacks). So, fixing this issue can be broken up into two parts: 1. Ensure that a CPU goes offline only after executing all the callbacks queued on it. 2. Modify the warning condition in the smp-call-function IPI handler code such that it warns only if an offline CPU got an IPI and that CPU had gone offline with callbacks still pending in its queue. Achieving part 1 is straight-forward - just flush (execute) all the queued callbacks on the outgoing CPU in the CPU_DYING stage[1], including those callbacks for which the source CPU's IPIs might not have been received on the outgoing CPU yet. Once we do this, an IPI that arrives late on the CPU going offline (either due to the race mentioned above, or due to hardware latencies) will be completely harmless, since the outgoing CPU would have executed all the queued callbacks before going offline. Overall, this fix (parts 1 and 2 put together) additionally guarantees that we will see a warning only when the IPI-sender code is buggy - that is, if it queues the callback _after_ the target CPU has gone offline. [1]. The CPU_DYING part needs a little more explanation: by the time we execute the CPU_DYING notifier callbacks, the CPU would have already been marked offline. But we want to flush out the pending callbacks at this stage, ignoring the fact that the CPU is offline. So restructure the IPI handler code so that we can by-pass the "is-cpu-offline?" check in this particular case. (Of course, the right solution here is to fix CPU hotplug to mark the CPU offline _after_ invoking the CPU_DYING notifiers, but this requires a lot of audit to ensure that this change doesn't break any existing code; hence lets go with the solution proposed above until that is done). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Suggested-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Borislav Petkov <bp@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <mgalbraith@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Rik van Riel <riel@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Tested-by: Sachin Kamat <sachin.kamat@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2014-06-23 16:47:43 -07:00
..
debug	kernel/printk: use symbolic defines for console loglevels	2014-06-04 16:54:17 -07:00
events	Merge branch 'perf/core' into perf/urgent, to pick up the latest fixes	2014-06-14 14:10:08 +02:00
gcov	gcov: add support for GCC 4.9	2014-06-10 15:34:46 -07:00
irq	genirq: Improve documentation to match current implementation	2014-05-27 10:16:44 +02:00
locking	Merge branch 'locking-urgent-for-linus.patch' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2014-06-21 07:06:02 -10:00
power	x86, kaslr: boot-time selectable with hibernation	2014-06-16 23:30:44 +02:00
printk	kernel/printk: use symbolic defines for console loglevels	2014-06-04 16:54:17 -07:00
rcu	Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next	2014-06-03 12:57:53 -07:00
sched	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2014-06-12 19:42:15 -07:00
time	Merge branch 'akpm' (patchbomb from Andrew) into next	2014-06-04 16:55:13 -07:00
trace	One bug fix that goes back to 3.10. Accessing a non existent buffer	2014-06-12 21:07:25 -07:00
.gitignore
acct.c	ipc, kernel: clear whitespace	2014-06-06 16:08:14 -07:00
async.c
audit_tree.c
audit_watch.c
audit.c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next	2014-06-12 14:27:40 -07:00
audit.h
auditfilter.c	Merge git://git.infradead.org/users/eparis/audit	2014-04-12 12:38:53 -07:00
auditsc.c	auditsc: audit_krule mask accesses need bounds checking	2014-06-10 08:44:40 -07:00
backtracetest.c	kernel/backtracetest.c: replace no level printk by pr_info()	2014-06-04 16:54:14 -07:00
bounds.c
capability.c	fs,userns: Change inode_capable to capable_wrt_inode_uidgid	2014-06-10 13:57:22 -07:00
cgroup_freezer.c	cgroup: remove css_parent()	2014-05-16 13:22:48 -04:00
cgroup.c	Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup	2014-06-09 15:03:33 -07:00
compat.c	kernel/compat.c: use sizeof() instead of sizeof	2014-06-04 16:54:19 -07:00
configs.c
context_tracking.c	x86/kprobes: Fix build errors and blacklist context_track_user	2014-06-14 09:07:44 +02:00
cpu_pm.c
cpu.c	More ACPI and power management updates for 3.16-rc1	2014-06-12 13:14:19 -07:00
cpuset.c	Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup	2014-06-09 15:03:33 -07:00
crash_dump.c
cred.c
delayacct.c
dma.c
elfcore.c
exec_domain.c	kernel/exec_domain.c: code clean-up	2014-06-04 16:54:15 -07:00
exit.c	signals: mv {dis,}allow_signal() from sched.h/exit.c to signal.[ch]	2014-06-06 16:08:11 -07:00
extable.c
fork.c	ptrace: fix fork event messages across pid namespaces	2014-06-06 16:08:11 -07:00
freezer.c
futex_compat.c
futex.c	Merge branch 'next' (accumulated 3.16 merge window patches) into master	2014-06-08 11:31:16 -07:00
groups.c
hrtimer.c	Merge branch 'perf/urgent' into perf/core, to resolve conflict and to prepare for new patches	2014-06-06 07:55:06 +02:00
hung_task.c	kernel/hung_task.c: convert simple_strtoul to kstrtouint	2014-06-04 16:54:15 -07:00
irq_work.c
itimer.c
jump_label.c
kallsyms.c	kernel: use macros from compiler.h instead of __attribute__((...))	2014-04-07 16:36:11 -07:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks	locking/rwlocks: Introduce 'qrwlocks' - fair, queued rwlocks	2014-06-06 07:58:28 +02:00
Kconfig.preempt
kexec.c	kernel/kexec.c: convert printk to pr_foo()	2014-06-06 16:08:12 -07:00
kmod.c	signals: change wait_for_helper() to use kernel_sigaction()	2014-06-06 16:08:12 -07:00
kprobes.c	kprobes: Show blacklist entries via debugfs	2014-04-24 10:26:41 +02:00
ksysfs.c	kobject: Make support for uevent_helper optional.	2014-04-25 12:00:49 -07:00
kthread.c	kthread: fix return value of kthread_create() upon SIGKILL.	2014-06-04 16:53:51 -07:00
latencytop.c	kernel/latencytop.c: convert seq_printf to seq_puts	2014-06-04 16:54:15 -07:00
Makefile
module_signing.c
module-internal.h
module.c	Most of this is cleaning up various driver sysfs permissions so we can	2014-06-11 16:09:14 -07:00
notifier.c	kprobes, notifier: Use NOKPROBE_SYMBOL macro in notifier	2014-04-24 10:26:39 +02:00
nsproxy.c
padata.c
panic.c	kernel/panic.c: add "crash_kexec_post_notifiers" option for kdump after panic_notifers	2014-06-06 16:08:12 -07:00
params.c	param: hand arguments after -- straight to init	2014-04-28 11:48:34 +09:30
pid_namespace.c
pid.c
posix-cpu-timers.c
posix-timers.c
profile.c	kernel/profile.c: use static const char instead of static char	2014-06-06 16:08:13 -07:00
ptrace.c
range.c
reboot.c	kernel/reboot.c: convert simple_strtoul to kstrtoint	2014-06-04 16:54:15 -07:00
relay.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2014-04-12 14:49:50 -07:00
res_counter.c	kernel/res_counter.c: replace simple_strtoull by kstrtoull	2014-06-04 16:54:15 -07:00
resource.c	resources: Clarify sanity check message	2014-05-23 10:47:21 -06:00
seccomp.c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next	2014-06-12 14:27:40 -07:00
signal.c	signals: introduce kernel_sigaction()	2014-06-06 16:08:12 -07:00
smp.c	CPU hotplug, smp: flush any pending IPI callbacks before CPU offline	2014-06-23 16:47:43 -07:00
smpboot.c
smpboot.h
softirq.c	Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu	2014-05-22 11:36:10 +02:00
stacktrace.c
stop_machine.c	kernel/stop_machine.c: kernel-doc warning fix	2014-06-04 16:54:15 -07:00
sys_ni.c	sys_sgetmask/sys_ssetmask: add CONFIG_SGETMASK_SYSCALL	2014-06-04 16:54:14 -07:00
sys.c	sched: Consolidate open coded implementations of nice level frobbing into nice_to_rlimit() and rlimit_to_nice()	2014-05-22 11:16:36 +02:00
sysctl_binary.c
sysctl.c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next	2014-06-19 07:50:07 -10:00
system_certificates.S
system_keyring.c
task_work.c
taskstats.c
test_kprobes.c
time.c
timeconst.bc
timer.c	timer: Prevent overflow in apply_slack	2014-04-30 13:46:17 +02:00
torture.c	torture: Remove __init from torture_init_begin/end	2014-05-14 09:46:30 -07:00
tracepoint.c	kernel/tracepoint.c: kernel-doc fixes	2014-06-04 16:54:15 -07:00
tsacct.c
uid16.c
up.c
user_namespace.c	kernel/user_namespace.c: kernel-doc/checkpatch fixes	2014-06-06 16:08:13 -07:00
user-return-notifier.c
user.c	kernel/user.c: drop unused field 'files' from user_struct	2014-06-04 16:54:16 -07:00
utsname_sysctl.c	sysctl: convert use of typedef ctl_table to struct ctl_table	2014-06-06 16:08:16 -07:00
utsname.c
watchdog.c	kernel/watchdog.c:touch_softlockup_watchdog(): use raw_cpu_write()	2014-04-18 16:40:08 -07:00
workqueue_internal.h	workqueue: rename manager_mutex to attach_mutex	2014-05-20 10:59:32 -04:00
workqueue.c	Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq	2014-06-09 14:56:49 -07:00