linux/kernel
Srivatsa S. Bhat 8d056c48e4 CPU hotplug, smp: flush any pending IPI callbacks before CPU offline
There is a race between the CPU offline code (within stop-machine) and
the smp-call-function code, which can lead to getting IPIs on the
outgoing CPU, *after* it has gone offline.

Specifically, this can happen when using
smp_call_function_single_async() to send the IPI, since this API allows
sending asynchronous IPIs from IRQ disabled contexts.  The exact race
condition is described below.

During CPU offline, in stop-machine, we don't enforce any rule in the
_DISABLE_IRQ stage, regarding the order in which the outgoing CPU and
the other CPUs disable their local interrupts.  Due to this, we can
encounter a situation in which an IPI is sent by one of the other CPUs
to the outgoing CPU (while it is *still* online), but the outgoing CPU
ends up noticing it only *after* it has gone offline.

              CPU 1                                         CPU 2
          (Online CPU)                               (CPU going offline)

       Enter _PREPARE stage                          Enter _PREPARE stage

                                                     Enter _DISABLE_IRQ stage

                                                   =
       Got a device interrupt, and                 | Didn't notice the IPI
       the interrupt handler sent an               | since interrupts were
       IPI to CPU 2 using                          | disabled on this CPU.
       smp_call_function_single_async()            |
                                                   =

       Enter _DISABLE_IRQ stage

       Enter _RUN stage                              Enter _RUN stage

                                  =
       Busy loop with interrupts  |                  Invoke take_cpu_down()
       disabled.                  |                  and take CPU 2 offline
                                  =

       Enter _EXIT stage                             Enter _EXIT stage

       Re-enable interrupts                          Re-enable interrupts

                                                     The pending IPI is noted
                                                     immediately, but alas,
                                                     the CPU is offline at
                                                     this point.

This of course, makes the smp-call-function IPI handler code running on
CPU 2 unhappy and it complains about "receiving an IPI on an offline
CPU".

One real example of the scenario on CPU 1 is the block layer's
complete-request call-path:

	__blk_complete_request() [interrupt-handler]
	    raise_blk_irq()
	        smp_call_function_single_async()

However, if we look closely, the block layer does check that the target
CPU is online before firing the IPI.  So in this case, it is actually
the unfortunate ordering/timing of events in the stop-machine phase that
leads to receiving IPIs after the target CPU has gone offline.

In reality, getting a late IPI on an offline CPU is not too bad by
itself (this can happen even due to hardware latencies in IPI
send-receive).  It is a bug only if the target CPU really went offline
without executing all the callbacks queued on its list.  (Note that a
CPU is free to execute its pending smp-call-function callbacks in a
batch, without waiting for the corresponding IPIs to arrive for each one
of those callbacks).

So, fixing this issue can be broken up into two parts:

1. Ensure that a CPU goes offline only after executing all the
   callbacks queued on it.

2. Modify the warning condition in the smp-call-function IPI handler
   code such that it warns only if an offline CPU got an IPI *and* that
   CPU had gone offline with callbacks still pending in its queue.

Achieving part 1 is straight-forward - just flush (execute) all the
queued callbacks on the outgoing CPU in the CPU_DYING stage[1],
including those callbacks for which the source CPU's IPIs might not have
been received on the outgoing CPU yet.  Once we do this, an IPI that
arrives late on the CPU going offline (either due to the race mentioned
above, or due to hardware latencies) will be completely harmless, since
the outgoing CPU would have executed all the queued callbacks before
going offline.

Overall, this fix (parts 1 and 2 put together) additionally guarantees
that we will see a warning only when the *IPI-sender code* is buggy -
that is, if it queues the callback _after_ the target CPU has gone
offline.

[1].  The CPU_DYING part needs a little more explanation: by the time we
execute the CPU_DYING notifier callbacks, the CPU would have already
been marked offline.  But we want to flush out the pending callbacks at
this stage, ignoring the fact that the CPU is offline.  So restructure
the IPI handler code so that we can by-pass the "is-cpu-offline?" check
in this particular case.  (Of course, the right solution here is to fix
CPU hotplug to mark the CPU offline _after_ invoking the CPU_DYING
notifiers, but this requires a lot of audit to ensure that this change
doesn't break any existing code; hence lets go with the solution
proposed above until that is done).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Suggested-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <mgalbraith@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sachin Kamat <sachin.kamat@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 16:47:43 -07:00
..
debug kernel/printk: use symbolic defines for console loglevels 2014-06-04 16:54:17 -07:00
events Merge branch 'perf/core' into perf/urgent, to pick up the latest fixes 2014-06-14 14:10:08 +02:00
gcov gcov: add support for GCC 4.9 2014-06-10 15:34:46 -07:00
irq genirq: Improve documentation to match current implementation 2014-05-27 10:16:44 +02:00
locking Merge branch 'locking-urgent-for-linus.patch' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-06-21 07:06:02 -10:00
power x86, kaslr: boot-time selectable with hibernation 2014-06-16 23:30:44 +02:00
printk kernel/printk: use symbolic defines for console loglevels 2014-06-04 16:54:17 -07:00
rcu Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next 2014-06-03 12:57:53 -07:00
sched Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-06-12 19:42:15 -07:00
time Merge branch 'akpm' (patchbomb from Andrew) into next 2014-06-04 16:55:13 -07:00
trace One bug fix that goes back to 3.10. Accessing a non existent buffer 2014-06-12 21:07:25 -07:00
.gitignore Ignore generated file kernel/x509_certificate_list 2013-12-10 18:21:34 +00:00
acct.c ipc, kernel: clear whitespace 2014-06-06 16:08:14 -07:00
async.c
audit_tree.c inotify: Fix reporting of cookies for inotify events 2014-02-18 11:17:17 +01:00
audit_watch.c inotify: Fix reporting of cookies for inotify events 2014-02-18 11:17:17 +01:00
audit.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2014-06-12 14:27:40 -07:00
audit.h audit: Use struct net not pid_t to remember the network namespce to reply in 2014-03-20 10:10:53 -04:00
auditfilter.c Merge git://git.infradead.org/users/eparis/audit 2014-04-12 12:38:53 -07:00
auditsc.c auditsc: audit_krule mask accesses need bounds checking 2014-06-10 08:44:40 -07:00
backtracetest.c kernel/backtracetest.c: replace no level printk by pr_info() 2014-06-04 16:54:14 -07:00
bounds.c mm: do not allocate page->ptl dynamically, if spinlock_t fits to long 2013-12-20 12:25:45 -08:00
capability.c fs,userns: Change inode_capable to capable_wrt_inode_uidgid 2014-06-10 13:57:22 -07:00
cgroup_freezer.c cgroup: remove css_parent() 2014-05-16 13:22:48 -04:00
cgroup.c Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2014-06-09 15:03:33 -07:00
compat.c kernel/compat.c: use sizeof() instead of sizeof 2014-06-04 16:54:19 -07:00
configs.c
context_tracking.c x86/kprobes: Fix build errors and blacklist context_track_user 2014-06-14 09:07:44 +02:00
cpu_pm.c
cpu.c More ACPI and power management updates for 3.16-rc1 2014-06-12 13:14:19 -07:00
cpuset.c Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2014-06-09 15:03:33 -07:00
crash_dump.c
cred.c
delayacct.c kernel/delayacct.c: remove redundant checking in __delayacct_add_tsk() 2013-11-13 12:09:12 +09:00
dma.c
elfcore.c switch elf_core_write_extra_phdrs() to dump_emit() 2013-11-09 00:16:23 -05:00
exec_domain.c kernel/exec_domain.c: code clean-up 2014-06-04 16:54:15 -07:00
exit.c signals: mv {dis,}allow_signal() from sched.h/exit.c to signal.[ch] 2014-06-06 16:08:11 -07:00
extable.c asmlinkage: Make main_extable_sort_needed visible 2014-02-13 18:13:22 -08:00
fork.c ptrace: fix fork event messages across pid namespaces 2014-06-06 16:08:11 -07:00
freezer.c libata, freezer: avoid block device removal while system is frozen 2013-12-19 13:50:32 -05:00
futex_compat.c compat: Get rid of (get|put)_compat_time(val|spec) 2014-02-02 14:09:12 -08:00
futex.c Merge branch 'next' (accumulated 3.16 merge window patches) into master 2014-06-08 11:31:16 -07:00
groups.c kernel/groups.c: remove return value of set_groups 2014-04-03 16:21:05 -07:00
hrtimer.c Merge branch 'perf/urgent' into perf/core, to resolve conflict and to prepare for new patches 2014-06-06 07:55:06 +02:00
hung_task.c kernel/hung_task.c: convert simple_strtoul to kstrtouint 2014-06-04 16:54:15 -07:00
irq_work.c perf/x86: Warn to early_printk() in case irq_work is too slow 2014-02-21 21:49:07 +01:00
itimer.c
jump_label.c static_key: WARN on usage before jump_label_init was called 2013-10-19 19:45:35 -04:00
kallsyms.c kernel: use macros from compiler.h instead of __attribute__((...)) 2014-04-07 16:36:11 -07:00
kcmp.c
Kconfig.freezer
Kconfig.hz kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS 2013-11-15 09:32:22 +09:00
Kconfig.locks locking/rwlocks: Introduce 'qrwlocks' - fair, queued rwlocks 2014-06-06 07:58:28 +02:00
Kconfig.preempt
kexec.c kernel/kexec.c: convert printk to pr_foo() 2014-06-06 16:08:12 -07:00
kmod.c signals: change wait_for_helper() to use kernel_sigaction() 2014-06-06 16:08:12 -07:00
kprobes.c kprobes: Show blacklist entries via debugfs 2014-04-24 10:26:41 +02:00
ksysfs.c kobject: Make support for uevent_helper optional. 2014-04-25 12:00:49 -07:00
kthread.c kthread: fix return value of kthread_create() upon SIGKILL. 2014-06-04 16:53:51 -07:00
latencytop.c kernel/latencytop.c: convert seq_printf to seq_puts 2014-06-04 16:54:15 -07:00
Makefile Merge branch 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-03-31 14:13:25 -07:00
module_signing.c keys: change asymmetric keys to use common hash definitions 2013-10-25 17:15:18 -04:00
module-internal.h KEYS: Separate the kernel signature checking keyring from module signing 2013-09-25 17:17:01 +01:00
module.c Most of this is cleaning up various driver sysfs permissions so we can 2014-06-11 16:09:14 -07:00
notifier.c kprobes, notifier: Use NOKPROBE_SYMBOL macro in notifier 2014-04-24 10:26:39 +02:00
nsproxy.c
padata.c padata: Fix wrong usage of rcu_dereference() 2013-12-05 21:28:42 +08:00
panic.c kernel/panic.c: add "crash_kexec_post_notifiers" option for kdump after panic_notifers 2014-06-06 16:08:12 -07:00
params.c param: hand arguments after -- straight to init 2014-04-28 11:48:34 +09:30
pid_namespace.c pid_namespace: pidns_get() should check task_active_pid_ns() != NULL 2014-04-02 16:20:21 -07:00
pid.c pidns: fix free_pid() to handle the first fork failure 2013-09-30 14:31:03 -07:00
posix-cpu-timers.c posix-timers: Convert abuses of BUG_ON to WARN_ON 2013-12-09 16:56:29 +01:00
posix-timers.c
profile.c kernel/profile.c: use static const char instead of static char 2014-06-06 16:08:13 -07:00
ptrace.c kernel/compat: convert to COMPAT_SYSCALL_DEFINE 2014-03-06 15:35:10 +01:00
range.c
reboot.c kernel/reboot.c: convert simple_strtoul to kstrtoint 2014-06-04 16:54:15 -07:00
relay.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-04-12 14:49:50 -07:00
res_counter.c kernel/res_counter.c: replace simple_strtoull by kstrtoull 2014-06-04 16:54:15 -07:00
resource.c resources: Clarify sanity check message 2014-05-23 10:47:21 -06:00
seccomp.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2014-06-12 14:27:40 -07:00
signal.c signals: introduce kernel_sigaction() 2014-06-06 16:08:12 -07:00
smp.c CPU hotplug, smp: flush any pending IPI callbacks before CPU offline 2014-06-23 16:47:43 -07:00
smpboot.c
smpboot.h
softirq.c Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu 2014-05-22 11:36:10 +02:00
stacktrace.c
stop_machine.c kernel/stop_machine.c: kernel-doc warning fix 2014-06-04 16:54:15 -07:00
sys_ni.c sys_sgetmask/sys_ssetmask: add CONFIG_SGETMASK_SYSCALL 2014-06-04 16:54:14 -07:00
sys.c sched: Consolidate open coded implementations of nice level frobbing into nice_to_rlimit() and rlimit_to_nice() 2014-05-22 11:16:36 +02:00
sysctl_binary.c kernel/sysctl_binary.c: use scnprintf() instead of snprintf() 2013-11-13 12:09:33 +09:00
sysctl.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next 2014-06-19 07:50:07 -10:00
system_certificates.S KEYS: correct alignment of system_certificate_list content in assembly file 2013-12-10 18:25:28 +00:00
system_keyring.c KEYS: correct alignment of system_certificate_list content in assembly file 2013-12-10 18:25:28 +00:00
task_work.c
taskstats.c genetlink: only pass array to genl_register_family_with_ops() 2013-11-19 16:39:05 -05:00
test_kprobes.c
time.c
timeconst.bc
timer.c timer: Prevent overflow in apply_slack 2014-04-30 13:46:17 +02:00
torture.c torture: Remove __init from torture_init_begin/end 2014-05-14 09:46:30 -07:00
tracepoint.c kernel/tracepoint.c: kernel-doc fixes 2014-06-04 16:54:15 -07:00
tsacct.c
uid16.c
up.c smp: Rename __smp_call_function_single() to smp_call_function_single_async() 2014-02-24 14:47:15 -08:00
user_namespace.c kernel/user_namespace.c: kernel-doc/checkpatch fixes 2014-06-06 16:08:13 -07:00
user-return-notifier.c
user.c kernel/user.c: drop unused field 'files' from user_struct 2014-06-04 16:54:16 -07:00
utsname_sysctl.c sysctl: convert use of typedef ctl_table to struct ctl_table 2014-06-06 16:08:16 -07:00
utsname.c
watchdog.c kernel/watchdog.c:touch_softlockup_watchdog(): use raw_cpu_write() 2014-04-18 16:40:08 -07:00
workqueue_internal.h workqueue: rename manager_mutex to attach_mutex 2014-05-20 10:59:32 -04:00
workqueue.c Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq 2014-06-09 14:56:49 -07:00