linux/kernel
Heiko Carstens dbd87b5af0 nohz: Fix get_next_timer_interrupt() vs cpu hotplug
This fixes a bug as seen on 2.6.32 based kernels where timers got
enqueued on offline cpus.

If a cpu goes offline it might still have pending timers. These will
be migrated during CPU_DEAD handling after the cpu is offline.
However while the cpu is going offline it will schedule the idle task
which will then call tick_nohz_stop_sched_tick().

That function in turn will call get_next_timer_intterupt() to figure
out if the tick of the cpu can be stopped or not. If it turns out that
the next tick is just one jiffy off (delta_jiffies == 1)
tick_nohz_stop_sched_tick() incorrectly assumes that the tick should
not stop and takes an early exit and thus it won't update the load
balancer cpu.

Just afterwards the cpu will be killed and the load balancer cpu could
be the offline cpu.

On 2.6.32 based kernel get_nohz_load_balancer() gets called to decide
on which cpu a timer should be enqueued (see __mod_timer()). Which
leads to the possibility that timers get enqueued on an offline cpu.
These will never expire and can cause a system hang.

This has been observed 2.6.32 kernels. On current kernels
__mod_timer() uses get_nohz_timer_target() which doesn't have that
problem. However there might be other problems because of the too
early exit tick_nohz_stop_sched_tick() in case a cpu goes offline.

The easiest and probably safest fix seems to be to let
get_next_timer_interrupt() just lie and let it say there isn't any
pending timer if the current cpu is offline.

I also thought of moving migrate_[hr]timers() from CPU_DEAD to
CPU_DYING, but seeing that there already have been fixes at least in
the hrtimer code in this area I'm afraid that this could add new
subtle bugs.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101201091109.GA8984@osiris.boeblingen.de.ibm.com>
Cc: stable@kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-08 20:15:07 +01:00
..
debug kdb: fix crash when KDB_BASE_CMD_MAX is exceeded 2010-11-17 13:54:57 -06:00
gcov llseek: automatically add .llseek fop 2010-10-15 15:53:27 +02:00
irq genirq: Fix incorrect proc spurious output 2010-12-01 08:44:26 +01:00
power PM / Hibernate: Fix memory corruption related to swap 2010-12-06 23:52:08 +01:00
time ntp: Clamp PLL update interval 2010-09-09 20:48:37 +02:00
trace Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-11-27 07:28:17 +09:00
.gitignore
acct.c pass a struct path to vfs_statfs 2010-08-09 16:48:42 -04:00
async.c async: use workqueue for worker pool 2010-07-14 11:29:46 +02:00
audit_tree.c in untag_chunk() we need to do alloc_chunk() a bit earlier 2010-10-30 02:18:32 -04:00
audit_watch.c audit: make functions static 2010-10-30 01:42:19 -04:00
audit.c audit: Use rcu for task lookup protection 2010-10-30 08:45:42 -04:00
audit.h audit: make functions static 2010-10-30 01:42:19 -04:00
auditfilter.c Audit: add support to match lsm labels on user audit messages 2010-10-30 01:41:57 -04:00
auditsc.c audit mmap 2010-10-30 08:45:43 -04:00
backtracetest.c
bounds.c
capability.c sched: Remove remaining USER_SCHED code 2010-04-02 20:12:00 +02:00
cgroup_freezer.c cgroup_freezer: update_freezer_state() does incorrect state transitions 2010-10-27 18:03:08 -07:00
cgroup.c convert cgroup and cpuset 2010-10-29 04:17:06 -04:00
compat.c compat: Make compat_alloc_user_space() incorporate the access_ok() 2010-09-14 16:08:45 -07:00
configs.c llseek: automatically add .llseek fop 2010-10-15 15:53:27 +02:00
cpu.c sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining 2010-06-08 21:40:36 +02:00
cpuset.c convert cgroup and cpuset 2010-10-29 04:17:06 -04:00
cred.c signals: move cred_guard_mutex from task_struct to signal_struct 2010-10-27 18:03:12 -07:00
delayacct.c
dma.c
elfcore.c
exec_domain.c sys_personality: remove the bogus checks in sys_personality()->__set_personality() path 2010-08-09 20:45:05 -07:00
exit.c do_exit(): make sure that we run with get_fs() == USER_DS 2010-12-02 14:51:16 -08:00
extable.c
fork.c Sched: fix skip_clock_update optimization 2010-12-08 20:15:06 +01:00
freezer.c
futex_compat.c futex: Address compiler warnings in exit_robust_list 2010-11-10 13:27:50 +01:00
futex.c futex: Address compiler warnings in exit_robust_list 2010-11-10 13:27:50 +01:00
groups.c kernel/groups.c: fix integer overflow in groups_search 2010-09-09 18:57:24 -07:00
hrtimer.c hrtimer: Preserve timer state in remove_hrtimer() 2010-10-14 13:29:59 +02:00
hung_task.c lockup detector: Fix grammar by adding a missing "to" in the comments 2010-08-17 09:11:52 +02:00
hw_breakpoint.c perf,hw_breakpoint: Initialize hardware api earlier 2010-11-12 14:51:55 +01:00
irq_work.c irq_work: Drop cmpxchg() result 2010-11-18 13:18:47 +01:00
itimer.c
jump_label.c jump label: Make arch_jump_label_text_poke_early() optional 2010-10-29 12:56:13 -04:00
kallsyms.c Revert "kernel: make /proc/kallsyms mode 400 to reduce ease of attacking" 2010-11-19 11:54:40 -08:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kexec.c use clear_page()/copy_page() in favor of memset()/memcpy() on whole pages 2010-10-26 16:52:13 -07:00
kfifo.c kfifo: fix scatterlist usage 2010-10-01 10:50:58 -07:00
kmod.c Make do_execve() take a const filename pointer 2010-08-17 18:07:43 -07:00
kprobes.c jump label: Fix error with preempt disable holding mutex 2010-10-29 12:55:55 -04:00
ksysfs.c sysfs: add struct file* to bin_attr callbacks 2010-05-21 09:37:31 -07:00
kthread.c kthread: implement kthread_data() 2010-06-29 10:07:09 +02:00
latencytop.c latencytop: fix per task accumulator 2010-11-12 07:55:31 -08:00
lockdep_internals.h lockdep: No need to disable preemption in debug atomic ops 2010-05-04 05:38:16 +02:00
lockdep_proc.c lockstat: Make lockstat counting per cpu 2010-04-06 00:15:37 +02:00
lockdep_states.h
lockdep.c lockdep: Check the depth of subclass 2010-10-18 18:44:26 +02:00
Makefile Merge branch 'core-memblock-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-10-21 18:52:11 -07:00
module.c tracing: Fix module use of trace_bprintk() 2010-11-10 22:19:24 -05:00
mutex-debug.c
mutex-debug.h
mutex.c mutex: Fix annotations to include it in kernel-locking docbook 2010-09-03 08:19:51 +02:00
mutex.h
notifier.c
ns_cgroup.c cgroup: notify ns_cgroup deprecated 2010-10-27 18:03:09 -07:00
nsproxy.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
padata.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2010-08-04 15:23:14 -07:00
panic.c lib/bug.c: add oops end marker to WARN implementation 2010-08-11 08:59:22 -07:00
params.c param: locking for kernel parameters 2010-08-11 23:04:20 +09:30
perf_event.c perf: Fix the software context switch counter 2010-11-26 15:00:59 +01:00
pid_namespace.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
pid.c Add RCU check for find_task_by_vpid(). 2010-08-19 17:18:02 -07:00
pm_qos_params.c PM / PM QoS: Fix reversed min and max 2010-11-15 22:45:22 +01:00
posix-cpu-timers.c posix-cpu-timers: Rcu_read_lock/unlock protect find_task_by_vpid call 2010-11-10 13:07:06 +01:00
posix-timers.c posix_timer: Move copy_to_user(created_timer_id) down in timer_create() 2010-07-23 15:08:12 +02:00
printk.c nohz: Fix printk_needs_cpu() return value on offline cpus 2010-11-26 15:03:12 +01:00
profile.c llseek: automatically add .llseek fop 2010-10-15 15:53:27 +02:00
ptrace.c signals: move cred_guard_mutex from task_struct to signal_struct 2010-10-27 18:03:12 -07:00
range.c kernel/range.c: fix clean_sort_range() for the case of full array 2010-11-12 07:55:31 -08:00
rcupdate.c Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/rcu 2010-10-07 09:43:11 +02:00
rcutiny_plugin.h rcu: performance fixes to TINY_PREEMPT_RCU callback checking 2010-08-27 10:51:17 -07:00
rcutiny.c rcu: Add a TINY_PREEMPT_RCU 2010-08-20 08:55:00 -07:00
rcutorture.c rcu: fix sparse errors in rcutorture.c 2010-09-23 09:16:42 -07:00
rcutree_plugin.h rcu: fix _oddness handling of verbose stall warnings 2010-09-02 16:15:30 -07:00
rcutree_trace.c rcu: Add tracing data to support queueing models 2010-09-23 09:16:53 -07:00
rcutree.c rcu: using ACCESS_ONCE() to observe the jiffies_stall/rnp->qsmask value 2010-10-07 10:41:06 -07:00
rcutree.h rcu: Add tracing data to support queueing models 2010-09-23 09:16:53 -07:00
relay.c Clean up relay_alloc_page_array() slightly by using vzalloc rather than vmalloc and memset 2010-11-05 08:21:34 -07:00
res_counter.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
resource.c Merge branch 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 2010-10-28 11:59:52 -07:00
rtmutex_common.h
rtmutex-debug.c
rtmutex-debug.h
rtmutex-tester.c rtmutex-tester: make it build without BKL 2010-10-19 11:29:56 +02:00
rtmutex.c
rtmutex.h
rwsem.c
sched_clock.c sched_clock: Add local_clock() API and improve documentation 2010-06-09 10:34:49 +02:00
sched_cpupri.c sched: No need for bootmem special cases 2010-07-17 12:06:22 +02:00
sched_cpupri.h sched: No need for bootmem special cases 2010-07-17 12:06:22 +02:00
sched_debug.c sched: Use correct macro to display sched_child_runs_first in /proc/sched_debug 2010-07-21 21:46:12 +02:00
sched_fair.c sched: Fix idle balancing 2010-11-18 13:12:33 +01:00
sched_features.h sched: Remove irq time from available CPU power 2010-10-18 20:52:27 +02:00
sched_idletask.c sched: Cure load average vs NO_HZ woes 2010-04-23 11:02:02 +02:00
sched_rt.c sched: Do not account irq time to current task 2010-10-18 20:52:26 +02:00
sched_stats.h sched_stat: Update sched_info_queue/dequeue() code comments 2010-10-24 13:29:01 +02:00
sched_stoptask.c sched: Fix cross-sched-class wakeup preemption 2010-11-11 14:37:23 +01:00
sched.c Sched: fix skip_clock_update optimization 2010-12-08 20:15:06 +01:00
seccomp.c
semaphore.c
signal.c signals: annotate lock context change on ptrace_stop() 2010-10-27 18:03:12 -07:00
smp.c Typedef SMP call function pointer 2010-10-27 17:28:36 +01:00
softirq.c Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-10-27 18:48:00 -07:00
spinlock.c
srcu.c kernel: Remove undead ifdef CONFIG_DEBUG_LOCK_ALLOC 2010-09-23 09:14:51 -07:00
stacktrace.c
stop_machine.c stop_machine: convert cpu notifier to return encapsulate errno value 2010-10-26 16:52:15 -07:00
sys_ni.c powerpc: define a compat_sys_recv cond_syscall 2010-09-23 17:03:55 +10:00
sys.c pid: make setpgid() system call use RCU read-side critical section 2010-08-31 17:00:18 -07:00
sysctl_binary.c sysctl: don't use own implementation of hex_to_bin() 2010-05-25 08:07:05 -07:00
sysctl_check.c sysctl: min/max bounds are optional 2010-10-15 14:42:24 -07:00
sysctl.c kernel/sysctl.c: Fix build failure with !CONFIG_PRINTK 2010-11-16 07:56:09 -08:00
taskstats.c taskstats: split fill_pid function 2010-10-27 18:03:17 -07:00
test_kprobes.c kprobes: Fix selftest to clear flags field for reusing probes 2010-10-14 08:55:27 +02:00
time.c time: Kill off CONFIG_GENERIC_TIME 2010-07-27 12:40:54 +02:00
timeconst.pl
timer.c nohz: Fix get_next_timer_interrupt() vs cpu hotplug 2010-12-08 20:15:07 +01:00
tracepoint.c jump_label: Use more consistent naming 2010-10-18 19:58:56 +02:00
tsacct.c taskstats: use real microsecond granularity for CPU times 2010-10-27 18:03:17 -07:00
uid16.c
up.c
user_namespace.c user_ns: Introduce user_nsmap_uid and user_ns_map_gid. 2010-06-16 14:55:34 -07:00
user-return-notifier.c
user.c kernel/user.c: add lock release annotation on free_user() 2010-10-26 16:52:15 -07:00
utsname_sysctl.c
utsname.c
wait.c docbook: add more wait/wake/completion to device-drivers docbook 2010-10-26 17:32:41 -07:00
watchdog.c watchdog: Fix section mismatch and potential undefined behavior. 2010-11-05 17:45:35 -07:00
workqueue_sched.h workqueue: implement concurrency managed dynamic worker pool 2010-06-29 10:07:14 +02:00
workqueue.c workqueues: s/ON_STACK/ONSTACK/ 2010-10-26 16:52:14 -07:00