linux/kernel
Tejun Heo 29187a9eea workqueue: fix subtle pool management issue which can stall whole worker_pool
A worker_pool's forward progress is guaranteed by the fact that the
last idle worker assumes the manager role to create more workers and
summon the rescuers if creating workers doesn't succeed in timely
manner before proceeding to execute work items.

This manager role is implemented in manage_workers(), which indicates
whether the worker may proceed to work item execution with its return
value.  This is necessary because multiple workers may contend for the
manager role, and, if there already is a manager, others should
proceed to work item execution.

Unfortunately, the function also indicates that the worker may proceed
to work item execution if need_to_create_worker() is false at the head
of the function.  need_to_create_worker() tests the following
conditions.

	pending work items && !nr_running && !nr_idle

The first and third conditions are protected by pool->lock and thus
won't change while holding pool->lock; however, nr_running can change
asynchronously as other workers block and resume and while it's likely
to be zero, as someone woke this worker up in the first place, some
other workers could have become runnable inbetween making it non-zero.

If this happens, manage_worker() could return false even with zero
nr_idle making the worker, the last idle one, proceed to execute work
items.  If then all workers of the pool end up blocking on a resource
which can only be released by a work item which is pending on that
pool, the whole pool can deadlock as there's no one to create more
workers or summon the rescuers.

This patch fixes the problem by removing the early exit condition from
maybe_create_worker() and making manage_workers() return false iff
there's already another manager, which ensures that the last worker
doesn't start executing work items.

We can leave the early exit condition alone and just ignore the return
value but the only reason it was put there is because the
manage_workers() used to perform both creations and destructions of
workers and thus the function may be invoked while the pool is trying
to reduce the number of workers.  Now that manage_workers() is called
only when more workers are needed, the only case this early exit
condition is triggered is rare race conditions rendering it pointless.

Tested with simulated workload and modified workqueue code which
trigger the pool deadlock reliably without this patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Eric Sandeen <sandeen@sandeen.net>
Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
Cc: Dave Chinner <david@fromorbit.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: stable@vger.kernel.org
2015-01-16 14:21:16 -05:00
..
bpf bpf: verifier: add checks for BPF_ABS | BPF_IND instructions 2014-12-05 21:47:32 -08:00
configs x86: Add "make tinyconfig" to configure the tiniest possible kernel 2014-08-08 16:30:24 -07:00
debug kdb: replace strnicmp with strncasecmp 2014-10-14 02:18:25 +02:00
events Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-12-19 13:15:24 -08:00
gcov gcov: enable GCOV_PROFILE_ALL from ARCH Kconfigs 2014-12-13 12:42:51 -08:00
irq genirq: Prevent proc race against freeing of irq descriptors 2014-12-13 13:33:07 +01:00
locking locking/mutex: Don't assume TASK_RUNNING 2014-10-28 10:55:08 +01:00
power PM: Eliminate CONFIG_PM_RUNTIME 2014-12-19 22:55:06 +01:00
printk This code is a fork from the trace-3.19 pull as it needed the trace_seq 2014-12-13 14:04:41 -08:00
rcu Merge branches 'torture.2014.11.03a', 'cpu.2014.11.03a', 'doc.2014.11.13a', 'fixes.2014.11.13a', 'signal.2014.10.29a' and 'rt.2014.10.29a' into HEAD 2014-11-13 10:39:04 -08:00
sched sched_show_task: fix unsafe usage of ->real_parent 2014-12-10 17:41:09 -08:00
time Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-12-19 13:29:20 -08:00
trace More ACPI and power management updates for 3.19-rc1 2014-12-18 20:28:33 -08:00
.gitignore
acct.c acct: eliminate compile warning 2014-10-09 22:26:04 -04:00
async.c kernel/async.c: switch to pr_foo() 2014-10-09 22:26:04 -04:00
audit_tree.c fsnotify: unify inode and mount marks handling 2014-12-13 12:42:53 -08:00
audit_watch.c audit: invalid op= values for rules 2014-09-23 16:37:53 -04:00
audit.c Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit 2014-12-13 13:41:28 -08:00
audit.h audit: reduce scope of audit_log_fcaps 2014-09-23 16:37:51 -04:00
auditfilter.c Merge git://git.infradead.org/users/eparis/audit 2014-10-19 16:25:56 -07:00
auditsc.c new helper: audit_file() 2014-11-19 13:01:26 -05:00
backtracetest.c kernel/backtracetest.c: replace no level printk by pr_info() 2014-06-04 16:54:14 -07:00
bounds.c page-cgroup: get rid of NR_PCG_FLAGS 2014-08-08 15:57:18 -07:00
capability.c CAPABILITIES: remove undefined caps from all processes 2014-07-24 21:53:47 +10:00
cgroup_freezer.c cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes 2014-07-15 11:05:09 -04:00
cgroup.c cgroup: implement cgroup_get_e_css() 2014-11-18 02:49:52 -05:00
compat.c compat: nanosleep: Clarify error handling 2014-09-06 12:58:18 +02:00
configs.c
context_tracking.c sched: stop the unbound recursion in preempt_schedule_context() 2014-10-28 10:46:05 +01:00
cpu_pm.c
cpu.c cpu: Avoid puts_pending overflow 2014-11-03 19:21:01 -08:00
cpuset.c Merge branch 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2014-12-11 18:57:19 -08:00
crash_dump.c crash_dump: Make is_kdump_kernel() accessible from modules 2014-08-25 15:42:19 -07:00
cred.c
delayacct.c delayacct: Remove braindamaged type conversions 2014-07-23 10:18:06 -07:00
dma.c
elfcore.c
exec_domain.c kernel/exec_domain.c: code clean-up 2014-06-04 16:54:15 -07:00
exit.c TTY/Serial driver patches for 3.19-rc1 2014-12-14 15:23:32 -08:00
extable.c ftrace/x86/extable: Add is_ftrace_trampoline() function 2014-11-19 15:25:26 -05:00
fork.c mm: use new helper functions around the i_mmap_mutex 2014-12-13 12:42:45 -08:00
freezer.c freezer: remove obsolete comments in __thaw_task() 2014-10-21 23:44:20 +02:00
futex_compat.c
futex.c futex: Fix a race condition between REQUEUE_PI and task death 2014-10-26 16:16:18 +01:00
groups.c userns: Don't allow setgroups until a gid mapping has been setablished 2014-12-09 16:58:40 -06:00
hung_task.c kernel/hung_task.c: convert simple_strtoul to kstrtouint 2014-06-04 16:54:15 -07:00
irq_work.c percpu: Convert remaining __get_cpu_var uses in 3.18-rcX 2014-10-29 11:18:18 -04:00
jump_label.c
kallsyms.c kernel/kallsyms.c: use __seq_open_private() 2014-10-14 02:18:16 +02:00
kcmp.c kcmp: fix standard comparison bug 2014-09-10 15:42:12 -07:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks locking/rwsem: Add CONFIG_RWSEM_SPIN_ON_OWNER 2014-07-16 14:57:13 +02:00
Kconfig.preempt
kexec.c kexec: remove unnecessary KERN_ERR from kexec.c 2014-12-13 12:42:51 -08:00
kmod.c usermodehelper: kill the kmod_thread_locker logic 2014-12-10 17:41:17 -08:00
kprobes.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux 2014-12-11 17:30:55 -08:00
ksysfs.c kobject: Make support for uevent_helper optional. 2014-04-25 12:00:49 -07:00
kthread.c kernel/kthread.c: partial revert of 81c98869fa ("kthread: ensure locality of task_struct allocations") 2014-10-09 22:25:51 -04:00
latencytop.c kernel/latencytop.c: convert seq_printf to seq_puts 2014-06-04 16:54:15 -07:00
Makefile kernel: res_counter: remove the unused API 2014-12-10 17:41:04 -08:00
module_signing.c
module-internal.h
module.c The exciting thing here is the getting rid of stop_machine on module 2014-12-18 20:55:41 -08:00
notifier.c kprobes, notifier: Use NOKPROBE_SYMBOL macro in notifier 2014-04-24 10:26:39 +02:00
nsproxy.c bury struct proc_ns in fs/proc 2014-12-04 14:34:54 -05:00
padata.c
panic.c kernel: add panic_on_warn 2014-12-10 17:41:10 -08:00
params.c param: do not set store func without write perm 2014-12-18 12:38:51 +10:30
pid_namespace.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-12-16 15:53:03 -08:00
pid.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-12-16 15:53:03 -08:00
profile.c kernel/profile.c: use static const char instead of static char 2014-06-06 16:08:13 -07:00
ptrace.c exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent() 2014-12-10 17:41:10 -08:00
range.c
reboot.c kernel: add support for kernel restart handler call chain 2014-09-26 00:00:06 -07:00
relay.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-04-12 14:49:50 -07:00
resource.c x86: optimize resource lookups for ioremap 2014-10-14 02:18:22 +02:00
seccomp.c Merge branch 'x86-seccomp-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-10-14 02:27:06 +02:00
signal.c Merge branch 'x86-mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-12-10 09:34:43 -08:00
smp.c Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2014-10-15 07:48:18 +02:00
smpboot.c sched, smp: Correctly deal with nested sleeps 2014-10-28 10:56:24 +01:00
smpboot.h
softirq.c rcu: Remove "cpu" argument to rcu_note_context_switch() 2014-11-03 19:20:34 -08:00
stacktrace.c stacktrace: introduce snprint_stack_trace for buffer output 2014-12-13 12:42:48 -08:00
stop_machine.c kernel/stop_machine.c: kernel-doc warning fix 2014-06-04 16:54:15 -07:00
sys_ni.c syscalls: implement execveat() system call 2014-12-13 12:42:51 -08:00
sys.c x86, mpx: On-demand kernel allocation of bounds tables 2014-11-18 00:58:53 +01:00
sysctl_binary.c kernel: add panic_on_warn 2014-12-10 17:41:10 -08:00
sysctl.c As the merge window is still open, and this code was not as complex 2014-12-16 12:53:59 -08:00
system_certificates.S
system_keyring.c KEYS: validate certificate trust only with builtin keys 2014-07-17 09:35:17 -04:00
task_work.c
taskstats.c kill f_dentry uses 2014-11-19 13:01:25 -05:00
test_kprobes.c kernel/test_kprobes.c: use current logging functions 2014-08-08 15:57:18 -07:00
torture.c torture: Address race in module cleanup 2014-09-16 13:41:06 -07:00
tracepoint.c tracing: syscall_regfunc() should not skip kernel threads 2014-06-21 00:15:26 -04:00
tsacct.c sched: Make task->start_time nanoseconds based 2014-07-23 10:18:05 -07:00
uid16.c groups: Consolidate the setgroups permission checks 2014-12-05 17:19:27 -06:00
up.c
user_namespace.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2014-12-17 12:31:40 -08:00
user-return-notifier.c scheduler: Replace __get_cpu_var with this_cpu_ptr 2014-08-26 13:45:45 -04:00
user.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2014-12-17 12:31:40 -08:00
utsname_sysctl.c sysctl: convert use of typedef ctl_table to struct ctl_table 2014-06-06 16:08:16 -07:00
utsname.c copy address of proc_ns_ops into ns_common 2014-12-04 14:34:47 -05:00
watchdog.c Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2014-10-15 07:48:18 +02:00
workqueue_internal.h workqueue: rename manager_mutex to attach_mutex 2014-05-20 10:59:32 -04:00
workqueue.c workqueue: fix subtle pool management issue which can stall whole worker_pool 2015-01-16 14:21:16 -05:00