linux/kernel
Thomas Gleixner dbb26055de locking/rtmutex: Prevent dequeue vs. unlock race
David reported a futex/rtmutex state corruption. It's caused by the
following problem:

CPU0		CPU1		CPU2

l->owner=T1
		rt_mutex_lock(l)
		lock(l->wait_lock)
		l->owner = T1 | HAS_WAITERS;
		enqueue(T2)
		boost()
		  unlock(l->wait_lock)
		schedule()

				rt_mutex_lock(l)
				lock(l->wait_lock)
				l->owner = T1 | HAS_WAITERS;
				enqueue(T3)
				boost()
				  unlock(l->wait_lock)
				schedule()
		signal(->T2)	signal(->T3)
		lock(l->wait_lock)
		dequeue(T2)
		deboost()
		  unlock(l->wait_lock)
				lock(l->wait_lock)
				dequeue(T3)
				  ===> wait list is now empty
				deboost()
				 unlock(l->wait_lock)
		lock(l->wait_lock)
		fixup_rt_mutex_waiters()
		  if (wait_list_empty(l)) {
		    owner = l->owner & ~HAS_WAITERS;
		    l->owner = owner
		     ==> l->owner = T1
		  }

				lock(l->wait_lock)
rt_mutex_unlock(l)		fixup_rt_mutex_waiters()
				  if (wait_list_empty(l)) {
				    owner = l->owner & ~HAS_WAITERS;
cmpxchg(l->owner, T1, NULL)
 ===> Success (l->owner = NULL)
				    l->owner = owner
				     ==> l->owner = T1
				  }

That means the problem is caused by fixup_rt_mutex_waiters() which does the
RMW to clear the waiters bit unconditionally when there are no waiters in
the rtmutexes rbtree.

This can be fatal: A concurrent unlock can release the rtmutex in the
fastpath because the waiters bit is not set. If the cmpxchg() gets in the
middle of the RMW operation then the previous owner, which just unlocked
the rtmutex is set as the owner again when the write takes place after the
successfull cmpxchg().

The solution is rather trivial: verify that the owner member of the rtmutex
has the waiters bit set before clearing it. This does not require a
cmpxchg() or other atomic operations because the waiters bit can only be
set and cleared with the rtmutex wait_lock held. It's also safe against the
fast path unlock attempt. The unlock attempt via cmpxchg() will either see
the bit set and take the slowpath or see the bit cleared and release it
atomically in the fastpath.

It's remarkable that the test program provided by David triggers on ARM64
and MIPS64 really quick, but it refuses to reproduce on x86-64, while the
problem exists there as well. That refusal might explain that this got not
discovered earlier despite the bug existing from day one of the rtmutex
implementation more than 10 years ago.

Thanks to David for meticulously instrumenting the code and providing the
information which allowed to decode this subtle problem.

Reported-by: David Daney <ddaney@caviumnetworks.com>
Tested-by: David Daney <david.daney@cavium.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: stable@vger.kernel.org
Fixes: 23f78d4a03 ("[PATCH] pi-futex: rt mutex core")
Link: http://lkml.kernel.org/r/20161130210030.351136722@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-02 11:13:26 +01:00
..
bpf bpf: fix range arithmetic for bpf map access 2016-11-16 13:21:45 -05:00
configs config: android: enable CONFIG_SECCOMP 2016-10-11 15:06:32 -07:00
debug
events perf/core: Fix address filter parser 2016-11-21 11:28:36 +01:00
gcov gcov: add support for gcc version >= 6 2016-07-15 14:54:27 +09:00
irq genirq: Use irq type from irqdata instead of irqdesc 2016-11-08 15:15:19 +01:00
livepatch livepatch/module: make TAINT_LIVEPATCH module-specific 2016-08-26 14:42:08 +02:00
locking locking/rtmutex: Prevent dequeue vs. unlock race 2016-12-02 11:13:26 +01:00
power PM / sleep: fix device reference leak in test_suspend 2016-11-02 05:10:04 +01:00
printk Revert "printk: make reading the kernel log flush pending lines" 2016-11-14 09:31:52 -08:00
rcu This adds a new gcc plugin named "latent_entropy". It is designed to 2016-10-15 10:03:15 -07:00
sched sched/autogroup: Do not use autogroup->tg in zombie threads 2016-11-22 12:33:43 +01:00
time timers: Prevent base clock corruption when forwarding 2016-10-25 16:32:50 +02:00
trace ftrace: Add more checks for FTRACE_FL_DISABLED in processing ip records 2016-11-14 16:31:49 -05:00
.gitignore
acct.c
async.c
audit_fsnotify.c
audit_tree.c audit: cleanup prune_tree_thread 2016-04-04 09:46:47 -04:00
audit_watch.c Merge branch 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit 2016-09-01 15:55:56 -07:00
audit.c Merge branch 'stable-4.9' of git://git.infradead.org/users/pcmoore/audit 2016-10-04 14:21:41 -07:00
audit.h Merge branch 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit 2016-07-29 17:54:17 -07:00
auditfilter.c audit: add fields to exclude filter by reusing user filter 2016-06-27 11:01:00 -04:00
auditsc.c Merge branch 'stable-4.9' of git://git.infradead.org/users/pcmoore/audit 2016-10-04 14:21:41 -07:00
backtracetest.c
bounds.c
capability.c kernel: Add noaudit variant of ns_capable() 2016-06-06 20:16:18 +10:00
cgroup_freezer.c
cgroup_pids.c cgroup: Use lld instead of ld when printing pids controller events_limit 2016-06-21 15:03:36 -04:00
cgroup.c Merge branch 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2016-10-14 12:18:50 -07:00
compat.c
configs.c
context_tracking.c
cpu_pm.c
cpu.c cpu/hotplug: Use distinct name for cpu_hotplug.dep_map 2016-10-16 11:09:32 +02:00
cpuset.c Merge branch 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2016-10-14 12:18:50 -07:00
crash_dump.c
cred.c cred: Reject inodes with invalid ids in set_create_file_as() 2016-06-30 18:05:09 -05:00
delayacct.c
dma.c
elfcore.c
exec_domain.c
exit.c sched/autogroup: Do not use autogroup->tg in zombie threads 2016-11-22 12:33:43 +01:00
extable.c
fork.c fork: Add task stack refcounting sanity check and prevent premature task stack freeing 2016-11-01 07:39:17 +01:00
freezer.c freezer, oom: check TIF_MEMDIE on the correct task 2016-07-28 16:07:41 -07:00
futex_compat.c
futex.c futex: Add some more function commentry 2016-09-05 17:20:18 +02:00
groups.c cred: simpler, 1D supplementary groups 2016-10-07 18:46:30 -07:00
hung_task.c hung_task: allow hung_task_panic when hung_task_warnings is 0 2016-10-11 15:06:33 -07:00
irq_work.c
jump_label.c powerpc updates for 4.8 #2 2016-08-05 09:00:54 -04:00
kallsyms.c
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kcov.c kcov: properly check if we are in an interrupt 2016-10-27 18:43:42 -07:00
kexec_core.c kexec: add restriction on kexec_load() segment sizes 2016-08-02 19:35:31 -04:00
kexec_file.c kexec: fix double-free when failing to relocate the purgatory 2016-09-01 17:52:01 -07:00
kexec_internal.h
kexec.c kexec: allow architectures to override boot mapping 2016-08-02 19:35:27 -04:00
kmod.c
kprobes.c kprobes: include <asm/sections.h> instead of <asm-generic/sections.h> 2016-10-11 15:06:31 -07:00
ksysfs.c kexec: add a kexec_crash_loaded() function 2016-08-02 19:35:30 -04:00
kthread.c kthread: better support freezable kthread workers 2016-10-11 15:06:33 -07:00
latencytop.c
Makefile userns: Add per user namespace sysctls. 2016-08-08 13:18:58 -05:00
membarrier.c
memremap.c mm: fix cache mode of dax pmd mappings 2016-09-09 17:34:46 -07:00
module_signing.c KEYS: Move the point of trust determination to __key_link() 2016-04-11 22:43:43 +01:00
module-internal.h
module.c livepatch/module: make TAINT_LIVEPATCH module-specific 2016-08-26 14:42:08 +02:00
notifier.c
nsproxy.c
padata.c padata: Convert to hotplug state machine 2016-09-19 21:44:30 +02:00
panic.c x86/panic: replace smp_send_stop() with kdump friendly version in panic path 2016-10-11 15:06:32 -07:00
params.c
pid_namespace.c Merge branch 'nsfs-ioctls' into HEAD 2016-09-22 20:00:36 -05:00
pid.c remove lots of IS_ERR_VALUE abuses 2016-05-27 15:26:11 -07:00
profile.c profile: Convert to hotplug state machine 2016-07-15 10:41:42 +02:00
ptrace.c mm: replace access_process_vm() write parameter with gup_flags 2016-10-19 08:31:25 -07:00
range.c
reboot.c
relay.c relay: Use irq_work instead of plain timer for deferred wakeup 2016-10-11 15:06:32 -07:00
resource.c /proc/iomem: only expose physical resource addresses to privileged users 2016-04-14 12:56:09 -07:00
seccomp.c seccomp: Fix tracer exit notifications during fatal signals 2016-08-30 16:12:46 -07:00
signal.c x86/signal: Add SA_{X32,IA32}_ABI sa_flags 2016-09-14 21:28:11 +02:00
smp.c smp: Allocate smp_call_on_cpu() workqueue on stack too 2016-09-22 14:49:10 +02:00
smpboot.c kthread/smpboot: do not park in kthread_create_on_cpu() 2016-10-11 15:06:33 -07:00
smpboot.h
softirq.c softirq: Display IRQ_POLL for irq-poll statistics 2016-10-21 15:45:47 -06:00
stacktrace.c
stop_machine.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2016-10-03 13:39:00 -07:00
sys_ni.c x86/pkeys: Fix pkeys build breakage for some non-x86 arches 2016-09-13 14:41:36 +02:00
sys.c prctl: make PR_SET_THP_DISABLE wait for mmap_sem killable 2016-05-23 17:04:14 -07:00
sysctl_binary.c kernel/sysctl_binary.c: use generic UUID library 2016-05-20 17:58:30 -07:00
sysctl.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-10-10 13:04:49 -07:00
task_work.c task_work: use READ_ONCE/lockless_dereference, avoid pi_lock if !task_works 2016-08-02 19:35:02 -04:00
taskstats.c taskstats: fix the length of cgroupstats_cmd_get_policy 2016-11-03 16:55:58 -04:00
test_kprobes.c
torture.c torture: Convert torture_shutdown() to hrtimer 2016-08-22 10:01:49 -07:00
tracepoint.c
tsacct.c
ucount.c mntns: Add a limit on the number of mount namespaces. 2016-08-31 07:28:35 -05:00
uid16.c cred: simpler, 1D supplementary groups 2016-10-07 18:46:30 -07:00
up.c smp: Add function to execute a function synchronously on a CPU 2016-09-05 13:52:39 +02:00
user_namespace.c Merge branch 'nsfs-ioctls' into HEAD 2016-09-22 20:00:36 -05:00
user-return-notifier.c
user.c
utsname_sysctl.c
utsname.c Merge branch 'nsfs-ioctls' into HEAD 2016-09-22 20:00:36 -05:00
watchdog.c Revert "perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86" 2016-07-10 20:58:36 +02:00
workqueue_internal.h
workqueue.c kthread: rename probe_kthread_data() to kthread_probe_data() 2016-10-11 15:06:33 -07:00