linux/kernel
Joel Savitz d477f8c202 cpuset: restore sanity to cpuset_cpus_allowed_fallback()
In the case that a process is constrained by taskset(1) (i.e.
sched_setaffinity(2)) to a subset of available cpus, and all of those are
subsequently offlined, the scheduler will set tsk->cpus_allowed to
the current value of task_cs(tsk)->effective_cpus.

This is done via a call to do_set_cpus_allowed() in the context of
cpuset_cpus_allowed_fallback() made by the scheduler when this case is
detected. This is the only call made to cpuset_cpus_allowed_fallback()
in the latest mainline kernel.

However, this is not sane behavior.

I will demonstrate this on a system running the latest upstream kernel
with the following initial configuration:

	# grep -i cpu /proc/$$/status
	Cpus_allowed:	ffffffff,fffffff
	Cpus_allowed_list:	0-63

(Where cpus 32-63 are provided via smt.)

If we limit our current shell process to cpu2 only and then offline it
and reonline it:

	# taskset -p 4 $$
	pid 2272's current affinity mask: ffffffffffffffff
	pid 2272's new affinity mask: 4

	# echo off > /sys/devices/system/cpu/cpu2/online
	# dmesg | tail -3
	[ 2195.866089] process 2272 (bash) no longer affine to cpu2
	[ 2195.872700] IRQ 114: no longer affine to CPU2
	[ 2195.879128] smpboot: CPU 2 is now offline

	# echo on > /sys/devices/system/cpu/cpu2/online
	# dmesg | tail -1
	[ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4

We see that our current process now has an affinity mask containing
every cpu available on the system _except_ the one we originally
constrained it to:

	# grep -i cpu /proc/$$/status
	Cpus_allowed:   ffffffff,fffffffb
	Cpus_allowed_list:      0-1,3-63

This is not sane behavior, as the scheduler can now not only place the
process on previously forbidden cpus, it can't even schedule it on
the cpu it was originally constrained to!

Other cases result in even more exotic affinity masks. Take for instance
a process with an affinity mask containing only cpus provided by smt at
the moment that smt is toggled, in a configuration such as the following:

	# taskset -p f000000000 $$
	# grep -i cpu /proc/$$/status
	Cpus_allowed:	000000f0,00000000
	Cpus_allowed_list:	36-39

A double toggle of smt results in the following behavior:

	# echo off > /sys/devices/system/cpu/smt/control
	# echo on > /sys/devices/system/cpu/smt/control
	# grep -i cpus /proc/$$/status
	Cpus_allowed:	ffffff00,ffffffff
	Cpus_allowed_list:	0-31,40-63

This is even less sane than the previous case, as the new affinity mask
excludes all smt-provided cpus with ids less than those that were
previously in the affinity mask, as well as those that were actually in
the mask.

With this patch applied, both of these cases end in the following state:

	# grep -i cpu /proc/$$/status
	Cpus_allowed:	ffffffff,ffffffff
	Cpus_allowed_list:	0-63

The original policy is discarded. Though not ideal, it is the simplest way
to restore sanity to this fallback case without reinventing the cpuset
wheel that rolls down the kernel just fine in cgroup v2. A user who wishes
for the previous affinity mask to be restored in this fallback case can use
that mechanism instead.

This patch modifies scheduler behavior by instead resetting the mask to
task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
mode. I tested the cases above on both modes.

Note that the scheduler uses this fallback mechanism if and only if
_every_ other valid avenue has been traveled, and it is the last resort
before calling BUG().

Suggested-by: Waiman Long <longman@redhat.com>
Suggested-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Acked-by: Phil Auld <pauld@redhat.com>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-06-12 11:00:08 -07:00
..
bpf SPDX update for 5.2-rc2, round 1 2019-05-21 12:33:38 -07:00
cgroup cpuset: restore sanity to cpuset_cpus_allowed_fallback() 2019-06-12 11:00:08 -07:00
configs
debug treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
dma treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
events mm/mmu_notifier: use correct mmu_notifier events for each invalidation 2019-05-14 09:47:49 -07:00
gcov treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
irq treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
livepatch treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 13 2019-05-21 11:28:45 +02:00
locking locking/lock_events: Use this_cpu_add() when necessary 2019-05-24 14:17:18 -07:00
power treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
printk treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 13 2019-05-21 11:28:45 +02:00
rcu treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
sched treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
time treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
trace Make the GCC 9 warning for sub struct memset go away. 2019-05-26 13:49:40 -07:00
.gitignore Provide in-kernel headers to make extending kernel easier 2019-04-29 16:48:03 +02:00
acct.c acct_on(): don't mess with freeze protection 2019-04-04 21:04:13 -04:00
async.c treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively 2019-04-09 14:19:06 +02:00
audit_fsnotify.c audit_compare_dname_path(): switch to const struct qstr * 2019-04-28 20:33:43 -04:00
audit_tree.c fsnotify: switch send_to_group() and ->handle_event to const struct qstr * 2019-04-26 13:51:03 -04:00
audit_watch.c audit_compare_dname_path(): switch to const struct qstr * 2019-04-28 20:33:43 -04:00
audit.c
audit.h audit_compare_dname_path(): switch to const struct qstr * 2019-04-28 20:33:43 -04:00
auditfilter.c Merge branch 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-05-07 20:03:32 -07:00
auditsc.c Merge branch 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-05-07 20:03:32 -07:00
backtracetest.c backtrace-test: Simplify stack trace handling 2019-04-29 12:37:47 +02:00
bounds.c
capability.c
compat.c kernel/compat.c: mark expected switch fall-throughs 2019-05-15 08:16:14 -07:00
configs.c
context_tracking.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
cpu_pm.c
cpu.c Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-05-06 14:50:46 -07:00
crash_core.c
crash_dump.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
cred.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 36 2019-05-24 17:27:11 +02:00
delayacct.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 25 2019-05-21 11:52:39 +02:00
dma.c
elfcore.c
exec_domain.c
exit.c cgroup: Call cgroup_release() before __exit_signal() 2019-05-31 10:38:57 -07:00
extable.c
fail_function.c treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively 2019-04-09 14:19:06 +02:00
fork.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
freezer.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
futex.c mm/gup: change GUP fast to use flags rather than a write 'bool' 2019-05-14 09:47:46 -07:00
gen_ikh_data.sh Provide in-kernel headers to make extending kernel easier 2019-04-29 16:48:03 +02:00
groups.c
hung_task.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
iomem.c mm/resource: Use resource_overlaps() to simplify region_intersects() 2019-04-19 12:59:36 +02:00
irq_work.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
jump_label.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
kallsyms.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
kcmp.c
Kconfig.freezer treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
Kconfig.hz treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
Kconfig.locks treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
Kconfig.preempt treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
kcov.c
kexec_core.c power/suspend: Add function to disable secondaries for suspend 2019-05-03 19:42:41 +02:00
kexec_file.c mm: memblock: make keeping memblock memory opt-in rather than opt-out 2019-05-14 09:47:50 -07:00
kexec_internal.h
kexec.c
kheaders.c Provide in-kernel headers to make extending kernel easier 2019-04-29 16:48:03 +02:00
kmod.c
kprobes.c kprobes: Fix error check when reusing optimized probes 2019-04-16 09:38:16 +02:00
ksysfs.c
kthread.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
latencytop.c kernel/latencytop.c: rename clear_all_latency_tracing to clear_tsk_latency_tracing 2019-05-14 19:52:49 -07:00
Makefile kernel/Makefile: don't assume that kernel/gen_ikh_data.sh is executable 2019-05-14 19:52:47 -07:00
memremap.c kernel/memremap.c: remove the unused device_private_entry_fault() export 2019-05-14 09:47:51 -07:00
module_signing.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 36 2019-05-24 17:27:11 +02:00
module-internal.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 36 2019-05-24 17:27:11 +02:00
module.c Modules updates for v5.2 2019-05-14 10:55:54 -07:00
notifier.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
nsproxy.c
padata.c padata: Replace padata_attr_type default_attrs field with groups 2019-04-25 22:06:11 +02:00
panic.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
params.c
pid_namespace.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
pid.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
profile.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
ptrace.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
range.c
reboot.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
relay.c
resource.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
rseq.c rseq: Remove superfluous rseq_len from task_struct 2019-04-19 12:39:32 +02:00
seccomp.c audit/stable-5.2 PR 20190507 2019-05-07 19:06:04 -07:00
signal.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
smp.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
smpboot.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
smpboot.h
softirq.c
stackleak.c
stacktrace.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
stop_machine.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 38 2019-05-24 17:27:11 +02:00
sys_ni.c signal: support CLONE_PIDFD with pidfd_send_signal 2019-05-07 14:31:03 +02:00
sys.c kernel/sys.c: prctl: fix false positive in validate_prctl_map() 2019-05-14 09:47:44 -07:00
sysctl_binary.c
sysctl.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
task_work.c
taskstats.c genetlink: optionally validate strictly/dumps 2019-04-27 17:07:22 -04:00
test_kprobes.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 25 2019-05-21 11:52:39 +02:00
torture.c torture: Don't try to offline the last CPU 2019-03-26 14:42:53 -07:00
tracepoint.c
tsacct.c
ucount.c
uid16.c
uid16.h
umh.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
up.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
user_namespace.c
user-return-notifier.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
user.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
utsname_sysctl.c
utsname.c
watchdog_hld.c kernel/watchdog_hld.c: hard lockup message should end with a newline 2019-04-19 09:46:05 -07:00
watchdog.c watchdog: Fix typo in comment 2019-04-18 14:05:51 +02:00
workqueue_internal.h sched/core, workqueues: Distangle worker accounting from rq lock 2019-04-16 16:55:15 +02:00
workqueue.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00