linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-10 22:21:40 +00:00

History

Odin Ugedal 0258bdfaff sched/fair: Fix unfairness caused by missing load decay This fixes an issue where old load on a cfs_rq is not properly decayed, resulting in strange behavior where fairness can decrease drastically. Real workloads with equally weighted control groups have ended up getting a respective 99% and 1%(!!) of cpu time. When an idle task is attached to a cfs_rq by attaching a pid to a cgroup, the old load of the task is attached to the new cfs_rq and sched_entity by attach_entity_cfs_rq. If the task is then moved to another cpu (and therefore cfs_rq) before being enqueued/woken up, the load will be moved to cfs_rq->removed from the sched_entity. Such a move will happen when enforcing a cpuset on the task (eg. via a cgroup) that force it to move. The load will however not be removed from the task_group itself, making it look like there is a constant load on that cfs_rq. This causes the vruntime of tasks on other sibling cfs_rq's to increase faster than they are supposed to; causing severe fairness issues. If no other task is started on the given cfs_rq, and due to the cpuset it would not happen, this load would never be properly unloaded. With this patch the load will be properly removed inside update_blocked_averages. This also applies to tasks moved to the fair scheduling class and moved to another cpu, and this path will also fix that. For fork, the entity is queued right away, so this problem does not affect that. This applies to cases where the new process is the first in the cfs_rq, issue introduced `3d30544f02` ("sched/fair: Apply more PELT fixes"), and when there has previously been load on the cgroup but the cgroup was removed from the leaflist due to having null PELT load, indroduced in `039ae8bcf7` ("sched/fair: Fix O(nr_cgroups) in the load balancing path"). For a simple cgroup hierarchy (as seen below) with two equally weighted groups, that in theory should get 50/50 of cpu time each, it often leads to a load of 60/40 or 70/30. parent/ cg-1/ cpu.weight: 100 cpuset.cpus: 1 cg-2/ cpu.weight: 100 cpuset.cpus: 1 If the hierarchy is deeper (as seen below), while keeping cg-1 and cg-2 equally weighted, they should still get a 50/50 balance of cpu time. This however sometimes results in a balance of 10/90 or 1/99(!!) between the task groups. $ ps u -C stress USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 18568 1.1 0.0 3684 100 pts/12 R+ 13:36 0:00 stress --cpu 1 root 18580 99.3 0.0 3684 100 pts/12 R+ 13:36 0:09 stress --cpu 1 parent/ cg-1/ cpu.weight: 100 sub-group/ cpu.weight: 1 cpuset.cpus: 1 cg-2/ cpu.weight: 100 sub-group/ cpu.weight: 10000 cpuset.cpus: 1 This can be reproduced by attaching an idle process to a cgroup and moving it to a given cpuset before it wakes up. The issue is evident in many (if not most) container runtimes, and has been reproduced with both crun and runc (and therefore docker and all its "derivatives"), and with both cgroup v1 and v2. Fixes: `3d30544f02` ("sched/fair: Apply more PELT fixes") Fixes: `039ae8bcf7` ("sched/fair: Fix O(nr_cgroups) in the load balancing path") Signed-off-by: Odin Ugedal <odin@uged.al> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210501141950.23622-2-odin@uged.al		2021-05-06 15:33:27 +02:00
..
bpf	selinux/stable-5.13 PR 20210426	2021-04-27 13:42:11 -07:00
cgroup	cgroup: use tsk->in_iowait instead of delayacct_is_task_waiting_on_io()	2021-04-16 16:49:37 -04:00
configs	staging: ION: remove some references to CONFIG_ION	2021-01-06 17:39:38 +01:00
debug	printk changes for 5.13	2021-04-27 18:09:44 -07:00
dma	Merge branch 'stable/for-linus-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb	2021-02-26 13:59:32 -08:00
entry	A trivial cleanup of typo fixes.	2021-04-26 09:41:15 -07:00
events	perf: Extend PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE	2021-04-19 20:03:29 +02:00
gcov	Revert "gcov: clang: fix clang-11+ build"	2021-04-19 15:08:49 -07:00
irq	The usual updates from the irq departement:	2021-04-26 09:43:16 -07:00
kcsan	kcsan: Fix printk format string	2021-04-22 14:36:03 +02:00
livepatch	Livepatching changes for 5.13	2021-04-27 18:14:38 -07:00
locking	Locking changes for this cycle were:	2021-04-28 12:37:53 -07:00
power	PM: sleep: fix typos in comments	2021-04-08 19:37:21 +02:00
printk	kernel/printk.c: Fixed mundane typos	2021-03-30 15:34:17 +02:00
rcu	Merge branches 'bitmaprange.2021.03.08a', 'fixes.2021.03.15a', 'kvfree_rcu.2021.03.08a', 'mmdumpobj.2021.03.08a', 'nocb.2021.03.15a', 'poll.2021.03.24a', 'rt.2021.03.08a', 'tasks.2021.03.08a', 'torture.2021.03.08a' and 'torturescript.2021.03.22a' into HEAD	2021-03-24 17:20:18 -07:00
sched	sched/fair: Fix unfairness caused by missing load decay	2021-05-06 15:33:27 +02:00
time	Power management updates for 5.13-rc1	2021-04-26 15:10:25 -07:00
trace	The usual updates from the irq departement:	2021-04-26 09:43:16 -07:00
.gitignore
acct.c	kernel/acct.c: use #elif instead of #end and #elif	2020-12-15 22:46:15 -08:00
async.c	treewide: Remove uninitialized_var() usage	2020-07-16 12:35:15 -07:00
audit_fsnotify.c	audit_alloc_mark(): don't open-code ERR_CAST()	2021-02-23 10:25:27 -05:00
audit_tree.c	fsnotify: generalize handle_inode_event()	2020-12-03 14:58:35 +01:00
audit_watch.c	fsnotify: generalize handle_inode_event()	2020-12-03 14:58:35 +01:00
audit.c	lsm: separate security_task_getsecid() into subjective and objective variants	2021-03-22 15:23:32 -04:00
audit.h	audit: avoid -Wempty-body warning	2021-03-24 12:11:48 -04:00
auditfilter.c	lsm: separate security_task_getsecid() into subjective and objective variants	2021-03-22 15:23:32 -04:00
auditsc.c	audit/stable-5.13 PR 20210426	2021-04-27 13:50:58 -07:00
backtracetest.c	treewide: Replace DECLARE_TASKLET() with DECLARE_TASKLET_OLD()	2020-07-30 11:15:58 -07:00
bounds.c
capability.c	capability: handle idmapped mounts	2021-01-24 14:27:16 +01:00
cfi.c	add support for Clang CFI	2021-04-08 16:04:20 -07:00
compat.c	treewide: Use fallthrough pseudo-keyword	2020-08-23 17:36:59 -05:00
configs.c
context_tracking.c
cpu_pm.c	notifier: Fix broken error handling pattern	2020-09-01 09:58:03 +02:00
cpu.c	cpumask/hotplug: Fix cpu_dying() state tracking	2021-04-21 13:55:43 +02:00
crash_core.c	kdump: append uts_namespace.name offset to VMCOREINFO	2020-12-15 22:46:18 -08:00
crash_dump.c
cred.c
delayacct.c
dma.c
exec_domain.c
exit.c	signal: Allow tasks to cache one sigqueue struct	2021-04-14 18:04:08 +02:00
extable.c
fail_function.c	fault-injection: handle EI_ETYPE_TRUE	2020-12-15 22:46:19 -08:00
fork.c	for-5.13/io_uring-2021-04-27	2021-04-28 14:56:09 -07:00
freezer.c	Revert "kernel: freezer should treat PF_IO_WORKER like PF_KTHREAD for freezing"	2021-03-27 14:09:10 -06:00
futex.c	Linux 5.12-rc5	2021-03-29 15:56:48 +02:00
gen_kheaders.sh
groups.c	groups: simplify struct group_info allocation	2021-02-26 09:41:03 -08:00
hung_task.c	kernel/hung_task.c: make type annotations consistent	2020-11-02 12:14:19 -08:00
iomem.c
irq_work.c	irq_work: Optimize irq_work_single()	2020-11-24 16:47:49 +01:00
jump_label.c	static_call: Fix static_call_update() sanity check	2021-03-19 13:16:44 +01:00
kallsyms.c	kallsyms: strip ThinLTO hashes from static functions	2021-04-08 16:04:21 -07:00
kcmp.c	Merge branch 'exec-update-lock-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2020-12-15 19:36:48 -08:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt	preempt: Introduce CONFIG_PREEMPT_DYNAMIC	2021-02-17 14:12:24 +01:00
kcov.c	kernel: make kcov_common_handle consider the current context	2020-11-02 18:00:20 -08:00
kexec_core.c	Merge branch 'work.elf-compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2021-02-21 09:29:23 -08:00
kexec_elf.c
kexec_file.c	ima: Free IMA measurement buffer after kexec syscall	2021-02-10 15:49:38 -05:00
kexec_internal.h	kexec: move machine_kexec_post_load() to public interface	2021-02-22 12:33:26 +00:00
kexec.c	LSM: Introduce kernel_post_load_data() hook	2020-10-05 13:37:03 +02:00
kheaders.c
kmod.c	kmod: remove redundant "be an" in the comment	2020-08-12 10:58:01 -07:00
kprobes.c	kprobes: Fix to delay the kprobes jump optimization	2021-02-19 14:57:12 -05:00
ksysfs.c
kthread.c	Scheduler updates for this cycle are:	2021-04-28 13:33:57 -07:00
latencytop.c
Makefile	add support for Clang CFI	2021-04-08 16:04:20 -07:00
module_signature.c	module: harden ELF info handling	2021-01-19 10:24:45 +01:00
module_signing.c	module: harden ELF info handling	2021-01-19 10:24:45 +01:00
module-internal.h
module.c	add support for Clang CFI	2021-04-08 16:04:20 -07:00
notifier.c	notifier: Fix broken error handling pattern	2020-09-01 09:58:03 +02:00
nsproxy.c	fixes-v5.11	2020-12-14 16:40:27 -08:00
padata.c	padata: fix possible padata_works_lock deadlock	2020-09-04 17:51:55 +10:00
panic.c	panic: don't dump stack twice on warn	2020-11-14 11:26:04 -08:00
params.c	Modules updates for v5.11	2020-12-17 13:01:31 -08:00
pid_namespace.c	fixes-v5.11	2020-12-14 16:40:27 -08:00
pid.c	Merge branch 'exec-update-lock-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2020-12-15 19:36:48 -08:00
profile.c	kernel: Initialize cpumask before parsing	2021-04-10 13:35:54 +02:00
ptrace.c	Linux 5.12-rc8	2021-04-20 10:13:58 +02:00
range.c	kernel.h: split out min()/max() et al. helpers	2020-10-16 11:11:19 -07:00
reboot.c	Revert "PM: ACPI: reboot: Use S5 for reboot"	2021-03-18 16:58:02 +01:00
regset.c	regset: kill ->get()	2020-07-27 14:31:12 -04:00
relay.c	relay: allow the use of const callback structs	2020-12-15 22:46:18 -08:00
resource_kunit.c	resource: provide meaningful MODULE_LICENSE() in test suite	2020-11-25 18:52:35 +01:00
resource.c	resource: Move devmem revoke code to resource framework	2021-01-12 14:26:31 +01:00
rseq.c	rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs()	2021-04-14 18:04:09 +02:00
scftorture.c	scftorture: Add debug output for wrong-CPU warning	2021-01-04 13:53:41 -08:00
scs.c	scs: switch to vmapped shadow stacks	2020-12-01 10:30:28 +00:00
seccomp.c	seccomp: Fix "cacheable" typo in comments	2021-03-30 22:34:30 -07:00
signal.c	Scheduler updates for this cycle are:	2021-04-28 13:33:57 -07:00
smp.c	Merge branch 'locking/core' into x86/mm, to resolve conflict	2021-03-06 13:00:58 +01:00
smpboot.c	kthread: Extract KTHREAD_IS_PER_CPU	2021-01-22 15:09:42 +01:00
smpboot.h
softirq.c	RCU changes for this cycle were:	2021-04-28 12:00:13 -07:00
stackleak.c	stackleak: let stack_erasing_sysctl take a kernel pointer buffer	2020-09-19 13:13:39 -07:00
stacktrace.c	stacktrace: Remove reliable argument from arch_stack_walk() callback	2020-09-18 14:24:16 +01:00
static_call.c	static_call: Fix unused variable warn w/o MODULE	2021-04-09 13:22:12 +02:00
stop_machine.c	stop_machine: Add caller debug info to queue_stop_cpus_work	2021-03-23 16:01:58 +01:00
sys_ni.c	quota: wire up quotactl_path	2021-03-17 15:51:17 +01:00
sys.c	arm64: Introduce prctl(PR_PAC_{SET,GET}_ENABLED_KEYS)	2021-04-13 17:31:44 +01:00
sysctl-test.c
sysctl.c	\n	2021-04-29 11:06:13 -07:00
task_work.c	task_work: add helper for more targeted task_work canceling	2021-04-11 19:30:25 -06:00
taskstats.c	treewide: rename nla_strlcpy to nla_strscpy.	2020-11-16 08:08:54 -08:00
test_kprobes.c
torture.c	torture: Replace torture_init_begin string with %s	2021-03-08 14:22:28 -08:00
tracepoint.c	tracepoints: Code clean up	2021-02-09 12:27:29 -05:00
tsacct.c
ucount.c	fanotify: configurable limits via sysfs	2021-03-16 16:49:31 +01:00
uid16.c
uid16.h
umh.c	usermodehelper: reset umask to default before executing user process	2020-10-06 10:31:52 -07:00
up.c	smp: Inline on_each_cpu_cond() and on_each_cpu()	2021-03-06 12:59:10 +01:00
user_namespace.c	capabilities: require CAP_SETFCAP to map uid 0	2021-04-20 14:28:33 -07:00
user-return-notifier.c
user.c	user: Use generic ns_common::count	2020-08-19 14:14:12 +02:00
usermode_driver.c	bpf: Fix umd memory leak in copy_process()	2021-03-19 22:23:19 +01:00
utsname_sysctl.c
utsname.c	uts: Use generic ns_common::count	2020-08-19 14:13:20 +02:00
watch_queue.c	watch_queue: rectify kernel-doc for init_watch()	2021-01-26 11:16:34 +00:00
watchdog_hld.c
watchdog.c	workqueue/watchdog: Make unbound workqueues aware of touch_softlockup_watchdog()	2021-04-04 13:26:49 -04:00
workqueue_internal.h
workqueue.c	CFI on arm64 series for v5.13-rc1	2021-04-27 10:16:46 -07:00