linux/kernel
Odin Ugedal 0258bdfaff sched/fair: Fix unfairness caused by missing load decay
This fixes an issue where old load on a cfs_rq is not properly decayed,
resulting in strange behavior where fairness can decrease drastically.
Real workloads with equally weighted control groups have ended up
getting a respective 99% and 1%(!!) of cpu time.

When an idle task is attached to a cfs_rq by attaching a pid to a cgroup,
the old load of the task is attached to the new cfs_rq and sched_entity by
attach_entity_cfs_rq. If the task is then moved to another cpu (and
therefore cfs_rq) before being enqueued/woken up, the load will be moved
to cfs_rq->removed from the sched_entity. Such a move will happen when
enforcing a cpuset on the task (eg. via a cgroup) that force it to move.

The load will however not be removed from the task_group itself, making
it look like there is a constant load on that cfs_rq. This causes the
vruntime of tasks on other sibling cfs_rq's to increase faster than they
are supposed to; causing severe fairness issues. If no other task is
started on the given cfs_rq, and due to the cpuset it would not happen,
this load would never be properly unloaded. With this patch the load
will be properly removed inside update_blocked_averages. This also
applies to tasks moved to the fair scheduling class and moved to another
cpu, and this path will also fix that. For fork, the entity is queued
right away, so this problem does not affect that.

This applies to cases where the new process is the first in the cfs_rq,
issue introduced 3d30544f02 ("sched/fair: Apply more PELT fixes"), and
when there has previously been load on the cgroup but the cgroup was
removed from the leaflist due to having null PELT load, indroduced
in 039ae8bcf7 ("sched/fair: Fix O(nr_cgroups) in the load balancing
path").

For a simple cgroup hierarchy (as seen below) with two equally weighted
groups, that in theory should get 50/50 of cpu time each, it often leads
to a load of 60/40 or 70/30.

parent/
  cg-1/
    cpu.weight: 100
    cpuset.cpus: 1
  cg-2/
    cpu.weight: 100
    cpuset.cpus: 1

If the hierarchy is deeper (as seen below), while keeping cg-1 and cg-2
equally weighted, they should still get a 50/50 balance of cpu time.
This however sometimes results in a balance of 10/90 or 1/99(!!) between
the task groups.

$ ps u -C stress
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       18568  1.1  0.0   3684   100 pts/12   R+   13:36   0:00 stress --cpu 1
root       18580 99.3  0.0   3684   100 pts/12   R+   13:36   0:09 stress --cpu 1

parent/
  cg-1/
    cpu.weight: 100
    sub-group/
      cpu.weight: 1
      cpuset.cpus: 1
  cg-2/
    cpu.weight: 100
    sub-group/
      cpu.weight: 10000
      cpuset.cpus: 1

This can be reproduced by attaching an idle process to a cgroup and
moving it to a given cpuset before it wakes up. The issue is evident in
many (if not most) container runtimes, and has been reproduced
with both crun and runc (and therefore docker and all its "derivatives"),
and with both cgroup v1 and v2.

Fixes: 3d30544f02 ("sched/fair: Apply more PELT fixes")
Fixes: 039ae8bcf7 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
Signed-off-by: Odin Ugedal <odin@uged.al>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210501141950.23622-2-odin@uged.al
2021-05-06 15:33:27 +02:00
..
bpf selinux/stable-5.13 PR 20210426 2021-04-27 13:42:11 -07:00
cgroup cgroup: use tsk->in_iowait instead of delayacct_is_task_waiting_on_io() 2021-04-16 16:49:37 -04:00
configs staging: ION: remove some references to CONFIG_ION 2021-01-06 17:39:38 +01:00
debug printk changes for 5.13 2021-04-27 18:09:44 -07:00
dma Merge branch 'stable/for-linus-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb 2021-02-26 13:59:32 -08:00
entry A trivial cleanup of typo fixes. 2021-04-26 09:41:15 -07:00
events perf: Extend PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE 2021-04-19 20:03:29 +02:00
gcov Revert "gcov: clang: fix clang-11+ build" 2021-04-19 15:08:49 -07:00
irq The usual updates from the irq departement: 2021-04-26 09:43:16 -07:00
kcsan kcsan: Fix printk format string 2021-04-22 14:36:03 +02:00
livepatch Livepatching changes for 5.13 2021-04-27 18:14:38 -07:00
locking Locking changes for this cycle were: 2021-04-28 12:37:53 -07:00
power PM: sleep: fix typos in comments 2021-04-08 19:37:21 +02:00
printk kernel/printk.c: Fixed mundane typos 2021-03-30 15:34:17 +02:00
rcu Merge branches 'bitmaprange.2021.03.08a', 'fixes.2021.03.15a', 'kvfree_rcu.2021.03.08a', 'mmdumpobj.2021.03.08a', 'nocb.2021.03.15a', 'poll.2021.03.24a', 'rt.2021.03.08a', 'tasks.2021.03.08a', 'torture.2021.03.08a' and 'torturescript.2021.03.22a' into HEAD 2021-03-24 17:20:18 -07:00
sched sched/fair: Fix unfairness caused by missing load decay 2021-05-06 15:33:27 +02:00
time Power management updates for 5.13-rc1 2021-04-26 15:10:25 -07:00
trace The usual updates from the irq departement: 2021-04-26 09:43:16 -07:00
.gitignore
acct.c kernel/acct.c: use #elif instead of #end and #elif 2020-12-15 22:46:15 -08:00
async.c treewide: Remove uninitialized_var() usage 2020-07-16 12:35:15 -07:00
audit_fsnotify.c audit_alloc_mark(): don't open-code ERR_CAST() 2021-02-23 10:25:27 -05:00
audit_tree.c fsnotify: generalize handle_inode_event() 2020-12-03 14:58:35 +01:00
audit_watch.c fsnotify: generalize handle_inode_event() 2020-12-03 14:58:35 +01:00
audit.c lsm: separate security_task_getsecid() into subjective and objective variants 2021-03-22 15:23:32 -04:00
audit.h audit: avoid -Wempty-body warning 2021-03-24 12:11:48 -04:00
auditfilter.c lsm: separate security_task_getsecid() into subjective and objective variants 2021-03-22 15:23:32 -04:00
auditsc.c audit/stable-5.13 PR 20210426 2021-04-27 13:50:58 -07:00
backtracetest.c treewide: Replace DECLARE_TASKLET() with DECLARE_TASKLET_OLD() 2020-07-30 11:15:58 -07:00
bounds.c
capability.c capability: handle idmapped mounts 2021-01-24 14:27:16 +01:00
cfi.c add support for Clang CFI 2021-04-08 16:04:20 -07:00
compat.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
configs.c
context_tracking.c
cpu_pm.c notifier: Fix broken error handling pattern 2020-09-01 09:58:03 +02:00
cpu.c cpumask/hotplug: Fix cpu_dying() state tracking 2021-04-21 13:55:43 +02:00
crash_core.c kdump: append uts_namespace.name offset to VMCOREINFO 2020-12-15 22:46:18 -08:00
crash_dump.c
cred.c
delayacct.c
dma.c
exec_domain.c
exit.c signal: Allow tasks to cache one sigqueue struct 2021-04-14 18:04:08 +02:00
extable.c
fail_function.c fault-injection: handle EI_ETYPE_TRUE 2020-12-15 22:46:19 -08:00
fork.c for-5.13/io_uring-2021-04-27 2021-04-28 14:56:09 -07:00
freezer.c Revert "kernel: freezer should treat PF_IO_WORKER like PF_KTHREAD for freezing" 2021-03-27 14:09:10 -06:00
futex.c Linux 5.12-rc5 2021-03-29 15:56:48 +02:00
gen_kheaders.sh
groups.c groups: simplify struct group_info allocation 2021-02-26 09:41:03 -08:00
hung_task.c kernel/hung_task.c: make type annotations consistent 2020-11-02 12:14:19 -08:00
iomem.c
irq_work.c irq_work: Optimize irq_work_single() 2020-11-24 16:47:49 +01:00
jump_label.c static_call: Fix static_call_update() sanity check 2021-03-19 13:16:44 +01:00
kallsyms.c kallsyms: strip ThinLTO hashes from static functions 2021-04-08 16:04:21 -07:00
kcmp.c Merge branch 'exec-update-lock-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2020-12-15 19:36:48 -08:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt preempt: Introduce CONFIG_PREEMPT_DYNAMIC 2021-02-17 14:12:24 +01:00
kcov.c kernel: make kcov_common_handle consider the current context 2020-11-02 18:00:20 -08:00
kexec_core.c Merge branch 'work.elf-compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-02-21 09:29:23 -08:00
kexec_elf.c
kexec_file.c ima: Free IMA measurement buffer after kexec syscall 2021-02-10 15:49:38 -05:00
kexec_internal.h kexec: move machine_kexec_post_load() to public interface 2021-02-22 12:33:26 +00:00
kexec.c LSM: Introduce kernel_post_load_data() hook 2020-10-05 13:37:03 +02:00
kheaders.c
kmod.c kmod: remove redundant "be an" in the comment 2020-08-12 10:58:01 -07:00
kprobes.c kprobes: Fix to delay the kprobes jump optimization 2021-02-19 14:57:12 -05:00
ksysfs.c
kthread.c Scheduler updates for this cycle are: 2021-04-28 13:33:57 -07:00
latencytop.c
Makefile add support for Clang CFI 2021-04-08 16:04:20 -07:00
module_signature.c module: harden ELF info handling 2021-01-19 10:24:45 +01:00
module_signing.c module: harden ELF info handling 2021-01-19 10:24:45 +01:00
module-internal.h
module.c add support for Clang CFI 2021-04-08 16:04:20 -07:00
notifier.c notifier: Fix broken error handling pattern 2020-09-01 09:58:03 +02:00
nsproxy.c fixes-v5.11 2020-12-14 16:40:27 -08:00
padata.c padata: fix possible padata_works_lock deadlock 2020-09-04 17:51:55 +10:00
panic.c panic: don't dump stack twice on warn 2020-11-14 11:26:04 -08:00
params.c Modules updates for v5.11 2020-12-17 13:01:31 -08:00
pid_namespace.c fixes-v5.11 2020-12-14 16:40:27 -08:00
pid.c Merge branch 'exec-update-lock-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2020-12-15 19:36:48 -08:00
profile.c kernel: Initialize cpumask before parsing 2021-04-10 13:35:54 +02:00
ptrace.c Linux 5.12-rc8 2021-04-20 10:13:58 +02:00
range.c kernel.h: split out min()/max() et al. helpers 2020-10-16 11:11:19 -07:00
reboot.c Revert "PM: ACPI: reboot: Use S5 for reboot" 2021-03-18 16:58:02 +01:00
regset.c regset: kill ->get() 2020-07-27 14:31:12 -04:00
relay.c relay: allow the use of const callback structs 2020-12-15 22:46:18 -08:00
resource_kunit.c resource: provide meaningful MODULE_LICENSE() in test suite 2020-11-25 18:52:35 +01:00
resource.c resource: Move devmem revoke code to resource framework 2021-01-12 14:26:31 +01:00
rseq.c rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs() 2021-04-14 18:04:09 +02:00
scftorture.c scftorture: Add debug output for wrong-CPU warning 2021-01-04 13:53:41 -08:00
scs.c scs: switch to vmapped shadow stacks 2020-12-01 10:30:28 +00:00
seccomp.c seccomp: Fix "cacheable" typo in comments 2021-03-30 22:34:30 -07:00
signal.c Scheduler updates for this cycle are: 2021-04-28 13:33:57 -07:00
smp.c Merge branch 'locking/core' into x86/mm, to resolve conflict 2021-03-06 13:00:58 +01:00
smpboot.c kthread: Extract KTHREAD_IS_PER_CPU 2021-01-22 15:09:42 +01:00
smpboot.h
softirq.c RCU changes for this cycle were: 2021-04-28 12:00:13 -07:00
stackleak.c stackleak: let stack_erasing_sysctl take a kernel pointer buffer 2020-09-19 13:13:39 -07:00
stacktrace.c stacktrace: Remove reliable argument from arch_stack_walk() callback 2020-09-18 14:24:16 +01:00
static_call.c static_call: Fix unused variable warn w/o MODULE 2021-04-09 13:22:12 +02:00
stop_machine.c stop_machine: Add caller debug info to queue_stop_cpus_work 2021-03-23 16:01:58 +01:00
sys_ni.c quota: wire up quotactl_path 2021-03-17 15:51:17 +01:00
sys.c arm64: Introduce prctl(PR_PAC_{SET,GET}_ENABLED_KEYS) 2021-04-13 17:31:44 +01:00
sysctl-test.c
sysctl.c \n 2021-04-29 11:06:13 -07:00
task_work.c task_work: add helper for more targeted task_work canceling 2021-04-11 19:30:25 -06:00
taskstats.c treewide: rename nla_strlcpy to nla_strscpy. 2020-11-16 08:08:54 -08:00
test_kprobes.c
torture.c torture: Replace torture_init_begin string with %s 2021-03-08 14:22:28 -08:00
tracepoint.c tracepoints: Code clean up 2021-02-09 12:27:29 -05:00
tsacct.c
ucount.c fanotify: configurable limits via sysfs 2021-03-16 16:49:31 +01:00
uid16.c
uid16.h
umh.c usermodehelper: reset umask to default before executing user process 2020-10-06 10:31:52 -07:00
up.c smp: Inline on_each_cpu_cond() and on_each_cpu() 2021-03-06 12:59:10 +01:00
user_namespace.c capabilities: require CAP_SETFCAP to map uid 0 2021-04-20 14:28:33 -07:00
user-return-notifier.c
user.c user: Use generic ns_common::count 2020-08-19 14:14:12 +02:00
usermode_driver.c bpf: Fix umd memory leak in copy_process() 2021-03-19 22:23:19 +01:00
utsname_sysctl.c
utsname.c uts: Use generic ns_common::count 2020-08-19 14:13:20 +02:00
watch_queue.c watch_queue: rectify kernel-doc for init_watch() 2021-01-26 11:16:34 +00:00
watchdog_hld.c
watchdog.c workqueue/watchdog: Make unbound workqueues aware of touch_softlockup_watchdog() 2021-04-04 13:26:49 -04:00
workqueue_internal.h
workqueue.c CFI on arm64 series for v5.13-rc1 2021-04-27 10:16:46 -07:00