linux

History

Suravee Suthikulpanit 051f3ca02e sched/topology: Introduce NUMA identity node sched domain On AMD Family17h-based (EPYC) system, a logical NUMA node can contain upto 8 cores (16 threads) with the following topology. ---------------------------- C0 \| T0 T1 \| \|\| \| T0 T1 \| C4 --------\| \|\| \|-------- C1 \| T0 T1 \| L3 \|\| L3 \| T0 T1 \| C5 --------\| \|\| \|-------- C2 \| T0 T1 \| #0 \|\| #1 \| T0 T1 \| C6 --------\| \|\| \|-------- C3 \| T0 T1 \| \|\| \| T0 T1 \| C7 ---------------------------- Here, there are 2 last-level (L3) caches per logical NUMA node. A socket can contain upto 4 NUMA nodes, and a system can support upto 2 sockets. With full system configuration, current scheduler creates 4 sched domains: domain0 SMT (span a core) domain1 MC (span a last-level-cache) domain2 NUMA (span a socket: 4 nodes) domain3 NUMA (span a system: 8 nodes) Note that there is no domain to represent cpus spaning a logical NUMA node. With this hierarchy of sched domains, the scheduler does not balance properly in the following cases: Case1: When running 8 tasks, a properly balanced system should schedule a task per logical NUMA node. This is not the case for the current scheduler. Case2: In some cases, threads are scheduled on the same cpu, while other cpus are idle. This results in run-to-run inconsistency. For example: taskset -c 0-7 sysbench --num-threads=8 --test=cpu \ --cpu-max-prime=100000 run Total execution time ranges from 25.1s to 33.5s depending on threads placement, where 25.1s is when all 8 threads are balanced properly on 8 cpus. Introducing NUMA identity node sched domain, which is based on how SRAT/SLIT table define a logical NUMA node. This results in the following hierarchy of sched domains on the same system described above. domain0 SMT (span a core) domain1 MC (span a last-level-cache) domain2 NODE (span a logical NUMA node) domain3 NUMA (span a socket: 4 nodes) domain4 NUMA (span a system: 8 nodes) This fixes the improper load balancing cases mentioned above. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@suse.de Link: http://lkml.kernel.org/r/1504768805-46716-1-git-send-email-suravee.suthikulpanit@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org>		2017-10-10 11:45:28 +02:00
..
autogroup.c	sched/autogroup: Fix error reporting printk text in autogroup_create()	2017-08-10 17:06:03 +02:00
autogroup.h	sched/headers: Prepare for new header dependencies before moving code to <linux/sched/autogroup.h>	2017-03-02 08:42:28 +01:00
clock.c	sched/clock: Fix early boot preempt assumption in __set_sched_clock_stable()	2017-05-24 09:10:00 +02:00
completion.c	Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2017-09-04 11:52:29 -07:00
core.c	sched/fair: Use reweight_entity() for set_user_nice()	2017-09-29 19:35:14 +02:00
cpuacct.c	sched/cputime: Convert kcpustat to nsecs	2017-02-01 09:13:47 +01:00
cpuacct.h	sched/cpuacct: Simplify the cpuacct code	2016-03-21 11:00:28 +01:00
cpudeadline.c	sched/deadline: Change return value of cpudl_find()	2017-08-10 12:18:17 +02:00
cpudeadline.h	sched/deadline: Split cpudl_set() into cpudl_set() and cpudl_clear()	2016-09-05 13:29:43 +02:00
cpufreq_schedutil.c	Merge branch 'pm-cpufreq-sched'	2017-09-04 00:05:22 +02:00
cpufreq.c	cpufreq / sched: Pass flags to cpufreq_update_util()	2016-08-16 22:14:55 +02:00
cpupri.c	sched/cpupri: Don't re-initialize 'struct cpupri'	2017-08-10 12:18:14 +02:00
cpupri.h
cputime.c	sched/cputime: Don't use smp_processor_id() in preemptible context	2017-07-14 10:27:15 +02:00
deadline.c	sched/deadline: Rename __dl_clear() to __dl_sub()	2017-10-10 11:45:26 +02:00
debug.c	sched/fair: Propagate an effective runnable_load_avg	2017-09-29 19:35:15 +02:00
fair.c	Merge branch 'sched/urgent' into sched/core, to pick up fixes	2017-10-10 11:30:59 +02:00
features.h	sched/core: Address more wake_affine() regressions	2017-10-10 10:14:03 +02:00
idle_task.c	sched/core: Add wrappers for lockdep_(un)pin_lock()	2017-01-14 11:29:30 +01:00
idle.c	sched/idle: Move quiet_vmstate() into the NOHZ code	2017-10-10 11:43:29 +02:00
loadavg.c	sched/loadavg: Generalize "_idle" naming to "_nohz"	2017-06-22 11:30:01 +02:00
Makefile	membarrier: Provide expedited private command	2017-08-17 07:28:05 -07:00
membarrier.c	membarrier: Provide expedited private command	2017-08-17 07:28:05 -07:00
rt.c	sched: cpufreq: Allow remote cpufreq callbacks	2017-08-01 14:24:53 +02:00
sched-pelt.h	sched/fair: Move the PELT constants into a generated header	2017-04-14 10:26:37 +02:00
sched.h	sched/deadline: Rename __dl_clear() to __dl_sub()	2017-10-10 11:45:26 +02:00
stats.c
stats.h	sched/headers: Move cputime functionality from <linux/sched.h> and <linux/cputime.h> into <linux/sched/cputime.h>	2017-03-03 01:45:22 +01:00
stop_task.c	sched/core: Add wrappers for lockdep_(un)pin_lock()	2017-01-14 11:29:30 +01:00
swait.c	sched/wait: Remove the lockless swait_active() check in swake_up*()	2017-08-10 12:28:53 +02:00
topology.c	sched/topology: Introduce NUMA identity node sched domain	2017-10-10 11:45:28 +02:00
wait_bit.c	sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming	2017-06-20 12:19:14 +02:00
wait.c	sched/wait: Introduce wakeup boomark in wake_up_page_bit	2017-09-14 09:56:18 -07:00