forked from Minki/linux
sched/topology: Introduce NUMA identity node sched domain
On AMD Family17h-based (EPYC) system, a logical NUMA node can contain upto 8 cores (16 threads) with the following topology. ---------------------------- C0 | T0 T1 | || | T0 T1 | C4 --------| || |-------- C1 | T0 T1 | L3 || L3 | T0 T1 | C5 --------| || |-------- C2 | T0 T1 | #0 || #1 | T0 T1 | C6 --------| || |-------- C3 | T0 T1 | || | T0 T1 | C7 ---------------------------- Here, there are 2 last-level (L3) caches per logical NUMA node. A socket can contain upto 4 NUMA nodes, and a system can support upto 2 sockets. With full system configuration, current scheduler creates 4 sched domains: domain0 SMT (span a core) domain1 MC (span a last-level-cache) domain2 NUMA (span a socket: 4 nodes) domain3 NUMA (span a system: 8 nodes) Note that there is no domain to represent cpus spaning a logical NUMA node. With this hierarchy of sched domains, the scheduler does not balance properly in the following cases: Case1: When running 8 tasks, a properly balanced system should schedule a task per logical NUMA node. This is not the case for the current scheduler. Case2: In some cases, threads are scheduled on the same cpu, while other cpus are idle. This results in run-to-run inconsistency. For example: taskset -c 0-7 sysbench --num-threads=8 --test=cpu \ --cpu-max-prime=100000 run Total execution time ranges from 25.1s to 33.5s depending on threads placement, where 25.1s is when all 8 threads are balanced properly on 8 cpus. Introducing NUMA identity node sched domain, which is based on how SRAT/SLIT table define a logical NUMA node. This results in the following hierarchy of sched domains on the same system described above. domain0 SMT (span a core) domain1 MC (span a last-level-cache) domain2 NODE (span a logical NUMA node) domain3 NUMA (span a socket: 4 nodes) domain4 NUMA (span a system: 8 nodes) This fixes the improper load balancing cases mentioned above. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@suse.de Link: http://lkml.kernel.org/r/1504768805-46716-1-git-send-email-suravee.suthikulpanit@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
This commit is contained in:
parent
ed4ad1ca08
commit
051f3ca02e
@ -1332,6 +1332,10 @@ void sched_init_numa(void)
|
||||
if (!sched_domains_numa_distance)
|
||||
return;
|
||||
|
||||
/* Includes NUMA identity node at level 0. */
|
||||
sched_domains_numa_distance[level++] = curr_distance;
|
||||
sched_domains_numa_levels = level;
|
||||
|
||||
/*
|
||||
* O(nr_nodes^2) deduplicating selection sort -- in order to find the
|
||||
* unique distances in the node_distance() table.
|
||||
@ -1379,8 +1383,7 @@ void sched_init_numa(void)
|
||||
return;
|
||||
|
||||
/*
|
||||
* 'level' contains the number of unique distances, excluding the
|
||||
* identity distance node_distance(i,i).
|
||||
* 'level' contains the number of unique distances
|
||||
*
|
||||
* The sched_domains_numa_distance[] array includes the actual distance
|
||||
* numbers.
|
||||
@ -1441,10 +1444,19 @@ void sched_init_numa(void)
|
||||
for (i = 0; sched_domain_topology[i].mask; i++)
|
||||
tl[i] = sched_domain_topology[i];
|
||||
|
||||
/*
|
||||
* Add the NUMA identity distance, aka single NODE.
|
||||
*/
|
||||
tl[i++] = (struct sched_domain_topology_level){
|
||||
.mask = sd_numa_mask,
|
||||
.numa_level = 0,
|
||||
SD_INIT_NAME(NODE)
|
||||
};
|
||||
|
||||
/*
|
||||
* .. and append 'j' levels of NUMA goodness.
|
||||
*/
|
||||
for (j = 0; j < level; i++, j++) {
|
||||
for (j = 1; j < level; i++, j++) {
|
||||
tl[i] = (struct sched_domain_topology_level){
|
||||
.mask = sd_numa_mask,
|
||||
.sd_flags = cpu_numa_flags,
|
||||
|
Loading…
Reference in New Issue
Block a user