mirror of
https://github.com/torvalds/linux.git
synced 2024-11-10 14:11:52 +00:00
sched/numa: Fix mm numa_scan_seq based unconditional scan
Since commit fc137c0dda
("sched/numa: enhance vma scanning logic")
NUMA Balancing allows updating PTEs to trap NUMA hinting faults if the
task had previously accessed VMA. However unconditional scan of VMAs are
allowed during initial phase of VMA creation until process's
mm numa_scan_seq reaches 2 even though current task had not accessed VMA.
Rationale:
- Without initial scan subsequent PTE update may never happen.
- Give fair opportunity to all the VMAs to be scanned and subsequently
understand the access pattern of all the VMAs.
But it has a corner case where, if a VMA is created after some time,
process's mm numa_scan_seq could be already greater than 2.
For e.g., values of mm numa_scan_seq when VMAs are created by running
mmtest autonuma benchmark briefly looks like:
start_seq=0 : 459
start_seq=2 : 138
start_seq=3 : 144
start_seq=4 : 8
start_seq=8 : 1
start_seq=9 : 1
This results in no unconditional PTE updates for those VMAs created after
some time.
Fix:
- Note down the initial value of mm numa_scan_seq in per VMA start_seq.
- Allow unconditional scan till start_seq + 2.
Result:
SUT: AMD EPYC Milan with 2 NUMA nodes 256 cpus.
base kernel: upstream 6.6-rc6 with Mels patches [1] applied.
kernbench
========== base patched %gain
Amean elsp-128 165.09 ( 0.00%) 164.78 * 0.19%*
Duration User 41404.28 41375.08
Duration System 9862.22 9768.48
Duration Elapsed 519.87 518.72
Ops NUMA PTE updates 1041416.00 831536.00
Ops NUMA hint faults 263296.00 220966.00
Ops NUMA pages migrated 258021.00 212769.00
Ops AutoNUMA cost 1328.67 1114.69
autonumabench
NUMA01_THREADLOCAL
==================
Amean elsp-NUMA01_THREADLOCAL 81.79 (0.00%) 67.74 * 17.18%*
Duration User 54832.73 47379.67
Duration System 75.00 185.75
Duration Elapsed 576.72 476.09
Ops NUMA PTE updates 394429.00 11121044.00
Ops NUMA hint faults 1001.00 8906404.00
Ops NUMA pages migrated 288.00 2998694.00
Ops AutoNUMA cost 7.77 44666.84
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/2ea7cbce80ac7c62e90cbfb9653a7972f902439f.1697816692.git.raghavendra.kt@amd.com
This commit is contained in:
parent
d6111cf45c
commit
84db47ca71
@ -600,6 +600,9 @@ struct vma_numab_state {
|
||||
*/
|
||||
unsigned long pids_active[2];
|
||||
|
||||
/* MM scan sequence ID when scan first started after VMA creation */
|
||||
int start_scan_seq;
|
||||
|
||||
/*
|
||||
* MM scan sequence ID when the VMA was last completely scanned.
|
||||
* A VMA is not eligible for scanning if prev_scan_seq == numa_scan_seq
|
||||
|
@ -3164,7 +3164,7 @@ static bool vma_is_accessed(struct mm_struct *mm, struct vm_area_struct *vma)
|
||||
* This is also done to avoid any side effect of task scanning
|
||||
* amplifying the unfairness of disjoint set of VMAs' access.
|
||||
*/
|
||||
if (READ_ONCE(current->mm->numa_scan_seq) < 2)
|
||||
if ((READ_ONCE(current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2)
|
||||
return true;
|
||||
|
||||
pids = vma->numab_state->pids_active[0] | vma->numab_state->pids_active[1];
|
||||
@ -3307,6 +3307,8 @@ retry_pids:
|
||||
if (!vma->numab_state)
|
||||
continue;
|
||||
|
||||
vma->numab_state->start_scan_seq = mm->numa_scan_seq;
|
||||
|
||||
vma->numab_state->next_scan = now +
|
||||
msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user