linux/mm
Mel Gorman b795854b1f sched/numa: Set preferred NUMA node based on number of private faults
Ideally it would be possible to distinguish between NUMA hinting faults that
are private to a task and those that are shared. If treated identically
there is a risk that shared pages bounce between nodes depending on
the order they are referenced by tasks. Ultimately what is desirable is
that task private pages remain local to the task while shared pages are
interleaved between sharing tasks running on different nodes to give good
average performance. This is further complicated by THP as even
applications that partition their data may not be partitioning on a huge
page boundary.

To start with, this patch assumes that multi-threaded or multi-process
applications partition their data and that in general the private accesses
are more important for cpu->memory locality in the general case. Also,
no new infrastructure is required to treat private pages properly but
interleaving for shared pages requires additional infrastructure.

To detect private accesses the pid of the last accessing task is required
but the storage requirements are a high. This patch borrows heavily from
Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
to encode some bits from the last accessing task in the page flags as
well as the node information. Collisions will occur but it is better than
just depending on the node information. Node information is then used to
determine if a page needs to migrate. The PID information is used to detect
private/shared accesses. The preferred NUMA node is selected based on where
the maximum number of approximately private faults were measured. Shared
faults are not taken into consideration for a few reasons.

First, if there are many tasks sharing the page then they'll all move
towards the same node. The node will be compute overloaded and then
scheduled away later only to bounce back again. Alternatively the shared
tasks would just bounce around nodes because the fault information is
effectively noise. Either way accounting for shared faults the same as
private faults can result in lower performance overall.

The second reason is based on a hypothetical workload that has a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.

The third reason is that multiple threads in a process will race each
other to fault the shared page making the fault information unreliable.

Signed-off-by: Mel Gorman <mgorman@suse.de>
[ Fix complication error when !NUMA_BALANCING. ]
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:35 +02:00
..
backing-dev.c mm/backing-dev.c: check user buffer length before copying data to the related user buffer 2013-09-11 15:58:03 -07:00
balloon_compaction.c mm: introduce a common interface for balloon pages mobility 2012-12-11 17:22:26 -08:00
bootmem.c mm: kill free_all_bootmem_node() 2013-07-03 16:07:39 -07:00
bounce.c mm/bounce.c: fix a regression where MS_SNAP_STABLE (stable pages snapshotting) was ignored 2013-09-30 14:31:02 -07:00
cleancache.c mm: cleancache: clean up cleancache_enabled 2013-04-30 17:04:01 -07:00
compaction.c mm/compaction.c: periodically schedule when freeing pages 2013-09-30 14:31:01 -07:00
debug-pagealloc.c
dmapool.c dmapool: make DMAPOOL_DEBUG detect corruption of free marker 2012-12-11 17:22:24 -08:00
fadvise.c teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long 2013-03-03 22:46:22 -05:00
failslab.c
filemap_xip.c lift sb_start_write() out of ->write() 2013-04-09 14:12:56 -04:00
filemap.c mm: cleanup add_to_page_cache_locked() 2013-09-12 15:38:03 -07:00
fremap.c mm: save soft-dirty bits on file pages 2013-08-13 17:57:48 -07:00
frontswap.c frontswap: fix incorrect zeroing and allocation size for frontswap_map 2013-06-12 16:29:46 -07:00
highmem.c Some nice cleanups, and even a patch my wife did as a "live" demo for 2012-12-20 08:37:05 -08:00
huge_memory.c sched/numa: Set preferred NUMA node based on number of private faults 2013-10-09 12:40:35 +02:00
hugetlb_cgroup.c cgroup: pass around cgroup_subsys_state instead of cgroup in file methods 2013-08-08 20:11:24 -04:00
hugetlb.c mm: prepare to remove /proc/sys/vm/hugepages_treat_as_movable 2013-09-11 15:57:49 -07:00
hwpoison-inject.c mm/hwpoison: fix the lack of one reference count against poisoned page 2013-09-30 14:31:03 -07:00
init-mm.c
internal.h mm: vmscan: fix do_try_to_free_pages() livelock 2013-09-11 15:58:01 -07:00
interval_tree.c
Kconfig powerpc: Fix memory hotplug with sparse vmemmap 2013-10-03 17:21:38 +10:00
Kconfig.debug
kmemcheck.c
kmemleak-test.c
kmemleak.c mm: replace strict_strtoul() with kstrtoul() 2013-09-11 15:57:11 -07:00
ksm.c mm: replace strict_strtoul() with kstrtoul() 2013-09-11 15:57:11 -07:00
list_lru.c list_lru: dynamically adjust node arrays 2013-09-10 18:56:32 -04:00
maccess.c
madvise.c mm/hwpoison: fix traversal of hugetlbfs pages to avoid printk flood 2013-09-30 14:31:02 -07:00
Makefile list: add a new LRU list type 2013-09-10 18:56:30 -04:00
memblock.c memblock, numa: binary search node id 2013-09-11 15:57:51 -07:00
memcontrol.c revert "memcg, vmscan: integrate soft reclaim tighter with zone shrinking code" 2013-09-24 17:00:26 -07:00
memory_hotplug.c ACPI and power management fixes for 3.12-rc1 2013-09-12 11:22:45 -07:00
memory-failure.c mm/hwpoison: fix false report on 2nd attempt at page recovery 2013-09-30 14:31:02 -07:00
memory.c sched/numa: Set preferred NUMA node based on number of private faults 2013-10-09 12:40:35 +02:00
mempolicy.c sched/numa: Set preferred NUMA node based on number of private faults 2013-10-09 12:40:35 +02:00
mempool.c mm/mempool.c: convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...) 2013-09-11 15:58:14 -07:00
migrate.c sched/numa: Set preferred NUMA node based on number of private faults 2013-10-09 12:40:35 +02:00
mincore.c swap: make each swap partition have one address_space 2013-02-23 17:50:17 -08:00
mlock.c mm/mlock.c: prevent walking off the end of a pagetable in no-pmd configuration 2013-09-30 14:31:02 -07:00
mm_init.c sched/numa: Set preferred NUMA node based on number of private faults 2013-10-09 12:40:35 +02:00
mmap.c mm/mmap: remove unnecessary assignment 2013-09-11 15:58:13 -07:00
mmu_context.c mm: remove old aio use_mm() comment 2013-05-07 18:38:27 -07:00
mmu_notifier.c treewide: relase -> release 2013-06-28 14:34:33 +02:00
mmzone.c sched/numa: Set preferred NUMA node based on number of private faults 2013-10-09 12:40:35 +02:00
mprotect.c sched/numa: Set preferred NUMA node based on number of private faults 2013-10-09 12:40:35 +02:00
mremap.c mm/mremap.c: call pud_free() after fail calling pmd_alloc() 2013-09-11 15:58:03 -07:00
msync.c
nobootmem.c mm: concentrate modification of totalram_pages into the mm core 2013-07-03 16:07:33 -07:00
nommu.c mm: remove free_area_cache 2013-07-10 18:11:34 -07:00
oom_kill.c mm: memcg: do not trap chargers with full callstack on OOM 2013-09-12 15:38:02 -07:00
page_alloc.c sched/numa: Set preferred NUMA node based on number of private faults 2013-10-09 12:40:35 +02:00
page_cgroup.c memcontrol: use N_MEMORY instead N_HIGH_MEMORY 2012-12-12 17:38:32 -08:00
page_io.c aio: Kill aio_rw_vect_retry() 2013-07-30 11:53:12 -04:00
page_isolation.c mm: memory-hotplug: enable memory hotplug to handle hugepage 2013-09-11 15:57:48 -07:00
page-writeback.c memcg: add per cgroup writeback pages accounting 2013-09-12 15:38:02 -07:00
pagewalk.c mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas 2013-05-24 16:22:53 -07:00
percpu-km.c
percpu-vm.c
percpu.c mm, percpu: Make sure percpu_alloc early parameter has an argument 2012-12-02 06:23:04 -08:00
pgtable-generic.c mm: move pgtable related functions to right place 2013-09-11 15:57:30 -07:00
process_vm_access.c Fix: compat_rw_copy_check_uvector() misuse in aio, readv, writev, and security keys 2013-03-12 11:05:45 -07:00
quicklist.c
readahead.c readahead: make context readahead more conservative 2013-09-11 15:57:39 -07:00
rmap.c thp: account anon transparent huge pages into NR_ANON_PAGES 2013-09-12 15:38:03 -07:00
shmem.c initmpfs: make rootfs use tmpfs when CONFIG_TMPFS enabled 2013-09-11 15:59:37 -07:00
slab_common.c mm/sl[aou]b: Move kmallocXXX functions to common code 2013-09-04 20:51:33 +03:00
slab.c kernel: delete __cpuinit usage from all core kernel files 2013-07-14 19:36:59 -04:00
slab.h memcg: check that kmem_cache has memcg_params before accessing it 2013-08-28 19:26:38 -07:00
slob.c mm/sl[aou]b: Move kmallocXXX functions to common code 2013-09-04 20:51:33 +03:00
slub.c Merge branch 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux 2013-09-15 07:15:06 -04:00
sparse-vmemmap.c sparse-vmemmap: specify vmemmap population range in bytes 2013-04-29 15:54:35 -07:00
sparse.c mm/sparse: introduce alloc_usemap_and_memmap 2013-09-11 15:58:01 -07:00
swap_state.c lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interrupt 2013-09-11 15:59:36 -07:00
swap.c mm: make lru_add_drain_all() selective 2013-09-12 15:38:02 -07:00
swapfile.c swap: make cluster allocation per-cpu 2013-09-11 15:57:17 -07:00
truncate.c truncate: drop 'oldsize' truncate_pagecache() parameter 2013-09-12 15:38:02 -07:00
util.c swap: clean-up #ifdef in page_mapping() 2013-09-11 15:57:31 -07:00
vmalloc.c mm/vmalloc: use wrapper function get_vm_area_size to caculate size of vm area 2013-09-11 15:58:02 -07:00
vmpressure.c Merge branch 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2013-09-03 18:25:03 -07:00
vmscan.c mm: avoid reinserting isolated balloon pages into LRU lists 2013-09-30 14:31:02 -07:00
vmstat.c mm: vmscan: fix do_try_to_free_pages() livelock 2013-09-11 15:58:01 -07:00
zbud.c mm/zbud: fix some trivial typos in comments 2013-09-11 15:57:35 -07:00
zswap.c mm/zswap: use postorder iteration when destroying rbtree 2013-09-11 15:59:21 -07:00