linux/mm
Donet Tom 133d04b1ee mm/numa_balancing: allow migrate on protnone reference with MPOL_PREFERRED_MANY policy
commit bda420b985 ("numa balancing: migrate on fault among multiple
bound nodes") added support for migrate on protnone reference with
MPOL_BIND memory policy.  This allowed numa fault migration when the
executing node is part of the policy mask for MPOL_BIND.  This patch
extends migration support to MPOL_PREFERRED_MANY policy.

Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag
MPOL_F_NUMA_BALANCING.  This causes issues when we want to use
NUMA_BALANCING_MEMORY_TIERING.  To effectively use the slow memory tier,
the kernel should not allocate pages from the slower memory tier via
allocation control zonelist fallback.  Instead, we should move cold pages
from the faster memory node via memory demotion.  For a page allocation,
kswapd is only woken up after we try to allocate pages from all nodes in
the allocation zone list.  This implies that, without using memory
policies, we will end up allocating hot pages in the slower memory tier.

MPOL_PREFERRED_MANY was added by commit b27abaccf8 ("mm/mempolicy: add
MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better
allocation control when we have memory tiers in the system.  With
MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only
of faster memory nodes.  When we fail to allocate pages from the faster
memory node, kswapd would be woken up, allowing demotion of cold pages to
slower memory nodes.

With the current kernel, such usage of memory policies implies we can't do
page promotion from a slower memory tier to a faster memory tier using
numa fault.  This patch fixes this issue.

For MPOL_PREFERRED_MANY, if the executing node is in the policy node mask,
we allow numa migration to the executing nodes.  If the executing node is
not in the policy node mask, we do not allow numa migration.

Example:
On a 2-sockets system, NUMA node N0, N1 and N2 are in socket 0,
N3 in socket 1. N0, N1 and N3 have fast memory and CPU, while
N2 has slow memory and no CPU.  For a workload, we may use
MPOL_PREFERRED_MANY with nodemask N0 and N1 set because the workload
runs on CPUs of socket 0 at most times. Then, even if the workload
runs on CPUs of N3 occasionally, we will not try to migrate the workload
pages from N2 to N3 because users may want to avoid cross-socket access
as much as possible in the long term.

In below table, Process is the Process executing node and
Curr Loc Pgs is the numa node where page present(folio node)
===========================================================
Process  Policy  Curr Loc Pgs     Observation
-----------------------------------------------------------
N0       N0 N1      N1         Pages Migrated from N1 to N0
N0       N0 N1      N2         Pages Migrated from N2 to N0
N0       N0 N1      N3	       Pages Migrated from N3 to N0

N3       N0 N1      N0         Pages NOT Migrated  to N3
N3       N0 N1      N1         Pages NOT Migrated  to N3
N3       N0 N1      N2	       Pages NOT Migrated  to N3
------------------------------------------------------------

Link: https://lkml.kernel.org/r/158acc57319129aa46d50fd64c9330f3e7c7b4bf.1711373653.git.donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/369d6a58758396335fd1176d97bbca4e7730d75a.1709909210.git.donettom@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:55:49 -07:00
..
damon mm: madvise: pageout: ignore references rather than clearing young 2024-03-04 17:01:18 -08:00
kasan - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames 2024-03-14 17:43:30 -07:00
kfence KFENCE: cleanup kfence_guarded_alloc() after CONFIG_SLAB removal 2023-12-05 11:17:58 +01:00
kmsan mm: kmsan: remove runtime checks from kmsan_unpoison_memory() 2024-02-22 10:24:41 -08:00
backing-dev.c vfs-6.9.misc 2024-03-11 09:38:17 -07:00
balloon_compaction.c
bootmem_info.c bootmem: use kmemleak_free_part_phys in put_page_bootmem 2023-10-25 16:47:13 -07:00
cma_debug.c
cma_sysfs.c mm/cma: add sysfs file 'release_pages_success' 2024-02-22 10:24:57 -08:00
cma.c mm/cma: add sysfs file 'release_pages_success' 2024-02-22 10:24:57 -08:00
cma.h mm/cma: add sysfs file 'release_pages_success' 2024-02-22 10:24:57 -08:00
compaction.c - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames 2024-03-14 17:43:30 -07:00
debug_page_alloc.c mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER 2024-01-08 15:27:15 -08:00
debug_page_ref.c
debug_vm_pgtable.c mm/debug_vm_pgtable: fix BUG_ON with pud advanced test 2024-02-23 17:27:13 -08:00
debug.c mm: make dump_page() take a const argument 2024-03-06 13:04:18 -08:00
dmapool_test.c
dmapool.c mm/mempool/dmapool: remove CONFIG_DEBUG_SLAB ifdefs 2023-12-05 11:17:58 +01:00
early_ioremap.c mm/early_ioremap.c: improve the execution efficiency of early_ioremap_setup() 2023-06-09 16:25:56 -07:00
fadvise.c mm: remove unnecessary pagevec includes 2023-06-23 16:59:31 -07:00
fail_page_alloc.c mm: page_alloc: split out FAIL_PAGE_ALLOC 2023-06-09 16:25:23 -07:00
failslab.c
filemap.c mm: cachestat: fix two shmem bugs 2024-03-26 11:07:20 -07:00
folio-compat.c mm: remove page_add_new_anon_rmap and lru_cache_add_inactive_or_unevictable 2023-12-29 11:58:27 -08:00
gup_test.c Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes. 2023-06-23 16:58:19 -07:00
gup_test.h
gup.c mm/treewide: replace pXd_huge() with pXd_leaf() 2024-04-25 20:55:46 -07:00
highmem.c x86/kexec: use pr_err() instead of kexec_dprintk() when an error occurs 2023-12-29 12:22:28 -08:00
hmm.c mm/treewide: replace pXd_huge() with pXd_leaf() 2024-04-25 20:55:46 -07:00
huge_memory.c mm/mempolicy: use numa_node_id() instead of cpu_to_node() 2024-04-25 20:55:48 -07:00
hugetlb_cgroup.c mm, hugetlb: remove HUGETLB_CGROUP_MIN_ORDER 2023-10-18 14:34:17 -07:00
hugetlb_vmemmap.c mm: hugetlb_vmemmap: move mmap lock to vmemmap_remap_range() 2023-12-12 10:57:08 -08:00
hugetlb_vmemmap.h mm: hugetlb_vmemmap: fix reference to nonexistent file 2023-10-25 16:47:14 -07:00
hugetlb.c mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio() 2024-04-25 10:07:27 -07:00
hwpoison-inject.c
init-mm.c mm: Deprecate pasid field 2023-12-12 10:11:32 +01:00
internal.h mm/mempolicy: use numa_node_id() instead of cpu_to_node() 2024-04-25 20:55:48 -07:00
interval_tree.c
io-mapping.c
ioremap.c mm: ioremap: remove unneeded ioremap_allowed and iounmap_allowed 2023-08-18 10:12:36 -07:00
Kconfig Kbuild updates for v6.9 2024-03-21 14:41:00 -07:00
Kconfig.debug mm/slub: unify all sl[au]b parameters with "slab_$param" 2024-01-22 10:31:08 +01:00
khugepaged.c mm: convert free_swap_cache() to take a folio 2024-03-04 17:01:26 -08:00
kmemleak.c kmemleak: avoid RCU stalls when freeing metadata for per-CPU pointers 2023-12-12 10:57:07 -08:00
ksm.c mm: convert page_try_share_anon_rmap() to folio_try_share_anon_rmap_[pte|pmd]() 2023-12-29 11:58:56 -08:00
list_lru.c mm/zswap: stop lru list shrinking when encounter warm region 2024-02-22 10:24:54 -08:00
maccess.c
madvise.c mm/madvise: don't perform madvise VMA walk for MADV_POPULATE_(READ|WRITE) 2024-04-25 20:55:43 -07:00
Makefile kbuild: make -Woverride-init warnings more consistent 2024-03-31 11:32:26 +09:00
mapping_dirty_helpers.c mm: fix clean_record_shared_mapping_range kernel-doc 2023-08-24 16:20:30 -07:00
memblock.c cxl fixes for 6.8-rc6 2024-02-24 15:53:40 -08:00
memcontrol.c mm: memcg: add NULL check to obj_cgroup_put() 2024-04-25 20:55:43 -07:00
memfd.c mm/memfd: refactor memfd_tag_pins() and memfd_wait_for_pins() 2024-03-04 17:01:21 -08:00
memory_hotplug.c mm/memory_hotplug: export mhp_supports_memmap_on_memory() 2024-02-22 10:24:40 -08:00
memory-failure.c mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled 2024-04-16 15:39:50 -07:00
memory-tiers.c mm/demotion: print demotion targets 2024-02-22 10:24:55 -08:00
memory.c mm/mempolicy: use numa_node_id() instead of cpu_to_node() 2024-04-25 20:55:48 -07:00
mempolicy.c mm/numa_balancing: allow migrate on protnone reference with MPOL_PREFERRED_MANY policy 2024-04-25 20:55:49 -07:00
mempool.c mempool: kvmalloc pool 2024-03-13 18:38:13 -04:00
memremap.c mm: remove stale example from comment 2023-12-29 11:58:26 -08:00
memtest.c memtest: use {READ,WRITE}_ONCE in memory scanning 2024-03-13 12:12:21 -07:00
migrate_device.c mm: convert page_try_share_anon_rmap() to folio_try_share_anon_rmap_[pte|pmd]() 2023-12-29 11:58:56 -08:00
migrate.c merge mm-hotfixes-stable into mm-nonmm-stable to pick up stackdepot changes 2024-02-23 17:28:43 -08:00
mincore.c mm: enable page walking API to lock vmas during the walk 2023-08-21 13:07:20 -07:00
mlock.c mm: make folios_put() the basis of release_pages() 2024-03-04 17:01:22 -08:00
mm_init.c Author: Gang Li padata: dispatch works on 2024-03-06 13:04:17 -08:00
mm_slot.h
mmap_lock.c
mmap.c RISC-V Patches for the 6.9 Merge Window 2024-03-22 10:41:13 -07:00
mmu_gather.c mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing 2024-02-22 15:27:17 -08:00
mmu_notifier.c mmu_notifiers: rename invalidate_range notifier 2023-08-18 10:12:41 -07:00
mmzone.c zswap: shrink zswap pool based on memory pressure 2023-12-12 10:57:02 -08:00
mprotect.c mprotect: use pfn_swap_entry_folio 2024-02-21 16:00:03 -08:00
mremap.c mm: abstract VMA merge and extend into vma_merge_extend() helper 2023-10-18 14:34:18 -07:00
msync.c
nommu.c mm/vmalloc: remove vmap_area_list 2024-02-23 17:48:19 -08:00
oom_kill.c mm: update mark_victim tracepoints fields 2024-03-04 17:01:16 -08:00
page_alloc.c mm: page_alloc: control latency caused by zone PCP draining 2024-04-25 20:55:44 -07:00
page_counter.c
page_ext.c mm/page_ext: move functions around for minor cleanups to page_ext 2023-08-18 10:12:31 -07:00
page_idle.c
page_io.c zswap: memcontrol: implement zswap writeback disabling 2023-12-29 20:22:11 -08:00
page_isolation.c mm: add alloc_contig_migrate_range allocation statistics 2024-03-04 17:01:27 -08:00
page_owner.c mm,page_owner: defer enablement of static branch 2024-04-16 15:39:51 -07:00
page_poison.c mm/page_poison: replace kmap_atomic() with kmap_local_page() 2023-12-10 16:51:50 -08:00
page_reporting.c mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER 2024-01-08 15:27:15 -08:00
page_reporting.h
page_table_check.c mm: convert page_table_check_pte_set() to page_table_check_ptes_set() 2023-08-24 16:20:18 -07:00
page_vma_mapped.c mm: thp: introduce multi-size THP sysfs interface 2023-12-20 14:48:12 -08:00
page-writeback.c writeback: remove a use of write_cache_pages() from do_writepages() 2024-02-23 17:48:38 -08:00
pagewalk.c mm: pagewalk: assert write mmap lock only for walking the user page tables 2023-12-10 16:51:53 -08:00
percpu-internal.h percpu-internal/pcpu_chunk: re-layout pcpu_chunk structure to reduce false sharing 2023-06-19 16:19:29 -07:00
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c mm: Introduce flush_cache_vmap_early() 2023-12-14 00:23:17 -08:00
pgalloc-track.h
pgtable-generic.c mm/pgtable: notes on pte_offset_map[_lock]() 2023-08-18 10:12:25 -07:00
process_vm_access.c mm: fix process_vm_rw page counts 2023-12-10 16:51:39 -08:00
ptdump.c mm: ptdump: add check_wx_pages debugfs attribute 2024-02-22 10:24:47 -08:00
readahead.c mm: support order-1 folios in the page cache 2024-03-04 17:01:19 -08:00
rmap.c rmap: replace two calls to compound_order with folio_order 2024-02-22 15:27:20 -08:00
rodata_test.c
secretmem.c mm/secretmem: use a folio in secretmem_fault() 2023-08-21 13:38:02 -07:00
shmem_quota.c tmpfs: fix race on handling dquot rbtree 2024-03-26 11:07:23 -07:00
shmem.c mm/shmem: inline shmem_is_huge() for disabled transparent hugepages 2024-04-16 15:39:51 -07:00
show_mem.c mm, treewide: introduce NR_PAGE_ORDERS 2024-01-08 15:27:15 -08:00
shrinker_debug.c mm: shrinker: convert shrinker_rwsem to mutex 2023-10-04 10:32:26 -07:00
shrinker.c mm: shrinker: use kvzalloc_node() from expand_one_shrinker_info() 2024-01-05 09:58:32 -08:00
shuffle.c
shuffle.h mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER 2024-01-08 15:27:15 -08:00
slab_common.c - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min 2024-03-14 18:03:09 -07:00
slab.h Merge branch 'slab/for-6.9/slab-flag-cleanups' into slab/for-linus 2024-03-12 10:16:56 +01:00
slub.c Merge branch 'slab/for-6.9/slab-flag-cleanups' into slab/for-linus 2024-03-12 10:16:56 +01:00
sparse-vmemmap.c mm/vmemmap: allow architectures to override how vmemmap optimization works 2023-08-18 10:12:53 -07:00
sparse.c mm/memory_hotplug: introduce MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers 2024-02-21 16:00:01 -08:00
swap_cgroup.c
swap_slots.c mm/zswap: invalidate zswap entry when swap entry free 2024-02-22 10:24:54 -08:00
swap_state.c mm: convert free_swap_cache() to take a folio 2024-03-04 17:01:26 -08:00
swap.c mm: fix list corruption in put_pages_list 2024-03-12 13:07:16 -07:00
swap.h mm/swap: fix race when skipping swapcache 2024-02-20 14:20:48 -08:00
swapfile.c - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames 2024-03-14 17:43:30 -07:00
truncate.c fs: convert error_remove_page to error_remove_folio 2023-12-10 16:51:42 -08:00
usercopy.c
userfaultfd.c userfaultfd: fix deadlock warning when locking src and dst VMAs 2024-03-26 11:07:23 -07:00
util.c - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames 2024-03-14 17:43:30 -07:00
vmalloc.c mm: vmalloc: fix lockdep warning 2024-04-05 11:21:30 -07:00
vmpressure.c eventfd: simplify eventfd_signal() 2023-11-28 14:08:38 +01:00
vmscan.c - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames 2024-03-14 17:43:30 -07:00
vmstat.c mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER 2024-01-08 15:27:15 -08:00
workingset.c mm: move mapping_set_update out of <linux/swap.h> 2024-02-21 11:36:50 +05:30
z3fold.c mm: zpool: return pool size in pages 2024-04-25 20:55:48 -07:00
zbud.c mm: zpool: return pool size in pages 2024-04-25 20:55:48 -07:00
zpool.c mm: zpool: return pool size in pages 2024-04-25 20:55:48 -07:00
zsmalloc.c mm: zpool: return pool size in pages 2024-04-25 20:55:48 -07:00
zswap.c mm: zswap: remove unnecessary check in zswap_find_zpool() 2024-04-25 20:55:48 -07:00