linux/mm
Michal Hocko a49bd4d716 mm, numa: rework do_pages_move
Patch series "unclutter thp migration"

Motivation:

THP migration is hacked into the generic migration with rather
surprising semantic.  The migration allocation callback is supposed to
check whether the THP can be migrated at once and if that is not the
case then it allocates a simple page to migrate.  unmap_and_move then
fixes that up by splitting the THP into small pages while moving the
head page to the newly allocated order-0 page.  Remaining pages are
moved to the LRU list by split_huge_page.  The same happens if the THP
allocation fails.  This is really ugly and error prone [2].

I also believe that split_huge_page to the LRU lists is inherently wrong
because all tail pages are not migrated.  Some callers will just work
around that by retrying (e.g.  memory hotplug).  There are other pfn
walkers which are simply broken though.  e.g. madvise_inject_error will
migrate head and then advances next pfn by the huge page size.
do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
will simply split the THP before migration if the THP migration is not
supported then falls back to single page migration but it doesn't handle
tail pages if the THP migration path is not able to allocate a fresh THP
so we end up with ENOMEM and fail the whole migration which is a
questionable behavior.  Page compaction doesn't try to migrate large
pages so it should be immune.

The first patch reworks do_pages_move which relies on a very ugly
calling semantic when the return status is pushed to the migration path
via private pointer.  It uses pre allocated fixed size batching to
achieve that.  We simply cannot do the same if a THP is to be split
during the migration path which is done in the patch 3.  Patch 2 is
follow up cleanup which removes the mentioned return status calling
convention ugliness.

On a side note:

There are some semantic issues I have encountered on the way when
working on patch 1 but I am not addressing them here.  E.g. trying to
move THP tail pages will result in either success or EBUSY (the later
one more likely once we isolate head from the LRU list).  Hugetlb
reports EACCESS on tail pages.  Some errors are reported via status
parameter but migration failures are not even though the original
`reason' argument suggests there was an intention to do so.  From a
quick look into git history this never worked.  I have tried to keep the
semantic unchanged.

Then there is a relatively minor thing that the page isolation might
fail because of pages not being on the LRU - e.g. because they are
sitting on the per-cpu LRU caches.  Easily fixable.

This patch (of 3):

do_pages_move is supposed to move user defined memory (an array of
addresses) to the user defined numa nodes (an array of nodes one for
each address).  The user provided status array then contains resulting
numa node for each address or an error.  The semantic of this function
is little bit confusing because only some errors are reported back.
Notably migrate_pages error is only reported via the return value.  This
patch doesn't try to address these semantic nuances but rather change
the underlying implementation.

Currently we are processing user input (which can be really large) in
batches which are stored to a temporarily allocated page.  Each address
is resolved to its struct page and stored to page_to_node structure
along with the requested target numa node.  The array of these
structures is then conveyed down the page migration path via private
argument.  new_page_node then finds the corresponding structure and
allocates the proper target page.

What is the problem with the current implementation and why to change
it? Apart from being quite ugly it also doesn't cope with unexpected
pages showing up on the migration list inside migrate_pages path.  That
doesn't happen currently but the follow up patch would like to make the
thp migration code more clear and that would need to split a THP into
the list for some cases.

How does the new implementation work? Well, instead of batching into a
fixed size array we simply batch all pages that should be migrated to
the same node and isolate all of them into a linked list which doesn't
require any additional storage.  This should work reasonably well
because page migration usually migrates larger ranges of memory to a
specific node.  So the common case should work equally well as the
current implementation.  Even if somebody constructs an input where the
target numa nodes would be interleaved we shouldn't see a large
performance impact because page migration alone doesn't really benefit
from batching.  mmap_sem batching for the lookup is quite questionable
and isolate_lru_page which would benefit from batching is not using it
even in the current implementation.

Link: http://lkml.kernel.org/r/20180103082555.14592-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Reale <ar@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:32 -07:00
..
kasan slab, slub: skip unnecessary kasan_cache_shutdown() 2018-04-05 21:36:24 -07:00
backing-dev.c mm/vmscan: don't mess with pgdat->flags in memcg reclaim 2018-04-11 10:28:30 -07:00
balloon_compaction.c virtio_balloon: fix deadlock on OOM 2017-11-14 23:57:38 +02:00
bootmem.c mm: docs: fix parameter names mismatch 2018-02-06 18:32:48 -08:00
cleancache.c fs: switch ->s_uuid to uuid_t 2017-06-05 16:59:12 +02:00
cma_debug.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
cma.c headers: untangle kmemleak.h from mm.h 2018-04-05 21:36:27 -07:00
cma.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
compaction.c mm: kernel-doc: add missing parameter descriptions 2018-04-05 21:36:27 -07:00
debug_page_ref.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
debug.c mm/debug.c: provide useful debugging information for VM_BUG 2018-01-04 16:45:09 -08:00
dmapool.c lib/vsprintf.c: remove %Z support 2017-02-27 18:43:47 -08:00
early_ioremap.c mm/early_ioremap: Fix boot hang with earlyprintk=efi,keep 2017-12-11 14:54:44 +01:00
fadvise.c mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64() 2018-04-02 20:16:10 +02:00
failslab.c mm: make should_failslab always available for fault injection 2018-04-05 21:36:26 -07:00
filemap.c mm/filemap.c: remove include of hardirq.h 2018-01-31 17:18:36 -08:00
frame_vector.c mm/frame_vector.c: release a semaphore in 'get_vaddr_frames()' 2017-12-14 16:00:48 -08:00
frontswap.c mm, frontswap: convert frontswap_enabled to static key 2016-07-26 16:19:19 -07:00
gup_benchmark.c mm: add infrastructure for get_user_pages_fast() benchmarking 2017-11-17 16:10:04 -08:00
gup.c mm/gup.c: fix coding style issues. 2018-04-05 21:36:26 -07:00
highmem.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
hmm.c mm/hmm.c: remove superfluous RCU protection around radix tree lookup 2018-04-11 10:28:31 -07:00
huge_memory.c memcg, thp: do not invoke oom killer on thp charges 2018-04-11 10:28:31 -07:00
hugetlb_cgroup.c mm, hugetlb_cgroup: round limit_in_bytes down to hugepage size 2016-05-20 17:58:30 -07:00
hugetlb.c mm, hugetlbfs: introduce ->pagesize() to vm_operations_struct 2018-04-05 21:36:26 -07:00
hwpoison-inject.c mm/memory_failure: Remove unused trapno from memory_failure 2018-01-23 12:17:42 -06:00
init-mm.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
internal.h mm, numa: rework do_pages_move 2018-04-11 10:28:32 -07:00
interval_tree.c mm/interval_tree.c: use vma_pages() helper 2018-01-31 17:18:37 -08:00
Kconfig treewide: simplify Kconfig dependencies for removed archs 2018-03-26 15:55:57 +02:00
Kconfig.debug kmemcheck: rip it out 2017-11-15 18:21:05 -08:00
khugepaged.c memcg, thp: do not invoke oom killer on thp charges 2018-04-11 10:28:31 -07:00
kmemleak-test.c
kmemleak.c mm: kernel-doc: add missing parameter descriptions 2018-04-05 21:36:27 -07:00
ksm.c mm/ksm.c: fix inconsistent accounting of zero pages 2018-04-11 10:28:31 -07:00
list_lru.c mm: make counting of list_lru_one::nr_items lockless 2018-04-05 21:36:27 -07:00
maccess.c mm: docs: fix parameter names mismatch 2018-02-06 18:32:48 -08:00
madvise.c mm/memory_failure: Remove unused trapno from memory_failure 2018-01-23 12:17:42 -06:00
Makefile mm/swap_slots.c: use conditional compilation 2018-04-05 21:36:24 -07:00
memblock.c powerpc updates for 4.17 2018-04-07 12:08:19 -07:00
memcontrol.c memcg: fix per_node_info cleanup 2018-04-11 10:28:32 -07:00
memory_hotplug.c mm: kernel-doc: add missing parameter descriptions 2018-04-05 21:36:27 -07:00
memory-failure.c mm: hwpoison: disable memory error handling on 1GB hugepage 2018-04-05 21:36:25 -07:00
memory.c mm: swap: unify cluster-based and vma-based swap readahead 2018-04-05 21:36:25 -07:00
mempolicy.c mm, numa: rework do_pages_move 2018-04-11 10:28:32 -07:00
mempool.c kasan: detect invalid frees for large mempool objects 2018-02-06 18:32:43 -08:00
memtest.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
migrate.c mm, numa: rework do_pages_move 2018-04-11 10:28:32 -07:00
mincore.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
mlock.c mm, mlock, vmscan: no more skipping pagevecs 2018-02-21 15:35:42 -08:00
mm_init.c
mmap.c mm: always print RLIMIT_DATA warning 2018-04-05 21:36:24 -07:00
mmu_context.c sched/headers: Prepare to move the task_lock()/unlock() APIs to <linux/sched/task.h> 2017-03-02 08:42:38 +01:00
mmu_notifier.c mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks 2018-01-31 17:18:38 -08:00
mmzone.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
mprotect.c sched/numa: avoid trapping faults and attempting migration of file-backed dirty pages 2018-04-11 10:28:31 -07:00
mremap.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
msync.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
nobootmem.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
nommu.c mm/nommu: remove description of alloc_vm_area 2018-04-05 21:36:26 -07:00
oom_kill.c mm,oom_reaper: check for MMF_OOM_SKIP before complaining 2018-04-05 21:36:27 -07:00
page_alloc.c mm: treat indirectly reclaimable memory as available in MemAvailable 2018-04-11 10:28:29 -07:00
page_counter.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
page_ext.c mm/page_ext.c: make page_ext_init a noop when CONFIG_PAGE_EXTENSION but nothing uses it 2018-01-31 17:18:39 -08:00
page_idle.c mm: thp: fix potential clearing to referenced flag in page_idle_clear_pte_refs_one() 2018-04-05 21:36:25 -07:00
page_io.c block: convert to bio_first_bvec_all & bio_first_page_all 2018-01-06 09:18:00 -07:00
page_isolation.c mm/page_isolation.c: make start_isolate_page_range() fail if already isolated 2018-04-05 21:36:27 -07:00
page_owner.c mm/page_owner.c: make early_page_owner_param() __init 2018-04-05 21:36:26 -07:00
page_poison.c mm/page_poison.c: make early_page_poison_param() __init 2018-04-05 21:36:26 -07:00
page_vma_mapped.c mm, page_vma_mapped: Introduce pfn_in_hpage() 2018-01-22 12:15:57 -08:00
page-writeback.c Revert "mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical" 2017-11-29 18:40:43 -08:00
pagewalk.c mm: kernel-doc: add missing parameter descriptions 2018-04-05 21:36:27 -07:00
percpu-internal.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
percpu-km.c percpu: allow select gfp to be passed to underlying allocators 2018-02-18 05:33:01 -08:00
percpu-stats.c mm: reuse DEFINE_SHOW_ATTRIBUTE() macro 2018-04-05 21:36:25 -07:00
percpu-vm.c percpu: allow select gfp to be passed to underlying allocators 2018-02-18 05:33:01 -08:00
percpu.c arch: remove obsolete architecture ports 2018-04-02 20:20:12 -07:00
pgtable-generic.c mm: do not lose dirty and accessed bits in pmdp_invalidate() 2018-01-31 17:18:38 -08:00
process_vm_access.c mm: docs: add blank lines to silence sphinx "Unexpected indentation" errors 2018-02-06 18:32:48 -08:00
quicklist.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
readahead.c mm: add ksys_readahead() helper; remove in-kernel calls to sys_readahead() 2018-04-02 20:16:12 +02:00
rmap.c mm: kernel-doc: add missing parameter descriptions 2018-04-05 21:36:27 -07:00
rodata_test.c mm: fix RODATA_TEST failure "rodata_test: test data was not read only" 2017-10-03 17:54:24 -07:00
shmem.c mm: swap: unify cluster-based and vma-based swap readahead 2018-04-05 21:36:25 -07:00
slab_common.c mm: make should_failslab always available for fault injection 2018-04-05 21:36:26 -07:00
slab.c slab, slub: skip unnecessary kasan_cache_shutdown() 2018-04-05 21:36:24 -07:00
slab.h slab, slub: skip unnecessary kasan_cache_shutdown() 2018-04-05 21:36:24 -07:00
slob.c slab, slub, slob: add slab_flags_t 2017-11-15 18:21:01 -08:00
slub.c slab, slub: skip unnecessary kasan_cache_shutdown() 2018-04-05 21:36:24 -07:00
sparse-vmemmap.c mm: merge vmem_altmap_alloc into altmap_alloc_block_buf 2018-01-08 11:46:23 -08:00
sparse.c mm/memory_hotplug: optimize memory hotplug 2018-04-05 21:36:25 -07:00
swap_cgroup.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
swap_slots.c mm/swap_slots.c: use conditional compilation 2018-04-05 21:36:24 -07:00
swap_state.c mm/swap_state.c: make bool enable_vma_readahead and swap_vma_readahead() static 2018-04-05 21:36:27 -07:00
swap.c mm/swap.c: remove @cold parameter description for release_pages() 2018-04-05 21:36:26 -07:00
swapfile.c mm/swapfile.c: make pointer swap_avail_heads static 2018-04-11 10:28:32 -07:00
truncate.c mm: add unmap_mapping_pages() 2018-01-31 17:18:37 -08:00
usercopy.c usercopy: WARN() on slab cache usercopy region violations 2018-01-15 12:07:48 -08:00
userfaultfd.c mm/userfaultfd.c: remove duplicate include 2018-02-06 18:32:47 -08:00
util.c mm: treat indirectly reclaimable memory as free in overcommit logic 2018-04-11 10:28:29 -07:00
vmacache.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
vmalloc.c vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systems 2018-02-21 15:35:43 -08:00
vmpressure.c mm, vmpressure: pass-through notification support 2017-07-10 16:32:31 -07:00
vmscan.c mm: memcg: make sure memory.events is uptodate when waking pollers 2018-04-11 10:28:31 -07:00
vmstat.c mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES 2018-04-11 10:28:29 -07:00
workingset.c mm, truncate: do not check mapping for every page being truncated 2017-11-15 18:21:06 -08:00
z3fold.c mm/z3fold.c: use gfpflags_allow_blocking 2018-04-11 10:28:31 -07:00
zbud.c mm: docs: fix parameter names mismatch 2018-02-06 18:32:48 -08:00
zpool.c mm/zpool.c: zpool_evictable: fix mismatch in parameter name and kernel-doc 2018-02-21 15:35:43 -08:00
zsmalloc.c mm: kernel-doc: add missing parameter descriptions 2018-04-05 21:36:27 -07:00
zswap.c mm, swap, frontswap: fix THP swap if frontswap enabled 2018-02-21 15:35:43 -08:00