linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-27 05:11:48 +00:00

History

Vlastimil Babka accf62422b mm, kswapd: replace kswapd compaction with waking up kcompactd Similarly to direct reclaim/compaction, kswapd attempts to combine reclaim and compaction to attempt making memory allocation of given order available. The details differ from direct reclaim e.g. in having high watermark as a goal. The code involved in kswapd's reclaim/compaction decisions has evolved to be quite complex. Testing reveals that it doesn't actually work in at least one scenario, and closer inspection suggests that it could be greatly simplified without compromising on the goal (make high-order page available) or efficiency (don't reclaim too much). The simplification relieas of doing all compaction in kcompactd, which is simply woken up when high watermarks are reached by kswapd's reclaim. The scenario where kswapd compaction doesn't work was found with mmtests test stress-highalloc configured to attempt order-9 allocations without direct reclaim, just waking up kswapd. There was no compaction attempt from kswapd during the whole test. Some added instrumentation shows what happens: - balance_pgdat() sets end_zone to Normal, as it's not balanced - reclaim is attempted on DMA zone, which sets nr_attempted to 99, but it cannot reclaim anything, so sc.nr_reclaimed is 0 - for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so it merely checks if high watermarks were reached for base pages. This is true, so no reclaim is attempted. For DMA, testorder=0 wasn't used, as compaction_suitable() returned COMPACT_SKIPPED - even though the pgdat_needs_compaction flag wasn't set to false, no compaction happens due to the condition sc.nr_reclaimed > nr_attempted being false (as 0 < 99) - priority-- due to nr_reclaimed being 0, repeat until priority reaches 0 pgdat_balanced() is false as only the small zone DMA appears balanced (curiously in that check, watermark appears OK and compaction_suitable() returns COMPACT_PARTIAL, because a lower classzone_idx is used there) Now, even if it was decided that reclaim shouldn't be attempted on the DMA zone, the scenario would be the same, as (sc.nr_reclaimed=0 > nr_attempted=0) is also false. The condition really should use >= as the comment suggests. Then there is a mismatch in the check for setting pgdat_needs_compaction to false using low watermark, while the rest uses high watermark, and who knows what other subtlety. Hopefully this demonstrates that this is unsustainable. Luckily we can simplify this a lot. The reclaim/compaction decisions make sense for direct reclaim scenario, but in kswapd, our primary goal is to reach high watermark in order-0 pages. Afterwards we can attempt compaction just once. Unlike direct reclaim, we don't reclaim extra pages (over the high watermark), the current code already disallows it for good reasons. After this patch, we simply wake up kcompactd to process the pgdat, after we have either succeeded or failed to reach the high watermarks in kswapd, which goes to sleep. We pass kswapd's order and classzone_idx, so kcompactd can apply the same criteria to determine which zones are worth compacting. Note that we use the classzone_idx from wakeup_kswapd(), not balanced_classzone_idx which can include higher zones that kswapd tried to balance too, but didn't consider them in pgdat_balanced(). Since kswapd now cannot create high-order pages itself, we need to adjust how it determines the zones to be balanced. The key element here is adding a "highorder" parameter to zone_balanced, which, when set to false, makes it consider only order-0 watermark instead of the desired higher order (this was done previously by kswapd_shrink_zone(), but not elsewhere). This false is passed for example in pgdat_balanced(). Importantly, wakeup_kswapd() uses true to make sure kswapd and thus kcompactd are woken up for a high-order allocation failure. The last thing is to decide what to do with pageblock_skip bitmap handling. Compaction maintains a pageblock_skip bitmap to record pageblocks where isolation recently failed. This bitmap can be reset by three ways: 1) direct compaction is restarting after going through the full deferred cycle 2) kswapd goes to sleep, and some other direct compaction has previously finished scanning the whole zone and set zone->compact_blockskip_flush. Note that a successful direct compaction clears this flag. 3) compaction was invoked manually via trigger in /proc The case 2) is somewhat fuzzy to begin with, but after introducing kcompactd we should update it. The check for direct compaction in 1), and to set the flush flag in 2) use current_is_kswapd(), which doesn't work for kcompactd. Thus, this patch adds bool direct_compaction to compact_control to use in 2). For the case 1) we remove the check completely - unlike the former kswapd compaction, kcompactd does use the deferred compaction functionality, so flushing tied to restarting from deferred compaction makes sense here. Note that when kswapd goes to sleep, kcompactd is woken up, so it will see the flushed pageblock_skip bits. This is different from when the former kswapd compaction observed the bits and I believe it makes more sense. Kcompactd can afford to be more thorough than a direct compaction trying to limit allocation latency, or kswapd whose primary goal is to reclaim. For testing, I used stress-highalloc configured to do order-9 allocations with GFP_NOWAIT\|__GFP_HIGH\|__GFP_COMP, so they relied just on kswapd/kcompactd reclaim/compaction (the interfering kernel builds in phases 1 and 2 work as usual): stress-highalloc 4.5-rc1+before 4.5-rc1+after -nodirect -nodirect Success 1 Min 1.00 ( 0.00%) 5.00 (-66.67%) Success 1 Mean 1.40 ( 0.00%) 6.20 (-55.00%) Success 1 Max 2.00 ( 0.00%) 7.00 (-16.67%) Success 2 Min 1.00 ( 0.00%) 5.00 (-66.67%) Success 2 Mean 1.80 ( 0.00%) 6.40 (-52.38%) Success 2 Max 3.00 ( 0.00%) 7.00 (-16.67%) Success 3 Min 34.00 ( 0.00%) 62.00 ( 1.59%) Success 3 Mean 41.80 ( 0.00%) 63.80 ( 1.24%) Success 3 Max 53.00 ( 0.00%) 65.00 ( 2.99%) User 3166.67 3181.09 System 1153.37 1158.25 Elapsed 1768.53 1799.37 4.5-rc1+before 4.5-rc1+after -nodirect -nodirect Direct pages scanned 32938 32797 Kswapd pages scanned 2183166 2202613 Kswapd pages reclaimed 2152359 2143524 Direct pages reclaimed 32735 32545 Percentage direct scans 1% 1% THP fault alloc 579 612 THP collapse alloc 304 316 THP splits 0 0 THP fault fallback 793 778 THP collapse fail 11 16 Compaction stalls 1013 1007 Compaction success 92 67 Compaction failures 920 939 Page migrate success 238457 721374 Page migrate failure 23021 23469 Compaction pages isolated 504695 1479924 Compaction migrate scanned 661390 8812554 Compaction free scanned 13476658 84327916 Compaction cost 262 838 After this patch we see improvements in allocation success rate (especially for phase 3) along with increased compaction activity. The compaction stalls (direct compaction) in the interfering kernel builds (probably THP's) also decreased somewhat thanks to kcompactd activity, yet THP alloc successes improved a bit. Note that elapsed and user time isn't so useful for this benchmark, because of the background interference being unpredictable. It's just to quickly spot some major unexpected differences. System time is somewhat more useful and that didn't increase. Also (after adjusting mmtests' ftrace monitor): Time kswapd awake 2547781 2269241 Time kcompactd awake 0 119253 Time direct compacting 939937 557649 Time kswapd compacting 0 0 Time kcompactd compacting 0 119099 The decrease of overal time spent compacting appears to not match the increased compaction stats. I suspect the tasks get rescheduled and since the ftrace monitor doesn't see that, the reported time is wall time, not CPU time. But arguably direct compactors care about overall latency anyway, whether busy compacting or waiting for CPU doesn't matter. And that latency seems to almost halved. It's also interesting how much time kswapd spent awake just going through all the priorities and failing to even try compacting, over and over. We can also configure stress-highalloc to perform both direct reclaim/compaction and wakeup kswapd/kcompactd, by using GFP_KERNEL\|__GFP_HIGH\|__GFP_COMP: stress-highalloc 4.5-rc1+before 4.5-rc1+after -direct -direct Success 1 Min 4.00 ( 0.00%) 9.00 (-50.00%) Success 1 Mean 8.00 ( 0.00%) 10.00 (-19.05%) Success 1 Max 12.00 ( 0.00%) 11.00 ( 15.38%) Success 2 Min 4.00 ( 0.00%) 9.00 (-50.00%) Success 2 Mean 8.20 ( 0.00%) 10.00 (-16.28%) Success 2 Max 13.00 ( 0.00%) 11.00 ( 8.33%) Success 3 Min 75.00 ( 0.00%) 74.00 ( 1.33%) Success 3 Mean 75.60 ( 0.00%) 75.20 ( 0.53%) Success 3 Max 77.00 ( 0.00%) 76.00 ( 0.00%) User 3344.73 3246.04 System 1194.24 1172.29 Elapsed 1838.04 1836.76 4.5-rc1+before 4.5-rc1+after -direct -direct Direct pages scanned 125146 120966 Kswapd pages scanned 2119757 2135012 Kswapd pages reclaimed 2073183 2108388 Direct pages reclaimed 124909 120577 Percentage direct scans 5% 5% THP fault alloc 599 652 THP collapse alloc 323 354 THP splits 0 0 THP fault fallback 806 793 THP collapse fail 17 16 Compaction stalls 2457 2025 Compaction success 906 518 Compaction failures 1551 1507 Page migrate success 2031423 2360608 Page migrate failure 32845 40852 Compaction pages isolated `4129761` 4802025 Compaction migrate scanned 11996712 21750613 Compaction free scanned 214970969 344372001 Compaction cost 2271 2694 In this scenario, this patch doesn't change the overall success rate as direct compaction already tries all it can. There's however significant reduction in direct compaction stalls (that is, the number of allocations that went into direct compaction). The number of successes (i.e. direct compaction stalls that ended up with successful allocation) is reduced by the same number. This means the offload to kcompactd is working as expected, and direct compaction is reduced either due to detecting contention, or compaction deferred by kcompactd. In the previous version of this patchset there was some apparent reduction of success rate, but the changes in this version (such as using sync compaction only), new baseline kernel, and/or averaging results from 5 executions (my bet), made this go away. Ftrace-based stats seem to roughly agree: Time kswapd awake 2532984 2326824 Time kcompactd awake 0 257916 Time direct compacting 864839 735130 Time kswapd compacting 0 0 Time kcompactd compacting 0 257585 Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2016-03-17 15:09:34 -07:00
..
kasan	kasan: add functions to clear stack poison	2016-03-09 15:43:42 -08:00
backing-dev.c	mm/backing-dev.c: fix error path in wb_init()	2016-02-11 18:35:48 -08:00
balloon_compaction.c	virtio_balloon: fix race between migration and ballooning	2016-01-12 20:47:06 +02:00
bootmem.c	x86/mm: Introduce max_possible_pfn	2015-12-06 12:46:31 +01:00
cleancache.c	cleancache: constify cleancache_ops structure	2016-01-27 09:09:57 -05:00
cma_debug.c	mm/cma_debug: correct size input to bitmap function	2015-07-17 16:39:54 -07:00
cma.c	mm/cma.c: suppress warning	2015-11-05 19:34:48 -08:00
cma.h	mm: cma: mark cma_bitmap_maxno() inline in header	2015-08-14 15:56:32 -07:00
compaction.c	mm, kswapd: replace kswapd compaction with waking up kcompactd	2016-03-17 15:09:34 -07:00
debug.c	mm, debug: move bad flags printing to bad_page()	2016-03-15 16:55:16 -07:00
dmapool.c	mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd	2015-11-06 17:50:42 -08:00
early_ioremap.c	mm/early_ioremap: use offset_in_page macro	2015-11-05 19:34:48 -08:00
fadvise.c	writeback: implement and use inode_congested()	2015-06-02 08:33:35 -06:00
failslab.c	mm: fault-inject take over bootstrap kmem_cache check	2016-03-15 16:55:16 -07:00
filemap.c	mm: remove unnecessary uses of lock_page_memcg()	2016-03-15 16:55:16 -07:00
frame_vector.c	mm: fix docbook comment for get_vaddr_frames()	2015-11-05 19:34:48 -08:00
frontswap.c	frontswap: allow multiple backends	2015-06-24 17:49:45 -07:00
gup.c	mm: retire GUP WARN_ON_ONCE that outlived its usefulness	2016-02-03 08:57:14 -08:00
highmem.c
huge_memory.c	thp: cleanup split_huge_page()	2016-03-15 16:55:16 -07:00
hugetlb_cgroup.c	mm: make compound_head() robust	2015-11-06 17:50:42 -08:00
hugetlb.c	mm/hugetlb: use EOPNOTSUPP in hugetlb sysctl handlers	2016-03-09 15:43:42 -08:00
hwpoison-inject.c	hwpoison: use page_cgroup_ino for filtering by memcg	2015-09-10 13:29:01 -07:00
init-mm.c
internal.h	mm, kswapd: replace kswapd compaction with waking up kcompactd	2016-03-17 15:09:34 -07:00
interval_tree.c	mm: replace vma->sharead.linear with vma->shared	2015-02-10 14:30:31 -08:00
Kconfig	mm/Kconfig: correct description of DEFERRED_STRUCT_PAGE_INIT	2016-02-05 18:10:40 -08:00
Kconfig.debug	mm/page_poisoning.c: allow for zero poisoning	2016-03-15 16:55:16 -07:00
kmemcheck.c	mm: kmemcheck skip object if slab allocation failed	2016-03-15 16:55:16 -07:00
kmemleak-test.c
kmemleak.c	Revert "gfp: add __GFP_NOACCOUNT"	2016-01-14 16:00:49 -08:00
ksm.c	mm/ksm.c: mark stable page dirty	2016-01-15 17:56:32 -08:00
list_lru.c	mm: memcontrol: move kmem accounting code to CONFIG_MEMCG	2016-01-20 17:09:18 -08:00
maccess.c	mm/maccess.c: actually return -EFAULT from strncpy_from_unsafe	2015-11-05 19:34:48 -08:00
madvise.c	mm/madvise: update comment on sys_madvise()	2016-03-15 16:55:16 -07:00
Makefile	mm/page_poison.c: enable PAGE_POISONING as a separate option	2016-03-15 16:55:16 -07:00
memblock.c	mm/memblock.c: remove unnecessary memblock_type variable	2016-03-15 16:55:16 -07:00
memcontrol.c	mm: memcontrol: report kernel stack usage in cgroup2 memory.stat	2016-03-17 15:09:34 -07:00
memory_hotplug.c	mm, memory hotplug: small cleanup in online_pages()	2016-03-17 15:09:34 -07:00
memory-failure.c	mm/memory-failure.c: remove useless "undef"s	2016-03-15 16:55:16 -07:00
memory.c	Merge branch 'akpm' (patches from Andrew)	2016-03-16 11:51:08 -07:00
mempolicy.c	mm/mempolicy.c: skip VM_HUGETLB and VM_MIXEDMAP VMA for lazy mbind	2016-03-15 16:55:16 -07:00
mempool.c	mm/mempool: avoid KASAN marking mempool poison checks as use-after-free	2016-03-11 16:17:47 -08:00
memtest.c	memtest: remove unused header files	2015-09-08 15:35:28 -07:00
migrate.c	mm: migrate: consolidate mem_cgroup_migrate() calls	2016-03-15 16:55:16 -07:00
mincore.c	thp: change pmd_trans_huge_lock() interface to return ptl	2016-01-21 17:20:51 -08:00
mlock.c	mm: fix mlock accouting	2016-01-21 17:20:51 -08:00
mm_init.c	mm: meminit: remove mminit_verify_page_links	2015-06-30 19:44:56 -07:00
mmap.c	Linux 4.5-rc7	2016-03-07 09:27:30 +01:00
mmu_context.c
mmu_notifier.c	mmu-notifier: add clear_young callback	2015-09-10 13:29:01 -07:00
mmzone.c	mm/mmzone.c: memmap_valid_within() can be boolean	2016-01-14 16:00:49 -08:00
mprotect.c	mm, dax: check for pmd_none() after split_huge_pmd()	2016-02-11 18:35:48 -08:00
mremap.c	mm, dax: check for pmd_none() after split_huge_pmd()	2016-02-11 18:35:48 -08:00
msync.c	mm/msync: use offset_in_page macro	2015-11-05 19:34:48 -08:00
nobootmem.c	x86/mm: Introduce max_possible_pfn	2015-12-06 12:46:31 +01:00
nommu.c	kmemcg: account certain kmem allocations to memcg	2016-01-14 16:00:49 -08:00
oom_kill.c	mm: oom_kill: don't ignore oom score on exiting tasks	2016-03-17 15:09:34 -07:00
page_alloc.c	mm, compaction: introduce kcompactd	2016-03-17 15:09:34 -07:00
page_counter.c	mm: page_counter: let page_counter_try_charge() return bool	2015-11-05 19:34:48 -08:00
page_ext.c	mm/page_poisoning.c: allow for zero poisoning	2016-03-15 16:55:16 -07:00
page_idle.c	mm: add page_check_address_transhuge() helper	2016-01-15 17:56:32 -08:00
page_io.c	fs: use helper bio_add_page() instead of open coding on bi_io_vec	2015-08-13 12:32:00 -06:00
page_isolation.c	mm/page_isolation: do some cleanup in "undo_isolate_page_range"	2016-01-15 17:56:32 -08:00
page_owner.c	mm, page_owner: dump page owner info from dump_page()	2016-03-15 16:55:16 -07:00
page_poison.c	mm/page_poisoning.c: allow for zero poisoning	2016-03-15 16:55:16 -07:00
page-writeback.c	mm: remove unnecessary uses of lock_page_memcg()	2016-03-15 16:55:16 -07:00
pagewalk.c	thp: rename split_huge_page_pmd() to split_huge_pmd()	2016-01-15 17:56:32 -08:00
percpu-km.c
percpu-vm.c
percpu.c	tree wide: use kvfree() than conditional kfree()/vfree()	2016-01-22 17:02:18 -08:00
pgtable-generic.c	mm,thp: fix spellos in describing __HAVE_ARCH_FLUSH_PMD_TLB_RANGE	2016-02-11 18:35:48 -08:00
process_vm_access.c	ptrace: use fsuid, fsgid, effective creds for fs access checks	2016-01-20 17:09:18 -08:00
quicklist.c
readahead.c	mm: move lru_to_page to mm_inline.h	2016-01-14 16:00:49 -08:00
rmap.c	mm: simplify lock_page_memcg()	2016-03-15 16:55:16 -07:00
shmem.c	mm: migrate: do not touch page->mem_cgroup of live pages	2016-03-15 16:55:16 -07:00
slab_common.c	mm: new API kfree_bulk() for SLAB+SLUB allocators	2016-03-15 16:55:16 -07:00
slab.c	mm: memcontrol: report slab usage in cgroup2 memory.stat	2016-03-17 15:09:34 -07:00
slab.h	mm: memcontrol: report slab usage in cgroup2 memory.stat	2016-03-17 15:09:34 -07:00
slob.c	mm: slab: free kmem_cache_node after destroy sysfs file	2016-02-18 16:23:24 -08:00
slub.c	mm/slub: query dynamic DEBUG_PAGEALLOC setting	2016-03-17 15:09:34 -07:00
sparse-vmemmap.c	x86, mm: introduce vmem_altmap to augment vmemmap_populate()	2016-01-15 17:56:32 -08:00
sparse.c	x86, mm: introduce vmem_altmap to augment vmemmap_populate()	2016-01-15 17:56:32 -08:00
swap_cgroup.c	mm: page_cgroup: rename file to mm/swap_cgroup.c	2014-12-10 17:41:09 -08:00
swap_state.c	mm: memcontrol: charge swap to cgroup2	2016-01-20 17:09:18 -08:00
swap.c	mm, x86: get_user_pages() for dax mappings	2016-01-15 17:56:32 -08:00
swapfile.c	wrappers for ->i_mutex access	2016-01-22 18:04:28 -05:00
truncate.c	mm: remove unnecessary uses of lock_page_memcg()	2016-03-15 16:55:16 -07:00
userfaultfd.c	memcg: adjust to support new THP refcounting	2016-01-15 17:56:32 -08:00
util.c	proc: revert /proc/<pid>/maps [stack:TID] annotation	2016-02-03 08:28:43 -08:00
vmacache.c	mm/vmacache: inline vmacache_valid_mm()	2015-11-05 19:34:48 -08:00
vmalloc.c	mm/vmalloc: query dynamic DEBUG_PAGEALLOC setting	2016-03-17 15:09:34 -07:00
vmpressure.c	mm/vmpressure.c: fix subtree pressure detection	2016-02-03 08:28:43 -08:00
vmscan.c	mm, kswapd: replace kswapd compaction with waking up kcompactd	2016-03-17 15:09:34 -07:00
vmstat.c	mm, compaction: introduce kcompactd	2016-03-17 15:09:34 -07:00
workingset.c	mm: simplify lock_page_memcg()	2016-03-15 16:55:16 -07:00
zbud.c	mm/zbud.c: use list_last_entry() instead of list_tail_entry()	2016-01-15 11:40:52 -08:00
zpool.c	mm: zsmalloc: constify struct zs_pool name	2015-11-06 17:50:42 -08:00
zsmalloc.c	zsmalloc: fix migrate_zspage-zs_free race condition	2016-01-20 17:09:18 -08:00
zswap.c	mm/zswap: change incorrect strncmp use to strcmp	2015-12-18 14:25:40 -08:00