linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-27 21:33:00 +00:00

History

Mel Gorman f77cf4e4cc mm, page_alloc: delete the zonelist_cache The zonelist cache (zlc) was introduced to skip over zones that were recently known to be full. This avoided expensive operations such as the cpuset checks, watermark calculations and zone_reclaim. The situation today is different and the complexity of zlc is harder to justify. 1) The cpuset checks are no-ops unless a cpuset is active and in general are a lot cheaper. 2) zone_reclaim is now disabled by default and I suspect that was a large source of the cost that zlc wanted to avoid. When it is enabled, it's known to be a major source of stalling when nodes fill up and it's unwise to hit every other user with the overhead. 3) Watermark checks are expensive to calculate for high-order allocation requests. Later patches in this series will reduce the cost of the watermark checking. 4) The most important issue is that in the current implementation it is possible for a failed THP allocation to mark a zone full for order-0 allocations and cause a fallback to remote nodes. The last issue could be addressed with additional complexity but as the benefit of zlc is questionable, it is better to remove it. If stalls due to zone_reclaim are ever reported then an alternative would be to introduce deferring logic based on a timeout inside zone_reclaim itself and leave the page allocator fast paths alone. The impact on page-allocator microbenchmarks is negligible as they don't hit the paths where the zlc comes into play. Most page-reclaim related workloads showed no noticeable difference as a result of the removal. The impact was noticeable in a workload called "stutter". One part uses a lot of anonymous memory, a second measures mmap latency and a third copies a large file. In an ideal world the latency application would not notice the mmap latency. On a 2-node machine the results of this patch are stutter 4.3.0-rc1 4.3.0-rc1 baseline nozlc-v4 Min mmap 20.9243 ( 0.00%) 20.7716 ( 0.73%) 1st-qrtle mmap 22.0612 ( 0.00%) 22.0680 ( -0.03%) 2nd-qrtle mmap 22.3291 ( 0.00%) 22.3809 ( -0.23%) 3rd-qrtle mmap 25.2244 ( 0.00%) 25.2396 ( -0.06%) Max-90% mmap 48.0995 ( 0.00%) 28.3713 ( 41.02%) Max-93% mmap 52.5557 ( 0.00%) 36.0170 ( 31.47%) Max-95% mmap 55.8173 ( 0.00%) 47.3163 ( 15.23%) Max-99% mmap 67.3781 ( 0.00%) 70.1140 ( -4.06%) Max mmap 24447.6375 ( 0.00%) 12915.1356 ( 47.17%) Mean mmap 33.7883 ( 0.00%) 27.7944 ( 17.74%) Best99%Mean mmap 27.7825 ( 0.00%) 25.2767 ( 9.02%) Best95%Mean mmap 26.3912 ( 0.00%) 23.7994 ( 9.82%) Best90%Mean mmap 24.9886 ( 0.00%) 23.2251 ( 7.06%) Best50%Mean mmap 22.0157 ( 0.00%) 22.0261 ( -0.05%) Best10%Mean mmap 21.6705 ( 0.00%) 21.6083 ( 0.29%) Best5%Mean mmap 21.5581 ( 0.00%) 21.4611 ( 0.45%) Best1%Mean mmap 21.3079 ( 0.00%) 21.1631 ( 0.68%) Note that the maximum stall latency went from 24 seconds to 12 which is still bad but an improvement. The milage varies considerably 2-node machine on an earlier test went from 494 seconds to 47 seconds and a 4-node machine that tested an earlier version of this patch went from a worst case stall time of 6 seconds to 67ms. The nature of the benchmark is inherently unpredictable as it is hammering the system and the milage will vary between machines. There is a secondary impact with potentially more direct reclaim because zones are now being considered instead of being skipped by zlc. In this particular test run it did not occur so will not be described. However, in at least one test the following was observed 1. Direct reclaim rates were higher. This was likely due to direct reclaim being entered instead of the zlc disabling a zone and busy looping. Busy looping may have the effect of allowing kswapd to make more progress and in some cases may be better overall. If this is found then the correct action is to put direct reclaimers to sleep on a waitqueue and allow kswapd make forward progress. Busy looping on the zlc is even worse than when the allocator used to blindly call congestion_wait(). 2. There was higher swap activity as direct reclaim was active. 3. Direct reclaim efficiency was lower. This is related to 1 as more scanning activity also encountered more pages that could not be immediately reclaimed In that case, the direct page scan and reclaim rates are noticeable but it is not considered a problem for a few reasons 1. The test is primarily concerned with latency. The mmap attempts are also faulted which means there are THP allocation requests. The ZLC could cause zones to be disabled causing the process to busy loop instead of reclaiming. This looks like elevated direct reclaim activity but it's the correct action to take based on what processes requested. 2. The test hammers reclaim and compaction heavily. The number of successful THP faults is highly variable but affects the reclaim stats. It's not a realistic or reasonable measure of page reclaim activity. 3. No other page-reclaim intensive workload that was tested showed a problem. 4. If a workload is identified that benefitted from the busy looping then it should be fixed by having direct reclaimers sleep on a wait queue until woken by kswapd instead of busy looping. We had this class of problem before when congestion_waits() with a fixed timeout was a brain damaged decision but happened to benefit some workloads. If a workload is identified that relied on the zlc to busy loop then it should be fixed correctly and have a direct reclaimer sleep on a waitqueue until woken by kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2015-11-06 17:50:42 -08:00
..
kasan	kasan: always taint kernel on report	2015-11-05 19:34:48 -08:00
backing-dev.c	mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd	2015-11-06 17:50:42 -08:00
balloon_compaction.c	mm: page migration trylock newpage at same level as oldpage	2015-11-05 19:34:48 -08:00
bootmem.c	bootmem: avoid freeing to bootmem after bootmem is done	2015-09-08 15:35:28 -07:00
cleancache.c	cleancache: remove limit on the number of cleancache enabled filesystems	2015-04-14 16:49:03 -07:00
cma_debug.c	mm/cma_debug: correct size input to bitmap function	2015-07-17 16:39:54 -07:00
cma.c	mm/cma.c: suppress warning	2015-11-05 19:34:48 -08:00
cma.h	mm: cma: mark cma_bitmap_maxno() inline in header	2015-08-14 15:56:32 -07:00
compaction.c	mm, compaction: distinguish contended status in tracepoints	2015-11-05 19:34:48 -08:00
debug-pagealloc.c	mm/debug-pagealloc: make debug-pagealloc boottime configurable	2014-12-13 12:42:48 -08:00
debug.c	mm: introduce VM_LOCKONFAULT	2015-11-05 19:34:48 -08:00
dmapool.c	mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd	2015-11-06 17:50:42 -08:00
early_ioremap.c	mm/early_ioremap: use offset_in_page macro	2015-11-05 19:34:48 -08:00
fadvise.c	writeback: implement and use inode_congested()	2015-06-02 08:33:35 -06:00
failslab.c	mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM	2015-11-06 17:50:42 -08:00
filemap.c	mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM	2015-11-06 17:50:42 -08:00
frame_vector.c	mm: fix docbook comment for get_vaddr_frames()	2015-11-05 19:34:48 -08:00
frontswap.c	frontswap: allow multiple backends	2015-06-24 17:49:45 -07:00
gup.c	mm: introduce VM_LOCKONFAULT	2015-11-05 19:34:48 -08:00
highmem.c	mm/highmem: make kmap cache coloring aware	2014-08-06 18:01:22 -07:00
huge_memory.c	mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM	2015-11-06 17:50:42 -08:00
hugetlb_cgroup.c	mm: page_counter: let page_counter_try_charge() return bool	2015-11-05 19:34:48 -08:00
hugetlb.c	mm: introduce VM_LOCKONFAULT	2015-11-05 19:34:48 -08:00
hwpoison-inject.c	hwpoison: use page_cgroup_ino for filtering by memcg	2015-09-10 13:29:01 -07:00
init-mm.c
internal.h	mm, page_alloc: remove unnecessary recalculations for dirty zone balancing	2015-11-06 17:50:42 -08:00
interval_tree.c	mm: replace vma->sharead.linear with vma->shared	2015-02-10 14:30:31 -08:00
Kconfig	media updates for v4.3-rc1	2015-09-11 16:42:39 -07:00
Kconfig.debug	mm/debug_pagealloc: remove obsolete Kconfig options	2015-01-08 15:10:52 -08:00
kmemcheck.c	mm/slab_common: move kmem_cache definition to internal header	2014-10-09 22:25:50 -04:00
kmemleak-test.c	mm/kmemleak-test.c: use pr_fmt for logging	2014-06-06 16:08:18 -07:00
kmemleak.c	mm/kmemleak.c: remove unneeded initialization of object to NULL	2015-11-05 19:34:48 -08:00
ksm.c	ksm: unstable_tree_search_insert error checking cleanup	2015-11-05 19:34:48 -08:00
list_lru.c	memcg: simplify and inline __mem_cgroup_from_kmem	2015-11-05 19:34:48 -08:00
maccess.c	mm/maccess.c: actually return -EFAULT from strncpy_from_unsafe	2015-11-05 19:34:48 -08:00
madvise.c	mm: madvise allow remove operation for hugetlbfs	2015-09-08 15:35:28 -07:00
Makefile	media updates for v4.3-rc1	2015-09-11 16:42:39 -07:00
memblock.c	mm/memblock: make memblock_remove_range() static	2015-11-05 19:34:48 -08:00
memcontrol.c	mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM	2015-11-06 17:50:42 -08:00
memory_hotplug.c	mm/page_alloc: remove unused parameter in init_currently_empty_zone()	2015-11-05 19:34:48 -08:00
memory-failure.c	mm: hwpoison: ratelimit messages from unpoison_memory()	2015-11-05 19:34:48 -08:00
memory.c	mm, dax: fix DAX deadlocks	2015-10-16 11:42:28 -07:00
mempolicy.c	mm: rename alloc_pages_exact_node() to __alloc_pages_node()	2015-09-08 15:35:28 -07:00
mempool.c	mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd	2015-11-06 17:50:42 -08:00
memtest.c	memtest: remove unused header files	2015-09-08 15:35:28 -07:00
migrate.c	mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM	2015-11-06 17:50:42 -08:00
mincore.c	mm/mincore: use offset_in_page macro	2015-11-05 19:34:48 -08:00
mlock.c	mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage	2015-11-05 19:34:48 -08:00
mm_init.c	mm: meminit: remove mminit_verify_page_links	2015-06-30 19:44:56 -07:00
mmap.c	mm: introduce VM_LOCKONFAULT	2015-11-05 19:34:48 -08:00
mmu_context.c	sched/mm: call finish_arch_post_lock_switch in idle_task_exit and use_mm	2014-02-21 08:50:17 +01:00
mmu_notifier.c	mmu-notifier: add clear_young callback	2015-09-10 13:29:01 -07:00
mmzone.c	mm: microoptimize zonelist operations	2015-02-11 17:06:02 -08:00
mprotect.c	userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx	2015-09-04 16:54:41 -07:00
mremap.c	mm/mremap: use offset_in_page macro	2015-11-05 19:34:48 -08:00
msync.c	mm/msync: use offset_in_page macro	2015-11-05 19:34:48 -08:00
nobootmem.c	mm: page_alloc: pass PFN to __free_pages_bootmem	2015-06-30 19:44:55 -07:00
nommu.c	mm/nommu.c: drop unlikely inside BUG_ON()	2015-11-05 19:34:48 -08:00
oom_kill.c	mm/oom_kill.c: introduce is_sysrq_oom helper	2015-11-06 17:50:42 -08:00
page_alloc.c	mm, page_alloc: delete the zonelist_cache	2015-11-06 17:50:42 -08:00
page_counter.c	mm: page_counter: let page_counter_try_charge() return bool	2015-11-05 19:34:48 -08:00
page_ext.c	mm: introduce idle page tracking	2015-09-10 13:29:01 -07:00
page_idle.c	mm: introduce idle page tracking	2015-09-10 13:29:01 -07:00
page_io.c	fs: use helper bio_add_page() instead of open coding on bi_io_vec	2015-08-13 12:32:00 -06:00
page_isolation.c	mm, page_isolation: make set/unset_migratetype_isolate() file-local	2015-09-08 15:35:28 -07:00
page_owner.c	mm/page_owner: set correct gfp_mask on page_owner	2015-07-17 16:39:54 -07:00
page-writeback.c	writeback: fix incorrect calculation of available memory for memcg domains	2015-10-12 10:31:13 -06:00
pagewalk.c	mm/pagewalk.c: prevent positive return value of walk_page_test() from being passed to callers	2015-03-25 16:20:30 -07:00
percpu-km.c	percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populated	2014-09-02 14:46:05 -04:00
percpu-vm.c	percpu: move region iterations out of pcpu_[de]populate_chunk()	2014-09-02 14:46:02 -04:00
percpu.c	mm/percpu: use offset_in_page macro	2015-11-05 19:34:48 -08:00
pgtable-generic.c	mm,thp: introduce flush_pmd_tlb_range	2015-10-17 17:48:20 +05:30
process_vm_access.c	process_vm_access: switch to {compat_,}import_iovec()	2015-04-11 22:27:12 -04:00
quicklist.c
readahead.c	mm: use only per-device readahead limit	2015-11-05 19:34:48 -08:00
rmap.c	mm: page migration use migration entry for swapcache too	2015-11-05 19:34:48 -08:00
shmem.c	tmpfs: avoid a little creat and stat slowdown	2015-11-05 19:34:48 -08:00
slab_common.c	mm/slab_common.c: initialize kmem_cache pointer to NULL	2015-11-05 19:34:48 -08:00
slab.c	mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd	2015-11-06 17:50:42 -08:00
slab.h	memcg: unify slab and other kmem pages charging	2015-11-05 19:34:48 -08:00
slob.c	mm: rename alloc_pages_exact_node() to __alloc_pages_node()	2015-09-08 15:35:28 -07:00
slub.c	mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd	2015-11-06 17:50:42 -08:00
sparse-vmemmap.c
sparse.c	mm: use macros from compiler.h instead of __attribute__((...))	2014-04-07 16:35:54 -07:00
swap_cgroup.c	mm: page_cgroup: rename file to mm/swap_cgroup.c	2014-12-10 17:41:09 -08:00
swap_state.c	mm: swap: zswap: maybe_preload & refactoring	2015-09-08 15:35:28 -07:00
swap.c	mm: introduce idle page tracking	2015-09-10 13:29:01 -07:00
swapfile.c	mm: /proc/pid/smaps:: show proportional swap share of the mapping	2015-09-08 15:35:28 -07:00
truncate.c	memcg: add per cgroup dirty page accounting	2015-06-02 08:33:33 -06:00
userfaultfd.c	userfaultfd: avoid mmap_sem read recursion in mcopy_atomic	2015-09-04 16:54:41 -07:00
util.c	mm/util: use offset_in_page macro	2015-11-05 19:34:48 -08:00
vmacache.c	mm/vmacache: inline vmacache_valid_mm()	2015-11-05 19:34:48 -08:00
vmalloc.c	mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd	2015-11-06 17:50:42 -08:00
vmpressure.c	mm/vmpressure.c: fix race in vmpressure_work_fn()	2014-12-02 17:32:07 -08:00
vmscan.c	mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd	2015-11-06 17:50:42 -08:00
vmstat.c	mm/vmstat.c: uninline node_page_state()	2015-11-05 19:34:48 -08:00
workingset.c	list_lru: add helpers to isolate items	2015-02-12 18:54:10 -08:00
zbud.c	mm: zbud: constify the zbud_ops	2015-09-08 15:35:28 -07:00
zpool.c	zpool: add zpool_has_pool()	2015-09-10 13:29:01 -07:00
zsmalloc.c	mm: zpool: constify the zpool_ops	2015-09-08 15:35:28 -07:00
zswap.c	mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd	2015-11-06 17:50:42 -08:00