linux/mm
Huang Ying ed43af1097 swap: try to scan more free slots even when fragmented
Now, the scalability of swap code will drop much when the swap device
becomes fragmented, because the swap slots allocation batching stops
working.  To solve the problem, in this patch, we will try to scan a
little more swap slots with restricted effort to batch the swap slots
allocation even if the swap device is fragmented.  Test shows that the
benchmark score can increase up to 37.1% with the patch.  Details are as
follows.

The swap code has a per-cpu cache of swap slots.  These batch swap space
allocations to improve swap subsystem scaling.  In the following code
path,

  add_to_swap()
    get_swap_page()
      refill_swap_slots_cache()
        get_swap_pages()
	  scan_swap_map_slots()

scan_swap_map_slots() and get_swap_pages() can return multiple swap
slots for each call.  These slots will be cached in the per-CPU swap
slots cache, so that several following swap slot requests will be
fulfilled there to avoid the lock contention in the lower level swap
space allocation/freeing code path.

But this only works when there are free swap clusters.  If a swap device
becomes so fragmented that there's no free swap clusters,
scan_swap_map_slots() and get_swap_pages() will return only one swap
slot for each call in the above code path.  Effectively, this falls back
to the situation before the swap slots cache was introduced, the heavy
lock contention on the swap related locks kills the scalability.

Why does it work in this way? Because the swap device could be large,
and the free swap slot scanning could be quite time consuming, to avoid
taking too much time to scanning free swap slots, the conservative
method was used.

In fact, this can be improved via scanning a little more free slots with
strictly restricted effort.  Which is implemented in this patch.  In
scan_swap_map_slots(), after the first free swap slot is gotten, we will
try to scan a little more, but only if we haven't scanned too many slots
(< LATENCY_LIMIT).  That is, the added scanning latency is strictly
restricted.

To test the patch, we have run 16-process pmbench memory benchmark on a
2-socket server machine with 48 cores.  Multiple ram disks are
configured as the swap devices.  The pmbench working-set size is much
larger than the available memory so that swapping is triggered.  The
memory read/write ratio is 80/20 and the accessing pattern is random, so
the swap space becomes highly fragmented during the test.  In the
original implementation, the lock contention on swap related locks is
very heavy.  The perf profiling data of the lock contention code path is
as following,

 _raw_spin_lock.get_swap_pages.get_swap_page.add_to_swap:             21.03
 _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:    1.92
 _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:      1.72
 _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:       0.69

While after applying this patch, it becomes,

 _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:    4.89
 _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:      3.85
 _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:       1.1
 _raw_spin_lock_irqsave.pagevec_lru_move_fn.__lru_cache_add.do_swap_page: 0.88

That is, the lock contention on the swap locks is eliminated.

And the pmbench score increases 37.1%.  The swapin throughput increases
45.7% from 2.02 GB/s to 2.94 GB/s.  While the swapout throughput increases
45.3% from 2.04 GB/s to 2.97 GB/s.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200427030023.264780-1-ying.huang@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:09 -07:00
..
kasan kasan: disable branch tracing for core runtime 2020-05-23 10:26:31 -07:00
backing-dev.c bdi: add a ->dev_name field to struct backing_dev_info 2020-05-09 16:07:57 -06:00
balloon_compaction.c mm/balloon_compaction: suppress allocation warnings 2019-09-04 07:42:01 -04:00
cleancache.c Driver Core and debugfs changes for 5.3-rc1 2019-07-12 12:24:03 -07:00
cma_debug.c mm/cma_debug.c: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops 2019-12-01 12:59:09 -08:00
cma.c mm: cma: NUMA node interface 2020-04-10 15:36:21 -07:00
cma.h
compaction.c mm/compaction: add missing annotation for compact_lock_irqsave 2020-04-07 10:43:41 -07:00
debug_page_ref.c
debug.c mm, dump_page(): do not crash with invalid mapping pointer 2020-06-02 10:59:06 -07:00
dmapool.c mm/dmapool.c: micro-optimisation remove unnecessary branch 2020-04-07 10:43:42 -07:00
early_ioremap.c mm/early_ioremap.c: use %pa to print resource_size_t variables 2020-01-31 10:30:38 -08:00
fadvise.c mm: return void from various readahead functions 2020-06-02 10:59:06 -07:00
failslab.c mm/failslab.c: by default, do not fail allocations with direct reclaim only 2019-07-12 11:05:43 -07:00
filemap.c mm/filemap.c: remove misleading comment 2020-06-02 10:59:08 -07:00
frame_vector.c mm: untag user pointers in get_vaddr_frames 2019-09-25 17:51:41 -07:00
frontswap.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 482 2019-06-19 17:09:52 +02:00
gup_benchmark.c mm/gup_benchmark: support pin_user_pages() and related calls 2020-04-02 09:35:27 -07:00
gup.c mm/gup.c: further document vma_permits_fault() 2020-06-02 10:59:08 -07:00
highmem.c mm, x86/mm: Untangle address space layout definitions from basic pgtable type definitions 2019-12-10 10:12:55 +01:00
hmm.c mm/hmm: return error for non-vma snapshots 2020-03-30 16:58:36 -03:00
huge_memory.c userfaultfd: wp: support swap and page migration 2020-04-07 10:43:39 -07:00
hugetlb_cgroup.c mm: use fallthrough; 2020-04-07 10:43:41 -07:00
hugetlb.c mm/hugetlb: fix a addressing exception caused by huge_pte_offset 2020-04-21 11:11:55 -07:00
hwpoison-inject.c mm/hwpoison-inject: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops 2019-12-01 12:59:09 -08:00
init-mm.c mm/init-mm.c: include <linux/mman.h> for vm_committed_as_batch 2019-10-19 06:32:32 -04:00
internal.h mm: return void from various readahead functions 2020-06-02 10:59:06 -07:00
interval_tree.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 248 2019-06-19 17:09:08 +02:00
Kconfig libnvdimm for 5.7 2020-04-08 21:03:40 -07:00
Kconfig.debug mm: add generic ptdump 2020-02-04 03:05:25 +00:00
khugepaged.c mm,thp: stop leaking unreleased file pages 2020-05-28 11:35:40 -07:00
kmemleak-test.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 333 2019-06-05 17:37:06 +02:00
kmemleak.c mm/kmemleak.c: use address-of operator on section symbols 2020-04-02 09:35:26 -07:00
ksm.c mm/ksm: fix NULL pointer dereference when KSM zero page is enabled 2020-04-21 11:11:55 -07:00
list_lru.c mm: use fallthrough; 2020-04-07 10:43:41 -07:00
maccess.c uaccess: Add strict non-pagefault kernel-space read function 2019-11-02 12:39:12 -07:00
madvise.c mm: check that mm is still valid in madvise() 2020-04-24 13:28:03 -07:00
Makefile mm: introduce Reported pages 2020-04-07 10:43:38 -07:00
mapping_dirty_helpers.c mm/mapping_dirty_helpers: update huge page-table entry callbacks 2020-04-02 09:35:29 -07:00
memblock.c mm: cma: NUMA node interface 2020-04-10 15:36:21 -07:00
memcontrol.c mm/writeback: discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead 2020-06-02 10:59:08 -07:00
memfd.c mm: page cache: store only head pages in i_pages 2019-09-24 15:54:08 -07:00
memory_hotplug.c mm/memory_hotplug: add pgprot_t to mhp_params 2020-04-10 15:36:21 -07:00
memory-failure.c mm: code cleanup for MADV_FREE 2020-04-07 10:43:38 -07:00
memory.c mm/memory.c: add vm_insert_pages() 2020-04-10 15:36:21 -07:00
mempolicy.c libnvdimm for 5.7 2020-04-08 21:03:40 -07:00
mempool.c docs/core-api/mm: fix return value descriptions in mm/ 2019-03-05 21:07:20 -08:00
memremap.c mm/memremap: set caching mode for PCI P2PDMA memory to WC 2020-04-10 15:36:21 -07:00
memtest.c
migrate.c mm/migrate.c: call detach_page_private to cleanup code 2020-06-02 10:59:08 -07:00
mincore.c mm: pagewalk: add 'depth' parameter to pte_hole 2020-02-04 03:05:25 +00:00
mlock.c mm: untag user pointers passed to memory syscalls 2019-09-25 17:51:41 -07:00
mm_init.c mm/mm_init.c: clean code. Use BUILD_BUG_ON when comparing compile time constant 2020-04-07 10:43:41 -07:00
mmap.c mm/vma: introduce VM_ACCESS_FLAGS 2020-04-10 15:36:21 -07:00
mmu_context.c
mmu_gather.c asm-generic/tlb: provide MMU_GATHER_TABLE_FREE 2020-02-04 03:05:26 +00:00
mmu_notifier.c mm/mmu_notifier: silence PROVE_RCU_LIST warnings 2020-03-21 18:56:06 -07:00
mmzone.c
mprotect.c mm/vma: introduce VM_ACCESS_FLAGS 2020-04-10 15:36:21 -07:00
mremap.c userfaultfd: fix remap event with MREMAP_DONTUNMAP 2020-05-14 10:00:35 -07:00
msync.c mm: untag user pointers passed to memory syscalls 2019-09-25 17:51:41 -07:00
nommu.c x86/mm: split vmalloc_sync_all() 2020-03-21 18:56:06 -07:00
oom_kill.c mm, oom: dump stack of victim when reaping failed 2020-01-31 10:30:38 -08:00
page_alloc.c mm/writeback: discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead 2020-06-02 10:59:08 -07:00
page_counter.c mm, memcg: prevent memory.min load/store tearing 2020-04-02 09:35:29 -07:00
page_ext.c mm/page_ext.c: drop pfn_present() check when onlining 2020-04-07 10:43:40 -07:00
page_idle.c mm/page_idle.c: fix oops because end_pfn is larger than max_pfn 2019-06-29 16:43:45 +08:00
page_io.c fs: Enable bmap() function to properly return errors 2020-02-03 08:05:37 -05:00
page_isolation.c mm: add function __putback_isolated_page 2020-04-07 10:43:38 -07:00
page_owner.c mm/page_owner: don't access uninitialized memmaps when reading /proc/pagetypeinfo 2019-10-19 06:32:31 -04:00
page_poison.c mm/page_poison.c: fix a typo in a comment 2019-09-24 15:54:08 -07:00
page_reporting.c mm/page_reporting: add budget limit on how many pages can be reported per pass 2020-04-07 10:43:39 -07:00
page_reporting.h mm: introduce Reported pages 2020-04-07 10:43:38 -07:00
page_vma_mapped.c mm/page_vma_mapped.c: explicitly compare pfn for normal, hugetlbfs and THP page 2020-01-31 10:30:38 -08:00
page-writeback.c mm/writeback: discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead 2020-06-02 10:59:08 -07:00
pagewalk.c x86: mm: avoid allocating struct mm_struct on the stack 2020-02-04 03:05:25 +00:00
percpu-internal.h percpu: convert chunk hints to be based on pcpu_block_md 2019-03-13 12:25:31 -07:00
percpu-km.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428 2019-06-05 17:37:16 +02:00
percpu-stats.c percpu: update copyright emails to dennis@kernel.org 2020-04-01 10:09:12 -07:00
percpu-vm.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428 2019-06-05 17:37:16 +02:00
percpu.c percpu: make pcpu_alloc() aware of current gfp context 2020-05-07 19:27:21 -07:00
pgtable-generic.c asm-generic/mm: stub out p{4,u}d_clear_bad() if __PAGETABLE_P{4,U}D_FOLDED 2019-12-01 06:29:19 -08:00
process_vm_access.c mm: docs: Fix a comment in process_vm_rw_core 2020-03-25 10:04:01 -05:00
ptdump.c x86: mm: avoid allocating struct mm_struct on the stack 2020-02-04 03:05:25 +00:00
readahead.c mm: use memalloc_nofs_save in readahead path 2020-06-02 10:59:07 -07:00
rmap.c mm: prevent a warning when casting void* -> enum 2020-04-07 10:43:41 -07:00
rodata_test.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441 2019-06-05 17:37:17 +02:00
shmem.c mm: shmem: disable interrupt when acquiring info->lock in userfaultfd_copy path 2020-04-21 11:11:56 -07:00
shuffle.c mm: adjust shuffle code to allow for future coalescing 2020-04-07 10:43:38 -07:00
shuffle.h mm: adjust shuffle code to allow for future coalescing 2020-04-07 10:43:38 -07:00
slab_common.c usercopy: mark dma-kmalloc caches as usercopy caches 2020-06-02 10:59:06 -07:00
slab.c mm, debug_pagealloc: don't rely on static keys too early 2020-01-13 18:19:02 -08:00
slab.h mm: kmem: rename (__)memcg_kmem_(un)charge_memcg() to __memcg_kmem_(un)charge() 2020-04-02 09:35:28 -07:00
slob.c mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two) 2019-10-07 15:47:20 -07:00
slub.c mm/slub: fix stack overruns with SLUB_STATS 2020-06-02 10:59:06 -07:00
sparse-vmemmap.c mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap() 2019-07-18 17:08:07 -07:00
sparse.c mm/sparse.c: move subsection_map related functions together 2020-04-07 10:43:40 -07:00
swap_cgroup.c
swap_slots.c mm/swap_slots.c: assign|reset cache slot by value directly 2020-04-02 09:35:27 -07:00
swap_state.c mm/swap_state: fix a data race in swapin_nr_pages 2020-06-02 10:59:08 -07:00
swap.c mm: huge tmpfs: try to split_huge_page() when punching hole 2020-04-07 10:43:41 -07:00
swapfile.c swap: try to scan more free slots even when fragmented 2020-06-02 10:59:09 -07:00
truncate.c mm/thp: allow dropping THP from page cache 2019-10-19 06:32:33 -04:00
usercopy.c usercopy: Avoid HIGHMEM pfn warning 2019-09-17 15:20:17 -07:00
userfaultfd.c userfaultfd: wp: support write protection for userfault vma range 2020-04-07 10:43:39 -07:00
util.c mm/mmap.c: rb_parent is not necessary in __vma_link_list() 2019-12-01 06:29:19 -08:00
vmacache.c
vmalloc.c vmalloc: fix remap_vmalloc_range() bounds checks 2020-04-21 11:11:56 -07:00
vmpressure.c mm: vmpressure: use mem_cgroup_is_root API 2020-04-02 09:35:31 -07:00
vmscan.c mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-06-02 10:59:08 -07:00
vmstat.c mm/writeback: discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead 2020-06-02 10:59:08 -07:00
workingset.c mm: vmscan: detect file thrashing at the reclaim root 2019-12-01 12:59:07 -08:00
z3fold.c mm/z3fold: silence kmemleak false positives of slots 2020-05-28 11:35:40 -07:00
zbud.c treewide: Add SPDX license identifier for more missed files 2019-05-21 10:50:45 +02:00
zpool.c zpool: add malloc_support_movable to zpool_driver 2019-09-24 15:54:12 -07:00
zsmalloc.c mm: use fallthrough; 2020-04-07 10:43:41 -07:00
zswap.c mm/zswap: allow setting default status, compressor and allocator in Kconfig 2020-04-07 10:43:41 -07:00