linux

History

Andrea Arcangeli 2c653d0ee2 ksm: introduce ksm_max_page_sharing per page deduplication limit Without a max deduplication limit for each KSM page, the list of the rmap_items associated to each stable_node can grow infinitely large. During the rmap walk each entry can take up to ~10usec to process because of IPIs for the TLB flushing (both for the primary MMU and the secondary MMUs with the MMU notifier). With only 16GB of address space shared in the same KSM page, that would amount to dozens of seconds of kernel runtime. A ~256 max deduplication factor will reduce the latencies of the rmap walks on KSM pages to order of a few msec. Just doing the cond_resched() during the rmap walks is not enough, the list size must have a limit too, otherwise the caller could get blocked in (schedule friendly) kernel computations for seconds, unexpectedly. There's room for optimization to significantly reduce the IPI delivery cost during the page_referenced(), but at least for page_migration in the KSM case (used by hard NUMA bindings, compaction and NUMA balancing) it may be inevitable to send lots of IPIs if each rmap_item->mm is active on a different CPU and there are lots of CPUs. Even if we ignore the IPI delivery cost, we've still to walk the whole KSM rmap list, so we can't allow millions or billions (ulimited) number of entries in the KSM stable_node rmap_item lists. The limit is enforced efficiently by adding a second dimension to the stable rbtree. So there are three types of stable_nodes: the regular ones (identical as before, living in the first flat dimension of the stable rbtree), the "chains" and the "dups". Every "chain" and all "dups" linked into a "chain" enforce the invariant that they represent the same write protected memory content, even if each "dup" will be pointed by a different KSM page copy of that content. This way the stable rbtree lookup computational complexity is unaffected if compared to an unlimited max_sharing_limit. It is still enforced that there cannot be KSM page content duplicates in the stable rbtree itself. Adding the second dimension to the stable rbtree only after the max_page_sharing limit hits, provides for a zero memory footprint increase on 64bit archs. The memory overhead of the per-KSM page stable_tree and per virtual mapping rmap_item is unchanged. Only after the max_page_sharing limit hits, we need to allocate a stable_tree "chain" and rb_replace() the "regular" stable_node with the newly allocated stable_node "chain". After that we simply add the "regular" stable_node to the chain as a stable_node "dup" by linking hlist_dup in the stable_node_chain->hlist. This way the "regular" (flat) stable_node is converted to a stable_node "dup" living in the second dimension of the stable rbtree. During stable rbtree lookups the stable_node "chain" is identified as stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka is_stable_node_chain()). When dropping stable_nodes, the stable_node "dup" is identified as stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()). The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used elsewhere in any stable_node->head/node to avoid a clashes with the stable_node->node.rb_parent_color pointer, and different from &migrate_nodes. So the second field of &migrate_nodes is picked and verified as always safe with a BUILD_BUG_ON in case the list_head implementation changes in the future. The STABLE_NODE_DUP is picked as a random negative value in stable_node->rmap_hlist_len. rmap_hlist_len cannot become negative when it's a "regular" stable_node or a stable_node "dup". The stable_node_chain->nid is irrelevant. The stable_node_chain->kpfn is aliased in a union with a time field used to rate limit the stable_node_chain->hlist prunes. The garbage collection of the stable_node_chain happens lazily during stable rbtree lookups (as for all other kind of stable_nodes), or while disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while collecting the entire stable rbtree. While the "regular" stable_nodes and the stable_node "dups" must wait for their underlying tree_page to be freed before they can be freed themselves, the stable_node "chains" can be freed immediately if the stable_node->hlist turns empty. This is because the "chains" are never pointed by any page->mapping and they're effectively stable rbtree KSM self contained metadata. [akpm@linux-foundation.org: fix non-NUMA build] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Petr Holasek <pholasek@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Evgheni Dereveanchin <ederevea@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Gavin Guo <gavin.guo@canonical.com> Cc: Jay Vosburgh <jay.vosburgh@canonical.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2017-07-06 16:24:31 -07:00
..
.gitignore
00-INDEX	Documentation: vm, add hugetlbfs reservation overview	2017-05-03 15:52:11 -07:00
active_mm.txt
balance	mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd	2015-11-06 17:50:42 -08:00
cleancache.txt	cleancache: forbid overriding cleancache_ops	2015-04-14 16:49:03 -07:00
frontswap.txt
highmem.txt
hugetlbfs_reserv.txt	Documentation: vm, add hugetlbfs reservation overview	2017-05-03 15:52:11 -07:00
hugetlbpage.txt	Documentation: vm: Spelling s/paltform/platform/g	2016-05-14 10:15:10 -06:00
hwpoison.txt	mm/memory-failure.c: support use of a dedicated thread to handle SIGBUS(BUS_MCEERR_AO)	2014-06-04 16:54:13 -07:00
idle_page_tracking.txt	mm: introduce idle page tracking	2015-09-10 13:29:01 -07:00
ksm.txt	ksm: introduce ksm_max_page_sharing per page deduplication limit	2017-07-06 16:24:31 -07:00
numa	docs: fix locations of several documents that got moved	2016-10-24 08:12:35 -02:00
numa_memory_policy.txt	Documenation: update cgroup's document path	2016-08-03 15:43:58 -06:00
overcommit-accounting	mm: add overcommit_kbytes sysctl variable	2014-01-21 16:19:44 -08:00
page_frags	mm: add documentation for page fragment APIs	2017-01-10 18:31:55 -08:00
page_migration	Three fixes for the docs build, including removing an annoying warning on	2016-08-07 10:23:17 -04:00
page_owner.txt	mm, page_owner: convert page_owner_inited to static key	2016-03-15 16:55:16 -07:00
pagemap.txt	Documentation typo: wrong page flag bit for KPF_HUGE	2016-04-15 15:47:10 -06:00
remap_file_pages.txt	mm: replace remap_file_pages() syscall with emulation	2015-02-10 14:30:30 -08:00
slub.txt	slub: convert SLAB_DEBUG_FREE to SLAB_CONSISTENCY_CHECKS	2016-03-15 16:55:16 -07:00
soft-dirty.txt
split_page_table_lock	mm: make compound_head() robust	2015-11-06 17:50:42 -08:00
transhuge.txt	Documentation/vm/transhuge.txt: fix trivial typos	2017-05-08 17:15:14 -07:00
unevictable-lru.txt	Three fixes for the docs build, including removing an annoying warning on	2016-08-07 10:23:17 -04:00
userfaultfd.txt	userfaultfd: non-cooperative: rollback userfaultfd_exit	2017-03-09 17:01:09 -08:00
z3fold.txt	z3fold: the 3-fold allocator for compressed pages	2016-05-20 17:58:30 -07:00
zsmalloc.txt	zsmalloc: zsmalloc documentation	2015-04-15 16:35:21 -07:00
zswap.txt	zswap: update docs for runtime-changeable attributes	2015-09-10 13:29:01 -07:00