2c653d0ee2
Without a max deduplication limit for each KSM page, the list of the rmap_items associated to each stable_node can grow infinitely large. During the rmap walk each entry can take up to ~10usec to process because of IPIs for the TLB flushing (both for the primary MMU and the secondary MMUs with the MMU notifier). With only 16GB of address space shared in the same KSM page, that would amount to dozens of seconds of kernel runtime. A ~256 max deduplication factor will reduce the latencies of the rmap walks on KSM pages to order of a few msec. Just doing the cond_resched() during the rmap walks is not enough, the list size must have a limit too, otherwise the caller could get blocked in (schedule friendly) kernel computations for seconds, unexpectedly. There's room for optimization to significantly reduce the IPI delivery cost during the page_referenced(), but at least for page_migration in the KSM case (used by hard NUMA bindings, compaction and NUMA balancing) it may be inevitable to send lots of IPIs if each rmap_item->mm is active on a different CPU and there are lots of CPUs. Even if we ignore the IPI delivery cost, we've still to walk the whole KSM rmap list, so we can't allow millions or billions (ulimited) number of entries in the KSM stable_node rmap_item lists. The limit is enforced efficiently by adding a second dimension to the stable rbtree. So there are three types of stable_nodes: the regular ones (identical as before, living in the first flat dimension of the stable rbtree), the "chains" and the "dups". Every "chain" and all "dups" linked into a "chain" enforce the invariant that they represent the same write protected memory content, even if each "dup" will be pointed by a different KSM page copy of that content. This way the stable rbtree lookup computational complexity is unaffected if compared to an unlimited max_sharing_limit. It is still enforced that there cannot be KSM page content duplicates in the stable rbtree itself. Adding the second dimension to the stable rbtree only after the max_page_sharing limit hits, provides for a zero memory footprint increase on 64bit archs. The memory overhead of the per-KSM page stable_tree and per virtual mapping rmap_item is unchanged. Only after the max_page_sharing limit hits, we need to allocate a stable_tree "chain" and rb_replace() the "regular" stable_node with the newly allocated stable_node "chain". After that we simply add the "regular" stable_node to the chain as a stable_node "dup" by linking hlist_dup in the stable_node_chain->hlist. This way the "regular" (flat) stable_node is converted to a stable_node "dup" living in the second dimension of the stable rbtree. During stable rbtree lookups the stable_node "chain" is identified as stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka is_stable_node_chain()). When dropping stable_nodes, the stable_node "dup" is identified as stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()). The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used elsewhere in any stable_node->head/node to avoid a clashes with the stable_node->node.rb_parent_color pointer, and different from &migrate_nodes. So the second field of &migrate_nodes is picked and verified as always safe with a BUILD_BUG_ON in case the list_head implementation changes in the future. The STABLE_NODE_DUP is picked as a random negative value in stable_node->rmap_hlist_len. rmap_hlist_len cannot become negative when it's a "regular" stable_node or a stable_node "dup". The stable_node_chain->nid is irrelevant. The stable_node_chain->kpfn is aliased in a union with a time field used to rate limit the stable_node_chain->hlist prunes. The garbage collection of the stable_node_chain happens lazily during stable rbtree lookups (as for all other kind of stable_nodes), or while disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while collecting the entire stable rbtree. While the "regular" stable_nodes and the stable_node "dups" must wait for their underlying tree_page to be freed before they can be freed themselves, the stable_node "chains" can be freed immediately if the stable_node->hlist turns empty. This is because the "chains" are never pointed by any page->mapping and they're effectively stable rbtree KSM self contained metadata. [akpm@linux-foundation.org: fix non-NUMA build] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Petr Holasek <pholasek@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Evgheni Dereveanchin <ederevea@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Gavin Guo <gavin.guo@canonical.com> Cc: Jay Vosburgh <jay.vosburgh@canonical.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
---|---|---|
.. | ||
ABI | ||
accounting | ||
acpi | ||
admin-guide | ||
aoe | ||
arm | ||
arm64 | ||
auxdisplay | ||
backlight | ||
blackfin | ||
block | ||
blockdev | ||
bus-devices | ||
cdrom | ||
cgroup-v1 | ||
cma | ||
connector | ||
console | ||
core-api | ||
cpu-freq | ||
cpuidle | ||
cris | ||
crypto | ||
dev-tools | ||
device-mapper | ||
devicetree | ||
dmaengine | ||
doc-guide | ||
driver-api | ||
driver-model | ||
early-userspace | ||
EDID | ||
extcon | ||
fault-injection | ||
fb | ||
features | ||
filesystems | ||
firmware_class | ||
fmc | ||
fpga | ||
frv | ||
gpio | ||
gpu | ||
hid | ||
hwmon | ||
i2c | ||
ia64 | ||
ide | ||
iio | ||
infiniband | ||
input | ||
ioctl | ||
isdn | ||
kbuild | ||
kdump | ||
kernel-hacking | ||
laptops | ||
leds | ||
lightnvm | ||
livepatch | ||
locking | ||
m68k | ||
md | ||
media | ||
memory-devices | ||
metag | ||
mic | ||
mips | ||
misc-devices | ||
mmc | ||
mn10300 | ||
mtd | ||
namespaces | ||
netlabel | ||
networking | ||
nfc | ||
nios2 | ||
nvdimm | ||
nvmem | ||
parisc | ||
PCI | ||
pcmcia | ||
perf | ||
phy | ||
platform | ||
power | ||
powerpc | ||
pps | ||
process | ||
pti | ||
ptp | ||
rapidio | ||
RCU | ||
s390 | ||
scheduler | ||
scsi | ||
security | ||
serial | ||
sh | ||
sound | ||
sparc | ||
sphinx | ||
sphinx-static | ||
spi | ||
sysctl | ||
target | ||
thermal | ||
timers | ||
trace | ||
translations | ||
usb | ||
userspace-api | ||
virtual | ||
vm | ||
w1 | ||
watchdog | ||
wimax | ||
x86 | ||
xtensa | ||
.gitignore | ||
00-INDEX | ||
bcache.txt | ||
bt8xxgpio.txt | ||
btmrvl.txt | ||
bus-virt-phys-mapping.txt | ||
cachetlb.txt | ||
cgroup-v2.txt | ||
Changes | ||
circular-buffers.txt | ||
clk.txt | ||
CodingStyle | ||
conf.py | ||
cpu-load.txt | ||
cputopology.txt | ||
crc32.txt | ||
dcdbas.txt | ||
debugging-modules.txt | ||
debugging-via-ohci1394.txt | ||
dell_rbu.txt | ||
digsig.txt | ||
DMA-API-HOWTO.txt | ||
DMA-API.txt | ||
DMA-attributes.txt | ||
DMA-ISA-LPC.txt | ||
docutils.conf | ||
dontdiff | ||
efi-stub.txt | ||
eisa.txt | ||
flexible-arrays.txt | ||
futex-requeue-pi.txt | ||
gcc-plugins.txt | ||
highuid.txt | ||
hw_random.txt | ||
hwspinlock.txt | ||
index.rst | ||
intel_txt.txt | ||
Intel-IOMMU.txt | ||
io_ordering.txt | ||
io-mapping.txt | ||
iostats.txt | ||
IPMI.txt | ||
IRQ-affinity.txt | ||
IRQ-domain.txt | ||
IRQ.txt | ||
irqflags-tracing.txt | ||
isa.txt | ||
isapnp.txt | ||
kernel-doc-nano-HOWTO.txt | ||
kernel-per-CPU-kthreads.txt | ||
kobject.txt | ||
kprobes.txt | ||
kref.txt | ||
kselftest.txt | ||
ldm.txt | ||
lockup-watchdogs.txt | ||
logo.gif | ||
logo.txt | ||
lsm.txt | ||
lzo.txt | ||
mailbox.txt | ||
Makefile | ||
memory-barriers.txt | ||
memory-hotplug.txt | ||
men-chameleon-bus.txt | ||
nommu-mmap.txt | ||
ntb.txt | ||
numastat.txt | ||
padata.txt | ||
parport-lowlevel.txt | ||
percpu-rw-semaphore.txt | ||
phy.txt | ||
pi-futex.txt | ||
pinctrl.txt | ||
pnp.txt | ||
preempt-locking.txt | ||
printk-formats.txt | ||
pwm.txt | ||
rbtree.txt | ||
remoteproc.txt | ||
rfkill.txt | ||
robust-futex-ABI.txt | ||
robust-futexes.txt | ||
rpmsg.txt | ||
rtc.txt | ||
SAK.txt | ||
sgi-ioc4.txt | ||
siphash.txt | ||
SM501.txt | ||
smsc_ece1099.txt | ||
static-keys.txt | ||
SubmittingPatches | ||
svga.txt | ||
switchtec.txt | ||
sync_file.txt | ||
tee.txt | ||
this_cpu_ops.txt | ||
unaligned-memory-access.txt | ||
vfio-mediated-device.txt | ||
vfio.txt | ||
video-output.txt | ||
xillybus.txt | ||
xz.txt | ||
zorro.txt |