linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-13 23:51:39 +00:00

History

Stefan Roesch d7597f59d1 mm: add new api to enable ksm per process Patch series "mm: process/cgroup ksm support", v9. So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level. Use case 1: The madvise call is not available in the programming language. An example for this are programs with forked workloads using a garbage collected language without pointers. In such a language madvise cannot be made available. In addition the addresses of objects get moved around as they are garbage collected. KSM sharing needs to be enabled "from the outside" for these type of workloads. Use case 2: The same interpreter can also be used for workloads where KSM brings no benefit or even has overhead. We'd like to be able to enable KSM on a workload by workload basis. Use case 3: With the madvise call sharing opportunities are only enabled for the current process: it is a workload-local decision. A considerable number of sharing opportunities may exist across multiple workloads or jobs (if they are part of the same security domain). Only a higler level entity like a job scheduler or container can know for certain if its running one or more instances of a job. That job scheduler however doesn't have the necessary internal workload knowledge to make targeted madvise calls. Security concerns: In previous discussions security concerns have been brought up. The problem is that an individual workload does not have the knowledge about what else is running on a machine. Therefore it has to be very conservative in what memory areas can be shared or not. However, if the system is dedicated to running multiple jobs within the same security domain, its the job scheduler that has the knowledge that sharing can be safely enabled and is even desirable. Performance: Experiments with using UKSM have shown a capacity increase of around 20%. Here are the metrics from an instagram workload (taken from a machine with 64GB main memory): full_scans: 445 general_profit: 20158298048 max_page_sharing: 256 merge_across_nodes: 1 pages_shared: 129547 pages_sharing: 5119146 pages_to_scan: 4000 pages_unshared: 1760924 pages_volatile: 10761341 run: 1 sleep_millisecs: 20 stable_node_chains: 167 stable_node_chains_prune_millisecs: 2000 stable_node_dups: 2751 use_zero_pages: 0 zero_pages_sharing: 0 After the service is running for 30 minutes to an hour, 4 to 5 million shared pages are common for this workload when using KSM. Detailed changes: 1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting. The setting will be inherited by child processes. With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting. 2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's. When forking a process that has KSM enabled, the setting will be inherited by the new child process. 3. Add general_profit metric The general_profit metric of KSM is specified in the documentation, but not calculated. This adds the general profit metric to /sys/kernel/debug/mm/ksm. 4. Add more metrics to ksm_stat This adds the process profit metric to /proc/<pid>/ksm_stat. 5. Add more tests to ksm_tests and ksm_functional_tests This adds an option to specify the merge type to the ksm_tests. This allows to test madvise and prctl KSM. It also adds a two new tests to ksm_functional_tests: one to test the new prctl options and the other one is a fork test to verify that the KSM process setting is inherited by client processes. This patch (of 3): So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level. 1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting. The setting will be inherited by child processes. With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting. 2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's. When forking a process that has KSM enabled, the setting will be inherited by the new child process. 1) Introduce new MMF_VM_MERGE_ANY flag This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag is set, kernel samepage merging (ksm) gets enabled for all vma's of a process. 2) Setting VM_MERGEABLE on VMA creation When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the VM_MERGEABLE flag will be set for this VMA. 3) support disabling of ksm for a process This adds the ability to disable ksm for a process if ksm has been enabled for the process with prctl. 4) add new prctl option to get and set ksm for a process This adds two new options to the prctl system call - enable ksm for all vmas of a process (if the vmas support it). - query if ksm has been enabled for a process. 3. Disabling MMF_VM_MERGE_ANY for storage keys in s390 In the s390 architecture when storage keys are used, the MMF_VM_MERGE_ANY will be disabled. Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Acked-by: David Hildenbrand <david@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>		2023-04-21 14:52:03 -07:00
..
bpf	bpf: Adjust insufficient default bpf_jit_limit	2023-03-21 12:43:05 -07:00
cgroup	cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic"	2023-04-18 16:29:49 -07:00
configs	mm, slob: rename CONFIG_SLOB to CONFIG_SLOB_DEPRECATED	2022-12-01 00:09:20 +01:00
debug	kdb: use srcu console list iterator	2022-12-02 11:25:00 +01:00
dma	mm, treewide: redefine MAX_ORDER sanely	2023-04-05 19:42:46 -07:00
entry	entry/rcu: Check TIF_RESCHED _after_ delayed RCU wake-up	2023-03-21 15:13:15 +01:00
events	mm, treewide: redefine MAX_ORDER sanely	2023-04-05 19:42:46 -07:00
futex	- Prevent the leaking of a debug timer in futex_waitv()	2023-01-01 11:15:05 -08:00
gcov	gcov: add support for checksum field	2022-12-21 14:31:52 -08:00
irq	A set of updates for the interrupt susbsystem:	2023-03-05 11:19:16 -08:00
kcsan	printk: export console trace point for kcsan/kasan/kfence/kmsan	2023-04-18 16:30:11 -07:00
livepatch	Livepatching changes for 6.3	2023-02-23 14:00:10 -08:00
locking	RCU pull request for v6.3	2023-02-21 10:45:51 -08:00
module	modules-6.3-rc1	2023-02-23 14:05:08 -08:00
power	Merge branches 'powercap', 'pm-domains', 'pm-em' and 'pm-opp'	2023-02-15 20:06:26 +01:00
printk	printk: export console trace point for kcsan/kasan/kfence/kmsan	2023-04-18 16:30:11 -07:00
rcu	Merge branch 'stall.2023.01.09a' into HEAD	2023-02-02 16:40:07 -08:00
sched	sched/numa: use hash_32 to mix up PIDs accessing VMA	2023-04-05 20:03:03 -07:00
time	Updates for timekeeping, timers and clockevent/source drivers:	2023-02-21 09:45:13 -08:00
trace	Tracing fixes for 6.3:	2023-03-19 10:46:02 -07:00
.gitignore
acct.c	acct: fix potential integer overflow in encode_comp_t()	2022-11-30 16:13:18 -08:00
async.c
audit_fsnotify.c	audit: fix potential double free on error path from fsnotify_add_inode_mark	2022-08-22 18:50:06 -04:00
audit_tree.c	audit: use fsnotify group lock helpers	2022-04-25 14:37:28 +02:00
audit_watch.c	audit_init_parent(): constify path	2022-09-01 17:39:30 -04:00
audit.c	audit: use time_after to compare time	2022-08-29 19:47:03 -04:00
audit.h	audit: remove selinux_audit_rule_update() declaration	2022-09-07 11:30:15 -04:00
auditfilter.c
auditsc.c	capability: just use a 'u64' instead of a 'u32[2]' array	2023-03-01 10:01:22 -08:00
backtracetest.c
bounds.c	mm: multi-gen LRU: minimal implementation	2022-09-26 19:46:09 -07:00
capability.c	capability: just use a 'u64' instead of a 'u32[2]' array	2023-03-01 10:01:22 -08:00
cfi.c	cfi: Switch to -fsanitize=kcfi	2022-09-26 10:13:13 -07:00
compat.c	sched_getaffinity: don't assume 'cpumask_size()' is fully initialized	2023-03-14 19:32:38 -07:00
configs.c
context_tracking.c	context_tracking: Fix noinstr vs KASAN	2023-01-13 11:48:18 +01:00
cpu_pm.c	cpuidle, cpu_pm: Remove RCU fiddling from cpu_pm_{enter,exit}()	2023-01-13 11:48:15 +01:00
cpu.c	lazy tlb: introduce lazy tlb mm refcount helper functions	2023-03-28 16:20:08 -07:00
crash_core.c	mm, treewide: redefine MAX_ORDER sanely	2023-04-05 19:42:46 -07:00
crash_dump.c
cred.c	cred: Do not default to init_cred in prepare_kernel_cred()	2022-11-01 10:04:52 -07:00
delayacct.c	delayacct: support re-entrance detection of thrashing accounting	2022-09-26 19:46:07 -07:00
dma.c
exec_domain.c
exit.c	lazy tlb: introduce lazy tlb mm refcount helper functions	2023-03-28 16:20:08 -07:00
extable.c	context_tracking: Take NMI eqs entrypoints over RCU	2022-07-05 13:32:59 -07:00
fail_function.c	kernel/fail_function: fix memory leak with using debugfs_lookup()	2023-02-08 13:36:22 +01:00
fork.c	sync mm-stable with mm-hotfixes-stable to pick up depended-upon upstream changes	2023-04-18 14:53:49 -07:00
freezer.c	freezer,sched: Rewrite core freezer logic	2022-09-07 21:53:50 +02:00
gen_kheaders.sh	kheaders: use standard naming for the temporary directory	2023-01-22 23:43:34 +09:00
groups.c	security: Add LSM hook to setgroups() syscall	2022-07-15 18:21:49 +00:00
hung_task.c	hung_task: print message when hung_task_warnings gets down to zero.	2023-02-09 17:03:20 -08:00
iomem.c
irq_work.c	irq_work: use kasan_record_aux_stack_noalloc() record callstack	2022-04-15 14:49:55 -07:00
jump_label.c	jump_label: Prevent key->enabled int overflow	2022-12-01 15:53:05 -08:00
kallsyms_internal.h	kallsyms: Reduce the memory occupied by kallsyms_seqs_of_names[]	2022-11-12 18:47:36 -08:00
kallsyms_selftest.c	kallsyms: Fix scheduling with interrupts disabled in self-test	2023-01-13 15:09:08 -08:00
kallsyms_selftest.h	kallsyms: Add self-test facility	2022-11-15 00:42:02 -08:00
kallsyms.c	kallsyms: Add self-test facility	2022-11-15 00:42:02 -08:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt	Revert "signal, x86: Delay calling signals in atomic on RT enabled kernels"	2022-03-31 10:36:55 +02:00
kcov.c	mm: replace vma->vm_flags direct modifications with modifier calls	2023-02-09 16:51:39 -08:00
kexec_core.c	There is no particular theme here - mainly quick hits all over the tree.	2023-02-23 17:55:40 -08:00
kexec_elf.c
kexec_file.c	kexec: introduce sysctl parameters kexec_load_limit_*	2023-02-02 22:50:05 -08:00
kexec_internal.h	panic, kexec: make __crash_kexec() NMI safe	2022-09-11 21:55:06 -07:00
kexec.c	kexec: introduce sysctl parameters kexec_load_limit_*	2023-02-02 22:50:05 -08:00
kheaders.c
kmod.c
kprobes.c	x86/kprobes: Fix arch_check_optimized_kprobe check within optimized_kprobe range	2023-02-21 08:49:16 +09:00
ksysfs.c	kernels/ksysfs.c: export kernel address bits	2023-01-20 14:30:45 +01:00
kthread.c	lazy tlb: introduce lazy tlb mm refcount helper functions	2023-03-28 16:20:08 -07:00
latencytop.c	latencytop: use the last element of latency_record of system	2022-09-11 21:55:12 -07:00
Makefile	kernel hardening fixes for v6.2-rc1	2022-12-23 12:00:24 -08:00
module_signature.c
notifier.c	kernel/notifier: Remove CONFIG_SRCU	2023-02-02 16:26:06 -08:00
nsproxy.c	fs/exec: switch timens when a task gets a new mm	2022-10-25 15:15:52 -07:00
padata.c	Kbuild updates for v6.2	2022-12-19 12:33:32 -06:00
panic.c	panic: fix the panic_print NMI backtrace setting	2023-03-02 21:54:23 -08:00
params.c	kernel/params.c: Use kstrtobool() instead of strtobool()	2023-01-25 14:07:21 -08:00
pid_namespace.c	- Daniel Verkamp has contributed a memfd series ("mm/memfd: add	2023-02-23 17:09:35 -08:00
pid_sysctl.h	mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC	2023-01-18 17:12:37 -08:00
pid.c	gfs2: Add glockfd debugfs file	2022-06-29 13:07:16 +02:00
profile.c	kernel/profile.c: simplify duplicated code in profile_setup()	2022-09-11 21:55:12 -07:00
ptrace.c	rseq: Introduce extensible rseq ABI	2022-12-27 12:52:10 +01:00
range.c
reboot.c	kernel/reboot: Add SYS_OFF_MODE_RESTART_PREPARE mode	2022-10-04 15:59:36 +02:00
regset.c
relay.c	mm: replace vma->vm_flags direct modifications with modifier calls	2023-02-09 16:51:39 -08:00
resource_kunit.c
resource.c	dax/kmem: Fix leak of memory-hotplug resources	2023-02-17 14:58:01 -08:00
rseq.c	rseq: Extend struct rseq with per-memory-map concurrency ID	2022-12-27 12:52:12 +01:00
scftorture.c	scftorture: Fix distribution of short handler delays	2022-04-11 17:07:29 -07:00
scs.c	scs: add support for dynamic shadow call stacks	2022-11-09 18:06:35 +00:00
seccomp.c	seccomp: fix kernel-doc function name warning	2023-01-13 17:01:06 -08:00
signal.c	sched: Introduce per-memory-map concurrency ID	2022-12-27 12:52:11 +01:00
smp.c	bitmap patches for v6.1-rc1	2022-10-10 12:49:34 -07:00
smpboot.c	smpboot: use atomic_try_cmpxchg in cpu_wait_death and cpu_report_death	2022-09-11 21:55:10 -07:00
smpboot.h
softirq.c	context_tracking: Take IRQ eqs entrypoints over RCU	2022-07-05 13:32:59 -07:00
stackleak.c	stackleak: add on/off stack variants	2022-05-08 01:33:09 -07:00
stacktrace.c
static_call_inline.c	static_call: Add call depth tracking support	2022-10-17 16:41:16 +02:00
static_call.c	static_call: Don't make __static_call_return0 static	2022-04-05 09:59:38 +02:00
stop_machine.c	Scheduler changes in this cycle were:	2022-05-24 11:11:13 -07:00
sys_ni.c	kernel/sys_ni: add compat entry for fadvise64_64	2022-08-20 15:17:45 -07:00
sys.c	mm: add new api to enable ksm per process	2023-04-21 14:52:03 -07:00
sysctl-test.c	kernel/sysctl-test: use SYSCTL_{ZERO/ONE_HUNDRED} instead of i_{zero/one_hundred}	2022-09-08 16:56:45 -07:00
sysctl.c	sysctl: fix proc_dobool() usability	2023-02-21 13:34:07 -08:00
task_work.c	task_work: use try_cmpxchg in task_work_add, task_work_cancel_match and task_work_run	2022-09-11 21:55:10 -07:00
taskstats.c	genetlink: start to validate reserved header bytes	2022-08-29 12:47:15 +01:00
torture.c	torture: Fix hang during kthread shutdown phase	2023-01-05 12:10:35 -08:00
tracepoint.c	tracepoint: Allow livepatch module add trace event	2023-02-18 14:34:36 -05:00
tsacct.c	taskstats: version 12 with thread group and exe info	2022-04-29 14:38:03 -07:00
ucount.c	ucounts: Split rlimit and ucount values and max values	2022-05-18 18:24:57 -05:00
uid16.c
uid16.h
umh.c	umh: simplify the capability pointer logic	2023-03-03 16:18:19 -08:00
up.c
user_namespace.c	userns: fix a struct's kernel-doc notation	2023-02-02 22:50:04 -08:00
user-return-notifier.c
user.c	kernel/user: Allow user_struct::locked_vm to be usable for iommufd	2022-11-30 20:16:49 -04:00
usermode_driver.c	blob_to_mnt(): kern_unmount() is needed to undo kern_mount()	2022-05-19 23:25:47 -04:00
utsname_sysctl.c	kernel/utsname_sysctl.c: Fix hostname polling	2022-10-23 12:01:01 -07:00
utsname.c
watch_queue.c	watch_queue: fix IOC_WATCH_QUEUE_SET_SIZE alloc error paths	2023-03-08 11:44:45 +01:00
watchdog_hld.c	Revert "printk: add functions to prefer direct printing"	2022-06-23 18:41:40 +02:00
watchdog.c	powerpc updates for 6.0	2022-08-06 16:38:17 -07:00
workqueue_internal.h
workqueue.c	workqueue: Fold rebind_worker() within rebind_workers()	2023-01-13 07:50:40 -10:00