linux/kernel
Rik van Riel 5beb493052 mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads.  Specifically, each anon_vma will be shared between the parent
process and all its child processes.

In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes.  However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.

This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock.  This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands.  Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.

This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA.  At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated.  The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.

This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
 This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.

The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations.  This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures.  This in
turn means error handling needs to be added to the calling functions.

A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock.  To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag.  This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.

Some test results:

Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.

With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time.  The anon_vma lock contention appears to be resolved.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:26 -08:00
..
gcov microblaze: Enable GCOV_PROFILE_ALL 2009-09-21 14:29:21 +02:00
irq sparseirq: Use radix_tree instead of ptrs array 2010-02-17 17:27:20 -08:00
power PM / Hibernate: Fix preallocating of memory 2010-02-26 20:39:13 +01:00
time Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-03-01 08:48:25 -08:00
trace Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2010-03-03 07:34:18 -08:00
.gitignore
acct.c bsdacct: fix uid/gid misreporting 2009-12-15 08:53:10 -08:00
async.c
audit_tree.c new helper: iterate_mounts() 2010-03-03 14:07:57 -05:00
audit_watch.c Audit: reorganize struct audit_watch to save 8 bytes 2009-09-24 03:50:25 -04:00
audit.c Audit: send signal info if selinux is disabled 2009-09-24 03:50:26 -04:00
audit.h
auditfilter.c
auditsc.c Lose the first argument of audit_inode_child() 2010-02-08 14:38:36 -05:00
backtracetest.c
bounds.c kbuild: move bounds.h to include/generated 2009-12-12 13:08:14 +01:00
capability.c capabilities: Use RCU to protect task lookup in sys_capget 2009-12-10 09:42:48 +11:00
cgroup_freezer.c cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time 2009-09-24 07:20:58 -07:00
cgroup.c sched, cgroups: Fix module export 2010-02-25 12:02:13 +01:00
compat.c
configs.c
cpu.c sched: Correct printk whitespace in warning from cpu down task check 2010-01-28 06:59:55 +01:00
cpuset.c sched: Fix balance vs hotplug race 2009-12-06 21:10:56 +01:00
cred-internals.h
cred.c kernel/cred.c: use kmem_cache_free 2010-02-03 10:21:57 +11:00
delayacct.c headers: taskstats_kern.h trim 2009-09-18 09:48:52 -07:00
dma.c
early_res.c early_res: Need to save the allocation name in drop_range_partial() 2010-03-01 23:23:02 -08:00
exec_domain.c
exit.c mm: avoid false sharing of mm_counter 2010-03-06 11:26:24 -08:00
extable.c
fork.c mm: change anon_vma linking to fix multi-process server scalability issue 2010-03-06 11:26:26 -08:00
freezer.c sched: fix nr_uninterruptible accounting of frozen tasks really 2009-07-18 14:19:53 +02:00
futex_compat.c futex: Protect pid lookup in compat code with RCU 2009-12-09 14:22:14 +01:00
futex.c futex: Handle futex value corruption gracefully 2010-02-03 15:13:22 +01:00
groups.c
hrtimer.c hrtimers: Convert to raw_spinlocks 2009-12-14 23:55:34 +01:00
hung_task.c softlockup: Fix hung_task_check_count sysctl 2009-11-27 06:21:57 +01:00
hw_breakpoint.c perf: Make bp_len type to u64 generic across the arch 2010-02-04 01:07:12 +01:00
itimer.c itimers: Fix racy writes to cpu_itimer fields 2009-11-18 16:32:12 +01:00
kallsyms.c hw-breakpoints: Fix broken hw-breakpoint sample module 2009-11-10 11:23:29 +01:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks mutex: Better control mutex adaptive spinning config 2009-12-03 11:50:11 +01:00
Kconfig.preempt
kexec.c percpu: add __percpu sparse annotations to core kernel subsystems 2010-02-17 11:17:38 +09:00
kfifo.c kfifo: Don't use integer as NULL pointer 2010-02-16 15:11:08 -08:00
kgdb.c Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-02-04 16:07:41 -08:00
kmod.c kmod: fix resource leak in call_usermodehelper_pipe() 2010-01-11 09:34:04 -08:00
kprobes.c kprobes: Jump optimization sysctl interface 2010-02-25 17:49:25 +01:00
ksysfs.c sched: Remove USER_SCHED 2010-01-21 13:40:18 +01:00
kthread.c kthread, sched: Remove reference to kthread_create_on_cpu 2010-02-09 11:47:39 +01:00
latencytop.c
lockdep_internals.h lockdep: BFS cleanup 2009-07-24 10:53:29 +02:00
lockdep_proc.c seq_file: constify seq_operations 2009-09-23 07:39:29 -07:00
lockdep_states.h
lockdep.c rcu: Make lockdep_rcu_dereference() message less alarmist 2010-02-26 08:20:46 +01:00
Makefile Merge branch 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-03-03 08:15:05 -08:00
module.c Merge branch 'master' into percpu 2010-02-02 14:38:15 +09:00
mutex-debug.c headers: remove sched.h from interrupt.h 2009-10-11 11:20:58 -07:00
mutex-debug.h locking: Implement new raw_spinlock 2009-12-14 23:55:32 +01:00
mutex.c mutex: Better control mutex adaptive spinning config 2009-12-03 11:50:11 +01:00
mutex.h
notifier.c sched: Use lockdep-based checking on rcu_dereference() 2010-02-25 10:34:26 +01:00
ns_cgroup.c cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time 2009-09-24 07:20:58 -07:00
nsproxy.c
padata.c padata: Allocate the cpumask for the padata instance 2010-03-04 13:30:22 +08:00
panic.c kmsg_dump: Dump on crash_kexec as well 2009-12-31 19:45:04 +00:00
params.c tree-wide: convert open calls to remove spaces to skip_spaces() lib function 2009-12-15 08:53:32 -08:00
perf_event.c Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-02-28 10:20:25 -08:00
pid_namespace.c pidns: deny CLONE_PARENT|CLONE_NEWPID combination 2009-09-24 07:21:04 -07:00
pid.c sched: Use lockdep-based checking on rcu_dereference() 2010-02-25 10:34:26 +01:00
pm_qos_params.c pm_qos: clean up racy global "name" variable 2009-10-14 15:31:10 +02:00
posix-cpu-timers.c posix-cpu-timers: optimize and document timer_create callback 2009-11-18 12:36:05 +01:00
posix-timers.c posix-timers.c: Don't export local functions 2010-02-05 14:54:10 +01:00
printk.c Merge branch 'next' into for-linus 2010-03-01 09:36:31 +11:00
profile.c kernel/profile.c: Switch /proc/irq/prof_cpu_mask to seq_file 2009-09-20 20:15:40 +02:00
ptrace.c ptrace: Fix ptrace_regset() comments and diagnose errors specifically 2010-02-23 13:45:26 -08:00
range.c x86: Change range end to start+size 2010-02-10 17:47:17 -08:00
rcupdate.c rcu: Export rcu_scheduler_active 2010-02-26 08:20:46 +01:00
rcutiny.c rcu: Eliminate unneeded function wrapping 2009-11-22 18:58:16 +01:00
rcutorture.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2010-03-03 07:34:18 -08:00
rcutree_plugin.h rcu: Fix accelerated GPs for last non-dynticked CPU 2010-02-27 09:53:53 +01:00
rcutree_trace.c rcu: Stop overflowing signed integers 2010-02-25 10:34:57 +01:00
rcutree.c rcu: Fix accelerated grace periods for last non-dynticked CPU 2010-02-27 09:53:52 +01:00
rcutree.h rcu: Fix accelerated grace periods for last non-dynticked CPU 2010-02-27 09:53:52 +01:00
relay.c const: constify remaining pipe_buf_operations 2009-12-16 07:20:05 -08:00
res_counter.c memcg: some modification to softlimit under hierarchical memory reclaim. 2009-10-01 16:11:13 -07:00
resource.c Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-03-03 09:11:02 -08:00
rtmutex_common.h
rtmutex-debug.c sched: Convert pi_lock to raw_spinlock 2009-12-14 23:55:33 +01:00
rtmutex-debug.h
rtmutex-tester.c
rtmutex.c rtmutes: Convert rtmutex.lock to raw_spinlock 2009-12-14 23:55:33 +01:00
rtmutex.h
rwsem.c
sched_clock.c sched: Fix cpu_clock() in NMIs, on !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK 2009-12-15 09:04:36 +01:00
sched_cpupri.c bitops: rename for_each_bit() to for_each_set_bit() 2010-03-06 11:26:23 -08:00
sched_cpupri.h sched: Convert cpupri lock to raw_spinlock 2009-12-14 23:55:33 +01:00
sched_debug.c sched: Convert rq->lock to raw_spinlock 2009-12-14 23:55:33 +01:00
sched_fair.c sched: Fix SCHED_MC regression caused by change in sched cpu_power 2010-02-26 15:45:13 +01:00
sched_features.h sched: Discard some old bits 2009-12-09 10:03:07 +01:00
sched_idletask.c sched: Remove the sched_class load_balance methods 2010-01-21 13:40:09 +01:00
sched_rt.c sched: Change usage of rt_rq->rt_se to rt_rq->tg->rt_se[cpu] 2010-02-04 09:57:32 +01:00
sched_stats.h
sched.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2010-03-03 07:34:18 -08:00
seccomp.c
semaphore.c
signal.c Prioritize synchronous signals over 'normal' signals 2010-03-03 19:21:10 -08:00
slow-work-debugfs.c SLOW_WORK: Move slow_work's proc file to debugfs 2009-12-01 08:20:31 -08:00
slow-work.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6 2009-12-08 07:38:50 -08:00
slow-work.h SLOW_WORK: Move slow_work's proc file to debugfs 2009-12-01 08:20:31 -08:00
smp.c generic-ipi: Optimize accesses by using DEFINE_PER_CPU_SHARED_ALIGNED for IPI data 2010-01-18 09:02:59 +01:00
softirq.c hrtimer, softirq: Fix hrtimer->softirq trampoline 2010-02-03 18:17:40 +01:00
softlockup.c softlockup: Add sched_clock_tick() to avoid kernel warning on kgdb resume 2010-02-01 08:22:32 +01:00
spinlock.c locking: Cleanup the name space completely 2009-12-14 23:55:33 +01:00
srcu.c rcu: Introduce lockdep-based checking to RCU read-side primitives 2010-02-25 09:40:59 +01:00
stacktrace.c
stop_machine.c percpu: add __percpu sparse annotations to core kernel subsystems 2010-02-17 11:17:38 +09:00
sys_ni.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 2009-12-08 07:55:01 -08:00
sys.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-02-28 10:31:01 -08:00
sysctl_binary.c Switch may_open() and break_lease() to passing O_... 2010-03-03 13:00:21 -05:00
sysctl_check.c ipv4 05/05: add sysctl to accept packets with local source addresses 2009-12-03 12:14:38 -08:00
sysctl.c Merge branch 'perf-probes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-03-05 10:50:22 -08:00
taskstats.c const: struct nla_policy 2010-02-18 14:30:18 -08:00
test_kprobes.c
time.c Revert "time: Remove xtime_cache" 2009-12-22 14:10:37 -08:00
timeconst.pl
timer.c perf: Fix perf_event_do_pending() fallback callsite 2010-01-21 13:40:39 +01:00
tracepoint.c trivial: fix typo "to to" in multiple files 2009-09-21 15:14:55 +02:00
tsacct.c mm: clean up mm_counter 2010-03-06 11:26:23 -08:00
uid16.c headers: utsname.h redux 2009-09-23 18:13:10 -07:00
up.c
user_namespace.c
user-return-notifier.c core: Clean up user return notifers use of per_cpu 2009-12-02 10:22:59 +01:00
user.c sched: Remove USER_SCHED 2010-01-21 13:40:18 +01:00
utsname_sysctl.c sysctl kernel: Remove binary sysctl logic 2009-11-12 02:04:55 -08:00
utsname.c
wait.c locking, sched: Give waitqueue spinlocks their own lockdep classes 2009-08-10 14:43:09 +02:00
workqueue.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq 2009-12-10 09:35:44 -08:00