linux/mm
Nick Piggin 8174c430e4 x86: lockless get_user_pages_fast()
Implement get_user_pages_fast without locking in the fastpath on x86.

Do an optimistic lockless pagetable walk, without taking mmap_sem or any
page table locks or even mmap_sem.  Page table existence is guaranteed by
turning interrupts off (combined with the fact that we're always looking
up the current mm, means we can do the lockless page table walk within the
constraints of the TLB shootdown design).  Basically we can do this
lockless pagetable walk in a similar manner to the way the CPU's pagetable
walker does not have to take any locks to find present ptes.

This patch (combined with the subsequent ones to convert direct IO to use
it) was found to give about 10% performance improvement on a 2 socket 8
core Intel Xeon system running an OLTP workload on DB2 v9.5

 "To test the effects of the patch, an OLTP workload was run on an IBM
  x3850 M2 server with 2 processors (quad-core Intel Xeon processors at
  2.93 GHz) using IBM DB2 v9.5 running Linux 2.6.24rc7 kernel.  Comparing
  runs with and without the patch resulted in an overall performance
  benefit of ~9.8%.  Correspondingly, oprofiles showed that samples from
  __up_read and __down_read routines that is seen during thread contention
  for system resources was reduced from 2.8% down to .05%.  Monitoring the
  /proc/vmstat output from the patched run showed that the counter for
  fast_gup contained a very high number while the fast_gup_slow value was
  zero."

(fast_gup is the old name for get_user_pages_fast, fast_gup_slow is a
counter we had for the number of times the slowpath was invoked).

The main reason for the improvement is that DB2 has multiple threads each
issuing direct-IO.  Direct-IO uses get_user_pages, and thus the threads
contend the mmap_sem cacheline, and can also contend on page table locks.

I would anticipate larger performance gains on larger systems, however I
think DB2 uses an adaptive mix of threads and processes, so it could be
that thread contention remains pretty constant as machine size increases.
In which case, we stuck with "only" a 10% gain.

The downside of using get_user_pages_fast is that if there is not a pte
with the correct permissions for the access, we end up falling back to
get_user_pages and so the get_user_pages_fast is a bit of extra work.
However this should not be the common case in most performance critical
code.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: Kconfig fix]
[akpm@linux-foundation.org: Makefile fix/cleanup]
[akpm@linux-foundation.org: warning fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-26 12:00:06 -07:00
..
allocpercpu.c Merge commit 'v2.6.26-rc9' into cpus4096 2008-07-06 14:23:39 +02:00
backing-dev.c mm: bdi: fix race in bdi_class device creation 2008-05-20 13:31:53 -07:00
bootmem.c bootmem: replace node_boot_start in struct bootmem_data 2008-07-24 10:47:20 -07:00
bounce.c block: Initial support for data-less (or empty) barrier support 2007-10-16 11:03:56 +02:00
dmapool.c dmapool: enable debugging for CONFIG_SLUB_DEBUG_ON too 2008-04-28 08:58:20 -07:00
fadvise.c xip: support non-struct page backed memory 2008-04-28 08:58:23 -07:00
filemap_xip.c xip: support non-struct page backed memory 2008-04-28 08:58:23 -07:00
filemap.c memcg: remove refcnt from page_cgroup 2008-07-25 10:53:37 -07:00
fremap.c mm: fix various kernel-doc comments 2008-03-19 18:53:35 -07:00
highmem.c highmem: Export totalhigh_pages. 2008-07-19 22:39:46 -07:00
hugetlb.c hugetlb: fix CONFIG_SYSCTL=n build 2008-07-26 12:00:01 -07:00
internal.h mm: export prep_compound_page to mm 2008-07-24 10:47:17 -07:00
Kconfig x86: lockless get_user_pages_fast() 2008-07-26 12:00:06 -07:00
maccess.c kgdb: fix optional arch functions and probe_kernel_* 2008-04-17 20:05:39 +02:00
madvise.c xip: support non-struct page backed memory 2008-04-28 08:58:23 -07:00
Makefile mm: remove mm_init compilation dependency on CONFIG_DEBUG_MEMORY_INIT 2008-07-24 10:47:17 -07:00
memcontrol.c memcg: limit change shrink usage 2008-07-25 10:53:37 -07:00
memory_hotplug.c memory-hotplug: add sysfs removable attribute for hotplug memory remove 2008-07-24 10:47:21 -07:00
memory.c hugetlb: introduce pud_huge 2008-07-24 10:47:18 -07:00
mempolicy.c hugetlb: modular state for hugetlb page size 2008-07-24 10:47:17 -07:00
mempool.c spelling fixes: mm/ 2007-10-20 01:27:18 +02:00
migrate.c memcg: remove refcnt from page_cgroup 2008-07-25 10:53:37 -07:00
mincore.c mm: remove nopage 2008-04-28 08:58:18 -07:00
mlock.c
mm_init.c mm: create /sys/kernel/mm 2008-07-24 10:47:17 -07:00
mmap.c hugetlb: modular state for hugetlb page size 2008-07-24 10:47:17 -07:00
mmzone.c mm: filter based on a nodemask as well as a gfp_mask 2008-04-28 08:58:19 -07:00
mprotect.c mm: record MAP_NORESERVE status on vmas and fix small page mprotect reservations 2008-07-24 10:47:16 -07:00
mremap.c sparse pointer use of zero as null 2007-10-18 14:37:31 -07:00
msync.c
nommu.c nommu: Correct kobjsize() page validity checks. 2008-06-12 07:56:17 -07:00
oom_kill.c oom_kill: remove unused parameter in badness() 2008-04-28 08:58:26 -07:00
page_alloc.c memory hotplug: small fixes to bootmem freeing for memory hotremove 2008-07-24 10:47:21 -07:00
page_io.c mm: fix PageUptodate data race 2008-02-05 09:44:19 -08:00
page_isolation.c memory hotremove: unset migrate type "ISOLATE" after removal 2007-11-14 18:45:38 -08:00
page-writeback.c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 2008-07-15 08:36:38 -07:00
pagewalk.c pagemap: pass mm into pagewalkers 2008-06-12 18:05:41 -07:00
pdflush.c pdflush: use time_after() instead of open-coding it 2008-07-25 10:53:28 -07:00
prio_tree.c spelling fixes: mm/ 2007-10-20 01:27:18 +02:00
quicklist.c quicklists: Only consider memory that can be used with GFP_KERNEL 2008-01-14 08:52:22 -08:00
readahead.c mm: bdi: export BDI attributes in sysfs 2008-04-30 08:29:49 -07:00
rmap.c memcg: remove refcnt from page_cgroup 2008-07-25 10:53:37 -07:00
shmem_acl.c
shmem.c memcg: helper function for relcaim from shmem. 2008-07-25 10:53:37 -07:00
slab.c Merge branch 'generic-ipi' into generic-ipi-for-linus 2008-07-15 21:55:59 +02:00
slob.c slob: record page flag overlays explicitly 2008-07-24 10:47:15 -07:00
slub.c slub: record page flag overlays explicitly 2008-07-24 10:47:15 -07:00
sparse-vmemmap.c Christoph has moved 2008-07-04 10:40:04 -07:00
sparse.c memory hotplug: allocate usemap on the section with pgdat 2008-07-24 10:47:21 -07:00
swap_state.c mm: bdi: add separate writeback accounting capability 2008-04-30 08:29:50 -07:00
swap.c mm: remove initialization of static per-cpu variables 2008-07-24 10:47:21 -07:00
swapfile.c mm: fix ever-decreasing swap priority 2008-07-24 10:47:21 -07:00
thrash.c
tiny-shmem.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2008-03-25 08:57:47 -07:00
truncate.c fix invalidate_inode_pages2_range() to not clear ret 2008-04-28 08:58:18 -07:00
util.c uninline arch_pick_mmap_layout() 2008-07-26 12:00:01 -07:00
vmalloc.c vmallocinfo: add NUMA information 2008-07-24 10:47:17 -07:00
vmscan.c per-task-delay-accounting: add memory reclaim delay 2008-07-25 10:53:47 -07:00
vmstat.c mm/vmstat.c: proper externs 2008-07-24 10:47:14 -07:00