linux/mm
KAMEZAWA Hiroyuki f0c0b2b808 change zonelist order: zonelist order selection logic
Make zonelist creation policy selectable from sysctl/boot option v6.

This patch makes NUMA's zonelist (of pgdat) order selectable.
Available order are Default(automatic)/ Node-based / Zone-based.

[Default Order]
The kernel selects Node-based or Zone-based order automatically.

[Node-based Order]
This policy treats the locality of memory as the most important parameter.
Zonelist order is created by each zone's locality. This means lower zones
(ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion.
IOW. ZONE_DMA will be in the middle of zonelist.
current 2.6.21 kernel uses this.

Pros.
 * A user can expect local memory as much as possible.
Cons.
 * lower zone will be exhansted before higher zone. This may cause OOM_KILL.

Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL
because of ZONE_DMA exhaution and you need the best locality.

(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

*node(0)'s memory allocation order:

 node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL.

*node(1)'s memory allocation order:

 node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

[Zone-based order]
This policy treats the zone type as the most important parameter.
Zonelist order is created by zone-type order. This means lower zone
never be used bofere higher zone exhaustion.
IOW. ZONE_DMA will be always at the tail of zonelist.

Pros.
 * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted.
Cons.
 * memory locality may not be best.

(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

*node(0)'s memory allocation order:

 node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA.

*node(1)'s memory allocation order:

 node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

bootoption "numa_zonelist_order=" and proc/sysctl is supporetd.

command:
%echo N > /proc/sys/vm/numa_zonelist_order

Will rebuild zonelist in Node-based order.

command:
%echo Z > /proc/sys/vm/numa_zonelist_order

Will rebuild zonelist in Zone-based order.

Thanks to Lee Schermerhorn, he gives me much help and codes.

[Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order]
[akpm@linux-foundation.org: build fix]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "jesse.barnes@intel.com" <jesse.barnes@intel.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:35 -07:00
..
allocpercpu.c [PATCH] Allow NULL pointers in percpu_free 2006-12-07 08:39:22 -08:00
backing-dev.c [PATCH] nfs: fix congestion control 2007-03-16 19:25:05 -07:00
bootmem.c [PATCH] remove EXPORT_UNUSED_SYMBOL'ed symbols 2006-12-07 08:39:44 -08:00
bounce.c block: blk_max_pfn is somtimes wrong 2007-03-27 08:52:47 +02:00
fadvise.c [PATCH] mm: change uses of f_{dentry,vfsmnt} to use f_path 2006-12-08 08:28:43 -08:00
filemap_xip.c xip sendfile removal 2007-07-10 08:04:15 +02:00
filemap.c sendfile: kill generic_file_sendfile() 2007-07-10 08:04:14 +02:00
filemap.h Remove all inclusions of <linux/config.h> 2006-10-04 03:38:54 -04:00
fremap.c [PATCH] mm: more rmap debugging 2006-12-22 08:55:49 -08:00
highmem.c [PATCH] i386: PARAVIRT: add kmap_atomic_pte for mapping highpte pages 2007-05-02 19:27:15 +02:00
hugetlb.c Rework ptep_set_access_flags and fix sun4c 2007-06-16 13:16:16 -07:00
internal.h Make page->private usable in compound pages 2007-05-07 12:12:53 -07:00
Kconfig sh64: generic quicklist support. 2007-05-14 09:55:35 +09:00
madvise.c Detach sched.h from mm.h 2007-05-21 09:18:19 -07:00
Makefile Quicklists for page table pages 2007-05-07 12:12:54 -07:00
memory_hotplug.c memory hotplug: fix unnecessary calling of init_currenty_empty_zone() 2007-06-01 08:18:29 -07:00
memory.c Rework ptep_set_access_flags and fix sun4c 2007-06-16 13:16:16 -07:00
mempolicy.c [PATCH] Page migration: Fix vma flag checking 2007-03-05 07:57:51 -08:00
mempool.c [PATCH] Numerous fixes to kernel-doc info in source files. 2007-02-11 10:51:32 -08:00
migrate.c page migration: fix NR_FILE_PAGES accounting 2007-04-24 08:23:08 -07:00
mincore.c [PATCH] mincore: vma crossing fix 2007-02-15 09:57:03 -08:00
mlock.c Detach sched.h from mm.h 2007-05-21 09:18:19 -07:00
mmap.c security: Protection for exploiting null dereference using mmap 2007-07-11 22:52:29 -04:00
mmzone.c [PATCH] remove EXPORT_UNUSED_SYMBOL'ed symbols 2006-12-07 08:39:44 -08:00
mprotect.c [PATCH] paravirt: lazy mmu mode hooks.patch 2006-10-01 00:39:33 -07:00
mremap.c security: Protection for exploiting null dereference using mmap 2007-07-11 22:52:29 -04:00
msync.c Detach sched.h from mm.h 2007-05-21 09:18:19 -07:00
nommu.c security: Protection for exploiting null dereference using mmap 2007-07-11 22:52:29 -04:00
oom_kill.c oom: fix constraint deadlock 2007-05-07 12:12:55 -07:00
page_alloc.c change zonelist order: zonelist order selection logic 2007-07-16 09:05:35 -07:00
page_io.c [PATCH] swsusp: use block device offsets to identify swap locations 2006-12-07 08:39:27 -08:00
page-writeback.c consolidate generic_writepages and mpage_writepages 2007-05-11 08:29:35 -07:00
pdflush.c [PATCH] Add include/linux/freezer.h and move definitions from sched.h 2006-12-07 08:39:27 -08:00
prio_tree.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
quicklist.c Quicklists for page table pages 2007-05-07 12:12:54 -07:00
readahead.c readahead: code cleanup 2007-05-07 12:12:52 -07:00
rmap.c mm: kill validate_anon_vma to avoid mapcount BUG 2007-06-28 11:34:53 -07:00
shmem_acl.c [PATCH] Fix typos in mm/shmem_acl.c 2006-10-11 11:14:23 -07:00
shmem.c shmem: convert to using splice instead of sendfile() 2007-07-10 08:04:15 +02:00
slab.c Fix slab redzone alignment 2007-07-05 15:54:13 -07:00
slob.c Remove SLAB_CTOR_CONSTRUCTOR 2007-05-17 05:23:04 -07:00
slub.c slub: remove useless EXPORT_SYMBOL 2007-07-06 11:45:11 -07:00
sparse.c Move three functions that are only needed for CONFIG_MEMORY_HOTPLUG 2007-06-08 17:23:33 -07:00
swap_state.c [PATCH] lockdep: locking init debugging improvement 2006-07-03 15:27:02 -07:00
swap.c Add suspend-related notifications for CPU hotplug 2007-05-09 12:30:56 -07:00
swapfile.c mm: make read_cache_page synchronous 2007-05-07 12:12:51 -07:00
thrash.c Bug in mm/thrash.c function grab_swap_token() 2007-05-11 08:29:32 -07:00
tiny-shmem.c [PATCH] mm/{,tiny-}shmem.c cleanups 2007-03-01 14:53:35 -08:00
truncate.c fs: convert core functions to zero_user_page 2007-05-09 12:30:55 -07:00
util.c [PATCH] slab: clean up leak tracking ifdefs a little bit 2006-10-04 07:55:13 -07:00
vmalloc.c Make __vunmap static 2007-05-17 05:23:04 -07:00
vmscan.c Add suspend-related notifications for CPU hotplug 2007-05-09 12:30:56 -07:00
vmstat.c mm: fixup /proc/vmstat output 2007-07-06 10:26:50 -07:00