linux/include
Johannes Weiner 81c0a2bb51 mm: page_alloc: fair zone allocator policy
Each zone that holds userspace pages of one workload must be aged at a
speed proportional to the zone size.  Otherwise, the time an individual
page gets to stay in memory depends on the zone it happened to be
allocated in.  Asymmetry in the zone aging creates rather unpredictable
aging behavior and results in the wrong pages being reclaimed, activated
etc.

But exactly this happens right now because of the way the page allocator
and kswapd interact.  The page allocator uses per-node lists of all zones
in the system, ordered by preference, when allocating a new page.  When
the first iteration does not yield any results, kswapd is woken up and the
allocator retries.  Due to the way kswapd reclaims zones below the high
watermark while a zone can be allocated from when it is above the low
watermark, the allocator may keep kswapd running while kswapd reclaim
ensures that the page allocator can keep allocating from the first zone in
the zonelist for extended periods of time.  Meanwhile the other zones
rarely see new allocations and thus get aged much slower in comparison.

The result is that the occasional page placed in lower zones gets
relatively more time in memory, even gets promoted to the active list
after its peers have long been evicted.  Meanwhile, the bulk of the
working set may be thrashing on the preferred zone even though there may
be significant amounts of memory available in the lower zones.

Even the most basic test -- repeatedly reading a file slightly bigger than
memory -- shows how broken the zone aging is.  In this scenario, no single
page should be able stay in memory long enough to get referenced twice and
activated, but activation happens in spades:

  $ grep active_file /proc/zoneinfo
      nr_inactive_file 0
      nr_active_file 0
      nr_inactive_file 0
      nr_active_file 8
      nr_inactive_file 1582
      nr_active_file 11994
  $ cat data data data data >/dev/null
  $ grep active_file /proc/zoneinfo
      nr_inactive_file 0
      nr_active_file 70
      nr_inactive_file 258753
      nr_active_file 443214
      nr_inactive_file 149793
      nr_active_file 12021

Fix this with a very simple round robin allocator.  Each zone is allowed a
batch of allocations that is proportional to the zone's size, after which
it is treated as full.  The batch counters are reset when all zones have
been tried and the allocator enters the slowpath and kicks off kswapd
reclaim.  Allocation and reclaim is now fairly spread out to all
available/allowable zones:

  $ grep active_file /proc/zoneinfo
      nr_inactive_file 0
      nr_active_file 0
      nr_inactive_file 174
      nr_active_file 4865
      nr_inactive_file 53
      nr_active_file 860
  $ cat data data data data >/dev/null
  $ grep active_file /proc/zoneinfo
      nr_inactive_file 0
      nr_active_file 0
      nr_inactive_file 666622
      nr_active_file 4988
      nr_inactive_file 190969
      nr_active_file 937

When zone_reclaim_mode is enabled, allocations will now spread out to all
zones on the local node, not just the first preferred zone (which on a 4G
node might be a tiny Normal zone).

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paul Bolle <paul.bollee@gmail.com>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Tested-by: Kevin Hilman <khilman@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:57:23 -07:00
..
acpi Merge branch 'acpi-hotplug' 2013-08-30 14:14:25 +02:00
asm-generic Merge branch 'for-v3.12' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping 2013-09-09 10:26:33 -07:00
clocksource ARM: SoC cleanups for 3.12 2013-09-06 13:21:16 -07:00
crypto crypto: scatterwalk - Add support for calculating number of SG elements 2013-08-21 21:27:58 +10:00
drm Merge tag 'drm-intel-fixes-2013-09-06' of git://people.freedesktop.org/~danvet/drm-intel into drm-fixes 2013-09-10 12:36:55 +10:00
dt-bindings Device tree core updates for v3.12 2013-09-10 13:53:52 -07:00
keys
kvm ARM: KVM: vgic: Bump VGIC_NR_IRQS to 256 2013-08-30 16:12:39 +03:00
linux mm: page_alloc: fair zone allocator policy 2013-09-11 15:57:23 -07:00
math-emu
media Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2013-09-05 14:54:29 -07:00
memory
misc
net Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2013-09-05 14:54:29 -07:00
pcmcia
ras
rdma Merge branches 'cxgb4', 'flowsteer', 'ipoib', 'iser', 'mlx4', 'ocrdma' and 'qib' into for-next 2013-09-03 09:01:08 -07:00
rxrpc
scsi [SCSI] Generate uevents on certain unit attention codes 2013-08-26 18:52:27 +04:00
sound Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media 2013-09-05 11:55:59 -07:00
target
trace mm/page_alloc.c: fix the value of fallback_migratetype in alloc_extfrag tracepoint() 2013-09-11 15:57:19 -07:00
uapi Add the ability to collect I/O statistics on user-defined regions of a 2013-09-10 13:06:15 -07:00
video fbdev changes for 3.12: 2013-09-05 09:49:32 -07:00
xen Features: 2013-09-04 17:45:39 -07:00
Kbuild