linux/Documentation
Dan Williams e900a918b0 mm: shuffle initial free memory to improve memory-side-cache utilization
Patch series "mm: Randomize free memory", v10.

This patch (of 3):

Randomization of the page allocator improves the average utilization of
a direct-mapped memory-side-cache.  Memory side caching is a platform
capability that Linux has been previously exposed to in HPC
(high-performance computing) environments on specialty platforms.  In
that instance it was a smaller pool of high-bandwidth-memory relative to
higher-capacity / lower-bandwidth DRAM.  Now, this capability is going
to be found on general purpose server platforms where DRAM is a cache in
front of higher latency persistent memory [1].

Robert offered an explanation of the state of the art of Linux
interactions with memory-side-caches [2], and I copy it here:

    It's been a problem in the HPC space:
    http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

    A kernel module called zonesort is available to try to help:
    https://software.intel.com/en-us/articles/xeon-phi-software

    and this abandoned patch series proposed that for the kernel:
    https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel).  That's better than forcing
    users to deploy remedies like:
        "To eliminate this gradual degradation, we have added a Stream
         measurement to the Node Health Check that follows each job;
         nodes are rebooted whenever their measured memory bandwidth
         falls below 300 GB/s."

A replacement for zonesort was merged upstream in commit cc9aec03e5
("x86/numa_emulation: Introduce uniform split capability").  With this
numa_emulation capability, memory can be split into cache sized
("near-memory" sized) numa nodes.  A bind operation to such a node, and
disabling workloads on other nodes, enables full cache performance.
However, once the workload exceeds the cache size then cache conflicts
are unavoidable.  While HPC environments might be able to tolerate
time-scheduling of cache sized workloads, for general purpose server
platforms, the oversubscribed cache case will be the common case.

The worst case scenario is that a server system owner benchmarks a
workload at boot with an un-contended cache only to see that performance
degrade over time, even below the average cache performance due to
excessive conflicts.  Randomization clips the peaks and fills in the
valleys of cache utilization to yield steady average performance.

Here are some performance impact details of the patches:

1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
   3X speedup in a contrived case that tries to force cache conflicts.
   The contrived cased used the numa_emulation capability to force an
   instance of the benchmark to be run in two of the near-memory sized
   numa nodes.  If both instances were placed on the same emulated they
   would fit and cause zero conflicts.  While on separate emulated nodes
   without randomization they underutilized the cache and conflicted
   unnecessarily due to the in-order allocation per node.

2/ A well known Java server application benchmark was run with a heap
   size that exceeded cache size by 3X.  The cache conflict rate was 8%
   for the first run and degraded to 21% after page allocator aging.  With
   randomization enabled the rate levelled out at 11%.

3/ A MongoDB workload did not observe measurable difference in
   cache-conflict rates, but the overall throughput dropped by 7% with
   randomization in one case.

4/ Mel Gorman ran his suite of performance workloads with randomization
   enabled on platforms without a memory-side-cache and saw a mix of some
   improvements and some losses [3].

While there is potentially significant improvement for applications that
depend on low latency access across a wide working-set, the performance
may be negligible to negative for other workloads.  For this reason the
shuffle capability defaults to off unless a direct-mapped
memory-side-cache is detected.  Even then, the page_alloc.shuffle=0
parameter can be specified to disable the randomization on those systems.

Outside of memory-side-cache utilization concerns there is potentially
security benefit from randomization.  Some data exfiltration and
return-oriented-programming attacks rely on the ability to infer the
location of sensitive data objects.  The kernel page allocator, especially
early in system boot, has predictable first-in-first out behavior for
physical pages.  Pages are freed in physical address order when first
onlined.

Quoting Kees:
    "While we already have a base-address randomization
     (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
     memory layouts would certainly be using the predictability of
     allocation ordering (i.e. for attacks where the base address isn't
     important: only the relative positions between allocated memory).
     This is common in lots of heap-style attacks. They try to gain
     control over ordering by spraying allocations, etc.

     I'd really like to see this because it gives us something similar
     to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
caches it leaves vast bulk of memory to be predictably in order allocated.
However, it should be noted, the concrete security benefits are hard to
quantify, and no known CVE is mitigated by this randomization.

Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
are initially populated with free memory at boot and at hotplug time.  Do
this based on either the presence of a page_alloc.shuffle=Y command line
parameter, or autodetection of a memory-side-cache (to be added in a
follow-on patch).

The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.  10,
4MB this trades off randomization granularity for time spent shuffling.
MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
while still showing memory-side cache behavior improvements, and the
expectation that the security implications of finer granularity
randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.  The
performance impact of the shuffling appears to be in the noise compared to
other memory initialization work.

This initial randomization can be undone over time so a follow-on patch is
introduced to inject entropy on page free decisions.  It is reasonable to
ask if the page free entropy is sufficient, but it is not enough due to
the in-order initial freeing of pages.  At the start of that process
putting page1 in front or behind page0 still keeps them close together,
page2 is still near page1 and has a high chance of being adjacent.  As
more pages are added ordering diversity improves, but there is still high
page locality for the low address pages and this leads to no significant
impact to the cache conflict rate.

[1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
[2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
[3]: https://lkml.org/lkml/2018/10/12/309

[dan.j.williams@intel.com: fix shuffle enable]
  Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
[cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
  Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Robert Elliott <elliott@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:48 -07:00
..
ABI Merge branch 'x86-mds-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-05-14 07:57:29 -07:00
accelerators
accounting psi: introduce psi monitor 2019-05-14 19:52:48 -07:00
acpi/dsd LED updates for 5.2-rc1. 2019-05-07 18:02:51 -07:00
admin-guide mm: shuffle initial free memory to improve memory-side-cache utilization 2019-05-14 19:52:48 -07:00
aoe
arm ARM: 8833/1: Ensure that NEON code always compiles with Clang 2019-02-12 15:20:09 +00:00
arm64 Merge branch 'for-next/perf' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into for-next/core 2019-05-03 10:18:08 +01:00
auxdisplay
backlight
block block: null: Add documentation for "zone_nr_conv" param 2019-04-09 08:18:23 -06:00
blockdev zram: idle writeback fixes and cleanup 2019-01-08 17:15:10 -08:00
bpf docs/btf: fix the missing section marks 2019-05-09 15:59:59 -07:00
bus-devices
cdrom
cgroup-v1 A fairly routine cycle for docs - lots of typo fixes, some new documents, 2019-03-09 09:56:17 -08:00
cma
connector
console
core-api A reasonably busy cycle for docs, including: 2019-05-08 12:42:50 -07:00
cpu-freq Documentation: cpu-freq: Frequencies aren't always sorted 2018-11-07 13:29:04 +01:00
crypto crypto: shash - remove shash_desc::flags 2019-04-25 15:38:12 +08:00
dev-tools A reasonably busy cycle for docs, including: 2019-05-08 12:42:50 -07:00
device-mapper dm cache: add support for discard passdown to the origin device 2019-03-05 14:53:52 -05:00
devicetree - Fix-ups 2019-05-14 10:45:03 -07:00
doc-guide docs: doc-guide: remove the extension from .rst files 2019-04-19 12:46:27 -06:00
driver-api This is the bulk of the GPIO changes for the v5.2 kernel cycle: 2019-05-11 10:54:43 -04:00
driver-model drm-misc-next for 5.2: 2019-03-25 11:05:12 +01:00
early-userspace Correct gen_init_cpio tool's documentation 2018-11-25 12:25:53 -07:00
EDID Docs/EDID: Calculate CRC while building the code 2018-11-06 07:36:22 -07:00
extcon
fault-injection doc: fault-injection: fix macro name in example 2019-01-07 15:36:11 -07:00
fb fbdev: fbmem: convert CONFIG_FB_LOGO_CENTER into a cmd line option 2019-01-16 17:42:35 +01:00
features Merge branch 'parisc-5.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux 2019-05-07 19:34:17 -07:00
filesystems Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-05-07 20:50:27 -07:00
firmware_class
firmware-guide Documentation: ACPI: move video_extension.txt to firmware-guide/acpi and convert to reST 2019-04-25 23:07:20 +02:00
fmc
fpga
gpio docs: gpio: convert docs to ReST and rename to *.rst 2019-04-23 23:30:07 +02:00
gpu drm: add drm_format_helper.c to kerneldoc 2019-04-17 09:39:22 +02:00
hid HID: doc: fix wrong data structure reference for UHID_OUTPUT 2018-12-18 14:55:22 +01:00
hwmon hwmon: (lm75) Add support for TMP75B 2019-05-03 13:16:18 -07:00
i2c i2c-piix4: Add Hygon Dhyana SMBus support 2019-05-03 16:47:54 +02:00
ia64
ide
iio
infiniband Documentation/infiniband: update from locked to pinned_vm 2019-02-07 12:56:23 -07:00
input doc: Change LXR references to elixir.bootlin.com 2019-02-01 16:05:03 -07:00
interconnect interconnect: Add generic on-chip interconnect API 2019-01-22 13:37:25 +01:00
ioctl seccomp: add a return code to trap to userspace 2018-12-11 16:28:41 -08:00
isdn
kbuild moduleparam: Save information about built-in modules in separate file 2019-05-07 21:50:24 +09:00
kdump kdump: Document kernel data exported in the vmcoreinfo note 2019-01-15 11:05:28 +01:00
kernel-hacking
laptops Documentation: fix lg-laptop.rst warnings 2019-02-11 08:27:47 -07:00
leds Documentation: Use "while" instead of "whilst" 2018-11-20 09:30:43 -07:00
lightnvm
livepatch docs/livepatch: Unify style of livepatch documentation in the ReST format 2019-05-07 16:06:28 -06:00
locking Documentation/locking/lockdep: Drop last two chars of sample states 2019-03-04 12:55:18 -07:00
m68k
maintainer
md
media media updates for v5.1-rc1 2019-05-08 11:13:17 -07:00
memory-devices
mic
mips
misc-devices Documentation: add ibmvmc to toctree(index) and fix warnings 2019-01-14 08:37:17 -07:00
mmc
mtd
namespaces
netlabel
networking Documentation: net: dsa: sja1105: Add info about supported traffic modes 2019-05-05 21:52:42 -07:00
nfc
nios2
nvdimm libnvdimm/security: Add documentation for nvdimm security support 2018-12-21 12:44:41 -08:00
nvmem Documentation: nvmem: document cell tables and lookup entries 2018-09-28 15:14:54 +02:00
openrisc
parisc
PCI pci-v4.20-changes 2018-10-25 06:50:48 -07:00
pcmcia
perf Documentation: perf: Add documentation for ThunderX2 PMU uncore driver 2018-12-06 12:29:47 +00:00
phy
platform
power PM/EM: Document the Energy Model framework 2019-01-27 12:29:37 +01:00
powerpc Documentation: powerpc: Expand the DAWR acronym 2019-05-03 02:54:58 +10:00
pps
process A reasonably busy cycle for docs, including: 2019-05-08 12:42:50 -07:00
pti
ptp
rapidio
RCU doc: Fix typos and otherwise modernize checklist.txt 2019-03-26 14:37:06 -07:00
riscv
s390 Documentation: Use "while" instead of "whilst" 2018-11-20 09:30:43 -07:00
scheduler sched/doc: Document Energy Aware Scheduling 2019-01-27 12:29:37 +01:00
scsi scsi: ufs-bsg: Allow reading descriptors 2019-02-27 09:00:02 -05:00
security doc: security: Add kern-doc for lsm_hooks.h 2019-02-22 08:54:09 -07:00
serial docs: serial: convert docs to ReST and rename to *.rst 2019-04-25 11:37:42 +02:00
sh sh: remove board_time_init() callback 2018-12-18 16:13:04 +01:00
sound ALSA: doc: my_chip has no element ioport 2019-04-03 11:55:47 +02:00
sparc docs: sparc: convert to ReST 2019-05-08 17:13:35 -07:00
sphinx
sphinx-static docs: improve readability for people with poorer eyesight 2018-10-07 09:16:50 -06:00
spi spi-summary: document set_cs_timing 2019-04-08 14:13:43 +07:00
sysctl userfaultfd/sysctl: add vm.unprivileged_userfaultfd 2019-05-14 09:47:45 -07:00
target scsi: target/core: Remove the write_pending_status() callback function 2019-02-04 21:23:59 -05:00
thermal docs: hwmon: Add an index file and rename docs to *.rst 2019-04-17 10:37:23 -07:00
timers Docs: Correct /proc/stat path 2019-02-22 08:50:17 -07:00
trace mm: move recent_rotated pages calculation to shrink_inactive_list() 2019-05-14 09:47:45 -07:00
translations Some late arriving documentation changes. In particular, this contains the 2019-05-10 13:24:53 -04:00
usb docs: usb: convert documents to ReST 2019-04-16 12:16:19 +02:00
userspace-api Documentation: seccomp: unify list indentation 2019-03-18 12:00:28 -06:00
virtual KVM: fix KVM_CLEAR_DIRTY_LOG for memory slots of unaligned size 2019-04-30 21:22:15 +02:00
vm mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays 2019-05-14 09:47:48 -07:00
w1
watchdog Documentation/watchdog: Add documentation mlx-wdt driver 2019-03-02 15:28:20 +01:00
wimax
x86 Merge branch 'x86-mds-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-05-14 07:57:29 -07:00
xilinx Documentation: xilinx: Add documentation for eemi APIs 2018-10-09 13:26:05 +02:00
xtensa xtensa: document boot parameter passing 2019-02-03 18:06:19 -08:00
.gitignore
atomic_bitops.txt docs: atomic_bitops.txt: add a title for this document 2019-04-11 12:37:02 -06:00
atomic_t.txt Documentation/atomic_t: Clarify signed vs unsigned 2019-03-18 10:27:52 -07:00
bt8xxgpio.txt
btmrvl.txt
bus-virt-phys-mapping.txt
Changes
clearing-warn-once.txt A reasonably busy cycle for docs, including: 2019-05-08 12:42:50 -07:00
CodingStyle
conf.py This is a fairly typical cycle for documentation. There's some welcome 2018-10-24 18:01:11 +01:00
cpu-load.txt
cputopology.txt topology: Simplify cputopology.txt formatting and wording 2019-04-19 10:56:04 +02:00
crc32.txt
dcdbas.txt
debugging-modules.txt
debugging-via-ohci1394.txt
dell_rbu.txt
digsig.txt
DMA-API-HOWTO.txt DMA mapping updates for 5.2 2019-05-09 08:40:55 -07:00
DMA-API.txt virtio: fixes, cleanups 2019-03-10 12:47:57 -07:00
DMA-attributes.txt
DMA-ISA-LPC.txt Documentation/DMA-ISA-LPC: fix an incorrect reference 2019-02-11 08:23:07 -07:00
docutils.conf
dontdiff A reasonably busy cycle for docs, including: 2019-05-08 12:42:50 -07:00
efi-stub.txt
eisa.txt
futex-requeue-pi.txt
gcc-plugins.txt
highuid.txt
hw_random.txt
hwspinlock.txt
index.rst Merge branch 'x86-mds-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-05-14 07:57:29 -07:00
intel_txt.txt
Intel-IOMMU.txt
io_ordering.txt
io-mapping.txt
iostats.txt
IPMI.txt
IRQ-affinity.txt
IRQ-domain.txt
IRQ.txt
irqflags-tracing.txt
isa.txt
isapnp.txt
kernel-per-CPU-kthreads.txt
kobject.txt kref/kobject: Improve documentation 2018-12-06 13:57:03 +01:00
kprobes.txt Merge branch 'parisc-5.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux 2019-05-07 19:34:17 -07:00
kref.txt
ldm.txt
lockup-watchdogs.txt
logo.gif
logo.txt
lsm.txt
lzo.txt lib/lzo: fix bugs for very short or empty input 2019-04-05 16:02:30 -10:00
mailbox.txt
Makefile docs: Makefile: use latexmk if available 2019-04-01 14:33:42 -06:00
memory-barriers.txt docs/memory-barriers.txt: Update I/O section to be clearer about CPU vs thread 2019-04-23 13:34:17 +01:00
men-chameleon-bus.txt
nommu-mmap.txt
ntb.txt docs: ntb.txt: add blank lines to clean up some Sphinx warnings 2019-04-11 12:37:03 -06:00
numastat.txt
packing.txt lib: Add support for generic packing operations 2019-05-03 10:49:17 -04:00
padata.txt
parport-lowlevel.txt
percpu-rw-semaphore.txt
phy.txt
pi-futex.txt
pnp.txt
preempt-locking.txt x86/fpu: Remove fpu__restore() 2019-04-09 19:27:42 +02:00
pwm.txt
rbtree.txt
remoteproc.txt
rfkill.txt
robust-futex-ABI.txt
robust-futexes.txt futex: Update comments and docs about return values of arch futex code 2019-04-26 13:57:55 +01:00
rpmsg.txt
rtc.txt Documentation: rtc: Correct location of rtctest.c 2019-03-25 10:34:55 -06:00
SAK.txt
sgi-ioc4.txt
siphash.txt
SM501.txt
smsc_ece1099.txt
speculation.txt docs: speculation.txt: mark example blocks as such 2019-04-11 12:37:03 -06:00
static-keys.txt static_keys.txt: Fix trivial spelling mistake 2019-02-06 16:44:16 -07:00
SubmittingPatches
svga.txt
switchtec.txt NTB: switchtec_ntb: Update switchtec documentation with prerequisites for NTB 2018-10-11 11:28:53 -05:00
sync_file.txt
tee.txt
this_cpu_ops.txt
unaligned-memory-access.txt docs: unaligned-memory-access.txt: use a lowercase title 2019-04-11 12:37:03 -06:00
vfio-mediated-device.txt
vfio.txt
video-output.txt docs: video-output.txt: convert it to ReST format 2019-04-11 12:37:03 -06:00
xillybus.txt
xz.txt
zorro.txt