mirror of
https://github.com/torvalds/linux.git
synced 2024-11-10 22:21:40 +00:00
docs/vm: transhuge: split userspace bits to admin-guide/mm/transhuge
Now that the administrative information for transparent huge pages is nicely separated, move it to its own page under the admin guide. Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
This commit is contained in:
parent
aa00eaa9af
commit
45c9a74f64
@ -4313,7 +4313,8 @@
|
|||||||
Format: [always|madvise|never]
|
Format: [always|madvise|never]
|
||||||
Can be used to control the default behavior of the system
|
Can be used to control the default behavior of the system
|
||||||
with respect to transparent hugepages.
|
with respect to transparent hugepages.
|
||||||
See Documentation/vm/transhuge.rst for more details.
|
See Documentation/admin-guide/mm/transhuge.rst
|
||||||
|
for more details.
|
||||||
|
|
||||||
tsc= Disable clocksource stability checks for TSC.
|
tsc= Disable clocksource stability checks for TSC.
|
||||||
Format: <string>
|
Format: <string>
|
||||||
|
@ -27,4 +27,5 @@ the Linux memory management.
|
|||||||
numa_memory_policy
|
numa_memory_policy
|
||||||
pagemap
|
pagemap
|
||||||
soft-dirty
|
soft-dirty
|
||||||
|
transhuge
|
||||||
userfaultfd
|
userfaultfd
|
||||||
|
418
Documentation/admin-guide/mm/transhuge.rst
Normal file
418
Documentation/admin-guide/mm/transhuge.rst
Normal file
@ -0,0 +1,418 @@
|
|||||||
|
.. _admin_guide_transhuge:
|
||||||
|
|
||||||
|
============================
|
||||||
|
Transparent Hugepage Support
|
||||||
|
============================
|
||||||
|
|
||||||
|
Objective
|
||||||
|
=========
|
||||||
|
|
||||||
|
Performance critical computing applications dealing with large memory
|
||||||
|
working sets are already running on top of libhugetlbfs and in turn
|
||||||
|
hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of
|
||||||
|
using huge pages for the backing of virtual memory with huge pages
|
||||||
|
that supports the automatic promotion and demotion of page sizes and
|
||||||
|
without the shortcomings of hugetlbfs.
|
||||||
|
|
||||||
|
Currently THP only works for anonymous memory mappings and tmpfs/shmem.
|
||||||
|
But in the future it can expand to other filesystems.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
in the examples below we presume that the basic page size is 4K and
|
||||||
|
the huge page size is 2M, although the actual numbers may vary
|
||||||
|
depending on the CPU architecture.
|
||||||
|
|
||||||
|
The reason applications are running faster is because of two
|
||||||
|
factors. The first factor is almost completely irrelevant and it's not
|
||||||
|
of significant interest because it'll also have the downside of
|
||||||
|
requiring larger clear-page copy-page in page faults which is a
|
||||||
|
potentially negative effect. The first factor consists in taking a
|
||||||
|
single page fault for each 2M virtual region touched by userland (so
|
||||||
|
reducing the enter/exit kernel frequency by a 512 times factor). This
|
||||||
|
only matters the first time the memory is accessed for the lifetime of
|
||||||
|
a memory mapping. The second long lasting and much more important
|
||||||
|
factor will affect all subsequent accesses to the memory for the whole
|
||||||
|
runtime of the application. The second factor consist of two
|
||||||
|
components:
|
||||||
|
|
||||||
|
1) the TLB miss will run faster (especially with virtualization using
|
||||||
|
nested pagetables but almost always also on bare metal without
|
||||||
|
virtualization)
|
||||||
|
|
||||||
|
2) a single TLB entry will be mapping a much larger amount of virtual
|
||||||
|
memory in turn reducing the number of TLB misses. With
|
||||||
|
virtualization and nested pagetables the TLB can be mapped of
|
||||||
|
larger size only if both KVM and the Linux guest are using
|
||||||
|
hugepages but a significant speedup already happens if only one of
|
||||||
|
the two is using hugepages just because of the fact the TLB miss is
|
||||||
|
going to run faster.
|
||||||
|
|
||||||
|
THP can be enabled system wide or restricted to certain tasks or even
|
||||||
|
memory ranges inside task's address space. Unless THP is completely
|
||||||
|
disabled, there is ``khugepaged`` daemon that scans memory and
|
||||||
|
collapses sequences of basic pages into huge pages.
|
||||||
|
|
||||||
|
The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
|
||||||
|
interface and using madivse(2) and prctl(2) system calls.
|
||||||
|
|
||||||
|
Transparent Hugepage Support maximizes the usefulness of free memory
|
||||||
|
if compared to the reservation approach of hugetlbfs by allowing all
|
||||||
|
unused memory to be used as cache or other movable (or even unmovable
|
||||||
|
entities). It doesn't require reservation to prevent hugepage
|
||||||
|
allocation failures to be noticeable from userland. It allows paging
|
||||||
|
and all other advanced VM features to be available on the
|
||||||
|
hugepages. It requires no modifications for applications to take
|
||||||
|
advantage of it.
|
||||||
|
|
||||||
|
Applications however can be further optimized to take advantage of
|
||||||
|
this feature, like for example they've been optimized before to avoid
|
||||||
|
a flood of mmap system calls for every malloc(4k). Optimizing userland
|
||||||
|
is by far not mandatory and khugepaged already can take care of long
|
||||||
|
lived page allocations even for hugepage unaware applications that
|
||||||
|
deals with large amounts of memory.
|
||||||
|
|
||||||
|
In certain cases when hugepages are enabled system wide, application
|
||||||
|
may end up allocating more memory resources. An application may mmap a
|
||||||
|
large region but only touch 1 byte of it, in that case a 2M page might
|
||||||
|
be allocated instead of a 4k page for no good. This is why it's
|
||||||
|
possible to disable hugepages system-wide and to only have them inside
|
||||||
|
MADV_HUGEPAGE madvise regions.
|
||||||
|
|
||||||
|
Embedded systems should enable hugepages only inside madvise regions
|
||||||
|
to eliminate any risk of wasting any precious byte of memory and to
|
||||||
|
only run faster.
|
||||||
|
|
||||||
|
Applications that gets a lot of benefit from hugepages and that don't
|
||||||
|
risk to lose memory by using hugepages, should use
|
||||||
|
madvise(MADV_HUGEPAGE) on their critical mmapped regions.
|
||||||
|
|
||||||
|
.. _thp_sysfs:
|
||||||
|
|
||||||
|
sysfs
|
||||||
|
=====
|
||||||
|
|
||||||
|
Global THP controls
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
Transparent Hugepage Support for anonymous memory can be entirely disabled
|
||||||
|
(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
|
||||||
|
regions (to avoid the risk of consuming more memory resources) or enabled
|
||||||
|
system wide. This can be achieved with one of::
|
||||||
|
|
||||||
|
echo always >/sys/kernel/mm/transparent_hugepage/enabled
|
||||||
|
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
|
||||||
|
echo never >/sys/kernel/mm/transparent_hugepage/enabled
|
||||||
|
|
||||||
|
It's also possible to limit defrag efforts in the VM to generate
|
||||||
|
anonymous hugepages in case they're not immediately free to madvise
|
||||||
|
regions or to never try to defrag memory and simply fallback to regular
|
||||||
|
pages unless hugepages are immediately available. Clearly if we spend CPU
|
||||||
|
time to defrag memory, we would expect to gain even more by the fact we
|
||||||
|
use hugepages later instead of regular pages. This isn't always
|
||||||
|
guaranteed, but it may be more likely in case the allocation is for a
|
||||||
|
MADV_HUGEPAGE region.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
echo always >/sys/kernel/mm/transparent_hugepage/defrag
|
||||||
|
echo defer >/sys/kernel/mm/transparent_hugepage/defrag
|
||||||
|
echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
|
||||||
|
echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
|
||||||
|
echo never >/sys/kernel/mm/transparent_hugepage/defrag
|
||||||
|
|
||||||
|
always
|
||||||
|
means that an application requesting THP will stall on
|
||||||
|
allocation failure and directly reclaim pages and compact
|
||||||
|
memory in an effort to allocate a THP immediately. This may be
|
||||||
|
desirable for virtual machines that benefit heavily from THP
|
||||||
|
use and are willing to delay the VM start to utilise them.
|
||||||
|
|
||||||
|
defer
|
||||||
|
means that an application will wake kswapd in the background
|
||||||
|
to reclaim pages and wake kcompactd to compact memory so that
|
||||||
|
THP is available in the near future. It's the responsibility
|
||||||
|
of khugepaged to then install the THP pages later.
|
||||||
|
|
||||||
|
defer+madvise
|
||||||
|
will enter direct reclaim and compaction like ``always``, but
|
||||||
|
only for regions that have used madvise(MADV_HUGEPAGE); all
|
||||||
|
other regions will wake kswapd in the background to reclaim
|
||||||
|
pages and wake kcompactd to compact memory so that THP is
|
||||||
|
available in the near future.
|
||||||
|
|
||||||
|
madvise
|
||||||
|
will enter direct reclaim like ``always`` but only for regions
|
||||||
|
that are have used madvise(MADV_HUGEPAGE). This is the default
|
||||||
|
behaviour.
|
||||||
|
|
||||||
|
never
|
||||||
|
should be self-explanatory.
|
||||||
|
|
||||||
|
By default kernel tries to use huge zero page on read page fault to
|
||||||
|
anonymous mapping. It's possible to disable huge zero page by writing 0
|
||||||
|
or enable it back by writing 1::
|
||||||
|
|
||||||
|
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
||||||
|
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
||||||
|
|
||||||
|
Some userspace (such as a test program, or an optimized memory allocation
|
||||||
|
library) may want to know the size (in bytes) of a transparent hugepage::
|
||||||
|
|
||||||
|
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
|
||||||
|
|
||||||
|
khugepaged will be automatically started when
|
||||||
|
transparent_hugepage/enabled is set to "always" or "madvise, and it'll
|
||||||
|
be automatically shutdown if it's set to "never".
|
||||||
|
|
||||||
|
Khugepaged controls
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
khugepaged runs usually at low frequency so while one may not want to
|
||||||
|
invoke defrag algorithms synchronously during the page faults, it
|
||||||
|
should be worth invoking defrag at least in khugepaged. However it's
|
||||||
|
also possible to disable defrag in khugepaged by writing 0 or enable
|
||||||
|
defrag in khugepaged by writing 1::
|
||||||
|
|
||||||
|
echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
|
||||||
|
echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
|
||||||
|
|
||||||
|
You can also control how many pages khugepaged should scan at each
|
||||||
|
pass::
|
||||||
|
|
||||||
|
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
|
||||||
|
|
||||||
|
and how many milliseconds to wait in khugepaged between each pass (you
|
||||||
|
can set this to 0 to run khugepaged at 100% utilization of one core)::
|
||||||
|
|
||||||
|
/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
|
||||||
|
|
||||||
|
and how many milliseconds to wait in khugepaged if there's an hugepage
|
||||||
|
allocation failure to throttle the next allocation attempt::
|
||||||
|
|
||||||
|
/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
|
||||||
|
|
||||||
|
The khugepaged progress can be seen in the number of pages collapsed::
|
||||||
|
|
||||||
|
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
|
||||||
|
|
||||||
|
for each pass::
|
||||||
|
|
||||||
|
/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
|
||||||
|
|
||||||
|
``max_ptes_none`` specifies how many extra small pages (that are
|
||||||
|
not already mapped) can be allocated when collapsing a group
|
||||||
|
of small pages into one large page::
|
||||||
|
|
||||||
|
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
|
||||||
|
|
||||||
|
A higher value leads to use additional memory for programs.
|
||||||
|
A lower value leads to gain less thp performance. Value of
|
||||||
|
max_ptes_none can waste cpu time very little, you can
|
||||||
|
ignore it.
|
||||||
|
|
||||||
|
``max_ptes_swap`` specifies how many pages can be brought in from
|
||||||
|
swap when collapsing a group of pages into a transparent huge page::
|
||||||
|
|
||||||
|
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
|
||||||
|
|
||||||
|
A higher value can cause excessive swap IO and waste
|
||||||
|
memory. A lower value can prevent THPs from being
|
||||||
|
collapsed, resulting fewer pages being collapsed into
|
||||||
|
THPs, and lower memory access performance.
|
||||||
|
|
||||||
|
Boot parameter
|
||||||
|
==============
|
||||||
|
|
||||||
|
You can change the sysfs boot time defaults of Transparent Hugepage
|
||||||
|
Support by passing the parameter ``transparent_hugepage=always`` or
|
||||||
|
``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
|
||||||
|
to the kernel command line.
|
||||||
|
|
||||||
|
Hugepages in tmpfs/shmem
|
||||||
|
========================
|
||||||
|
|
||||||
|
You can control hugepage allocation policy in tmpfs with mount option
|
||||||
|
``huge=``. It can have following values:
|
||||||
|
|
||||||
|
always
|
||||||
|
Attempt to allocate huge pages every time we need a new page;
|
||||||
|
|
||||||
|
never
|
||||||
|
Do not allocate huge pages;
|
||||||
|
|
||||||
|
within_size
|
||||||
|
Only allocate huge page if it will be fully within i_size.
|
||||||
|
Also respect fadvise()/madvise() hints;
|
||||||
|
|
||||||
|
advise
|
||||||
|
Only allocate huge pages if requested with fadvise()/madvise();
|
||||||
|
|
||||||
|
The default policy is ``never``.
|
||||||
|
|
||||||
|
``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
|
||||||
|
``huge=never`` will not attempt to break up huge pages at all, just stop more
|
||||||
|
from being allocated.
|
||||||
|
|
||||||
|
There's also sysfs knob to control hugepage allocation policy for internal
|
||||||
|
shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
|
||||||
|
is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
|
||||||
|
MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
|
||||||
|
|
||||||
|
In addition to policies listed above, shmem_enabled allows two further
|
||||||
|
values:
|
||||||
|
|
||||||
|
deny
|
||||||
|
For use in emergencies, to force the huge option off from
|
||||||
|
all mounts;
|
||||||
|
force
|
||||||
|
Force the huge option on for all - very useful for testing;
|
||||||
|
|
||||||
|
Need of application restart
|
||||||
|
===========================
|
||||||
|
|
||||||
|
The transparent_hugepage/enabled values and tmpfs mount option only affect
|
||||||
|
future behavior. So to make them effective you need to restart any
|
||||||
|
application that could have been using hugepages. This also applies to the
|
||||||
|
regions registered in khugepaged.
|
||||||
|
|
||||||
|
Monitoring usage
|
||||||
|
================
|
||||||
|
|
||||||
|
The number of anonymous transparent huge pages currently used by the
|
||||||
|
system is available by reading the AnonHugePages field in ``/proc/meminfo``.
|
||||||
|
To identify what applications are using anonymous transparent huge pages,
|
||||||
|
it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
|
||||||
|
for each mapping.
|
||||||
|
|
||||||
|
The number of file transparent huge pages mapped to userspace is available
|
||||||
|
by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
|
||||||
|
To identify what applications are mapping file transparent huge pages, it
|
||||||
|
is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
|
||||||
|
for each mapping.
|
||||||
|
|
||||||
|
Note that reading the smaps file is expensive and reading it
|
||||||
|
frequently will incur overhead.
|
||||||
|
|
||||||
|
There are a number of counters in ``/proc/vmstat`` that may be used to
|
||||||
|
monitor how successfully the system is providing huge pages for use.
|
||||||
|
|
||||||
|
thp_fault_alloc
|
||||||
|
is incremented every time a huge page is successfully
|
||||||
|
allocated to handle a page fault. This applies to both the
|
||||||
|
first time a page is faulted and for COW faults.
|
||||||
|
|
||||||
|
thp_collapse_alloc
|
||||||
|
is incremented by khugepaged when it has found
|
||||||
|
a range of pages to collapse into one huge page and has
|
||||||
|
successfully allocated a new huge page to store the data.
|
||||||
|
|
||||||
|
thp_fault_fallback
|
||||||
|
is incremented if a page fault fails to allocate
|
||||||
|
a huge page and instead falls back to using small pages.
|
||||||
|
|
||||||
|
thp_collapse_alloc_failed
|
||||||
|
is incremented if khugepaged found a range
|
||||||
|
of pages that should be collapsed into one huge page but failed
|
||||||
|
the allocation.
|
||||||
|
|
||||||
|
thp_file_alloc
|
||||||
|
is incremented every time a file huge page is successfully
|
||||||
|
allocated.
|
||||||
|
|
||||||
|
thp_file_mapped
|
||||||
|
is incremented every time a file huge page is mapped into
|
||||||
|
user address space.
|
||||||
|
|
||||||
|
thp_split_page
|
||||||
|
is incremented every time a huge page is split into base
|
||||||
|
pages. This can happen for a variety of reasons but a common
|
||||||
|
reason is that a huge page is old and is being reclaimed.
|
||||||
|
This action implies splitting all PMD the page mapped with.
|
||||||
|
|
||||||
|
thp_split_page_failed
|
||||||
|
is incremented if kernel fails to split huge
|
||||||
|
page. This can happen if the page was pinned by somebody.
|
||||||
|
|
||||||
|
thp_deferred_split_page
|
||||||
|
is incremented when a huge page is put onto split
|
||||||
|
queue. This happens when a huge page is partially unmapped and
|
||||||
|
splitting it would free up some memory. Pages on split queue are
|
||||||
|
going to be split under memory pressure.
|
||||||
|
|
||||||
|
thp_split_pmd
|
||||||
|
is incremented every time a PMD split into table of PTEs.
|
||||||
|
This can happen, for instance, when application calls mprotect() or
|
||||||
|
munmap() on part of huge page. It doesn't split huge page, only
|
||||||
|
page table entry.
|
||||||
|
|
||||||
|
thp_zero_page_alloc
|
||||||
|
is incremented every time a huge zero page is
|
||||||
|
successfully allocated. It includes allocations which where
|
||||||
|
dropped due race with other allocation. Note, it doesn't count
|
||||||
|
every map of the huge zero page, only its allocation.
|
||||||
|
|
||||||
|
thp_zero_page_alloc_failed
|
||||||
|
is incremented if kernel fails to allocate
|
||||||
|
huge zero page and falls back to using small pages.
|
||||||
|
|
||||||
|
thp_swpout
|
||||||
|
is incremented every time a huge page is swapout in one
|
||||||
|
piece without splitting.
|
||||||
|
|
||||||
|
thp_swpout_fallback
|
||||||
|
is incremented if a huge page has to be split before swapout.
|
||||||
|
Usually because failed to allocate some continuous swap space
|
||||||
|
for the huge page.
|
||||||
|
|
||||||
|
As the system ages, allocating huge pages may be expensive as the
|
||||||
|
system uses memory compaction to copy data around memory to free a
|
||||||
|
huge page for use. There are some counters in ``/proc/vmstat`` to help
|
||||||
|
monitor this overhead.
|
||||||
|
|
||||||
|
compact_stall
|
||||||
|
is incremented every time a process stalls to run
|
||||||
|
memory compaction so that a huge page is free for use.
|
||||||
|
|
||||||
|
compact_success
|
||||||
|
is incremented if the system compacted memory and
|
||||||
|
freed a huge page for use.
|
||||||
|
|
||||||
|
compact_fail
|
||||||
|
is incremented if the system tries to compact memory
|
||||||
|
but failed.
|
||||||
|
|
||||||
|
compact_pages_moved
|
||||||
|
is incremented each time a page is moved. If
|
||||||
|
this value is increasing rapidly, it implies that the system
|
||||||
|
is copying a lot of data to satisfy the huge page allocation.
|
||||||
|
It is possible that the cost of copying exceeds any savings
|
||||||
|
from reduced TLB misses.
|
||||||
|
|
||||||
|
compact_pagemigrate_failed
|
||||||
|
is incremented when the underlying mechanism
|
||||||
|
for moving a page failed.
|
||||||
|
|
||||||
|
compact_blocks_moved
|
||||||
|
is incremented each time memory compaction examines
|
||||||
|
a huge page aligned range of pages.
|
||||||
|
|
||||||
|
It is possible to establish how long the stalls were using the function
|
||||||
|
tracer to record how long was spent in __alloc_pages_nodemask and
|
||||||
|
using the mm_page_alloc tracepoint to identify which allocations were
|
||||||
|
for huge pages.
|
||||||
|
|
||||||
|
Optimizing the applications
|
||||||
|
===========================
|
||||||
|
|
||||||
|
To be guaranteed that the kernel will map a 2M page immediately in any
|
||||||
|
memory region, the mmap region has to be hugepage naturally
|
||||||
|
aligned. posix_memalign() can provide that guarantee.
|
||||||
|
|
||||||
|
Hugetlbfs
|
||||||
|
=========
|
||||||
|
|
||||||
|
You can use hugetlbfs on a kernel that has transparent hugepage
|
||||||
|
support enabled just fine as always. No difference can be noted in
|
||||||
|
hugetlbfs other than there will be less overall fragmentation. All
|
||||||
|
usual features belonging to hugetlbfs are preserved and
|
||||||
|
unaffected. libhugetlbfs will also work fine as usual.
|
@ -4,418 +4,8 @@
|
|||||||
Transparent Hugepage Support
|
Transparent Hugepage Support
|
||||||
============================
|
============================
|
||||||
|
|
||||||
Objective
|
This document describes design principles Transparent Hugepage (THP)
|
||||||
=========
|
Support and its interaction with other parts of the memory management.
|
||||||
|
|
||||||
Performance critical computing applications dealing with large memory
|
|
||||||
working sets are already running on top of libhugetlbfs and in turn
|
|
||||||
hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of
|
|
||||||
using huge pages for the backing of virtual memory with huge pages
|
|
||||||
that supports the automatic promotion and demotion of page sizes and
|
|
||||||
without the shortcomings of hugetlbfs.
|
|
||||||
|
|
||||||
Currently THP only works for anonymous memory mappings and tmpfs/shmem.
|
|
||||||
But in the future it can expand to other filesystems.
|
|
||||||
|
|
||||||
.. note::
|
|
||||||
in the examples below we presume that the basic page size is 4K and
|
|
||||||
the huge page size is 2M, although the actual numbers may vary
|
|
||||||
depending on the CPU architecture.
|
|
||||||
|
|
||||||
The reason applications are running faster is because of two
|
|
||||||
factors. The first factor is almost completely irrelevant and it's not
|
|
||||||
of significant interest because it'll also have the downside of
|
|
||||||
requiring larger clear-page copy-page in page faults which is a
|
|
||||||
potentially negative effect. The first factor consists in taking a
|
|
||||||
single page fault for each 2M virtual region touched by userland (so
|
|
||||||
reducing the enter/exit kernel frequency by a 512 times factor). This
|
|
||||||
only matters the first time the memory is accessed for the lifetime of
|
|
||||||
a memory mapping. The second long lasting and much more important
|
|
||||||
factor will affect all subsequent accesses to the memory for the whole
|
|
||||||
runtime of the application. The second factor consist of two
|
|
||||||
components:
|
|
||||||
|
|
||||||
1) the TLB miss will run faster (especially with virtualization using
|
|
||||||
nested pagetables but almost always also on bare metal without
|
|
||||||
virtualization)
|
|
||||||
|
|
||||||
2) a single TLB entry will be mapping a much larger amount of virtual
|
|
||||||
memory in turn reducing the number of TLB misses. With
|
|
||||||
virtualization and nested pagetables the TLB can be mapped of
|
|
||||||
larger size only if both KVM and the Linux guest are using
|
|
||||||
hugepages but a significant speedup already happens if only one of
|
|
||||||
the two is using hugepages just because of the fact the TLB miss is
|
|
||||||
going to run faster.
|
|
||||||
|
|
||||||
THP can be enabled system wide or restricted to certain tasks or even
|
|
||||||
memory ranges inside task's address space. Unless THP is completely
|
|
||||||
disabled, there is ``khugepaged`` daemon that scans memory and
|
|
||||||
collapses sequences of basic pages into huge pages.
|
|
||||||
|
|
||||||
The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
|
|
||||||
interface and using madivse(2) and prctl(2) system calls.
|
|
||||||
|
|
||||||
Transparent Hugepage Support maximizes the usefulness of free memory
|
|
||||||
if compared to the reservation approach of hugetlbfs by allowing all
|
|
||||||
unused memory to be used as cache or other movable (or even unmovable
|
|
||||||
entities). It doesn't require reservation to prevent hugepage
|
|
||||||
allocation failures to be noticeable from userland. It allows paging
|
|
||||||
and all other advanced VM features to be available on the
|
|
||||||
hugepages. It requires no modifications for applications to take
|
|
||||||
advantage of it.
|
|
||||||
|
|
||||||
Applications however can be further optimized to take advantage of
|
|
||||||
this feature, like for example they've been optimized before to avoid
|
|
||||||
a flood of mmap system calls for every malloc(4k). Optimizing userland
|
|
||||||
is by far not mandatory and khugepaged already can take care of long
|
|
||||||
lived page allocations even for hugepage unaware applications that
|
|
||||||
deals with large amounts of memory.
|
|
||||||
|
|
||||||
In certain cases when hugepages are enabled system wide, application
|
|
||||||
may end up allocating more memory resources. An application may mmap a
|
|
||||||
large region but only touch 1 byte of it, in that case a 2M page might
|
|
||||||
be allocated instead of a 4k page for no good. This is why it's
|
|
||||||
possible to disable hugepages system-wide and to only have them inside
|
|
||||||
MADV_HUGEPAGE madvise regions.
|
|
||||||
|
|
||||||
Embedded systems should enable hugepages only inside madvise regions
|
|
||||||
to eliminate any risk of wasting any precious byte of memory and to
|
|
||||||
only run faster.
|
|
||||||
|
|
||||||
Applications that gets a lot of benefit from hugepages and that don't
|
|
||||||
risk to lose memory by using hugepages, should use
|
|
||||||
madvise(MADV_HUGEPAGE) on their critical mmapped regions.
|
|
||||||
|
|
||||||
.. _thp_sysfs:
|
|
||||||
|
|
||||||
sysfs
|
|
||||||
=====
|
|
||||||
|
|
||||||
Global THP controls
|
|
||||||
-------------------
|
|
||||||
|
|
||||||
Transparent Hugepage Support for anonymous memory can be entirely disabled
|
|
||||||
(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
|
|
||||||
regions (to avoid the risk of consuming more memory resources) or enabled
|
|
||||||
system wide. This can be achieved with one of::
|
|
||||||
|
|
||||||
echo always >/sys/kernel/mm/transparent_hugepage/enabled
|
|
||||||
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
|
|
||||||
echo never >/sys/kernel/mm/transparent_hugepage/enabled
|
|
||||||
|
|
||||||
It's also possible to limit defrag efforts in the VM to generate
|
|
||||||
anonymous hugepages in case they're not immediately free to madvise
|
|
||||||
regions or to never try to defrag memory and simply fallback to regular
|
|
||||||
pages unless hugepages are immediately available. Clearly if we spend CPU
|
|
||||||
time to defrag memory, we would expect to gain even more by the fact we
|
|
||||||
use hugepages later instead of regular pages. This isn't always
|
|
||||||
guaranteed, but it may be more likely in case the allocation is for a
|
|
||||||
MADV_HUGEPAGE region.
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
echo always >/sys/kernel/mm/transparent_hugepage/defrag
|
|
||||||
echo defer >/sys/kernel/mm/transparent_hugepage/defrag
|
|
||||||
echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
|
|
||||||
echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
|
|
||||||
echo never >/sys/kernel/mm/transparent_hugepage/defrag
|
|
||||||
|
|
||||||
always
|
|
||||||
means that an application requesting THP will stall on
|
|
||||||
allocation failure and directly reclaim pages and compact
|
|
||||||
memory in an effort to allocate a THP immediately. This may be
|
|
||||||
desirable for virtual machines that benefit heavily from THP
|
|
||||||
use and are willing to delay the VM start to utilise them.
|
|
||||||
|
|
||||||
defer
|
|
||||||
means that an application will wake kswapd in the background
|
|
||||||
to reclaim pages and wake kcompactd to compact memory so that
|
|
||||||
THP is available in the near future. It's the responsibility
|
|
||||||
of khugepaged to then install the THP pages later.
|
|
||||||
|
|
||||||
defer+madvise
|
|
||||||
will enter direct reclaim and compaction like ``always``, but
|
|
||||||
only for regions that have used madvise(MADV_HUGEPAGE); all
|
|
||||||
other regions will wake kswapd in the background to reclaim
|
|
||||||
pages and wake kcompactd to compact memory so that THP is
|
|
||||||
available in the near future.
|
|
||||||
|
|
||||||
madvise
|
|
||||||
will enter direct reclaim like ``always`` but only for regions
|
|
||||||
that are have used madvise(MADV_HUGEPAGE). This is the default
|
|
||||||
behaviour.
|
|
||||||
|
|
||||||
never
|
|
||||||
should be self-explanatory.
|
|
||||||
|
|
||||||
By default kernel tries to use huge zero page on read page fault to
|
|
||||||
anonymous mapping. It's possible to disable huge zero page by writing 0
|
|
||||||
or enable it back by writing 1::
|
|
||||||
|
|
||||||
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
|
||||||
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
|
||||||
|
|
||||||
Some userspace (such as a test program, or an optimized memory allocation
|
|
||||||
library) may want to know the size (in bytes) of a transparent hugepage::
|
|
||||||
|
|
||||||
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
|
|
||||||
|
|
||||||
khugepaged will be automatically started when
|
|
||||||
transparent_hugepage/enabled is set to "always" or "madvise, and it'll
|
|
||||||
be automatically shutdown if it's set to "never".
|
|
||||||
|
|
||||||
Khugepaged controls
|
|
||||||
-------------------
|
|
||||||
|
|
||||||
khugepaged runs usually at low frequency so while one may not want to
|
|
||||||
invoke defrag algorithms synchronously during the page faults, it
|
|
||||||
should be worth invoking defrag at least in khugepaged. However it's
|
|
||||||
also possible to disable defrag in khugepaged by writing 0 or enable
|
|
||||||
defrag in khugepaged by writing 1::
|
|
||||||
|
|
||||||
echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
|
|
||||||
echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
|
|
||||||
|
|
||||||
You can also control how many pages khugepaged should scan at each
|
|
||||||
pass::
|
|
||||||
|
|
||||||
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
|
|
||||||
|
|
||||||
and how many milliseconds to wait in khugepaged between each pass (you
|
|
||||||
can set this to 0 to run khugepaged at 100% utilization of one core)::
|
|
||||||
|
|
||||||
/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
|
|
||||||
|
|
||||||
and how many milliseconds to wait in khugepaged if there's an hugepage
|
|
||||||
allocation failure to throttle the next allocation attempt::
|
|
||||||
|
|
||||||
/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
|
|
||||||
|
|
||||||
The khugepaged progress can be seen in the number of pages collapsed::
|
|
||||||
|
|
||||||
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
|
|
||||||
|
|
||||||
for each pass::
|
|
||||||
|
|
||||||
/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
|
|
||||||
|
|
||||||
``max_ptes_none`` specifies how many extra small pages (that are
|
|
||||||
not already mapped) can be allocated when collapsing a group
|
|
||||||
of small pages into one large page::
|
|
||||||
|
|
||||||
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
|
|
||||||
|
|
||||||
A higher value leads to use additional memory for programs.
|
|
||||||
A lower value leads to gain less thp performance. Value of
|
|
||||||
max_ptes_none can waste cpu time very little, you can
|
|
||||||
ignore it.
|
|
||||||
|
|
||||||
``max_ptes_swap`` specifies how many pages can be brought in from
|
|
||||||
swap when collapsing a group of pages into a transparent huge page::
|
|
||||||
|
|
||||||
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
|
|
||||||
|
|
||||||
A higher value can cause excessive swap IO and waste
|
|
||||||
memory. A lower value can prevent THPs from being
|
|
||||||
collapsed, resulting fewer pages being collapsed into
|
|
||||||
THPs, and lower memory access performance.
|
|
||||||
|
|
||||||
Boot parameter
|
|
||||||
==============
|
|
||||||
|
|
||||||
You can change the sysfs boot time defaults of Transparent Hugepage
|
|
||||||
Support by passing the parameter ``transparent_hugepage=always`` or
|
|
||||||
``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
|
|
||||||
to the kernel command line.
|
|
||||||
|
|
||||||
Hugepages in tmpfs/shmem
|
|
||||||
========================
|
|
||||||
|
|
||||||
You can control hugepage allocation policy in tmpfs with mount option
|
|
||||||
``huge=``. It can have following values:
|
|
||||||
|
|
||||||
always
|
|
||||||
Attempt to allocate huge pages every time we need a new page;
|
|
||||||
|
|
||||||
never
|
|
||||||
Do not allocate huge pages;
|
|
||||||
|
|
||||||
within_size
|
|
||||||
Only allocate huge page if it will be fully within i_size.
|
|
||||||
Also respect fadvise()/madvise() hints;
|
|
||||||
|
|
||||||
advise
|
|
||||||
Only allocate huge pages if requested with fadvise()/madvise();
|
|
||||||
|
|
||||||
The default policy is ``never``.
|
|
||||||
|
|
||||||
``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
|
|
||||||
``huge=never`` will not attempt to break up huge pages at all, just stop more
|
|
||||||
from being allocated.
|
|
||||||
|
|
||||||
There's also sysfs knob to control hugepage allocation policy for internal
|
|
||||||
shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
|
|
||||||
is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
|
|
||||||
MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
|
|
||||||
|
|
||||||
In addition to policies listed above, shmem_enabled allows two further
|
|
||||||
values:
|
|
||||||
|
|
||||||
deny
|
|
||||||
For use in emergencies, to force the huge option off from
|
|
||||||
all mounts;
|
|
||||||
force
|
|
||||||
Force the huge option on for all - very useful for testing;
|
|
||||||
|
|
||||||
Need of application restart
|
|
||||||
===========================
|
|
||||||
|
|
||||||
The transparent_hugepage/enabled values and tmpfs mount option only affect
|
|
||||||
future behavior. So to make them effective you need to restart any
|
|
||||||
application that could have been using hugepages. This also applies to the
|
|
||||||
regions registered in khugepaged.
|
|
||||||
|
|
||||||
Monitoring usage
|
|
||||||
================
|
|
||||||
|
|
||||||
The number of anonymous transparent huge pages currently used by the
|
|
||||||
system is available by reading the AnonHugePages field in ``/proc/meminfo``.
|
|
||||||
To identify what applications are using anonymous transparent huge pages,
|
|
||||||
it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
|
|
||||||
for each mapping.
|
|
||||||
|
|
||||||
The number of file transparent huge pages mapped to userspace is available
|
|
||||||
by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
|
|
||||||
To identify what applications are mapping file transparent huge pages, it
|
|
||||||
is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
|
|
||||||
for each mapping.
|
|
||||||
|
|
||||||
Note that reading the smaps file is expensive and reading it
|
|
||||||
frequently will incur overhead.
|
|
||||||
|
|
||||||
There are a number of counters in ``/proc/vmstat`` that may be used to
|
|
||||||
monitor how successfully the system is providing huge pages for use.
|
|
||||||
|
|
||||||
thp_fault_alloc
|
|
||||||
is incremented every time a huge page is successfully
|
|
||||||
allocated to handle a page fault. This applies to both the
|
|
||||||
first time a page is faulted and for COW faults.
|
|
||||||
|
|
||||||
thp_collapse_alloc
|
|
||||||
is incremented by khugepaged when it has found
|
|
||||||
a range of pages to collapse into one huge page and has
|
|
||||||
successfully allocated a new huge page to store the data.
|
|
||||||
|
|
||||||
thp_fault_fallback
|
|
||||||
is incremented if a page fault fails to allocate
|
|
||||||
a huge page and instead falls back to using small pages.
|
|
||||||
|
|
||||||
thp_collapse_alloc_failed
|
|
||||||
is incremented if khugepaged found a range
|
|
||||||
of pages that should be collapsed into one huge page but failed
|
|
||||||
the allocation.
|
|
||||||
|
|
||||||
thp_file_alloc
|
|
||||||
is incremented every time a file huge page is successfully
|
|
||||||
allocated.
|
|
||||||
|
|
||||||
thp_file_mapped
|
|
||||||
is incremented every time a file huge page is mapped into
|
|
||||||
user address space.
|
|
||||||
|
|
||||||
thp_split_page
|
|
||||||
is incremented every time a huge page is split into base
|
|
||||||
pages. This can happen for a variety of reasons but a common
|
|
||||||
reason is that a huge page is old and is being reclaimed.
|
|
||||||
This action implies splitting all PMD the page mapped with.
|
|
||||||
|
|
||||||
thp_split_page_failed
|
|
||||||
is incremented if kernel fails to split huge
|
|
||||||
page. This can happen if the page was pinned by somebody.
|
|
||||||
|
|
||||||
thp_deferred_split_page
|
|
||||||
is incremented when a huge page is put onto split
|
|
||||||
queue. This happens when a huge page is partially unmapped and
|
|
||||||
splitting it would free up some memory. Pages on split queue are
|
|
||||||
going to be split under memory pressure.
|
|
||||||
|
|
||||||
thp_split_pmd
|
|
||||||
is incremented every time a PMD split into table of PTEs.
|
|
||||||
This can happen, for instance, when application calls mprotect() or
|
|
||||||
munmap() on part of huge page. It doesn't split huge page, only
|
|
||||||
page table entry.
|
|
||||||
|
|
||||||
thp_zero_page_alloc
|
|
||||||
is incremented every time a huge zero page is
|
|
||||||
successfully allocated. It includes allocations which where
|
|
||||||
dropped due race with other allocation. Note, it doesn't count
|
|
||||||
every map of the huge zero page, only its allocation.
|
|
||||||
|
|
||||||
thp_zero_page_alloc_failed
|
|
||||||
is incremented if kernel fails to allocate
|
|
||||||
huge zero page and falls back to using small pages.
|
|
||||||
|
|
||||||
thp_swpout
|
|
||||||
is incremented every time a huge page is swapout in one
|
|
||||||
piece without splitting.
|
|
||||||
|
|
||||||
thp_swpout_fallback
|
|
||||||
is incremented if a huge page has to be split before swapout.
|
|
||||||
Usually because failed to allocate some continuous swap space
|
|
||||||
for the huge page.
|
|
||||||
|
|
||||||
As the system ages, allocating huge pages may be expensive as the
|
|
||||||
system uses memory compaction to copy data around memory to free a
|
|
||||||
huge page for use. There are some counters in ``/proc/vmstat`` to help
|
|
||||||
monitor this overhead.
|
|
||||||
|
|
||||||
compact_stall
|
|
||||||
is incremented every time a process stalls to run
|
|
||||||
memory compaction so that a huge page is free for use.
|
|
||||||
|
|
||||||
compact_success
|
|
||||||
is incremented if the system compacted memory and
|
|
||||||
freed a huge page for use.
|
|
||||||
|
|
||||||
compact_fail
|
|
||||||
is incremented if the system tries to compact memory
|
|
||||||
but failed.
|
|
||||||
|
|
||||||
compact_pages_moved
|
|
||||||
is incremented each time a page is moved. If
|
|
||||||
this value is increasing rapidly, it implies that the system
|
|
||||||
is copying a lot of data to satisfy the huge page allocation.
|
|
||||||
It is possible that the cost of copying exceeds any savings
|
|
||||||
from reduced TLB misses.
|
|
||||||
|
|
||||||
compact_pagemigrate_failed
|
|
||||||
is incremented when the underlying mechanism
|
|
||||||
for moving a page failed.
|
|
||||||
|
|
||||||
compact_blocks_moved
|
|
||||||
is incremented each time memory compaction examines
|
|
||||||
a huge page aligned range of pages.
|
|
||||||
|
|
||||||
It is possible to establish how long the stalls were using the function
|
|
||||||
tracer to record how long was spent in __alloc_pages_nodemask and
|
|
||||||
using the mm_page_alloc tracepoint to identify which allocations were
|
|
||||||
for huge pages.
|
|
||||||
|
|
||||||
Optimizing the applications
|
|
||||||
===========================
|
|
||||||
|
|
||||||
To be guaranteed that the kernel will map a 2M page immediately in any
|
|
||||||
memory region, the mmap region has to be hugepage naturally
|
|
||||||
aligned. posix_memalign() can provide that guarantee.
|
|
||||||
|
|
||||||
Hugetlbfs
|
|
||||||
=========
|
|
||||||
|
|
||||||
You can use hugetlbfs on a kernel that has transparent hugepage
|
|
||||||
support enabled just fine as always. No difference can be noted in
|
|
||||||
hugetlbfs other than there will be less overall fragmentation. All
|
|
||||||
usual features belonging to hugetlbfs are preserved and
|
|
||||||
unaffected. libhugetlbfs will also work fine as usual.
|
|
||||||
|
|
||||||
Design principles
|
Design principles
|
||||||
=================
|
=================
|
||||||
|
Loading…
Reference in New Issue
Block a user