Merge tag 'docs-4.18' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"There's been a fair amount of work in the docs tree this time around,
including:
- Extensive RST conversions and organizational work in the
memory-management docs thanks to Mike Rapoport.
- An update of Documentation/features from Andrea Parri and a script
to keep it updated.
- Various LICENSES updates from Thomas, along with a script to check
SPDX tags.
- Work to fix dangling references to documentation files; this
involved a fair number of one-liner comment changes outside of
Documentation/
... and the usual list of documentation improvements, typo fixes, etc"
* tag 'docs-4.18' of git://git.lwn.net/linux: (103 commits)
Documentation: document hung_task_panic kernel parameter
docs/admin-guide/mm: add high level concepts overview
docs/vm: move ksm and transhuge from "user" to "internals" section.
docs: Use the kerneldoc comments for memalloc_no*()
doc: document scope NOFS, NOIO APIs
docs: update kernel versions and dates in tables
docs/vm: transhuge: split userspace bits to admin-guide/mm/transhuge
docs/vm: transhuge: minor updates
docs/vm: transhuge: change sections order
Documentation: arm: clean up Marvell Berlin family info
Documentation: gpio: driver: Fix a typo and some odd grammar
docs: ranoops.rst: fix location of ramoops.txt
scripts/documentation-file-ref-check: rewrite it in perl with auto-fix mode
docs: uio-howto.rst: use a code block to solve a warning
mm, THP, doc: Add document for thp_swpout/thp_swpout_fallback
w1: w1_io.c: fix a kernel-doc warning
Documentation/process/posting: wrap text at 80 cols
docs: admin-guide: add cgroup-v2 documentation
Revert "Documentation/features/vm: Remove arch support status file for 'pte_special'"
Documentation: refcount-vs-atomic: Update reference to LKMM doc.
...
This commit is contained in:
@@ -64,8 +64,6 @@ auxdisplay/
|
|||||||
- misc. LCD driver documentation (cfag12864b, ks0108).
|
- misc. LCD driver documentation (cfag12864b, ks0108).
|
||||||
backlight/
|
backlight/
|
||||||
- directory with info on controlling backlights in flat panel displays
|
- directory with info on controlling backlights in flat panel displays
|
||||||
bcache.txt
|
|
||||||
- Block-layer cache on fast SSDs to improve slow (raid) I/O performance.
|
|
||||||
block/
|
block/
|
||||||
- info on the Block I/O (BIO) layer.
|
- info on the Block I/O (BIO) layer.
|
||||||
blockdev/
|
blockdev/
|
||||||
@@ -78,18 +76,10 @@ bus-devices/
|
|||||||
- directory with info on TI GPMC (General Purpose Memory Controller)
|
- directory with info on TI GPMC (General Purpose Memory Controller)
|
||||||
bus-virt-phys-mapping.txt
|
bus-virt-phys-mapping.txt
|
||||||
- how to access I/O mapped memory from within device drivers.
|
- how to access I/O mapped memory from within device drivers.
|
||||||
cachetlb.txt
|
|
||||||
- describes the cache/TLB flushing interfaces Linux uses.
|
|
||||||
cdrom/
|
cdrom/
|
||||||
- directory with information on the CD-ROM drivers that Linux has.
|
- directory with information on the CD-ROM drivers that Linux has.
|
||||||
cgroup-v1/
|
cgroup-v1/
|
||||||
- cgroups v1 features, including cpusets and memory controller.
|
- cgroups v1 features, including cpusets and memory controller.
|
||||||
cgroup-v2.txt
|
|
||||||
- cgroups v2 features, including cpusets and memory controller.
|
|
||||||
circular-buffers.txt
|
|
||||||
- how to make use of the existing circular buffer infrastructure
|
|
||||||
clk.txt
|
|
||||||
- info on the common clock framework
|
|
||||||
cma/
|
cma/
|
||||||
- Continuous Memory Area (CMA) debugfs interface.
|
- Continuous Memory Area (CMA) debugfs interface.
|
||||||
conf.py
|
conf.py
|
||||||
|
|||||||
@@ -90,4 +90,4 @@ Date: December 2009
|
|||||||
Contact: Lee Schermerhorn <lee.schermerhorn@hp.com>
|
Contact: Lee Schermerhorn <lee.schermerhorn@hp.com>
|
||||||
Description:
|
Description:
|
||||||
The node's huge page size control/query attributes.
|
The node's huge page size control/query attributes.
|
||||||
See Documentation/vm/hugetlbpage.txt
|
See Documentation/admin-guide/mm/hugetlbpage.rst
|
||||||
@@ -12,4 +12,4 @@ Description:
|
|||||||
free_hugepages
|
free_hugepages
|
||||||
surplus_hugepages
|
surplus_hugepages
|
||||||
resv_hugepages
|
resv_hugepages
|
||||||
See Documentation/vm/hugetlbpage.txt for details.
|
See Documentation/admin-guide/mm/hugetlbpage.rst for details.
|
||||||
|
|||||||
@@ -40,7 +40,7 @@ Description: Kernel Samepage Merging daemon sysfs interface
|
|||||||
sleep_millisecs: how many milliseconds ksm should sleep between
|
sleep_millisecs: how many milliseconds ksm should sleep between
|
||||||
scans.
|
scans.
|
||||||
|
|
||||||
See Documentation/vm/ksm.txt for more information.
|
See Documentation/vm/ksm.rst for more information.
|
||||||
|
|
||||||
What: /sys/kernel/mm/ksm/merge_across_nodes
|
What: /sys/kernel/mm/ksm/merge_across_nodes
|
||||||
Date: January 2013
|
Date: January 2013
|
||||||
|
|||||||
@@ -37,7 +37,7 @@ Description:
|
|||||||
The alloc_calls file is read-only and lists the kernel code
|
The alloc_calls file is read-only and lists the kernel code
|
||||||
locations from which allocations for this cache were performed.
|
locations from which allocations for this cache were performed.
|
||||||
The alloc_calls file only contains information if debugging is
|
The alloc_calls file only contains information if debugging is
|
||||||
enabled for that cache (see Documentation/vm/slub.txt).
|
enabled for that cache (see Documentation/vm/slub.rst).
|
||||||
|
|
||||||
What: /sys/kernel/slab/cache/alloc_fastpath
|
What: /sys/kernel/slab/cache/alloc_fastpath
|
||||||
Date: February 2008
|
Date: February 2008
|
||||||
@@ -219,7 +219,7 @@ Contact: Pekka Enberg <penberg@cs.helsinki.fi>,
|
|||||||
Description:
|
Description:
|
||||||
The free_calls file is read-only and lists the locations of
|
The free_calls file is read-only and lists the locations of
|
||||||
object frees if slab debugging is enabled (see
|
object frees if slab debugging is enabled (see
|
||||||
Documentation/vm/slub.txt).
|
Documentation/vm/slub.rst).
|
||||||
|
|
||||||
What: /sys/kernel/slab/cache/free_fastpath
|
What: /sys/kernel/slab/cache/free_fastpath
|
||||||
Date: February 2008
|
Date: February 2008
|
||||||
|
|||||||
@@ -48,6 +48,7 @@ configure specific aspects of kernel behavior to your liking.
|
|||||||
:maxdepth: 1
|
:maxdepth: 1
|
||||||
|
|
||||||
initrd
|
initrd
|
||||||
|
cgroup-v2
|
||||||
serial-console
|
serial-console
|
||||||
braille-console
|
braille-console
|
||||||
parport
|
parport
|
||||||
@@ -60,9 +61,11 @@ configure specific aspects of kernel behavior to your liking.
|
|||||||
mono
|
mono
|
||||||
java
|
java
|
||||||
ras
|
ras
|
||||||
|
bcache
|
||||||
pm/index
|
pm/index
|
||||||
thunderbolt
|
thunderbolt
|
||||||
LSM/index
|
LSM/index
|
||||||
|
mm/index
|
||||||
|
|
||||||
.. only:: subproject and html
|
.. only:: subproject and html
|
||||||
|
|
||||||
|
|||||||
@@ -518,7 +518,7 @@
|
|||||||
those clocks in any way. This parameter is useful for
|
those clocks in any way. This parameter is useful for
|
||||||
debug and development, but should not be needed on a
|
debug and development, but should not be needed on a
|
||||||
platform with proper driver support. For more
|
platform with proper driver support. For more
|
||||||
information, see Documentation/clk.txt.
|
information, see Documentation/driver-api/clk.rst.
|
||||||
|
|
||||||
clock= [BUGS=X86-32, HW] gettimeofday clocksource override.
|
clock= [BUGS=X86-32, HW] gettimeofday clocksource override.
|
||||||
[Deprecated]
|
[Deprecated]
|
||||||
@@ -1341,12 +1341,21 @@
|
|||||||
x86-64 are 2M (when the CPU supports "pse") and 1G
|
x86-64 are 2M (when the CPU supports "pse") and 1G
|
||||||
(when the CPU supports the "pdpe1gb" cpuinfo flag).
|
(when the CPU supports the "pdpe1gb" cpuinfo flag).
|
||||||
|
|
||||||
|
hung_task_panic=
|
||||||
|
[KNL] Should the hung task detector generate panics.
|
||||||
|
Format: <integer>
|
||||||
|
|
||||||
|
A nonzero value instructs the kernel to panic when a
|
||||||
|
hung task is detected. The default value is controlled
|
||||||
|
by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time
|
||||||
|
option. The value selected by this boot parameter can
|
||||||
|
be changed later by the kernel.hung_task_panic sysctl.
|
||||||
|
|
||||||
hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC)
|
hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC)
|
||||||
terminal devices. Valid values: 0..8
|
terminal devices. Valid values: 0..8
|
||||||
hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs.
|
hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs.
|
||||||
If specified, z/VM IUCV HVC accepts connections
|
If specified, z/VM IUCV HVC accepts connections
|
||||||
from listed z/VM user IDs only.
|
from listed z/VM user IDs only.
|
||||||
|
|
||||||
keep_bootcon [KNL]
|
keep_bootcon [KNL]
|
||||||
Do not unregister boot console at start. This is only
|
Do not unregister boot console at start. This is only
|
||||||
useful for debugging when something happens in the window
|
useful for debugging when something happens in the window
|
||||||
@@ -3917,7 +3926,7 @@
|
|||||||
cache (risks via metadata attacks are mostly
|
cache (risks via metadata attacks are mostly
|
||||||
unchanged). Debug options disable merging on their
|
unchanged). Debug options disable merging on their
|
||||||
own.
|
own.
|
||||||
For more information see Documentation/vm/slub.txt.
|
For more information see Documentation/vm/slub.rst.
|
||||||
|
|
||||||
slab_max_order= [MM, SLAB]
|
slab_max_order= [MM, SLAB]
|
||||||
Determines the maximum allowed order for slabs.
|
Determines the maximum allowed order for slabs.
|
||||||
@@ -3931,7 +3940,7 @@
|
|||||||
slub_debug can create guard zones around objects and
|
slub_debug can create guard zones around objects and
|
||||||
may poison objects when not in use. Also tracks the
|
may poison objects when not in use. Also tracks the
|
||||||
last alloc / free. For more information see
|
last alloc / free. For more information see
|
||||||
Documentation/vm/slub.txt.
|
Documentation/vm/slub.rst.
|
||||||
|
|
||||||
slub_memcg_sysfs= [MM, SLUB]
|
slub_memcg_sysfs= [MM, SLUB]
|
||||||
Determines whether to enable sysfs directories for
|
Determines whether to enable sysfs directories for
|
||||||
@@ -3945,7 +3954,7 @@
|
|||||||
Determines the maximum allowed order for slabs.
|
Determines the maximum allowed order for slabs.
|
||||||
A high setting may cause OOMs due to memory
|
A high setting may cause OOMs due to memory
|
||||||
fragmentation. For more information see
|
fragmentation. For more information see
|
||||||
Documentation/vm/slub.txt.
|
Documentation/vm/slub.rst.
|
||||||
|
|
||||||
slub_min_objects= [MM, SLUB]
|
slub_min_objects= [MM, SLUB]
|
||||||
The minimum number of objects per slab. SLUB will
|
The minimum number of objects per slab. SLUB will
|
||||||
@@ -3954,12 +3963,12 @@
|
|||||||
the number of objects indicated. The higher the number
|
the number of objects indicated. The higher the number
|
||||||
of objects the smaller the overhead of tracking slabs
|
of objects the smaller the overhead of tracking slabs
|
||||||
and the less frequently locks need to be acquired.
|
and the less frequently locks need to be acquired.
|
||||||
For more information see Documentation/vm/slub.txt.
|
For more information see Documentation/vm/slub.rst.
|
||||||
|
|
||||||
slub_min_order= [MM, SLUB]
|
slub_min_order= [MM, SLUB]
|
||||||
Determines the minimum page order for slabs. Must be
|
Determines the minimum page order for slabs. Must be
|
||||||
lower than slub_max_order.
|
lower than slub_max_order.
|
||||||
For more information see Documentation/vm/slub.txt.
|
For more information see Documentation/vm/slub.rst.
|
||||||
|
|
||||||
slub_nomerge [MM, SLUB]
|
slub_nomerge [MM, SLUB]
|
||||||
Same with slab_nomerge. This is supported for legacy.
|
Same with slab_nomerge. This is supported for legacy.
|
||||||
@@ -4357,7 +4366,8 @@
|
|||||||
Format: [always|madvise|never]
|
Format: [always|madvise|never]
|
||||||
Can be used to control the default behavior of the system
|
Can be used to control the default behavior of the system
|
||||||
with respect to transparent hugepages.
|
with respect to transparent hugepages.
|
||||||
See Documentation/vm/transhuge.txt for more details.
|
See Documentation/admin-guide/mm/transhuge.rst
|
||||||
|
for more details.
|
||||||
|
|
||||||
tsc= Disable clocksource stability checks for TSC.
|
tsc= Disable clocksource stability checks for TSC.
|
||||||
Format: <string>
|
Format: <string>
|
||||||
|
|||||||
222
Documentation/admin-guide/mm/concepts.rst
Normal file
222
Documentation/admin-guide/mm/concepts.rst
Normal file
@@ -0,0 +1,222 @@
|
|||||||
|
.. _mm_concepts:
|
||||||
|
|
||||||
|
=================
|
||||||
|
Concepts overview
|
||||||
|
=================
|
||||||
|
|
||||||
|
The memory management in Linux is complex system that evolved over the
|
||||||
|
years and included more and more functionality to support variety of
|
||||||
|
systems from MMU-less microcontrollers to supercomputers. The memory
|
||||||
|
management for systems without MMU is called ``nommu`` and it
|
||||||
|
definitely deserves a dedicated document, which hopefully will be
|
||||||
|
eventually written. Yet, although some of the concepts are the same,
|
||||||
|
here we assume that MMU is available and CPU can translate a virtual
|
||||||
|
address to a physical address.
|
||||||
|
|
||||||
|
.. contents:: :local:
|
||||||
|
|
||||||
|
Virtual Memory Primer
|
||||||
|
=====================
|
||||||
|
|
||||||
|
The physical memory in a computer system is a limited resource and
|
||||||
|
even for systems that support memory hotplug there is a hard limit on
|
||||||
|
the amount of memory that can be installed. The physical memory is not
|
||||||
|
necessary contiguous, it might be accessible as a set of distinct
|
||||||
|
address ranges. Besides, different CPU architectures, and even
|
||||||
|
different implementations of the same architecture have different view
|
||||||
|
how these address ranges defined.
|
||||||
|
|
||||||
|
All this makes dealing directly with physical memory quite complex and
|
||||||
|
to avoid this complexity a concept of virtual memory was developed.
|
||||||
|
|
||||||
|
The virtual memory abstracts the details of physical memory from the
|
||||||
|
application software, allows to keep only needed information in the
|
||||||
|
physical memory (demand paging) and provides a mechanism for the
|
||||||
|
protection and controlled sharing of data between processes.
|
||||||
|
|
||||||
|
With virtual memory, each and every memory access uses a virtual
|
||||||
|
address. When the CPU decodes the an instruction that reads (or
|
||||||
|
writes) from (or to) the system memory, it translates the `virtual`
|
||||||
|
address encoded in that instruction to a `physical` address that the
|
||||||
|
memory controller can understand.
|
||||||
|
|
||||||
|
The physical system memory is divided into page frames, or pages. The
|
||||||
|
size of each page is architecture specific. Some architectures allow
|
||||||
|
selection of the page size from several supported values; this
|
||||||
|
selection is performed at the kernel build time by setting an
|
||||||
|
appropriate kernel configuration option.
|
||||||
|
|
||||||
|
Each physical memory page can be mapped as one or more virtual
|
||||||
|
pages. These mappings are described by page tables that allow
|
||||||
|
translation from virtual address used by programs to real address in
|
||||||
|
the physical memory. The page tables organized hierarchically.
|
||||||
|
|
||||||
|
The tables at the lowest level of the hierarchy contain physical
|
||||||
|
addresses of actual pages used by the software. The tables at higher
|
||||||
|
levels contain physical addresses of the pages belonging to the lower
|
||||||
|
levels. The pointer to the top level page table resides in a
|
||||||
|
register. When the CPU performs the address translation, it uses this
|
||||||
|
register to access the top level page table. The high bits of the
|
||||||
|
virtual address are used to index an entry in the top level page
|
||||||
|
table. That entry is then used to access the next level in the
|
||||||
|
hierarchy with the next bits of the virtual address as the index to
|
||||||
|
that level page table. The lowest bits in the virtual address define
|
||||||
|
the offset inside the actual page.
|
||||||
|
|
||||||
|
Huge Pages
|
||||||
|
==========
|
||||||
|
|
||||||
|
The address translation requires several memory accesses and memory
|
||||||
|
accesses are slow relatively to CPU speed. To avoid spending precious
|
||||||
|
processor cycles on the address translation, CPUs maintain a cache of
|
||||||
|
such translations called Translation Lookaside Buffer (or
|
||||||
|
TLB). Usually TLB is pretty scarce resource and applications with
|
||||||
|
large memory working set will experience performance hit because of
|
||||||
|
TLB misses.
|
||||||
|
|
||||||
|
Many modern CPU architectures allow mapping of the memory pages
|
||||||
|
directly by the higher levels in the page table. For instance, on x86,
|
||||||
|
it is possible to map 2M and even 1G pages using entries in the second
|
||||||
|
and the third level page tables. In Linux such pages are called
|
||||||
|
`huge`. Usage of huge pages significantly reduces pressure on TLB,
|
||||||
|
improves TLB hit-rate and thus improves overall system performance.
|
||||||
|
|
||||||
|
There are two mechanisms in Linux that enable mapping of the physical
|
||||||
|
memory with the huge pages. The first one is `HugeTLB filesystem`, or
|
||||||
|
hugetlbfs. It is a pseudo filesystem that uses RAM as its backing
|
||||||
|
store. For the files created in this filesystem the data resides in
|
||||||
|
the memory and mapped using huge pages. The hugetlbfs is described at
|
||||||
|
:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`.
|
||||||
|
|
||||||
|
Another, more recent, mechanism that enables use of the huge pages is
|
||||||
|
called `Transparent HugePages`, or THP. Unlike the hugetlbfs that
|
||||||
|
requires users and/or system administrators to configure what parts of
|
||||||
|
the system memory should and can be mapped by the huge pages, THP
|
||||||
|
manages such mappings transparently to the user and hence the
|
||||||
|
name. See
|
||||||
|
:ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>`
|
||||||
|
for more details about THP.
|
||||||
|
|
||||||
|
Zones
|
||||||
|
=====
|
||||||
|
|
||||||
|
Often hardware poses restrictions on how different physical memory
|
||||||
|
ranges can be accessed. In some cases, devices cannot perform DMA to
|
||||||
|
all the addressable memory. In other cases, the size of the physical
|
||||||
|
memory exceeds the maximal addressable size of virtual memory and
|
||||||
|
special actions are required to access portions of the memory. Linux
|
||||||
|
groups memory pages into `zones` according to their possible
|
||||||
|
usage. For example, ZONE_DMA will contain memory that can be used by
|
||||||
|
devices for DMA, ZONE_HIGHMEM will contain memory that is not
|
||||||
|
permanently mapped into kernel's address space and ZONE_NORMAL will
|
||||||
|
contain normally addressed pages.
|
||||||
|
|
||||||
|
The actual layout of the memory zones is hardware dependent as not all
|
||||||
|
architectures define all zones, and requirements for DMA are different
|
||||||
|
for different platforms.
|
||||||
|
|
||||||
|
Nodes
|
||||||
|
=====
|
||||||
|
|
||||||
|
Many multi-processor machines are NUMA - Non-Uniform Memory Access -
|
||||||
|
systems. In such systems the memory is arranged into banks that have
|
||||||
|
different access latency depending on the "distance" from the
|
||||||
|
processor. Each bank is referred as `node` and for each node Linux
|
||||||
|
constructs an independent memory management subsystem. A node has it's
|
||||||
|
own set of zones, lists of free and used pages and various statistics
|
||||||
|
counters. You can find more details about NUMA in
|
||||||
|
:ref:`Documentation/vm/numa.rst <numa>` and in
|
||||||
|
:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`.
|
||||||
|
|
||||||
|
Page cache
|
||||||
|
==========
|
||||||
|
|
||||||
|
The physical memory is volatile and the common case for getting data
|
||||||
|
into the memory is to read it from files. Whenever a file is read, the
|
||||||
|
data is put into the `page cache` to avoid expensive disk access on
|
||||||
|
the subsequent reads. Similarly, when one writes to a file, the data
|
||||||
|
is placed in the page cache and eventually gets into the backing
|
||||||
|
storage device. The written pages are marked as `dirty` and when Linux
|
||||||
|
decides to reuse them for other purposes, it makes sure to synchronize
|
||||||
|
the file contents on the device with the updated data.
|
||||||
|
|
||||||
|
Anonymous Memory
|
||||||
|
================
|
||||||
|
|
||||||
|
The `anonymous memory` or `anonymous mappings` represent memory that
|
||||||
|
is not backed by a filesystem. Such mappings are implicitly created
|
||||||
|
for program's stack and heap or by explicit calls to mmap(2) system
|
||||||
|
call. Usually, the anonymous mappings only define virtual memory areas
|
||||||
|
that the program is allowed to access. The read accesses will result
|
||||||
|
in creation of a page table entry that references a special physical
|
||||||
|
page filled with zeroes. When the program performs a write, regular
|
||||||
|
physical page will be allocated to hold the written data. The page
|
||||||
|
will be marked dirty and if the kernel will decide to repurpose it,
|
||||||
|
the dirty page will be swapped out.
|
||||||
|
|
||||||
|
Reclaim
|
||||||
|
=======
|
||||||
|
|
||||||
|
Throughout the system lifetime, a physical page can be used for storing
|
||||||
|
different types of data. It can be kernel internal data structures,
|
||||||
|
DMA'able buffers for device drivers use, data read from a filesystem,
|
||||||
|
memory allocated by user space processes etc.
|
||||||
|
|
||||||
|
Depending on the page usage it is treated differently by the Linux
|
||||||
|
memory management. The pages that can be freed at any time, either
|
||||||
|
because they cache the data available elsewhere, for instance, on a
|
||||||
|
hard disk, or because they can be swapped out, again, to the hard
|
||||||
|
disk, are called `reclaimable`. The most notable categories of the
|
||||||
|
reclaimable pages are page cache and anonymous memory.
|
||||||
|
|
||||||
|
In most cases, the pages holding internal kernel data and used as DMA
|
||||||
|
buffers cannot be repurposed, and they remain pinned until freed by
|
||||||
|
their user. Such pages are called `unreclaimable`. However, in certain
|
||||||
|
circumstances, even pages occupied with kernel data structures can be
|
||||||
|
reclaimed. For instance, in-memory caches of filesystem metadata can
|
||||||
|
be re-read from the storage device and therefore it is possible to
|
||||||
|
discard them from the main memory when system is under memory
|
||||||
|
pressure.
|
||||||
|
|
||||||
|
The process of freeing the reclaimable physical memory pages and
|
||||||
|
repurposing them is called (surprise!) `reclaim`. Linux can reclaim
|
||||||
|
pages either asynchronously or synchronously, depending on the state
|
||||||
|
of the system. When system is not loaded, most of the memory is free
|
||||||
|
and allocation request will be satisfied immediately from the free
|
||||||
|
pages supply. As the load increases, the amount of the free pages goes
|
||||||
|
down and when it reaches a certain threshold (high watermark), an
|
||||||
|
allocation request will awaken the ``kswapd`` daemon. It will
|
||||||
|
asynchronously scan memory pages and either just free them if the data
|
||||||
|
they contain is available elsewhere, or evict to the backing storage
|
||||||
|
device (remember those dirty pages?). As memory usage increases even
|
||||||
|
more and reaches another threshold - min watermark - an allocation
|
||||||
|
will trigger the `direct reclaim`. In this case allocation is stalled
|
||||||
|
until enough memory pages are reclaimed to satisfy the request.
|
||||||
|
|
||||||
|
Compaction
|
||||||
|
==========
|
||||||
|
|
||||||
|
As the system runs, tasks allocate and free the memory and it becomes
|
||||||
|
fragmented. Although with virtual memory it is possible to present
|
||||||
|
scattered physical pages as virtually contiguous range, sometimes it is
|
||||||
|
necessary to allocate large physically contiguous memory areas. Such
|
||||||
|
need may arise, for instance, when a device driver requires large
|
||||||
|
buffer for DMA, or when THP allocates a huge page. Memory `compaction`
|
||||||
|
addresses the fragmentation issue. This mechanism moves occupied pages
|
||||||
|
from the lower part of a memory zone to free pages in the upper part
|
||||||
|
of the zone. When a compaction scan is finished free pages are grouped
|
||||||
|
together at the beginning of the zone and allocations of large
|
||||||
|
physically contiguous areas become possible.
|
||||||
|
|
||||||
|
Like reclaim, the compaction may happen asynchronously in ``kcompactd``
|
||||||
|
daemon or synchronously as a result of memory allocation request.
|
||||||
|
|
||||||
|
OOM killer
|
||||||
|
==========
|
||||||
|
|
||||||
|
It may happen, that on a loaded machine memory will be exhausted. When
|
||||||
|
the kernel detects that the system runs out of memory (OOM) it invokes
|
||||||
|
`OOM killer`. Its mission is simple: all it has to do is to select a
|
||||||
|
task to sacrifice for the sake of the overall system health. The
|
||||||
|
selected task is killed in a hope that after it exits enough memory
|
||||||
|
will be freed to continue normal operation.
|
||||||
@@ -1,3 +1,11 @@
|
|||||||
|
.. _hugetlbpage:
|
||||||
|
|
||||||
|
=============
|
||||||
|
HugeTLB Pages
|
||||||
|
=============
|
||||||
|
|
||||||
|
Overview
|
||||||
|
========
|
||||||
|
|
||||||
The intent of this file is to give a brief summary of hugetlbpage support in
|
The intent of this file is to give a brief summary of hugetlbpage support in
|
||||||
the Linux kernel. This support is built on top of multiple page size support
|
the Linux kernel. This support is built on top of multiple page size support
|
||||||
@@ -18,16 +26,15 @@ First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
|
|||||||
automatically when CONFIG_HUGETLBFS is selected) configuration
|
automatically when CONFIG_HUGETLBFS is selected) configuration
|
||||||
options.
|
options.
|
||||||
|
|
||||||
The /proc/meminfo file provides information about the total number of
|
The ``/proc/meminfo`` file provides information about the total number of
|
||||||
persistent hugetlb pages in the kernel's huge page pool. It also displays
|
persistent hugetlb pages in the kernel's huge page pool. It also displays
|
||||||
default huge page size and information about the number of free, reserved
|
default huge page size and information about the number of free, reserved
|
||||||
and surplus huge pages in the pool of huge pages of default size.
|
and surplus huge pages in the pool of huge pages of default size.
|
||||||
The huge page size is needed for generating the proper alignment and
|
The huge page size is needed for generating the proper alignment and
|
||||||
size of the arguments to system calls that map huge page regions.
|
size of the arguments to system calls that map huge page regions.
|
||||||
|
|
||||||
The output of "cat /proc/meminfo" will include lines like:
|
The output of ``cat /proc/meminfo`` will include lines like::
|
||||||
|
|
||||||
.....
|
|
||||||
HugePages_Total: uuu
|
HugePages_Total: uuu
|
||||||
HugePages_Free: vvv
|
HugePages_Free: vvv
|
||||||
HugePages_Rsvd: www
|
HugePages_Rsvd: www
|
||||||
@@ -36,35 +43,42 @@ Hugepagesize: yyy kB
|
|||||||
Hugetlb: zzz kB
|
Hugetlb: zzz kB
|
||||||
|
|
||||||
where:
|
where:
|
||||||
HugePages_Total is the size of the pool of huge pages.
|
|
||||||
HugePages_Free is the number of huge pages in the pool that are not yet
|
HugePages_Total
|
||||||
|
is the size of the pool of huge pages.
|
||||||
|
HugePages_Free
|
||||||
|
is the number of huge pages in the pool that are not yet
|
||||||
allocated.
|
allocated.
|
||||||
HugePages_Rsvd is short for "reserved," and is the number of huge pages for
|
HugePages_Rsvd
|
||||||
|
is short for "reserved," and is the number of huge pages for
|
||||||
which a commitment to allocate from the pool has been made,
|
which a commitment to allocate from the pool has been made,
|
||||||
but no allocation has yet been made. Reserved huge pages
|
but no allocation has yet been made. Reserved huge pages
|
||||||
guarantee that an application will be able to allocate a
|
guarantee that an application will be able to allocate a
|
||||||
huge page from the pool of huge pages at fault time.
|
huge page from the pool of huge pages at fault time.
|
||||||
HugePages_Surp is short for "surplus," and is the number of huge pages in
|
HugePages_Surp
|
||||||
the pool above the value in /proc/sys/vm/nr_hugepages. The
|
is short for "surplus," and is the number of huge pages in
|
||||||
|
the pool above the value in ``/proc/sys/vm/nr_hugepages``. The
|
||||||
maximum number of surplus huge pages is controlled by
|
maximum number of surplus huge pages is controlled by
|
||||||
/proc/sys/vm/nr_overcommit_hugepages.
|
``/proc/sys/vm/nr_overcommit_hugepages``.
|
||||||
Hugepagesize is the default hugepage size (in Kb).
|
Hugepagesize
|
||||||
Hugetlb is the total amount of memory (in kB), consumed by huge
|
is the default hugepage size (in Kb).
|
||||||
|
Hugetlb
|
||||||
|
is the total amount of memory (in kB), consumed by huge
|
||||||
pages of all sizes.
|
pages of all sizes.
|
||||||
If huge pages of different sizes are in use, this number
|
If huge pages of different sizes are in use, this number
|
||||||
will exceed HugePages_Total * Hugepagesize. To get more
|
will exceed HugePages_Total \* Hugepagesize. To get more
|
||||||
detailed information, please, refer to
|
detailed information, please, refer to
|
||||||
/sys/kernel/mm/hugepages (described below).
|
``/sys/kernel/mm/hugepages`` (described below).
|
||||||
|
|
||||||
|
|
||||||
/proc/filesystems should also show a filesystem of type "hugetlbfs" configured
|
``/proc/filesystems`` should also show a filesystem of type "hugetlbfs"
|
||||||
in the kernel.
|
configured in the kernel.
|
||||||
|
|
||||||
/proc/sys/vm/nr_hugepages indicates the current number of "persistent" huge
|
``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" huge
|
||||||
pages in the kernel's huge page pool. "Persistent" huge pages will be
|
pages in the kernel's huge page pool. "Persistent" huge pages will be
|
||||||
returned to the huge page pool when freed by a task. A user with root
|
returned to the huge page pool when freed by a task. A user with root
|
||||||
privileges can dynamically allocate more or free some persistent huge pages
|
privileges can dynamically allocate more or free some persistent huge pages
|
||||||
by increasing or decreasing the value of 'nr_hugepages'.
|
by increasing or decreasing the value of ``nr_hugepages``.
|
||||||
|
|
||||||
Pages that are used as huge pages are reserved inside the kernel and cannot
|
Pages that are used as huge pages are reserved inside the kernel and cannot
|
||||||
be used for other purposes. Huge pages cannot be swapped out under
|
be used for other purposes. Huge pages cannot be swapped out under
|
||||||
@@ -73,7 +87,7 @@ memory pressure.
|
|||||||
Once a number of huge pages have been pre-allocated to the kernel huge page
|
Once a number of huge pages have been pre-allocated to the kernel huge page
|
||||||
pool, a user with appropriate privilege can use either the mmap system call
|
pool, a user with appropriate privilege can use either the mmap system call
|
||||||
or shared memory system calls to use the huge pages. See the discussion of
|
or shared memory system calls to use the huge pages. See the discussion of
|
||||||
Using Huge Pages, below.
|
:ref:`Using Huge Pages <using_huge_pages>`, below.
|
||||||
|
|
||||||
The administrator can allocate persistent huge pages on the kernel boot
|
The administrator can allocate persistent huge pages on the kernel boot
|
||||||
command line by specifying the "hugepages=N" parameter, where 'N' = the
|
command line by specifying the "hugepages=N" parameter, where 'N' = the
|
||||||
@@ -86,10 +100,10 @@ with a huge page size selection parameter "hugepagesz=<size>". <size> must
|
|||||||
be specified in bytes with optional scale suffix [kKmMgG]. The default huge
|
be specified in bytes with optional scale suffix [kKmMgG]. The default huge
|
||||||
page size may be selected with the "default_hugepagesz=<size>" boot parameter.
|
page size may be selected with the "default_hugepagesz=<size>" boot parameter.
|
||||||
|
|
||||||
When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
|
When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
|
||||||
indicates the current number of pre-allocated huge pages of the default size.
|
indicates the current number of pre-allocated huge pages of the default size.
|
||||||
Thus, one can use the following command to dynamically allocate/deallocate
|
Thus, one can use the following command to dynamically allocate/deallocate
|
||||||
default sized persistent huge pages:
|
default sized persistent huge pages::
|
||||||
|
|
||||||
echo 20 > /proc/sys/vm/nr_hugepages
|
echo 20 > /proc/sys/vm/nr_hugepages
|
||||||
|
|
||||||
@@ -98,11 +112,12 @@ huge page pool to 20, allocating or freeing huge pages, as required.
|
|||||||
|
|
||||||
On a NUMA platform, the kernel will attempt to distribute the huge page pool
|
On a NUMA platform, the kernel will attempt to distribute the huge page pool
|
||||||
over all the set of allowed nodes specified by the NUMA memory policy of the
|
over all the set of allowed nodes specified by the NUMA memory policy of the
|
||||||
task that modifies nr_hugepages. The default for the allowed nodes--when the
|
task that modifies ``nr_hugepages``. The default for the allowed nodes--when the
|
||||||
task has default memory policy--is all on-line nodes with memory. Allowed
|
task has default memory policy--is all on-line nodes with memory. Allowed
|
||||||
nodes with insufficient available, contiguous memory for a huge page will be
|
nodes with insufficient available, contiguous memory for a huge page will be
|
||||||
silently skipped when allocating persistent huge pages. See the discussion
|
silently skipped when allocating persistent huge pages. See the
|
||||||
below of the interaction of task memory policy, cpusets and per node attributes
|
:ref:`discussion below <mem_policy_and_hp_alloc>`
|
||||||
|
of the interaction of task memory policy, cpusets and per node attributes
|
||||||
with the allocation and freeing of persistent huge pages.
|
with the allocation and freeing of persistent huge pages.
|
||||||
|
|
||||||
The success or failure of huge page allocation depends on the amount of
|
The success or failure of huge page allocation depends on the amount of
|
||||||
@@ -117,51 +132,52 @@ init files. This will enable the kernel to allocate huge pages early in
|
|||||||
the boot process when the possibility of getting physical contiguous pages
|
the boot process when the possibility of getting physical contiguous pages
|
||||||
is still very high. Administrators can verify the number of huge pages
|
is still very high. Administrators can verify the number of huge pages
|
||||||
actually allocated by checking the sysctl or meminfo. To check the per node
|
actually allocated by checking the sysctl or meminfo. To check the per node
|
||||||
distribution of huge pages in a NUMA system, use:
|
distribution of huge pages in a NUMA system, use::
|
||||||
|
|
||||||
cat /sys/devices/system/node/node*/meminfo | fgrep Huge
|
cat /sys/devices/system/node/node*/meminfo | fgrep Huge
|
||||||
|
|
||||||
/proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
|
``/proc/sys/vm/nr_overcommit_hugepages`` specifies how large the pool of
|
||||||
huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
|
huge pages can grow, if more huge pages than ``/proc/sys/vm/nr_hugepages`` are
|
||||||
requested by applications. Writing any non-zero value into this file
|
requested by applications. Writing any non-zero value into this file
|
||||||
indicates that the hugetlb subsystem is allowed to try to obtain that
|
indicates that the hugetlb subsystem is allowed to try to obtain that
|
||||||
number of "surplus" huge pages from the kernel's normal page pool, when the
|
number of "surplus" huge pages from the kernel's normal page pool, when the
|
||||||
persistent huge page pool is exhausted. As these surplus huge pages become
|
persistent huge page pool is exhausted. As these surplus huge pages become
|
||||||
unused, they are freed back to the kernel's normal page pool.
|
unused, they are freed back to the kernel's normal page pool.
|
||||||
|
|
||||||
When increasing the huge page pool size via nr_hugepages, any existing surplus
|
When increasing the huge page pool size via ``nr_hugepages``, any existing
|
||||||
pages will first be promoted to persistent huge pages. Then, additional
|
surplus pages will first be promoted to persistent huge pages. Then, additional
|
||||||
huge pages will be allocated, if necessary and if possible, to fulfill
|
huge pages will be allocated, if necessary and if possible, to fulfill
|
||||||
the new persistent huge page pool size.
|
the new persistent huge page pool size.
|
||||||
|
|
||||||
The administrator may shrink the pool of persistent huge pages for
|
The administrator may shrink the pool of persistent huge pages for
|
||||||
the default huge page size by setting the nr_hugepages sysctl to a
|
the default huge page size by setting the ``nr_hugepages`` sysctl to a
|
||||||
smaller value. The kernel will attempt to balance the freeing of huge pages
|
smaller value. The kernel will attempt to balance the freeing of huge pages
|
||||||
across all nodes in the memory policy of the task modifying nr_hugepages.
|
across all nodes in the memory policy of the task modifying ``nr_hugepages``.
|
||||||
Any free huge pages on the selected nodes will be freed back to the kernel's
|
Any free huge pages on the selected nodes will be freed back to the kernel's
|
||||||
normal page pool.
|
normal page pool.
|
||||||
|
|
||||||
Caveat: Shrinking the persistent huge page pool via nr_hugepages such that
|
Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that
|
||||||
it becomes less than the number of huge pages in use will convert the balance
|
it becomes less than the number of huge pages in use will convert the balance
|
||||||
of the in-use huge pages to surplus huge pages. This will occur even if
|
of the in-use huge pages to surplus huge pages. This will occur even if
|
||||||
the number of surplus pages it would exceed the overcommit value. As long as
|
the number of surplus pages would exceed the overcommit value. As long as
|
||||||
this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is
|
this condition holds--that is, until ``nr_hugepages+nr_overcommit_hugepages`` is
|
||||||
increased sufficiently, or the surplus huge pages go out of use and are freed--
|
increased sufficiently, or the surplus huge pages go out of use and are freed--
|
||||||
no more surplus huge pages will be allowed to be allocated.
|
no more surplus huge pages will be allowed to be allocated.
|
||||||
|
|
||||||
With support for multiple huge page pools at run-time available, much of
|
With support for multiple huge page pools at run-time available, much of
|
||||||
the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs.
|
the huge page userspace interface in ``/proc/sys/vm`` has been duplicated in
|
||||||
The /proc interfaces discussed above have been retained for backwards
|
sysfs.
|
||||||
compatibility. The root huge page control directory in sysfs is:
|
The ``/proc`` interfaces discussed above have been retained for backwards
|
||||||
|
compatibility. The root huge page control directory in sysfs is::
|
||||||
|
|
||||||
/sys/kernel/mm/hugepages
|
/sys/kernel/mm/hugepages
|
||||||
|
|
||||||
For each huge page size supported by the running kernel, a subdirectory
|
For each huge page size supported by the running kernel, a subdirectory
|
||||||
will exist, of the form:
|
will exist, of the form::
|
||||||
|
|
||||||
hugepages-${size}kB
|
hugepages-${size}kB
|
||||||
|
|
||||||
Inside each of these directories, the same set of files will exist:
|
Inside each of these directories, the same set of files will exist::
|
||||||
|
|
||||||
nr_hugepages
|
nr_hugepages
|
||||||
nr_hugepages_mempolicy
|
nr_hugepages_mempolicy
|
||||||
@@ -172,37 +188,39 @@ Inside each of these directories, the same set of files will exist:
|
|||||||
|
|
||||||
which function as described above for the default huge page-sized case.
|
which function as described above for the default huge page-sized case.
|
||||||
|
|
||||||
|
.. _mem_policy_and_hp_alloc:
|
||||||
|
|
||||||
Interaction of Task Memory Policy with Huge Page Allocation/Freeing
|
Interaction of Task Memory Policy with Huge Page Allocation/Freeing
|
||||||
===================================================================
|
===================================================================
|
||||||
|
|
||||||
Whether huge pages are allocated and freed via the /proc interface or
|
Whether huge pages are allocated and freed via the ``/proc`` interface or
|
||||||
the /sysfs interface using the nr_hugepages_mempolicy attribute, the NUMA
|
the ``/sysfs`` interface using the ``nr_hugepages_mempolicy`` attribute, the
|
||||||
nodes from which huge pages are allocated or freed are controlled by the
|
NUMA nodes from which huge pages are allocated or freed are controlled by the
|
||||||
NUMA memory policy of the task that modifies the nr_hugepages_mempolicy
|
NUMA memory policy of the task that modifies the ``nr_hugepages_mempolicy``
|
||||||
sysctl or attribute. When the nr_hugepages attribute is used, mempolicy
|
sysctl or attribute. When the ``nr_hugepages`` attribute is used, mempolicy
|
||||||
is ignored.
|
is ignored.
|
||||||
|
|
||||||
The recommended method to allocate or free huge pages to/from the kernel
|
The recommended method to allocate or free huge pages to/from the kernel
|
||||||
huge page pool, using the nr_hugepages example above, is:
|
huge page pool, using the ``nr_hugepages`` example above, is::
|
||||||
|
|
||||||
numactl --interleave <node-list> echo 20 \
|
numactl --interleave <node-list> echo 20 \
|
||||||
>/proc/sys/vm/nr_hugepages_mempolicy
|
>/proc/sys/vm/nr_hugepages_mempolicy
|
||||||
|
|
||||||
or, more succinctly:
|
or, more succinctly::
|
||||||
|
|
||||||
numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy
|
numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy
|
||||||
|
|
||||||
This will allocate or free abs(20 - nr_hugepages) to or from the nodes
|
This will allocate or free ``abs(20 - nr_hugepages)`` to or from the nodes
|
||||||
specified in <node-list>, depending on whether number of persistent huge pages
|
specified in <node-list>, depending on whether number of persistent huge pages
|
||||||
is initially less than or greater than 20, respectively. No huge pages will be
|
is initially less than or greater than 20, respectively. No huge pages will be
|
||||||
allocated nor freed on any node not included in the specified <node-list>.
|
allocated nor freed on any node not included in the specified <node-list>.
|
||||||
|
|
||||||
When adjusting the persistent hugepage count via nr_hugepages_mempolicy, any
|
When adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any
|
||||||
memory policy mode--bind, preferred, local or interleave--may be used. The
|
memory policy mode--bind, preferred, local or interleave--may be used. The
|
||||||
resulting effect on persistent huge page allocation is as follows:
|
resulting effect on persistent huge page allocation is as follows:
|
||||||
|
|
||||||
1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
|
#. Regardless of mempolicy mode [see
|
||||||
|
:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`],
|
||||||
persistent huge pages will be distributed across the node or nodes
|
persistent huge pages will be distributed across the node or nodes
|
||||||
specified in the mempolicy as if "interleave" had been specified.
|
specified in the mempolicy as if "interleave" had been specified.
|
||||||
However, if a node in the policy does not contain sufficient contiguous
|
However, if a node in the policy does not contain sufficient contiguous
|
||||||
@@ -212,7 +230,7 @@ resulting effect on persistent huge page allocation is as follows:
|
|||||||
possibly, allocation of persistent huge pages on nodes not allowed by
|
possibly, allocation of persistent huge pages on nodes not allowed by
|
||||||
the task's memory policy.
|
the task's memory policy.
|
||||||
|
|
||||||
2) One or more nodes may be specified with the bind or interleave policy.
|
#. One or more nodes may be specified with the bind or interleave policy.
|
||||||
If more than one node is specified with the preferred policy, only the
|
If more than one node is specified with the preferred policy, only the
|
||||||
lowest numeric id will be used. Local policy will select the node where
|
lowest numeric id will be used. Local policy will select the node where
|
||||||
the task is running at the time the nodes_allowed mask is constructed.
|
the task is running at the time the nodes_allowed mask is constructed.
|
||||||
@@ -222,20 +240,20 @@ resulting effect on persistent huge page allocation is as follows:
|
|||||||
indeterminate. Thus, local policy is not very useful for this purpose.
|
indeterminate. Thus, local policy is not very useful for this purpose.
|
||||||
Any of the other mempolicy modes may be used to specify a single node.
|
Any of the other mempolicy modes may be used to specify a single node.
|
||||||
|
|
||||||
3) The nodes allowed mask will be derived from any non-default task mempolicy,
|
#. The nodes allowed mask will be derived from any non-default task mempolicy,
|
||||||
whether this policy was set explicitly by the task itself or one of its
|
whether this policy was set explicitly by the task itself or one of its
|
||||||
ancestors, such as numactl. This means that if the task is invoked from a
|
ancestors, such as numactl. This means that if the task is invoked from a
|
||||||
shell with non-default policy, that policy will be used. One can specify a
|
shell with non-default policy, that policy will be used. One can specify a
|
||||||
node list of "all" with numactl --interleave or --membind [-m] to achieve
|
node list of "all" with numactl --interleave or --membind [-m] to achieve
|
||||||
interleaving over all nodes in the system or cpuset.
|
interleaving over all nodes in the system or cpuset.
|
||||||
|
|
||||||
4) Any task mempolicy specified--e.g., using numactl--will be constrained by
|
#. Any task mempolicy specified--e.g., using numactl--will be constrained by
|
||||||
the resource limits of any cpuset in which the task runs. Thus, there will
|
the resource limits of any cpuset in which the task runs. Thus, there will
|
||||||
be no way for a task with non-default policy running in a cpuset with a
|
be no way for a task with non-default policy running in a cpuset with a
|
||||||
subset of the system nodes to allocate huge pages outside the cpuset
|
subset of the system nodes to allocate huge pages outside the cpuset
|
||||||
without first moving to a cpuset that contains all of the desired nodes.
|
without first moving to a cpuset that contains all of the desired nodes.
|
||||||
|
|
||||||
5) Boot-time huge page allocation attempts to distribute the requested number
|
#. Boot-time huge page allocation attempts to distribute the requested number
|
||||||
of huge pages over all on-lines nodes with memory.
|
of huge pages over all on-lines nodes with memory.
|
||||||
|
|
||||||
Per Node Hugepages Attributes
|
Per Node Hugepages Attributes
|
||||||
@@ -243,22 +261,22 @@ Per Node Hugepages Attributes
|
|||||||
|
|
||||||
A subset of the contents of the root huge page control directory in sysfs,
|
A subset of the contents of the root huge page control directory in sysfs,
|
||||||
described above, will be replicated under each the system device of each
|
described above, will be replicated under each the system device of each
|
||||||
NUMA node with memory in:
|
NUMA node with memory in::
|
||||||
|
|
||||||
/sys/devices/system/node/node[0-9]*/hugepages/
|
/sys/devices/system/node/node[0-9]*/hugepages/
|
||||||
|
|
||||||
Under this directory, the subdirectory for each supported huge page size
|
Under this directory, the subdirectory for each supported huge page size
|
||||||
contains the following attribute files:
|
contains the following attribute files::
|
||||||
|
|
||||||
nr_hugepages
|
nr_hugepages
|
||||||
free_hugepages
|
free_hugepages
|
||||||
surplus_hugepages
|
surplus_hugepages
|
||||||
|
|
||||||
The free_' and surplus_' attribute files are read-only. They return the number
|
The free\_' and surplus\_' attribute files are read-only. They return the number
|
||||||
of free and surplus [overcommitted] huge pages, respectively, on the parent
|
of free and surplus [overcommitted] huge pages, respectively, on the parent
|
||||||
node.
|
node.
|
||||||
|
|
||||||
The nr_hugepages attribute returns the total number of huge pages on the
|
The ``nr_hugepages`` attribute returns the total number of huge pages on the
|
||||||
specified node. When this attribute is written, the number of persistent huge
|
specified node. When this attribute is written, the number of persistent huge
|
||||||
pages on the parent node will be adjusted to the specified value, if sufficient
|
pages on the parent node will be adjusted to the specified value, if sufficient
|
||||||
resources exist, regardless of the task's mempolicy or cpuset constraints.
|
resources exist, regardless of the task's mempolicy or cpuset constraints.
|
||||||
@@ -267,43 +285,58 @@ Note that the number of overcommit and reserve pages remain global quantities,
|
|||||||
as we don't know until fault time, when the faulting task's mempolicy is
|
as we don't know until fault time, when the faulting task's mempolicy is
|
||||||
applied, from which node the huge page allocation will be attempted.
|
applied, from which node the huge page allocation will be attempted.
|
||||||
|
|
||||||
|
.. _using_huge_pages:
|
||||||
|
|
||||||
Using Huge Pages
|
Using Huge Pages
|
||||||
================
|
================
|
||||||
|
|
||||||
If the user applications are going to request huge pages using mmap system
|
If the user applications are going to request huge pages using mmap system
|
||||||
call, then it is required that system administrator mount a file system of
|
call, then it is required that system administrator mount a file system of
|
||||||
type hugetlbfs:
|
type hugetlbfs::
|
||||||
|
|
||||||
mount -t hugetlbfs \
|
mount -t hugetlbfs \
|
||||||
-o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\
|
-o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\
|
||||||
min_size=<value>,nr_inodes=<value> none /mnt/huge
|
min_size=<value>,nr_inodes=<value> none /mnt/huge
|
||||||
|
|
||||||
This command mounts a (pseudo) filesystem of type hugetlbfs on the directory
|
This command mounts a (pseudo) filesystem of type hugetlbfs on the directory
|
||||||
/mnt/huge. Any files created on /mnt/huge uses huge pages. The uid and gid
|
``/mnt/huge``. Any file created on ``/mnt/huge`` uses huge pages.
|
||||||
options sets the owner and group of the root of the file system. By default
|
|
||||||
the uid and gid of the current process are taken. The mode option sets the
|
The ``uid`` and ``gid`` options sets the owner and group of the root of the
|
||||||
mode of root of file system to value & 01777. This value is given in octal.
|
file system. By default the ``uid`` and ``gid`` of the current process
|
||||||
By default the value 0755 is picked. If the platform supports multiple huge
|
are taken.
|
||||||
page sizes, the pagesize option can be used to specify the huge page size and
|
|
||||||
associated pool. pagesize is specified in bytes. If pagesize is not specified
|
The ``mode`` option sets the mode of root of file system to value & 01777.
|
||||||
the platform's default huge page size and associated pool will be used. The
|
This value is given in octal. By default the value 0755 is picked.
|
||||||
size option sets the maximum value of memory (huge pages) allowed for that
|
|
||||||
filesystem (/mnt/huge). The size option can be specified in bytes, or as a
|
If the platform supports multiple huge page sizes, the ``pagesize`` option can
|
||||||
percentage of the specified huge page pool (nr_hugepages). The size is
|
be used to specify the huge page size and associated pool. ``pagesize``
|
||||||
rounded down to HPAGE_SIZE boundary. The min_size option sets the minimum
|
is specified in bytes. If ``pagesize`` is not specified the platform's
|
||||||
value of memory (huge pages) allowed for the filesystem. min_size can be
|
default huge page size and associated pool will be used.
|
||||||
specified in the same way as size, either bytes or a percentage of the
|
|
||||||
huge page pool. At mount time, the number of huge pages specified by
|
The ``size`` option sets the maximum value of memory (huge pages) allowed
|
||||||
min_size are reserved for use by the filesystem. If there are not enough
|
for that filesystem (``/mnt/huge``). The ``size`` option can be specified
|
||||||
free huge pages available, the mount will fail. As huge pages are allocated
|
in bytes, or as a percentage of the specified huge page pool (``nr_hugepages``).
|
||||||
to the filesystem and freed, the reserve count is adjusted so that the sum
|
The size is rounded down to HPAGE_SIZE boundary.
|
||||||
of allocated and reserved huge pages is always at least min_size. The option
|
|
||||||
nr_inodes sets the maximum number of inodes that /mnt/huge can use. If the
|
The ``min_size`` option sets the minimum value of memory (huge pages) allowed
|
||||||
size, min_size or nr_inodes option is not provided on command line then
|
for the filesystem. ``min_size`` can be specified in the same way as ``size``,
|
||||||
no limits are set. For pagesize, size, min_size and nr_inodes options, you
|
either bytes or a percentage of the huge page pool.
|
||||||
can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For example, size=2K
|
At mount time, the number of huge pages specified by ``min_size`` are reserved
|
||||||
has the same meaning as size=2048.
|
for use by the filesystem.
|
||||||
|
If there are not enough free huge pages available, the mount will fail.
|
||||||
|
As huge pages are allocated to the filesystem and freed, the reserve count
|
||||||
|
is adjusted so that the sum of allocated and reserved huge pages is always
|
||||||
|
at least ``min_size``.
|
||||||
|
|
||||||
|
The option ``nr_inodes`` sets the maximum number of inodes that ``/mnt/huge``
|
||||||
|
can use.
|
||||||
|
|
||||||
|
If the ``size``, ``min_size`` or ``nr_inodes`` option is not provided on
|
||||||
|
command line then no limits are set.
|
||||||
|
|
||||||
|
For ``pagesize``, ``size``, ``min_size`` and ``nr_inodes`` options, you can
|
||||||
|
use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo.
|
||||||
|
For example, size=2K has the same meaning as size=2048.
|
||||||
|
|
||||||
While read system calls are supported on files that reside on hugetlb
|
While read system calls are supported on files that reside on hugetlb
|
||||||
file systems, write system calls are not.
|
file systems, write system calls are not.
|
||||||
@@ -313,12 +346,12 @@ used to change the file attributes on hugetlbfs.
|
|||||||
|
|
||||||
Also, it is important to note that no such mount command is required if
|
Also, it is important to note that no such mount command is required if
|
||||||
applications are going to use only shmat/shmget system calls or mmap with
|
applications are going to use only shmat/shmget system calls or mmap with
|
||||||
MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see map_hugetlb
|
MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see
|
||||||
below.
|
:ref:`map_hugetlb <map_hugetlb>` below.
|
||||||
|
|
||||||
Users who wish to use hugetlb memory via shared memory segment should be a
|
Users who wish to use hugetlb memory via shared memory segment should be
|
||||||
member of a supplementary group and system admin needs to configure that gid
|
members of a supplementary group and system admin needs to configure that gid
|
||||||
into /proc/sys/vm/hugetlb_shm_group. It is possible for same or different
|
into ``/proc/sys/vm/hugetlb_shm_group``. It is possible for same or different
|
||||||
applications to use any combination of mmaps and shm* calls, though the mount of
|
applications to use any combination of mmaps and shm* calls, though the mount of
|
||||||
filesystem will be required for using mmap calls without MAP_HUGETLB.
|
filesystem will be required for using mmap calls without MAP_HUGETLB.
|
||||||
|
|
||||||
@@ -332,20 +365,18 @@ a hugetlb page and the length is smaller than the hugepage size.
|
|||||||
Examples
|
Examples
|
||||||
========
|
========
|
||||||
|
|
||||||
1) map_hugetlb: see tools/testing/selftests/vm/map_hugetlb.c
|
.. _map_hugetlb:
|
||||||
|
|
||||||
2) hugepage-shm: see tools/testing/selftests/vm/hugepage-shm.c
|
``map_hugetlb``
|
||||||
|
see tools/testing/selftests/vm/map_hugetlb.c
|
||||||
|
|
||||||
3) hugepage-mmap: see tools/testing/selftests/vm/hugepage-mmap.c
|
``hugepage-shm``
|
||||||
|
see tools/testing/selftests/vm/hugepage-shm.c
|
||||||
|
|
||||||
4) The libhugetlbfs (https://github.com/libhugetlbfs/libhugetlbfs) library
|
``hugepage-mmap``
|
||||||
provides a wide range of userspace tools to help with huge page usability,
|
see tools/testing/selftests/vm/hugepage-mmap.c
|
||||||
environment setup, and control.
|
|
||||||
|
|
||||||
Kernel development regression testing
|
The `libhugetlbfs`_ library provides a wide range of userspace tools
|
||||||
=====================================
|
to help with huge page usability, environment setup, and control.
|
||||||
|
|
||||||
The most complete set of hugetlb tests are in the libhugetlbfs repository.
|
.. _libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs
|
||||||
If you modify any hugetlb related code, use the libhugetlbfs test suite
|
|
||||||
to check for regressions. In addition, if you add any new hugetlb
|
|
||||||
functionality, please add appropriate tests to libhugetlbfs.
|
|
||||||
@@ -1,4 +1,11 @@
|
|||||||
MOTIVATION
|
.. _idle_page_tracking:
|
||||||
|
|
||||||
|
==================
|
||||||
|
Idle Page Tracking
|
||||||
|
==================
|
||||||
|
|
||||||
|
Motivation
|
||||||
|
==========
|
||||||
|
|
||||||
The idle page tracking feature allows to track which memory pages are being
|
The idle page tracking feature allows to track which memory pages are being
|
||||||
accessed by a workload and which are idle. This information can be useful for
|
accessed by a workload and which are idle. This information can be useful for
|
||||||
@@ -8,10 +15,14 @@ or deciding where to place the workload within a compute cluster.
|
|||||||
|
|
||||||
It is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
|
It is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
|
||||||
|
|
||||||
USER API
|
.. _user_api:
|
||||||
|
|
||||||
The idle page tracking API is located at /sys/kernel/mm/page_idle. Currently,
|
User API
|
||||||
it consists of the only read-write file, /sys/kernel/mm/page_idle/bitmap.
|
========
|
||||||
|
|
||||||
|
The idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
|
||||||
|
Currently, it consists of the only read-write file,
|
||||||
|
``/sys/kernel/mm/page_idle/bitmap``.
|
||||||
|
|
||||||
The file implements a bitmap where each bit corresponds to a memory page. The
|
The file implements a bitmap where each bit corresponds to a memory page. The
|
||||||
bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
|
bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
|
||||||
@@ -19,8 +30,9 @@ mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is
|
|||||||
set, the corresponding page is idle.
|
set, the corresponding page is idle.
|
||||||
|
|
||||||
A page is considered idle if it has not been accessed since it was marked idle
|
A page is considered idle if it has not been accessed since it was marked idle
|
||||||
(for more details on what "accessed" actually means see the IMPLEMENTATION
|
(for more details on what "accessed" actually means see the :ref:`Implementation
|
||||||
DETAILS section). To mark a page idle one has to set the bit corresponding to
|
Details <impl_details>` section).
|
||||||
|
To mark a page idle one has to set the bit corresponding to
|
||||||
the page by writing to the file. A value written to the file is OR-ed with the
|
the page by writing to the file. A value written to the file is OR-ed with the
|
||||||
current bitmap value.
|
current bitmap value.
|
||||||
|
|
||||||
@@ -30,9 +42,9 @@ page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored,
|
|||||||
and hence such pages are never reported idle.
|
and hence such pages are never reported idle.
|
||||||
|
|
||||||
For huge pages the idle flag is set only on the head page, so one has to read
|
For huge pages the idle flag is set only on the head page, so one has to read
|
||||||
/proc/kpageflags in order to correctly count idle huge pages.
|
``/proc/kpageflags`` in order to correctly count idle huge pages.
|
||||||
|
|
||||||
Reading from or writing to /sys/kernel/mm/page_idle/bitmap will return
|
Reading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return
|
||||||
-EINVAL if you are not starting the read/write on an 8-byte boundary, or
|
-EINVAL if you are not starting the read/write on an 8-byte boundary, or
|
||||||
if the size of the read/write is not a multiple of 8 bytes. Writing to
|
if the size of the read/write is not a multiple of 8 bytes. Writing to
|
||||||
this file beyond max PFN will return -ENXIO.
|
this file beyond max PFN will return -ENXIO.
|
||||||
@@ -41,21 +53,26 @@ That said, in order to estimate the amount of pages that are not used by a
|
|||||||
workload one should:
|
workload one should:
|
||||||
|
|
||||||
1. Mark all the workload's pages as idle by setting corresponding bits in
|
1. Mark all the workload's pages as idle by setting corresponding bits in
|
||||||
/sys/kernel/mm/page_idle/bitmap. The pages can be found by reading
|
``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading
|
||||||
/proc/pid/pagemap if the workload is represented by a process, or by
|
``/proc/pid/pagemap`` if the workload is represented by a process, or by
|
||||||
filtering out alien pages using /proc/kpagecgroup in case the workload is
|
filtering out alien pages using ``/proc/kpagecgroup`` in case the workload
|
||||||
placed in a memory cgroup.
|
is placed in a memory cgroup.
|
||||||
|
|
||||||
2. Wait until the workload accesses its working set.
|
2. Wait until the workload accesses its working set.
|
||||||
|
|
||||||
3. Read /sys/kernel/mm/page_idle/bitmap and count the number of bits set. If
|
3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set.
|
||||||
one wants to ignore certain types of pages, e.g. mlocked pages since they
|
If one wants to ignore certain types of pages, e.g. mlocked pages since they
|
||||||
are not reclaimable, he or she can filter them out using /proc/kpageflags.
|
are not reclaimable, he or she can filter them out using
|
||||||
|
``/proc/kpageflags``.
|
||||||
|
|
||||||
See Documentation/vm/pagemap.txt for more information about /proc/pid/pagemap,
|
See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more
|
||||||
/proc/kpageflags, and /proc/kpagecgroup.
|
information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and
|
||||||
|
``/proc/kpagecgroup``.
|
||||||
|
|
||||||
IMPLEMENTATION DETAILS
|
.. _impl_details:
|
||||||
|
|
||||||
|
Implementation Details
|
||||||
|
======================
|
||||||
|
|
||||||
The kernel internally keeps track of accesses to user memory pages in order to
|
The kernel internally keeps track of accesses to user memory pages in order to
|
||||||
reclaim unreferenced pages first on memory shortage conditions. A page is
|
reclaim unreferenced pages first on memory shortage conditions. A page is
|
||||||
@@ -77,7 +94,8 @@ When a dirty page is written to swap or disk as a result of memory reclaim or
|
|||||||
exceeding the dirty memory limit, it is not marked referenced.
|
exceeding the dirty memory limit, it is not marked referenced.
|
||||||
|
|
||||||
The idle memory tracking feature adds a new page flag, the Idle flag. This flag
|
The idle memory tracking feature adds a new page flag, the Idle flag. This flag
|
||||||
is set manually, by writing to /sys/kernel/mm/page_idle/bitmap (see the USER API
|
is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
|
||||||
|
:ref:`User API <user_api>`
|
||||||
section), and cleared automatically whenever a page is referenced as defined
|
section), and cleared automatically whenever a page is referenced as defined
|
||||||
above.
|
above.
|
||||||
|
|
||||||
36
Documentation/admin-guide/mm/index.rst
Normal file
36
Documentation/admin-guide/mm/index.rst
Normal file
@@ -0,0 +1,36 @@
|
|||||||
|
=================
|
||||||
|
Memory Management
|
||||||
|
=================
|
||||||
|
|
||||||
|
Linux memory management subsystem is responsible, as the name implies,
|
||||||
|
for managing the memory in the system. This includes implemnetation of
|
||||||
|
virtual memory and demand paging, memory allocation both for kernel
|
||||||
|
internal structures and user space programms, mapping of files into
|
||||||
|
processes address space and many other cool things.
|
||||||
|
|
||||||
|
Linux memory management is a complex system with many configurable
|
||||||
|
settings. Most of these settings are available via ``/proc``
|
||||||
|
filesystem and can be quired and adjusted using ``sysctl``. These APIs
|
||||||
|
are described in Documentation/sysctl/vm.txt and in `man 5 proc`_.
|
||||||
|
|
||||||
|
.. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html
|
||||||
|
|
||||||
|
Linux memory management has its own jargon and if you are not yet
|
||||||
|
familiar with it, consider reading
|
||||||
|
:ref:`Documentation/admin-guide/mm/concepts.rst <mm_concepts>`.
|
||||||
|
|
||||||
|
Here we document in detail how to interact with various mechanisms in
|
||||||
|
the Linux memory management.
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 1
|
||||||
|
|
||||||
|
concepts
|
||||||
|
hugetlbpage
|
||||||
|
idle_page_tracking
|
||||||
|
ksm
|
||||||
|
numa_memory_policy
|
||||||
|
pagemap
|
||||||
|
soft-dirty
|
||||||
|
transhuge
|
||||||
|
userfaultfd
|
||||||
189
Documentation/admin-guide/mm/ksm.rst
Normal file
189
Documentation/admin-guide/mm/ksm.rst
Normal file
@@ -0,0 +1,189 @@
|
|||||||
|
.. _admin_guide_ksm:
|
||||||
|
|
||||||
|
=======================
|
||||||
|
Kernel Samepage Merging
|
||||||
|
=======================
|
||||||
|
|
||||||
|
Overview
|
||||||
|
========
|
||||||
|
|
||||||
|
KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y,
|
||||||
|
added to the Linux kernel in 2.6.32. See ``mm/ksm.c`` for its implementation,
|
||||||
|
and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/
|
||||||
|
|
||||||
|
KSM was originally developed for use with KVM (where it was known as
|
||||||
|
Kernel Shared Memory), to fit more virtual machines into physical memory,
|
||||||
|
by sharing the data common between them. But it can be useful to any
|
||||||
|
application which generates many instances of the same data.
|
||||||
|
|
||||||
|
The KSM daemon ksmd periodically scans those areas of user memory
|
||||||
|
which have been registered with it, looking for pages of identical
|
||||||
|
content which can be replaced by a single write-protected page (which
|
||||||
|
is automatically copied if a process later wants to update its
|
||||||
|
content). The amount of pages that KSM daemon scans in a single pass
|
||||||
|
and the time between the passes are configured using :ref:`sysfs
|
||||||
|
intraface <ksm_sysfs>`
|
||||||
|
|
||||||
|
KSM only merges anonymous (private) pages, never pagecache (file) pages.
|
||||||
|
KSM's merged pages were originally locked into kernel memory, but can now
|
||||||
|
be swapped out just like other user pages (but sharing is broken when they
|
||||||
|
are swapped back in: ksmd must rediscover their identity and merge again).
|
||||||
|
|
||||||
|
Controlling KSM with madvise
|
||||||
|
============================
|
||||||
|
|
||||||
|
KSM only operates on those areas of address space which an application
|
||||||
|
has advised to be likely candidates for merging, by using the madvise(2)
|
||||||
|
system call::
|
||||||
|
|
||||||
|
int madvise(addr, length, MADV_MERGEABLE)
|
||||||
|
|
||||||
|
The app may call
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
int madvise(addr, length, MADV_UNMERGEABLE)
|
||||||
|
|
||||||
|
to cancel that advice and restore unshared pages: whereupon KSM
|
||||||
|
unmerges whatever it merged in that range. Note: this unmerging call
|
||||||
|
may suddenly require more memory than is available - possibly failing
|
||||||
|
with EAGAIN, but more probably arousing the Out-Of-Memory killer.
|
||||||
|
|
||||||
|
If KSM is not configured into the running kernel, madvise MADV_MERGEABLE
|
||||||
|
and MADV_UNMERGEABLE simply fail with EINVAL. If the running kernel was
|
||||||
|
built with CONFIG_KSM=y, those calls will normally succeed: even if the
|
||||||
|
the KSM daemon is not currently running, MADV_MERGEABLE still registers
|
||||||
|
the range for whenever the KSM daemon is started; even if the range
|
||||||
|
cannot contain any pages which KSM could actually merge; even if
|
||||||
|
MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.
|
||||||
|
|
||||||
|
If a region of memory must be split into at least one new MADV_MERGEABLE
|
||||||
|
or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process
|
||||||
|
will exceed ``vm.max_map_count`` (see Documentation/sysctl/vm.txt).
|
||||||
|
|
||||||
|
Like other madvise calls, they are intended for use on mapped areas of
|
||||||
|
the user address space: they will report ENOMEM if the specified range
|
||||||
|
includes unmapped gaps (though working on the intervening mapped areas),
|
||||||
|
and might fail with EAGAIN if not enough memory for internal structures.
|
||||||
|
|
||||||
|
Applications should be considerate in their use of MADV_MERGEABLE,
|
||||||
|
restricting its use to areas likely to benefit. KSM's scans may use a lot
|
||||||
|
of processing power: some installations will disable KSM for that reason.
|
||||||
|
|
||||||
|
.. _ksm_sysfs:
|
||||||
|
|
||||||
|
KSM daemon sysfs interface
|
||||||
|
==========================
|
||||||
|
|
||||||
|
The KSM daemon is controlled by sysfs files in ``/sys/kernel/mm/ksm/``,
|
||||||
|
readable by all but writable only by root:
|
||||||
|
|
||||||
|
pages_to_scan
|
||||||
|
how many pages to scan before ksmd goes to sleep
|
||||||
|
e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``.
|
||||||
|
|
||||||
|
Default: 100 (chosen for demonstration purposes)
|
||||||
|
|
||||||
|
sleep_millisecs
|
||||||
|
how many milliseconds ksmd should sleep before next scan
|
||||||
|
e.g. ``echo 20 > /sys/kernel/mm/ksm/sleep_millisecs``
|
||||||
|
|
||||||
|
Default: 20 (chosen for demonstration purposes)
|
||||||
|
|
||||||
|
merge_across_nodes
|
||||||
|
specifies if pages from different NUMA nodes can be merged.
|
||||||
|
When set to 0, ksm merges only pages which physically reside
|
||||||
|
in the memory area of same NUMA node. That brings lower
|
||||||
|
latency to access of shared pages. Systems with more nodes, at
|
||||||
|
significant NUMA distances, are likely to benefit from the
|
||||||
|
lower latency of setting 0. Smaller systems, which need to
|
||||||
|
minimize memory usage, are likely to benefit from the greater
|
||||||
|
sharing of setting 1 (default). You may wish to compare how
|
||||||
|
your system performs under each setting, before deciding on
|
||||||
|
which to use. ``merge_across_nodes`` setting can be changed only
|
||||||
|
when there are no ksm shared pages in the system: set run 2 to
|
||||||
|
unmerge pages first, then to 1 after changing
|
||||||
|
``merge_across_nodes``, to remerge according to the new setting.
|
||||||
|
|
||||||
|
Default: 1 (merging across nodes as in earlier releases)
|
||||||
|
|
||||||
|
run
|
||||||
|
* set to 0 to stop ksmd from running but keep merged pages,
|
||||||
|
* set to 1 to run ksmd e.g. ``echo 1 > /sys/kernel/mm/ksm/run``,
|
||||||
|
* set to 2 to stop ksmd and unmerge all pages currently merged, but
|
||||||
|
leave mergeable areas registered for next run.
|
||||||
|
|
||||||
|
Default: 0 (must be changed to 1 to activate KSM, except if
|
||||||
|
CONFIG_SYSFS is disabled)
|
||||||
|
|
||||||
|
use_zero_pages
|
||||||
|
specifies whether empty pages (i.e. allocated pages that only
|
||||||
|
contain zeroes) should be treated specially. When set to 1,
|
||||||
|
empty pages are merged with the kernel zero page(s) instead of
|
||||||
|
with each other as it would happen normally. This can improve
|
||||||
|
the performance on architectures with coloured zero pages,
|
||||||
|
depending on the workload. Care should be taken when enabling
|
||||||
|
this setting, as it can potentially degrade the performance of
|
||||||
|
KSM for some workloads, for example if the checksums of pages
|
||||||
|
candidate for merging match the checksum of an empty
|
||||||
|
page. This setting can be changed at any time, it is only
|
||||||
|
effective for pages merged after the change.
|
||||||
|
|
||||||
|
Default: 0 (normal KSM behaviour as in earlier releases)
|
||||||
|
|
||||||
|
max_page_sharing
|
||||||
|
Maximum sharing allowed for each KSM page. This enforces a
|
||||||
|
deduplication limit to avoid high latency for virtual memory
|
||||||
|
operations that involve traversal of the virtual mappings that
|
||||||
|
share the KSM page. The minimum value is 2 as a newly created
|
||||||
|
KSM page will have at least two sharers. The higher this value
|
||||||
|
the faster KSM will merge the memory and the higher the
|
||||||
|
deduplication factor will be, but the slower the worst case
|
||||||
|
virtual mappings traversal could be for any given KSM
|
||||||
|
page. Slowing down this traversal means there will be higher
|
||||||
|
latency for certain virtual memory operations happening during
|
||||||
|
swapping, compaction, NUMA balancing and page migration, in
|
||||||
|
turn decreasing responsiveness for the caller of those virtual
|
||||||
|
memory operations. The scheduler latency of other tasks not
|
||||||
|
involved with the VM operations doing the virtual mappings
|
||||||
|
traversal is not affected by this parameter as these
|
||||||
|
traversals are always schedule friendly themselves.
|
||||||
|
|
||||||
|
stable_node_chains_prune_millisecs
|
||||||
|
specifies how frequently KSM checks the metadata of the pages
|
||||||
|
that hit the deduplication limit for stale information.
|
||||||
|
Smaller milllisecs values will free up the KSM metadata with
|
||||||
|
lower latency, but they will make ksmd use more CPU during the
|
||||||
|
scan. It's a noop if not a single KSM page hit the
|
||||||
|
``max_page_sharing`` yet.
|
||||||
|
|
||||||
|
The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
|
||||||
|
|
||||||
|
pages_shared
|
||||||
|
how many shared pages are being used
|
||||||
|
pages_sharing
|
||||||
|
how many more sites are sharing them i.e. how much saved
|
||||||
|
pages_unshared
|
||||||
|
how many pages unique but repeatedly checked for merging
|
||||||
|
pages_volatile
|
||||||
|
how many pages changing too fast to be placed in a tree
|
||||||
|
full_scans
|
||||||
|
how many times all mergeable areas have been scanned
|
||||||
|
stable_node_chains
|
||||||
|
the number of KSM pages that hit the ``max_page_sharing`` limit
|
||||||
|
stable_node_dups
|
||||||
|
number of duplicated KSM pages
|
||||||
|
|
||||||
|
A high ratio of ``pages_sharing`` to ``pages_shared`` indicates good
|
||||||
|
sharing, but a high ratio of ``pages_unshared`` to ``pages_sharing``
|
||||||
|
indicates wasted effort. ``pages_volatile`` embraces several
|
||||||
|
different kinds of activity, but a high proportion there would also
|
||||||
|
indicate poor use of madvise MADV_MERGEABLE.
|
||||||
|
|
||||||
|
The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the
|
||||||
|
``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must
|
||||||
|
be increased accordingly.
|
||||||
|
|
||||||
|
--
|
||||||
|
Izik Eidus,
|
||||||
|
Hugh Dickins, 17 Nov 2009
|
||||||
495
Documentation/admin-guide/mm/numa_memory_policy.rst
Normal file
495
Documentation/admin-guide/mm/numa_memory_policy.rst
Normal file
@@ -0,0 +1,495 @@
|
|||||||
|
.. _numa_memory_policy:
|
||||||
|
|
||||||
|
==================
|
||||||
|
NUMA Memory Policy
|
||||||
|
==================
|
||||||
|
|
||||||
|
What is NUMA Memory Policy?
|
||||||
|
============================
|
||||||
|
|
||||||
|
In the Linux kernel, "memory policy" determines from which node the kernel will
|
||||||
|
allocate memory in a NUMA system or in an emulated NUMA system. Linux has
|
||||||
|
supported platforms with Non-Uniform Memory Access architectures since 2.4.?.
|
||||||
|
The current memory policy support was added to Linux 2.6 around May 2004. This
|
||||||
|
document attempts to describe the concepts and APIs of the 2.6 memory policy
|
||||||
|
support.
|
||||||
|
|
||||||
|
Memory policies should not be confused with cpusets
|
||||||
|
(``Documentation/cgroup-v1/cpusets.txt``)
|
||||||
|
which is an administrative mechanism for restricting the nodes from which
|
||||||
|
memory may be allocated by a set of processes. Memory policies are a
|
||||||
|
programming interface that a NUMA-aware application can take advantage of. When
|
||||||
|
both cpusets and policies are applied to a task, the restrictions of the cpuset
|
||||||
|
takes priority. See :ref:`Memory Policies and cpusets <mem_pol_and_cpusets>`
|
||||||
|
below for more details.
|
||||||
|
|
||||||
|
Memory Policy Concepts
|
||||||
|
======================
|
||||||
|
|
||||||
|
Scope of Memory Policies
|
||||||
|
------------------------
|
||||||
|
|
||||||
|
The Linux kernel supports _scopes_ of memory policy, described here from
|
||||||
|
most general to most specific:
|
||||||
|
|
||||||
|
System Default Policy
|
||||||
|
this policy is "hard coded" into the kernel. It is the policy
|
||||||
|
that governs all page allocations that aren't controlled by
|
||||||
|
one of the more specific policy scopes discussed below. When
|
||||||
|
the system is "up and running", the system default policy will
|
||||||
|
use "local allocation" described below. However, during boot
|
||||||
|
up, the system default policy will be set to interleave
|
||||||
|
allocations across all nodes with "sufficient" memory, so as
|
||||||
|
not to overload the initial boot node with boot-time
|
||||||
|
allocations.
|
||||||
|
|
||||||
|
Task/Process Policy
|
||||||
|
this is an optional, per-task policy. When defined for a
|
||||||
|
specific task, this policy controls all page allocations made
|
||||||
|
by or on behalf of the task that aren't controlled by a more
|
||||||
|
specific scope. If a task does not define a task policy, then
|
||||||
|
all page allocations that would have been controlled by the
|
||||||
|
task policy "fall back" to the System Default Policy.
|
||||||
|
|
||||||
|
The task policy applies to the entire address space of a task. Thus,
|
||||||
|
it is inheritable, and indeed is inherited, across both fork()
|
||||||
|
[clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task
|
||||||
|
to establish the task policy for a child task exec()'d from an
|
||||||
|
executable image that has no awareness of memory policy. See the
|
||||||
|
:ref:`Memory Policy APIs <memory_policy_apis>` section,
|
||||||
|
below, for an overview of the system call
|
||||||
|
that a task may use to set/change its task/process policy.
|
||||||
|
|
||||||
|
In a multi-threaded task, task policies apply only to the thread
|
||||||
|
[Linux kernel task] that installs the policy and any threads
|
||||||
|
subsequently created by that thread. Any sibling threads existing
|
||||||
|
at the time a new task policy is installed retain their current
|
||||||
|
policy.
|
||||||
|
|
||||||
|
A task policy applies only to pages allocated after the policy is
|
||||||
|
installed. Any pages already faulted in by the task when the task
|
||||||
|
changes its task policy remain where they were allocated based on
|
||||||
|
the policy at the time they were allocated.
|
||||||
|
|
||||||
|
.. _vma_policy:
|
||||||
|
|
||||||
|
VMA Policy
|
||||||
|
A "VMA" or "Virtual Memory Area" refers to a range of a task's
|
||||||
|
virtual address space. A task may define a specific policy for a range
|
||||||
|
of its virtual address space. See the
|
||||||
|
:ref:`Memory Policy APIs <memory_policy_apis>` section,
|
||||||
|
below, for an overview of the mbind() system call used to set a VMA
|
||||||
|
policy.
|
||||||
|
|
||||||
|
A VMA policy will govern the allocation of pages that back
|
||||||
|
this region of the address space. Any regions of the task's
|
||||||
|
address space that don't have an explicit VMA policy will fall
|
||||||
|
back to the task policy, which may itself fall back to the
|
||||||
|
System Default Policy.
|
||||||
|
|
||||||
|
VMA policies have a few complicating details:
|
||||||
|
|
||||||
|
* VMA policy applies ONLY to anonymous pages. These include
|
||||||
|
pages allocated for anonymous segments, such as the task
|
||||||
|
stack and heap, and any regions of the address space
|
||||||
|
mmap()ed with the MAP_ANONYMOUS flag. If a VMA policy is
|
||||||
|
applied to a file mapping, it will be ignored if the mapping
|
||||||
|
used the MAP_SHARED flag. If the file mapping used the
|
||||||
|
MAP_PRIVATE flag, the VMA policy will only be applied when
|
||||||
|
an anonymous page is allocated on an attempt to write to the
|
||||||
|
mapping-- i.e., at Copy-On-Write.
|
||||||
|
|
||||||
|
* VMA policies are shared between all tasks that share a
|
||||||
|
virtual address space--a.k.a. threads--independent of when
|
||||||
|
the policy is installed; and they are inherited across
|
||||||
|
fork(). However, because VMA policies refer to a specific
|
||||||
|
region of a task's address space, and because the address
|
||||||
|
space is discarded and recreated on exec*(), VMA policies
|
||||||
|
are NOT inheritable across exec(). Thus, only NUMA-aware
|
||||||
|
applications may use VMA policies.
|
||||||
|
|
||||||
|
* A task may install a new VMA policy on a sub-range of a
|
||||||
|
previously mmap()ed region. When this happens, Linux splits
|
||||||
|
the existing virtual memory area into 2 or 3 VMAs, each with
|
||||||
|
it's own policy.
|
||||||
|
|
||||||
|
* By default, VMA policy applies only to pages allocated after
|
||||||
|
the policy is installed. Any pages already faulted into the
|
||||||
|
VMA range remain where they were allocated based on the
|
||||||
|
policy at the time they were allocated. However, since
|
||||||
|
2.6.16, Linux supports page migration via the mbind() system
|
||||||
|
call, so that page contents can be moved to match a newly
|
||||||
|
installed policy.
|
||||||
|
|
||||||
|
Shared Policy
|
||||||
|
Conceptually, shared policies apply to "memory objects" mapped
|
||||||
|
shared into one or more tasks' distinct address spaces. An
|
||||||
|
application installs shared policies the same way as VMA
|
||||||
|
policies--using the mbind() system call specifying a range of
|
||||||
|
virtual addresses that map the shared object. However, unlike
|
||||||
|
VMA policies, which can be considered to be an attribute of a
|
||||||
|
range of a task's address space, shared policies apply
|
||||||
|
directly to the shared object. Thus, all tasks that attach to
|
||||||
|
the object share the policy, and all pages allocated for the
|
||||||
|
shared object, by any task, will obey the shared policy.
|
||||||
|
|
||||||
|
As of 2.6.22, only shared memory segments, created by shmget() or
|
||||||
|
mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared
|
||||||
|
policy support was added to Linux, the associated data structures were
|
||||||
|
added to hugetlbfs shmem segments. At the time, hugetlbfs did not
|
||||||
|
support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
|
||||||
|
shmem segments were never "hooked up" to the shared policy support.
|
||||||
|
Although hugetlbfs segments now support lazy allocation, their support
|
||||||
|
for shared policy has not been completed.
|
||||||
|
|
||||||
|
As mentioned above in :ref:`VMA policies <vma_policy>` section,
|
||||||
|
allocations of page cache pages for regular files mmap()ed
|
||||||
|
with MAP_SHARED ignore any VMA policy installed on the virtual
|
||||||
|
address range backed by the shared file mapping. Rather,
|
||||||
|
shared page cache pages, including pages backing private
|
||||||
|
mappings that have not yet been written by the task, follow
|
||||||
|
task policy, if any, else System Default Policy.
|
||||||
|
|
||||||
|
The shared policy infrastructure supports different policies on subset
|
||||||
|
ranges of the shared object. However, Linux still splits the VMA of
|
||||||
|
the task that installs the policy for each range of distinct policy.
|
||||||
|
Thus, different tasks that attach to a shared memory segment can have
|
||||||
|
different VMA configurations mapping that one shared object. This
|
||||||
|
can be seen by examining the /proc/<pid>/numa_maps of tasks sharing
|
||||||
|
a shared memory region, when one task has installed shared policy on
|
||||||
|
one or more ranges of the region.
|
||||||
|
|
||||||
|
Components of Memory Policies
|
||||||
|
-----------------------------
|
||||||
|
|
||||||
|
A NUMA memory policy consists of a "mode", optional mode flags, and
|
||||||
|
an optional set of nodes. The mode determines the behavior of the
|
||||||
|
policy, the optional mode flags determine the behavior of the mode,
|
||||||
|
and the optional set of nodes can be viewed as the arguments to the
|
||||||
|
policy behavior.
|
||||||
|
|
||||||
|
Internally, memory policies are implemented by a reference counted
|
||||||
|
structure, struct mempolicy. Details of this structure will be
|
||||||
|
discussed in context, below, as required to explain the behavior.
|
||||||
|
|
||||||
|
NUMA memory policy supports the following 4 behavioral modes:
|
||||||
|
|
||||||
|
Default Mode--MPOL_DEFAULT
|
||||||
|
This mode is only used in the memory policy APIs. Internally,
|
||||||
|
MPOL_DEFAULT is converted to the NULL memory policy in all
|
||||||
|
policy scopes. Any existing non-default policy will simply be
|
||||||
|
removed when MPOL_DEFAULT is specified. As a result,
|
||||||
|
MPOL_DEFAULT means "fall back to the next most specific policy
|
||||||
|
scope."
|
||||||
|
|
||||||
|
For example, a NULL or default task policy will fall back to the
|
||||||
|
system default policy. A NULL or default vma policy will fall
|
||||||
|
back to the task policy.
|
||||||
|
|
||||||
|
When specified in one of the memory policy APIs, the Default mode
|
||||||
|
does not use the optional set of nodes.
|
||||||
|
|
||||||
|
It is an error for the set of nodes specified for this policy to
|
||||||
|
be non-empty.
|
||||||
|
|
||||||
|
MPOL_BIND
|
||||||
|
This mode specifies that memory must come from the set of
|
||||||
|
nodes specified by the policy. Memory will be allocated from
|
||||||
|
the node in the set with sufficient free memory that is
|
||||||
|
closest to the node where the allocation takes place.
|
||||||
|
|
||||||
|
MPOL_PREFERRED
|
||||||
|
This mode specifies that the allocation should be attempted
|
||||||
|
from the single node specified in the policy. If that
|
||||||
|
allocation fails, the kernel will search other nodes, in order
|
||||||
|
of increasing distance from the preferred node based on
|
||||||
|
information provided by the platform firmware.
|
||||||
|
|
||||||
|
Internally, the Preferred policy uses a single node--the
|
||||||
|
preferred_node member of struct mempolicy. When the internal
|
||||||
|
mode flag MPOL_F_LOCAL is set, the preferred_node is ignored
|
||||||
|
and the policy is interpreted as local allocation. "Local"
|
||||||
|
allocation policy can be viewed as a Preferred policy that
|
||||||
|
starts at the node containing the cpu where the allocation
|
||||||
|
takes place.
|
||||||
|
|
||||||
|
It is possible for the user to specify that local allocation
|
||||||
|
is always preferred by passing an empty nodemask with this
|
||||||
|
mode. If an empty nodemask is passed, the policy cannot use
|
||||||
|
the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags
|
||||||
|
described below.
|
||||||
|
|
||||||
|
MPOL_INTERLEAVED
|
||||||
|
This mode specifies that page allocations be interleaved, on a
|
||||||
|
page granularity, across the nodes specified in the policy.
|
||||||
|
This mode also behaves slightly differently, based on the
|
||||||
|
context where it is used:
|
||||||
|
|
||||||
|
For allocation of anonymous pages and shared memory pages,
|
||||||
|
Interleave mode indexes the set of nodes specified by the
|
||||||
|
policy using the page offset of the faulting address into the
|
||||||
|
segment [VMA] containing the address modulo the number of
|
||||||
|
nodes specified by the policy. It then attempts to allocate a
|
||||||
|
page, starting at the selected node, as if the node had been
|
||||||
|
specified by a Preferred policy or had been selected by a
|
||||||
|
local allocation. That is, allocation will follow the per
|
||||||
|
node zonelist.
|
||||||
|
|
||||||
|
For allocation of page cache pages, Interleave mode indexes
|
||||||
|
the set of nodes specified by the policy using a node counter
|
||||||
|
maintained per task. This counter wraps around to the lowest
|
||||||
|
specified node after it reaches the highest specified node.
|
||||||
|
This will tend to spread the pages out over the nodes
|
||||||
|
specified by the policy based on the order in which they are
|
||||||
|
allocated, rather than based on any page offset into an
|
||||||
|
address range or file. During system boot up, the temporary
|
||||||
|
interleaved system default policy works in this mode.
|
||||||
|
|
||||||
|
NUMA memory policy supports the following optional mode flags:
|
||||||
|
|
||||||
|
MPOL_F_STATIC_NODES
|
||||||
|
This flag specifies that the nodemask passed by
|
||||||
|
the user should not be remapped if the task or VMA's set of allowed
|
||||||
|
nodes changes after the memory policy has been defined.
|
||||||
|
|
||||||
|
Without this flag, any time a mempolicy is rebound because of a
|
||||||
|
change in the set of allowed nodes, the node (Preferred) or
|
||||||
|
nodemask (Bind, Interleave) is remapped to the new set of
|
||||||
|
allowed nodes. This may result in nodes being used that were
|
||||||
|
previously undesired.
|
||||||
|
|
||||||
|
With this flag, if the user-specified nodes overlap with the
|
||||||
|
nodes allowed by the task's cpuset, then the memory policy is
|
||||||
|
applied to their intersection. If the two sets of nodes do not
|
||||||
|
overlap, the Default policy is used.
|
||||||
|
|
||||||
|
For example, consider a task that is attached to a cpuset with
|
||||||
|
mems 1-3 that sets an Interleave policy over the same set. If
|
||||||
|
the cpuset's mems change to 3-5, the Interleave will now occur
|
||||||
|
over nodes 3, 4, and 5. With this flag, however, since only node
|
||||||
|
3 is allowed from the user's nodemask, the "interleave" only
|
||||||
|
occurs over that node. If no nodes from the user's nodemask are
|
||||||
|
now allowed, the Default behavior is used.
|
||||||
|
|
||||||
|
MPOL_F_STATIC_NODES cannot be combined with the
|
||||||
|
MPOL_F_RELATIVE_NODES flag. It also cannot be used for
|
||||||
|
MPOL_PREFERRED policies that were created with an empty nodemask
|
||||||
|
(local allocation).
|
||||||
|
|
||||||
|
MPOL_F_RELATIVE_NODES
|
||||||
|
This flag specifies that the nodemask passed
|
||||||
|
by the user will be mapped relative to the set of the task or VMA's
|
||||||
|
set of allowed nodes. The kernel stores the user-passed nodemask,
|
||||||
|
and if the allowed nodes changes, then that original nodemask will
|
||||||
|
be remapped relative to the new set of allowed nodes.
|
||||||
|
|
||||||
|
Without this flag (and without MPOL_F_STATIC_NODES), anytime a
|
||||||
|
mempolicy is rebound because of a change in the set of allowed
|
||||||
|
nodes, the node (Preferred) or nodemask (Bind, Interleave) is
|
||||||
|
remapped to the new set of allowed nodes. That remap may not
|
||||||
|
preserve the relative nature of the user's passed nodemask to its
|
||||||
|
set of allowed nodes upon successive rebinds: a nodemask of
|
||||||
|
1,3,5 may be remapped to 7-9 and then to 1-3 if the set of
|
||||||
|
allowed nodes is restored to its original state.
|
||||||
|
|
||||||
|
With this flag, the remap is done so that the node numbers from
|
||||||
|
the user's passed nodemask are relative to the set of allowed
|
||||||
|
nodes. In other words, if nodes 0, 2, and 4 are set in the user's
|
||||||
|
nodemask, the policy will be effected over the first (and in the
|
||||||
|
Bind or Interleave case, the third and fifth) nodes in the set of
|
||||||
|
allowed nodes. The nodemask passed by the user represents nodes
|
||||||
|
relative to task or VMA's set of allowed nodes.
|
||||||
|
|
||||||
|
If the user's nodemask includes nodes that are outside the range
|
||||||
|
of the new set of allowed nodes (for example, node 5 is set in
|
||||||
|
the user's nodemask when the set of allowed nodes is only 0-3),
|
||||||
|
then the remap wraps around to the beginning of the nodemask and,
|
||||||
|
if not already set, sets the node in the mempolicy nodemask.
|
||||||
|
|
||||||
|
For example, consider a task that is attached to a cpuset with
|
||||||
|
mems 2-5 that sets an Interleave policy over the same set with
|
||||||
|
MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the
|
||||||
|
interleave now occurs over nodes 3,5-7. If the cpuset's mems
|
||||||
|
then change to 0,2-3,5, then the interleave occurs over nodes
|
||||||
|
0,2-3,5.
|
||||||
|
|
||||||
|
Thanks to the consistent remapping, applications preparing
|
||||||
|
nodemasks to specify memory policies using this flag should
|
||||||
|
disregard their current, actual cpuset imposed memory placement
|
||||||
|
and prepare the nodemask as if they were always located on
|
||||||
|
memory nodes 0 to N-1, where N is the number of memory nodes the
|
||||||
|
policy is intended to manage. Let the kernel then remap to the
|
||||||
|
set of memory nodes allowed by the task's cpuset, as that may
|
||||||
|
change over time.
|
||||||
|
|
||||||
|
MPOL_F_RELATIVE_NODES cannot be combined with the
|
||||||
|
MPOL_F_STATIC_NODES flag. It also cannot be used for
|
||||||
|
MPOL_PREFERRED policies that were created with an empty nodemask
|
||||||
|
(local allocation).
|
||||||
|
|
||||||
|
Memory Policy Reference Counting
|
||||||
|
================================
|
||||||
|
|
||||||
|
To resolve use/free races, struct mempolicy contains an atomic reference
|
||||||
|
count field. Internal interfaces, mpol_get()/mpol_put() increment and
|
||||||
|
decrement this reference count, respectively. mpol_put() will only free
|
||||||
|
the structure back to the mempolicy kmem cache when the reference count
|
||||||
|
goes to zero.
|
||||||
|
|
||||||
|
When a new memory policy is allocated, its reference count is initialized
|
||||||
|
to '1', representing the reference held by the task that is installing the
|
||||||
|
new policy. When a pointer to a memory policy structure is stored in another
|
||||||
|
structure, another reference is added, as the task's reference will be dropped
|
||||||
|
on completion of the policy installation.
|
||||||
|
|
||||||
|
During run-time "usage" of the policy, we attempt to minimize atomic operations
|
||||||
|
on the reference count, as this can lead to cache lines bouncing between cpus
|
||||||
|
and NUMA nodes. "Usage" here means one of the following:
|
||||||
|
|
||||||
|
1) querying of the policy, either by the task itself [using the get_mempolicy()
|
||||||
|
API discussed below] or by another task using the /proc/<pid>/numa_maps
|
||||||
|
interface.
|
||||||
|
|
||||||
|
2) examination of the policy to determine the policy mode and associated node
|
||||||
|
or node lists, if any, for page allocation. This is considered a "hot
|
||||||
|
path". Note that for MPOL_BIND, the "usage" extends across the entire
|
||||||
|
allocation process, which may sleep during page reclaimation, because the
|
||||||
|
BIND policy nodemask is used, by reference, to filter ineligible nodes.
|
||||||
|
|
||||||
|
We can avoid taking an extra reference during the usages listed above as
|
||||||
|
follows:
|
||||||
|
|
||||||
|
1) we never need to get/free the system default policy as this is never
|
||||||
|
changed nor freed, once the system is up and running.
|
||||||
|
|
||||||
|
2) for querying the policy, we do not need to take an extra reference on the
|
||||||
|
target task's task policy nor vma policies because we always acquire the
|
||||||
|
task's mm's mmap_sem for read during the query. The set_mempolicy() and
|
||||||
|
mbind() APIs [see below] always acquire the mmap_sem for write when
|
||||||
|
installing or replacing task or vma policies. Thus, there is no possibility
|
||||||
|
of a task or thread freeing a policy while another task or thread is
|
||||||
|
querying it.
|
||||||
|
|
||||||
|
3) Page allocation usage of task or vma policy occurs in the fault path where
|
||||||
|
we hold them mmap_sem for read. Again, because replacing the task or vma
|
||||||
|
policy requires that the mmap_sem be held for write, the policy can't be
|
||||||
|
freed out from under us while we're using it for page allocation.
|
||||||
|
|
||||||
|
4) Shared policies require special consideration. One task can replace a
|
||||||
|
shared memory policy while another task, with a distinct mmap_sem, is
|
||||||
|
querying or allocating a page based on the policy. To resolve this
|
||||||
|
potential race, the shared policy infrastructure adds an extra reference
|
||||||
|
to the shared policy during lookup while holding a spin lock on the shared
|
||||||
|
policy management structure. This requires that we drop this extra
|
||||||
|
reference when we're finished "using" the policy. We must drop the
|
||||||
|
extra reference on shared policies in the same query/allocation paths
|
||||||
|
used for non-shared policies. For this reason, shared policies are marked
|
||||||
|
as such, and the extra reference is dropped "conditionally"--i.e., only
|
||||||
|
for shared policies.
|
||||||
|
|
||||||
|
Because of this extra reference counting, and because we must lookup
|
||||||
|
shared policies in a tree structure under spinlock, shared policies are
|
||||||
|
more expensive to use in the page allocation path. This is especially
|
||||||
|
true for shared policies on shared memory regions shared by tasks running
|
||||||
|
on different NUMA nodes. This extra overhead can be avoided by always
|
||||||
|
falling back to task or system default policy for shared memory regions,
|
||||||
|
or by prefaulting the entire shared memory region into memory and locking
|
||||||
|
it down. However, this might not be appropriate for all applications.
|
||||||
|
|
||||||
|
.. _memory_policy_apis:
|
||||||
|
|
||||||
|
Memory Policy APIs
|
||||||
|
==================
|
||||||
|
|
||||||
|
Linux supports 3 system calls for controlling memory policy. These APIS
|
||||||
|
always affect only the calling task, the calling task's address space, or
|
||||||
|
some shared object mapped into the calling task's address space.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
the headers that define these APIs and the parameter data types for
|
||||||
|
user space applications reside in a package that is not part of the
|
||||||
|
Linux kernel. The kernel system call interfaces, with the 'sys\_'
|
||||||
|
prefix, are defined in <linux/syscalls.h>; the mode and flag
|
||||||
|
definitions are defined in <linux/mempolicy.h>.
|
||||||
|
|
||||||
|
Set [Task] Memory Policy::
|
||||||
|
|
||||||
|
long set_mempolicy(int mode, const unsigned long *nmask,
|
||||||
|
unsigned long maxnode);
|
||||||
|
|
||||||
|
Set's the calling task's "task/process memory policy" to mode
|
||||||
|
specified by the 'mode' argument and the set of nodes defined by
|
||||||
|
'nmask'. 'nmask' points to a bit mask of node ids containing at least
|
||||||
|
'maxnode' ids. Optional mode flags may be passed by combining the
|
||||||
|
'mode' argument with the flag (for example: MPOL_INTERLEAVE |
|
||||||
|
MPOL_F_STATIC_NODES).
|
||||||
|
|
||||||
|
See the set_mempolicy(2) man page for more details
|
||||||
|
|
||||||
|
|
||||||
|
Get [Task] Memory Policy or Related Information::
|
||||||
|
|
||||||
|
long get_mempolicy(int *mode,
|
||||||
|
const unsigned long *nmask, unsigned long maxnode,
|
||||||
|
void *addr, int flags);
|
||||||
|
|
||||||
|
Queries the "task/process memory policy" of the calling task, or the
|
||||||
|
policy or location of a specified virtual address, depending on the
|
||||||
|
'flags' argument.
|
||||||
|
|
||||||
|
See the get_mempolicy(2) man page for more details
|
||||||
|
|
||||||
|
|
||||||
|
Install VMA/Shared Policy for a Range of Task's Address Space::
|
||||||
|
|
||||||
|
long mbind(void *start, unsigned long len, int mode,
|
||||||
|
const unsigned long *nmask, unsigned long maxnode,
|
||||||
|
unsigned flags);
|
||||||
|
|
||||||
|
mbind() installs the policy specified by (mode, nmask, maxnodes) as a
|
||||||
|
VMA policy for the range of the calling task's address space specified
|
||||||
|
by the 'start' and 'len' arguments. Additional actions may be
|
||||||
|
requested via the 'flags' argument.
|
||||||
|
|
||||||
|
See the mbind(2) man page for more details.
|
||||||
|
|
||||||
|
Memory Policy Command Line Interface
|
||||||
|
====================================
|
||||||
|
|
||||||
|
Although not strictly part of the Linux implementation of memory policy,
|
||||||
|
a command line tool, numactl(8), exists that allows one to:
|
||||||
|
|
||||||
|
+ set the task policy for a specified program via set_mempolicy(2), fork(2) and
|
||||||
|
exec(2)
|
||||||
|
|
||||||
|
+ set the shared policy for a shared memory segment via mbind(2)
|
||||||
|
|
||||||
|
The numactl(8) tool is packaged with the run-time version of the library
|
||||||
|
containing the memory policy system call wrappers. Some distributions
|
||||||
|
package the headers and compile-time libraries in a separate development
|
||||||
|
package.
|
||||||
|
|
||||||
|
.. _mem_pol_and_cpusets:
|
||||||
|
|
||||||
|
Memory Policies and cpusets
|
||||||
|
===========================
|
||||||
|
|
||||||
|
Memory policies work within cpusets as described above. For memory policies
|
||||||
|
that require a node or set of nodes, the nodes are restricted to the set of
|
||||||
|
nodes whose memories are allowed by the cpuset constraints. If the nodemask
|
||||||
|
specified for the policy contains nodes that are not allowed by the cpuset and
|
||||||
|
MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes
|
||||||
|
specified for the policy and the set of nodes with memory is used. If the
|
||||||
|
result is the empty set, the policy is considered invalid and cannot be
|
||||||
|
installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped
|
||||||
|
onto and folded into the task's set of allowed nodes as previously described.
|
||||||
|
|
||||||
|
The interaction of memory policies and cpusets can be problematic when tasks
|
||||||
|
in two cpusets share access to a memory region, such as shared memory segments
|
||||||
|
created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and
|
||||||
|
any of the tasks install shared policy on the region, only nodes whose
|
||||||
|
memories are allowed in both cpusets may be used in the policies. Obtaining
|
||||||
|
this information requires "stepping outside" the memory policy APIs to use the
|
||||||
|
cpuset information and requires that one know in what cpusets other task might
|
||||||
|
be attaching to the shared region. Furthermore, if the cpusets' allowed
|
||||||
|
memory sets are disjoint, "local" allocation is the only valid policy.
|
||||||
@@ -1,21 +1,25 @@
|
|||||||
pagemap, from the userspace perspective
|
.. _pagemap:
|
||||||
---------------------------------------
|
|
||||||
|
=============================
|
||||||
|
Examining Process Page Tables
|
||||||
|
=============================
|
||||||
|
|
||||||
pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
|
pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
|
||||||
userspace programs to examine the page tables and related information by
|
userspace programs to examine the page tables and related information by
|
||||||
reading files in /proc.
|
reading files in ``/proc``.
|
||||||
|
|
||||||
There are four components to pagemap:
|
There are four components to pagemap:
|
||||||
|
|
||||||
* /proc/pid/pagemap. This file lets a userspace process find out which
|
* ``/proc/pid/pagemap``. This file lets a userspace process find out which
|
||||||
physical frame each virtual page is mapped to. It contains one 64-bit
|
physical frame each virtual page is mapped to. It contains one 64-bit
|
||||||
value for each virtual page, containing the following data (from
|
value for each virtual page, containing the following data (from
|
||||||
fs/proc/task_mmu.c, above pagemap_read):
|
``fs/proc/task_mmu.c``, above pagemap_read):
|
||||||
|
|
||||||
* Bits 0-54 page frame number (PFN) if present
|
* Bits 0-54 page frame number (PFN) if present
|
||||||
* Bits 0-4 swap type if swapped
|
* Bits 0-4 swap type if swapped
|
||||||
* Bits 5-54 swap offset if swapped
|
* Bits 5-54 swap offset if swapped
|
||||||
* Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
|
* Bit 55 pte is soft-dirty (see
|
||||||
|
:ref:`Documentation/admin-guide/mm/soft-dirty.rst <soft_dirty>`)
|
||||||
* Bit 56 page exclusively mapped (since 4.2)
|
* Bit 56 page exclusively mapped (since 4.2)
|
||||||
* Bits 57-60 zero
|
* Bits 57-60 zero
|
||||||
* Bit 61 page is file-page or shared-anon (since 3.5)
|
* Bit 61 page is file-page or shared-anon (since 3.5)
|
||||||
@@ -33,17 +37,17 @@ There are four components to pagemap:
|
|||||||
precisely which pages are mapped (or in swap) and comparing mapped
|
precisely which pages are mapped (or in swap) and comparing mapped
|
||||||
pages between processes.
|
pages between processes.
|
||||||
|
|
||||||
Efficient users of this interface will use /proc/pid/maps to
|
Efficient users of this interface will use ``/proc/pid/maps`` to
|
||||||
determine which areas of memory are actually mapped and llseek to
|
determine which areas of memory are actually mapped and llseek to
|
||||||
skip over unmapped regions.
|
skip over unmapped regions.
|
||||||
|
|
||||||
* /proc/kpagecount. This file contains a 64-bit count of the number of
|
* ``/proc/kpagecount``. This file contains a 64-bit count of the number of
|
||||||
times each page is mapped, indexed by PFN.
|
times each page is mapped, indexed by PFN.
|
||||||
|
|
||||||
* /proc/kpageflags. This file contains a 64-bit set of flags for each
|
* ``/proc/kpageflags``. This file contains a 64-bit set of flags for each
|
||||||
page, indexed by PFN.
|
page, indexed by PFN.
|
||||||
|
|
||||||
The flags are (from fs/proc/page.c, above kpageflags_read):
|
The flags are (from ``fs/proc/page.c``, above kpageflags_read):
|
||||||
|
|
||||||
0. LOCKED
|
0. LOCKED
|
||||||
1. ERROR
|
1. ERROR
|
||||||
@@ -72,98 +76,111 @@ There are four components to pagemap:
|
|||||||
24. ZERO_PAGE
|
24. ZERO_PAGE
|
||||||
25. IDLE
|
25. IDLE
|
||||||
|
|
||||||
* /proc/kpagecgroup. This file contains a 64-bit inode number of the
|
* ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the
|
||||||
memory cgroup each page is charged to, indexed by PFN. Only available when
|
memory cgroup each page is charged to, indexed by PFN. Only available when
|
||||||
CONFIG_MEMCG is set.
|
CONFIG_MEMCG is set.
|
||||||
|
|
||||||
Short descriptions to the page flags:
|
Short descriptions to the page flags
|
||||||
|
====================================
|
||||||
|
|
||||||
0. LOCKED
|
0 - LOCKED
|
||||||
page is being locked for exclusive access, eg. by undergoing read/write IO
|
page is being locked for exclusive access, e.g. by undergoing read/write IO
|
||||||
|
7 - SLAB
|
||||||
7. SLAB
|
|
||||||
page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
|
page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
|
||||||
When compound page is used, SLUB/SLQB will only set this flag on the head
|
When compound page is used, SLUB/SLQB will only set this flag on the head
|
||||||
page; SLOB will not flag it at all.
|
page; SLOB will not flag it at all.
|
||||||
|
10 - BUDDY
|
||||||
10. BUDDY
|
|
||||||
a free memory block managed by the buddy system allocator
|
a free memory block managed by the buddy system allocator
|
||||||
The buddy system organizes free memory in blocks of various orders.
|
The buddy system organizes free memory in blocks of various orders.
|
||||||
An order N block has 2^N physically contiguous pages, with the BUDDY flag
|
An order N block has 2^N physically contiguous pages, with the BUDDY flag
|
||||||
set for and _only_ for the first page.
|
set for and _only_ for the first page.
|
||||||
|
15 - COMPOUND_HEAD
|
||||||
15. COMPOUND_HEAD
|
|
||||||
16. COMPOUND_TAIL
|
|
||||||
A compound page with order N consists of 2^N physically contiguous pages.
|
A compound page with order N consists of 2^N physically contiguous pages.
|
||||||
A compound page with order 2 takes the form of "HTTT", where H donates its
|
A compound page with order 2 takes the form of "HTTT", where H donates its
|
||||||
head page and T donates its tail page(s). The major consumers of compound
|
head page and T donates its tail page(s). The major consumers of compound
|
||||||
pages are hugeTLB pages (Documentation/vm/hugetlbpage.txt), the SLUB etc.
|
pages are hugeTLB pages
|
||||||
memory allocators and various device drivers. However in this interface,
|
(:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`),
|
||||||
only huge/giga pages are made visible to end users.
|
the SLUB etc. memory allocators and various device drivers.
|
||||||
17. HUGE
|
However in this interface, only huge/giga pages are made visible
|
||||||
|
to end users.
|
||||||
|
16 - COMPOUND_TAIL
|
||||||
|
A compound page tail (see description above).
|
||||||
|
17 - HUGE
|
||||||
this is an integral part of a HugeTLB page
|
this is an integral part of a HugeTLB page
|
||||||
|
19 - HWPOISON
|
||||||
19. HWPOISON
|
|
||||||
hardware detected memory corruption on this page: don't touch the data!
|
hardware detected memory corruption on this page: don't touch the data!
|
||||||
|
20 - NOPAGE
|
||||||
20. NOPAGE
|
|
||||||
no page frame exists at the requested address
|
no page frame exists at the requested address
|
||||||
|
21 - KSM
|
||||||
21. KSM
|
|
||||||
identical memory pages dynamically shared between one or more processes
|
identical memory pages dynamically shared between one or more processes
|
||||||
|
22 - THP
|
||||||
22. THP
|
|
||||||
contiguous pages which construct transparent hugepages
|
contiguous pages which construct transparent hugepages
|
||||||
|
23 - BALLOON
|
||||||
23. BALLOON
|
|
||||||
balloon compaction page
|
balloon compaction page
|
||||||
|
24 - ZERO_PAGE
|
||||||
24. ZERO_PAGE
|
|
||||||
zero page for pfn_zero or huge_zero page
|
zero page for pfn_zero or huge_zero page
|
||||||
|
25 - IDLE
|
||||||
25. IDLE
|
|
||||||
page has not been accessed since it was marked idle (see
|
page has not been accessed since it was marked idle (see
|
||||||
Documentation/vm/idle_page_tracking.txt). Note that this flag may be
|
:ref:`Documentation/admin-guide/mm/idle_page_tracking.rst <idle_page_tracking>`).
|
||||||
stale in case the page was accessed via a PTE. To make sure the flag
|
Note that this flag may be stale in case the page was accessed via
|
||||||
is up-to-date one has to read /sys/kernel/mm/page_idle/bitmap first.
|
a PTE. To make sure the flag is up-to-date one has to read
|
||||||
|
``/sys/kernel/mm/page_idle/bitmap`` first.
|
||||||
|
|
||||||
[IO related page flags]
|
IO related page flags
|
||||||
1. ERROR IO error occurred
|
---------------------
|
||||||
3. UPTODATE page has up-to-date data
|
|
||||||
|
1 - ERROR
|
||||||
|
IO error occurred
|
||||||
|
3 - UPTODATE
|
||||||
|
page has up-to-date data
|
||||||
ie. for file backed page: (in-memory data revision >= on-disk one)
|
ie. for file backed page: (in-memory data revision >= on-disk one)
|
||||||
4. DIRTY page has been written to, hence contains new data
|
4 - DIRTY
|
||||||
ie. for file backed page: (in-memory data revision > on-disk one)
|
page has been written to, hence contains new data
|
||||||
8. WRITEBACK page is being synced to disk
|
i.e. for file backed page: (in-memory data revision > on-disk one)
|
||||||
|
8 - WRITEBACK
|
||||||
|
page is being synced to disk
|
||||||
|
|
||||||
[LRU related page flags]
|
LRU related page flags
|
||||||
5. LRU page is in one of the LRU lists
|
----------------------
|
||||||
6. ACTIVE page is in the active LRU list
|
|
||||||
18. UNEVICTABLE page is in the unevictable (non-)LRU list
|
5 - LRU
|
||||||
It is somehow pinned and not a candidate for LRU page reclaims,
|
page is in one of the LRU lists
|
||||||
eg. ramfs pages, shmctl(SHM_LOCK) and mlock() memory segments
|
6 - ACTIVE
|
||||||
2. REFERENCED page has been referenced since last LRU list enqueue/requeue
|
page is in the active LRU list
|
||||||
9. RECLAIM page will be reclaimed soon after its pageout IO completed
|
18 - UNEVICTABLE
|
||||||
11. MMAP a memory mapped page
|
page is in the unevictable (non-)LRU list It is somehow pinned and
|
||||||
12. ANON a memory mapped page that is not part of a file
|
not a candidate for LRU page reclaims, e.g. ramfs pages,
|
||||||
13. SWAPCACHE page is mapped to swap space, ie. has an associated swap entry
|
shmctl(SHM_LOCK) and mlock() memory segments
|
||||||
14. SWAPBACKED page is backed by swap/RAM
|
2 - REFERENCED
|
||||||
|
page has been referenced since last LRU list enqueue/requeue
|
||||||
|
9 - RECLAIM
|
||||||
|
page will be reclaimed soon after its pageout IO completed
|
||||||
|
11 - MMAP
|
||||||
|
a memory mapped page
|
||||||
|
12 - ANON
|
||||||
|
a memory mapped page that is not part of a file
|
||||||
|
13 - SWAPCACHE
|
||||||
|
page is mapped to swap space, i.e. has an associated swap entry
|
||||||
|
14 - SWAPBACKED
|
||||||
|
page is backed by swap/RAM
|
||||||
|
|
||||||
The page-types tool in the tools/vm directory can be used to query the
|
The page-types tool in the tools/vm directory can be used to query the
|
||||||
above flags.
|
above flags.
|
||||||
|
|
||||||
Using pagemap to do something useful:
|
Using pagemap to do something useful
|
||||||
|
====================================
|
||||||
|
|
||||||
The general procedure for using pagemap to find out about a process' memory
|
The general procedure for using pagemap to find out about a process' memory
|
||||||
usage goes like this:
|
usage goes like this:
|
||||||
|
|
||||||
1. Read /proc/pid/maps to determine which parts of the memory space are
|
1. Read ``/proc/pid/maps`` to determine which parts of the memory space are
|
||||||
mapped to what.
|
mapped to what.
|
||||||
2. Select the maps you are interested in -- all of them, or a particular
|
2. Select the maps you are interested in -- all of them, or a particular
|
||||||
library, or the stack or the heap, etc.
|
library, or the stack or the heap, etc.
|
||||||
3. Open /proc/pid/pagemap and seek to the pages you would like to examine.
|
3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine.
|
||||||
4. Read a u64 for each page from pagemap.
|
4. Read a u64 for each page from pagemap.
|
||||||
5. Open /proc/kpagecount and/or /proc/kpageflags. For each PFN you just
|
5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you
|
||||||
read, seek to that entry in the file, and read the data you want.
|
just read, seek to that entry in the file, and read the data you want.
|
||||||
|
|
||||||
For example, to find the "unique set size" (USS), which is the amount of
|
For example, to find the "unique set size" (USS), which is the amount of
|
||||||
memory that a process is using that is not shared with any other process,
|
memory that a process is using that is not shared with any other process,
|
||||||
@@ -171,7 +188,8 @@ you can go through every map in the process, find the PFNs, look those up
|
|||||||
in kpagecount, and tally up the number of pages that are only referenced
|
in kpagecount, and tally up the number of pages that are only referenced
|
||||||
once.
|
once.
|
||||||
|
|
||||||
Other notes:
|
Other notes
|
||||||
|
===========
|
||||||
|
|
||||||
Reading from any of the files will return -EINVAL if you are not starting
|
Reading from any of the files will return -EINVAL if you are not starting
|
||||||
the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
|
the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
|
||||||
@@ -1,18 +1,22 @@
|
|||||||
SOFT-DIRTY PTEs
|
.. _soft_dirty:
|
||||||
|
|
||||||
|
===============
|
||||||
|
Soft-Dirty PTEs
|
||||||
|
===============
|
||||||
|
|
||||||
The soft-dirty is a bit on a PTE which helps to track which pages a task
|
The soft-dirty is a bit on a PTE which helps to track which pages a task
|
||||||
writes to. In order to do this tracking one should
|
writes to. In order to do this tracking one should
|
||||||
|
|
||||||
1. Clear soft-dirty bits from the task's PTEs.
|
1. Clear soft-dirty bits from the task's PTEs.
|
||||||
|
|
||||||
This is done by writing "4" into the /proc/PID/clear_refs file of the
|
This is done by writing "4" into the ``/proc/PID/clear_refs`` file of the
|
||||||
task in question.
|
task in question.
|
||||||
|
|
||||||
2. Wait some time.
|
2. Wait some time.
|
||||||
|
|
||||||
3. Read soft-dirty bits from the PTEs.
|
3. Read soft-dirty bits from the PTEs.
|
||||||
|
|
||||||
This is done by reading from the /proc/PID/pagemap. The bit 55 of the
|
This is done by reading from the ``/proc/PID/pagemap``. The bit 55 of the
|
||||||
64-bit qword is the soft-dirty one. If set, the respective PTE was
|
64-bit qword is the soft-dirty one. If set, the respective PTE was
|
||||||
written to since step 1.
|
written to since step 1.
|
||||||
|
|
||||||
418
Documentation/admin-guide/mm/transhuge.rst
Normal file
418
Documentation/admin-guide/mm/transhuge.rst
Normal file
@@ -0,0 +1,418 @@
|
|||||||
|
.. _admin_guide_transhuge:
|
||||||
|
|
||||||
|
============================
|
||||||
|
Transparent Hugepage Support
|
||||||
|
============================
|
||||||
|
|
||||||
|
Objective
|
||||||
|
=========
|
||||||
|
|
||||||
|
Performance critical computing applications dealing with large memory
|
||||||
|
working sets are already running on top of libhugetlbfs and in turn
|
||||||
|
hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of
|
||||||
|
using huge pages for the backing of virtual memory with huge pages
|
||||||
|
that supports the automatic promotion and demotion of page sizes and
|
||||||
|
without the shortcomings of hugetlbfs.
|
||||||
|
|
||||||
|
Currently THP only works for anonymous memory mappings and tmpfs/shmem.
|
||||||
|
But in the future it can expand to other filesystems.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
in the examples below we presume that the basic page size is 4K and
|
||||||
|
the huge page size is 2M, although the actual numbers may vary
|
||||||
|
depending on the CPU architecture.
|
||||||
|
|
||||||
|
The reason applications are running faster is because of two
|
||||||
|
factors. The first factor is almost completely irrelevant and it's not
|
||||||
|
of significant interest because it'll also have the downside of
|
||||||
|
requiring larger clear-page copy-page in page faults which is a
|
||||||
|
potentially negative effect. The first factor consists in taking a
|
||||||
|
single page fault for each 2M virtual region touched by userland (so
|
||||||
|
reducing the enter/exit kernel frequency by a 512 times factor). This
|
||||||
|
only matters the first time the memory is accessed for the lifetime of
|
||||||
|
a memory mapping. The second long lasting and much more important
|
||||||
|
factor will affect all subsequent accesses to the memory for the whole
|
||||||
|
runtime of the application. The second factor consist of two
|
||||||
|
components:
|
||||||
|
|
||||||
|
1) the TLB miss will run faster (especially with virtualization using
|
||||||
|
nested pagetables but almost always also on bare metal without
|
||||||
|
virtualization)
|
||||||
|
|
||||||
|
2) a single TLB entry will be mapping a much larger amount of virtual
|
||||||
|
memory in turn reducing the number of TLB misses. With
|
||||||
|
virtualization and nested pagetables the TLB can be mapped of
|
||||||
|
larger size only if both KVM and the Linux guest are using
|
||||||
|
hugepages but a significant speedup already happens if only one of
|
||||||
|
the two is using hugepages just because of the fact the TLB miss is
|
||||||
|
going to run faster.
|
||||||
|
|
||||||
|
THP can be enabled system wide or restricted to certain tasks or even
|
||||||
|
memory ranges inside task's address space. Unless THP is completely
|
||||||
|
disabled, there is ``khugepaged`` daemon that scans memory and
|
||||||
|
collapses sequences of basic pages into huge pages.
|
||||||
|
|
||||||
|
The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
|
||||||
|
interface and using madivse(2) and prctl(2) system calls.
|
||||||
|
|
||||||
|
Transparent Hugepage Support maximizes the usefulness of free memory
|
||||||
|
if compared to the reservation approach of hugetlbfs by allowing all
|
||||||
|
unused memory to be used as cache or other movable (or even unmovable
|
||||||
|
entities). It doesn't require reservation to prevent hugepage
|
||||||
|
allocation failures to be noticeable from userland. It allows paging
|
||||||
|
and all other advanced VM features to be available on the
|
||||||
|
hugepages. It requires no modifications for applications to take
|
||||||
|
advantage of it.
|
||||||
|
|
||||||
|
Applications however can be further optimized to take advantage of
|
||||||
|
this feature, like for example they've been optimized before to avoid
|
||||||
|
a flood of mmap system calls for every malloc(4k). Optimizing userland
|
||||||
|
is by far not mandatory and khugepaged already can take care of long
|
||||||
|
lived page allocations even for hugepage unaware applications that
|
||||||
|
deals with large amounts of memory.
|
||||||
|
|
||||||
|
In certain cases when hugepages are enabled system wide, application
|
||||||
|
may end up allocating more memory resources. An application may mmap a
|
||||||
|
large region but only touch 1 byte of it, in that case a 2M page might
|
||||||
|
be allocated instead of a 4k page for no good. This is why it's
|
||||||
|
possible to disable hugepages system-wide and to only have them inside
|
||||||
|
MADV_HUGEPAGE madvise regions.
|
||||||
|
|
||||||
|
Embedded systems should enable hugepages only inside madvise regions
|
||||||
|
to eliminate any risk of wasting any precious byte of memory and to
|
||||||
|
only run faster.
|
||||||
|
|
||||||
|
Applications that gets a lot of benefit from hugepages and that don't
|
||||||
|
risk to lose memory by using hugepages, should use
|
||||||
|
madvise(MADV_HUGEPAGE) on their critical mmapped regions.
|
||||||
|
|
||||||
|
.. _thp_sysfs:
|
||||||
|
|
||||||
|
sysfs
|
||||||
|
=====
|
||||||
|
|
||||||
|
Global THP controls
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
Transparent Hugepage Support for anonymous memory can be entirely disabled
|
||||||
|
(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
|
||||||
|
regions (to avoid the risk of consuming more memory resources) or enabled
|
||||||
|
system wide. This can be achieved with one of::
|
||||||
|
|
||||||
|
echo always >/sys/kernel/mm/transparent_hugepage/enabled
|
||||||
|
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
|
||||||
|
echo never >/sys/kernel/mm/transparent_hugepage/enabled
|
||||||
|
|
||||||
|
It's also possible to limit defrag efforts in the VM to generate
|
||||||
|
anonymous hugepages in case they're not immediately free to madvise
|
||||||
|
regions or to never try to defrag memory and simply fallback to regular
|
||||||
|
pages unless hugepages are immediately available. Clearly if we spend CPU
|
||||||
|
time to defrag memory, we would expect to gain even more by the fact we
|
||||||
|
use hugepages later instead of regular pages. This isn't always
|
||||||
|
guaranteed, but it may be more likely in case the allocation is for a
|
||||||
|
MADV_HUGEPAGE region.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
echo always >/sys/kernel/mm/transparent_hugepage/defrag
|
||||||
|
echo defer >/sys/kernel/mm/transparent_hugepage/defrag
|
||||||
|
echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
|
||||||
|
echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
|
||||||
|
echo never >/sys/kernel/mm/transparent_hugepage/defrag
|
||||||
|
|
||||||
|
always
|
||||||
|
means that an application requesting THP will stall on
|
||||||
|
allocation failure and directly reclaim pages and compact
|
||||||
|
memory in an effort to allocate a THP immediately. This may be
|
||||||
|
desirable for virtual machines that benefit heavily from THP
|
||||||
|
use and are willing to delay the VM start to utilise them.
|
||||||
|
|
||||||
|
defer
|
||||||
|
means that an application will wake kswapd in the background
|
||||||
|
to reclaim pages and wake kcompactd to compact memory so that
|
||||||
|
THP is available in the near future. It's the responsibility
|
||||||
|
of khugepaged to then install the THP pages later.
|
||||||
|
|
||||||
|
defer+madvise
|
||||||
|
will enter direct reclaim and compaction like ``always``, but
|
||||||
|
only for regions that have used madvise(MADV_HUGEPAGE); all
|
||||||
|
other regions will wake kswapd in the background to reclaim
|
||||||
|
pages and wake kcompactd to compact memory so that THP is
|
||||||
|
available in the near future.
|
||||||
|
|
||||||
|
madvise
|
||||||
|
will enter direct reclaim like ``always`` but only for regions
|
||||||
|
that are have used madvise(MADV_HUGEPAGE). This is the default
|
||||||
|
behaviour.
|
||||||
|
|
||||||
|
never
|
||||||
|
should be self-explanatory.
|
||||||
|
|
||||||
|
By default kernel tries to use huge zero page on read page fault to
|
||||||
|
anonymous mapping. It's possible to disable huge zero page by writing 0
|
||||||
|
or enable it back by writing 1::
|
||||||
|
|
||||||
|
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
||||||
|
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
||||||
|
|
||||||
|
Some userspace (such as a test program, or an optimized memory allocation
|
||||||
|
library) may want to know the size (in bytes) of a transparent hugepage::
|
||||||
|
|
||||||
|
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
|
||||||
|
|
||||||
|
khugepaged will be automatically started when
|
||||||
|
transparent_hugepage/enabled is set to "always" or "madvise, and it'll
|
||||||
|
be automatically shutdown if it's set to "never".
|
||||||
|
|
||||||
|
Khugepaged controls
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
khugepaged runs usually at low frequency so while one may not want to
|
||||||
|
invoke defrag algorithms synchronously during the page faults, it
|
||||||
|
should be worth invoking defrag at least in khugepaged. However it's
|
||||||
|
also possible to disable defrag in khugepaged by writing 0 or enable
|
||||||
|
defrag in khugepaged by writing 1::
|
||||||
|
|
||||||
|
echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
|
||||||
|
echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
|
||||||
|
|
||||||
|
You can also control how many pages khugepaged should scan at each
|
||||||
|
pass::
|
||||||
|
|
||||||
|
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
|
||||||
|
|
||||||
|
and how many milliseconds to wait in khugepaged between each pass (you
|
||||||
|
can set this to 0 to run khugepaged at 100% utilization of one core)::
|
||||||
|
|
||||||
|
/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
|
||||||
|
|
||||||
|
and how many milliseconds to wait in khugepaged if there's an hugepage
|
||||||
|
allocation failure to throttle the next allocation attempt::
|
||||||
|
|
||||||
|
/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
|
||||||
|
|
||||||
|
The khugepaged progress can be seen in the number of pages collapsed::
|
||||||
|
|
||||||
|
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
|
||||||
|
|
||||||
|
for each pass::
|
||||||
|
|
||||||
|
/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
|
||||||
|
|
||||||
|
``max_ptes_none`` specifies how many extra small pages (that are
|
||||||
|
not already mapped) can be allocated when collapsing a group
|
||||||
|
of small pages into one large page::
|
||||||
|
|
||||||
|
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
|
||||||
|
|
||||||
|
A higher value leads to use additional memory for programs.
|
||||||
|
A lower value leads to gain less thp performance. Value of
|
||||||
|
max_ptes_none can waste cpu time very little, you can
|
||||||
|
ignore it.
|
||||||
|
|
||||||
|
``max_ptes_swap`` specifies how many pages can be brought in from
|
||||||
|
swap when collapsing a group of pages into a transparent huge page::
|
||||||
|
|
||||||
|
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
|
||||||
|
|
||||||
|
A higher value can cause excessive swap IO and waste
|
||||||
|
memory. A lower value can prevent THPs from being
|
||||||
|
collapsed, resulting fewer pages being collapsed into
|
||||||
|
THPs, and lower memory access performance.
|
||||||
|
|
||||||
|
Boot parameter
|
||||||
|
==============
|
||||||
|
|
||||||
|
You can change the sysfs boot time defaults of Transparent Hugepage
|
||||||
|
Support by passing the parameter ``transparent_hugepage=always`` or
|
||||||
|
``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
|
||||||
|
to the kernel command line.
|
||||||
|
|
||||||
|
Hugepages in tmpfs/shmem
|
||||||
|
========================
|
||||||
|
|
||||||
|
You can control hugepage allocation policy in tmpfs with mount option
|
||||||
|
``huge=``. It can have following values:
|
||||||
|
|
||||||
|
always
|
||||||
|
Attempt to allocate huge pages every time we need a new page;
|
||||||
|
|
||||||
|
never
|
||||||
|
Do not allocate huge pages;
|
||||||
|
|
||||||
|
within_size
|
||||||
|
Only allocate huge page if it will be fully within i_size.
|
||||||
|
Also respect fadvise()/madvise() hints;
|
||||||
|
|
||||||
|
advise
|
||||||
|
Only allocate huge pages if requested with fadvise()/madvise();
|
||||||
|
|
||||||
|
The default policy is ``never``.
|
||||||
|
|
||||||
|
``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
|
||||||
|
``huge=never`` will not attempt to break up huge pages at all, just stop more
|
||||||
|
from being allocated.
|
||||||
|
|
||||||
|
There's also sysfs knob to control hugepage allocation policy for internal
|
||||||
|
shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
|
||||||
|
is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
|
||||||
|
MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
|
||||||
|
|
||||||
|
In addition to policies listed above, shmem_enabled allows two further
|
||||||
|
values:
|
||||||
|
|
||||||
|
deny
|
||||||
|
For use in emergencies, to force the huge option off from
|
||||||
|
all mounts;
|
||||||
|
force
|
||||||
|
Force the huge option on for all - very useful for testing;
|
||||||
|
|
||||||
|
Need of application restart
|
||||||
|
===========================
|
||||||
|
|
||||||
|
The transparent_hugepage/enabled values and tmpfs mount option only affect
|
||||||
|
future behavior. So to make them effective you need to restart any
|
||||||
|
application that could have been using hugepages. This also applies to the
|
||||||
|
regions registered in khugepaged.
|
||||||
|
|
||||||
|
Monitoring usage
|
||||||
|
================
|
||||||
|
|
||||||
|
The number of anonymous transparent huge pages currently used by the
|
||||||
|
system is available by reading the AnonHugePages field in ``/proc/meminfo``.
|
||||||
|
To identify what applications are using anonymous transparent huge pages,
|
||||||
|
it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
|
||||||
|
for each mapping.
|
||||||
|
|
||||||
|
The number of file transparent huge pages mapped to userspace is available
|
||||||
|
by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
|
||||||
|
To identify what applications are mapping file transparent huge pages, it
|
||||||
|
is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
|
||||||
|
for each mapping.
|
||||||
|
|
||||||
|
Note that reading the smaps file is expensive and reading it
|
||||||
|
frequently will incur overhead.
|
||||||
|
|
||||||
|
There are a number of counters in ``/proc/vmstat`` that may be used to
|
||||||
|
monitor how successfully the system is providing huge pages for use.
|
||||||
|
|
||||||
|
thp_fault_alloc
|
||||||
|
is incremented every time a huge page is successfully
|
||||||
|
allocated to handle a page fault. This applies to both the
|
||||||
|
first time a page is faulted and for COW faults.
|
||||||
|
|
||||||
|
thp_collapse_alloc
|
||||||
|
is incremented by khugepaged when it has found
|
||||||
|
a range of pages to collapse into one huge page and has
|
||||||
|
successfully allocated a new huge page to store the data.
|
||||||
|
|
||||||
|
thp_fault_fallback
|
||||||
|
is incremented if a page fault fails to allocate
|
||||||
|
a huge page and instead falls back to using small pages.
|
||||||
|
|
||||||
|
thp_collapse_alloc_failed
|
||||||
|
is incremented if khugepaged found a range
|
||||||
|
of pages that should be collapsed into one huge page but failed
|
||||||
|
the allocation.
|
||||||
|
|
||||||
|
thp_file_alloc
|
||||||
|
is incremented every time a file huge page is successfully
|
||||||
|
allocated.
|
||||||
|
|
||||||
|
thp_file_mapped
|
||||||
|
is incremented every time a file huge page is mapped into
|
||||||
|
user address space.
|
||||||
|
|
||||||
|
thp_split_page
|
||||||
|
is incremented every time a huge page is split into base
|
||||||
|
pages. This can happen for a variety of reasons but a common
|
||||||
|
reason is that a huge page is old and is being reclaimed.
|
||||||
|
This action implies splitting all PMD the page mapped with.
|
||||||
|
|
||||||
|
thp_split_page_failed
|
||||||
|
is incremented if kernel fails to split huge
|
||||||
|
page. This can happen if the page was pinned by somebody.
|
||||||
|
|
||||||
|
thp_deferred_split_page
|
||||||
|
is incremented when a huge page is put onto split
|
||||||
|
queue. This happens when a huge page is partially unmapped and
|
||||||
|
splitting it would free up some memory. Pages on split queue are
|
||||||
|
going to be split under memory pressure.
|
||||||
|
|
||||||
|
thp_split_pmd
|
||||||
|
is incremented every time a PMD split into table of PTEs.
|
||||||
|
This can happen, for instance, when application calls mprotect() or
|
||||||
|
munmap() on part of huge page. It doesn't split huge page, only
|
||||||
|
page table entry.
|
||||||
|
|
||||||
|
thp_zero_page_alloc
|
||||||
|
is incremented every time a huge zero page is
|
||||||
|
successfully allocated. It includes allocations which where
|
||||||
|
dropped due race with other allocation. Note, it doesn't count
|
||||||
|
every map of the huge zero page, only its allocation.
|
||||||
|
|
||||||
|
thp_zero_page_alloc_failed
|
||||||
|
is incremented if kernel fails to allocate
|
||||||
|
huge zero page and falls back to using small pages.
|
||||||
|
|
||||||
|
thp_swpout
|
||||||
|
is incremented every time a huge page is swapout in one
|
||||||
|
piece without splitting.
|
||||||
|
|
||||||
|
thp_swpout_fallback
|
||||||
|
is incremented if a huge page has to be split before swapout.
|
||||||
|
Usually because failed to allocate some continuous swap space
|
||||||
|
for the huge page.
|
||||||
|
|
||||||
|
As the system ages, allocating huge pages may be expensive as the
|
||||||
|
system uses memory compaction to copy data around memory to free a
|
||||||
|
huge page for use. There are some counters in ``/proc/vmstat`` to help
|
||||||
|
monitor this overhead.
|
||||||
|
|
||||||
|
compact_stall
|
||||||
|
is incremented every time a process stalls to run
|
||||||
|
memory compaction so that a huge page is free for use.
|
||||||
|
|
||||||
|
compact_success
|
||||||
|
is incremented if the system compacted memory and
|
||||||
|
freed a huge page for use.
|
||||||
|
|
||||||
|
compact_fail
|
||||||
|
is incremented if the system tries to compact memory
|
||||||
|
but failed.
|
||||||
|
|
||||||
|
compact_pages_moved
|
||||||
|
is incremented each time a page is moved. If
|
||||||
|
this value is increasing rapidly, it implies that the system
|
||||||
|
is copying a lot of data to satisfy the huge page allocation.
|
||||||
|
It is possible that the cost of copying exceeds any savings
|
||||||
|
from reduced TLB misses.
|
||||||
|
|
||||||
|
compact_pagemigrate_failed
|
||||||
|
is incremented when the underlying mechanism
|
||||||
|
for moving a page failed.
|
||||||
|
|
||||||
|
compact_blocks_moved
|
||||||
|
is incremented each time memory compaction examines
|
||||||
|
a huge page aligned range of pages.
|
||||||
|
|
||||||
|
It is possible to establish how long the stalls were using the function
|
||||||
|
tracer to record how long was spent in __alloc_pages_nodemask and
|
||||||
|
using the mm_page_alloc tracepoint to identify which allocations were
|
||||||
|
for huge pages.
|
||||||
|
|
||||||
|
Optimizing the applications
|
||||||
|
===========================
|
||||||
|
|
||||||
|
To be guaranteed that the kernel will map a 2M page immediately in any
|
||||||
|
memory region, the mmap region has to be hugepage naturally
|
||||||
|
aligned. posix_memalign() can provide that guarantee.
|
||||||
|
|
||||||
|
Hugetlbfs
|
||||||
|
=========
|
||||||
|
|
||||||
|
You can use hugetlbfs on a kernel that has transparent hugepage
|
||||||
|
support enabled just fine as always. No difference can be noted in
|
||||||
|
hugetlbfs other than there will be less overall fragmentation. All
|
||||||
|
usual features belonging to hugetlbfs are preserved and
|
||||||
|
unaffected. libhugetlbfs will also work fine as usual.
|
||||||
@@ -1,6 +1,11 @@
|
|||||||
= Userfaultfd =
|
.. _userfaultfd:
|
||||||
|
|
||||||
== Objective ==
|
===========
|
||||||
|
Userfaultfd
|
||||||
|
===========
|
||||||
|
|
||||||
|
Objective
|
||||||
|
=========
|
||||||
|
|
||||||
Userfaults allow the implementation of on-demand paging from userland
|
Userfaults allow the implementation of on-demand paging from userland
|
||||||
and more generally they allow userland to take control of various
|
and more generally they allow userland to take control of various
|
||||||
@@ -9,7 +14,8 @@ memory page faults, something otherwise only the kernel code could do.
|
|||||||
For example userfaults allows a proper and more optimal implementation
|
For example userfaults allows a proper and more optimal implementation
|
||||||
of the PROT_NONE+SIGSEGV trick.
|
of the PROT_NONE+SIGSEGV trick.
|
||||||
|
|
||||||
== Design ==
|
Design
|
||||||
|
======
|
||||||
|
|
||||||
Userfaults are delivered and resolved through the userfaultfd syscall.
|
Userfaults are delivered and resolved through the userfaultfd syscall.
|
||||||
|
|
||||||
@@ -41,7 +47,8 @@ different processes without them being aware about what is going on
|
|||||||
themselves on the same region the manager is already tracking, which
|
themselves on the same region the manager is already tracking, which
|
||||||
is a corner case that would currently return -EBUSY).
|
is a corner case that would currently return -EBUSY).
|
||||||
|
|
||||||
== API ==
|
API
|
||||||
|
===
|
||||||
|
|
||||||
When first opened the userfaultfd must be enabled invoking the
|
When first opened the userfaultfd must be enabled invoking the
|
||||||
UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
|
UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
|
||||||
@@ -101,7 +108,8 @@ UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
|
|||||||
half copied page since it'll keep userfaulting until the copy has
|
half copied page since it'll keep userfaulting until the copy has
|
||||||
finished.
|
finished.
|
||||||
|
|
||||||
== QEMU/KVM ==
|
QEMU/KVM
|
||||||
|
========
|
||||||
|
|
||||||
QEMU/KVM is using the userfaultfd syscall to implement postcopy live
|
QEMU/KVM is using the userfaultfd syscall to implement postcopy live
|
||||||
migration. Postcopy live migration is one form of memory
|
migration. Postcopy live migration is one form of memory
|
||||||
@@ -163,7 +171,8 @@ sending the same page twice (in case the userfault is read by the
|
|||||||
postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
|
postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
|
||||||
thread).
|
thread).
|
||||||
|
|
||||||
== Non-cooperative userfaultfd ==
|
Non-cooperative userfaultfd
|
||||||
|
===========================
|
||||||
|
|
||||||
When the userfaultfd is monitored by an external manager, the manager
|
When the userfaultfd is monitored by an external manager, the manager
|
||||||
must be able to track changes in the process virtual memory
|
must be able to track changes in the process virtual memory
|
||||||
@@ -172,27 +181,30 @@ the same read(2) protocol as for the page fault notifications. The
|
|||||||
manager has to explicitly enable these events by setting appropriate
|
manager has to explicitly enable these events by setting appropriate
|
||||||
bits in uffdio_api.features passed to UFFDIO_API ioctl:
|
bits in uffdio_api.features passed to UFFDIO_API ioctl:
|
||||||
|
|
||||||
UFFD_FEATURE_EVENT_FORK - enable userfaultfd hooks for fork(). When
|
UFFD_FEATURE_EVENT_FORK
|
||||||
this feature is enabled, the userfaultfd context of the parent process
|
enable userfaultfd hooks for fork(). When this feature is
|
||||||
is duplicated into the newly created process. The manager receives
|
enabled, the userfaultfd context of the parent process is
|
||||||
UFFD_EVENT_FORK with file descriptor of the new userfaultfd context in
|
duplicated into the newly created process. The manager
|
||||||
the uffd_msg.fork.
|
receives UFFD_EVENT_FORK with file descriptor of the new
|
||||||
|
userfaultfd context in the uffd_msg.fork.
|
||||||
|
|
||||||
UFFD_FEATURE_EVENT_REMAP - enable notifications about mremap()
|
UFFD_FEATURE_EVENT_REMAP
|
||||||
calls. When the non-cooperative process moves a virtual memory area to
|
enable notifications about mremap() calls. When the
|
||||||
a different location, the manager will receive UFFD_EVENT_REMAP. The
|
non-cooperative process moves a virtual memory area to a
|
||||||
uffd_msg.remap will contain the old and new addresses of the area and
|
different location, the manager will receive
|
||||||
its original length.
|
UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and
|
||||||
|
new addresses of the area and its original length.
|
||||||
|
|
||||||
UFFD_FEATURE_EVENT_REMOVE - enable notifications about
|
UFFD_FEATURE_EVENT_REMOVE
|
||||||
madvise(MADV_REMOVE) and madvise(MADV_DONTNEED) calls. The event
|
enable notifications about madvise(MADV_REMOVE) and
|
||||||
UFFD_EVENT_REMOVE will be generated upon these calls to madvise. The
|
madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will
|
||||||
uffd_msg.remove will contain start and end addresses of the removed
|
be generated upon these calls to madvise. The uffd_msg.remove
|
||||||
area.
|
will contain start and end addresses of the removed area.
|
||||||
|
|
||||||
UFFD_FEATURE_EVENT_UNMAP - enable notifications about memory
|
UFFD_FEATURE_EVENT_UNMAP
|
||||||
unmapping. The manager will get UFFD_EVENT_UNMAP with uffd_msg.remove
|
enable notifications about memory unmapping. The manager will
|
||||||
containing start and end addresses of the unmapped area.
|
get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and
|
||||||
|
end addresses of the unmapped area.
|
||||||
|
|
||||||
Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP
|
Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP
|
||||||
are pretty similar, they quite differ in the action expected from the
|
are pretty similar, they quite differ in the action expected from the
|
||||||
@@ -61,7 +61,7 @@ Setting the ramoops parameters can be done in several different manners:
|
|||||||
mem=128M ramoops.mem_address=0x8000000 ramoops.ecc=1
|
mem=128M ramoops.mem_address=0x8000000 ramoops.ecc=1
|
||||||
|
|
||||||
B. Use Device Tree bindings, as described in
|
B. Use Device Tree bindings, as described in
|
||||||
``Documentation/device-tree/bindings/reserved-memory/admin-guide/ramoops.rst``.
|
``Documentation/devicetree/bindings/reserved-memory/ramoops.txt``.
|
||||||
For example::
|
For example::
|
||||||
|
|
||||||
reserved-memory {
|
reserved-memory {
|
||||||
|
|||||||
@@ -302,19 +302,15 @@ Berlin family (Multimedia Solutions)
|
|||||||
88DE3010, Armada 1000 (no Linux support)
|
88DE3010, Armada 1000 (no Linux support)
|
||||||
Core: Marvell PJ1 (ARMv5TE), Dual-core
|
Core: Marvell PJ1 (ARMv5TE), Dual-core
|
||||||
Product Brief: http://www.marvell.com.cn/digital-entertainment/assets/armada_1000_pb.pdf
|
Product Brief: http://www.marvell.com.cn/digital-entertainment/assets/armada_1000_pb.pdf
|
||||||
88DE3005, Armada 1500-mini
|
|
||||||
88DE3005, Armada 1500 Mini
|
88DE3005, Armada 1500 Mini
|
||||||
Design name: BG2CD
|
Design name: BG2CD
|
||||||
Core: ARM Cortex-A9, PL310 L2CC
|
Core: ARM Cortex-A9, PL310 L2CC
|
||||||
Homepage: http://www.marvell.com/multimedia-solutions/armada-1500-mini/
|
|
||||||
88DE3006, Armada 1500 Mini Plus
|
88DE3006, Armada 1500 Mini Plus
|
||||||
Design name: BG2CDP
|
Design name: BG2CDP
|
||||||
Core: Dual Core ARM Cortex-A7
|
Core: Dual Core ARM Cortex-A7
|
||||||
Homepage: http://www.marvell.com/multimedia-solutions/armada-1500-mini-plus/
|
|
||||||
88DE3100, Armada 1500
|
88DE3100, Armada 1500
|
||||||
Design name: BG2
|
Design name: BG2
|
||||||
Core: Marvell PJ4B-MP (ARMv7), Tauros3 L2CC
|
Core: Marvell PJ4B-MP (ARMv7), Tauros3 L2CC
|
||||||
Product Brief: http://www.marvell.com/digital-entertainment/armada-1500/assets/Marvell-ARMADA-1500-Product-Brief.pdf
|
|
||||||
88DE3114, Armada 1500 Pro
|
88DE3114, Armada 1500 Pro
|
||||||
Design name: BG2Q
|
Design name: BG2Q
|
||||||
Core: Quad Core ARM Cortex-A9, PL310 L2CC
|
Core: Quad Core ARM Cortex-A9, PL310 L2CC
|
||||||
@@ -324,13 +320,16 @@ Berlin family (Multimedia Solutions)
|
|||||||
88DE3218, ARMADA 1500 Ultra
|
88DE3218, ARMADA 1500 Ultra
|
||||||
Core: ARM Cortex-A53
|
Core: ARM Cortex-A53
|
||||||
|
|
||||||
Homepage: http://www.marvell.com/multimedia-solutions/
|
Homepage: https://www.synaptics.com/products/multimedia-solutions
|
||||||
Directory: arch/arm/mach-berlin
|
Directory: arch/arm/mach-berlin
|
||||||
|
|
||||||
Comments:
|
Comments:
|
||||||
|
|
||||||
* This line of SoCs is based on Marvell Sheeva or ARM Cortex CPUs
|
* This line of SoCs is based on Marvell Sheeva or ARM Cortex CPUs
|
||||||
with Synopsys DesignWare (IRQ, GPIO, Timers, ...) and PXA IP (SDHCI, USB, ETH, ...).
|
with Synopsys DesignWare (IRQ, GPIO, Timers, ...) and PXA IP (SDHCI, USB, ETH, ...).
|
||||||
|
|
||||||
|
* The Berlin family was acquired by Synaptics from Marvell in 2017.
|
||||||
|
|
||||||
CPU Cores
|
CPU Cores
|
||||||
---------
|
---------
|
||||||
|
|
||||||
|
|||||||
@@ -1,7 +1,9 @@
|
|||||||
Embedded device command line partition parsing
|
Embedded device command line partition parsing
|
||||||
=====================================================================
|
=====================================================================
|
||||||
|
|
||||||
Support for reading the block device partition table from the command line.
|
The "blkdevparts" command line option adds support for reading the
|
||||||
|
block device partition table from the kernel command line.
|
||||||
|
|
||||||
It is typically used for fixed block (eMMC) embedded devices.
|
It is typically used for fixed block (eMMC) embedded devices.
|
||||||
It has no MBR, so saves storage space. Bootloader can be easily accessed
|
It has no MBR, so saves storage space. Bootloader can be easily accessed
|
||||||
by absolute address of data on the block device.
|
by absolute address of data on the block device.
|
||||||
@@ -14,22 +16,27 @@ blkdevparts=<blkdev-def>[;<blkdev-def>]
|
|||||||
<partdef> := <size>[@<offset>](part-name)
|
<partdef> := <size>[@<offset>](part-name)
|
||||||
|
|
||||||
<blkdev-id>
|
<blkdev-id>
|
||||||
block device disk name, embedded device used fixed block device,
|
block device disk name. Embedded device uses fixed block device.
|
||||||
it's disk name also fixed. such as: mmcblk0, mmcblk1, mmcblk0boot0.
|
Its disk name is also fixed, such as: mmcblk0, mmcblk1, mmcblk0boot0.
|
||||||
|
|
||||||
<size>
|
<size>
|
||||||
partition size, in bytes, such as: 512, 1m, 1G.
|
partition size, in bytes, such as: 512, 1m, 1G.
|
||||||
|
size may contain an optional suffix of (upper or lower case):
|
||||||
|
K, M, G, T, P, E.
|
||||||
|
"-" is used to denote all remaining space.
|
||||||
|
|
||||||
<offset>
|
<offset>
|
||||||
partition start address, in bytes.
|
partition start address, in bytes.
|
||||||
|
offset may contain an optional suffix of (upper or lower case):
|
||||||
|
K, M, G, T, P, E.
|
||||||
|
|
||||||
(part-name)
|
(part-name)
|
||||||
partition name, kernel send uevent with "PARTNAME". application can create
|
partition name. Kernel sends uevent with "PARTNAME". Application can
|
||||||
a link to block device partition with the name "PARTNAME".
|
create a link to block device partition with the name "PARTNAME".
|
||||||
user space application can access partition by partition name.
|
User space application can access partition by partition name.
|
||||||
|
|
||||||
Example:
|
Example:
|
||||||
eMMC disk name is "mmcblk0" and "mmcblk0boot0"
|
eMMC disk names are "mmcblk0" and "mmcblk0boot0".
|
||||||
|
|
||||||
bootargs:
|
bootargs:
|
||||||
'blkdevparts=mmcblk0:1G(data0),1G(data1),-;mmcblk0boot0:1m(boot),-(kernel)'
|
'blkdevparts=mmcblk0:1G(data0),1G(data1),-;mmcblk0boot0:1m(boot),-(kernel)'
|
||||||
|
|||||||
66
Documentation/core-api/gfp_mask-from-fs-io.rst
Normal file
66
Documentation/core-api/gfp_mask-from-fs-io.rst
Normal file
@@ -0,0 +1,66 @@
|
|||||||
|
=================================
|
||||||
|
GFP masks used from FS/IO context
|
||||||
|
=================================
|
||||||
|
|
||||||
|
:Date: May, 2018
|
||||||
|
:Author: Michal Hocko <mhocko@kernel.org>
|
||||||
|
|
||||||
|
Introduction
|
||||||
|
============
|
||||||
|
|
||||||
|
Code paths in the filesystem and IO stacks must be careful when
|
||||||
|
allocating memory to prevent recursion deadlocks caused by direct
|
||||||
|
memory reclaim calling back into the FS or IO paths and blocking on
|
||||||
|
already held resources (e.g. locks - most commonly those used for the
|
||||||
|
transaction context).
|
||||||
|
|
||||||
|
The traditional way to avoid this deadlock problem is to clear __GFP_FS
|
||||||
|
respectively __GFP_IO (note the latter implies clearing the first as well) in
|
||||||
|
the gfp mask when calling an allocator. GFP_NOFS respectively GFP_NOIO can be
|
||||||
|
used as shortcut. It turned out though that above approach has led to
|
||||||
|
abuses when the restricted gfp mask is used "just in case" without a
|
||||||
|
deeper consideration which leads to problems because an excessive use
|
||||||
|
of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
|
||||||
|
reclaim issues.
|
||||||
|
|
||||||
|
New API
|
||||||
|
========
|
||||||
|
|
||||||
|
Since 4.12 we do have a generic scope API for both NOFS and NOIO context
|
||||||
|
``memalloc_nofs_save``, ``memalloc_nofs_restore`` respectively ``memalloc_noio_save``,
|
||||||
|
``memalloc_noio_restore`` which allow to mark a scope to be a critical
|
||||||
|
section from a filesystem or I/O point of view. Any allocation from that
|
||||||
|
scope will inherently drop __GFP_FS respectively __GFP_IO from the given
|
||||||
|
mask so no memory allocation can recurse back in the FS/IO.
|
||||||
|
|
||||||
|
.. kernel-doc:: include/linux/sched/mm.h
|
||||||
|
:functions: memalloc_nofs_save memalloc_nofs_restore
|
||||||
|
.. kernel-doc:: include/linux/sched/mm.h
|
||||||
|
:functions: memalloc_noio_save memalloc_noio_restore
|
||||||
|
|
||||||
|
FS/IO code then simply calls the appropriate save function before
|
||||||
|
any critical section with respect to the reclaim is started - e.g.
|
||||||
|
lock shared with the reclaim context or when a transaction context
|
||||||
|
nesting would be possible via reclaim. The restore function should be
|
||||||
|
called when the critical section ends. All that ideally along with an
|
||||||
|
explanation what is the reclaim context for easier maintenance.
|
||||||
|
|
||||||
|
Please note that the proper pairing of save/restore functions
|
||||||
|
allows nesting so it is safe to call ``memalloc_noio_save`` or
|
||||||
|
``memalloc_noio_restore`` respectively from an existing NOIO or NOFS
|
||||||
|
scope.
|
||||||
|
|
||||||
|
What about __vmalloc(GFP_NOFS)
|
||||||
|
==============================
|
||||||
|
|
||||||
|
vmalloc doesn't support GFP_NOFS semantic because there are hardcoded
|
||||||
|
GFP_KERNEL allocations deep inside the allocator which are quite non-trivial
|
||||||
|
to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is
|
||||||
|
almost always a bug. The good news is that the NOFS/NOIO semantic can be
|
||||||
|
achieved by the scope API.
|
||||||
|
|
||||||
|
In the ideal world, upper layers should already mark dangerous contexts
|
||||||
|
and so no special care is required and vmalloc should be called without
|
||||||
|
any problems. Sometimes if the context is not really clear or there are
|
||||||
|
layering violations then the recommended way around that is to wrap ``vmalloc``
|
||||||
|
by the scope API with a comment explaining the problem.
|
||||||
@@ -14,6 +14,7 @@ Core utilities
|
|||||||
kernel-api
|
kernel-api
|
||||||
assoc_array
|
assoc_array
|
||||||
atomic_ops
|
atomic_ops
|
||||||
|
cachetlb
|
||||||
refcount-vs-atomic
|
refcount-vs-atomic
|
||||||
cpu_hotplug
|
cpu_hotplug
|
||||||
idr
|
idr
|
||||||
@@ -25,6 +26,8 @@ Core utilities
|
|||||||
genalloc
|
genalloc
|
||||||
errseq
|
errseq
|
||||||
printk-formats
|
printk-formats
|
||||||
|
circular-buffers
|
||||||
|
gfp_mask-from-fs-io
|
||||||
|
|
||||||
Interfaces for kernel debugging
|
Interfaces for kernel debugging
|
||||||
===============================
|
===============================
|
||||||
|
|||||||
@@ -39,17 +39,17 @@ String Manipulation
|
|||||||
.. kernel-doc:: lib/string.c
|
.. kernel-doc:: lib/string.c
|
||||||
:export:
|
:export:
|
||||||
|
|
||||||
|
Basic Kernel Library Functions
|
||||||
|
==============================
|
||||||
|
|
||||||
|
The Linux kernel provides more basic utility functions.
|
||||||
|
|
||||||
Bit Operations
|
Bit Operations
|
||||||
--------------
|
--------------
|
||||||
|
|
||||||
.. kernel-doc:: arch/x86/include/asm/bitops.h
|
.. kernel-doc:: arch/x86/include/asm/bitops.h
|
||||||
:internal:
|
:internal:
|
||||||
|
|
||||||
Basic Kernel Library Functions
|
|
||||||
==============================
|
|
||||||
|
|
||||||
The Linux kernel provides more basic utility functions.
|
|
||||||
|
|
||||||
Bitmap Operations
|
Bitmap Operations
|
||||||
-----------------
|
-----------------
|
||||||
|
|
||||||
@@ -80,6 +80,31 @@ Command-line Parsing
|
|||||||
.. kernel-doc:: lib/cmdline.c
|
.. kernel-doc:: lib/cmdline.c
|
||||||
:export:
|
:export:
|
||||||
|
|
||||||
|
Sorting
|
||||||
|
-------
|
||||||
|
|
||||||
|
.. kernel-doc:: lib/sort.c
|
||||||
|
:export:
|
||||||
|
|
||||||
|
.. kernel-doc:: lib/list_sort.c
|
||||||
|
:export:
|
||||||
|
|
||||||
|
Text Searching
|
||||||
|
--------------
|
||||||
|
|
||||||
|
.. kernel-doc:: lib/textsearch.c
|
||||||
|
:doc: ts_intro
|
||||||
|
|
||||||
|
.. kernel-doc:: lib/textsearch.c
|
||||||
|
:export:
|
||||||
|
|
||||||
|
.. kernel-doc:: include/linux/textsearch.h
|
||||||
|
:functions: textsearch_find textsearch_next \
|
||||||
|
textsearch_get_pattern textsearch_get_pattern_len
|
||||||
|
|
||||||
|
CRC and Math Functions in Linux
|
||||||
|
===============================
|
||||||
|
|
||||||
CRC Functions
|
CRC Functions
|
||||||
-------------
|
-------------
|
||||||
|
|
||||||
@@ -103,9 +128,6 @@ CRC Functions
|
|||||||
.. kernel-doc:: lib/crc-itu-t.c
|
.. kernel-doc:: lib/crc-itu-t.c
|
||||||
:export:
|
:export:
|
||||||
|
|
||||||
Math Functions in Linux
|
|
||||||
=======================
|
|
||||||
|
|
||||||
Base 2 log and power Functions
|
Base 2 log and power Functions
|
||||||
------------------------------
|
------------------------------
|
||||||
|
|
||||||
@@ -127,28 +149,6 @@ Division Functions
|
|||||||
.. kernel-doc:: lib/gcd.c
|
.. kernel-doc:: lib/gcd.c
|
||||||
:export:
|
:export:
|
||||||
|
|
||||||
Sorting
|
|
||||||
-------
|
|
||||||
|
|
||||||
.. kernel-doc:: lib/sort.c
|
|
||||||
:export:
|
|
||||||
|
|
||||||
.. kernel-doc:: lib/list_sort.c
|
|
||||||
:export:
|
|
||||||
|
|
||||||
Text Searching
|
|
||||||
--------------
|
|
||||||
|
|
||||||
.. kernel-doc:: lib/textsearch.c
|
|
||||||
:doc: ts_intro
|
|
||||||
|
|
||||||
.. kernel-doc:: lib/textsearch.c
|
|
||||||
:export:
|
|
||||||
|
|
||||||
.. kernel-doc:: include/linux/textsearch.h
|
|
||||||
:functions: textsearch_find textsearch_next \
|
|
||||||
textsearch_get_pattern textsearch_get_pattern_len
|
|
||||||
|
|
||||||
UUID/GUID
|
UUID/GUID
|
||||||
---------
|
---------
|
||||||
|
|
||||||
|
|||||||
@@ -17,7 +17,7 @@ in order to help maintainers validate their code against the change in
|
|||||||
these memory ordering guarantees.
|
these memory ordering guarantees.
|
||||||
|
|
||||||
The terms used through this document try to follow the formal LKMM defined in
|
The terms used through this document try to follow the formal LKMM defined in
|
||||||
github.com/aparri/memory-model/blob/master/Documentation/explanation.txt
|
tools/memory-model/Documentation/explanation.txt.
|
||||||
|
|
||||||
memory-barriers.txt and atomic_t.txt provide more background to the
|
memory-barriers.txt and atomic_t.txt provide more background to the
|
||||||
memory ordering in general and for atomic operations specifically.
|
memory ordering in general and for atomic operations specifically.
|
||||||
|
|||||||
@@ -20,5 +20,6 @@ for cryptographic use cases, as well as programming examples.
|
|||||||
architecture
|
architecture
|
||||||
devel-algos
|
devel-algos
|
||||||
userspace-if
|
userspace-if
|
||||||
|
crypto_engine
|
||||||
api
|
api
|
||||||
api-samples
|
api-samples
|
||||||
|
|||||||
@@ -120,7 +120,7 @@ A typical out of bounds access report looks like this::
|
|||||||
|
|
||||||
The header of the report discribe what kind of bug happened and what kind of
|
The header of the report discribe what kind of bug happened and what kind of
|
||||||
access caused it. It's followed by the description of the accessed slub object
|
access caused it. It's followed by the description of the accessed slub object
|
||||||
(see 'SLUB Debug output' section in Documentation/vm/slub.txt for details) and
|
(see 'SLUB Debug output' section in Documentation/vm/slub.rst for details) and
|
||||||
the description of the accessed memory page.
|
the description of the accessed memory page.
|
||||||
|
|
||||||
In the last section the report shows memory state around the accessed address.
|
In the last section the report shows memory state around the accessed address.
|
||||||
|
|||||||
@@ -151,6 +151,11 @@ Contributing new tests (details)
|
|||||||
TEST_FILES, TEST_GEN_FILES mean it is the file which is used by
|
TEST_FILES, TEST_GEN_FILES mean it is the file which is used by
|
||||||
test.
|
test.
|
||||||
|
|
||||||
|
* First use the headers inside the kernel source and/or git repo, and then the
|
||||||
|
system headers. Headers for the kernel release as opposed to headers
|
||||||
|
installed by the distro on the system should be the primary focus to be able
|
||||||
|
to find regressions.
|
||||||
|
|
||||||
Test Harness
|
Test Harness
|
||||||
============
|
============
|
||||||
|
|
||||||
|
|||||||
@@ -44,7 +44,7 @@ common to each controller of that type:
|
|||||||
|
|
||||||
- methods to establish GPIO line direction
|
- methods to establish GPIO line direction
|
||||||
- methods used to access GPIO line values
|
- methods used to access GPIO line values
|
||||||
- method to set electrical configuration to a a given GPIO line
|
- method to set electrical configuration for a given GPIO line
|
||||||
- method to return the IRQ number associated to a given GPIO line
|
- method to return the IRQ number associated to a given GPIO line
|
||||||
- flag saying whether calls to its methods may sleep
|
- flag saying whether calls to its methods may sleep
|
||||||
- optional line names array to identify lines
|
- optional line names array to identify lines
|
||||||
@@ -143,7 +143,7 @@ resistor will make the line tend to high level unless one of the transistors on
|
|||||||
the rail actively pulls it down.
|
the rail actively pulls it down.
|
||||||
|
|
||||||
The level on the line will go as high as the VDD on the pull-up resistor, which
|
The level on the line will go as high as the VDD on the pull-up resistor, which
|
||||||
may be higher than the level supported by the transistor, achieveing a
|
may be higher than the level supported by the transistor, achieving a
|
||||||
level-shift to the higher VDD.
|
level-shift to the higher VDD.
|
||||||
|
|
||||||
Integrated electronics often have an output driver stage in the form of a CMOS
|
Integrated electronics often have an output driver stage in the form of a CMOS
|
||||||
@@ -382,7 +382,7 @@ Real-Time compliance for GPIO IRQ chips
|
|||||||
|
|
||||||
Any provider of irqchips needs to be carefully tailored to support Real Time
|
Any provider of irqchips needs to be carefully tailored to support Real Time
|
||||||
preemption. It is desirable that all irqchips in the GPIO subsystem keep this
|
preemption. It is desirable that all irqchips in the GPIO subsystem keep this
|
||||||
in mind and does the proper testing to assure they are real time-enabled.
|
in mind and do the proper testing to assure they are real time-enabled.
|
||||||
So, pay attention on above " RT_FULL:" notes, please.
|
So, pay attention on above " RT_FULL:" notes, please.
|
||||||
The following is a checklist to follow when preparing a driver for real
|
The following is a checklist to follow when preparing a driver for real
|
||||||
time-compliance:
|
time-compliance:
|
||||||
|
|||||||
@@ -17,7 +17,9 @@ available subsections can be seen below.
|
|||||||
basics
|
basics
|
||||||
infrastructure
|
infrastructure
|
||||||
pm/index
|
pm/index
|
||||||
|
clk
|
||||||
device-io
|
device-io
|
||||||
|
device_connection
|
||||||
dma-buf
|
dma-buf
|
||||||
device_link
|
device_link
|
||||||
message-based
|
message-based
|
||||||
|
|||||||
@@ -711,7 +711,8 @@ The vmbus device regions are mapped into uio device resources:
|
|||||||
|
|
||||||
If a subchannel is created by a request to host, then the uio_hv_generic
|
If a subchannel is created by a request to host, then the uio_hv_generic
|
||||||
device driver will create a sysfs binary file for the per-channel ring buffer.
|
device driver will create a sysfs binary file for the per-channel ring buffer.
|
||||||
For example:
|
For example::
|
||||||
|
|
||||||
/sys/bus/vmbus/devices/3811fe4d-0fa0-4b62-981a-74fc1084c757/channels/21/ring
|
/sys/bus/vmbus/devices/3811fe4d-0fa0-4b62-981a-74fc1084c757/channels/21/ring
|
||||||
|
|
||||||
Further information
|
Further information
|
||||||
|
|||||||
@@ -1,7 +1,7 @@
|
|||||||
#
|
#
|
||||||
# Feature name: strncasecmp
|
# Feature name: cBPF-JIT
|
||||||
# Kconfig: __HAVE_ARCH_STRNCASECMP
|
# Kconfig: HAVE_CBPF_JIT
|
||||||
# description: arch provides an optimized strncasecmp() function
|
# description: arch supports cBPF JIT optimizations
|
||||||
#
|
#
|
||||||
-----------------------
|
-----------------------
|
||||||
| arch |status|
|
| arch |status|
|
||||||
@@ -16,14 +16,16 @@
|
|||||||
| ia64: | TODO |
|
| ia64: | TODO |
|
||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | TODO |
|
| mips: | ok |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | TODO |
|
| powerpc: | ok |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | TODO |
|
| s390: | TODO |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | TODO |
|
| sparc: | ok |
|
||||||
| um: | TODO |
|
| um: | TODO |
|
||||||
| unicore32: | TODO |
|
| unicore32: | TODO |
|
||||||
| x86: | TODO |
|
| x86: | TODO |
|
||||||
@@ -1,7 +1,7 @@
|
|||||||
#
|
#
|
||||||
# Feature name: BPF-JIT
|
# Feature name: eBPF-JIT
|
||||||
# Kconfig: HAVE_BPF_JIT
|
# Kconfig: HAVE_EBPF_JIT
|
||||||
# description: arch supports BPF JIT optimizations
|
# description: arch supports eBPF JIT optimizations
|
||||||
#
|
#
|
||||||
-----------------------
|
-----------------------
|
||||||
| arch |status|
|
| arch |status|
|
||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | ok |
|
| mips: | ok |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | ok |
|
| s390: | ok |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | ok |
|
| sparc: | ok |
|
||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | ok |
|
| mips: | ok |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | ok |
|
||||||
| parisc: | ok |
|
| parisc: | ok |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | ok |
|
||||||
| s390: | ok |
|
| s390: | ok |
|
||||||
| sh: | ok |
|
| sh: | ok |
|
||||||
| sparc: | ok |
|
| sparc: | ok |
|
||||||
|
|||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | ok |
|
| mips: | ok |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | ok |
|
| s390: | ok |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | ok |
|
| sparc: | ok |
|
||||||
|
|||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | ok |
|
| mips: | ok |
|
||||||
|
| nds32: | ok |
|
||||||
| nios2: | ok |
|
| nios2: | ok |
|
||||||
| openrisc: | ok |
|
| openrisc: | ok |
|
||||||
| parisc: | ok |
|
| parisc: | ok |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | ok |
|
||||||
| s390: | ok |
|
| s390: | ok |
|
||||||
| sh: | ok |
|
| sh: | ok |
|
||||||
| sparc: | ok |
|
| sparc: | ok |
|
||||||
|
|||||||
@@ -17,15 +17,17 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | TODO |
|
| mips: | TODO |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | TODO |
|
| powerpc: | TODO |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | TODO |
|
| s390: | TODO |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | TODO |
|
| sparc: | TODO |
|
||||||
| um: | TODO |
|
| um: | TODO |
|
||||||
| unicore32: | TODO |
|
| unicore32: | TODO |
|
||||||
| x86: | ok | 64-bit only
|
| x86: | ok |
|
||||||
| xtensa: | ok |
|
| xtensa: | ok |
|
||||||
-----------------------
|
-----------------------
|
||||||
|
|||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | ok |
|
| microblaze: | ok |
|
||||||
| mips: | TODO |
|
| mips: | TODO |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | ok |
|
| s390: | ok |
|
||||||
| sh: | ok |
|
| sh: | ok |
|
||||||
| sparc: | TODO |
|
| sparc: | TODO |
|
||||||
|
|||||||
@@ -11,16 +11,18 @@
|
|||||||
| arm: | ok |
|
| arm: | ok |
|
||||||
| arm64: | ok |
|
| arm64: | ok |
|
||||||
| c6x: | TODO |
|
| c6x: | TODO |
|
||||||
| h8300: | TODO |
|
| h8300: | ok |
|
||||||
| hexagon: | ok |
|
| hexagon: | ok |
|
||||||
| ia64: | TODO |
|
| ia64: | TODO |
|
||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | ok |
|
| microblaze: | ok |
|
||||||
| mips: | ok |
|
| mips: | ok |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | ok |
|
| nios2: | ok |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | TODO |
|
| s390: | TODO |
|
||||||
| sh: | ok |
|
| sh: | ok |
|
||||||
| sparc: | ok |
|
| sparc: | ok |
|
||||||
|
|||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | TODO |
|
| mips: | TODO |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | TODO |
|
| s390: | TODO |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | TODO |
|
| sparc: | TODO |
|
||||||
|
|||||||
@@ -9,7 +9,7 @@
|
|||||||
| alpha: | TODO |
|
| alpha: | TODO |
|
||||||
| arc: | ok |
|
| arc: | ok |
|
||||||
| arm: | ok |
|
| arm: | ok |
|
||||||
| arm64: | TODO |
|
| arm64: | ok |
|
||||||
| c6x: | TODO |
|
| c6x: | TODO |
|
||||||
| h8300: | TODO |
|
| h8300: | TODO |
|
||||||
| hexagon: | TODO |
|
| hexagon: | TODO |
|
||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | ok |
|
| mips: | ok |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | ok |
|
||||||
| s390: | ok |
|
| s390: | ok |
|
||||||
| sh: | ok |
|
| sh: | ok |
|
||||||
| sparc: | ok |
|
| sparc: | ok |
|
||||||
|
|||||||
@@ -9,7 +9,7 @@
|
|||||||
| alpha: | TODO |
|
| alpha: | TODO |
|
||||||
| arc: | ok |
|
| arc: | ok |
|
||||||
| arm: | ok |
|
| arm: | ok |
|
||||||
| arm64: | TODO |
|
| arm64: | ok |
|
||||||
| c6x: | TODO |
|
| c6x: | TODO |
|
||||||
| h8300: | TODO |
|
| h8300: | TODO |
|
||||||
| hexagon: | TODO |
|
| hexagon: | TODO |
|
||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | ok |
|
| mips: | ok |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | ok |
|
| s390: | ok |
|
||||||
| sh: | ok |
|
| sh: | ok |
|
||||||
| sparc: | ok |
|
| sparc: | ok |
|
||||||
|
|||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | TODO |
|
| mips: | TODO |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | TODO |
|
| powerpc: | ok |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | TODO |
|
| s390: | TODO |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | TODO |
|
| sparc: | TODO |
|
||||||
|
|||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | ok |
|
| mips: | ok |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | TODO |
|
| powerpc: | TODO |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | TODO |
|
| s390: | TODO |
|
||||||
| sh: | ok |
|
| sh: | ok |
|
||||||
| sparc: | TODO |
|
| sparc: | TODO |
|
||||||
|
|||||||
@@ -9,7 +9,7 @@
|
|||||||
| alpha: | TODO |
|
| alpha: | TODO |
|
||||||
| arc: | TODO |
|
| arc: | TODO |
|
||||||
| arm: | ok |
|
| arm: | ok |
|
||||||
| arm64: | TODO |
|
| arm64: | ok |
|
||||||
| c6x: | TODO |
|
| c6x: | TODO |
|
||||||
| h8300: | TODO |
|
| h8300: | TODO |
|
||||||
| hexagon: | TODO |
|
| hexagon: | TODO |
|
||||||
@@ -17,13 +17,15 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | ok |
|
| mips: | ok |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | ok |
|
| s390: | ok |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | TODO |
|
| sparc: | ok |
|
||||||
| um: | TODO |
|
| um: | TODO |
|
||||||
| unicore32: | TODO |
|
| unicore32: | TODO |
|
||||||
| x86: | ok |
|
| x86: | ok |
|
||||||
|
|||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | TODO |
|
| mips: | TODO |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | TODO |
|
| powerpc: | TODO |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | TODO |
|
| s390: | TODO |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | TODO |
|
| sparc: | TODO |
|
||||||
|
|||||||
@@ -17,11 +17,13 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | ok |
|
| mips: | ok |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | TODO |
|
| powerpc: | TODO |
|
||||||
| s390: | TODO |
|
| riscv: | ok |
|
||||||
|
| s390: | ok |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | TODO |
|
| sparc: | TODO |
|
||||||
| um: | TODO |
|
| um: | TODO |
|
||||||
|
|||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | TODO |
|
| mips: | TODO |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | ok |
|
| s390: | ok |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | ok |
|
| sparc: | ok |
|
||||||
|
|||||||
@@ -9,7 +9,7 @@
|
|||||||
| alpha: | TODO |
|
| alpha: | TODO |
|
||||||
| arc: | TODO |
|
| arc: | TODO |
|
||||||
| arm: | TODO |
|
| arm: | TODO |
|
||||||
| arm64: | TODO |
|
| arm64: | ok |
|
||||||
| c6x: | TODO |
|
| c6x: | TODO |
|
||||||
| h8300: | TODO |
|
| h8300: | TODO |
|
||||||
| hexagon: | TODO |
|
| hexagon: | TODO |
|
||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | TODO |
|
| mips: | TODO |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | TODO |
|
| powerpc: | TODO |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | ok |
|
| s390: | ok |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | TODO |
|
| sparc: | TODO |
|
||||||
|
|||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | ok |
|
| microblaze: | ok |
|
||||||
| mips: | ok |
|
| mips: | ok |
|
||||||
|
| nds32: | ok |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | ok |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | ok |
|
| s390: | ok |
|
||||||
| sh: | ok |
|
| sh: | ok |
|
||||||
| sparc: | ok |
|
| sparc: | ok |
|
||||||
|
|||||||
@@ -9,21 +9,23 @@
|
|||||||
| alpha: | TODO |
|
| alpha: | TODO |
|
||||||
| arc: | TODO |
|
| arc: | TODO |
|
||||||
| arm: | TODO |
|
| arm: | TODO |
|
||||||
| arm64: | TODO |
|
| arm64: | ok |
|
||||||
| c6x: | TODO |
|
| c6x: | TODO |
|
||||||
| h8300: | TODO |
|
| h8300: | TODO |
|
||||||
| hexagon: | TODO |
|
| hexagon: | TODO |
|
||||||
| ia64: | TODO |
|
| ia64: | TODO |
|
||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | TODO |
|
| mips: | ok |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | ok |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | TODO |
|
| powerpc: | TODO |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | TODO |
|
| s390: | TODO |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | TODO |
|
| sparc: | ok |
|
||||||
| um: | TODO |
|
| um: | TODO |
|
||||||
| unicore32: | TODO |
|
| unicore32: | TODO |
|
||||||
| x86: | ok |
|
| x86: | ok |
|
||||||
|
|||||||
@@ -16,14 +16,16 @@
|
|||||||
| ia64: | TODO |
|
| ia64: | TODO |
|
||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | TODO |
|
| mips: | ok |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | ok |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | TODO |
|
| powerpc: | TODO |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | TODO |
|
| s390: | TODO |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | TODO |
|
| sparc: | ok |
|
||||||
| um: | TODO |
|
| um: | TODO |
|
||||||
| unicore32: | TODO |
|
| unicore32: | TODO |
|
||||||
| x86: | ok |
|
| x86: | ok |
|
||||||
|
|||||||
@@ -1,6 +1,6 @@
|
|||||||
#
|
#
|
||||||
# Feature name: rwsem-optimized
|
# Feature name: rwsem-optimized
|
||||||
# Kconfig: Optimized asm/rwsem.h
|
# Kconfig: !RWSEM_GENERIC_SPINLOCK
|
||||||
# description: arch provides optimized rwsem APIs
|
# description: arch provides optimized rwsem APIs
|
||||||
#
|
#
|
||||||
-----------------------
|
-----------------------
|
||||||
@@ -8,8 +8,8 @@
|
|||||||
-----------------------
|
-----------------------
|
||||||
| alpha: | ok |
|
| alpha: | ok |
|
||||||
| arc: | TODO |
|
| arc: | TODO |
|
||||||
| arm: | TODO |
|
| arm: | ok |
|
||||||
| arm64: | TODO |
|
| arm64: | ok |
|
||||||
| c6x: | TODO |
|
| c6x: | TODO |
|
||||||
| h8300: | TODO |
|
| h8300: | TODO |
|
||||||
| hexagon: | TODO |
|
| hexagon: | TODO |
|
||||||
@@ -17,14 +17,16 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | TODO |
|
| mips: | TODO |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | TODO |
|
| powerpc: | TODO |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | ok |
|
| s390: | ok |
|
||||||
| sh: | ok |
|
| sh: | ok |
|
||||||
| sparc: | ok |
|
| sparc: | ok |
|
||||||
| um: | TODO |
|
| um: | ok |
|
||||||
| unicore32: | TODO |
|
| unicore32: | TODO |
|
||||||
| x86: | ok |
|
| x86: | ok |
|
||||||
| xtensa: | ok |
|
| xtensa: | ok |
|
||||||
|
|||||||
@@ -9,7 +9,7 @@
|
|||||||
| alpha: | TODO |
|
| alpha: | TODO |
|
||||||
| arc: | TODO |
|
| arc: | TODO |
|
||||||
| arm: | ok |
|
| arm: | ok |
|
||||||
| arm64: | TODO |
|
| arm64: | ok |
|
||||||
| c6x: | TODO |
|
| c6x: | TODO |
|
||||||
| h8300: | TODO |
|
| h8300: | TODO |
|
||||||
| hexagon: | ok |
|
| hexagon: | ok |
|
||||||
@@ -17,13 +17,15 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | ok |
|
| mips: | ok |
|
||||||
|
| nds32: | ok |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | ok |
|
| s390: | ok |
|
||||||
| sh: | ok |
|
| sh: | ok |
|
||||||
| sparc: | TODO |
|
| sparc: | ok |
|
||||||
| um: | TODO |
|
| um: | TODO |
|
||||||
| unicore32: | TODO |
|
| unicore32: | TODO |
|
||||||
| x86: | ok |
|
| x86: | ok |
|
||||||
|
|||||||
@@ -17,11 +17,13 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | TODO |
|
| mips: | TODO |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
| s390: | TODO |
|
| riscv: | TODO |
|
||||||
|
| s390: | ok |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | TODO |
|
| sparc: | TODO |
|
||||||
| um: | TODO |
|
| um: | TODO |
|
||||||
|
|||||||
@@ -17,11 +17,13 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | TODO |
|
| mips: | TODO |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
| s390: | TODO |
|
| riscv: | TODO |
|
||||||
|
| s390: | ok |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | TODO |
|
| sparc: | TODO |
|
||||||
| um: | TODO |
|
| um: | TODO |
|
||||||
|
|||||||
@@ -40,10 +40,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | TODO |
|
| mips: | TODO |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | TODO |
|
| powerpc: | TODO |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | TODO |
|
| s390: | TODO |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | TODO |
|
| sparc: | TODO |
|
||||||
|
|||||||
@@ -9,7 +9,7 @@
|
|||||||
| alpha: | TODO |
|
| alpha: | TODO |
|
||||||
| arc: | .. |
|
| arc: | .. |
|
||||||
| arm: | .. |
|
| arm: | .. |
|
||||||
| arm64: | .. |
|
| arm64: | ok |
|
||||||
| c6x: | .. |
|
| c6x: | .. |
|
||||||
| h8300: | .. |
|
| h8300: | .. |
|
||||||
| hexagon: | .. |
|
| hexagon: | .. |
|
||||||
@@ -17,11 +17,13 @@
|
|||||||
| m68k: | .. |
|
| m68k: | .. |
|
||||||
| microblaze: | .. |
|
| microblaze: | .. |
|
||||||
| mips: | TODO |
|
| mips: | TODO |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | .. |
|
| nios2: | .. |
|
||||||
| openrisc: | .. |
|
| openrisc: | .. |
|
||||||
| parisc: | .. |
|
| parisc: | .. |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
| s390: | .. |
|
| riscv: | TODO |
|
||||||
|
| s390: | ok |
|
||||||
| sh: | .. |
|
| sh: | .. |
|
||||||
| sparc: | TODO |
|
| sparc: | TODO |
|
||||||
| um: | .. |
|
| um: | .. |
|
||||||
|
|||||||
98
Documentation/features/scripts/features-refresh.sh
Executable file
98
Documentation/features/scripts/features-refresh.sh
Executable file
@@ -0,0 +1,98 @@
|
|||||||
|
#
|
||||||
|
# Small script that refreshes the kernel feature support status in place.
|
||||||
|
#
|
||||||
|
|
||||||
|
for F_FILE in Documentation/features/*/*/arch-support.txt; do
|
||||||
|
F=$(grep "^# Kconfig:" "$F_FILE" | cut -c26-)
|
||||||
|
|
||||||
|
#
|
||||||
|
# Each feature F is identified by a pair (O, K), where 'O' can
|
||||||
|
# be either the empty string (for 'nop') or "not" (the logical
|
||||||
|
# negation operator '!'); other operators are not supported.
|
||||||
|
#
|
||||||
|
O=""
|
||||||
|
K=$F
|
||||||
|
if [[ "$F" == !* ]]; then
|
||||||
|
O="not"
|
||||||
|
K=$(echo $F | sed -e 's/^!//g')
|
||||||
|
fi
|
||||||
|
|
||||||
|
#
|
||||||
|
# F := (O, K) is 'valid' iff there is a Kconfig file (for some
|
||||||
|
# arch) which contains K.
|
||||||
|
#
|
||||||
|
# Notice that this definition entails an 'asymmetry' between
|
||||||
|
# the case 'O = ""' and the case 'O = "not"'. E.g., F may be
|
||||||
|
# _invalid_ if:
|
||||||
|
#
|
||||||
|
# [case 'O = ""']
|
||||||
|
# 1) no arch provides support for F,
|
||||||
|
# 2) K does not exist (e.g., it was renamed/mis-typed);
|
||||||
|
#
|
||||||
|
# [case 'O = "not"']
|
||||||
|
# 3) all archs provide support for F,
|
||||||
|
# 4) as in (2).
|
||||||
|
#
|
||||||
|
# The rationale for adopting this definition (and, thus, for
|
||||||
|
# keeping the asymmetry) is:
|
||||||
|
#
|
||||||
|
# We want to be able to 'detect' (2) (or (4)).
|
||||||
|
#
|
||||||
|
# (1) and (3) may further warn the developers about the fact
|
||||||
|
# that K can be removed.
|
||||||
|
#
|
||||||
|
F_VALID="false"
|
||||||
|
for ARCH_DIR in arch/*/; do
|
||||||
|
K_FILES=$(find $ARCH_DIR -name "Kconfig*")
|
||||||
|
K_GREP=$(grep "$K" $K_FILES)
|
||||||
|
if [ ! -z "$K_GREP" ]; then
|
||||||
|
F_VALID="true"
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
if [ "$F_VALID" = "false" ]; then
|
||||||
|
printf "WARNING: '%s' is not a valid Kconfig\n" "$F"
|
||||||
|
fi
|
||||||
|
|
||||||
|
T_FILE="$F_FILE.tmp"
|
||||||
|
grep "^#" $F_FILE > $T_FILE
|
||||||
|
echo " -----------------------" >> $T_FILE
|
||||||
|
echo " | arch |status|" >> $T_FILE
|
||||||
|
echo " -----------------------" >> $T_FILE
|
||||||
|
for ARCH_DIR in arch/*/; do
|
||||||
|
ARCH=$(echo $ARCH_DIR | sed -e 's/arch//g' | sed -e 's/\///g')
|
||||||
|
K_FILES=$(find $ARCH_DIR -name "Kconfig*")
|
||||||
|
K_GREP=$(grep "$K" $K_FILES)
|
||||||
|
#
|
||||||
|
# Arch support status values for (O, K) are updated according
|
||||||
|
# to the following rules.
|
||||||
|
#
|
||||||
|
# - ("", K) is 'supported by a given arch', if there is a
|
||||||
|
# Kconfig file for that arch which contains K;
|
||||||
|
#
|
||||||
|
# - ("not", K) is 'supported by a given arch', if there is
|
||||||
|
# no Kconfig file for that arch which contains K;
|
||||||
|
#
|
||||||
|
# - otherwise: preserve the previous status value (if any),
|
||||||
|
# default to 'not yet supported'.
|
||||||
|
#
|
||||||
|
# Notice that, according these rules, invalid features may be
|
||||||
|
# updated/modified.
|
||||||
|
#
|
||||||
|
if [ "$O" = "" ] && [ ! -z "$K_GREP" ]; then
|
||||||
|
printf " |%12s: | ok |\n" "$ARCH" >> $T_FILE
|
||||||
|
elif [ "$O" = "not" ] && [ -z "$K_GREP" ]; then
|
||||||
|
printf " |%12s: | ok |\n" "$ARCH" >> $T_FILE
|
||||||
|
else
|
||||||
|
S=$(grep -v "^#" "$F_FILE" | grep " $ARCH:")
|
||||||
|
if [ ! -z "$S" ]; then
|
||||||
|
echo "$S" >> $T_FILE
|
||||||
|
else
|
||||||
|
printf " |%12s: | TODO |\n" "$ARCH" \
|
||||||
|
>> $T_FILE
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
echo " -----------------------" >> $T_FILE
|
||||||
|
mv $T_FILE $F_FILE
|
||||||
|
done
|
||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | ok |
|
| mips: | ok |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | ok |
|
||||||
| powerpc: | TODO |
|
| powerpc: | ok |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | ok |
|
| s390: | ok |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | TODO |
|
| sparc: | TODO |
|
||||||
|
|||||||
@@ -17,12 +17,14 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | ok |
|
| mips: | ok |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | TODO |
|
| s390: | TODO |
|
||||||
| sh: | TODO |
|
| sh: | ok |
|
||||||
| sparc: | TODO |
|
| sparc: | TODO |
|
||||||
| um: | TODO |
|
| um: | TODO |
|
||||||
| unicore32: | TODO |
|
| unicore32: | TODO |
|
||||||
|
|||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | ok |
|
| m68k: | ok |
|
||||||
| microblaze: | ok |
|
| microblaze: | ok |
|
||||||
| mips: | ok |
|
| mips: | ok |
|
||||||
|
| nds32: | ok |
|
||||||
| nios2: | ok |
|
| nios2: | ok |
|
||||||
| openrisc: | ok |
|
| openrisc: | ok |
|
||||||
| parisc: | TODO |
|
| parisc: | ok |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | ok |
|
||||||
| s390: | ok |
|
| s390: | ok |
|
||||||
| sh: | ok |
|
| sh: | ok |
|
||||||
| sparc: | ok |
|
| sparc: | ok |
|
||||||
|
|||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | ok |
|
| mips: | ok |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | TODO |
|
| s390: | TODO |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | ok |
|
| sparc: | ok |
|
||||||
|
|||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | ok |
|
| mips: | ok |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | .. |
|
| parisc: | .. |
|
||||||
| powerpc: | .. |
|
| powerpc: | ok |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | .. |
|
| s390: | .. |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | .. |
|
| sparc: | .. |
|
||||||
|
|||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | ok |
|
| microblaze: | ok |
|
||||||
| mips: | ok |
|
| mips: | ok |
|
||||||
|
| nds32: | ok |
|
||||||
| nios2: | ok |
|
| nios2: | ok |
|
||||||
| openrisc: | ok |
|
| openrisc: | ok |
|
||||||
| parisc: | ok |
|
| parisc: | ok |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | ok |
|
||||||
| s390: | ok |
|
| s390: | ok |
|
||||||
| sh: | ok |
|
| sh: | ok |
|
||||||
| sparc: | ok |
|
| sparc: | ok |
|
||||||
|
|||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | ok |
|
| mips: | ok |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | ok |
|
| parisc: | ok |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | ok |
|
| s390: | ok |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | ok |
|
| sparc: | ok |
|
||||||
|
|||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | ok |
|
| mips: | ok |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | ok |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | ok |
|
| s390: | ok |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | TODO |
|
| sparc: | TODO |
|
||||||
|
|||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | TODO |
|
| mips: | TODO |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | TODO |
|
| powerpc: | TODO |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | TODO |
|
| s390: | TODO |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | TODO |
|
| sparc: | TODO |
|
||||||
|
|||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | .. |
|
| m68k: | .. |
|
||||||
| microblaze: | .. |
|
| microblaze: | .. |
|
||||||
| mips: | ok |
|
| mips: | ok |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | .. |
|
| nios2: | .. |
|
||||||
| openrisc: | .. |
|
| openrisc: | .. |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | ok |
|
| s390: | ok |
|
||||||
| sh: | .. |
|
| sh: | .. |
|
||||||
| sparc: | ok |
|
| sparc: | ok |
|
||||||
|
|||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | .. |
|
| m68k: | .. |
|
||||||
| microblaze: | .. |
|
| microblaze: | .. |
|
||||||
| mips: | TODO |
|
| mips: | TODO |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | .. |
|
| nios2: | .. |
|
||||||
| openrisc: | .. |
|
| openrisc: | .. |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | TODO |
|
| powerpc: | TODO |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | TODO |
|
| s390: | TODO |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | TODO |
|
| sparc: | TODO |
|
||||||
|
|||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | TODO |
|
| mips: | TODO |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | TODO |
|
| powerpc: | TODO |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | TODO |
|
| s390: | TODO |
|
||||||
| sh: | TODO |
|
| sh: | TODO |
|
||||||
| sparc: | TODO |
|
| sparc: | TODO |
|
||||||
|
|||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | TODO |
|
| mips: | TODO |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | TODO |
|
| s390: | TODO |
|
||||||
| sh: | ok |
|
| sh: | ok |
|
||||||
| sparc: | TODO |
|
| sparc: | TODO |
|
||||||
|
|||||||
@@ -9,7 +9,7 @@
|
|||||||
| alpha: | TODO |
|
| alpha: | TODO |
|
||||||
| arc: | .. |
|
| arc: | .. |
|
||||||
| arm: | .. |
|
| arm: | .. |
|
||||||
| arm64: | .. |
|
| arm64: | ok |
|
||||||
| c6x: | .. |
|
| c6x: | .. |
|
||||||
| h8300: | .. |
|
| h8300: | .. |
|
||||||
| hexagon: | .. |
|
| hexagon: | .. |
|
||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | .. |
|
| m68k: | .. |
|
||||||
| microblaze: | ok |
|
| microblaze: | ok |
|
||||||
| mips: | ok |
|
| mips: | ok |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | .. |
|
| nios2: | .. |
|
||||||
| openrisc: | .. |
|
| openrisc: | .. |
|
||||||
| parisc: | .. |
|
| parisc: | .. |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | ok |
|
||||||
| s390: | ok |
|
| s390: | ok |
|
||||||
| sh: | ok |
|
| sh: | ok |
|
||||||
| sparc: | ok |
|
| sparc: | ok |
|
||||||
|
|||||||
@@ -17,10 +17,12 @@
|
|||||||
| m68k: | TODO |
|
| m68k: | TODO |
|
||||||
| microblaze: | TODO |
|
| microblaze: | TODO |
|
||||||
| mips: | TODO |
|
| mips: | TODO |
|
||||||
|
| nds32: | TODO |
|
||||||
| nios2: | TODO |
|
| nios2: | TODO |
|
||||||
| openrisc: | TODO |
|
| openrisc: | TODO |
|
||||||
| parisc: | TODO |
|
| parisc: | TODO |
|
||||||
| powerpc: | ok |
|
| powerpc: | ok |
|
||||||
|
| riscv: | TODO |
|
||||||
| s390: | ok |
|
| s390: | ok |
|
||||||
| sh: | ok |
|
| sh: | ok |
|
||||||
| sparc: | ok |
|
| sparc: | ok |
|
||||||
|
|||||||
@@ -515,7 +515,8 @@ guarantees:
|
|||||||
|
|
||||||
The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG
|
The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG
|
||||||
bits on both physical and virtual pages associated with a process, and the
|
bits on both physical and virtual pages associated with a process, and the
|
||||||
soft-dirty bit on pte (see Documentation/vm/soft-dirty.txt for details).
|
soft-dirty bit on pte (see Documentation/admin-guide/mm/soft-dirty.rst
|
||||||
|
for details).
|
||||||
To clear the bits for all the pages associated with the process
|
To clear the bits for all the pages associated with the process
|
||||||
> echo 1 > /proc/PID/clear_refs
|
> echo 1 > /proc/PID/clear_refs
|
||||||
|
|
||||||
@@ -536,7 +537,8 @@ Any other value written to /proc/PID/clear_refs will have no effect.
|
|||||||
|
|
||||||
The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags
|
The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags
|
||||||
using /proc/kpageflags and number of times a page is mapped using
|
using /proc/kpageflags and number of times a page is mapped using
|
||||||
/proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.txt.
|
/proc/kpagecount. For detailed explanation, see
|
||||||
|
Documentation/admin-guide/mm/pagemap.rst.
|
||||||
|
|
||||||
The /proc/pid/numa_maps is an extension based on maps, showing the memory
|
The /proc/pid/numa_maps is an extension based on maps, showing the memory
|
||||||
locality and binding policy, as well as the memory usage (in pages) of
|
locality and binding policy, as well as the memory usage (in pages) of
|
||||||
@@ -564,7 +566,7 @@ address policy mapping details
|
|||||||
|
|
||||||
Where:
|
Where:
|
||||||
"address" is the starting address for the mapping;
|
"address" is the starting address for the mapping;
|
||||||
"policy" reports the NUMA memory policy set for the mapping (see vm/numa_memory_policy.txt);
|
"policy" reports the NUMA memory policy set for the mapping (see Documentation/admin-guide/mm/numa_memory_policy.rst);
|
||||||
"mapping details" summarizes mapping data such as mapping type, page usage counters,
|
"mapping details" summarizes mapping data such as mapping type, page usage counters,
|
||||||
node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page
|
node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page
|
||||||
size, in KB, that is backing the mapping up.
|
size, in KB, that is backing the mapping up.
|
||||||
|
|||||||
@@ -105,8 +105,9 @@ policy for the file will revert to "default" policy.
|
|||||||
NUMA memory allocation policies have optional flags that can be used in
|
NUMA memory allocation policies have optional flags that can be used in
|
||||||
conjunction with their modes. These optional flags can be specified
|
conjunction with their modes. These optional flags can be specified
|
||||||
when tmpfs is mounted by appending them to the mode before the NodeList.
|
when tmpfs is mounted by appending them to the mode before the NodeList.
|
||||||
See Documentation/vm/numa_memory_policy.txt for a list of all available
|
See Documentation/admin-guide/mm/numa_memory_policy.rst for a list of
|
||||||
memory allocation policy mode flags and their effect on memory policy.
|
all available memory allocation policy mode flags and their effect on
|
||||||
|
memory policy.
|
||||||
|
|
||||||
=static is equivalent to MPOL_F_STATIC_NODES
|
=static is equivalent to MPOL_F_STATIC_NODES
|
||||||
=relative is equivalent to MPOL_F_RELATIVE_NODES
|
=relative is equivalent to MPOL_F_RELATIVE_NODES
|
||||||
|
|||||||
@@ -89,6 +89,7 @@ needed).
|
|||||||
sound/index
|
sound/index
|
||||||
crypto/index
|
crypto/index
|
||||||
filesystems/index
|
filesystems/index
|
||||||
|
vm/index
|
||||||
|
|
||||||
Architecture-specific documentation
|
Architecture-specific documentation
|
||||||
-----------------------------------
|
-----------------------------------
|
||||||
|
|||||||
@@ -73,7 +73,9 @@ will have a second iteration or at least an extension for any given interface.
|
|||||||
future extensions is going right down the gutters since someone will submit
|
future extensions is going right down the gutters since someone will submit
|
||||||
an ioctl struct with random stack garbage in the yet unused parts. Which
|
an ioctl struct with random stack garbage in the yet unused parts. Which
|
||||||
then bakes in the ABI that those fields can never be used for anything else
|
then bakes in the ABI that those fields can never be used for anything else
|
||||||
but garbage.
|
but garbage. This is also the reason why you must explicitly pad all
|
||||||
|
structures, even if you never use them in an array - the padding the compiler
|
||||||
|
might insert could contain garbage.
|
||||||
|
|
||||||
* Have simple testcases for all of the above.
|
* Have simple testcases for all of the above.
|
||||||
|
|
||||||
|
|||||||
@@ -2903,7 +2903,7 @@ is discarded from the CPU's cache and reloaded. To deal with this, the
|
|||||||
appropriate part of the kernel must invalidate the overlapping bits of the
|
appropriate part of the kernel must invalidate the overlapping bits of the
|
||||||
cache on each CPU.
|
cache on each CPU.
|
||||||
|
|
||||||
See Documentation/cachetlb.txt for more information on cache management.
|
See Documentation/core-api/cachetlb.rst for more information on cache management.
|
||||||
|
|
||||||
|
|
||||||
CACHE COHERENCY VS MMIO
|
CACHE COHERENCY VS MMIO
|
||||||
@@ -3083,7 +3083,7 @@ CIRCULAR BUFFERS
|
|||||||
Memory barriers can be used to implement circular buffering without the need
|
Memory barriers can be used to implement circular buffering without the need
|
||||||
of a lock to serialise the producer with the consumer. See:
|
of a lock to serialise the producer with the consumer. See:
|
||||||
|
|
||||||
Documentation/circular-buffers.txt
|
Documentation/core-api/circular-buffers.rst
|
||||||
|
|
||||||
for details.
|
for details.
|
||||||
|
|
||||||
|
|||||||
@@ -18,17 +18,17 @@ major kernel release happening every two or three months. The recent
|
|||||||
release history looks like this:
|
release history looks like this:
|
||||||
|
|
||||||
====== =================
|
====== =================
|
||||||
2.6.38 March 14, 2011
|
4.11 April 30, 2017
|
||||||
2.6.37 January 4, 2011
|
4.12 July 2, 2017
|
||||||
2.6.36 October 20, 2010
|
4.13 September 3, 2017
|
||||||
2.6.35 August 1, 2010
|
4.14 November 12, 2017
|
||||||
2.6.34 May 15, 2010
|
4.15 January 28, 2018
|
||||||
2.6.33 February 24, 2010
|
4.16 April 1, 2018
|
||||||
====== =================
|
====== =================
|
||||||
|
|
||||||
Every 2.6.x release is a major kernel release with new features, internal
|
Every 4.x release is a major kernel release with new features, internal
|
||||||
API changes, and more. A typical 2.6 release can contain nearly 10,000
|
API changes, and more. A typical 4.x release contain about 13,000
|
||||||
changesets with changes to several hundred thousand lines of code. 2.6 is
|
changesets with changes to several hundred thousand lines of code. 4.x is
|
||||||
thus the leading edge of Linux kernel development; the kernel uses a
|
thus the leading edge of Linux kernel development; the kernel uses a
|
||||||
rolling development model which is continually integrating major changes.
|
rolling development model which is continually integrating major changes.
|
||||||
|
|
||||||
@@ -70,20 +70,19 @@ will get up to somewhere between -rc6 and -rc9 before the kernel is
|
|||||||
considered to be sufficiently stable and the final 2.6.x release is made.
|
considered to be sufficiently stable and the final 2.6.x release is made.
|
||||||
At that point the whole process starts over again.
|
At that point the whole process starts over again.
|
||||||
|
|
||||||
As an example, here is how the 2.6.38 development cycle went (all dates in
|
As an example, here is how the 4.16 development cycle went (all dates in
|
||||||
2011):
|
2018):
|
||||||
|
|
||||||
============== ===============================
|
============== ===============================
|
||||||
January 4 2.6.37 stable release
|
January 28 4.15 stable release
|
||||||
January 18 2.6.38-rc1, merge window closes
|
February 11 4.16-rc1, merge window closes
|
||||||
January 21 2.6.38-rc2
|
February 18 4.16-rc2
|
||||||
February 1 2.6.38-rc3
|
February 25 4.16-rc3
|
||||||
February 7 2.6.38-rc4
|
March 4 4.16-rc4
|
||||||
February 15 2.6.38-rc5
|
March 11 4.16-rc5
|
||||||
February 21 2.6.38-rc6
|
March 18 4.16-rc6
|
||||||
March 1 2.6.38-rc7
|
March 25 4.16-rc7
|
||||||
March 7 2.6.38-rc8
|
April 1 4.17 stable release
|
||||||
March 14 2.6.38 stable release
|
|
||||||
============== ===============================
|
============== ===============================
|
||||||
|
|
||||||
How do the developers decide when to close the development cycle and create
|
How do the developers decide when to close the development cycle and create
|
||||||
@@ -99,37 +98,42 @@ release is made. In the real world, this kind of perfection is hard to
|
|||||||
achieve; there are just too many variables in a project of this size.
|
achieve; there are just too many variables in a project of this size.
|
||||||
There comes a point where delaying the final release just makes the problem
|
There comes a point where delaying the final release just makes the problem
|
||||||
worse; the pile of changes waiting for the next merge window will grow
|
worse; the pile of changes waiting for the next merge window will grow
|
||||||
larger, creating even more regressions the next time around. So most 2.6.x
|
larger, creating even more regressions the next time around. So most 4.x
|
||||||
kernels go out with a handful of known regressions though, hopefully, none
|
kernels go out with a handful of known regressions though, hopefully, none
|
||||||
of them are serious.
|
of them are serious.
|
||||||
|
|
||||||
Once a stable release is made, its ongoing maintenance is passed off to the
|
Once a stable release is made, its ongoing maintenance is passed off to the
|
||||||
"stable team," currently consisting of Greg Kroah-Hartman. The stable team
|
"stable team," currently consisting of Greg Kroah-Hartman. The stable team
|
||||||
will release occasional updates to the stable release using the 2.6.x.y
|
will release occasional updates to the stable release using the 4.x.y
|
||||||
numbering scheme. To be considered for an update release, a patch must (1)
|
numbering scheme. To be considered for an update release, a patch must (1)
|
||||||
fix a significant bug, and (2) already be merged into the mainline for the
|
fix a significant bug, and (2) already be merged into the mainline for the
|
||||||
next development kernel. Kernels will typically receive stable updates for
|
next development kernel. Kernels will typically receive stable updates for
|
||||||
a little more than one development cycle past their initial release. So,
|
a little more than one development cycle past their initial release. So,
|
||||||
for example, the 2.6.36 kernel's history looked like:
|
for example, the 4.13 kernel's history looked like:
|
||||||
|
|
||||||
============== ===============================
|
============== ===============================
|
||||||
October 10 2.6.36 stable release
|
September 3 4.13 stable release
|
||||||
November 22 2.6.36.1
|
September 13 4.13.1
|
||||||
December 9 2.6.36.2
|
September 20 4.13.2
|
||||||
January 7 2.6.36.3
|
September 27 4.13.3
|
||||||
February 17 2.6.36.4
|
October 5 4.13.4
|
||||||
|
October 12 4.13.5
|
||||||
|
... ...
|
||||||
|
November 24 4.13.16
|
||||||
============== ===============================
|
============== ===============================
|
||||||
|
|
||||||
2.6.36.4 was the final stable update for the 2.6.36 release.
|
4.13.16 was the final stable update of the 4.13 release.
|
||||||
|
|
||||||
Some kernels are designated "long term" kernels; they will receive support
|
Some kernels are designated "long term" kernels; they will receive support
|
||||||
for a longer period. As of this writing, the current long term kernels
|
for a longer period. As of this writing, the current long term kernels
|
||||||
and their maintainers are:
|
and their maintainers are:
|
||||||
|
|
||||||
====== ====================== ===========================
|
====== ====================== ==============================
|
||||||
2.6.27 Willy Tarreau (Deep-frozen stable kernel)
|
3.16 Ben Hutchings (very long-term stable kernel)
|
||||||
2.6.32 Greg Kroah-Hartman
|
4.1 Sasha Levin
|
||||||
2.6.35 Andi Kleen (Embedded flag kernel)
|
4.4 Greg Kroah-Hartman (very long-term stable kernel)
|
||||||
|
4.9 Greg Kroah-Hartman
|
||||||
|
4.14 Greg Kroah-Hartman
|
||||||
====== ====================== ===========================
|
====== ====================== ===========================
|
||||||
|
|
||||||
The selection of a kernel for long-term support is purely a matter of a
|
The selection of a kernel for long-term support is purely a matter of a
|
||||||
|
|||||||
@@ -10,8 +10,8 @@ of conventions and procedures which are used in the posting of patches;
|
|||||||
following them will make life much easier for everybody involved. This
|
following them will make life much easier for everybody involved. This
|
||||||
document will attempt to cover these expectations in reasonable detail;
|
document will attempt to cover these expectations in reasonable detail;
|
||||||
more information can also be found in the files process/submitting-patches.rst,
|
more information can also be found in the files process/submitting-patches.rst,
|
||||||
process/submitting-drivers.rst, and process/submit-checklist.rst in the kernel documentation
|
process/submitting-drivers.rst, and process/submit-checklist.rst in the kernel
|
||||||
directory.
|
documentation directory.
|
||||||
|
|
||||||
|
|
||||||
When to post
|
When to post
|
||||||
@@ -198,8 +198,8 @@ pass it to diff with the "-X" option.
|
|||||||
|
|
||||||
The tags mentioned above are used to describe how various developers have
|
The tags mentioned above are used to describe how various developers have
|
||||||
been associated with the development of this patch. They are described in
|
been associated with the development of this patch. They are described in
|
||||||
detail in the process/submitting-patches.rst document; what follows here is a brief
|
detail in the process/submitting-patches.rst document; what follows here is a
|
||||||
summary. Each of these lines has the format:
|
brief summary. Each of these lines has the format:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
@@ -210,8 +210,8 @@ The tags in common use are:
|
|||||||
- Signed-off-by: this is a developer's certification that he or she has
|
- Signed-off-by: this is a developer's certification that he or she has
|
||||||
the right to submit the patch for inclusion into the kernel. It is an
|
the right to submit the patch for inclusion into the kernel. It is an
|
||||||
agreement to the Developer's Certificate of Origin, the full text of
|
agreement to the Developer's Certificate of Origin, the full text of
|
||||||
which can be found in Documentation/process/submitting-patches.rst. Code without a
|
which can be found in Documentation/process/submitting-patches.rst. Code
|
||||||
proper signoff cannot be merged into the mainline.
|
without a proper signoff cannot be merged into the mainline.
|
||||||
|
|
||||||
- Co-developed-by: states that the patch was also created by another developer
|
- Co-developed-by: states that the patch was also created by another developer
|
||||||
along with the original author. This is useful at times when multiple
|
along with the original author. This is useful at times when multiple
|
||||||
@@ -226,8 +226,8 @@ The tags in common use are:
|
|||||||
it to work.
|
it to work.
|
||||||
|
|
||||||
- Reviewed-by: the named developer has reviewed the patch for correctness;
|
- Reviewed-by: the named developer has reviewed the patch for correctness;
|
||||||
see the reviewer's statement in Documentation/process/submitting-patches.rst for more
|
see the reviewer's statement in Documentation/process/submitting-patches.rst
|
||||||
detail.
|
for more detail.
|
||||||
|
|
||||||
- Reported-by: names a user who reported a problem which is fixed by this
|
- Reported-by: names a user who reported a problem which is fixed by this
|
||||||
patch; this tag is used to give credit to the (often underappreciated)
|
patch; this tag is used to give credit to the (often underappreciated)
|
||||||
|
|||||||
@@ -52,6 +52,7 @@ lack of a better place.
|
|||||||
adding-syscalls
|
adding-syscalls
|
||||||
magic-number
|
magic-number
|
||||||
volatile-considered-harmful
|
volatile-considered-harmful
|
||||||
|
clang-format
|
||||||
|
|
||||||
.. only:: subproject and html
|
.. only:: subproject and html
|
||||||
|
|
||||||
|
|||||||
@@ -219,7 +219,7 @@ Our goal is to protect your master key by moving it to offline media, so
|
|||||||
if you only have a combined **[SC]** key, then you should create a separate
|
if you only have a combined **[SC]** key, then you should create a separate
|
||||||
signing subkey::
|
signing subkey::
|
||||||
|
|
||||||
$ gpg --quick-add-key [fpr] ed25519 sign
|
$ gpg --quick-addkey [fpr] ed25519 sign
|
||||||
|
|
||||||
Remember to tell the keyservers about this change, so others can pull down
|
Remember to tell the keyservers about this change, so others can pull down
|
||||||
your new subkey::
|
your new subkey::
|
||||||
@@ -450,11 +450,18 @@ functionality. There are several options available:
|
|||||||
others. If you want to use ECC keys, your best bet among commercially
|
others. If you want to use ECC keys, your best bet among commercially
|
||||||
available devices is the Nitrokey Start.
|
available devices is the Nitrokey Start.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
If you are listed in MAINTAINERS or have an account at kernel.org,
|
||||||
|
you `qualify for a free Nitrokey Start`_ courtesy of The Linux
|
||||||
|
Foundation.
|
||||||
|
|
||||||
.. _`Nitrokey Start`: https://shop.nitrokey.com/shop/product/nitrokey-start-6
|
.. _`Nitrokey Start`: https://shop.nitrokey.com/shop/product/nitrokey-start-6
|
||||||
.. _`Nitrokey Pro`: https://shop.nitrokey.com/shop/product/nitrokey-pro-3
|
.. _`Nitrokey Pro`: https://shop.nitrokey.com/shop/product/nitrokey-pro-3
|
||||||
.. _`Yubikey 4`: https://www.yubico.com/product/yubikey-4-series/
|
.. _`Yubikey 4`: https://www.yubico.com/product/yubikey-4-series/
|
||||||
.. _Gnuk: http://www.fsij.org/doc-gnuk/
|
.. _Gnuk: http://www.fsij.org/doc-gnuk/
|
||||||
.. _`LWN has a good review`: https://lwn.net/Articles/736231/
|
.. _`LWN has a good review`: https://lwn.net/Articles/736231/
|
||||||
|
.. _`qualify for a free Nitrokey Start`: https://www.kernel.org/nitrokey-digital-tokens-for-kernel-developers.html
|
||||||
|
|
||||||
Configure your smartcard device
|
Configure your smartcard device
|
||||||
-------------------------------
|
-------------------------------
|
||||||
@@ -494,6 +501,12 @@ additionally leak information about your smartcard should you lose it.
|
|||||||
Despite having the name "PIN", neither the user PIN nor the admin
|
Despite having the name "PIN", neither the user PIN nor the admin
|
||||||
PIN on the card need to be numbers.
|
PIN on the card need to be numbers.
|
||||||
|
|
||||||
|
.. warning::
|
||||||
|
|
||||||
|
Some devices may require that you move the subkeys onto the device
|
||||||
|
before you can change the passphrase. Please check the documentation
|
||||||
|
provided by the device manufacturer.
|
||||||
|
|
||||||
Move the subkeys to your smartcard
|
Move the subkeys to your smartcard
|
||||||
----------------------------------
|
----------------------------------
|
||||||
|
|
||||||
@@ -655,6 +668,20 @@ want to import these changes back into your regular working directory::
|
|||||||
$ gpg --export | gpg --homedir ~/.gnupg --import
|
$ gpg --export | gpg --homedir ~/.gnupg --import
|
||||||
$ unset GNUPGHOME
|
$ unset GNUPGHOME
|
||||||
|
|
||||||
|
Using gpg-agent over ssh
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
You can forward your gpg-agent over ssh if you need to sign tags or
|
||||||
|
commits on a remote system. Please refer to the instructions provided
|
||||||
|
on the GnuPG wiki:
|
||||||
|
|
||||||
|
- `Agent Forwarding over SSH`_
|
||||||
|
|
||||||
|
It works more smoothly if you can modify the sshd server settings on the
|
||||||
|
remote end.
|
||||||
|
|
||||||
|
.. _`Agent Forwarding over SSH`: https://wiki.gnupg.org/AgentForwarding
|
||||||
|
|
||||||
|
|
||||||
Using PGP with Git
|
Using PGP with Git
|
||||||
==================
|
==================
|
||||||
@@ -692,6 +719,7 @@ should be used (``[fpr]`` is the fingerprint of your key)::
|
|||||||
tell git to always use it instead of the legacy ``gpg`` from version 1::
|
tell git to always use it instead of the legacy ``gpg`` from version 1::
|
||||||
|
|
||||||
$ git config --global gpg.program gpg2
|
$ git config --global gpg.program gpg2
|
||||||
|
$ git config --global gpgv.program gpgv2
|
||||||
|
|
||||||
How to work with signed tags
|
How to work with signed tags
|
||||||
----------------------------
|
----------------------------
|
||||||
@@ -731,6 +759,13 @@ If you are verifying someone else's git tag, then you will need to
|
|||||||
import their PGP key. Please refer to the
|
import their PGP key. Please refer to the
|
||||||
":ref:`verify_identities`" section below.
|
":ref:`verify_identities`" section below.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
If you get "``gpg: Can't check signature: unknown pubkey
|
||||||
|
algorithm``" error, you need to tell git to use gpgv2 for
|
||||||
|
verification, so it properly processes signatures made by ECC keys.
|
||||||
|
See instructions at the start of this section.
|
||||||
|
|
||||||
Configure git to always sign annotated tags
|
Configure git to always sign annotated tags
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
|||||||
@@ -761,7 +761,7 @@ requests, especially from new, unknown developers. If in doubt you can use
|
|||||||
the pull request as the cover letter for a normal posting of the patch
|
the pull request as the cover letter for a normal posting of the patch
|
||||||
series, giving the maintainer the option of using either.
|
series, giving the maintainer the option of using either.
|
||||||
|
|
||||||
A pull request should have [GIT] or [PULL] in the subject line. The
|
A pull request should have [GIT PULL] in the subject line. The
|
||||||
request itself should include the repository name and the branch of
|
request itself should include the repository name and the branch of
|
||||||
interest on a single line; it should look something like::
|
interest on a single line; it should look something like::
|
||||||
|
|
||||||
|
|||||||
@@ -9,5 +9,7 @@ Security Documentation
|
|||||||
IMA-templates
|
IMA-templates
|
||||||
keys/index
|
keys/index
|
||||||
LSM
|
LSM
|
||||||
|
LSM-sctp
|
||||||
|
SELinux-sctp
|
||||||
self-protection
|
self-protection
|
||||||
tpm/index
|
tpm/index
|
||||||
|
|||||||
@@ -1062,7 +1062,7 @@ output (with ``--no-upload`` option) to kernel bugzilla or alsa-devel
|
|||||||
ML (see the section `Links and Addresses`_).
|
ML (see the section `Links and Addresses`_).
|
||||||
|
|
||||||
``power_save`` and ``power_save_controller`` options are for power-saving
|
``power_save`` and ``power_save_controller`` options are for power-saving
|
||||||
mode. See powersave.txt for details.
|
mode. See powersave.rst for details.
|
||||||
|
|
||||||
Note 2: If you get click noises on output, try the module option
|
Note 2: If you get click noises on output, try the module option
|
||||||
``position_fix=1`` or ``2``. ``position_fix=1`` will use the SD_LPIB
|
``position_fix=1`` or ``2``. ``position_fix=1`` will use the SD_LPIB
|
||||||
@@ -1133,7 +1133,7 @@ line_outs_monitor
|
|||||||
enable_monitor
|
enable_monitor
|
||||||
Enable Analog Out on Channel 63/64 by default.
|
Enable Analog Out on Channel 63/64 by default.
|
||||||
|
|
||||||
See hdspm.txt for details.
|
See hdspm.rst for details.
|
||||||
|
|
||||||
Module snd-ice1712
|
Module snd-ice1712
|
||||||
------------------
|
------------------
|
||||||
|
|||||||
@@ -139,7 +139,7 @@ DAPM description
|
|||||||
----------------
|
----------------
|
||||||
The Dynamic Audio Power Management description describes the codec power
|
The Dynamic Audio Power Management description describes the codec power
|
||||||
components and their relationships and registers to the ASoC core.
|
components and their relationships and registers to the ASoC core.
|
||||||
Please read dapm.txt for details of building the description.
|
Please read dapm.rst for details of building the description.
|
||||||
|
|
||||||
Please also see the examples in other codec drivers.
|
Please also see the examples in other codec drivers.
|
||||||
|
|
||||||
|
|||||||
@@ -66,7 +66,7 @@ Each SoC DAI driver must provide the following features:-
|
|||||||
4. SYSCLK configuration
|
4. SYSCLK configuration
|
||||||
5. Suspend and resume (optional)
|
5. Suspend and resume (optional)
|
||||||
|
|
||||||
Please see codec.txt for a description of items 1 - 4.
|
Please see codec.rst for a description of items 1 - 4.
|
||||||
|
|
||||||
|
|
||||||
SoC DSP Drivers
|
SoC DSP Drivers
|
||||||
|
|||||||
@@ -515,7 +515,7 @@ nr_hugepages
|
|||||||
|
|
||||||
Change the minimum size of the hugepage pool.
|
Change the minimum size of the hugepage pool.
|
||||||
|
|
||||||
See Documentation/vm/hugetlbpage.txt
|
See Documentation/admin-guide/mm/hugetlbpage.rst
|
||||||
|
|
||||||
==============================================================
|
==============================================================
|
||||||
|
|
||||||
@@ -524,7 +524,7 @@ nr_overcommit_hugepages
|
|||||||
Change the maximum size of the hugepage pool. The maximum is
|
Change the maximum size of the hugepage pool. The maximum is
|
||||||
nr_hugepages + nr_overcommit_hugepages.
|
nr_hugepages + nr_overcommit_hugepages.
|
||||||
|
|
||||||
See Documentation/vm/hugetlbpage.txt
|
See Documentation/admin-guide/mm/hugetlbpage.rst
|
||||||
|
|
||||||
==============================================================
|
==============================================================
|
||||||
|
|
||||||
@@ -667,7 +667,7 @@ and don't use much of it.
|
|||||||
|
|
||||||
The default value is 0.
|
The default value is 0.
|
||||||
|
|
||||||
See Documentation/vm/overcommit-accounting and
|
See Documentation/vm/overcommit-accounting.rst and
|
||||||
mm/mmap.c::__vm_enough_memory() for more information.
|
mm/mmap.c::__vm_enough_memory() for more information.
|
||||||
|
|
||||||
==============================================================
|
==============================================================
|
||||||
|
|||||||
@@ -187,13 +187,19 @@ that can be performed on them (see "struct coresight_ops"). The
|
|||||||
specific to that component only. "Implementation defined" customisations are
|
specific to that component only. "Implementation defined" customisations are
|
||||||
expected to be accessed and controlled using those entries.
|
expected to be accessed and controlled using those entries.
|
||||||
|
|
||||||
Last but not least, "struct module *owner" is expected to be set to reflect
|
|
||||||
the information carried in "THIS_MODULE".
|
|
||||||
|
|
||||||
How to use the tracer modules
|
How to use the tracer modules
|
||||||
-----------------------------
|
-----------------------------
|
||||||
|
|
||||||
Before trace collection can start, a coresight sink needs to be identify.
|
There are two ways to use the Coresight framework: 1) using the perf cmd line
|
||||||
|
tools and 2) interacting directly with the Coresight devices using the sysFS
|
||||||
|
interface. Preference is given to the former as using the sysFS interface
|
||||||
|
requires a deep understanding of the Coresight HW. The following sections
|
||||||
|
provide details on using both methods.
|
||||||
|
|
||||||
|
1) Using the sysFS interface:
|
||||||
|
|
||||||
|
Before trace collection can start, a coresight sink needs to be identified.
|
||||||
There is no limit on the amount of sinks (nor sources) that can be enabled at
|
There is no limit on the amount of sinks (nor sources) that can be enabled at
|
||||||
any given moment. As a generic operation, all device pertaining to the sink
|
any given moment. As a generic operation, all device pertaining to the sink
|
||||||
class will have an "active" entry in sysfs:
|
class will have an "active" entry in sysfs:
|
||||||
@@ -298,42 +304,48 @@ Instruction 13570831 0x8026B584 E28DD00C false ADD
|
|||||||
Instruction 0 0x8026B588 E8BD8000 true LDM sp!,{pc}
|
Instruction 0 0x8026B588 E8BD8000 true LDM sp!,{pc}
|
||||||
Timestamp Timestamp: 17107041535
|
Timestamp Timestamp: 17107041535
|
||||||
|
|
||||||
How to use the STM module
|
2) Using perf framework:
|
||||||
-------------------------
|
|
||||||
|
|
||||||
Using the System Trace Macrocell module is the same as the tracers - the only
|
Coresight tracers are represented using the Perf framework's Performance
|
||||||
difference is that clients are driving the trace capture rather
|
Monitoring Unit (PMU) abstraction. As such the perf framework takes charge of
|
||||||
than the program flow through the code.
|
controlling when tracing gets enabled based on when the process of interest is
|
||||||
|
scheduled. When configured in a system, Coresight PMUs will be listed when
|
||||||
|
queried by the perf command line tool:
|
||||||
|
|
||||||
As with any other CoreSight component, specifics about the STM tracer can be
|
linaro@linaro-nano:~$ ./perf list pmu
|
||||||
found in sysfs with more information on each entry being found in [1]:
|
|
||||||
|
|
||||||
root@genericarmv8:~# ls /sys/bus/coresight/devices/20100000.stm
|
List of pre-defined events (to be used in -e):
|
||||||
enable_source hwevent_select port_enable subsystem uevent
|
|
||||||
hwevent_enable mgmt port_select traceid
|
|
||||||
root@genericarmv8:~#
|
|
||||||
|
|
||||||
Like any other source a sink needs to be identified and the STM enabled before
|
cs_etm// [Kernel PMU event]
|
||||||
being used:
|
|
||||||
|
|
||||||
root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20010000.etf/enable_sink
|
linaro@linaro-nano:~$
|
||||||
root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20100000.stm/enable_source
|
|
||||||
|
|
||||||
From there user space applications can request and use channels using the devfs
|
Regardless of the number of tracers available in a system (usually equal to the
|
||||||
interface provided for that purpose by the generic STM API:
|
amount of processor cores), the "cs_etm" PMU will be listed only once.
|
||||||
|
|
||||||
root@genericarmv8:~# ls -l /dev/20100000.stm
|
A Coresight PMU works the same way as any other PMU, i.e the name of the PMU is
|
||||||
crw------- 1 root root 10, 61 Jan 3 18:11 /dev/20100000.stm
|
listed along with configuration options within forward slashes '/'. Since a
|
||||||
root@genericarmv8:~#
|
Coresight system will typically have more than one sink, the name of the sink to
|
||||||
|
work with needs to be specified as an event option. Names for sink to choose
|
||||||
|
from are listed in sysFS under ($SYSFS)/bus/coresight/devices:
|
||||||
|
|
||||||
Details on how to use the generic STM API can be found here [2].
|
root@linaro-nano:~# ls /sys/bus/coresight/devices/
|
||||||
|
20010000.etf 20040000.funnel 20100000.stm 22040000.etm
|
||||||
|
22140000.etm 230c0000.funnel 23240000.etm 20030000.tpiu
|
||||||
|
20070000.etr 20120000.replicator 220c0000.funnel
|
||||||
|
23040000.etm 23140000.etm 23340000.etm
|
||||||
|
|
||||||
[1]. Documentation/ABI/testing/sysfs-bus-coresight-devices-stm
|
root@linaro-nano:~# perf record -e cs_etm/@20070000.etr/u --per-thread program
|
||||||
[2]. Documentation/trace/stm.txt
|
|
||||||
|
|
||||||
|
The syntax within the forward slashes '/' is important. The '@' character
|
||||||
|
tells the parser that a sink is about to be specified and that this is the sink
|
||||||
|
to use for the trace session.
|
||||||
|
|
||||||
Using perf tools
|
More information on the above and other example on how to use Coresight with
|
||||||
----------------
|
the perf tools can be found in the "HOWTO.md" file of the openCSD gitHub
|
||||||
|
repository [3].
|
||||||
|
|
||||||
|
2.1) AutoFDO analysis using the perf tools:
|
||||||
|
|
||||||
perf can be used to record and analyze trace of programs.
|
perf can be used to record and analyze trace of programs.
|
||||||
|
|
||||||
@@ -381,3 +393,38 @@ sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tuto
|
|||||||
$ taskset -c 2 ./sort_autofdo
|
$ taskset -c 2 ./sort_autofdo
|
||||||
Bubble sorting array of 30000 elements
|
Bubble sorting array of 30000 elements
|
||||||
5806 ms
|
5806 ms
|
||||||
|
|
||||||
|
|
||||||
|
How to use the STM module
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
Using the System Trace Macrocell module is the same as the tracers - the only
|
||||||
|
difference is that clients are driving the trace capture rather
|
||||||
|
than the program flow through the code.
|
||||||
|
|
||||||
|
As with any other CoreSight component, specifics about the STM tracer can be
|
||||||
|
found in sysfs with more information on each entry being found in [1]:
|
||||||
|
|
||||||
|
root@genericarmv8:~# ls /sys/bus/coresight/devices/20100000.stm
|
||||||
|
enable_source hwevent_select port_enable subsystem uevent
|
||||||
|
hwevent_enable mgmt port_select traceid
|
||||||
|
root@genericarmv8:~#
|
||||||
|
|
||||||
|
Like any other source a sink needs to be identified and the STM enabled before
|
||||||
|
being used:
|
||||||
|
|
||||||
|
root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20010000.etf/enable_sink
|
||||||
|
root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20100000.stm/enable_source
|
||||||
|
|
||||||
|
From there user space applications can request and use channels using the devfs
|
||||||
|
interface provided for that purpose by the generic STM API:
|
||||||
|
|
||||||
|
root@genericarmv8:~# ls -l /dev/20100000.stm
|
||||||
|
crw------- 1 root root 10, 61 Jan 3 18:11 /dev/20100000.stm
|
||||||
|
root@genericarmv8:~#
|
||||||
|
|
||||||
|
Details on how to use the generic STM API can be found here [2].
|
||||||
|
|
||||||
|
[1]. Documentation/ABI/testing/sysfs-bus-coresight-devices-stm
|
||||||
|
[2]. Documentation/trace/stm.txt
|
||||||
|
[3]. https://github.com/Linaro/perf-opencsd
|
||||||
|
|||||||
@@ -12,7 +12,7 @@ Written for: 4.14
|
|||||||
Introduction
|
Introduction
|
||||||
============
|
============
|
||||||
|
|
||||||
The ftrace infrastructure was originially created to attach callbacks to the
|
The ftrace infrastructure was originally created to attach callbacks to the
|
||||||
beginning of functions in order to record and trace the flow of the kernel.
|
beginning of functions in order to record and trace the flow of the kernel.
|
||||||
But callbacks to the start of a function can have other use cases. Either
|
But callbacks to the start of a function can have other use cases. Either
|
||||||
for live kernel patching, or for security monitoring. This document describes
|
for live kernel patching, or for security monitoring. This document describes
|
||||||
@@ -30,7 +30,7 @@ The ftrace context
|
|||||||
This requires extra care to what can be done inside a callback. A callback
|
This requires extra care to what can be done inside a callback. A callback
|
||||||
can be called outside the protective scope of RCU.
|
can be called outside the protective scope of RCU.
|
||||||
|
|
||||||
The ftrace infrastructure has some protections agains recursions and RCU
|
The ftrace infrastructure has some protections against recursions and RCU
|
||||||
but one must still be very careful how they use the callbacks.
|
but one must still be very careful how they use the callbacks.
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -224,6 +224,8 @@ of ftrace. Here is a list of some of the key files:
|
|||||||
has a side effect of enabling or disabling specific functions
|
has a side effect of enabling or disabling specific functions
|
||||||
to be traced. Echoing names of functions into this file
|
to be traced. Echoing names of functions into this file
|
||||||
will limit the trace to only those functions.
|
will limit the trace to only those functions.
|
||||||
|
This influences the tracers "function" and "function_graph"
|
||||||
|
and thus also function profiling (see "function_profile_enabled").
|
||||||
|
|
||||||
The functions listed in "available_filter_functions" are what
|
The functions listed in "available_filter_functions" are what
|
||||||
can be written into this file.
|
can be written into this file.
|
||||||
@@ -265,6 +267,8 @@ of ftrace. Here is a list of some of the key files:
|
|||||||
Functions listed in this file will cause the function graph
|
Functions listed in this file will cause the function graph
|
||||||
tracer to only trace these functions and the functions that
|
tracer to only trace these functions and the functions that
|
||||||
they call. (See the section "dynamic ftrace" for more details).
|
they call. (See the section "dynamic ftrace" for more details).
|
||||||
|
Note, set_ftrace_filter and set_ftrace_notrace still affects
|
||||||
|
what functions are being traced.
|
||||||
|
|
||||||
set_graph_notrace:
|
set_graph_notrace:
|
||||||
|
|
||||||
@@ -277,7 +281,8 @@ of ftrace. Here is a list of some of the key files:
|
|||||||
|
|
||||||
This lists the functions that ftrace has processed and can trace.
|
This lists the functions that ftrace has processed and can trace.
|
||||||
These are the function names that you can pass to
|
These are the function names that you can pass to
|
||||||
"set_ftrace_filter" or "set_ftrace_notrace".
|
"set_ftrace_filter", "set_ftrace_notrace",
|
||||||
|
"set_graph_function", or "set_graph_notrace".
|
||||||
(See the section "dynamic ftrace" below for more details.)
|
(See the section "dynamic ftrace" below for more details.)
|
||||||
|
|
||||||
dyn_ftrace_total_info:
|
dyn_ftrace_total_info:
|
||||||
|
|||||||
@@ -2846,7 +2846,7 @@ CPU 의 캐시에서 RAM 으로 쓰여지는 더티 캐시 라인에 의해 덮
|
|||||||
문제를 해결하기 위해선, 커널의 적절한 부분에서 각 CPU 의 캐시 안의 문제가 되는
|
문제를 해결하기 위해선, 커널의 적절한 부분에서 각 CPU 의 캐시 안의 문제가 되는
|
||||||
비트들을 무효화 시켜야 합니다.
|
비트들을 무효화 시켜야 합니다.
|
||||||
|
|
||||||
캐시 관리에 대한 더 많은 정보를 위해선 Documentation/cachetlb.txt 를
|
캐시 관리에 대한 더 많은 정보를 위해선 Documentation/core-api/cachetlb.rst 를
|
||||||
참고하세요.
|
참고하세요.
|
||||||
|
|
||||||
|
|
||||||
@@ -3023,7 +3023,7 @@ smp_mb() 가 아니라 virt_mb() 를 사용해야 합니다.
|
|||||||
동기화에 락을 사용하지 않고 구현하는데에 사용될 수 있습니다. 더 자세한 내용을
|
동기화에 락을 사용하지 않고 구현하는데에 사용될 수 있습니다. 더 자세한 내용을
|
||||||
위해선 다음을 참고하세요:
|
위해선 다음을 참고하세요:
|
||||||
|
|
||||||
Documentation/circular-buffers.txt
|
Documentation/core-api/circular-buffers.rst
|
||||||
|
|
||||||
|
|
||||||
=========
|
=========
|
||||||
|
|||||||
@@ -252,15 +252,14 @@ into VFIO core. When devices are bound and unbound to the driver,
|
|||||||
the driver should call vfio_add_group_dev() and vfio_del_group_dev()
|
the driver should call vfio_add_group_dev() and vfio_del_group_dev()
|
||||||
respectively::
|
respectively::
|
||||||
|
|
||||||
extern int vfio_add_group_dev(struct iommu_group *iommu_group,
|
extern int vfio_add_group_dev(struct device *dev,
|
||||||
struct device *dev,
|
|
||||||
const struct vfio_device_ops *ops,
|
const struct vfio_device_ops *ops,
|
||||||
void *device_data);
|
void *device_data);
|
||||||
|
|
||||||
extern void *vfio_del_group_dev(struct device *dev);
|
extern void *vfio_del_group_dev(struct device *dev);
|
||||||
|
|
||||||
vfio_add_group_dev() indicates to the core to begin tracking the
|
vfio_add_group_dev() indicates to the core to begin tracking the
|
||||||
specified iommu_group and register the specified dev as owned by
|
iommu_group of the specified dev and register the dev as owned by
|
||||||
a VFIO bus driver. The driver provides an ops structure for callbacks
|
a VFIO bus driver. The driver provides an ops structure for callbacks
|
||||||
similar to a file operations structure::
|
similar to a file operations structure::
|
||||||
|
|
||||||
|
|||||||
@@ -1,62 +1,50 @@
|
|||||||
00-INDEX
|
00-INDEX
|
||||||
- this file.
|
- this file.
|
||||||
active_mm.txt
|
active_mm.rst
|
||||||
- An explanation from Linus about tsk->active_mm vs tsk->mm.
|
- An explanation from Linus about tsk->active_mm vs tsk->mm.
|
||||||
balance
|
balance.rst
|
||||||
- various information on memory balancing.
|
- various information on memory balancing.
|
||||||
cleancache.txt
|
cleancache.rst
|
||||||
- Intro to cleancache and page-granularity victim cache.
|
- Intro to cleancache and page-granularity victim cache.
|
||||||
frontswap.txt
|
frontswap.rst
|
||||||
- Outline frontswap, part of the transcendent memory frontend.
|
- Outline frontswap, part of the transcendent memory frontend.
|
||||||
highmem.txt
|
highmem.rst
|
||||||
- Outline of highmem and common issues.
|
- Outline of highmem and common issues.
|
||||||
hmm.txt
|
hmm.rst
|
||||||
- Documentation of heterogeneous memory management
|
- Documentation of heterogeneous memory management
|
||||||
hugetlbpage.txt
|
hugetlbfs_reserv.rst
|
||||||
- a brief summary of hugetlbpage support in the Linux kernel.
|
|
||||||
hugetlbfs_reserv.txt
|
|
||||||
- A brief overview of hugetlbfs reservation design/implementation.
|
- A brief overview of hugetlbfs reservation design/implementation.
|
||||||
hwpoison.txt
|
hwpoison.rst
|
||||||
- explains what hwpoison is
|
- explains what hwpoison is
|
||||||
idle_page_tracking.txt
|
ksm.rst
|
||||||
- description of the idle page tracking feature.
|
|
||||||
ksm.txt
|
|
||||||
- how to use the Kernel Samepage Merging feature.
|
- how to use the Kernel Samepage Merging feature.
|
||||||
mmu_notifier.txt
|
mmu_notifier.rst
|
||||||
- a note about clearing pte/pmd and mmu notifications
|
- a note about clearing pte/pmd and mmu notifications
|
||||||
numa
|
numa.rst
|
||||||
- information about NUMA specific code in the Linux vm.
|
- information about NUMA specific code in the Linux vm.
|
||||||
numa_memory_policy.txt
|
overcommit-accounting.rst
|
||||||
- documentation of concepts and APIs of the 2.6 memory policy support.
|
|
||||||
overcommit-accounting
|
|
||||||
- description of the Linux kernels overcommit handling modes.
|
- description of the Linux kernels overcommit handling modes.
|
||||||
page_frags
|
page_frags.rst
|
||||||
- description of page fragments allocator
|
- description of page fragments allocator
|
||||||
page_migration
|
page_migration.rst
|
||||||
- description of page migration in NUMA systems.
|
- description of page migration in NUMA systems.
|
||||||
pagemap.txt
|
page_owner.rst
|
||||||
- pagemap, from the userspace perspective
|
|
||||||
page_owner.txt
|
|
||||||
- tracking about who allocated each page
|
- tracking about who allocated each page
|
||||||
remap_file_pages.txt
|
remap_file_pages.rst
|
||||||
- a note about remap_file_pages() system call
|
- a note about remap_file_pages() system call
|
||||||
slub.txt
|
slub.rst
|
||||||
- a short users guide for SLUB.
|
- a short users guide for SLUB.
|
||||||
soft-dirty.txt
|
split_page_table_lock.rst
|
||||||
- short explanation for soft-dirty PTEs
|
|
||||||
split_page_table_lock
|
|
||||||
- Separate per-table lock to improve scalability of the old page_table_lock.
|
- Separate per-table lock to improve scalability of the old page_table_lock.
|
||||||
swap_numa.txt
|
swap_numa.rst
|
||||||
- automatic binding of swap device to numa node
|
- automatic binding of swap device to numa node
|
||||||
transhuge.txt
|
transhuge.rst
|
||||||
- Transparent Hugepage Support, alternative way of using hugepages.
|
- Transparent Hugepage Support, alternative way of using hugepages.
|
||||||
unevictable-lru.txt
|
unevictable-lru.rst
|
||||||
- Unevictable LRU infrastructure
|
- Unevictable LRU infrastructure
|
||||||
userfaultfd.txt
|
|
||||||
- description of userfaultfd system call
|
|
||||||
z3fold.txt
|
z3fold.txt
|
||||||
- outline of z3fold allocator for storing compressed pages
|
- outline of z3fold allocator for storing compressed pages
|
||||||
zsmalloc.txt
|
zsmalloc.rst
|
||||||
- outline of zsmalloc allocator for storing compressed pages
|
- outline of zsmalloc allocator for storing compressed pages
|
||||||
zswap.txt
|
zswap.rst
|
||||||
- Intro to compressed cache for swap pages
|
- Intro to compressed cache for swap pages
|
||||||
|
|||||||
91
Documentation/vm/active_mm.rst
Normal file
91
Documentation/vm/active_mm.rst
Normal file
@@ -0,0 +1,91 @@
|
|||||||
|
.. _active_mm:
|
||||||
|
|
||||||
|
=========
|
||||||
|
Active MM
|
||||||
|
=========
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
List: linux-kernel
|
||||||
|
Subject: Re: active_mm
|
||||||
|
From: Linus Torvalds <torvalds () transmeta ! com>
|
||||||
|
Date: 1999-07-30 21:36:24
|
||||||
|
|
||||||
|
Cc'd to linux-kernel, because I don't write explanations all that often,
|
||||||
|
and when I do I feel better about more people reading them.
|
||||||
|
|
||||||
|
On Fri, 30 Jul 1999, David Mosberger wrote:
|
||||||
|
>
|
||||||
|
> Is there a brief description someplace on how "mm" vs. "active_mm" in
|
||||||
|
> the task_struct are supposed to be used? (My apologies if this was
|
||||||
|
> discussed on the mailing lists---I just returned from vacation and
|
||||||
|
> wasn't able to follow linux-kernel for a while).
|
||||||
|
|
||||||
|
Basically, the new setup is:
|
||||||
|
|
||||||
|
- we have "real address spaces" and "anonymous address spaces". The
|
||||||
|
difference is that an anonymous address space doesn't care about the
|
||||||
|
user-level page tables at all, so when we do a context switch into an
|
||||||
|
anonymous address space we just leave the previous address space
|
||||||
|
active.
|
||||||
|
|
||||||
|
The obvious use for a "anonymous address space" is any thread that
|
||||||
|
doesn't need any user mappings - all kernel threads basically fall into
|
||||||
|
this category, but even "real" threads can temporarily say that for
|
||||||
|
some amount of time they are not going to be interested in user space,
|
||||||
|
and that the scheduler might as well try to avoid wasting time on
|
||||||
|
switching the VM state around. Currently only the old-style bdflush
|
||||||
|
sync does that.
|
||||||
|
|
||||||
|
- "tsk->mm" points to the "real address space". For an anonymous process,
|
||||||
|
tsk->mm will be NULL, for the logical reason that an anonymous process
|
||||||
|
really doesn't _have_ a real address space at all.
|
||||||
|
|
||||||
|
- however, we obviously need to keep track of which address space we
|
||||||
|
"stole" for such an anonymous user. For that, we have "tsk->active_mm",
|
||||||
|
which shows what the currently active address space is.
|
||||||
|
|
||||||
|
The rule is that for a process with a real address space (ie tsk->mm is
|
||||||
|
non-NULL) the active_mm obviously always has to be the same as the real
|
||||||
|
one.
|
||||||
|
|
||||||
|
For a anonymous process, tsk->mm == NULL, and tsk->active_mm is the
|
||||||
|
"borrowed" mm while the anonymous process is running. When the
|
||||||
|
anonymous process gets scheduled away, the borrowed address space is
|
||||||
|
returned and cleared.
|
||||||
|
|
||||||
|
To support all that, the "struct mm_struct" now has two counters: a
|
||||||
|
"mm_users" counter that is how many "real address space users" there are,
|
||||||
|
and a "mm_count" counter that is the number of "lazy" users (ie anonymous
|
||||||
|
users) plus one if there are any real users.
|
||||||
|
|
||||||
|
Usually there is at least one real user, but it could be that the real
|
||||||
|
user exited on another CPU while a lazy user was still active, so you do
|
||||||
|
actually get cases where you have a address space that is _only_ used by
|
||||||
|
lazy users. That is often a short-lived state, because once that thread
|
||||||
|
gets scheduled away in favour of a real thread, the "zombie" mm gets
|
||||||
|
released because "mm_users" becomes zero.
|
||||||
|
|
||||||
|
Also, a new rule is that _nobody_ ever has "init_mm" as a real MM any
|
||||||
|
more. "init_mm" should be considered just a "lazy context when no other
|
||||||
|
context is available", and in fact it is mainly used just at bootup when
|
||||||
|
no real VM has yet been created. So code that used to check
|
||||||
|
|
||||||
|
if (current->mm == &init_mm)
|
||||||
|
|
||||||
|
should generally just do
|
||||||
|
|
||||||
|
if (!current->mm)
|
||||||
|
|
||||||
|
instead (which makes more sense anyway - the test is basically one of "do
|
||||||
|
we have a user context", and is generally done by the page fault handler
|
||||||
|
and things like that).
|
||||||
|
|
||||||
|
Anyway, I put a pre-patch-2.3.13-1 on ftp.kernel.org just a moment ago,
|
||||||
|
because it slightly changes the interfaces to accommodate the alpha (who
|
||||||
|
would have thought it, but the alpha actually ends up having one of the
|
||||||
|
ugliest context switch codes - unlike the other architectures where the MM
|
||||||
|
and register state is separate, the alpha PALcode joins the two, and you
|
||||||
|
need to switch both together).
|
||||||
|
|
||||||
|
(From http://marc.info/?l=linux-kernel&m=93337278602211&w=2)
|
||||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user