Linux 4.18-rc2
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlsvlI0eHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGueQH/jy7xefQ2i2Tpj6M 0uftRK25X+RqbuL3RYBU//IPHxGFcwEwiA8/h14pKQJJ90+OUzj/n9yJnYy24sLJ Lu78t1rY0FUSGwCzB6nXM0lUA72EVE0ugmMLIborvcZhhzYnJq0oYCGiZpSfklvV QlWUZnFoSJB+JbQ2E50J0Uj0AJsaUFh4Wd0TEZ3+nsm5DNJ3JDRskr9gObkwBDdN ViVOeXtbZaiqXl3Ncj5vTBeqgYnGjztNdpluWqNhvhyc6HG/QCHq7mn4K7qPvoBE BSiw8oQel8iIUauT+0CdNTxiCQX9Z5TfxAs6sAA91bIAcZyACsjW5KqDd9t3EDDi EO+Ph9M= =ajiq -----END PGP SIGNATURE----- Merge tag 'v4.18-rc2' of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into fbdev-for-next Linux 4.18-rc2
This commit is contained in:
commit
6b16f5d122
.mailmap
Documentation
00-INDEX
ABI
obsolete
removed
stable
testing
PCI
RCU
accelerators
acpi
admin-guide
LSM
bcache.rstcgroup-v2.rstindex.rstkernel-parameters.txtmm
concepts.rsthugetlbpage.rstidle_page_tracking.rstindex.rstksm.rstnuma_memory_policy.rstpagemap.rstsoft-dirty.rsttranshuge.rstuserfaultfd.rst
pm
ramoops.rstarm
auxdisplay
block
blockdev
bpf
cgroup-v2.txtclk.txtcore-api
atomic_ops.rstcachetlb.rstcircular-buffers.rstgfp_mask-from-fs-io.rstindex.rstkernel-api.rstprintk-formats.rstrefcount-vs-atomic.rst
crypto
dell_rbu.txtdev-tools
device-mapper
devicetree/bindings
3
.mailmap
3
.mailmap
@ -186,6 +186,9 @@ Uwe Kleine-König <ukleinek@informatik.uni-freiburg.de>
|
||||
Uwe Kleine-König <ukl@pengutronix.de>
|
||||
Uwe Kleine-König <Uwe.Kleine-Koenig@digi.com>
|
||||
Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
|
||||
Vinod Koul <vkoul@kernel.org> <vinod.koul@intel.com>
|
||||
Vinod Koul <vkoul@kernel.org> <vinod.koul@linux.intel.com>
|
||||
Vinod Koul <vkoul@kernel.org> <vkoul@infradead.org>
|
||||
Viresh Kumar <vireshk@kernel.org> <viresh.kumar@st.com>
|
||||
Viresh Kumar <vireshk@kernel.org> <viresh.linux@gmail.com>
|
||||
Viresh Kumar <vireshk@kernel.org> <viresh.kumar2@arm.com>
|
||||
|
@ -64,8 +64,6 @@ auxdisplay/
|
||||
- misc. LCD driver documentation (cfag12864b, ks0108).
|
||||
backlight/
|
||||
- directory with info on controlling backlights in flat panel displays
|
||||
bcache.txt
|
||||
- Block-layer cache on fast SSDs to improve slow (raid) I/O performance.
|
||||
block/
|
||||
- info on the Block I/O (BIO) layer.
|
||||
blockdev/
|
||||
@ -78,18 +76,10 @@ bus-devices/
|
||||
- directory with info on TI GPMC (General Purpose Memory Controller)
|
||||
bus-virt-phys-mapping.txt
|
||||
- how to access I/O mapped memory from within device drivers.
|
||||
cachetlb.txt
|
||||
- describes the cache/TLB flushing interfaces Linux uses.
|
||||
cdrom/
|
||||
- directory with information on the CD-ROM drivers that Linux has.
|
||||
cgroup-v1/
|
||||
- cgroups v1 features, including cpusets and memory controller.
|
||||
cgroup-v2.txt
|
||||
- cgroups v2 features, including cpusets and memory controller.
|
||||
circular-buffers.txt
|
||||
- how to make use of the existing circular buffer infrastructure
|
||||
clk.txt
|
||||
- info on the common clock framework
|
||||
cma/
|
||||
- Continuous Memory Area (CMA) debugfs interface.
|
||||
conf.py
|
||||
|
@ -11,7 +11,7 @@ Description:
|
||||
Kernel code may export it for complete or partial access.
|
||||
|
||||
GPIOs are identified as they are inside the kernel, using integers in
|
||||
the range 0..INT_MAX. See Documentation/gpio/gpio.txt for more information.
|
||||
the range 0..INT_MAX. See Documentation/gpio for more information.
|
||||
|
||||
/sys/class/gpio
|
||||
/export ... asks the kernel to export a GPIO to userspace
|
||||
|
17
Documentation/ABI/removed/sysfs-bus-nfit
Normal file
17
Documentation/ABI/removed/sysfs-bus-nfit
Normal file
@ -0,0 +1,17 @@
|
||||
What: /sys/bus/nd/devices/regionX/nfit/ecc_unit_size
|
||||
Date: Aug, 2017
|
||||
KernelVersion: v4.14 (Removed v4.18)
|
||||
Contact: linux-nvdimm@lists.01.org
|
||||
Description:
|
||||
(RO) Size of a write request to a DIMM that will not incur a
|
||||
read-modify-write cycle at the memory controller.
|
||||
|
||||
When the nfit driver initializes it runs an ARS (Address Range
|
||||
Scrub) operation across every pmem range. Part of that process
|
||||
involves determining the ARS capabilities of a given address
|
||||
range. One of the capabilities that is reported is the 'Clear
|
||||
Uncorrectable Error Range Length Unit Size' (see: ACPI 6.2
|
||||
section 9.20.7.4 Function Index 1 - Query ARS Capabilities).
|
||||
This property indicates the boundary at which the NVDIMM may
|
||||
need to perform read-modify-write cycles to maintain ECC (Error
|
||||
Correcting Code) blocks.
|
@ -1,25 +1,25 @@
|
||||
What: /sys/bus/vmbus/devices/vmbus_*/id
|
||||
What: /sys/bus/vmbus/devices/<UUID>/id
|
||||
Date: Jul 2009
|
||||
KernelVersion: 2.6.31
|
||||
Contact: K. Y. Srinivasan <kys@microsoft.com>
|
||||
Description: The VMBus child_relid of the device's primary channel
|
||||
Users: tools/hv/lsvmbus
|
||||
|
||||
What: /sys/bus/vmbus/devices/vmbus_*/class_id
|
||||
What: /sys/bus/vmbus/devices/<UUID>/class_id
|
||||
Date: Jul 2009
|
||||
KernelVersion: 2.6.31
|
||||
Contact: K. Y. Srinivasan <kys@microsoft.com>
|
||||
Description: The VMBus interface type GUID of the device
|
||||
Users: tools/hv/lsvmbus
|
||||
|
||||
What: /sys/bus/vmbus/devices/vmbus_*/device_id
|
||||
What: /sys/bus/vmbus/devices/<UUID>/device_id
|
||||
Date: Jul 2009
|
||||
KernelVersion: 2.6.31
|
||||
Contact: K. Y. Srinivasan <kys@microsoft.com>
|
||||
Description: The VMBus interface instance GUID of the device
|
||||
Users: tools/hv/lsvmbus
|
||||
|
||||
What: /sys/bus/vmbus/devices/vmbus_*/channel_vp_mapping
|
||||
What: /sys/bus/vmbus/devices/<UUID>/channel_vp_mapping
|
||||
Date: Jul 2015
|
||||
KernelVersion: 4.2.0
|
||||
Contact: K. Y. Srinivasan <kys@microsoft.com>
|
||||
@ -28,112 +28,112 @@ Description: The mapping of which primary/sub channels are bound to which
|
||||
Format: <channel's child_relid:the bound cpu's number>
|
||||
Users: tools/hv/lsvmbus
|
||||
|
||||
What: /sys/bus/vmbus/devices/vmbus_*/device
|
||||
What: /sys/bus/vmbus/devices/<UUID>/device
|
||||
Date: Dec. 2015
|
||||
KernelVersion: 4.5
|
||||
Contact: K. Y. Srinivasan <kys@microsoft.com>
|
||||
Description: The 16 bit device ID of the device
|
||||
Users: tools/hv/lsvmbus and user level RDMA libraries
|
||||
|
||||
What: /sys/bus/vmbus/devices/vmbus_*/vendor
|
||||
What: /sys/bus/vmbus/devices/<UUID>/vendor
|
||||
Date: Dec. 2015
|
||||
KernelVersion: 4.5
|
||||
Contact: K. Y. Srinivasan <kys@microsoft.com>
|
||||
Description: The 16 bit vendor ID of the device
|
||||
Users: tools/hv/lsvmbus and user level RDMA libraries
|
||||
|
||||
What: /sys/bus/vmbus/devices/vmbus_*/channels/NN
|
||||
What: /sys/bus/vmbus/devices/<UUID>/channels/<N>
|
||||
Date: September. 2017
|
||||
KernelVersion: 4.14
|
||||
Contact: Stephen Hemminger <sthemmin@microsoft.com>
|
||||
Description: Directory for per-channel information
|
||||
NN is the VMBUS relid associtated with the channel.
|
||||
|
||||
What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/cpu
|
||||
What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/cpu
|
||||
Date: September. 2017
|
||||
KernelVersion: 4.14
|
||||
Contact: Stephen Hemminger <sthemmin@microsoft.com>
|
||||
Description: VCPU (sub)channel is affinitized to
|
||||
Users: tools/hv/lsvmbus and other debugging tools
|
||||
|
||||
What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/cpu
|
||||
What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/cpu
|
||||
Date: September. 2017
|
||||
KernelVersion: 4.14
|
||||
Contact: Stephen Hemminger <sthemmin@microsoft.com>
|
||||
Description: VCPU (sub)channel is affinitized to
|
||||
Users: tools/hv/lsvmbus and other debugging tools
|
||||
|
||||
What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/in_mask
|
||||
What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/in_mask
|
||||
Date: September. 2017
|
||||
KernelVersion: 4.14
|
||||
Contact: Stephen Hemminger <sthemmin@microsoft.com>
|
||||
Description: Host to guest channel interrupt mask
|
||||
Users: Debugging tools
|
||||
|
||||
What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/latency
|
||||
What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/latency
|
||||
Date: September. 2017
|
||||
KernelVersion: 4.14
|
||||
Contact: Stephen Hemminger <sthemmin@microsoft.com>
|
||||
Description: Channel signaling latency
|
||||
Users: Debugging tools
|
||||
|
||||
What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/out_mask
|
||||
What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/out_mask
|
||||
Date: September. 2017
|
||||
KernelVersion: 4.14
|
||||
Contact: Stephen Hemminger <sthemmin@microsoft.com>
|
||||
Description: Guest to host channel interrupt mask
|
||||
Users: Debugging tools
|
||||
|
||||
What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/pending
|
||||
What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/pending
|
||||
Date: September. 2017
|
||||
KernelVersion: 4.14
|
||||
Contact: Stephen Hemminger <sthemmin@microsoft.com>
|
||||
Description: Channel interrupt pending state
|
||||
Users: Debugging tools
|
||||
|
||||
What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/read_avail
|
||||
What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/read_avail
|
||||
Date: September. 2017
|
||||
KernelVersion: 4.14
|
||||
Contact: Stephen Hemminger <sthemmin@microsoft.com>
|
||||
Description: Bytes available to read
|
||||
Users: Debugging tools
|
||||
|
||||
What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/write_avail
|
||||
What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/write_avail
|
||||
Date: September. 2017
|
||||
KernelVersion: 4.14
|
||||
Contact: Stephen Hemminger <sthemmin@microsoft.com>
|
||||
Description: Bytes available to write
|
||||
Users: Debugging tools
|
||||
|
||||
What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/events
|
||||
What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/events
|
||||
Date: September. 2017
|
||||
KernelVersion: 4.14
|
||||
Contact: Stephen Hemminger <sthemmin@microsoft.com>
|
||||
Description: Number of times we have signaled the host
|
||||
Users: Debugging tools
|
||||
|
||||
What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/interrupts
|
||||
What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/interrupts
|
||||
Date: September. 2017
|
||||
KernelVersion: 4.14
|
||||
Contact: Stephen Hemminger <sthemmin@microsoft.com>
|
||||
Description: Number of times we have taken an interrupt (incoming)
|
||||
Users: Debugging tools
|
||||
|
||||
What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/subchannel_id
|
||||
What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/subchannel_id
|
||||
Date: January. 2018
|
||||
KernelVersion: 4.16
|
||||
Contact: Stephen Hemminger <sthemmin@microsoft.com>
|
||||
Description: Subchannel ID associated with VMBUS channel
|
||||
Users: Debugging tools and userspace drivers
|
||||
|
||||
What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/monitor_id
|
||||
What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/monitor_id
|
||||
Date: January. 2018
|
||||
KernelVersion: 4.16
|
||||
Contact: Stephen Hemminger <sthemmin@microsoft.com>
|
||||
Description: Monitor bit associated with channel
|
||||
Users: Debugging tools and userspace drivers
|
||||
|
||||
What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/ring
|
||||
What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/ring
|
||||
Date: January. 2018
|
||||
KernelVersion: 4.16
|
||||
Contact: Stephen Hemminger <sthemmin@microsoft.com>
|
||||
|
@ -90,4 +90,4 @@ Date: December 2009
|
||||
Contact: Lee Schermerhorn <lee.schermerhorn@hp.com>
|
||||
Description:
|
||||
The node's huge page size control/query attributes.
|
||||
See Documentation/vm/hugetlbpage.txt
|
||||
See Documentation/admin-guide/mm/hugetlbpage.rst
|
@ -57,3 +57,16 @@ Description:
|
||||
dracut (via 97masterkey and 98integrity) and systemd (via
|
||||
core/ima-setup) have support for loading keys at boot
|
||||
time.
|
||||
|
||||
What: security/integrity/evm/evm_xattrs
|
||||
Date: April 2018
|
||||
Contact: Matthew Garrett <mjg59@google.com>
|
||||
Description:
|
||||
Shows the set of extended attributes used to calculate or
|
||||
validate the EVM signature, and allows additional attributes
|
||||
to be added at runtime. Any signatures generated after
|
||||
additional attributes are added (and on files posessing those
|
||||
additional attributes) will only be valid if the same
|
||||
additional attributes are configured on system boot. Writing
|
||||
a single period (.) will lock the xattr list from any further
|
||||
modification.
|
||||
|
@ -21,7 +21,7 @@ Description:
|
||||
audit | hash | dont_hash
|
||||
condition:= base | lsm [option]
|
||||
base: [[func=] [mask=] [fsmagic=] [fsuuid=] [uid=]
|
||||
[euid=] [fowner=]]
|
||||
[euid=] [fowner=] [fsname=]]
|
||||
lsm: [[subj_user=] [subj_role=] [subj_type=]
|
||||
[obj_user=] [obj_role=] [obj_type=]]
|
||||
option: [[appraise_type=]] [permit_directio]
|
||||
|
@ -190,6 +190,13 @@ Description:
|
||||
but should match other such assignments on device).
|
||||
Units after application of scale and offset are m/s^2.
|
||||
|
||||
What: /sys/bus/iio/devices/iio:deviceX/in_angl_raw
|
||||
KernelVersion: 4.17
|
||||
Contact: linux-iio@vger.kernel.org
|
||||
Description:
|
||||
Angle of rotation. Units after application of scale and offset
|
||||
are radians.
|
||||
|
||||
What: /sys/bus/iio/devices/iio:deviceX/in_anglvel_x_raw
|
||||
What: /sys/bus/iio/devices/iio:deviceX/in_anglvel_y_raw
|
||||
What: /sys/bus/iio/devices/iio:deviceX/in_anglvel_z_raw
|
||||
@ -297,6 +304,7 @@ What: /sys/bus/iio/devices/iio:deviceX/in_pressure_offset
|
||||
What: /sys/bus/iio/devices/iio:deviceX/in_humidityrelative_offset
|
||||
What: /sys/bus/iio/devices/iio:deviceX/in_magn_offset
|
||||
What: /sys/bus/iio/devices/iio:deviceX/in_rot_offset
|
||||
What: /sys/bus/iio/devices/iio:deviceX/in_angl_offset
|
||||
KernelVersion: 2.6.35
|
||||
Contact: linux-iio@vger.kernel.org
|
||||
Description:
|
||||
@ -350,6 +358,7 @@ What: /sys/bus/iio/devices/iio:deviceX/in_humidityrelative_scale
|
||||
What: /sys/bus/iio/devices/iio:deviceX/in_velocity_sqrt(x^2+y^2+z^2)_scale
|
||||
What: /sys/bus/iio/devices/iio:deviceX/in_illuminance_scale
|
||||
What: /sys/bus/iio/devices/iio:deviceX/in_countY_scale
|
||||
What: /sys/bus/iio/devices/iio:deviceX/in_angl_scale
|
||||
KernelVersion: 2.6.35
|
||||
Contact: linux-iio@vger.kernel.org
|
||||
Description:
|
||||
|
@ -212,22 +212,3 @@ Description:
|
||||
range. Used by NVDIMM Region Mapping Structure to uniquely refer
|
||||
to this structure. Value of 0 is reserved and not used as an
|
||||
index.
|
||||
|
||||
|
||||
What: /sys/bus/nd/devices/regionX/nfit/ecc_unit_size
|
||||
Date: Aug, 2017
|
||||
KernelVersion: v4.14
|
||||
Contact: linux-nvdimm@lists.01.org
|
||||
Description:
|
||||
(RO) Size of a write request to a DIMM that will not incur a
|
||||
read-modify-write cycle at the memory controller.
|
||||
|
||||
When the nfit driver initializes it runs an ARS (Address Range
|
||||
Scrub) operation across every pmem range. Part of that process
|
||||
involves determining the ARS capabilities of a given address
|
||||
range. One of the capabilities that is reported is the 'Clear
|
||||
Uncorrectable Error Range Length Unit Size' (see: ACPI 6.2
|
||||
section 9.20.7.4 Function Index 1 - Query ARS Capabilities).
|
||||
This property indicates the boundary at which the NVDIMM may
|
||||
need to perform read-modify-write cycles to maintain ECC (Error
|
||||
Correcting Code) blocks.
|
||||
|
@ -73,3 +73,23 @@ Description:
|
||||
This sysfs entry tells us whether the channel is a local
|
||||
server channel that is announced (values are either
|
||||
true or false).
|
||||
|
||||
What: /sys/bus/rpmsg/devices/.../driver_override
|
||||
Date: April 2018
|
||||
KernelVersion: 4.18
|
||||
Contact: Bjorn Andersson <bjorn.andersson@linaro.org>
|
||||
Description:
|
||||
Every rpmsg device is a communication channel with a remote
|
||||
processor. Channels are identified by a textual name (see
|
||||
/sys/bus/rpmsg/devices/.../name above) and have a local
|
||||
("source") rpmsg address, and remote ("destination") rpmsg
|
||||
address.
|
||||
|
||||
The listening entity (or client) which communicates with a
|
||||
remote processor is referred as rpmsg driver. The rpmsg device
|
||||
and rpmsg driver are matched based on rpmsg device name and
|
||||
rpmsg driver ID table.
|
||||
|
||||
This sysfs entry allows the rpmsg driver for a rpmsg device
|
||||
to be specified which will override standard OF, ID table
|
||||
and name matching.
|
||||
|
@ -189,6 +189,28 @@ Description:
|
||||
The file will read "hotplug", "wired" and "not used" if the
|
||||
information is available, and "unknown" otherwise.
|
||||
|
||||
What: /sys/bus/usb/devices/.../(hub interface)/portX/quirks
|
||||
Date: May 2018
|
||||
Contact: Nicolas Boichat <drinkcat@chromium.org>
|
||||
Description:
|
||||
In some cases, we care about time-to-active for devices
|
||||
connected on a specific port (e.g. non-standard USB port like
|
||||
pogo pins), where the device to be connected is known in
|
||||
advance, and behaves well according to the specification.
|
||||
This attribute is a bit-field that controls the behavior of
|
||||
a specific port:
|
||||
- Bit 0 of this field selects the "old" enumeration scheme,
|
||||
as it is considerably faster (it only causes one USB reset
|
||||
instead of 2).
|
||||
The old enumeration scheme can also be selected globally
|
||||
using /sys/module/usbcore/parameters/old_scheme_first, but
|
||||
it is often not desirable as the new scheme was introduced to
|
||||
increase compatibility with more devices.
|
||||
- Bit 1 reduces TRSTRCY to the 10 ms that are required by the
|
||||
USB 2.0 specification, instead of the 50 ms that are normally
|
||||
used to help make enumeration work better on some high speed
|
||||
devices.
|
||||
|
||||
What: /sys/bus/usb/devices/.../(hub interface)/portX/over_current_count
|
||||
Date: February 2018
|
||||
Contact: Richard Leitner <richard.leitner@skidata.com>
|
||||
@ -236,3 +258,21 @@ Description:
|
||||
Supported values are 0 - 15.
|
||||
More information on how besl values map to microseconds can be found in
|
||||
USB 2.0 ECN Errata for Link Power Management, section 4.10)
|
||||
|
||||
What: /sys/bus/usb/devices/.../rx_lanes
|
||||
Date: March 2018
|
||||
Contact: Mathias Nyman <mathias.nyman@linux.intel.com>
|
||||
Description:
|
||||
Number of rx lanes the device is using.
|
||||
USB 3.2 adds Dual-lane support, 2 rx and 2 tx lanes over Type-C.
|
||||
Inter-Chip SSIC devices support asymmetric lanes up to 4 lanes per
|
||||
direction. Devices before USB 3.2 are single lane (rx_lanes = 1)
|
||||
|
||||
What: /sys/bus/usb/devices/.../tx_lanes
|
||||
Date: March 2018
|
||||
Contact: Mathias Nyman <mathias.nyman@linux.intel.com>
|
||||
Description:
|
||||
Number of tx lanes the device is using.
|
||||
USB 3.2 adds Dual-lane support, 2 rx and 2 tx -lanes over Type-C.
|
||||
Inter-Chip SSIC devices support asymmetric lanes up to 4 lanes per
|
||||
direction. Devices before USB 3.2 are single lane (tx_lanes = 1)
|
||||
|
@ -69,7 +69,9 @@ Date: September 2014
|
||||
Contact: linuxppc-dev@lists.ozlabs.org
|
||||
Description: read/write
|
||||
Set the mode for prefaulting in segments into the segment table
|
||||
when performing the START_WORK ioctl. Possible values:
|
||||
when performing the START_WORK ioctl. Only applicable when
|
||||
running under hashed page table mmu.
|
||||
Possible values:
|
||||
none: No prefaulting (default)
|
||||
work_element_descriptor: Treat the work element
|
||||
descriptor as an effective address and
|
||||
@ -244,3 +246,11 @@ Description: read only
|
||||
Returns 1 if the psl timebase register is synchronized
|
||||
with the core timebase register, 0 otherwise.
|
||||
Users: https://github.com/ibm-capi/libcxl
|
||||
|
||||
What: /sys/class/cxl/<card>/tunneled_ops_supported
|
||||
Date: May 2018
|
||||
Contact: linuxppc-dev@lists.ozlabs.org
|
||||
Description: read only
|
||||
Returns 1 if tunneled operations are supported in capi mode,
|
||||
0 otherwise.
|
||||
Users: https://github.com/ibm-capi/libcxl
|
||||
|
@ -232,3 +232,11 @@ Description:
|
||||
of the parent (another partition or a flash device) in bytes.
|
||||
This attribute is absent on flash devices, so it can be used
|
||||
to distinguish them from partitions.
|
||||
|
||||
What: /sys/class/mtd/mtdX/oobavail
|
||||
Date: April 2018
|
||||
KernelVersion: 4.16
|
||||
Contact: linux-mtd@lists.infradead.org
|
||||
Description:
|
||||
Number of bytes available for a client to place data into
|
||||
the out of band area.
|
||||
|
@ -1,3 +1,458 @@
|
||||
===== General Properties =====
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/manufacturer
|
||||
Date: May 2007
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports the name of the device manufacturer.
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented as string
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/model_name
|
||||
Date: May 2007
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports the name of the device model.
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented as string
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/serial_number
|
||||
Date: January 2008
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports the serial number of the device.
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented as string
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/type
|
||||
Date: May 2010
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Describes the main type of the supply.
|
||||
|
||||
Access: Read
|
||||
Valid values: "Battery", "UPS", "Mains", "USB"
|
||||
|
||||
===== Battery Properties =====
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/capacity
|
||||
Date: May 2007
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Fine grain representation of battery capacity.
|
||||
Access: Read
|
||||
Valid values: 0 - 100 (percent)
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/capacity_alert_max
|
||||
Date: July 2012
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Maximum battery capacity trip-wire value where the supply will
|
||||
notify user-space of the event. This is normally used for the
|
||||
battery discharging scenario where user-space needs to know the
|
||||
battery has dropped to an upper level so it can take
|
||||
appropriate action (e.g. warning user that battery level is
|
||||
low).
|
||||
|
||||
Access: Read, Write
|
||||
Valid values: 0 - 100 (percent)
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/capacity_alert_min
|
||||
Date: July 2012
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Minimum battery capacity trip-wire value where the supply will
|
||||
notify user-space of the event. This is normally used for the
|
||||
battery discharging scenario where user-space needs to know the
|
||||
battery has dropped to a lower level so it can take
|
||||
appropriate action (e.g. warning user that battery level is
|
||||
critically low).
|
||||
|
||||
Access: Read, Write
|
||||
Valid values: 0 - 100 (percent)
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/capacity_level
|
||||
Date: June 2009
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Coarse representation of battery capacity.
|
||||
|
||||
Access: Read
|
||||
Valid values: "Unknown", "Critical", "Low", "Normal", "High",
|
||||
"Full"
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/current_avg
|
||||
Date: May 2007
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports an average IBAT current reading for the battery, over a
|
||||
fixed period. Normally devices will provide a fixed interval in
|
||||
which they average readings to smooth out the reported value.
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in microamps
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/current_max
|
||||
Date: October 2010
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports the maximum IBAT current allowed into the battery.
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in microamps
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/current_now
|
||||
Date: May 2007
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports an instant, single IBAT current reading for the battery.
|
||||
This value is not averaged/smoothed.
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in microamps
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/charge_type
|
||||
Date: July 2009
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Represents the type of charging currently being applied to the
|
||||
battery.
|
||||
|
||||
Access: Read
|
||||
Valid values: "Unknown", "N/A", "Trickle", "Fast"
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/charge_term_current
|
||||
Date: July 2014
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports the charging current value which is used to determine
|
||||
when the battery is considered full and charging should end.
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in microamps
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/health
|
||||
Date: May 2007
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports the health of the battery or battery side of charger
|
||||
functionality.
|
||||
|
||||
Access: Read
|
||||
Valid values: "Unknown", "Good", "Overheat", "Dead",
|
||||
"Over voltage", "Unspecified failure", "Cold",
|
||||
"Watchdog timer expire", "Safety timer expire"
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/precharge_current
|
||||
Date: June 2017
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports the charging current applied during pre-charging phase
|
||||
for a battery charge cycle.
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in microamps
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/present
|
||||
Date: May 2007
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports whether a battery is present or not in the system.
|
||||
|
||||
Access: Read
|
||||
Valid values:
|
||||
0: Absent
|
||||
1: Present
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/status
|
||||
Date: May 2007
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Represents the charging status of the battery. Normally this
|
||||
is read-only reporting although for some supplies this can be
|
||||
used to enable/disable charging to the battery.
|
||||
|
||||
Access: Read, Write
|
||||
Valid values: "Unknown", "Charging", "Discharging",
|
||||
"Not charging", "Full"
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/technology
|
||||
Date: May 2007
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Describes the battery technology supported by the supply.
|
||||
|
||||
Access: Read
|
||||
Valid values: "Unknown", "NiMH", "Li-ion", "Li-poly", "LiFe",
|
||||
"NiCd", "LiMn"
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/temp
|
||||
Date: May 2007
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports the current TBAT battery temperature reading.
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in 1/10 Degrees Celsius
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/temp_alert_max
|
||||
Date: July 2012
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Maximum TBAT temperature trip-wire value where the supply will
|
||||
notify user-space of the event. This is normally used for the
|
||||
battery charging scenario where user-space needs to know the
|
||||
battery temperature has crossed an upper threshold so it can
|
||||
take appropriate action (e.g. warning user that battery level is
|
||||
critically high, and charging has stopped).
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in 1/10 Degrees Celsius
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/temp_alert_min
|
||||
Date: July 2012
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Minimum TBAT temperature trip-wire value where the supply will
|
||||
notify user-space of the event. This is normally used for the
|
||||
battery charging scenario where user-space needs to know the
|
||||
battery temperature has crossed a lower threshold so it can take
|
||||
appropriate action (e.g. warning user that battery level is
|
||||
high, and charging current has been reduced accordingly to
|
||||
remedy the situation).
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in 1/10 Degrees Celsius
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/temp_max
|
||||
Date: July 2014
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports the maximum allowed TBAT battery temperature for
|
||||
charging.
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in 1/10 Degrees Celsius
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/temp_min
|
||||
Date: July 2014
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports the minimum allowed TBAT battery temperature for
|
||||
charging.
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in 1/10 Degrees Celsius
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/voltage_avg,
|
||||
Date: May 2007
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports an average VBAT voltage reading for the battery, over a
|
||||
fixed period. Normally devices will provide a fixed interval in
|
||||
which they average readings to smooth out the reported value.
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in microvolts
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/voltage_max,
|
||||
Date: January 2008
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports the maximum safe VBAT voltage permitted for the battery,
|
||||
during charging.
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in microvolts
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/voltage_min,
|
||||
Date: January 2008
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports the minimum safe VBAT voltage permitted for the battery,
|
||||
during discharging.
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in microvolts
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/voltage_now,
|
||||
Date: May 2007
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports an instant, single VBAT voltage reading for the battery.
|
||||
This value is not averaged/smoothed.
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in microvolts
|
||||
|
||||
===== USB Properties =====
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/current_avg
|
||||
Date: May 2007
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports an average IBUS current reading over a fixed period.
|
||||
Normally devices will provide a fixed interval in which they
|
||||
average readings to smooth out the reported value.
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in microamps
|
||||
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/current_max
|
||||
Date: October 2010
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports the maximum IBUS current the supply can support.
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in microamps
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/current_now
|
||||
Date: May 2007
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports the IBUS current supplied now. This value is generally
|
||||
read-only reporting, unless the 'online' state of the supply
|
||||
is set to be programmable, in which case this value can be set
|
||||
within the reported min/max range.
|
||||
|
||||
Access: Read, Write
|
||||
Valid values: Represented in microamps
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/input_current_limit
|
||||
Date: July 2014
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Details the incoming IBUS current limit currently set in the
|
||||
supply. Normally this is configured based on the type of
|
||||
connection made (e.g. A configured SDP should output a maximum
|
||||
of 500mA so the input current limit is set to the same value).
|
||||
|
||||
Access: Read, Write
|
||||
Valid values: Represented in microamps
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/online,
|
||||
Date: May 2007
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Indicates if VBUS is present for the supply. When the supply is
|
||||
online, and the supply allows it, then it's possible to switch
|
||||
between online states (e.g. Fixed -> Programmable for a PD_PPS
|
||||
USB supply so voltage and current can be controlled).
|
||||
|
||||
Access: Read, Write
|
||||
Valid values:
|
||||
0: Offline
|
||||
1: Online Fixed - Fixed Voltage Supply
|
||||
2: Online Programmable - Programmable Voltage Supply
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/temp
|
||||
Date: May 2007
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports the current supply temperature reading. This would
|
||||
normally be the internal temperature of the device itself (e.g
|
||||
TJUNC temperature of an IC)
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in 1/10 Degrees Celsius
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/temp_alert_max
|
||||
Date: July 2012
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Maximum supply temperature trip-wire value where the supply will
|
||||
notify user-space of the event. This is normally used for the
|
||||
charging scenario where user-space needs to know the supply
|
||||
temperature has crossed an upper threshold so it can take
|
||||
appropriate action (e.g. warning user that the supply
|
||||
temperature is critically high, and charging has stopped to
|
||||
remedy the situation).
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in 1/10 Degrees Celsius
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/temp_alert_min
|
||||
Date: July 2012
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Minimum supply temperature trip-wire value where the supply will
|
||||
notify user-space of the event. This is normally used for the
|
||||
charging scenario where user-space needs to know the supply
|
||||
temperature has crossed a lower threshold so it can take
|
||||
appropriate action (e.g. warning user that the supply
|
||||
temperature is high, and charging current has been reduced
|
||||
accordingly to remedy the situation).
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in 1/10 Degrees Celsius
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/temp_max
|
||||
Date: July 2014
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports the maximum allowed supply temperature for operation.
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in 1/10 Degrees Celsius
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/temp_min
|
||||
Date: July 2014
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports the mainimum allowed supply temperature for operation.
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in 1/10 Degrees Celsius
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/usb_type
|
||||
Date: March 2018
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports what type of USB connection is currently active for
|
||||
the supply, for example it can show if USB-PD capable source
|
||||
is attached.
|
||||
|
||||
Access: Read-Only
|
||||
Valid values: "Unknown", "SDP", "DCP", "CDP", "ACA", "C", "PD",
|
||||
"PD_DRP", "PD_PPS", "BrickID"
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/voltage_max
|
||||
Date: January 2008
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports the maximum VBUS voltage the supply can support.
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in microvolts
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/voltage_min
|
||||
Date: January 2008
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports the minimum VBUS voltage the supply can support.
|
||||
|
||||
Access: Read
|
||||
Valid values: Represented in microvolts
|
||||
|
||||
What: /sys/class/power_supply/<supply_name>/voltage_now
|
||||
Date: May 2007
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description:
|
||||
Reports the VBUS voltage supplied now. This value is generally
|
||||
read-only reporting, unless the 'online' state of the supply
|
||||
is set to be programmable, in which case this value can be set
|
||||
within the reported min/max range.
|
||||
|
||||
Access: Read, Write
|
||||
Valid values: Represented in microvolts
|
||||
|
||||
===== Device Specific Properties =====
|
||||
|
||||
What: /sys/class/power/ds2760-battery.*/charge_now
|
||||
Date: May 2010
|
||||
KernelVersion: 2.6.35
|
||||
|
@ -1,7 +1,7 @@
|
||||
What: /sys/class/rc/
|
||||
Date: Apr 2010
|
||||
KernelVersion: 2.6.35
|
||||
Contact: Mauro Carvalho Chehab <m.chehab@samsung.com>
|
||||
Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
||||
Description:
|
||||
The rc/ class sub-directory belongs to the Remote Controller
|
||||
core and provides a sysfs interface for configuring infrared
|
||||
@ -10,7 +10,7 @@ Description:
|
||||
What: /sys/class/rc/rcN/
|
||||
Date: Apr 2010
|
||||
KernelVersion: 2.6.35
|
||||
Contact: Mauro Carvalho Chehab <m.chehab@samsung.com>
|
||||
Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
||||
Description:
|
||||
A /sys/class/rc/rcN directory is created for each remote
|
||||
control receiver device where N is the number of the receiver.
|
||||
@ -18,7 +18,7 @@ Description:
|
||||
What: /sys/class/rc/rcN/protocols
|
||||
Date: Jun 2010
|
||||
KernelVersion: 2.6.36
|
||||
Contact: Mauro Carvalho Chehab <m.chehab@samsung.com>
|
||||
Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
||||
Description:
|
||||
Reading this file returns a list of available protocols,
|
||||
something like:
|
||||
@ -36,7 +36,7 @@ Description:
|
||||
What: /sys/class/rc/rcN/filter
|
||||
Date: Jan 2014
|
||||
KernelVersion: 3.15
|
||||
Contact: Mauro Carvalho Chehab <m.chehab@samsung.com>
|
||||
Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
||||
Description:
|
||||
Sets the scancode filter expected value.
|
||||
Use in combination with /sys/class/rc/rcN/filter_mask to set the
|
||||
@ -49,7 +49,7 @@ Description:
|
||||
What: /sys/class/rc/rcN/filter_mask
|
||||
Date: Jan 2014
|
||||
KernelVersion: 3.15
|
||||
Contact: Mauro Carvalho Chehab <m.chehab@samsung.com>
|
||||
Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
||||
Description:
|
||||
Sets the scancode filter mask of bits to compare.
|
||||
Use in combination with /sys/class/rc/rcN/filter to set the bits
|
||||
@ -64,7 +64,7 @@ Description:
|
||||
What: /sys/class/rc/rcN/wakeup_protocols
|
||||
Date: Feb 2017
|
||||
KernelVersion: 4.11
|
||||
Contact: Mauro Carvalho Chehab <m.chehab@samsung.com>
|
||||
Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
||||
Description:
|
||||
Reading this file returns a list of available protocols to use
|
||||
for the wakeup filter, something like:
|
||||
@ -83,7 +83,7 @@ Description:
|
||||
What: /sys/class/rc/rcN/wakeup_filter
|
||||
Date: Jan 2014
|
||||
KernelVersion: 3.15
|
||||
Contact: Mauro Carvalho Chehab <m.chehab@samsung.com>
|
||||
Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
||||
Description:
|
||||
Sets the scancode wakeup filter expected value.
|
||||
Use in combination with /sys/class/rc/rcN/wakeup_filter_mask to
|
||||
@ -98,7 +98,7 @@ Description:
|
||||
What: /sys/class/rc/rcN/wakeup_filter_mask
|
||||
Date: Jan 2014
|
||||
KernelVersion: 3.15
|
||||
Contact: Mauro Carvalho Chehab <m.chehab@samsung.com>
|
||||
Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
||||
Description:
|
||||
Sets the scancode wakeup filter mask of bits to compare.
|
||||
Use in combination with /sys/class/rc/rcN/wakeup_filter to set
|
||||
|
@ -1,7 +1,7 @@
|
||||
What: /sys/class/rc/rcN/wakeup_data
|
||||
Date: Mar 2016
|
||||
KernelVersion: 4.6
|
||||
Contact: Mauro Carvalho Chehab <m.chehab@samsung.com>
|
||||
Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
||||
Description:
|
||||
Reading this file returns the stored CIR wakeup sequence.
|
||||
It starts with a pulse, followed by a space, pulse etc.
|
||||
|
@ -77,7 +77,7 @@ Description: Read/Write attribute file that controls memory scrubbing.
|
||||
|
||||
What: /sys/devices/system/edac/mc/mc*/max_location
|
||||
Date: April 2012
|
||||
Contact: Mauro Carvalho Chehab <m.chehab@samsung.com>
|
||||
Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
||||
linux-edac@vger.kernel.org
|
||||
Description: This attribute file displays the information about the last
|
||||
available memory slot in this memory controller. It is used by
|
||||
@ -85,7 +85,7 @@ Description: This attribute file displays the information about the last
|
||||
|
||||
What: /sys/devices/system/edac/mc/mc*/(dimm|rank)*/size
|
||||
Date: April 2012
|
||||
Contact: Mauro Carvalho Chehab <m.chehab@samsung.com>
|
||||
Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
||||
linux-edac@vger.kernel.org
|
||||
Description: This attribute file will display the size of dimm or rank.
|
||||
For dimm*/size, this is the size, in MB of the DIMM memory
|
||||
@ -96,14 +96,14 @@ Description: This attribute file will display the size of dimm or rank.
|
||||
|
||||
What: /sys/devices/system/edac/mc/mc*/(dimm|rank)*/dimm_dev_type
|
||||
Date: April 2012
|
||||
Contact: Mauro Carvalho Chehab <m.chehab@samsung.com>
|
||||
Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
||||
linux-edac@vger.kernel.org
|
||||
Description: This attribute file will display what type of DRAM device is
|
||||
being utilized on this DIMM (x1, x2, x4, x8, ...).
|
||||
|
||||
What: /sys/devices/system/edac/mc/mc*/(dimm|rank)*/dimm_edac_mode
|
||||
Date: April 2012
|
||||
Contact: Mauro Carvalho Chehab <m.chehab@samsung.com>
|
||||
Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
||||
linux-edac@vger.kernel.org
|
||||
Description: This attribute file will display what type of Error detection
|
||||
and correction is being utilized. For example: S4ECD4ED would
|
||||
@ -111,7 +111,7 @@ Description: This attribute file will display what type of Error detection
|
||||
|
||||
What: /sys/devices/system/edac/mc/mc*/(dimm|rank)*/dimm_label
|
||||
Date: April 2012
|
||||
Contact: Mauro Carvalho Chehab <m.chehab@samsung.com>
|
||||
Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
||||
linux-edac@vger.kernel.org
|
||||
Description: This control file allows this DIMM to have a label assigned
|
||||
to it. With this label in the module, when errors occur
|
||||
@ -126,14 +126,14 @@ Description: This control file allows this DIMM to have a label assigned
|
||||
|
||||
What: /sys/devices/system/edac/mc/mc*/(dimm|rank)*/dimm_location
|
||||
Date: April 2012
|
||||
Contact: Mauro Carvalho Chehab <m.chehab@samsung.com>
|
||||
Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
||||
linux-edac@vger.kernel.org
|
||||
Description: This attribute file will display the location (csrow/channel,
|
||||
branch/channel/slot or channel/slot) of the dimm or rank.
|
||||
|
||||
What: /sys/devices/system/edac/mc/mc*/(dimm|rank)*/dimm_mem_type
|
||||
Date: April 2012
|
||||
Contact: Mauro Carvalho Chehab <m.chehab@samsung.com>
|
||||
Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
||||
linux-edac@vger.kernel.org
|
||||
Description: This attribute file will display what type of memory is
|
||||
currently on this csrow. Normally, either buffered or
|
||||
|
@ -238,9 +238,6 @@ Description: Discover and change clock speed of CPUs
|
||||
|
||||
See files in Documentation/cpu-freq/ for more information.
|
||||
|
||||
In particular, read Documentation/cpu-freq/user-guide.txt
|
||||
to learn how to control the knobs.
|
||||
|
||||
|
||||
What: /sys/devices/system/cpu/cpu#/cpufreq/freqdomain_cpus
|
||||
Date: June 2013
|
||||
@ -478,6 +475,7 @@ What: /sys/devices/system/cpu/vulnerabilities
|
||||
/sys/devices/system/cpu/vulnerabilities/meltdown
|
||||
/sys/devices/system/cpu/vulnerabilities/spectre_v1
|
||||
/sys/devices/system/cpu/vulnerabilities/spectre_v2
|
||||
/sys/devices/system/cpu/vulnerabilities/spec_store_bypass
|
||||
Date: January 2018
|
||||
Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
|
||||
Description: Information about CPU vulnerabilities
|
||||
|
@ -101,6 +101,7 @@ Date: February 2015
|
||||
Contact: "Jaegeuk Kim" <jaegeuk@kernel.org>
|
||||
Description:
|
||||
Controls the trimming rate in batch mode.
|
||||
<deprecated>
|
||||
|
||||
What: /sys/fs/f2fs/<disk>/cp_interval
|
||||
Date: October 2015
|
||||
@ -140,7 +141,7 @@ Contact: "Shuoran Liu" <liushuoran@huawei.com>
|
||||
Description:
|
||||
Shows total written kbytes issued to disk.
|
||||
|
||||
What: /sys/fs/f2fs/<disk>/feature
|
||||
What: /sys/fs/f2fs/<disk>/features
|
||||
Date: July 2017
|
||||
Contact: "Jaegeuk Kim" <jaegeuk@kernel.org>
|
||||
Description:
|
||||
|
@ -12,4 +12,4 @@ Description:
|
||||
free_hugepages
|
||||
surplus_hugepages
|
||||
resv_hugepages
|
||||
See Documentation/vm/hugetlbpage.txt for details.
|
||||
See Documentation/admin-guide/mm/hugetlbpage.rst for details.
|
||||
|
@ -40,7 +40,7 @@ Description: Kernel Samepage Merging daemon sysfs interface
|
||||
sleep_millisecs: how many milliseconds ksm should sleep between
|
||||
scans.
|
||||
|
||||
See Documentation/vm/ksm.txt for more information.
|
||||
See Documentation/vm/ksm.rst for more information.
|
||||
|
||||
What: /sys/kernel/mm/ksm/merge_across_nodes
|
||||
Date: January 2013
|
||||
|
@ -37,7 +37,7 @@ Description:
|
||||
The alloc_calls file is read-only and lists the kernel code
|
||||
locations from which allocations for this cache were performed.
|
||||
The alloc_calls file only contains information if debugging is
|
||||
enabled for that cache (see Documentation/vm/slub.txt).
|
||||
enabled for that cache (see Documentation/vm/slub.rst).
|
||||
|
||||
What: /sys/kernel/slab/cache/alloc_fastpath
|
||||
Date: February 2008
|
||||
@ -219,7 +219,7 @@ Contact: Pekka Enberg <penberg@cs.helsinki.fi>,
|
||||
Description:
|
||||
The free_calls file is read-only and lists the locations of
|
||||
object frees if slab debugging is enabled (see
|
||||
Documentation/vm/slub.txt).
|
||||
Documentation/vm/slub.rst).
|
||||
|
||||
What: /sys/kernel/slab/cache/free_fastpath
|
||||
Date: February 2008
|
||||
|
@ -25,3 +25,16 @@ Description:
|
||||
Control touchpad mode.
|
||||
* 1 -> Switched On
|
||||
* 0 -> Switched Off
|
||||
|
||||
What: /sys/bus/pci/devices/<bdf>/<device>/VPC2004:00/fn_lock
|
||||
Date: May 2018
|
||||
KernelVersion: 4.18
|
||||
Contact: "Oleg Keri <ezhi99@gmail.com>"
|
||||
Description:
|
||||
Control fn-lock mode.
|
||||
* 1 -> Switched On
|
||||
* 0 -> Switched Off
|
||||
|
||||
For example:
|
||||
# echo "0" > \
|
||||
/sys/bus/pci/devices/0000:00:1f.0/PNP0C09:00/VPC2004:00/fn_lock
|
||||
|
@ -110,7 +110,7 @@ The actual steps taken by a platform to recover from a PCI error
|
||||
event will be platform-dependent, but will follow the general
|
||||
sequence described below.
|
||||
|
||||
STEP 0: Error Event
|
||||
STEP 0: Error Event: ERR_NONFATAL
|
||||
-------------------
|
||||
A PCI bus error is detected by the PCI hardware. On powerpc, the slot
|
||||
is isolated, in that all I/O is blocked: all reads return 0xffffffff,
|
||||
@ -228,13 +228,7 @@ proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
|
||||
If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
|
||||
proceeds to STEP 4 (Slot Reset)
|
||||
|
||||
STEP 3: Link Reset
|
||||
------------------
|
||||
The platform resets the link. This is a PCI-Express specific step
|
||||
and is done whenever a fatal error has been detected that can be
|
||||
"solved" by resetting the link.
|
||||
|
||||
STEP 4: Slot Reset
|
||||
STEP 3: Slot Reset
|
||||
------------------
|
||||
|
||||
In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
|
||||
@ -320,7 +314,7 @@ Failure).
|
||||
>>> However, it probably should.
|
||||
|
||||
|
||||
STEP 5: Resume Operations
|
||||
STEP 4: Resume Operations
|
||||
-------------------------
|
||||
The platform will call the resume() callback on all affected device
|
||||
drivers if all drivers on the segment have returned
|
||||
@ -332,7 +326,7 @@ a result code.
|
||||
At this point, if a new error happens, the platform will restart
|
||||
a new error recovery sequence.
|
||||
|
||||
STEP 6: Permanent Failure
|
||||
STEP 5: Permanent Failure
|
||||
-------------------------
|
||||
A "permanent failure" has occurred, and the platform cannot recover
|
||||
the device. The platform will call error_detected() with a
|
||||
@ -355,6 +349,27 @@ errors. See the discussion in powerpc/eeh-pci-error-recovery.txt
|
||||
for additional detail on real-life experience of the causes of
|
||||
software errors.
|
||||
|
||||
STEP 0: Error Event: ERR_FATAL
|
||||
-------------------
|
||||
PCI bus error is detected by the PCI hardware. On powerpc, the slot is
|
||||
isolated, in that all I/O is blocked: all reads return 0xffffffff, all
|
||||
writes are ignored.
|
||||
|
||||
STEP 1: Remove devices
|
||||
--------------------
|
||||
Platform removes the devices depending on the error agent, it could be
|
||||
this port for all subordinates or upstream component (likely downstream
|
||||
port)
|
||||
|
||||
STEP 2: Reset link
|
||||
--------------------
|
||||
The platform resets the link. This is a PCI-Express specific step and is
|
||||
done whenever a fatal error has been detected that can be "solved" by
|
||||
resetting the link.
|
||||
|
||||
STEP 3: Re-enumerate the devices
|
||||
--------------------
|
||||
Initiates the re-enumeration.
|
||||
|
||||
Conclusion; General Remarks
|
||||
---------------------------
|
||||
|
@ -1,3 +1,5 @@
|
||||
What is RCU? -- "Read, Copy, Update"
|
||||
|
||||
Please note that the "What is RCU?" LWN series is an excellent place
|
||||
to start learning about RCU:
|
||||
|
||||
|
@ -157,6 +157,17 @@ OCXL_IOCTL_GET_METADATA:
|
||||
Obtains configuration information from the card, such at the size of
|
||||
MMIO areas, the AFU version, and the PASID for the current context.
|
||||
|
||||
OCXL_IOCTL_ENABLE_P9_WAIT:
|
||||
|
||||
Allows the AFU to wake a userspace thread executing 'wait'. Returns
|
||||
information to userspace to allow it to configure the AFU. Note that
|
||||
this is only available on POWER9.
|
||||
|
||||
OCXL_IOCTL_GET_FEATURES:
|
||||
|
||||
Reports on which CPU features that affect OpenCAPI are usable from
|
||||
userspace.
|
||||
|
||||
|
||||
mmap
|
||||
----
|
||||
|
69
Documentation/acpi/cppc_sysfs.txt
Normal file
69
Documentation/acpi/cppc_sysfs.txt
Normal file
@ -0,0 +1,69 @@
|
||||
|
||||
Collaborative Processor Performance Control (CPPC)
|
||||
|
||||
CPPC defined in the ACPI spec describes a mechanism for the OS to manage the
|
||||
performance of a logical processor on a contigious and abstract performance
|
||||
scale. CPPC exposes a set of registers to describe abstract performance scale,
|
||||
to request performance levels and to measure per-cpu delivered performance.
|
||||
|
||||
For more details on CPPC please refer to the ACPI specification at:
|
||||
|
||||
http://uefi.org/specifications
|
||||
|
||||
Some of the CPPC registers are exposed via sysfs under:
|
||||
|
||||
/sys/devices/system/cpu/cpuX/acpi_cppc/
|
||||
|
||||
for each cpu X
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
$ ls -lR /sys/devices/system/cpu/cpu0/acpi_cppc/
|
||||
/sys/devices/system/cpu/cpu0/acpi_cppc/:
|
||||
total 0
|
||||
-r--r--r-- 1 root root 65536 Mar 5 19:38 feedback_ctrs
|
||||
-r--r--r-- 1 root root 65536 Mar 5 19:38 highest_perf
|
||||
-r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_freq
|
||||
-r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_nonlinear_perf
|
||||
-r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_perf
|
||||
-r--r--r-- 1 root root 65536 Mar 5 19:38 nominal_freq
|
||||
-r--r--r-- 1 root root 65536 Mar 5 19:38 nominal_perf
|
||||
-r--r--r-- 1 root root 65536 Mar 5 19:38 reference_perf
|
||||
-r--r--r-- 1 root root 65536 Mar 5 19:38 wraparound_time
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* highest_perf : Highest performance of this processor (abstract scale).
|
||||
* nominal_perf : Highest sustained performance of this processor (abstract scale).
|
||||
* lowest_nonlinear_perf : Lowest performance of this processor with nonlinear
|
||||
power savings (abstract scale).
|
||||
* lowest_perf : Lowest performance of this processor (abstract scale).
|
||||
|
||||
* lowest_freq : CPU frequency corresponding to lowest_perf (in MHz).
|
||||
* nominal_freq : CPU frequency corresponding to nominal_perf (in MHz).
|
||||
The above frequencies should only be used to report processor performance in
|
||||
freqency instead of abstract scale. These values should not be used for any
|
||||
functional decisions.
|
||||
|
||||
* feedback_ctrs : Includes both Reference and delivered performance counter.
|
||||
Reference counter ticks up proportional to processor's reference performance.
|
||||
Delivered counter ticks up proportional to processor's delivered performance.
|
||||
* wraparound_time: Minimum time for the feedback counters to wraparound (seconds).
|
||||
* reference_perf : Performance level at which reference performance counter
|
||||
accumulates (abstract scale).
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Computing Average Delivered Performance
|
||||
|
||||
Below describes the steps to compute the average performance delivered by taking
|
||||
two different snapshots of feedback counters at time T1 and T2.
|
||||
|
||||
T1: Read feedback_ctrs as fbc_t1
|
||||
Wait or run some workload
|
||||
T2: Read feedback_ctrs as fbc_t2
|
||||
|
||||
delivered_counter_delta = fbc_t2[del] - fbc_t1[del]
|
||||
reference_counter_delta = fbc_t2[ref] - fbc_t1[ref]
|
||||
|
||||
delivered_perf = (refernce_perf x delivered_counter_delta) / reference_counter_delta
|
@ -16,7 +16,8 @@ control method rather than override the entire DSDT, because kernel
|
||||
rebuild/reboot is not needed and test result can be got in minutes.
|
||||
|
||||
Note: Only ACPI METHOD can be overridden, any other object types like
|
||||
"Device", "OperationRegion", are not recognized.
|
||||
"Device", "OperationRegion", are not recognized. Methods
|
||||
declared inside scope operators are also not supported.
|
||||
Note: The same ACPI control method can be overridden for many times,
|
||||
and it's always the latest one that used by Linux/kernel.
|
||||
Note: To get the ACPI debug object output (Store (AAAA, Debug)),
|
||||
@ -32,8 +33,6 @@ Note: To get the ACPI debug object output (Store (AAAA, Debug)),
|
||||
|
||||
DefinitionBlock ("", "SSDT", 1, "", "", 0x20080715)
|
||||
{
|
||||
External (ACON)
|
||||
|
||||
Method (\_SB_.AC._PSR, 0, NotSerialized)
|
||||
{
|
||||
Store ("In AC _PSR", Debug)
|
||||
@ -42,9 +41,10 @@ Note: To get the ACPI debug object output (Store (AAAA, Debug)),
|
||||
}
|
||||
Note that the full pathname of the method in ACPI namespace
|
||||
should be used.
|
||||
And remember to use "External" to declare external objects.
|
||||
e) assemble the file to generate the AML code of the method.
|
||||
e.g. "iasl psr.asl" (psr.aml is generated as a result)
|
||||
e.g. "iasl -vw 6084 psr.asl" (psr.aml is generated as a result)
|
||||
If parameter "-vw 6084" is not supported by your iASL compiler,
|
||||
please try a newer version.
|
||||
f) mount debugfs by "mount -t debugfs none /sys/kernel/debug"
|
||||
g) override the old method via the debugfs by running
|
||||
"cat /tmp/psr.aml > /sys/kernel/debug/acpi/custom_method"
|
||||
|
@ -44,8 +44,8 @@ Links
|
||||
|
||||
Mailing List - apparmor@lists.ubuntu.com
|
||||
|
||||
Wiki - http://apparmor.wiki.kernel.org/
|
||||
Wiki - http://wiki.apparmor.net
|
||||
|
||||
User space tools - https://launchpad.net/apparmor
|
||||
User space tools - https://gitlab.com/apparmor
|
||||
|
||||
Kernel module - git://git.kernel.org/pub/scm/linux/kernel/git/jj/apparmor-dev.git
|
||||
Kernel module - git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor
|
||||
|
2040
Documentation/admin-guide/cgroup-v2.rst
Normal file
2040
Documentation/admin-guide/cgroup-v2.rst
Normal file
File diff suppressed because it is too large
Load Diff
@ -48,6 +48,7 @@ configure specific aspects of kernel behavior to your liking.
|
||||
:maxdepth: 1
|
||||
|
||||
initrd
|
||||
cgroup-v2
|
||||
serial-console
|
||||
braille-console
|
||||
parport
|
||||
@ -60,9 +61,11 @@ configure specific aspects of kernel behavior to your liking.
|
||||
mono
|
||||
java
|
||||
ras
|
||||
bcache
|
||||
pm/index
|
||||
thunderbolt
|
||||
LSM/index
|
||||
mm/index
|
||||
|
||||
.. only:: subproject and html
|
||||
|
||||
|
@ -106,11 +106,11 @@
|
||||
use by PCI
|
||||
Format: <irq>,<irq>...
|
||||
|
||||
acpi_mask_gpe= [HW,ACPI]
|
||||
acpi_mask_gpe= [HW,ACPI]
|
||||
Due to the existence of _Lxx/_Exx, some GPEs triggered
|
||||
by unsupported hardware/firmware features can result in
|
||||
GPE floodings that cannot be automatically disabled by
|
||||
the GPE dispatcher.
|
||||
GPE floodings that cannot be automatically disabled by
|
||||
the GPE dispatcher.
|
||||
This facility can be used to prevent such uncontrolled
|
||||
GPE floodings.
|
||||
Format: <int>
|
||||
@ -256,7 +256,7 @@
|
||||
(may crash computer or cause data corruption)
|
||||
|
||||
ALSA [HW,ALSA]
|
||||
See Documentation/sound/alsa/alsa-parameters.txt
|
||||
See Documentation/sound/alsa-configuration.rst
|
||||
|
||||
alignment= [KNL,ARM]
|
||||
Allow the default userspace alignment fault handler
|
||||
@ -472,10 +472,10 @@
|
||||
for platform specific values (SB1, Loongson3 and
|
||||
others).
|
||||
|
||||
ccw_timeout_log [S390]
|
||||
ccw_timeout_log [S390]
|
||||
See Documentation/s390/CommonIO for details.
|
||||
|
||||
cgroup_disable= [KNL] Disable a particular controller
|
||||
cgroup_disable= [KNL] Disable a particular controller
|
||||
Format: {name of the controller(s) to disable}
|
||||
The effects of cgroup_disable=foo are:
|
||||
- foo isn't auto-mounted if you mount all cgroups in
|
||||
@ -518,7 +518,7 @@
|
||||
those clocks in any way. This parameter is useful for
|
||||
debug and development, but should not be needed on a
|
||||
platform with proper driver support. For more
|
||||
information, see Documentation/clk.txt.
|
||||
information, see Documentation/driver-api/clk.rst.
|
||||
|
||||
clock= [BUGS=X86-32, HW] gettimeofday clocksource override.
|
||||
[Deprecated]
|
||||
@ -587,11 +587,6 @@
|
||||
Sets the size of memory pool for coherent, atomic dma
|
||||
allocations, by default set to 256K.
|
||||
|
||||
code_bytes [X86] How many bytes of object code to print
|
||||
in an oops report.
|
||||
Range: 0 - 8192
|
||||
Default: 64
|
||||
|
||||
com20020= [HW,NET] ARCnet - COM20020 chipset
|
||||
Format:
|
||||
<io>[,<irq>[,<nodeID>[,<backplane>[,<ckp>[,<timeout>]]]]]
|
||||
@ -641,8 +636,8 @@
|
||||
hvc<n> Use the hypervisor console device <n>. This is for
|
||||
both Xen and PowerPC hypervisors.
|
||||
|
||||
If the device connected to the port is not a TTY but a braille
|
||||
device, prepend "brl," before the device type, for instance
|
||||
If the device connected to the port is not a TTY but a braille
|
||||
device, prepend "brl," before the device type, for instance
|
||||
console=brl,ttyS0
|
||||
For now, only VisioBraille is supported.
|
||||
|
||||
@ -662,7 +657,7 @@
|
||||
|
||||
consoleblank= [KNL] The console blank (screen saver) timeout in
|
||||
seconds. A value of 0 disables the blank timer.
|
||||
Defaults to 0.
|
||||
Defaults to 0.
|
||||
|
||||
coredump_filter=
|
||||
[KNL] Change the default value for
|
||||
@ -730,7 +725,7 @@
|
||||
or memory reserved is below 4G.
|
||||
|
||||
cryptomgr.notests
|
||||
[KNL] Disable crypto self-tests
|
||||
[KNL] Disable crypto self-tests
|
||||
|
||||
cs89x0_dma= [HW,NET]
|
||||
Format: <dma>
|
||||
@ -746,7 +741,7 @@
|
||||
Format: <port#>,<type>
|
||||
See also Documentation/input/devices/joystick-parport.rst
|
||||
|
||||
ddebug_query= [KNL,DYNAMIC_DEBUG] Enable debug messages at early boot
|
||||
ddebug_query= [KNL,DYNAMIC_DEBUG] Enable debug messages at early boot
|
||||
time. See
|
||||
Documentation/admin-guide/dynamic-debug-howto.rst for
|
||||
details. Deprecated, see dyndbg.
|
||||
@ -833,7 +828,7 @@
|
||||
causing system reset or hang due to sending
|
||||
INIT from AP to BSP.
|
||||
|
||||
disable_ddw [PPC/PSERIES]
|
||||
disable_ddw [PPC/PSERIES]
|
||||
Disable Dynamic DMA Window support. Use this if
|
||||
to workaround buggy firmware.
|
||||
|
||||
@ -1025,6 +1020,12 @@
|
||||
address. The serial port must already be setup
|
||||
and configured. Options are not yet supported.
|
||||
|
||||
qcom_geni,<addr>
|
||||
Start an early, polled-mode console on a Qualcomm
|
||||
Generic Interface (GENI) based serial port at the
|
||||
specified address. The serial port must already be
|
||||
setup and configured. Options are not yet supported.
|
||||
|
||||
earlyprintk= [X86,SH,ARM,M68k,S390]
|
||||
earlyprintk=vga
|
||||
earlyprintk=efi
|
||||
@ -1188,7 +1189,7 @@
|
||||
parameter will force ia64_sal_cache_flush to call
|
||||
ia64_pal_cache_flush instead of SAL_CACHE_FLUSH.
|
||||
|
||||
forcepae [X86-32]
|
||||
forcepae [X86-32]
|
||||
Forcefully enable Physical Address Extension (PAE).
|
||||
Many Pentium M systems disable PAE but may have a
|
||||
functionally usable PAE implementation.
|
||||
@ -1247,7 +1248,7 @@
|
||||
|
||||
gamma= [HW,DRM]
|
||||
|
||||
gart_fix_e820= [X86_64] disable the fix e820 for K8 GART
|
||||
gart_fix_e820= [X86_64] disable the fix e820 for K8 GART
|
||||
Format: off | on
|
||||
default: on
|
||||
|
||||
@ -1341,23 +1342,32 @@
|
||||
x86-64 are 2M (when the CPU supports "pse") and 1G
|
||||
(when the CPU supports the "pdpe1gb" cpuinfo flag).
|
||||
|
||||
hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC)
|
||||
terminal devices. Valid values: 0..8
|
||||
hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs.
|
||||
If specified, z/VM IUCV HVC accepts connections
|
||||
from listed z/VM user IDs only.
|
||||
hung_task_panic=
|
||||
[KNL] Should the hung task detector generate panics.
|
||||
Format: <integer>
|
||||
|
||||
A nonzero value instructs the kernel to panic when a
|
||||
hung task is detected. The default value is controlled
|
||||
by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time
|
||||
option. The value selected by this boot parameter can
|
||||
be changed later by the kernel.hung_task_panic sysctl.
|
||||
|
||||
hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC)
|
||||
terminal devices. Valid values: 0..8
|
||||
hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs.
|
||||
If specified, z/VM IUCV HVC accepts connections
|
||||
from listed z/VM user IDs only.
|
||||
keep_bootcon [KNL]
|
||||
Do not unregister boot console at start. This is only
|
||||
useful for debugging when something happens in the window
|
||||
between unregistering the boot console and initializing
|
||||
the real console.
|
||||
|
||||
i2c_bus= [HW] Override the default board specific I2C bus speed
|
||||
or register an additional I2C bus that is not
|
||||
registered from board initialization code.
|
||||
Format:
|
||||
<bus_id>,<clkrate>
|
||||
i2c_bus= [HW] Override the default board specific I2C bus speed
|
||||
or register an additional I2C bus that is not
|
||||
registered from board initialization code.
|
||||
Format:
|
||||
<bus_id>,<clkrate>
|
||||
|
||||
i8042.debug [HW] Toggle i8042 debug mode
|
||||
i8042.unmask_kbd_data
|
||||
@ -1386,7 +1396,7 @@
|
||||
Default: only on s2r transitions on x86; most other
|
||||
architectures force reset to be always executed
|
||||
i8042.unlock [HW] Unlock (ignore) the keylock
|
||||
i8042.kbdreset [HW] Reset device connected to KBD port
|
||||
i8042.kbdreset [HW] Reset device connected to KBD port
|
||||
|
||||
i810= [HW,DRM]
|
||||
|
||||
@ -1548,13 +1558,13 @@
|
||||
programs exec'd, files mmap'd for exec, and all files
|
||||
opened for read by uid=0.
|
||||
|
||||
ima_template= [IMA]
|
||||
ima_template= [IMA]
|
||||
Select one of defined IMA measurements template formats.
|
||||
Formats: { "ima" | "ima-ng" | "ima-sig" }
|
||||
Default: "ima-ng"
|
||||
|
||||
ima_template_fmt=
|
||||
[IMA] Define a custom template format.
|
||||
[IMA] Define a custom template format.
|
||||
Format: { "field1|...|fieldN" }
|
||||
|
||||
ima.ahash_minsize= [IMA] Minimum file size for asynchronous hash usage
|
||||
@ -1597,7 +1607,7 @@
|
||||
inport.irq= [HW] Inport (ATI XL and Microsoft) busmouse driver
|
||||
Format: <irq>
|
||||
|
||||
int_pln_enable [x86] Enable power limit notification interrupt
|
||||
int_pln_enable [x86] Enable power limit notification interrupt
|
||||
|
||||
integrity_audit=[IMA]
|
||||
Format: { "0" | "1" }
|
||||
@ -1650,39 +1660,39 @@
|
||||
0 disables intel_idle and fall back on acpi_idle.
|
||||
1 to 9 specify maximum depth of C-state.
|
||||
|
||||
intel_pstate= [X86]
|
||||
disable
|
||||
Do not enable intel_pstate as the default
|
||||
scaling driver for the supported processors
|
||||
passive
|
||||
Use intel_pstate as a scaling driver, but configure it
|
||||
to work with generic cpufreq governors (instead of
|
||||
enabling its internal governor). This mode cannot be
|
||||
used along with the hardware-managed P-states (HWP)
|
||||
feature.
|
||||
force
|
||||
Enable intel_pstate on systems that prohibit it by default
|
||||
in favor of acpi-cpufreq. Forcing the intel_pstate driver
|
||||
instead of acpi-cpufreq may disable platform features, such
|
||||
as thermal controls and power capping, that rely on ACPI
|
||||
P-States information being indicated to OSPM and therefore
|
||||
should be used with caution. This option does not work with
|
||||
processors that aren't supported by the intel_pstate driver
|
||||
or on platforms that use pcc-cpufreq instead of acpi-cpufreq.
|
||||
no_hwp
|
||||
Do not enable hardware P state control (HWP)
|
||||
if available.
|
||||
hwp_only
|
||||
Only load intel_pstate on systems which support
|
||||
hardware P state control (HWP) if available.
|
||||
support_acpi_ppc
|
||||
Enforce ACPI _PPC performance limits. If the Fixed ACPI
|
||||
Description Table, specifies preferred power management
|
||||
profile as "Enterprise Server" or "Performance Server",
|
||||
then this feature is turned on by default.
|
||||
per_cpu_perf_limits
|
||||
Allow per-logical-CPU P-State performance control limits using
|
||||
cpufreq sysfs interface
|
||||
intel_pstate= [X86]
|
||||
disable
|
||||
Do not enable intel_pstate as the default
|
||||
scaling driver for the supported processors
|
||||
passive
|
||||
Use intel_pstate as a scaling driver, but configure it
|
||||
to work with generic cpufreq governors (instead of
|
||||
enabling its internal governor). This mode cannot be
|
||||
used along with the hardware-managed P-states (HWP)
|
||||
feature.
|
||||
force
|
||||
Enable intel_pstate on systems that prohibit it by default
|
||||
in favor of acpi-cpufreq. Forcing the intel_pstate driver
|
||||
instead of acpi-cpufreq may disable platform features, such
|
||||
as thermal controls and power capping, that rely on ACPI
|
||||
P-States information being indicated to OSPM and therefore
|
||||
should be used with caution. This option does not work with
|
||||
processors that aren't supported by the intel_pstate driver
|
||||
or on platforms that use pcc-cpufreq instead of acpi-cpufreq.
|
||||
no_hwp
|
||||
Do not enable hardware P state control (HWP)
|
||||
if available.
|
||||
hwp_only
|
||||
Only load intel_pstate on systems which support
|
||||
hardware P state control (HWP) if available.
|
||||
support_acpi_ppc
|
||||
Enforce ACPI _PPC performance limits. If the Fixed ACPI
|
||||
Description Table, specifies preferred power management
|
||||
profile as "Enterprise Server" or "Performance Server",
|
||||
then this feature is turned on by default.
|
||||
per_cpu_perf_limits
|
||||
Allow per-logical-CPU P-State performance control limits using
|
||||
cpufreq sysfs interface
|
||||
|
||||
intremap= [X86-64, Intel-IOMMU]
|
||||
on enable Interrupt Remapping (default)
|
||||
@ -1705,7 +1715,6 @@
|
||||
nopanic
|
||||
merge
|
||||
nomerge
|
||||
forcesac
|
||||
soft
|
||||
pt [x86, IA-64]
|
||||
nobypass [PPC/POWERNV]
|
||||
@ -2027,7 +2036,7 @@
|
||||
* [no]ncqtrim: Turn off queued DSM TRIM.
|
||||
|
||||
* nohrst, nosrst, norst: suppress hard, soft
|
||||
and both resets.
|
||||
and both resets.
|
||||
|
||||
* rstonce: only attempt one reset during
|
||||
hot-unplug link recovery
|
||||
@ -2215,7 +2224,7 @@
|
||||
[KNL,SH] Allow user to override the default size for
|
||||
per-device physically contiguous DMA buffers.
|
||||
|
||||
memhp_default_state=online/offline
|
||||
memhp_default_state=online/offline
|
||||
[KNL] Set the initial state for the memory hotplug
|
||||
onlining policy. If not specified, the default value is
|
||||
set according to the
|
||||
@ -2600,6 +2609,9 @@
|
||||
emulation library even if a 387 maths coprocessor
|
||||
is present.
|
||||
|
||||
no5lvl [X86-64] Disable 5-level paging mode. Forces
|
||||
kernel to use 4-level paging instead.
|
||||
|
||||
no_console_suspend
|
||||
[HW] Never suspend the console
|
||||
Disable suspending of consoles during suspend and
|
||||
@ -2680,6 +2692,9 @@
|
||||
allow data leaks with this option, which is equivalent
|
||||
to spectre_v2=off.
|
||||
|
||||
nospec_store_bypass_disable
|
||||
[HW] Disable all mitigations for the Speculative Store Bypass vulnerability
|
||||
|
||||
noxsave [BUGS=X86] Disables x86 extended register state save
|
||||
and restore using xsave. The kernel will fallback to
|
||||
enabling legacy floating-point and sse state.
|
||||
@ -2762,7 +2777,7 @@
|
||||
[X86,PV_OPS] Disable paravirtualized VMware scheduler
|
||||
clock and use the default one.
|
||||
|
||||
no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting.
|
||||
no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting.
|
||||
steal time is computed, but won't influence scheduler
|
||||
behaviour
|
||||
|
||||
@ -2823,7 +2838,7 @@
|
||||
notsc [BUGS=X86-32] Disable Time Stamp Counter
|
||||
|
||||
nowatchdog [KNL] Disable both lockup detectors, i.e.
|
||||
soft-lockup and NMI watchdog (hard-lockup).
|
||||
soft-lockup and NMI watchdog (hard-lockup).
|
||||
|
||||
nowb [ARM]
|
||||
|
||||
@ -2843,7 +2858,7 @@
|
||||
If the dependencies are under your control, you can
|
||||
turn on cpu0_hotplug.
|
||||
|
||||
nps_mtm_hs_ctr= [KNL,ARC]
|
||||
nps_mtm_hs_ctr= [KNL,ARC]
|
||||
This parameter sets the maximum duration, in
|
||||
cycles, each HW thread of the CTOP can run
|
||||
without interruptions, before HW switches it.
|
||||
@ -2911,9 +2926,6 @@
|
||||
This will also cause panics on machine check exceptions.
|
||||
Useful together with panic=30 to trigger a reboot.
|
||||
|
||||
OSS [HW,OSS]
|
||||
See Documentation/sound/oss/oss-parameters.txt
|
||||
|
||||
page_owner= [KNL] Boot-time page_owner enabling option.
|
||||
Storage of the information about who allocated
|
||||
each page is disabled in default. With this switch,
|
||||
@ -2984,7 +2996,7 @@
|
||||
|
||||
pci=option[,option...] [PCI] various PCI subsystem options:
|
||||
earlydump [X86] dump PCI config space before the kernel
|
||||
changes anything
|
||||
changes anything
|
||||
off [X86] don't probe for the PCI bus
|
||||
bios [X86-32] force use of PCI BIOS, don't access
|
||||
the hardware directly. Use this if your machine
|
||||
@ -3072,7 +3084,7 @@
|
||||
is enabled by default. If you need to use this,
|
||||
please report a bug.
|
||||
nocrs [X86] Ignore PCI host bridge windows from ACPI.
|
||||
If you need to use this, please report a bug.
|
||||
If you need to use this, please report a bug.
|
||||
routeirq Do IRQ routing for all PCI devices.
|
||||
This is normally done in pci_enable_device(),
|
||||
so this option is a temporary workaround
|
||||
@ -3147,6 +3159,8 @@
|
||||
on: Turn realloc on
|
||||
realloc same as realloc=on
|
||||
noari do not use PCIe ARI.
|
||||
noats [PCIE, Intel-IOMMU, AMD-IOMMU]
|
||||
do not use PCIe ATS (and IOMMU device IOTLB).
|
||||
pcie_scan_all Scan all possible PCIe devices. Otherwise we
|
||||
only look for one device below a PCIe downstream
|
||||
port.
|
||||
@ -3915,7 +3929,7 @@
|
||||
cache (risks via metadata attacks are mostly
|
||||
unchanged). Debug options disable merging on their
|
||||
own.
|
||||
For more information see Documentation/vm/slub.txt.
|
||||
For more information see Documentation/vm/slub.rst.
|
||||
|
||||
slab_max_order= [MM, SLAB]
|
||||
Determines the maximum allowed order for slabs.
|
||||
@ -3929,7 +3943,7 @@
|
||||
slub_debug can create guard zones around objects and
|
||||
may poison objects when not in use. Also tracks the
|
||||
last alloc / free. For more information see
|
||||
Documentation/vm/slub.txt.
|
||||
Documentation/vm/slub.rst.
|
||||
|
||||
slub_memcg_sysfs= [MM, SLUB]
|
||||
Determines whether to enable sysfs directories for
|
||||
@ -3943,7 +3957,7 @@
|
||||
Determines the maximum allowed order for slabs.
|
||||
A high setting may cause OOMs due to memory
|
||||
fragmentation. For more information see
|
||||
Documentation/vm/slub.txt.
|
||||
Documentation/vm/slub.rst.
|
||||
|
||||
slub_min_objects= [MM, SLUB]
|
||||
The minimum number of objects per slab. SLUB will
|
||||
@ -3952,12 +3966,12 @@
|
||||
the number of objects indicated. The higher the number
|
||||
of objects the smaller the overhead of tracking slabs
|
||||
and the less frequently locks need to be acquired.
|
||||
For more information see Documentation/vm/slub.txt.
|
||||
For more information see Documentation/vm/slub.rst.
|
||||
|
||||
slub_min_order= [MM, SLUB]
|
||||
Determines the minimum page order for slabs. Must be
|
||||
lower than slub_max_order.
|
||||
For more information see Documentation/vm/slub.txt.
|
||||
For more information see Documentation/vm/slub.rst.
|
||||
|
||||
slub_nomerge [MM, SLUB]
|
||||
Same with slab_nomerge. This is supported for legacy.
|
||||
@ -4025,6 +4039,48 @@
|
||||
Not specifying this option is equivalent to
|
||||
spectre_v2=auto.
|
||||
|
||||
spec_store_bypass_disable=
|
||||
[HW] Control Speculative Store Bypass (SSB) Disable mitigation
|
||||
(Speculative Store Bypass vulnerability)
|
||||
|
||||
Certain CPUs are vulnerable to an exploit against a
|
||||
a common industry wide performance optimization known
|
||||
as "Speculative Store Bypass" in which recent stores
|
||||
to the same memory location may not be observed by
|
||||
later loads during speculative execution. The idea
|
||||
is that such stores are unlikely and that they can
|
||||
be detected prior to instruction retirement at the
|
||||
end of a particular speculation execution window.
|
||||
|
||||
In vulnerable processors, the speculatively forwarded
|
||||
store can be used in a cache side channel attack, for
|
||||
example to read memory to which the attacker does not
|
||||
directly have access (e.g. inside sandboxed code).
|
||||
|
||||
This parameter controls whether the Speculative Store
|
||||
Bypass optimization is used.
|
||||
|
||||
on - Unconditionally disable Speculative Store Bypass
|
||||
off - Unconditionally enable Speculative Store Bypass
|
||||
auto - Kernel detects whether the CPU model contains an
|
||||
implementation of Speculative Store Bypass and
|
||||
picks the most appropriate mitigation. If the
|
||||
CPU is not vulnerable, "off" is selected. If the
|
||||
CPU is vulnerable the default mitigation is
|
||||
architecture and Kconfig dependent. See below.
|
||||
prctl - Control Speculative Store Bypass per thread
|
||||
via prctl. Speculative Store Bypass is enabled
|
||||
for a process by default. The state of the control
|
||||
is inherited on fork.
|
||||
seccomp - Same as "prctl" above, but all seccomp threads
|
||||
will disable SSB unless they explicitly opt out.
|
||||
|
||||
Not specifying this option is equivalent to
|
||||
spec_store_bypass_disable=auto.
|
||||
|
||||
Default mitigations:
|
||||
X86: If CONFIG_SECCOMP=y "seccomp", otherwise "prctl"
|
||||
|
||||
spia_io_base= [HW,MTD]
|
||||
spia_fio_base=
|
||||
spia_pedr=
|
||||
@ -4047,6 +4103,23 @@
|
||||
expediting. Set to zero to disable automatic
|
||||
expediting.
|
||||
|
||||
ssbd= [ARM64,HW]
|
||||
Speculative Store Bypass Disable control
|
||||
|
||||
On CPUs that are vulnerable to the Speculative
|
||||
Store Bypass vulnerability and offer a
|
||||
firmware based mitigation, this parameter
|
||||
indicates how the mitigation should be used:
|
||||
|
||||
force-on: Unconditionally enable mitigation for
|
||||
for both kernel and userspace
|
||||
force-off: Unconditionally disable mitigation for
|
||||
for both kernel and userspace
|
||||
kernel: Always enable mitigation in the
|
||||
kernel, and offer a prctl interface
|
||||
to allow userspace to register its
|
||||
interest in being mitigated too.
|
||||
|
||||
stack_guard_gap= [MM]
|
||||
override the default stack gap protection. The value
|
||||
is in page units and it defines how many pages prior
|
||||
@ -4259,7 +4332,7 @@
|
||||
[FTRACE] Set and start specified trace events in order
|
||||
to facilitate early boot debugging. The event-list is a
|
||||
comma separated list of trace events to enable. See
|
||||
also Documentation/trace/events.txt
|
||||
also Documentation/trace/events.rst
|
||||
|
||||
trace_options=[option-list]
|
||||
[FTRACE] Enable or disable tracer options at boot.
|
||||
@ -4274,7 +4347,7 @@
|
||||
|
||||
trace_options=stacktrace
|
||||
|
||||
See also Documentation/trace/ftrace.txt "trace options"
|
||||
See also Documentation/trace/ftrace.rst "trace options"
|
||||
section.
|
||||
|
||||
tp_printk[FTRACE]
|
||||
@ -4313,7 +4386,8 @@
|
||||
Format: [always|madvise|never]
|
||||
Can be used to control the default behavior of the system
|
||||
with respect to transparent hugepages.
|
||||
See Documentation/vm/transhuge.txt for more details.
|
||||
See Documentation/admin-guide/mm/transhuge.rst
|
||||
for more details.
|
||||
|
||||
tsc= Disable clocksource stability checks for TSC.
|
||||
Format: <string>
|
||||
@ -4391,7 +4465,7 @@
|
||||
|
||||
usbcore.initial_descriptor_timeout=
|
||||
[USB] Specifies timeout for the initial 64-byte
|
||||
USB_REQ_GET_DESCRIPTOR request in milliseconds
|
||||
USB_REQ_GET_DESCRIPTOR request in milliseconds
|
||||
(default 5000 = 5.0 seconds).
|
||||
|
||||
usbcore.nousb [USB] Disable the USB subsystem
|
||||
|
222
Documentation/admin-guide/mm/concepts.rst
Normal file
222
Documentation/admin-guide/mm/concepts.rst
Normal file
@ -0,0 +1,222 @@
|
||||
.. _mm_concepts:
|
||||
|
||||
=================
|
||||
Concepts overview
|
||||
=================
|
||||
|
||||
The memory management in Linux is complex system that evolved over the
|
||||
years and included more and more functionality to support variety of
|
||||
systems from MMU-less microcontrollers to supercomputers. The memory
|
||||
management for systems without MMU is called ``nommu`` and it
|
||||
definitely deserves a dedicated document, which hopefully will be
|
||||
eventually written. Yet, although some of the concepts are the same,
|
||||
here we assume that MMU is available and CPU can translate a virtual
|
||||
address to a physical address.
|
||||
|
||||
.. contents:: :local:
|
||||
|
||||
Virtual Memory Primer
|
||||
=====================
|
||||
|
||||
The physical memory in a computer system is a limited resource and
|
||||
even for systems that support memory hotplug there is a hard limit on
|
||||
the amount of memory that can be installed. The physical memory is not
|
||||
necessary contiguous, it might be accessible as a set of distinct
|
||||
address ranges. Besides, different CPU architectures, and even
|
||||
different implementations of the same architecture have different view
|
||||
how these address ranges defined.
|
||||
|
||||
All this makes dealing directly with physical memory quite complex and
|
||||
to avoid this complexity a concept of virtual memory was developed.
|
||||
|
||||
The virtual memory abstracts the details of physical memory from the
|
||||
application software, allows to keep only needed information in the
|
||||
physical memory (demand paging) and provides a mechanism for the
|
||||
protection and controlled sharing of data between processes.
|
||||
|
||||
With virtual memory, each and every memory access uses a virtual
|
||||
address. When the CPU decodes the an instruction that reads (or
|
||||
writes) from (or to) the system memory, it translates the `virtual`
|
||||
address encoded in that instruction to a `physical` address that the
|
||||
memory controller can understand.
|
||||
|
||||
The physical system memory is divided into page frames, or pages. The
|
||||
size of each page is architecture specific. Some architectures allow
|
||||
selection of the page size from several supported values; this
|
||||
selection is performed at the kernel build time by setting an
|
||||
appropriate kernel configuration option.
|
||||
|
||||
Each physical memory page can be mapped as one or more virtual
|
||||
pages. These mappings are described by page tables that allow
|
||||
translation from virtual address used by programs to real address in
|
||||
the physical memory. The page tables organized hierarchically.
|
||||
|
||||
The tables at the lowest level of the hierarchy contain physical
|
||||
addresses of actual pages used by the software. The tables at higher
|
||||
levels contain physical addresses of the pages belonging to the lower
|
||||
levels. The pointer to the top level page table resides in a
|
||||
register. When the CPU performs the address translation, it uses this
|
||||
register to access the top level page table. The high bits of the
|
||||
virtual address are used to index an entry in the top level page
|
||||
table. That entry is then used to access the next level in the
|
||||
hierarchy with the next bits of the virtual address as the index to
|
||||
that level page table. The lowest bits in the virtual address define
|
||||
the offset inside the actual page.
|
||||
|
||||
Huge Pages
|
||||
==========
|
||||
|
||||
The address translation requires several memory accesses and memory
|
||||
accesses are slow relatively to CPU speed. To avoid spending precious
|
||||
processor cycles on the address translation, CPUs maintain a cache of
|
||||
such translations called Translation Lookaside Buffer (or
|
||||
TLB). Usually TLB is pretty scarce resource and applications with
|
||||
large memory working set will experience performance hit because of
|
||||
TLB misses.
|
||||
|
||||
Many modern CPU architectures allow mapping of the memory pages
|
||||
directly by the higher levels in the page table. For instance, on x86,
|
||||
it is possible to map 2M and even 1G pages using entries in the second
|
||||
and the third level page tables. In Linux such pages are called
|
||||
`huge`. Usage of huge pages significantly reduces pressure on TLB,
|
||||
improves TLB hit-rate and thus improves overall system performance.
|
||||
|
||||
There are two mechanisms in Linux that enable mapping of the physical
|
||||
memory with the huge pages. The first one is `HugeTLB filesystem`, or
|
||||
hugetlbfs. It is a pseudo filesystem that uses RAM as its backing
|
||||
store. For the files created in this filesystem the data resides in
|
||||
the memory and mapped using huge pages. The hugetlbfs is described at
|
||||
:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`.
|
||||
|
||||
Another, more recent, mechanism that enables use of the huge pages is
|
||||
called `Transparent HugePages`, or THP. Unlike the hugetlbfs that
|
||||
requires users and/or system administrators to configure what parts of
|
||||
the system memory should and can be mapped by the huge pages, THP
|
||||
manages such mappings transparently to the user and hence the
|
||||
name. See
|
||||
:ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>`
|
||||
for more details about THP.
|
||||
|
||||
Zones
|
||||
=====
|
||||
|
||||
Often hardware poses restrictions on how different physical memory
|
||||
ranges can be accessed. In some cases, devices cannot perform DMA to
|
||||
all the addressable memory. In other cases, the size of the physical
|
||||
memory exceeds the maximal addressable size of virtual memory and
|
||||
special actions are required to access portions of the memory. Linux
|
||||
groups memory pages into `zones` according to their possible
|
||||
usage. For example, ZONE_DMA will contain memory that can be used by
|
||||
devices for DMA, ZONE_HIGHMEM will contain memory that is not
|
||||
permanently mapped into kernel's address space and ZONE_NORMAL will
|
||||
contain normally addressed pages.
|
||||
|
||||
The actual layout of the memory zones is hardware dependent as not all
|
||||
architectures define all zones, and requirements for DMA are different
|
||||
for different platforms.
|
||||
|
||||
Nodes
|
||||
=====
|
||||
|
||||
Many multi-processor machines are NUMA - Non-Uniform Memory Access -
|
||||
systems. In such systems the memory is arranged into banks that have
|
||||
different access latency depending on the "distance" from the
|
||||
processor. Each bank is referred as `node` and for each node Linux
|
||||
constructs an independent memory management subsystem. A node has it's
|
||||
own set of zones, lists of free and used pages and various statistics
|
||||
counters. You can find more details about NUMA in
|
||||
:ref:`Documentation/vm/numa.rst <numa>` and in
|
||||
:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`.
|
||||
|
||||
Page cache
|
||||
==========
|
||||
|
||||
The physical memory is volatile and the common case for getting data
|
||||
into the memory is to read it from files. Whenever a file is read, the
|
||||
data is put into the `page cache` to avoid expensive disk access on
|
||||
the subsequent reads. Similarly, when one writes to a file, the data
|
||||
is placed in the page cache and eventually gets into the backing
|
||||
storage device. The written pages are marked as `dirty` and when Linux
|
||||
decides to reuse them for other purposes, it makes sure to synchronize
|
||||
the file contents on the device with the updated data.
|
||||
|
||||
Anonymous Memory
|
||||
================
|
||||
|
||||
The `anonymous memory` or `anonymous mappings` represent memory that
|
||||
is not backed by a filesystem. Such mappings are implicitly created
|
||||
for program's stack and heap or by explicit calls to mmap(2) system
|
||||
call. Usually, the anonymous mappings only define virtual memory areas
|
||||
that the program is allowed to access. The read accesses will result
|
||||
in creation of a page table entry that references a special physical
|
||||
page filled with zeroes. When the program performs a write, regular
|
||||
physical page will be allocated to hold the written data. The page
|
||||
will be marked dirty and if the kernel will decide to repurpose it,
|
||||
the dirty page will be swapped out.
|
||||
|
||||
Reclaim
|
||||
=======
|
||||
|
||||
Throughout the system lifetime, a physical page can be used for storing
|
||||
different types of data. It can be kernel internal data structures,
|
||||
DMA'able buffers for device drivers use, data read from a filesystem,
|
||||
memory allocated by user space processes etc.
|
||||
|
||||
Depending on the page usage it is treated differently by the Linux
|
||||
memory management. The pages that can be freed at any time, either
|
||||
because they cache the data available elsewhere, for instance, on a
|
||||
hard disk, or because they can be swapped out, again, to the hard
|
||||
disk, are called `reclaimable`. The most notable categories of the
|
||||
reclaimable pages are page cache and anonymous memory.
|
||||
|
||||
In most cases, the pages holding internal kernel data and used as DMA
|
||||
buffers cannot be repurposed, and they remain pinned until freed by
|
||||
their user. Such pages are called `unreclaimable`. However, in certain
|
||||
circumstances, even pages occupied with kernel data structures can be
|
||||
reclaimed. For instance, in-memory caches of filesystem metadata can
|
||||
be re-read from the storage device and therefore it is possible to
|
||||
discard them from the main memory when system is under memory
|
||||
pressure.
|
||||
|
||||
The process of freeing the reclaimable physical memory pages and
|
||||
repurposing them is called (surprise!) `reclaim`. Linux can reclaim
|
||||
pages either asynchronously or synchronously, depending on the state
|
||||
of the system. When system is not loaded, most of the memory is free
|
||||
and allocation request will be satisfied immediately from the free
|
||||
pages supply. As the load increases, the amount of the free pages goes
|
||||
down and when it reaches a certain threshold (high watermark), an
|
||||
allocation request will awaken the ``kswapd`` daemon. It will
|
||||
asynchronously scan memory pages and either just free them if the data
|
||||
they contain is available elsewhere, or evict to the backing storage
|
||||
device (remember those dirty pages?). As memory usage increases even
|
||||
more and reaches another threshold - min watermark - an allocation
|
||||
will trigger the `direct reclaim`. In this case allocation is stalled
|
||||
until enough memory pages are reclaimed to satisfy the request.
|
||||
|
||||
Compaction
|
||||
==========
|
||||
|
||||
As the system runs, tasks allocate and free the memory and it becomes
|
||||
fragmented. Although with virtual memory it is possible to present
|
||||
scattered physical pages as virtually contiguous range, sometimes it is
|
||||
necessary to allocate large physically contiguous memory areas. Such
|
||||
need may arise, for instance, when a device driver requires large
|
||||
buffer for DMA, or when THP allocates a huge page. Memory `compaction`
|
||||
addresses the fragmentation issue. This mechanism moves occupied pages
|
||||
from the lower part of a memory zone to free pages in the upper part
|
||||
of the zone. When a compaction scan is finished free pages are grouped
|
||||
together at the beginning of the zone and allocations of large
|
||||
physically contiguous areas become possible.
|
||||
|
||||
Like reclaim, the compaction may happen asynchronously in ``kcompactd``
|
||||
daemon or synchronously as a result of memory allocation request.
|
||||
|
||||
OOM killer
|
||||
==========
|
||||
|
||||
It may happen, that on a loaded machine memory will be exhausted. When
|
||||
the kernel detects that the system runs out of memory (OOM) it invokes
|
||||
`OOM killer`. Its mission is simple: all it has to do is to select a
|
||||
task to sacrifice for the sake of the overall system health. The
|
||||
selected task is killed in a hope that after it exits enough memory
|
||||
will be freed to continue normal operation.
|
382
Documentation/admin-guide/mm/hugetlbpage.rst
Normal file
382
Documentation/admin-guide/mm/hugetlbpage.rst
Normal file
@ -0,0 +1,382 @@
|
||||
.. _hugetlbpage:
|
||||
|
||||
=============
|
||||
HugeTLB Pages
|
||||
=============
|
||||
|
||||
Overview
|
||||
========
|
||||
|
||||
The intent of this file is to give a brief summary of hugetlbpage support in
|
||||
the Linux kernel. This support is built on top of multiple page size support
|
||||
that is provided by most modern architectures. For example, x86 CPUs normally
|
||||
support 4K and 2M (1G if architecturally supported) page sizes, ia64
|
||||
architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M,
|
||||
256M and ppc64 supports 4K and 16M. A TLB is a cache of virtual-to-physical
|
||||
translations. Typically this is a very scarce resource on processor.
|
||||
Operating systems try to make best use of limited number of TLB resources.
|
||||
This optimization is more critical now as bigger and bigger physical memories
|
||||
(several GBs) are more readily available.
|
||||
|
||||
Users can use the huge page support in Linux kernel by either using the mmap
|
||||
system call or standard SYSV shared memory system calls (shmget, shmat).
|
||||
|
||||
First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
|
||||
(present under "File systems") and CONFIG_HUGETLB_PAGE (selected
|
||||
automatically when CONFIG_HUGETLBFS is selected) configuration
|
||||
options.
|
||||
|
||||
The ``/proc/meminfo`` file provides information about the total number of
|
||||
persistent hugetlb pages in the kernel's huge page pool. It also displays
|
||||
default huge page size and information about the number of free, reserved
|
||||
and surplus huge pages in the pool of huge pages of default size.
|
||||
The huge page size is needed for generating the proper alignment and
|
||||
size of the arguments to system calls that map huge page regions.
|
||||
|
||||
The output of ``cat /proc/meminfo`` will include lines like::
|
||||
|
||||
HugePages_Total: uuu
|
||||
HugePages_Free: vvv
|
||||
HugePages_Rsvd: www
|
||||
HugePages_Surp: xxx
|
||||
Hugepagesize: yyy kB
|
||||
Hugetlb: zzz kB
|
||||
|
||||
where:
|
||||
|
||||
HugePages_Total
|
||||
is the size of the pool of huge pages.
|
||||
HugePages_Free
|
||||
is the number of huge pages in the pool that are not yet
|
||||
allocated.
|
||||
HugePages_Rsvd
|
||||
is short for "reserved," and is the number of huge pages for
|
||||
which a commitment to allocate from the pool has been made,
|
||||
but no allocation has yet been made. Reserved huge pages
|
||||
guarantee that an application will be able to allocate a
|
||||
huge page from the pool of huge pages at fault time.
|
||||
HugePages_Surp
|
||||
is short for "surplus," and is the number of huge pages in
|
||||
the pool above the value in ``/proc/sys/vm/nr_hugepages``. The
|
||||
maximum number of surplus huge pages is controlled by
|
||||
``/proc/sys/vm/nr_overcommit_hugepages``.
|
||||
Hugepagesize
|
||||
is the default hugepage size (in Kb).
|
||||
Hugetlb
|
||||
is the total amount of memory (in kB), consumed by huge
|
||||
pages of all sizes.
|
||||
If huge pages of different sizes are in use, this number
|
||||
will exceed HugePages_Total \* Hugepagesize. To get more
|
||||
detailed information, please, refer to
|
||||
``/sys/kernel/mm/hugepages`` (described below).
|
||||
|
||||
|
||||
``/proc/filesystems`` should also show a filesystem of type "hugetlbfs"
|
||||
configured in the kernel.
|
||||
|
||||
``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" huge
|
||||
pages in the kernel's huge page pool. "Persistent" huge pages will be
|
||||
returned to the huge page pool when freed by a task. A user with root
|
||||
privileges can dynamically allocate more or free some persistent huge pages
|
||||
by increasing or decreasing the value of ``nr_hugepages``.
|
||||
|
||||
Pages that are used as huge pages are reserved inside the kernel and cannot
|
||||
be used for other purposes. Huge pages cannot be swapped out under
|
||||
memory pressure.
|
||||
|
||||
Once a number of huge pages have been pre-allocated to the kernel huge page
|
||||
pool, a user with appropriate privilege can use either the mmap system call
|
||||
or shared memory system calls to use the huge pages. See the discussion of
|
||||
:ref:`Using Huge Pages <using_huge_pages>`, below.
|
||||
|
||||
The administrator can allocate persistent huge pages on the kernel boot
|
||||
command line by specifying the "hugepages=N" parameter, where 'N' = the
|
||||
number of huge pages requested. This is the most reliable method of
|
||||
allocating huge pages as memory has not yet become fragmented.
|
||||
|
||||
Some platforms support multiple huge page sizes. To allocate huge pages
|
||||
of a specific size, one must precede the huge pages boot command parameters
|
||||
with a huge page size selection parameter "hugepagesz=<size>". <size> must
|
||||
be specified in bytes with optional scale suffix [kKmMgG]. The default huge
|
||||
page size may be selected with the "default_hugepagesz=<size>" boot parameter.
|
||||
|
||||
When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
|
||||
indicates the current number of pre-allocated huge pages of the default size.
|
||||
Thus, one can use the following command to dynamically allocate/deallocate
|
||||
default sized persistent huge pages::
|
||||
|
||||
echo 20 > /proc/sys/vm/nr_hugepages
|
||||
|
||||
This command will try to adjust the number of default sized huge pages in the
|
||||
huge page pool to 20, allocating or freeing huge pages, as required.
|
||||
|
||||
On a NUMA platform, the kernel will attempt to distribute the huge page pool
|
||||
over all the set of allowed nodes specified by the NUMA memory policy of the
|
||||
task that modifies ``nr_hugepages``. The default for the allowed nodes--when the
|
||||
task has default memory policy--is all on-line nodes with memory. Allowed
|
||||
nodes with insufficient available, contiguous memory for a huge page will be
|
||||
silently skipped when allocating persistent huge pages. See the
|
||||
:ref:`discussion below <mem_policy_and_hp_alloc>`
|
||||
of the interaction of task memory policy, cpusets and per node attributes
|
||||
with the allocation and freeing of persistent huge pages.
|
||||
|
||||
The success or failure of huge page allocation depends on the amount of
|
||||
physically contiguous memory that is present in system at the time of the
|
||||
allocation attempt. If the kernel is unable to allocate huge pages from
|
||||
some nodes in a NUMA system, it will attempt to make up the difference by
|
||||
allocating extra pages on other nodes with sufficient available contiguous
|
||||
memory, if any.
|
||||
|
||||
System administrators may want to put this command in one of the local rc
|
||||
init files. This will enable the kernel to allocate huge pages early in
|
||||
the boot process when the possibility of getting physical contiguous pages
|
||||
is still very high. Administrators can verify the number of huge pages
|
||||
actually allocated by checking the sysctl or meminfo. To check the per node
|
||||
distribution of huge pages in a NUMA system, use::
|
||||
|
||||
cat /sys/devices/system/node/node*/meminfo | fgrep Huge
|
||||
|
||||
``/proc/sys/vm/nr_overcommit_hugepages`` specifies how large the pool of
|
||||
huge pages can grow, if more huge pages than ``/proc/sys/vm/nr_hugepages`` are
|
||||
requested by applications. Writing any non-zero value into this file
|
||||
indicates that the hugetlb subsystem is allowed to try to obtain that
|
||||
number of "surplus" huge pages from the kernel's normal page pool, when the
|
||||
persistent huge page pool is exhausted. As these surplus huge pages become
|
||||
unused, they are freed back to the kernel's normal page pool.
|
||||
|
||||
When increasing the huge page pool size via ``nr_hugepages``, any existing
|
||||
surplus pages will first be promoted to persistent huge pages. Then, additional
|
||||
huge pages will be allocated, if necessary and if possible, to fulfill
|
||||
the new persistent huge page pool size.
|
||||
|
||||
The administrator may shrink the pool of persistent huge pages for
|
||||
the default huge page size by setting the ``nr_hugepages`` sysctl to a
|
||||
smaller value. The kernel will attempt to balance the freeing of huge pages
|
||||
across all nodes in the memory policy of the task modifying ``nr_hugepages``.
|
||||
Any free huge pages on the selected nodes will be freed back to the kernel's
|
||||
normal page pool.
|
||||
|
||||
Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that
|
||||
it becomes less than the number of huge pages in use will convert the balance
|
||||
of the in-use huge pages to surplus huge pages. This will occur even if
|
||||
the number of surplus pages would exceed the overcommit value. As long as
|
||||
this condition holds--that is, until ``nr_hugepages+nr_overcommit_hugepages`` is
|
||||
increased sufficiently, or the surplus huge pages go out of use and are freed--
|
||||
no more surplus huge pages will be allowed to be allocated.
|
||||
|
||||
With support for multiple huge page pools at run-time available, much of
|
||||
the huge page userspace interface in ``/proc/sys/vm`` has been duplicated in
|
||||
sysfs.
|
||||
The ``/proc`` interfaces discussed above have been retained for backwards
|
||||
compatibility. The root huge page control directory in sysfs is::
|
||||
|
||||
/sys/kernel/mm/hugepages
|
||||
|
||||
For each huge page size supported by the running kernel, a subdirectory
|
||||
will exist, of the form::
|
||||
|
||||
hugepages-${size}kB
|
||||
|
||||
Inside each of these directories, the same set of files will exist::
|
||||
|
||||
nr_hugepages
|
||||
nr_hugepages_mempolicy
|
||||
nr_overcommit_hugepages
|
||||
free_hugepages
|
||||
resv_hugepages
|
||||
surplus_hugepages
|
||||
|
||||
which function as described above for the default huge page-sized case.
|
||||
|
||||
.. _mem_policy_and_hp_alloc:
|
||||
|
||||
Interaction of Task Memory Policy with Huge Page Allocation/Freeing
|
||||
===================================================================
|
||||
|
||||
Whether huge pages are allocated and freed via the ``/proc`` interface or
|
||||
the ``/sysfs`` interface using the ``nr_hugepages_mempolicy`` attribute, the
|
||||
NUMA nodes from which huge pages are allocated or freed are controlled by the
|
||||
NUMA memory policy of the task that modifies the ``nr_hugepages_mempolicy``
|
||||
sysctl or attribute. When the ``nr_hugepages`` attribute is used, mempolicy
|
||||
is ignored.
|
||||
|
||||
The recommended method to allocate or free huge pages to/from the kernel
|
||||
huge page pool, using the ``nr_hugepages`` example above, is::
|
||||
|
||||
numactl --interleave <node-list> echo 20 \
|
||||
>/proc/sys/vm/nr_hugepages_mempolicy
|
||||
|
||||
or, more succinctly::
|
||||
|
||||
numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy
|
||||
|
||||
This will allocate or free ``abs(20 - nr_hugepages)`` to or from the nodes
|
||||
specified in <node-list>, depending on whether number of persistent huge pages
|
||||
is initially less than or greater than 20, respectively. No huge pages will be
|
||||
allocated nor freed on any node not included in the specified <node-list>.
|
||||
|
||||
When adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any
|
||||
memory policy mode--bind, preferred, local or interleave--may be used. The
|
||||
resulting effect on persistent huge page allocation is as follows:
|
||||
|
||||
#. Regardless of mempolicy mode [see
|
||||
:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`],
|
||||
persistent huge pages will be distributed across the node or nodes
|
||||
specified in the mempolicy as if "interleave" had been specified.
|
||||
However, if a node in the policy does not contain sufficient contiguous
|
||||
memory for a huge page, the allocation will not "fallback" to the nearest
|
||||
neighbor node with sufficient contiguous memory. To do this would cause
|
||||
undesirable imbalance in the distribution of the huge page pool, or
|
||||
possibly, allocation of persistent huge pages on nodes not allowed by
|
||||
the task's memory policy.
|
||||
|
||||
#. One or more nodes may be specified with the bind or interleave policy.
|
||||
If more than one node is specified with the preferred policy, only the
|
||||
lowest numeric id will be used. Local policy will select the node where
|
||||
the task is running at the time the nodes_allowed mask is constructed.
|
||||
For local policy to be deterministic, the task must be bound to a cpu or
|
||||
cpus in a single node. Otherwise, the task could be migrated to some
|
||||
other node at any time after launch and the resulting node will be
|
||||
indeterminate. Thus, local policy is not very useful for this purpose.
|
||||
Any of the other mempolicy modes may be used to specify a single node.
|
||||
|
||||
#. The nodes allowed mask will be derived from any non-default task mempolicy,
|
||||
whether this policy was set explicitly by the task itself or one of its
|
||||
ancestors, such as numactl. This means that if the task is invoked from a
|
||||
shell with non-default policy, that policy will be used. One can specify a
|
||||
node list of "all" with numactl --interleave or --membind [-m] to achieve
|
||||
interleaving over all nodes in the system or cpuset.
|
||||
|
||||
#. Any task mempolicy specified--e.g., using numactl--will be constrained by
|
||||
the resource limits of any cpuset in which the task runs. Thus, there will
|
||||
be no way for a task with non-default policy running in a cpuset with a
|
||||
subset of the system nodes to allocate huge pages outside the cpuset
|
||||
without first moving to a cpuset that contains all of the desired nodes.
|
||||
|
||||
#. Boot-time huge page allocation attempts to distribute the requested number
|
||||
of huge pages over all on-lines nodes with memory.
|
||||
|
||||
Per Node Hugepages Attributes
|
||||
=============================
|
||||
|
||||
A subset of the contents of the root huge page control directory in sysfs,
|
||||
described above, will be replicated under each the system device of each
|
||||
NUMA node with memory in::
|
||||
|
||||
/sys/devices/system/node/node[0-9]*/hugepages/
|
||||
|
||||
Under this directory, the subdirectory for each supported huge page size
|
||||
contains the following attribute files::
|
||||
|
||||
nr_hugepages
|
||||
free_hugepages
|
||||
surplus_hugepages
|
||||
|
||||
The free\_' and surplus\_' attribute files are read-only. They return the number
|
||||
of free and surplus [overcommitted] huge pages, respectively, on the parent
|
||||
node.
|
||||
|
||||
The ``nr_hugepages`` attribute returns the total number of huge pages on the
|
||||
specified node. When this attribute is written, the number of persistent huge
|
||||
pages on the parent node will be adjusted to the specified value, if sufficient
|
||||
resources exist, regardless of the task's mempolicy or cpuset constraints.
|
||||
|
||||
Note that the number of overcommit and reserve pages remain global quantities,
|
||||
as we don't know until fault time, when the faulting task's mempolicy is
|
||||
applied, from which node the huge page allocation will be attempted.
|
||||
|
||||
.. _using_huge_pages:
|
||||
|
||||
Using Huge Pages
|
||||
================
|
||||
|
||||
If the user applications are going to request huge pages using mmap system
|
||||
call, then it is required that system administrator mount a file system of
|
||||
type hugetlbfs::
|
||||
|
||||
mount -t hugetlbfs \
|
||||
-o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\
|
||||
min_size=<value>,nr_inodes=<value> none /mnt/huge
|
||||
|
||||
This command mounts a (pseudo) filesystem of type hugetlbfs on the directory
|
||||
``/mnt/huge``. Any file created on ``/mnt/huge`` uses huge pages.
|
||||
|
||||
The ``uid`` and ``gid`` options sets the owner and group of the root of the
|
||||
file system. By default the ``uid`` and ``gid`` of the current process
|
||||
are taken.
|
||||
|
||||
The ``mode`` option sets the mode of root of file system to value & 01777.
|
||||
This value is given in octal. By default the value 0755 is picked.
|
||||
|
||||
If the platform supports multiple huge page sizes, the ``pagesize`` option can
|
||||
be used to specify the huge page size and associated pool. ``pagesize``
|
||||
is specified in bytes. If ``pagesize`` is not specified the platform's
|
||||
default huge page size and associated pool will be used.
|
||||
|
||||
The ``size`` option sets the maximum value of memory (huge pages) allowed
|
||||
for that filesystem (``/mnt/huge``). The ``size`` option can be specified
|
||||
in bytes, or as a percentage of the specified huge page pool (``nr_hugepages``).
|
||||
The size is rounded down to HPAGE_SIZE boundary.
|
||||
|
||||
The ``min_size`` option sets the minimum value of memory (huge pages) allowed
|
||||
for the filesystem. ``min_size`` can be specified in the same way as ``size``,
|
||||
either bytes or a percentage of the huge page pool.
|
||||
At mount time, the number of huge pages specified by ``min_size`` are reserved
|
||||
for use by the filesystem.
|
||||
If there are not enough free huge pages available, the mount will fail.
|
||||
As huge pages are allocated to the filesystem and freed, the reserve count
|
||||
is adjusted so that the sum of allocated and reserved huge pages is always
|
||||
at least ``min_size``.
|
||||
|
||||
The option ``nr_inodes`` sets the maximum number of inodes that ``/mnt/huge``
|
||||
can use.
|
||||
|
||||
If the ``size``, ``min_size`` or ``nr_inodes`` option is not provided on
|
||||
command line then no limits are set.
|
||||
|
||||
For ``pagesize``, ``size``, ``min_size`` and ``nr_inodes`` options, you can
|
||||
use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo.
|
||||
For example, size=2K has the same meaning as size=2048.
|
||||
|
||||
While read system calls are supported on files that reside on hugetlb
|
||||
file systems, write system calls are not.
|
||||
|
||||
Regular chown, chgrp, and chmod commands (with right permissions) could be
|
||||
used to change the file attributes on hugetlbfs.
|
||||
|
||||
Also, it is important to note that no such mount command is required if
|
||||
applications are going to use only shmat/shmget system calls or mmap with
|
||||
MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see
|
||||
:ref:`map_hugetlb <map_hugetlb>` below.
|
||||
|
||||
Users who wish to use hugetlb memory via shared memory segment should be
|
||||
members of a supplementary group and system admin needs to configure that gid
|
||||
into ``/proc/sys/vm/hugetlb_shm_group``. It is possible for same or different
|
||||
applications to use any combination of mmaps and shm* calls, though the mount of
|
||||
filesystem will be required for using mmap calls without MAP_HUGETLB.
|
||||
|
||||
Syscalls that operate on memory backed by hugetlb pages only have their lengths
|
||||
aligned to the native page size of the processor; they will normally fail with
|
||||
errno set to EINVAL or exclude hugetlb pages that extend beyond the length if
|
||||
not hugepage aligned. For example, munmap(2) will fail if memory is backed by
|
||||
a hugetlb page and the length is smaller than the hugepage size.
|
||||
|
||||
|
||||
Examples
|
||||
========
|
||||
|
||||
.. _map_hugetlb:
|
||||
|
||||
``map_hugetlb``
|
||||
see tools/testing/selftests/vm/map_hugetlb.c
|
||||
|
||||
``hugepage-shm``
|
||||
see tools/testing/selftests/vm/hugepage-shm.c
|
||||
|
||||
``hugepage-mmap``
|
||||
see tools/testing/selftests/vm/hugepage-mmap.c
|
||||
|
||||
The `libhugetlbfs`_ library provides a wide range of userspace tools
|
||||
to help with huge page usability, environment setup, and control.
|
||||
|
||||
.. _libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs
|
116
Documentation/admin-guide/mm/idle_page_tracking.rst
Normal file
116
Documentation/admin-guide/mm/idle_page_tracking.rst
Normal file
@ -0,0 +1,116 @@
|
||||
.. _idle_page_tracking:
|
||||
|
||||
==================
|
||||
Idle Page Tracking
|
||||
==================
|
||||
|
||||
Motivation
|
||||
==========
|
||||
|
||||
The idle page tracking feature allows to track which memory pages are being
|
||||
accessed by a workload and which are idle. This information can be useful for
|
||||
estimating the workload's working set size, which, in turn, can be taken into
|
||||
account when configuring the workload parameters, setting memory cgroup limits,
|
||||
or deciding where to place the workload within a compute cluster.
|
||||
|
||||
It is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
|
||||
|
||||
.. _user_api:
|
||||
|
||||
User API
|
||||
========
|
||||
|
||||
The idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
|
||||
Currently, it consists of the only read-write file,
|
||||
``/sys/kernel/mm/page_idle/bitmap``.
|
||||
|
||||
The file implements a bitmap where each bit corresponds to a memory page. The
|
||||
bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
|
||||
mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is
|
||||
set, the corresponding page is idle.
|
||||
|
||||
A page is considered idle if it has not been accessed since it was marked idle
|
||||
(for more details on what "accessed" actually means see the :ref:`Implementation
|
||||
Details <impl_details>` section).
|
||||
To mark a page idle one has to set the bit corresponding to
|
||||
the page by writing to the file. A value written to the file is OR-ed with the
|
||||
current bitmap value.
|
||||
|
||||
Only accesses to user memory pages are tracked. These are pages mapped to a
|
||||
process address space, page cache and buffer pages, swap cache pages. For other
|
||||
page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored,
|
||||
and hence such pages are never reported idle.
|
||||
|
||||
For huge pages the idle flag is set only on the head page, so one has to read
|
||||
``/proc/kpageflags`` in order to correctly count idle huge pages.
|
||||
|
||||
Reading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return
|
||||
-EINVAL if you are not starting the read/write on an 8-byte boundary, or
|
||||
if the size of the read/write is not a multiple of 8 bytes. Writing to
|
||||
this file beyond max PFN will return -ENXIO.
|
||||
|
||||
That said, in order to estimate the amount of pages that are not used by a
|
||||
workload one should:
|
||||
|
||||
1. Mark all the workload's pages as idle by setting corresponding bits in
|
||||
``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading
|
||||
``/proc/pid/pagemap`` if the workload is represented by a process, or by
|
||||
filtering out alien pages using ``/proc/kpagecgroup`` in case the workload
|
||||
is placed in a memory cgroup.
|
||||
|
||||
2. Wait until the workload accesses its working set.
|
||||
|
||||
3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set.
|
||||
If one wants to ignore certain types of pages, e.g. mlocked pages since they
|
||||
are not reclaimable, he or she can filter them out using
|
||||
``/proc/kpageflags``.
|
||||
|
||||
See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more
|
||||
information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and
|
||||
``/proc/kpagecgroup``.
|
||||
|
||||
.. _impl_details:
|
||||
|
||||
Implementation Details
|
||||
======================
|
||||
|
||||
The kernel internally keeps track of accesses to user memory pages in order to
|
||||
reclaim unreferenced pages first on memory shortage conditions. A page is
|
||||
considered referenced if it has been recently accessed via a process address
|
||||
space, in which case one or more PTEs it is mapped to will have the Accessed bit
|
||||
set, or marked accessed explicitly by the kernel (see mark_page_accessed()). The
|
||||
latter happens when:
|
||||
|
||||
- a userspace process reads or writes a page using a system call (e.g. read(2)
|
||||
or write(2))
|
||||
|
||||
- a page that is used for storing filesystem buffers is read or written,
|
||||
because a process needs filesystem metadata stored in it (e.g. lists a
|
||||
directory tree)
|
||||
|
||||
- a page is accessed by a device driver using get_user_pages()
|
||||
|
||||
When a dirty page is written to swap or disk as a result of memory reclaim or
|
||||
exceeding the dirty memory limit, it is not marked referenced.
|
||||
|
||||
The idle memory tracking feature adds a new page flag, the Idle flag. This flag
|
||||
is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
|
||||
:ref:`User API <user_api>`
|
||||
section), and cleared automatically whenever a page is referenced as defined
|
||||
above.
|
||||
|
||||
When a page is marked idle, the Accessed bit must be cleared in all PTEs it is
|
||||
mapped to, otherwise we will not be able to detect accesses to the page coming
|
||||
from a process address space. To avoid interference with the reclaimer, which,
|
||||
as noted above, uses the Accessed bit to promote actively referenced pages, one
|
||||
more page flag is introduced, the Young flag. When the PTE Accessed bit is
|
||||
cleared as a result of setting or updating a page's Idle flag, the Young flag
|
||||
is set on the page. The reclaimer treats the Young flag as an extra PTE
|
||||
Accessed bit and therefore will consider such a page as referenced.
|
||||
|
||||
Since the idle memory tracking feature is based on the memory reclaimer logic,
|
||||
it only works with pages that are on an LRU list, other pages are silently
|
||||
ignored. That means it will ignore a user memory page if it is isolated, but
|
||||
since there are usually not many of them, it should not affect the overall
|
||||
result noticeably. In order not to stall scanning of the idle page bitmap,
|
||||
locked pages may be skipped too.
|
36
Documentation/admin-guide/mm/index.rst
Normal file
36
Documentation/admin-guide/mm/index.rst
Normal file
@ -0,0 +1,36 @@
|
||||
=================
|
||||
Memory Management
|
||||
=================
|
||||
|
||||
Linux memory management subsystem is responsible, as the name implies,
|
||||
for managing the memory in the system. This includes implemnetation of
|
||||
virtual memory and demand paging, memory allocation both for kernel
|
||||
internal structures and user space programms, mapping of files into
|
||||
processes address space and many other cool things.
|
||||
|
||||
Linux memory management is a complex system with many configurable
|
||||
settings. Most of these settings are available via ``/proc``
|
||||
filesystem and can be quired and adjusted using ``sysctl``. These APIs
|
||||
are described in Documentation/sysctl/vm.txt and in `man 5 proc`_.
|
||||
|
||||
.. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html
|
||||
|
||||
Linux memory management has its own jargon and if you are not yet
|
||||
familiar with it, consider reading
|
||||
:ref:`Documentation/admin-guide/mm/concepts.rst <mm_concepts>`.
|
||||
|
||||
Here we document in detail how to interact with various mechanisms in
|
||||
the Linux memory management.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
concepts
|
||||
hugetlbpage
|
||||
idle_page_tracking
|
||||
ksm
|
||||
numa_memory_policy
|
||||
pagemap
|
||||
soft-dirty
|
||||
transhuge
|
||||
userfaultfd
|
189
Documentation/admin-guide/mm/ksm.rst
Normal file
189
Documentation/admin-guide/mm/ksm.rst
Normal file
@ -0,0 +1,189 @@
|
||||
.. _admin_guide_ksm:
|
||||
|
||||
=======================
|
||||
Kernel Samepage Merging
|
||||
=======================
|
||||
|
||||
Overview
|
||||
========
|
||||
|
||||
KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y,
|
||||
added to the Linux kernel in 2.6.32. See ``mm/ksm.c`` for its implementation,
|
||||
and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/
|
||||
|
||||
KSM was originally developed for use with KVM (where it was known as
|
||||
Kernel Shared Memory), to fit more virtual machines into physical memory,
|
||||
by sharing the data common between them. But it can be useful to any
|
||||
application which generates many instances of the same data.
|
||||
|
||||
The KSM daemon ksmd periodically scans those areas of user memory
|
||||
which have been registered with it, looking for pages of identical
|
||||
content which can be replaced by a single write-protected page (which
|
||||
is automatically copied if a process later wants to update its
|
||||
content). The amount of pages that KSM daemon scans in a single pass
|
||||
and the time between the passes are configured using :ref:`sysfs
|
||||
intraface <ksm_sysfs>`
|
||||
|
||||
KSM only merges anonymous (private) pages, never pagecache (file) pages.
|
||||
KSM's merged pages were originally locked into kernel memory, but can now
|
||||
be swapped out just like other user pages (but sharing is broken when they
|
||||
are swapped back in: ksmd must rediscover their identity and merge again).
|
||||
|
||||
Controlling KSM with madvise
|
||||
============================
|
||||
|
||||
KSM only operates on those areas of address space which an application
|
||||
has advised to be likely candidates for merging, by using the madvise(2)
|
||||
system call::
|
||||
|
||||
int madvise(addr, length, MADV_MERGEABLE)
|
||||
|
||||
The app may call
|
||||
|
||||
::
|
||||
|
||||
int madvise(addr, length, MADV_UNMERGEABLE)
|
||||
|
||||
to cancel that advice and restore unshared pages: whereupon KSM
|
||||
unmerges whatever it merged in that range. Note: this unmerging call
|
||||
may suddenly require more memory than is available - possibly failing
|
||||
with EAGAIN, but more probably arousing the Out-Of-Memory killer.
|
||||
|
||||
If KSM is not configured into the running kernel, madvise MADV_MERGEABLE
|
||||
and MADV_UNMERGEABLE simply fail with EINVAL. If the running kernel was
|
||||
built with CONFIG_KSM=y, those calls will normally succeed: even if the
|
||||
the KSM daemon is not currently running, MADV_MERGEABLE still registers
|
||||
the range for whenever the KSM daemon is started; even if the range
|
||||
cannot contain any pages which KSM could actually merge; even if
|
||||
MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.
|
||||
|
||||
If a region of memory must be split into at least one new MADV_MERGEABLE
|
||||
or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process
|
||||
will exceed ``vm.max_map_count`` (see Documentation/sysctl/vm.txt).
|
||||
|
||||
Like other madvise calls, they are intended for use on mapped areas of
|
||||
the user address space: they will report ENOMEM if the specified range
|
||||
includes unmapped gaps (though working on the intervening mapped areas),
|
||||
and might fail with EAGAIN if not enough memory for internal structures.
|
||||
|
||||
Applications should be considerate in their use of MADV_MERGEABLE,
|
||||
restricting its use to areas likely to benefit. KSM's scans may use a lot
|
||||
of processing power: some installations will disable KSM for that reason.
|
||||
|
||||
.. _ksm_sysfs:
|
||||
|
||||
KSM daemon sysfs interface
|
||||
==========================
|
||||
|
||||
The KSM daemon is controlled by sysfs files in ``/sys/kernel/mm/ksm/``,
|
||||
readable by all but writable only by root:
|
||||
|
||||
pages_to_scan
|
||||
how many pages to scan before ksmd goes to sleep
|
||||
e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``.
|
||||
|
||||
Default: 100 (chosen for demonstration purposes)
|
||||
|
||||
sleep_millisecs
|
||||
how many milliseconds ksmd should sleep before next scan
|
||||
e.g. ``echo 20 > /sys/kernel/mm/ksm/sleep_millisecs``
|
||||
|
||||
Default: 20 (chosen for demonstration purposes)
|
||||
|
||||
merge_across_nodes
|
||||
specifies if pages from different NUMA nodes can be merged.
|
||||
When set to 0, ksm merges only pages which physically reside
|
||||
in the memory area of same NUMA node. That brings lower
|
||||
latency to access of shared pages. Systems with more nodes, at
|
||||
significant NUMA distances, are likely to benefit from the
|
||||
lower latency of setting 0. Smaller systems, which need to
|
||||
minimize memory usage, are likely to benefit from the greater
|
||||
sharing of setting 1 (default). You may wish to compare how
|
||||
your system performs under each setting, before deciding on
|
||||
which to use. ``merge_across_nodes`` setting can be changed only
|
||||
when there are no ksm shared pages in the system: set run 2 to
|
||||
unmerge pages first, then to 1 after changing
|
||||
``merge_across_nodes``, to remerge according to the new setting.
|
||||
|
||||
Default: 1 (merging across nodes as in earlier releases)
|
||||
|
||||
run
|
||||
* set to 0 to stop ksmd from running but keep merged pages,
|
||||
* set to 1 to run ksmd e.g. ``echo 1 > /sys/kernel/mm/ksm/run``,
|
||||
* set to 2 to stop ksmd and unmerge all pages currently merged, but
|
||||
leave mergeable areas registered for next run.
|
||||
|
||||
Default: 0 (must be changed to 1 to activate KSM, except if
|
||||
CONFIG_SYSFS is disabled)
|
||||
|
||||
use_zero_pages
|
||||
specifies whether empty pages (i.e. allocated pages that only
|
||||
contain zeroes) should be treated specially. When set to 1,
|
||||
empty pages are merged with the kernel zero page(s) instead of
|
||||
with each other as it would happen normally. This can improve
|
||||
the performance on architectures with coloured zero pages,
|
||||
depending on the workload. Care should be taken when enabling
|
||||
this setting, as it can potentially degrade the performance of
|
||||
KSM for some workloads, for example if the checksums of pages
|
||||
candidate for merging match the checksum of an empty
|
||||
page. This setting can be changed at any time, it is only
|
||||
effective for pages merged after the change.
|
||||
|
||||
Default: 0 (normal KSM behaviour as in earlier releases)
|
||||
|
||||
max_page_sharing
|
||||
Maximum sharing allowed for each KSM page. This enforces a
|
||||
deduplication limit to avoid high latency for virtual memory
|
||||
operations that involve traversal of the virtual mappings that
|
||||
share the KSM page. The minimum value is 2 as a newly created
|
||||
KSM page will have at least two sharers. The higher this value
|
||||
the faster KSM will merge the memory and the higher the
|
||||
deduplication factor will be, but the slower the worst case
|
||||
virtual mappings traversal could be for any given KSM
|
||||
page. Slowing down this traversal means there will be higher
|
||||
latency for certain virtual memory operations happening during
|
||||
swapping, compaction, NUMA balancing and page migration, in
|
||||
turn decreasing responsiveness for the caller of those virtual
|
||||
memory operations. The scheduler latency of other tasks not
|
||||
involved with the VM operations doing the virtual mappings
|
||||
traversal is not affected by this parameter as these
|
||||
traversals are always schedule friendly themselves.
|
||||
|
||||
stable_node_chains_prune_millisecs
|
||||
specifies how frequently KSM checks the metadata of the pages
|
||||
that hit the deduplication limit for stale information.
|
||||
Smaller milllisecs values will free up the KSM metadata with
|
||||
lower latency, but they will make ksmd use more CPU during the
|
||||
scan. It's a noop if not a single KSM page hit the
|
||||
``max_page_sharing`` yet.
|
||||
|
||||
The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
|
||||
|
||||
pages_shared
|
||||
how many shared pages are being used
|
||||
pages_sharing
|
||||
how many more sites are sharing them i.e. how much saved
|
||||
pages_unshared
|
||||
how many pages unique but repeatedly checked for merging
|
||||
pages_volatile
|
||||
how many pages changing too fast to be placed in a tree
|
||||
full_scans
|
||||
how many times all mergeable areas have been scanned
|
||||
stable_node_chains
|
||||
the number of KSM pages that hit the ``max_page_sharing`` limit
|
||||
stable_node_dups
|
||||
number of duplicated KSM pages
|
||||
|
||||
A high ratio of ``pages_sharing`` to ``pages_shared`` indicates good
|
||||
sharing, but a high ratio of ``pages_unshared`` to ``pages_sharing``
|
||||
indicates wasted effort. ``pages_volatile`` embraces several
|
||||
different kinds of activity, but a high proportion there would also
|
||||
indicate poor use of madvise MADV_MERGEABLE.
|
||||
|
||||
The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the
|
||||
``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must
|
||||
be increased accordingly.
|
||||
|
||||
--
|
||||
Izik Eidus,
|
||||
Hugh Dickins, 17 Nov 2009
|
495
Documentation/admin-guide/mm/numa_memory_policy.rst
Normal file
495
Documentation/admin-guide/mm/numa_memory_policy.rst
Normal file
@ -0,0 +1,495 @@
|
||||
.. _numa_memory_policy:
|
||||
|
||||
==================
|
||||
NUMA Memory Policy
|
||||
==================
|
||||
|
||||
What is NUMA Memory Policy?
|
||||
============================
|
||||
|
||||
In the Linux kernel, "memory policy" determines from which node the kernel will
|
||||
allocate memory in a NUMA system or in an emulated NUMA system. Linux has
|
||||
supported platforms with Non-Uniform Memory Access architectures since 2.4.?.
|
||||
The current memory policy support was added to Linux 2.6 around May 2004. This
|
||||
document attempts to describe the concepts and APIs of the 2.6 memory policy
|
||||
support.
|
||||
|
||||
Memory policies should not be confused with cpusets
|
||||
(``Documentation/cgroup-v1/cpusets.txt``)
|
||||
which is an administrative mechanism for restricting the nodes from which
|
||||
memory may be allocated by a set of processes. Memory policies are a
|
||||
programming interface that a NUMA-aware application can take advantage of. When
|
||||
both cpusets and policies are applied to a task, the restrictions of the cpuset
|
||||
takes priority. See :ref:`Memory Policies and cpusets <mem_pol_and_cpusets>`
|
||||
below for more details.
|
||||
|
||||
Memory Policy Concepts
|
||||
======================
|
||||
|
||||
Scope of Memory Policies
|
||||
------------------------
|
||||
|
||||
The Linux kernel supports _scopes_ of memory policy, described here from
|
||||
most general to most specific:
|
||||
|
||||
System Default Policy
|
||||
this policy is "hard coded" into the kernel. It is the policy
|
||||
that governs all page allocations that aren't controlled by
|
||||
one of the more specific policy scopes discussed below. When
|
||||
the system is "up and running", the system default policy will
|
||||
use "local allocation" described below. However, during boot
|
||||
up, the system default policy will be set to interleave
|
||||
allocations across all nodes with "sufficient" memory, so as
|
||||
not to overload the initial boot node with boot-time
|
||||
allocations.
|
||||
|
||||
Task/Process Policy
|
||||
this is an optional, per-task policy. When defined for a
|
||||
specific task, this policy controls all page allocations made
|
||||
by or on behalf of the task that aren't controlled by a more
|
||||
specific scope. If a task does not define a task policy, then
|
||||
all page allocations that would have been controlled by the
|
||||
task policy "fall back" to the System Default Policy.
|
||||
|
||||
The task policy applies to the entire address space of a task. Thus,
|
||||
it is inheritable, and indeed is inherited, across both fork()
|
||||
[clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task
|
||||
to establish the task policy for a child task exec()'d from an
|
||||
executable image that has no awareness of memory policy. See the
|
||||
:ref:`Memory Policy APIs <memory_policy_apis>` section,
|
||||
below, for an overview of the system call
|
||||
that a task may use to set/change its task/process policy.
|
||||
|
||||
In a multi-threaded task, task policies apply only to the thread
|
||||
[Linux kernel task] that installs the policy and any threads
|
||||
subsequently created by that thread. Any sibling threads existing
|
||||
at the time a new task policy is installed retain their current
|
||||
policy.
|
||||
|
||||
A task policy applies only to pages allocated after the policy is
|
||||
installed. Any pages already faulted in by the task when the task
|
||||
changes its task policy remain where they were allocated based on
|
||||
the policy at the time they were allocated.
|
||||
|
||||
.. _vma_policy:
|
||||
|
||||
VMA Policy
|
||||
A "VMA" or "Virtual Memory Area" refers to a range of a task's
|
||||
virtual address space. A task may define a specific policy for a range
|
||||
of its virtual address space. See the
|
||||
:ref:`Memory Policy APIs <memory_policy_apis>` section,
|
||||
below, for an overview of the mbind() system call used to set a VMA
|
||||
policy.
|
||||
|
||||
A VMA policy will govern the allocation of pages that back
|
||||
this region of the address space. Any regions of the task's
|
||||
address space that don't have an explicit VMA policy will fall
|
||||
back to the task policy, which may itself fall back to the
|
||||
System Default Policy.
|
||||
|
||||
VMA policies have a few complicating details:
|
||||
|
||||
* VMA policy applies ONLY to anonymous pages. These include
|
||||
pages allocated for anonymous segments, such as the task
|
||||
stack and heap, and any regions of the address space
|
||||
mmap()ed with the MAP_ANONYMOUS flag. If a VMA policy is
|
||||
applied to a file mapping, it will be ignored if the mapping
|
||||
used the MAP_SHARED flag. If the file mapping used the
|
||||
MAP_PRIVATE flag, the VMA policy will only be applied when
|
||||
an anonymous page is allocated on an attempt to write to the
|
||||
mapping-- i.e., at Copy-On-Write.
|
||||
|
||||
* VMA policies are shared between all tasks that share a
|
||||
virtual address space--a.k.a. threads--independent of when
|
||||
the policy is installed; and they are inherited across
|
||||
fork(). However, because VMA policies refer to a specific
|
||||
region of a task's address space, and because the address
|
||||
space is discarded and recreated on exec*(), VMA policies
|
||||
are NOT inheritable across exec(). Thus, only NUMA-aware
|
||||
applications may use VMA policies.
|
||||
|
||||
* A task may install a new VMA policy on a sub-range of a
|
||||
previously mmap()ed region. When this happens, Linux splits
|
||||
the existing virtual memory area into 2 or 3 VMAs, each with
|
||||
it's own policy.
|
||||
|
||||
* By default, VMA policy applies only to pages allocated after
|
||||
the policy is installed. Any pages already faulted into the
|
||||
VMA range remain where they were allocated based on the
|
||||
policy at the time they were allocated. However, since
|
||||
2.6.16, Linux supports page migration via the mbind() system
|
||||
call, so that page contents can be moved to match a newly
|
||||
installed policy.
|
||||
|
||||
Shared Policy
|
||||
Conceptually, shared policies apply to "memory objects" mapped
|
||||
shared into one or more tasks' distinct address spaces. An
|
||||
application installs shared policies the same way as VMA
|
||||
policies--using the mbind() system call specifying a range of
|
||||
virtual addresses that map the shared object. However, unlike
|
||||
VMA policies, which can be considered to be an attribute of a
|
||||
range of a task's address space, shared policies apply
|
||||
directly to the shared object. Thus, all tasks that attach to
|
||||
the object share the policy, and all pages allocated for the
|
||||
shared object, by any task, will obey the shared policy.
|
||||
|
||||
As of 2.6.22, only shared memory segments, created by shmget() or
|
||||
mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared
|
||||
policy support was added to Linux, the associated data structures were
|
||||
added to hugetlbfs shmem segments. At the time, hugetlbfs did not
|
||||
support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
|
||||
shmem segments were never "hooked up" to the shared policy support.
|
||||
Although hugetlbfs segments now support lazy allocation, their support
|
||||
for shared policy has not been completed.
|
||||
|
||||
As mentioned above in :ref:`VMA policies <vma_policy>` section,
|
||||
allocations of page cache pages for regular files mmap()ed
|
||||
with MAP_SHARED ignore any VMA policy installed on the virtual
|
||||
address range backed by the shared file mapping. Rather,
|
||||
shared page cache pages, including pages backing private
|
||||
mappings that have not yet been written by the task, follow
|
||||
task policy, if any, else System Default Policy.
|
||||
|
||||
The shared policy infrastructure supports different policies on subset
|
||||
ranges of the shared object. However, Linux still splits the VMA of
|
||||
the task that installs the policy for each range of distinct policy.
|
||||
Thus, different tasks that attach to a shared memory segment can have
|
||||
different VMA configurations mapping that one shared object. This
|
||||
can be seen by examining the /proc/<pid>/numa_maps of tasks sharing
|
||||
a shared memory region, when one task has installed shared policy on
|
||||
one or more ranges of the region.
|
||||
|
||||
Components of Memory Policies
|
||||
-----------------------------
|
||||
|
||||
A NUMA memory policy consists of a "mode", optional mode flags, and
|
||||
an optional set of nodes. The mode determines the behavior of the
|
||||
policy, the optional mode flags determine the behavior of the mode,
|
||||
and the optional set of nodes can be viewed as the arguments to the
|
||||
policy behavior.
|
||||
|
||||
Internally, memory policies are implemented by a reference counted
|
||||
structure, struct mempolicy. Details of this structure will be
|
||||
discussed in context, below, as required to explain the behavior.
|
||||
|
||||
NUMA memory policy supports the following 4 behavioral modes:
|
||||
|
||||
Default Mode--MPOL_DEFAULT
|
||||
This mode is only used in the memory policy APIs. Internally,
|
||||
MPOL_DEFAULT is converted to the NULL memory policy in all
|
||||
policy scopes. Any existing non-default policy will simply be
|
||||
removed when MPOL_DEFAULT is specified. As a result,
|
||||
MPOL_DEFAULT means "fall back to the next most specific policy
|
||||
scope."
|
||||
|
||||
For example, a NULL or default task policy will fall back to the
|
||||
system default policy. A NULL or default vma policy will fall
|
||||
back to the task policy.
|
||||
|
||||
When specified in one of the memory policy APIs, the Default mode
|
||||
does not use the optional set of nodes.
|
||||
|
||||
It is an error for the set of nodes specified for this policy to
|
||||
be non-empty.
|
||||
|
||||
MPOL_BIND
|
||||
This mode specifies that memory must come from the set of
|
||||
nodes specified by the policy. Memory will be allocated from
|
||||
the node in the set with sufficient free memory that is
|
||||
closest to the node where the allocation takes place.
|
||||
|
||||
MPOL_PREFERRED
|
||||
This mode specifies that the allocation should be attempted
|
||||
from the single node specified in the policy. If that
|
||||
allocation fails, the kernel will search other nodes, in order
|
||||
of increasing distance from the preferred node based on
|
||||
information provided by the platform firmware.
|
||||
|
||||
Internally, the Preferred policy uses a single node--the
|
||||
preferred_node member of struct mempolicy. When the internal
|
||||
mode flag MPOL_F_LOCAL is set, the preferred_node is ignored
|
||||
and the policy is interpreted as local allocation. "Local"
|
||||
allocation policy can be viewed as a Preferred policy that
|
||||
starts at the node containing the cpu where the allocation
|
||||
takes place.
|
||||
|
||||
It is possible for the user to specify that local allocation
|
||||
is always preferred by passing an empty nodemask with this
|
||||
mode. If an empty nodemask is passed, the policy cannot use
|
||||
the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags
|
||||
described below.
|
||||
|
||||
MPOL_INTERLEAVED
|
||||
This mode specifies that page allocations be interleaved, on a
|
||||
page granularity, across the nodes specified in the policy.
|
||||
This mode also behaves slightly differently, based on the
|
||||
context where it is used:
|
||||
|
||||
For allocation of anonymous pages and shared memory pages,
|
||||
Interleave mode indexes the set of nodes specified by the
|
||||
policy using the page offset of the faulting address into the
|
||||
segment [VMA] containing the address modulo the number of
|
||||
nodes specified by the policy. It then attempts to allocate a
|
||||
page, starting at the selected node, as if the node had been
|
||||
specified by a Preferred policy or had been selected by a
|
||||
local allocation. That is, allocation will follow the per
|
||||
node zonelist.
|
||||
|
||||
For allocation of page cache pages, Interleave mode indexes
|
||||
the set of nodes specified by the policy using a node counter
|
||||
maintained per task. This counter wraps around to the lowest
|
||||
specified node after it reaches the highest specified node.
|
||||
This will tend to spread the pages out over the nodes
|
||||
specified by the policy based on the order in which they are
|
||||
allocated, rather than based on any page offset into an
|
||||
address range or file. During system boot up, the temporary
|
||||
interleaved system default policy works in this mode.
|
||||
|
||||
NUMA memory policy supports the following optional mode flags:
|
||||
|
||||
MPOL_F_STATIC_NODES
|
||||
This flag specifies that the nodemask passed by
|
||||
the user should not be remapped if the task or VMA's set of allowed
|
||||
nodes changes after the memory policy has been defined.
|
||||
|
||||
Without this flag, any time a mempolicy is rebound because of a
|
||||
change in the set of allowed nodes, the node (Preferred) or
|
||||
nodemask (Bind, Interleave) is remapped to the new set of
|
||||
allowed nodes. This may result in nodes being used that were
|
||||
previously undesired.
|
||||
|
||||
With this flag, if the user-specified nodes overlap with the
|
||||
nodes allowed by the task's cpuset, then the memory policy is
|
||||
applied to their intersection. If the two sets of nodes do not
|
||||
overlap, the Default policy is used.
|
||||
|
||||
For example, consider a task that is attached to a cpuset with
|
||||
mems 1-3 that sets an Interleave policy over the same set. If
|
||||
the cpuset's mems change to 3-5, the Interleave will now occur
|
||||
over nodes 3, 4, and 5. With this flag, however, since only node
|
||||
3 is allowed from the user's nodemask, the "interleave" only
|
||||
occurs over that node. If no nodes from the user's nodemask are
|
||||
now allowed, the Default behavior is used.
|
||||
|
||||
MPOL_F_STATIC_NODES cannot be combined with the
|
||||
MPOL_F_RELATIVE_NODES flag. It also cannot be used for
|
||||
MPOL_PREFERRED policies that were created with an empty nodemask
|
||||
(local allocation).
|
||||
|
||||
MPOL_F_RELATIVE_NODES
|
||||
This flag specifies that the nodemask passed
|
||||
by the user will be mapped relative to the set of the task or VMA's
|
||||
set of allowed nodes. The kernel stores the user-passed nodemask,
|
||||
and if the allowed nodes changes, then that original nodemask will
|
||||
be remapped relative to the new set of allowed nodes.
|
||||
|
||||
Without this flag (and without MPOL_F_STATIC_NODES), anytime a
|
||||
mempolicy is rebound because of a change in the set of allowed
|
||||
nodes, the node (Preferred) or nodemask (Bind, Interleave) is
|
||||
remapped to the new set of allowed nodes. That remap may not
|
||||
preserve the relative nature of the user's passed nodemask to its
|
||||
set of allowed nodes upon successive rebinds: a nodemask of
|
||||
1,3,5 may be remapped to 7-9 and then to 1-3 if the set of
|
||||
allowed nodes is restored to its original state.
|
||||
|
||||
With this flag, the remap is done so that the node numbers from
|
||||
the user's passed nodemask are relative to the set of allowed
|
||||
nodes. In other words, if nodes 0, 2, and 4 are set in the user's
|
||||
nodemask, the policy will be effected over the first (and in the
|
||||
Bind or Interleave case, the third and fifth) nodes in the set of
|
||||
allowed nodes. The nodemask passed by the user represents nodes
|
||||
relative to task or VMA's set of allowed nodes.
|
||||
|
||||
If the user's nodemask includes nodes that are outside the range
|
||||
of the new set of allowed nodes (for example, node 5 is set in
|
||||
the user's nodemask when the set of allowed nodes is only 0-3),
|
||||
then the remap wraps around to the beginning of the nodemask and,
|
||||
if not already set, sets the node in the mempolicy nodemask.
|
||||
|
||||
For example, consider a task that is attached to a cpuset with
|
||||
mems 2-5 that sets an Interleave policy over the same set with
|
||||
MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the
|
||||
interleave now occurs over nodes 3,5-7. If the cpuset's mems
|
||||
then change to 0,2-3,5, then the interleave occurs over nodes
|
||||
0,2-3,5.
|
||||
|
||||
Thanks to the consistent remapping, applications preparing
|
||||
nodemasks to specify memory policies using this flag should
|
||||
disregard their current, actual cpuset imposed memory placement
|
||||
and prepare the nodemask as if they were always located on
|
||||
memory nodes 0 to N-1, where N is the number of memory nodes the
|
||||
policy is intended to manage. Let the kernel then remap to the
|
||||
set of memory nodes allowed by the task's cpuset, as that may
|
||||
change over time.
|
||||
|
||||
MPOL_F_RELATIVE_NODES cannot be combined with the
|
||||
MPOL_F_STATIC_NODES flag. It also cannot be used for
|
||||
MPOL_PREFERRED policies that were created with an empty nodemask
|
||||
(local allocation).
|
||||
|
||||
Memory Policy Reference Counting
|
||||
================================
|
||||
|
||||
To resolve use/free races, struct mempolicy contains an atomic reference
|
||||
count field. Internal interfaces, mpol_get()/mpol_put() increment and
|
||||
decrement this reference count, respectively. mpol_put() will only free
|
||||
the structure back to the mempolicy kmem cache when the reference count
|
||||
goes to zero.
|
||||
|
||||
When a new memory policy is allocated, its reference count is initialized
|
||||
to '1', representing the reference held by the task that is installing the
|
||||
new policy. When a pointer to a memory policy structure is stored in another
|
||||
structure, another reference is added, as the task's reference will be dropped
|
||||
on completion of the policy installation.
|
||||
|
||||
During run-time "usage" of the policy, we attempt to minimize atomic operations
|
||||
on the reference count, as this can lead to cache lines bouncing between cpus
|
||||
and NUMA nodes. "Usage" here means one of the following:
|
||||
|
||||
1) querying of the policy, either by the task itself [using the get_mempolicy()
|
||||
API discussed below] or by another task using the /proc/<pid>/numa_maps
|
||||
interface.
|
||||
|
||||
2) examination of the policy to determine the policy mode and associated node
|
||||
or node lists, if any, for page allocation. This is considered a "hot
|
||||
path". Note that for MPOL_BIND, the "usage" extends across the entire
|
||||
allocation process, which may sleep during page reclaimation, because the
|
||||
BIND policy nodemask is used, by reference, to filter ineligible nodes.
|
||||
|
||||
We can avoid taking an extra reference during the usages listed above as
|
||||
follows:
|
||||
|
||||
1) we never need to get/free the system default policy as this is never
|
||||
changed nor freed, once the system is up and running.
|
||||
|
||||
2) for querying the policy, we do not need to take an extra reference on the
|
||||
target task's task policy nor vma policies because we always acquire the
|
||||
task's mm's mmap_sem for read during the query. The set_mempolicy() and
|
||||
mbind() APIs [see below] always acquire the mmap_sem for write when
|
||||
installing or replacing task or vma policies. Thus, there is no possibility
|
||||
of a task or thread freeing a policy while another task or thread is
|
||||
querying it.
|
||||
|
||||
3) Page allocation usage of task or vma policy occurs in the fault path where
|
||||
we hold them mmap_sem for read. Again, because replacing the task or vma
|
||||
policy requires that the mmap_sem be held for write, the policy can't be
|
||||
freed out from under us while we're using it for page allocation.
|
||||
|
||||
4) Shared policies require special consideration. One task can replace a
|
||||
shared memory policy while another task, with a distinct mmap_sem, is
|
||||
querying or allocating a page based on the policy. To resolve this
|
||||
potential race, the shared policy infrastructure adds an extra reference
|
||||
to the shared policy during lookup while holding a spin lock on the shared
|
||||
policy management structure. This requires that we drop this extra
|
||||
reference when we're finished "using" the policy. We must drop the
|
||||
extra reference on shared policies in the same query/allocation paths
|
||||
used for non-shared policies. For this reason, shared policies are marked
|
||||
as such, and the extra reference is dropped "conditionally"--i.e., only
|
||||
for shared policies.
|
||||
|
||||
Because of this extra reference counting, and because we must lookup
|
||||
shared policies in a tree structure under spinlock, shared policies are
|
||||
more expensive to use in the page allocation path. This is especially
|
||||
true for shared policies on shared memory regions shared by tasks running
|
||||
on different NUMA nodes. This extra overhead can be avoided by always
|
||||
falling back to task or system default policy for shared memory regions,
|
||||
or by prefaulting the entire shared memory region into memory and locking
|
||||
it down. However, this might not be appropriate for all applications.
|
||||
|
||||
.. _memory_policy_apis:
|
||||
|
||||
Memory Policy APIs
|
||||
==================
|
||||
|
||||
Linux supports 3 system calls for controlling memory policy. These APIS
|
||||
always affect only the calling task, the calling task's address space, or
|
||||
some shared object mapped into the calling task's address space.
|
||||
|
||||
.. note::
|
||||
the headers that define these APIs and the parameter data types for
|
||||
user space applications reside in a package that is not part of the
|
||||
Linux kernel. The kernel system call interfaces, with the 'sys\_'
|
||||
prefix, are defined in <linux/syscalls.h>; the mode and flag
|
||||
definitions are defined in <linux/mempolicy.h>.
|
||||
|
||||
Set [Task] Memory Policy::
|
||||
|
||||
long set_mempolicy(int mode, const unsigned long *nmask,
|
||||
unsigned long maxnode);
|
||||
|
||||
Set's the calling task's "task/process memory policy" to mode
|
||||
specified by the 'mode' argument and the set of nodes defined by
|
||||
'nmask'. 'nmask' points to a bit mask of node ids containing at least
|
||||
'maxnode' ids. Optional mode flags may be passed by combining the
|
||||
'mode' argument with the flag (for example: MPOL_INTERLEAVE |
|
||||
MPOL_F_STATIC_NODES).
|
||||
|
||||
See the set_mempolicy(2) man page for more details
|
||||
|
||||
|
||||
Get [Task] Memory Policy or Related Information::
|
||||
|
||||
long get_mempolicy(int *mode,
|
||||
const unsigned long *nmask, unsigned long maxnode,
|
||||
void *addr, int flags);
|
||||
|
||||
Queries the "task/process memory policy" of the calling task, or the
|
||||
policy or location of a specified virtual address, depending on the
|
||||
'flags' argument.
|
||||
|
||||
See the get_mempolicy(2) man page for more details
|
||||
|
||||
|
||||
Install VMA/Shared Policy for a Range of Task's Address Space::
|
||||
|
||||
long mbind(void *start, unsigned long len, int mode,
|
||||
const unsigned long *nmask, unsigned long maxnode,
|
||||
unsigned flags);
|
||||
|
||||
mbind() installs the policy specified by (mode, nmask, maxnodes) as a
|
||||
VMA policy for the range of the calling task's address space specified
|
||||
by the 'start' and 'len' arguments. Additional actions may be
|
||||
requested via the 'flags' argument.
|
||||
|
||||
See the mbind(2) man page for more details.
|
||||
|
||||
Memory Policy Command Line Interface
|
||||
====================================
|
||||
|
||||
Although not strictly part of the Linux implementation of memory policy,
|
||||
a command line tool, numactl(8), exists that allows one to:
|
||||
|
||||
+ set the task policy for a specified program via set_mempolicy(2), fork(2) and
|
||||
exec(2)
|
||||
|
||||
+ set the shared policy for a shared memory segment via mbind(2)
|
||||
|
||||
The numactl(8) tool is packaged with the run-time version of the library
|
||||
containing the memory policy system call wrappers. Some distributions
|
||||
package the headers and compile-time libraries in a separate development
|
||||
package.
|
||||
|
||||
.. _mem_pol_and_cpusets:
|
||||
|
||||
Memory Policies and cpusets
|
||||
===========================
|
||||
|
||||
Memory policies work within cpusets as described above. For memory policies
|
||||
that require a node or set of nodes, the nodes are restricted to the set of
|
||||
nodes whose memories are allowed by the cpuset constraints. If the nodemask
|
||||
specified for the policy contains nodes that are not allowed by the cpuset and
|
||||
MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes
|
||||
specified for the policy and the set of nodes with memory is used. If the
|
||||
result is the empty set, the policy is considered invalid and cannot be
|
||||
installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped
|
||||
onto and folded into the task's set of allowed nodes as previously described.
|
||||
|
||||
The interaction of memory policies and cpusets can be problematic when tasks
|
||||
in two cpusets share access to a memory region, such as shared memory segments
|
||||
created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and
|
||||
any of the tasks install shared policy on the region, only nodes whose
|
||||
memories are allowed in both cpusets may be used in the policies. Obtaining
|
||||
this information requires "stepping outside" the memory policy APIs to use the
|
||||
cpuset information and requires that one know in what cpusets other task might
|
||||
be attaching to the shared region. Furthermore, if the cpusets' allowed
|
||||
memory sets are disjoint, "local" allocation is the only valid policy.
|
201
Documentation/admin-guide/mm/pagemap.rst
Normal file
201
Documentation/admin-guide/mm/pagemap.rst
Normal file
@ -0,0 +1,201 @@
|
||||
.. _pagemap:
|
||||
|
||||
=============================
|
||||
Examining Process Page Tables
|
||||
=============================
|
||||
|
||||
pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
|
||||
userspace programs to examine the page tables and related information by
|
||||
reading files in ``/proc``.
|
||||
|
||||
There are four components to pagemap:
|
||||
|
||||
* ``/proc/pid/pagemap``. This file lets a userspace process find out which
|
||||
physical frame each virtual page is mapped to. It contains one 64-bit
|
||||
value for each virtual page, containing the following data (from
|
||||
``fs/proc/task_mmu.c``, above pagemap_read):
|
||||
|
||||
* Bits 0-54 page frame number (PFN) if present
|
||||
* Bits 0-4 swap type if swapped
|
||||
* Bits 5-54 swap offset if swapped
|
||||
* Bit 55 pte is soft-dirty (see
|
||||
:ref:`Documentation/admin-guide/mm/soft-dirty.rst <soft_dirty>`)
|
||||
* Bit 56 page exclusively mapped (since 4.2)
|
||||
* Bits 57-60 zero
|
||||
* Bit 61 page is file-page or shared-anon (since 3.5)
|
||||
* Bit 62 page swapped
|
||||
* Bit 63 page present
|
||||
|
||||
Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs.
|
||||
In 4.0 and 4.1 opens by unprivileged fail with -EPERM. Starting from
|
||||
4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
|
||||
Reason: information about PFNs helps in exploiting Rowhammer vulnerability.
|
||||
|
||||
If the page is not present but in swap, then the PFN contains an
|
||||
encoding of the swap file number and the page's offset into the
|
||||
swap. Unmapped pages return a null PFN. This allows determining
|
||||
precisely which pages are mapped (or in swap) and comparing mapped
|
||||
pages between processes.
|
||||
|
||||
Efficient users of this interface will use ``/proc/pid/maps`` to
|
||||
determine which areas of memory are actually mapped and llseek to
|
||||
skip over unmapped regions.
|
||||
|
||||
* ``/proc/kpagecount``. This file contains a 64-bit count of the number of
|
||||
times each page is mapped, indexed by PFN.
|
||||
|
||||
* ``/proc/kpageflags``. This file contains a 64-bit set of flags for each
|
||||
page, indexed by PFN.
|
||||
|
||||
The flags are (from ``fs/proc/page.c``, above kpageflags_read):
|
||||
|
||||
0. LOCKED
|
||||
1. ERROR
|
||||
2. REFERENCED
|
||||
3. UPTODATE
|
||||
4. DIRTY
|
||||
5. LRU
|
||||
6. ACTIVE
|
||||
7. SLAB
|
||||
8. WRITEBACK
|
||||
9. RECLAIM
|
||||
10. BUDDY
|
||||
11. MMAP
|
||||
12. ANON
|
||||
13. SWAPCACHE
|
||||
14. SWAPBACKED
|
||||
15. COMPOUND_HEAD
|
||||
16. COMPOUND_TAIL
|
||||
17. HUGE
|
||||
18. UNEVICTABLE
|
||||
19. HWPOISON
|
||||
20. NOPAGE
|
||||
21. KSM
|
||||
22. THP
|
||||
23. BALLOON
|
||||
24. ZERO_PAGE
|
||||
25. IDLE
|
||||
|
||||
* ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the
|
||||
memory cgroup each page is charged to, indexed by PFN. Only available when
|
||||
CONFIG_MEMCG is set.
|
||||
|
||||
Short descriptions to the page flags
|
||||
====================================
|
||||
|
||||
0 - LOCKED
|
||||
page is being locked for exclusive access, e.g. by undergoing read/write IO
|
||||
7 - SLAB
|
||||
page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
|
||||
When compound page is used, SLUB/SLQB will only set this flag on the head
|
||||
page; SLOB will not flag it at all.
|
||||
10 - BUDDY
|
||||
a free memory block managed by the buddy system allocator
|
||||
The buddy system organizes free memory in blocks of various orders.
|
||||
An order N block has 2^N physically contiguous pages, with the BUDDY flag
|
||||
set for and _only_ for the first page.
|
||||
15 - COMPOUND_HEAD
|
||||
A compound page with order N consists of 2^N physically contiguous pages.
|
||||
A compound page with order 2 takes the form of "HTTT", where H donates its
|
||||
head page and T donates its tail page(s). The major consumers of compound
|
||||
pages are hugeTLB pages
|
||||
(:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`),
|
||||
the SLUB etc. memory allocators and various device drivers.
|
||||
However in this interface, only huge/giga pages are made visible
|
||||
to end users.
|
||||
16 - COMPOUND_TAIL
|
||||
A compound page tail (see description above).
|
||||
17 - HUGE
|
||||
this is an integral part of a HugeTLB page
|
||||
19 - HWPOISON
|
||||
hardware detected memory corruption on this page: don't touch the data!
|
||||
20 - NOPAGE
|
||||
no page frame exists at the requested address
|
||||
21 - KSM
|
||||
identical memory pages dynamically shared between one or more processes
|
||||
22 - THP
|
||||
contiguous pages which construct transparent hugepages
|
||||
23 - BALLOON
|
||||
balloon compaction page
|
||||
24 - ZERO_PAGE
|
||||
zero page for pfn_zero or huge_zero page
|
||||
25 - IDLE
|
||||
page has not been accessed since it was marked idle (see
|
||||
:ref:`Documentation/admin-guide/mm/idle_page_tracking.rst <idle_page_tracking>`).
|
||||
Note that this flag may be stale in case the page was accessed via
|
||||
a PTE. To make sure the flag is up-to-date one has to read
|
||||
``/sys/kernel/mm/page_idle/bitmap`` first.
|
||||
|
||||
IO related page flags
|
||||
---------------------
|
||||
|
||||
1 - ERROR
|
||||
IO error occurred
|
||||
3 - UPTODATE
|
||||
page has up-to-date data
|
||||
ie. for file backed page: (in-memory data revision >= on-disk one)
|
||||
4 - DIRTY
|
||||
page has been written to, hence contains new data
|
||||
i.e. for file backed page: (in-memory data revision > on-disk one)
|
||||
8 - WRITEBACK
|
||||
page is being synced to disk
|
||||
|
||||
LRU related page flags
|
||||
----------------------
|
||||
|
||||
5 - LRU
|
||||
page is in one of the LRU lists
|
||||
6 - ACTIVE
|
||||
page is in the active LRU list
|
||||
18 - UNEVICTABLE
|
||||
page is in the unevictable (non-)LRU list It is somehow pinned and
|
||||
not a candidate for LRU page reclaims, e.g. ramfs pages,
|
||||
shmctl(SHM_LOCK) and mlock() memory segments
|
||||
2 - REFERENCED
|
||||
page has been referenced since last LRU list enqueue/requeue
|
||||
9 - RECLAIM
|
||||
page will be reclaimed soon after its pageout IO completed
|
||||
11 - MMAP
|
||||
a memory mapped page
|
||||
12 - ANON
|
||||
a memory mapped page that is not part of a file
|
||||
13 - SWAPCACHE
|
||||
page is mapped to swap space, i.e. has an associated swap entry
|
||||
14 - SWAPBACKED
|
||||
page is backed by swap/RAM
|
||||
|
||||
The page-types tool in the tools/vm directory can be used to query the
|
||||
above flags.
|
||||
|
||||
Using pagemap to do something useful
|
||||
====================================
|
||||
|
||||
The general procedure for using pagemap to find out about a process' memory
|
||||
usage goes like this:
|
||||
|
||||
1. Read ``/proc/pid/maps`` to determine which parts of the memory space are
|
||||
mapped to what.
|
||||
2. Select the maps you are interested in -- all of them, or a particular
|
||||
library, or the stack or the heap, etc.
|
||||
3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine.
|
||||
4. Read a u64 for each page from pagemap.
|
||||
5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you
|
||||
just read, seek to that entry in the file, and read the data you want.
|
||||
|
||||
For example, to find the "unique set size" (USS), which is the amount of
|
||||
memory that a process is using that is not shared with any other process,
|
||||
you can go through every map in the process, find the PFNs, look those up
|
||||
in kpagecount, and tally up the number of pages that are only referenced
|
||||
once.
|
||||
|
||||
Other notes
|
||||
===========
|
||||
|
||||
Reading from any of the files will return -EINVAL if you are not starting
|
||||
the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
|
||||
into the file), or if the size of the read is not a multiple of 8 bytes.
|
||||
|
||||
Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
|
||||
always 12 at most architectures). Since Linux 3.11 their meaning changes
|
||||
after first clear of soft-dirty bits. Since Linux 4.2 they are used for
|
||||
flags unconditionally.
|
47
Documentation/admin-guide/mm/soft-dirty.rst
Normal file
47
Documentation/admin-guide/mm/soft-dirty.rst
Normal file
@ -0,0 +1,47 @@
|
||||
.. _soft_dirty:
|
||||
|
||||
===============
|
||||
Soft-Dirty PTEs
|
||||
===============
|
||||
|
||||
The soft-dirty is a bit on a PTE which helps to track which pages a task
|
||||
writes to. In order to do this tracking one should
|
||||
|
||||
1. Clear soft-dirty bits from the task's PTEs.
|
||||
|
||||
This is done by writing "4" into the ``/proc/PID/clear_refs`` file of the
|
||||
task in question.
|
||||
|
||||
2. Wait some time.
|
||||
|
||||
3. Read soft-dirty bits from the PTEs.
|
||||
|
||||
This is done by reading from the ``/proc/PID/pagemap``. The bit 55 of the
|
||||
64-bit qword is the soft-dirty one. If set, the respective PTE was
|
||||
written to since step 1.
|
||||
|
||||
|
||||
Internally, to do this tracking, the writable bit is cleared from PTEs
|
||||
when the soft-dirty bit is cleared. So, after this, when the task tries to
|
||||
modify a page at some virtual address the #PF occurs and the kernel sets
|
||||
the soft-dirty bit on the respective PTE.
|
||||
|
||||
Note, that although all the task's address space is marked as r/o after the
|
||||
soft-dirty bits clear, the #PF-s that occur after that are processed fast.
|
||||
This is so, since the pages are still mapped to physical memory, and thus all
|
||||
the kernel does is finds this fact out and puts both writable and soft-dirty
|
||||
bits on the PTE.
|
||||
|
||||
While in most cases tracking memory changes by #PF-s is more than enough
|
||||
there is still a scenario when we can lose soft dirty bits -- a task
|
||||
unmaps a previously mapped memory region and then maps a new one at exactly
|
||||
the same place. When unmap is called, the kernel internally clears PTE values
|
||||
including soft dirty bits. To notify user space application about such
|
||||
memory region renewal the kernel always marks new memory regions (and
|
||||
expanded regions) as soft dirty.
|
||||
|
||||
This feature is actively used by the checkpoint-restore project. You
|
||||
can find more details about it on http://criu.org
|
||||
|
||||
|
||||
-- Pavel Emelyanov, Apr 9, 2013
|
418
Documentation/admin-guide/mm/transhuge.rst
Normal file
418
Documentation/admin-guide/mm/transhuge.rst
Normal file
@ -0,0 +1,418 @@
|
||||
.. _admin_guide_transhuge:
|
||||
|
||||
============================
|
||||
Transparent Hugepage Support
|
||||
============================
|
||||
|
||||
Objective
|
||||
=========
|
||||
|
||||
Performance critical computing applications dealing with large memory
|
||||
working sets are already running on top of libhugetlbfs and in turn
|
||||
hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of
|
||||
using huge pages for the backing of virtual memory with huge pages
|
||||
that supports the automatic promotion and demotion of page sizes and
|
||||
without the shortcomings of hugetlbfs.
|
||||
|
||||
Currently THP only works for anonymous memory mappings and tmpfs/shmem.
|
||||
But in the future it can expand to other filesystems.
|
||||
|
||||
.. note::
|
||||
in the examples below we presume that the basic page size is 4K and
|
||||
the huge page size is 2M, although the actual numbers may vary
|
||||
depending on the CPU architecture.
|
||||
|
||||
The reason applications are running faster is because of two
|
||||
factors. The first factor is almost completely irrelevant and it's not
|
||||
of significant interest because it'll also have the downside of
|
||||
requiring larger clear-page copy-page in page faults which is a
|
||||
potentially negative effect. The first factor consists in taking a
|
||||
single page fault for each 2M virtual region touched by userland (so
|
||||
reducing the enter/exit kernel frequency by a 512 times factor). This
|
||||
only matters the first time the memory is accessed for the lifetime of
|
||||
a memory mapping. The second long lasting and much more important
|
||||
factor will affect all subsequent accesses to the memory for the whole
|
||||
runtime of the application. The second factor consist of two
|
||||
components:
|
||||
|
||||
1) the TLB miss will run faster (especially with virtualization using
|
||||
nested pagetables but almost always also on bare metal without
|
||||
virtualization)
|
||||
|
||||
2) a single TLB entry will be mapping a much larger amount of virtual
|
||||
memory in turn reducing the number of TLB misses. With
|
||||
virtualization and nested pagetables the TLB can be mapped of
|
||||
larger size only if both KVM and the Linux guest are using
|
||||
hugepages but a significant speedup already happens if only one of
|
||||
the two is using hugepages just because of the fact the TLB miss is
|
||||
going to run faster.
|
||||
|
||||
THP can be enabled system wide or restricted to certain tasks or even
|
||||
memory ranges inside task's address space. Unless THP is completely
|
||||
disabled, there is ``khugepaged`` daemon that scans memory and
|
||||
collapses sequences of basic pages into huge pages.
|
||||
|
||||
The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
|
||||
interface and using madivse(2) and prctl(2) system calls.
|
||||
|
||||
Transparent Hugepage Support maximizes the usefulness of free memory
|
||||
if compared to the reservation approach of hugetlbfs by allowing all
|
||||
unused memory to be used as cache or other movable (or even unmovable
|
||||
entities). It doesn't require reservation to prevent hugepage
|
||||
allocation failures to be noticeable from userland. It allows paging
|
||||
and all other advanced VM features to be available on the
|
||||
hugepages. It requires no modifications for applications to take
|
||||
advantage of it.
|
||||
|
||||
Applications however can be further optimized to take advantage of
|
||||
this feature, like for example they've been optimized before to avoid
|
||||
a flood of mmap system calls for every malloc(4k). Optimizing userland
|
||||
is by far not mandatory and khugepaged already can take care of long
|
||||
lived page allocations even for hugepage unaware applications that
|
||||
deals with large amounts of memory.
|
||||
|
||||
In certain cases when hugepages are enabled system wide, application
|
||||
may end up allocating more memory resources. An application may mmap a
|
||||
large region but only touch 1 byte of it, in that case a 2M page might
|
||||
be allocated instead of a 4k page for no good. This is why it's
|
||||
possible to disable hugepages system-wide and to only have them inside
|
||||
MADV_HUGEPAGE madvise regions.
|
||||
|
||||
Embedded systems should enable hugepages only inside madvise regions
|
||||
to eliminate any risk of wasting any precious byte of memory and to
|
||||
only run faster.
|
||||
|
||||
Applications that gets a lot of benefit from hugepages and that don't
|
||||
risk to lose memory by using hugepages, should use
|
||||
madvise(MADV_HUGEPAGE) on their critical mmapped regions.
|
||||
|
||||
.. _thp_sysfs:
|
||||
|
||||
sysfs
|
||||
=====
|
||||
|
||||
Global THP controls
|
||||
-------------------
|
||||
|
||||
Transparent Hugepage Support for anonymous memory can be entirely disabled
|
||||
(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
|
||||
regions (to avoid the risk of consuming more memory resources) or enabled
|
||||
system wide. This can be achieved with one of::
|
||||
|
||||
echo always >/sys/kernel/mm/transparent_hugepage/enabled
|
||||
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
|
||||
echo never >/sys/kernel/mm/transparent_hugepage/enabled
|
||||
|
||||
It's also possible to limit defrag efforts in the VM to generate
|
||||
anonymous hugepages in case they're not immediately free to madvise
|
||||
regions or to never try to defrag memory and simply fallback to regular
|
||||
pages unless hugepages are immediately available. Clearly if we spend CPU
|
||||
time to defrag memory, we would expect to gain even more by the fact we
|
||||
use hugepages later instead of regular pages. This isn't always
|
||||
guaranteed, but it may be more likely in case the allocation is for a
|
||||
MADV_HUGEPAGE region.
|
||||
|
||||
::
|
||||
|
||||
echo always >/sys/kernel/mm/transparent_hugepage/defrag
|
||||
echo defer >/sys/kernel/mm/transparent_hugepage/defrag
|
||||
echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
|
||||
echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
|
||||
echo never >/sys/kernel/mm/transparent_hugepage/defrag
|
||||
|
||||
always
|
||||
means that an application requesting THP will stall on
|
||||
allocation failure and directly reclaim pages and compact
|
||||
memory in an effort to allocate a THP immediately. This may be
|
||||
desirable for virtual machines that benefit heavily from THP
|
||||
use and are willing to delay the VM start to utilise them.
|
||||
|
||||
defer
|
||||
means that an application will wake kswapd in the background
|
||||
to reclaim pages and wake kcompactd to compact memory so that
|
||||
THP is available in the near future. It's the responsibility
|
||||
of khugepaged to then install the THP pages later.
|
||||
|
||||
defer+madvise
|
||||
will enter direct reclaim and compaction like ``always``, but
|
||||
only for regions that have used madvise(MADV_HUGEPAGE); all
|
||||
other regions will wake kswapd in the background to reclaim
|
||||
pages and wake kcompactd to compact memory so that THP is
|
||||
available in the near future.
|
||||
|
||||
madvise
|
||||
will enter direct reclaim like ``always`` but only for regions
|
||||
that are have used madvise(MADV_HUGEPAGE). This is the default
|
||||
behaviour.
|
||||
|
||||
never
|
||||
should be self-explanatory.
|
||||
|
||||
By default kernel tries to use huge zero page on read page fault to
|
||||
anonymous mapping. It's possible to disable huge zero page by writing 0
|
||||
or enable it back by writing 1::
|
||||
|
||||
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
||||
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
||||
|
||||
Some userspace (such as a test program, or an optimized memory allocation
|
||||
library) may want to know the size (in bytes) of a transparent hugepage::
|
||||
|
||||
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
|
||||
|
||||
khugepaged will be automatically started when
|
||||
transparent_hugepage/enabled is set to "always" or "madvise, and it'll
|
||||
be automatically shutdown if it's set to "never".
|
||||
|
||||
Khugepaged controls
|
||||
-------------------
|
||||
|
||||
khugepaged runs usually at low frequency so while one may not want to
|
||||
invoke defrag algorithms synchronously during the page faults, it
|
||||
should be worth invoking defrag at least in khugepaged. However it's
|
||||
also possible to disable defrag in khugepaged by writing 0 or enable
|
||||
defrag in khugepaged by writing 1::
|
||||
|
||||
echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
|
||||
echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
|
||||
|
||||
You can also control how many pages khugepaged should scan at each
|
||||
pass::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
|
||||
|
||||
and how many milliseconds to wait in khugepaged between each pass (you
|
||||
can set this to 0 to run khugepaged at 100% utilization of one core)::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
|
||||
|
||||
and how many milliseconds to wait in khugepaged if there's an hugepage
|
||||
allocation failure to throttle the next allocation attempt::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
|
||||
|
||||
The khugepaged progress can be seen in the number of pages collapsed::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
|
||||
|
||||
for each pass::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
|
||||
|
||||
``max_ptes_none`` specifies how many extra small pages (that are
|
||||
not already mapped) can be allocated when collapsing a group
|
||||
of small pages into one large page::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
|
||||
|
||||
A higher value leads to use additional memory for programs.
|
||||
A lower value leads to gain less thp performance. Value of
|
||||
max_ptes_none can waste cpu time very little, you can
|
||||
ignore it.
|
||||
|
||||
``max_ptes_swap`` specifies how many pages can be brought in from
|
||||
swap when collapsing a group of pages into a transparent huge page::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
|
||||
|
||||
A higher value can cause excessive swap IO and waste
|
||||
memory. A lower value can prevent THPs from being
|
||||
collapsed, resulting fewer pages being collapsed into
|
||||
THPs, and lower memory access performance.
|
||||
|
||||
Boot parameter
|
||||
==============
|
||||
|
||||
You can change the sysfs boot time defaults of Transparent Hugepage
|
||||
Support by passing the parameter ``transparent_hugepage=always`` or
|
||||
``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
|
||||
to the kernel command line.
|
||||
|
||||
Hugepages in tmpfs/shmem
|
||||
========================
|
||||
|
||||
You can control hugepage allocation policy in tmpfs with mount option
|
||||
``huge=``. It can have following values:
|
||||
|
||||
always
|
||||
Attempt to allocate huge pages every time we need a new page;
|
||||
|
||||
never
|
||||
Do not allocate huge pages;
|
||||
|
||||
within_size
|
||||
Only allocate huge page if it will be fully within i_size.
|
||||
Also respect fadvise()/madvise() hints;
|
||||
|
||||
advise
|
||||
Only allocate huge pages if requested with fadvise()/madvise();
|
||||
|
||||
The default policy is ``never``.
|
||||
|
||||
``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
|
||||
``huge=never`` will not attempt to break up huge pages at all, just stop more
|
||||
from being allocated.
|
||||
|
||||
There's also sysfs knob to control hugepage allocation policy for internal
|
||||
shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
|
||||
is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
|
||||
MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
|
||||
|
||||
In addition to policies listed above, shmem_enabled allows two further
|
||||
values:
|
||||
|
||||
deny
|
||||
For use in emergencies, to force the huge option off from
|
||||
all mounts;
|
||||
force
|
||||
Force the huge option on for all - very useful for testing;
|
||||
|
||||
Need of application restart
|
||||
===========================
|
||||
|
||||
The transparent_hugepage/enabled values and tmpfs mount option only affect
|
||||
future behavior. So to make them effective you need to restart any
|
||||
application that could have been using hugepages. This also applies to the
|
||||
regions registered in khugepaged.
|
||||
|
||||
Monitoring usage
|
||||
================
|
||||
|
||||
The number of anonymous transparent huge pages currently used by the
|
||||
system is available by reading the AnonHugePages field in ``/proc/meminfo``.
|
||||
To identify what applications are using anonymous transparent huge pages,
|
||||
it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
|
||||
for each mapping.
|
||||
|
||||
The number of file transparent huge pages mapped to userspace is available
|
||||
by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
|
||||
To identify what applications are mapping file transparent huge pages, it
|
||||
is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
|
||||
for each mapping.
|
||||
|
||||
Note that reading the smaps file is expensive and reading it
|
||||
frequently will incur overhead.
|
||||
|
||||
There are a number of counters in ``/proc/vmstat`` that may be used to
|
||||
monitor how successfully the system is providing huge pages for use.
|
||||
|
||||
thp_fault_alloc
|
||||
is incremented every time a huge page is successfully
|
||||
allocated to handle a page fault. This applies to both the
|
||||
first time a page is faulted and for COW faults.
|
||||
|
||||
thp_collapse_alloc
|
||||
is incremented by khugepaged when it has found
|
||||
a range of pages to collapse into one huge page and has
|
||||
successfully allocated a new huge page to store the data.
|
||||
|
||||
thp_fault_fallback
|
||||
is incremented if a page fault fails to allocate
|
||||
a huge page and instead falls back to using small pages.
|
||||
|
||||
thp_collapse_alloc_failed
|
||||
is incremented if khugepaged found a range
|
||||
of pages that should be collapsed into one huge page but failed
|
||||
the allocation.
|
||||
|
||||
thp_file_alloc
|
||||
is incremented every time a file huge page is successfully
|
||||
allocated.
|
||||
|
||||
thp_file_mapped
|
||||
is incremented every time a file huge page is mapped into
|
||||
user address space.
|
||||
|
||||
thp_split_page
|
||||
is incremented every time a huge page is split into base
|
||||
pages. This can happen for a variety of reasons but a common
|
||||
reason is that a huge page is old and is being reclaimed.
|
||||
This action implies splitting all PMD the page mapped with.
|
||||
|
||||
thp_split_page_failed
|
||||
is incremented if kernel fails to split huge
|
||||
page. This can happen if the page was pinned by somebody.
|
||||
|
||||
thp_deferred_split_page
|
||||
is incremented when a huge page is put onto split
|
||||
queue. This happens when a huge page is partially unmapped and
|
||||
splitting it would free up some memory. Pages on split queue are
|
||||
going to be split under memory pressure.
|
||||
|
||||
thp_split_pmd
|
||||
is incremented every time a PMD split into table of PTEs.
|
||||
This can happen, for instance, when application calls mprotect() or
|
||||
munmap() on part of huge page. It doesn't split huge page, only
|
||||
page table entry.
|
||||
|
||||
thp_zero_page_alloc
|
||||
is incremented every time a huge zero page is
|
||||
successfully allocated. It includes allocations which where
|
||||
dropped due race with other allocation. Note, it doesn't count
|
||||
every map of the huge zero page, only its allocation.
|
||||
|
||||
thp_zero_page_alloc_failed
|
||||
is incremented if kernel fails to allocate
|
||||
huge zero page and falls back to using small pages.
|
||||
|
||||
thp_swpout
|
||||
is incremented every time a huge page is swapout in one
|
||||
piece without splitting.
|
||||
|
||||
thp_swpout_fallback
|
||||
is incremented if a huge page has to be split before swapout.
|
||||
Usually because failed to allocate some continuous swap space
|
||||
for the huge page.
|
||||
|
||||
As the system ages, allocating huge pages may be expensive as the
|
||||
system uses memory compaction to copy data around memory to free a
|
||||
huge page for use. There are some counters in ``/proc/vmstat`` to help
|
||||
monitor this overhead.
|
||||
|
||||
compact_stall
|
||||
is incremented every time a process stalls to run
|
||||
memory compaction so that a huge page is free for use.
|
||||
|
||||
compact_success
|
||||
is incremented if the system compacted memory and
|
||||
freed a huge page for use.
|
||||
|
||||
compact_fail
|
||||
is incremented if the system tries to compact memory
|
||||
but failed.
|
||||
|
||||
compact_pages_moved
|
||||
is incremented each time a page is moved. If
|
||||
this value is increasing rapidly, it implies that the system
|
||||
is copying a lot of data to satisfy the huge page allocation.
|
||||
It is possible that the cost of copying exceeds any savings
|
||||
from reduced TLB misses.
|
||||
|
||||
compact_pagemigrate_failed
|
||||
is incremented when the underlying mechanism
|
||||
for moving a page failed.
|
||||
|
||||
compact_blocks_moved
|
||||
is incremented each time memory compaction examines
|
||||
a huge page aligned range of pages.
|
||||
|
||||
It is possible to establish how long the stalls were using the function
|
||||
tracer to record how long was spent in __alloc_pages_nodemask and
|
||||
using the mm_page_alloc tracepoint to identify which allocations were
|
||||
for huge pages.
|
||||
|
||||
Optimizing the applications
|
||||
===========================
|
||||
|
||||
To be guaranteed that the kernel will map a 2M page immediately in any
|
||||
memory region, the mmap region has to be hugepage naturally
|
||||
aligned. posix_memalign() can provide that guarantee.
|
||||
|
||||
Hugetlbfs
|
||||
=========
|
||||
|
||||
You can use hugetlbfs on a kernel that has transparent hugepage
|
||||
support enabled just fine as always. No difference can be noted in
|
||||
hugetlbfs other than there will be less overall fragmentation. All
|
||||
usual features belonging to hugetlbfs are preserved and
|
||||
unaffected. libhugetlbfs will also work fine as usual.
|
241
Documentation/admin-guide/mm/userfaultfd.rst
Normal file
241
Documentation/admin-guide/mm/userfaultfd.rst
Normal file
@ -0,0 +1,241 @@
|
||||
.. _userfaultfd:
|
||||
|
||||
===========
|
||||
Userfaultfd
|
||||
===========
|
||||
|
||||
Objective
|
||||
=========
|
||||
|
||||
Userfaults allow the implementation of on-demand paging from userland
|
||||
and more generally they allow userland to take control of various
|
||||
memory page faults, something otherwise only the kernel code could do.
|
||||
|
||||
For example userfaults allows a proper and more optimal implementation
|
||||
of the PROT_NONE+SIGSEGV trick.
|
||||
|
||||
Design
|
||||
======
|
||||
|
||||
Userfaults are delivered and resolved through the userfaultfd syscall.
|
||||
|
||||
The userfaultfd (aside from registering and unregistering virtual
|
||||
memory ranges) provides two primary functionalities:
|
||||
|
||||
1) read/POLLIN protocol to notify a userland thread of the faults
|
||||
happening
|
||||
|
||||
2) various UFFDIO_* ioctls that can manage the virtual memory regions
|
||||
registered in the userfaultfd that allows userland to efficiently
|
||||
resolve the userfaults it receives via 1) or to manage the virtual
|
||||
memory in the background
|
||||
|
||||
The real advantage of userfaults if compared to regular virtual memory
|
||||
management of mremap/mprotect is that the userfaults in all their
|
||||
operations never involve heavyweight structures like vmas (in fact the
|
||||
userfaultfd runtime load never takes the mmap_sem for writing).
|
||||
|
||||
Vmas are not suitable for page- (or hugepage) granular fault tracking
|
||||
when dealing with virtual address spaces that could span
|
||||
Terabytes. Too many vmas would be needed for that.
|
||||
|
||||
The userfaultfd once opened by invoking the syscall, can also be
|
||||
passed using unix domain sockets to a manager process, so the same
|
||||
manager process could handle the userfaults of a multitude of
|
||||
different processes without them being aware about what is going on
|
||||
(well of course unless they later try to use the userfaultfd
|
||||
themselves on the same region the manager is already tracking, which
|
||||
is a corner case that would currently return -EBUSY).
|
||||
|
||||
API
|
||||
===
|
||||
|
||||
When first opened the userfaultfd must be enabled invoking the
|
||||
UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
|
||||
a later API version) which will specify the read/POLLIN protocol
|
||||
userland intends to speak on the UFFD and the uffdio_api.features
|
||||
userland requires. The UFFDIO_API ioctl if successful (i.e. if the
|
||||
requested uffdio_api.api is spoken also by the running kernel and the
|
||||
requested features are going to be enabled) will return into
|
||||
uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of
|
||||
respectively all the available features of the read(2) protocol and
|
||||
the generic ioctl available.
|
||||
|
||||
The uffdio_api.features bitmask returned by the UFFDIO_API ioctl
|
||||
defines what memory types are supported by the userfaultfd and what
|
||||
events, except page fault notifications, may be generated.
|
||||
|
||||
If the kernel supports registering userfaultfd ranges on hugetlbfs
|
||||
virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in
|
||||
uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be
|
||||
set if the kernel supports registering userfaultfd ranges on shared
|
||||
memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero
|
||||
MAP_SHARED, memfd_create, etc).
|
||||
|
||||
The userland application that wants to use userfaultfd with hugetlbfs
|
||||
or shared memory need to set the corresponding flag in
|
||||
uffdio_api.features to enable those features.
|
||||
|
||||
If the userland desires to receive notifications for events other than
|
||||
page faults, it has to verify that uffdio_api.features has appropriate
|
||||
UFFD_FEATURE_EVENT_* bits set. These events are described in more
|
||||
detail below in "Non-cooperative userfaultfd" section.
|
||||
|
||||
Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
|
||||
be invoked (if present in the returned uffdio_api.ioctls bitmask) to
|
||||
register a memory range in the userfaultfd by setting the
|
||||
uffdio_register structure accordingly. The uffdio_register.mode
|
||||
bitmask will specify to the kernel which kind of faults to track for
|
||||
the range (UFFDIO_REGISTER_MODE_MISSING would track missing
|
||||
pages). The UFFDIO_REGISTER ioctl will return the
|
||||
uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
|
||||
userfaults on the range registered. Not all ioctls will necessarily be
|
||||
supported for all memory types depending on the underlying virtual
|
||||
memory backend (anonymous memory vs tmpfs vs real filebacked
|
||||
mappings).
|
||||
|
||||
Userland can use the uffdio_register.ioctls to manage the virtual
|
||||
address space in the background (to add or potentially also remove
|
||||
memory from the userfaultfd registered range). This means a userfault
|
||||
could be triggering just before userland maps in the background the
|
||||
user-faulted page.
|
||||
|
||||
The primary ioctl to resolve userfaults is UFFDIO_COPY. That
|
||||
atomically copies a page into the userfault registered range and wakes
|
||||
up the blocked userfaults (unless uffdio_copy.mode &
|
||||
UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
|
||||
UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
|
||||
half copied page since it'll keep userfaulting until the copy has
|
||||
finished.
|
||||
|
||||
QEMU/KVM
|
||||
========
|
||||
|
||||
QEMU/KVM is using the userfaultfd syscall to implement postcopy live
|
||||
migration. Postcopy live migration is one form of memory
|
||||
externalization consisting of a virtual machine running with part or
|
||||
all of its memory residing on a different node in the cloud. The
|
||||
userfaultfd abstraction is generic enough that not a single line of
|
||||
KVM kernel code had to be modified in order to add postcopy live
|
||||
migration to QEMU.
|
||||
|
||||
Guest async page faults, FOLL_NOWAIT and all other GUP features work
|
||||
just fine in combination with userfaults. Userfaults trigger async
|
||||
page faults in the guest scheduler so those guest processes that
|
||||
aren't waiting for userfaults (i.e. network bound) can keep running in
|
||||
the guest vcpus.
|
||||
|
||||
It is generally beneficial to run one pass of precopy live migration
|
||||
just before starting postcopy live migration, in order to avoid
|
||||
generating userfaults for readonly guest regions.
|
||||
|
||||
The implementation of postcopy live migration currently uses one
|
||||
single bidirectional socket but in the future two different sockets
|
||||
will be used (to reduce the latency of the userfaults to the minimum
|
||||
possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
|
||||
|
||||
The QEMU in the source node writes all pages that it knows are missing
|
||||
in the destination node, into the socket, and the migration thread of
|
||||
the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
|
||||
ioctls on the userfaultfd in order to map the received pages into the
|
||||
guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
|
||||
|
||||
A different postcopy thread in the destination node listens with
|
||||
poll() to the userfaultfd in parallel. When a POLLIN event is
|
||||
generated after a userfault triggers, the postcopy thread read() from
|
||||
the userfaultfd and receives the fault address (or -EAGAIN in case the
|
||||
userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
|
||||
by the parallel QEMU migration thread).
|
||||
|
||||
After the QEMU postcopy thread (running in the destination node) gets
|
||||
the userfault address it writes the information about the missing page
|
||||
into the socket. The QEMU source node receives the information and
|
||||
roughly "seeks" to that page address and continues sending all
|
||||
remaining missing pages from that new page offset. Soon after that
|
||||
(just the time to flush the tcp_wmem queue through the network) the
|
||||
migration thread in the QEMU running in the destination node will
|
||||
receive the page that triggered the userfault and it'll map it as
|
||||
usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
|
||||
was spontaneously sent by the source or if it was an urgent page
|
||||
requested through a userfault).
|
||||
|
||||
By the time the userfaults start, the QEMU in the destination node
|
||||
doesn't need to keep any per-page state bitmap relative to the live
|
||||
migration around and a single per-page bitmap has to be maintained in
|
||||
the QEMU running in the source node to know which pages are still
|
||||
missing in the destination node. The bitmap in the source node is
|
||||
checked to find which missing pages to send in round robin and we seek
|
||||
over it when receiving incoming userfaults. After sending each page of
|
||||
course the bitmap is updated accordingly. It's also useful to avoid
|
||||
sending the same page twice (in case the userfault is read by the
|
||||
postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
|
||||
thread).
|
||||
|
||||
Non-cooperative userfaultfd
|
||||
===========================
|
||||
|
||||
When the userfaultfd is monitored by an external manager, the manager
|
||||
must be able to track changes in the process virtual memory
|
||||
layout. Userfaultfd can notify the manager about such changes using
|
||||
the same read(2) protocol as for the page fault notifications. The
|
||||
manager has to explicitly enable these events by setting appropriate
|
||||
bits in uffdio_api.features passed to UFFDIO_API ioctl:
|
||||
|
||||
UFFD_FEATURE_EVENT_FORK
|
||||
enable userfaultfd hooks for fork(). When this feature is
|
||||
enabled, the userfaultfd context of the parent process is
|
||||
duplicated into the newly created process. The manager
|
||||
receives UFFD_EVENT_FORK with file descriptor of the new
|
||||
userfaultfd context in the uffd_msg.fork.
|
||||
|
||||
UFFD_FEATURE_EVENT_REMAP
|
||||
enable notifications about mremap() calls. When the
|
||||
non-cooperative process moves a virtual memory area to a
|
||||
different location, the manager will receive
|
||||
UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and
|
||||
new addresses of the area and its original length.
|
||||
|
||||
UFFD_FEATURE_EVENT_REMOVE
|
||||
enable notifications about madvise(MADV_REMOVE) and
|
||||
madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will
|
||||
be generated upon these calls to madvise. The uffd_msg.remove
|
||||
will contain start and end addresses of the removed area.
|
||||
|
||||
UFFD_FEATURE_EVENT_UNMAP
|
||||
enable notifications about memory unmapping. The manager will
|
||||
get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and
|
||||
end addresses of the unmapped area.
|
||||
|
||||
Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP
|
||||
are pretty similar, they quite differ in the action expected from the
|
||||
userfaultfd manager. In the former case, the virtual memory is
|
||||
removed, but the area is not, the area remains monitored by the
|
||||
userfaultfd, and if a page fault occurs in that area it will be
|
||||
delivered to the manager. The proper resolution for such page fault is
|
||||
to zeromap the faulting address. However, in the latter case, when an
|
||||
area is unmapped, either explicitly (with munmap() system call), or
|
||||
implicitly (e.g. during mremap()), the area is removed and in turn the
|
||||
userfaultfd context for such area disappears too and the manager will
|
||||
not get further userland page faults from the removed area. Still, the
|
||||
notification is required in order to prevent manager from using
|
||||
UFFDIO_COPY on the unmapped area.
|
||||
|
||||
Unlike userland page faults which have to be synchronous and require
|
||||
explicit or implicit wakeup, all the events are delivered
|
||||
asynchronously and the non-cooperative process resumes execution as
|
||||
soon as manager executes read(). The userfaultfd manager should
|
||||
carefully synchronize calls to UFFDIO_COPY with the events
|
||||
processing. To aid the synchronization, the UFFDIO_COPY ioctl will
|
||||
return -ENOSPC when the monitored process exits at the time of
|
||||
UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed
|
||||
its virtual memory layout simultaneously with outstanding UFFDIO_COPY
|
||||
operation.
|
||||
|
||||
The current asynchronous model of the event delivery is optimal for
|
||||
single threaded non-cooperative userfaultfd manager implementations. A
|
||||
synchronous event delivery model can be added later as a new
|
||||
userfaultfd feature to facilitate multithreading enhancements of the
|
||||
non cooperative manager, for example to allow UFFDIO_COPY ioctls to
|
||||
run in parallel to the event reception. Single threaded
|
||||
implementations should continue to use the current async event
|
||||
delivery model instead.
|
@ -145,7 +145,7 @@ feature enabled.]
|
||||
|
||||
In this mode ``intel_pstate`` registers utilization update callbacks with the
|
||||
CPU scheduler in order to run a P-state selection algorithm, either
|
||||
``powersave`` or ``performance``, depending on the ``scaling_cur_freq`` policy
|
||||
``powersave`` or ``performance``, depending on the ``scaling_governor`` policy
|
||||
setting in ``sysfs``. The current CPU frequency information to be made
|
||||
available from the ``scaling_cur_freq`` policy attribute in ``sysfs`` is
|
||||
periodically updated by those utilization update callbacks too.
|
||||
@ -410,7 +410,7 @@ argument is passed to the kernel in the command line.
|
||||
That only is supported in some configurations, though (for example, if
|
||||
the `HWP feature is enabled in the processor <Active Mode With HWP_>`_,
|
||||
the operation mode of the driver cannot be changed), and if it is not
|
||||
supported in the current configuration, writes to this attribute with
|
||||
supported in the current configuration, writes to this attribute will
|
||||
fail with an appropriate error.
|
||||
|
||||
Interpretation of Policy Attributes
|
||||
|
@ -15,7 +15,7 @@ Sleep States That Can Be Supported
|
||||
==================================
|
||||
|
||||
Depending on its configuration and the capabilities of the platform it runs on,
|
||||
the Linux kernel can support up to four system sleep states, includig
|
||||
the Linux kernel can support up to four system sleep states, including
|
||||
hibernation and up to three variants of system suspend. The sleep states that
|
||||
can be supported by the kernel are listed below.
|
||||
|
||||
|
@ -61,7 +61,7 @@ Setting the ramoops parameters can be done in several different manners:
|
||||
mem=128M ramoops.mem_address=0x8000000 ramoops.ecc=1
|
||||
|
||||
B. Use Device Tree bindings, as described in
|
||||
``Documentation/device-tree/bindings/reserved-memory/admin-guide/ramoops.rst``.
|
||||
``Documentation/devicetree/bindings/reserved-memory/ramoops.txt``.
|
||||
For example::
|
||||
|
||||
reserved-memory {
|
||||
|
@ -302,19 +302,15 @@ Berlin family (Multimedia Solutions)
|
||||
88DE3010, Armada 1000 (no Linux support)
|
||||
Core: Marvell PJ1 (ARMv5TE), Dual-core
|
||||
Product Brief: http://www.marvell.com.cn/digital-entertainment/assets/armada_1000_pb.pdf
|
||||
88DE3005, Armada 1500-mini
|
||||
88DE3005, Armada 1500 Mini
|
||||
Design name: BG2CD
|
||||
Core: ARM Cortex-A9, PL310 L2CC
|
||||
Homepage: http://www.marvell.com/multimedia-solutions/armada-1500-mini/
|
||||
88DE3006, Armada 1500 Mini Plus
|
||||
Design name: BG2CDP
|
||||
Core: Dual Core ARM Cortex-A7
|
||||
Homepage: http://www.marvell.com/multimedia-solutions/armada-1500-mini-plus/
|
||||
88DE3006, Armada 1500 Mini Plus
|
||||
Design name: BG2CDP
|
||||
Core: Dual Core ARM Cortex-A7
|
||||
88DE3100, Armada 1500
|
||||
Design name: BG2
|
||||
Core: Marvell PJ4B-MP (ARMv7), Tauros3 L2CC
|
||||
Product Brief: http://www.marvell.com/digital-entertainment/armada-1500/assets/Marvell-ARMADA-1500-Product-Brief.pdf
|
||||
88DE3114, Armada 1500 Pro
|
||||
Design name: BG2Q
|
||||
Core: Quad Core ARM Cortex-A9, PL310 L2CC
|
||||
@ -324,13 +320,16 @@ Berlin family (Multimedia Solutions)
|
||||
88DE3218, ARMADA 1500 Ultra
|
||||
Core: ARM Cortex-A53
|
||||
|
||||
Homepage: http://www.marvell.com/multimedia-solutions/
|
||||
Homepage: https://www.synaptics.com/products/multimedia-solutions
|
||||
Directory: arch/arm/mach-berlin
|
||||
|
||||
Comments:
|
||||
|
||||
* This line of SoCs is based on Marvell Sheeva or ARM Cortex CPUs
|
||||
with Synopsys DesignWare (IRQ, GPIO, Timers, ...) and PXA IP (SDHCI, USB, ETH, ...).
|
||||
|
||||
* The Berlin family was acquired by Synaptics from Marvell in 2017.
|
||||
|
||||
CPU Cores
|
||||
---------
|
||||
|
||||
|
@ -5,3 +5,7 @@ KERNEL NEW DEPENDENCIES
|
||||
v4.3+ Update is needed for custom .config files to make sure
|
||||
CONFIG_REGULATOR_PBIAS is enabled for MMC1 to work
|
||||
properly.
|
||||
|
||||
v4.18+ Update is needed for custom .config files to make sure
|
||||
CONFIG_MMC_SDHCI_OMAP is enabled for all MMC instances
|
||||
to work in DRA7 and K2G based boards.
|
||||
|
@ -752,18 +752,6 @@ completion of the request to the block layer. This means ending tag
|
||||
operations before calling end_that_request_last()! For an example of a user
|
||||
of these helpers, see the IDE tagged command queueing support.
|
||||
|
||||
Certain hardware conditions may dictate a need to invalidate the block tag
|
||||
queue. For instance, on IDE any tagged request error needs to clear both
|
||||
the hardware and software block queue and enable the driver to sanely restart
|
||||
all the outstanding requests. There's a third helper to do that:
|
||||
|
||||
blk_queue_invalidate_tags(struct request_queue *q)
|
||||
|
||||
Clear the internal block tag queue and re-add all the pending requests
|
||||
to the request queue. The driver will receive them again on the
|
||||
next request_fn run, just like it did the first time it encountered
|
||||
them.
|
||||
|
||||
3.2.5.2 Tag info
|
||||
|
||||
Some block functions exist to query current tag status or to go from a
|
||||
@ -805,8 +793,7 @@ Internally, block manages tags in the blk_queue_tag structure:
|
||||
Most of the above is simple and straight forward, however busy_list may need
|
||||
a bit of explaining. Normally we don't care too much about request ordering,
|
||||
but in the event of any barrier requests in the tag queue we need to ensure
|
||||
that requests are restarted in the order they were queue. This may happen
|
||||
if the driver needs to use blk_queue_invalidate_tags().
|
||||
that requests are restarted in the order they were queue.
|
||||
|
||||
3.3 I/O Submission
|
||||
|
||||
|
@ -1,7 +1,9 @@
|
||||
Embedded device command line partition parsing
|
||||
=====================================================================
|
||||
|
||||
Support for reading the block device partition table from the command line.
|
||||
The "blkdevparts" command line option adds support for reading the
|
||||
block device partition table from the kernel command line.
|
||||
|
||||
It is typically used for fixed block (eMMC) embedded devices.
|
||||
It has no MBR, so saves storage space. Bootloader can be easily accessed
|
||||
by absolute address of data on the block device.
|
||||
@ -14,22 +16,27 @@ blkdevparts=<blkdev-def>[;<blkdev-def>]
|
||||
<partdef> := <size>[@<offset>](part-name)
|
||||
|
||||
<blkdev-id>
|
||||
block device disk name, embedded device used fixed block device,
|
||||
it's disk name also fixed. such as: mmcblk0, mmcblk1, mmcblk0boot0.
|
||||
block device disk name. Embedded device uses fixed block device.
|
||||
Its disk name is also fixed, such as: mmcblk0, mmcblk1, mmcblk0boot0.
|
||||
|
||||
<size>
|
||||
partition size, in bytes, such as: 512, 1m, 1G.
|
||||
size may contain an optional suffix of (upper or lower case):
|
||||
K, M, G, T, P, E.
|
||||
"-" is used to denote all remaining space.
|
||||
|
||||
<offset>
|
||||
partition start address, in bytes.
|
||||
offset may contain an optional suffix of (upper or lower case):
|
||||
K, M, G, T, P, E.
|
||||
|
||||
(part-name)
|
||||
partition name, kernel send uevent with "PARTNAME". application can create
|
||||
a link to block device partition with the name "PARTNAME".
|
||||
user space application can access partition by partition name.
|
||||
partition name. Kernel sends uevent with "PARTNAME". Application can
|
||||
create a link to block device partition with the name "PARTNAME".
|
||||
User space application can access partition by partition name.
|
||||
|
||||
Example:
|
||||
eMMC disk name is "mmcblk0" and "mmcblk0boot0"
|
||||
eMMC disk names are "mmcblk0" and "mmcblk0boot0".
|
||||
|
||||
bootargs:
|
||||
'blkdevparts=mmcblk0:1G(data0),1G(data1),-;mmcblk0boot0:1m(boot),-(kernel)'
|
||||
|
@ -71,13 +71,16 @@ use_per_node_hctx=[0/1]: Default: 0
|
||||
1: The multi-queue block layer is instantiated with a hardware dispatch
|
||||
queue for each CPU node in the system.
|
||||
|
||||
use_lightnvm=[0/1]: Default: 0
|
||||
Register device with LightNVM. Requires blk-mq and CONFIG_NVM to be enabled.
|
||||
|
||||
no_sched=[0/1]: Default: 0
|
||||
0: nullb* use default blk-mq io scheduler.
|
||||
1: nullb* doesn't use io scheduler.
|
||||
|
||||
blocking=[0/1]: Default: 0
|
||||
0: Register as a non-blocking blk-mq driver device.
|
||||
1: Register as a blocking blk-mq driver device, null_blk will set
|
||||
the BLK_MQ_F_BLOCKING flag, indicating that it sometimes/always
|
||||
needs to block in its ->queue_rq() function.
|
||||
|
||||
shared_tags=[0/1]: Default: 0
|
||||
0: Tag set is not shared.
|
||||
1: Tag set shared between devices for blk-mq. Only makes sense with
|
||||
|
@ -218,6 +218,7 @@ line of text and contains the following stats separated by whitespace:
|
||||
same_pages the number of same element filled pages written to this disk.
|
||||
No memory is allocated for such pages.
|
||||
pages_compacted the number of pages freed during compaction
|
||||
huge_pages the number of incompressible pages
|
||||
|
||||
9) Deactivate:
|
||||
swapoff /dev/zram0
|
||||
@ -242,5 +243,29 @@ to backing storage rather than keeping it in memory.
|
||||
User should set up backing device via /sys/block/zramX/backing_dev
|
||||
before disksize setting.
|
||||
|
||||
= memory tracking
|
||||
|
||||
With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
|
||||
zram block. It could be useful to catch cold or incompressible
|
||||
pages of the process with*pagemap.
|
||||
If you enable the feature, you could see block state via
|
||||
/sys/kernel/debug/zram/zram0/block_state". The output is as follows,
|
||||
|
||||
300 75.033841 .wh
|
||||
301 63.806904 s..
|
||||
302 63.806919 ..h
|
||||
|
||||
First column is zram's block index.
|
||||
Second column is access time since the system was booted
|
||||
Third column is state of the block.
|
||||
(s: same page
|
||||
w: written page to backing store
|
||||
h: huge page)
|
||||
|
||||
First line of above example says 300th block is accessed at 75.033841sec
|
||||
and the block's state is huge so it is written back to the backing
|
||||
storage. It's a debugging feature so anyone shouldn't rely on it to work
|
||||
properly.
|
||||
|
||||
Nitin Gupta
|
||||
ngupta@vflare.org
|
||||
|
36
Documentation/bpf/README.rst
Normal file
36
Documentation/bpf/README.rst
Normal file
@ -0,0 +1,36 @@
|
||||
=================
|
||||
BPF documentation
|
||||
=================
|
||||
|
||||
This directory contains documentation for the BPF (Berkeley Packet
|
||||
Filter) facility, with a focus on the extended BPF version (eBPF).
|
||||
|
||||
This kernel side documentation is still work in progress. The main
|
||||
textual documentation is (for historical reasons) described in
|
||||
`Documentation/networking/filter.txt`_, which describe both classical
|
||||
and extended BPF instruction-set.
|
||||
The Cilium project also maintains a `BPF and XDP Reference Guide`_
|
||||
that goes into great technical depth about the BPF Architecture.
|
||||
|
||||
The primary info for the bpf syscall is available in the `man-pages`_
|
||||
for `bpf(2)`_.
|
||||
|
||||
|
||||
|
||||
Frequently asked questions (FAQ)
|
||||
================================
|
||||
|
||||
Two sets of Questions and Answers (Q&A) are maintained.
|
||||
|
||||
* QA for common questions about BPF see: bpf_design_QA_
|
||||
|
||||
* QA for developers interacting with BPF subsystem: bpf_devel_QA_
|
||||
|
||||
|
||||
.. Links:
|
||||
.. _bpf_design_QA: bpf_design_QA.rst
|
||||
.. _bpf_devel_QA: bpf_devel_QA.rst
|
||||
.. _Documentation/networking/filter.txt: ../networking/filter.txt
|
||||
.. _man-pages: https://www.kernel.org/doc/man-pages/
|
||||
.. _bpf(2): http://man7.org/linux/man-pages/man2/bpf.2.html
|
||||
.. _BPF and XDP Reference Guide: http://cilium.readthedocs.io/en/latest/bpf/
|
221
Documentation/bpf/bpf_design_QA.rst
Normal file
221
Documentation/bpf/bpf_design_QA.rst
Normal file
@ -0,0 +1,221 @@
|
||||
==============
|
||||
BPF Design Q&A
|
||||
==============
|
||||
|
||||
BPF extensibility and applicability to networking, tracing, security
|
||||
in the linux kernel and several user space implementations of BPF
|
||||
virtual machine led to a number of misunderstanding on what BPF actually is.
|
||||
This short QA is an attempt to address that and outline a direction
|
||||
of where BPF is heading long term.
|
||||
|
||||
.. contents::
|
||||
:local:
|
||||
:depth: 3
|
||||
|
||||
Questions and Answers
|
||||
=====================
|
||||
|
||||
Q: Is BPF a generic instruction set similar to x64 and arm64?
|
||||
-------------------------------------------------------------
|
||||
A: NO.
|
||||
|
||||
Q: Is BPF a generic virtual machine ?
|
||||
-------------------------------------
|
||||
A: NO.
|
||||
|
||||
BPF is generic instruction set *with* C calling convention.
|
||||
-----------------------------------------------------------
|
||||
|
||||
Q: Why C calling convention was chosen?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
A: Because BPF programs are designed to run in the linux kernel
|
||||
which is written in C, hence BPF defines instruction set compatible
|
||||
with two most used architectures x64 and arm64 (and takes into
|
||||
consideration important quirks of other architectures) and
|
||||
defines calling convention that is compatible with C calling
|
||||
convention of the linux kernel on those architectures.
|
||||
|
||||
Q: can multiple return values be supported in the future?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
A: NO. BPF allows only register R0 to be used as return value.
|
||||
|
||||
Q: can more than 5 function arguments be supported in the future?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
A: NO. BPF calling convention only allows registers R1-R5 to be used
|
||||
as arguments. BPF is not a standalone instruction set.
|
||||
(unlike x64 ISA that allows msft, cdecl and other conventions)
|
||||
|
||||
Q: can BPF programs access instruction pointer or return address?
|
||||
-----------------------------------------------------------------
|
||||
A: NO.
|
||||
|
||||
Q: can BPF programs access stack pointer ?
|
||||
------------------------------------------
|
||||
A: NO.
|
||||
|
||||
Only frame pointer (register R10) is accessible.
|
||||
From compiler point of view it's necessary to have stack pointer.
|
||||
For example LLVM defines register R11 as stack pointer in its
|
||||
BPF backend, but it makes sure that generated code never uses it.
|
||||
|
||||
Q: Does C-calling convention diminishes possible use cases?
|
||||
-----------------------------------------------------------
|
||||
A: YES.
|
||||
|
||||
BPF design forces addition of major functionality in the form
|
||||
of kernel helper functions and kernel objects like BPF maps with
|
||||
seamless interoperability between them. It lets kernel call into
|
||||
BPF programs and programs call kernel helpers with zero overhead.
|
||||
As all of them were native C code. That is particularly the case
|
||||
for JITed BPF programs that are indistinguishable from
|
||||
native kernel C code.
|
||||
|
||||
Q: Does it mean that 'innovative' extensions to BPF code are disallowed?
|
||||
------------------------------------------------------------------------
|
||||
A: Soft yes.
|
||||
|
||||
At least for now until BPF core has support for
|
||||
bpf-to-bpf calls, indirect calls, loops, global variables,
|
||||
jump tables, read only sections and all other normal constructs
|
||||
that C code can produce.
|
||||
|
||||
Q: Can loops be supported in a safe way?
|
||||
----------------------------------------
|
||||
A: It's not clear yet.
|
||||
|
||||
BPF developers are trying to find a way to
|
||||
support bounded loops where the verifier can guarantee that
|
||||
the program terminates in less than 4096 instructions.
|
||||
|
||||
Instruction level questions
|
||||
---------------------------
|
||||
|
||||
Q: LD_ABS and LD_IND instructions vs C code
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Q: How come LD_ABS and LD_IND instruction are present in BPF whereas
|
||||
C code cannot express them and has to use builtin intrinsics?
|
||||
|
||||
A: This is artifact of compatibility with classic BPF. Modern
|
||||
networking code in BPF performs better without them.
|
||||
See 'direct packet access'.
|
||||
|
||||
Q: BPF instructions mapping not one-to-one to native CPU
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Q: It seems not all BPF instructions are one-to-one to native CPU.
|
||||
For example why BPF_JNE and other compare and jumps are not cpu-like?
|
||||
|
||||
A: This was necessary to avoid introducing flags into ISA which are
|
||||
impossible to make generic and efficient across CPU architectures.
|
||||
|
||||
Q: why BPF_DIV instruction doesn't map to x64 div?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
A: Because if we picked one-to-one relationship to x64 it would have made
|
||||
it more complicated to support on arm64 and other archs. Also it
|
||||
needs div-by-zero runtime check.
|
||||
|
||||
Q: why there is no BPF_SDIV for signed divide operation?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
A: Because it would be rarely used. llvm errors in such case and
|
||||
prints a suggestion to use unsigned divide instead
|
||||
|
||||
Q: Why BPF has implicit prologue and epilogue?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
A: Because architectures like sparc have register windows and in general
|
||||
there are enough subtle differences between architectures, so naive
|
||||
store return address into stack won't work. Another reason is BPF has
|
||||
to be safe from division by zero (and legacy exception path
|
||||
of LD_ABS insn). Those instructions need to invoke epilogue and
|
||||
return implicitly.
|
||||
|
||||
Q: Why BPF_JLT and BPF_JLE instructions were not introduced in the beginning?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
A: Because classic BPF didn't have them and BPF authors felt that compiler
|
||||
workaround would be acceptable. Turned out that programs lose performance
|
||||
due to lack of these compare instructions and they were added.
|
||||
These two instructions is a perfect example what kind of new BPF
|
||||
instructions are acceptable and can be added in the future.
|
||||
These two already had equivalent instructions in native CPUs.
|
||||
New instructions that don't have one-to-one mapping to HW instructions
|
||||
will not be accepted.
|
||||
|
||||
Q: BPF 32-bit subregister requirements
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Q: BPF 32-bit subregisters have a requirement to zero upper 32-bits of BPF
|
||||
registers which makes BPF inefficient virtual machine for 32-bit
|
||||
CPU architectures and 32-bit HW accelerators. Can true 32-bit registers
|
||||
be added to BPF in the future?
|
||||
|
||||
A: NO. The first thing to improve performance on 32-bit archs is to teach
|
||||
LLVM to generate code that uses 32-bit subregisters. Then second step
|
||||
is to teach verifier to mark operations where zero-ing upper bits
|
||||
is unnecessary. Then JITs can take advantage of those markings and
|
||||
drastically reduce size of generated code and improve performance.
|
||||
|
||||
Q: Does BPF have a stable ABI?
|
||||
------------------------------
|
||||
A: YES. BPF instructions, arguments to BPF programs, set of helper
|
||||
functions and their arguments, recognized return codes are all part
|
||||
of ABI. However when tracing programs are using bpf_probe_read() helper
|
||||
to walk kernel internal datastructures and compile with kernel
|
||||
internal headers these accesses can and will break with newer
|
||||
kernels. The union bpf_attr -> kern_version is checked at load time
|
||||
to prevent accidentally loading kprobe-based bpf programs written
|
||||
for a different kernel. Networking programs don't do kern_version check.
|
||||
|
||||
Q: How much stack space a BPF program uses?
|
||||
-------------------------------------------
|
||||
A: Currently all program types are limited to 512 bytes of stack
|
||||
space, but the verifier computes the actual amount of stack used
|
||||
and both interpreter and most JITed code consume necessary amount.
|
||||
|
||||
Q: Can BPF be offloaded to HW?
|
||||
------------------------------
|
||||
A: YES. BPF HW offload is supported by NFP driver.
|
||||
|
||||
Q: Does classic BPF interpreter still exist?
|
||||
--------------------------------------------
|
||||
A: NO. Classic BPF programs are converted into extend BPF instructions.
|
||||
|
||||
Q: Can BPF call arbitrary kernel functions?
|
||||
-------------------------------------------
|
||||
A: NO. BPF programs can only call a set of helper functions which
|
||||
is defined for every program type.
|
||||
|
||||
Q: Can BPF overwrite arbitrary kernel memory?
|
||||
---------------------------------------------
|
||||
A: NO.
|
||||
|
||||
Tracing bpf programs can *read* arbitrary memory with bpf_probe_read()
|
||||
and bpf_probe_read_str() helpers. Networking programs cannot read
|
||||
arbitrary memory, since they don't have access to these helpers.
|
||||
Programs can never read or write arbitrary memory directly.
|
||||
|
||||
Q: Can BPF overwrite arbitrary user memory?
|
||||
-------------------------------------------
|
||||
A: Sort-of.
|
||||
|
||||
Tracing BPF programs can overwrite the user memory
|
||||
of the current task with bpf_probe_write_user(). Every time such
|
||||
program is loaded the kernel will print warning message, so
|
||||
this helper is only useful for experiments and prototypes.
|
||||
Tracing BPF programs are root only.
|
||||
|
||||
Q: bpf_trace_printk() helper warning
|
||||
------------------------------------
|
||||
Q: When bpf_trace_printk() helper is used the kernel prints nasty
|
||||
warning message. Why is that?
|
||||
|
||||
A: This is done to nudge program authors into better interfaces when
|
||||
programs need to pass data to user space. Like bpf_perf_event_output()
|
||||
can be used to efficiently stream data via perf ring buffer.
|
||||
BPF maps can be used for asynchronous data sharing between kernel
|
||||
and user space. bpf_trace_printk() should only be used for debugging.
|
||||
|
||||
Q: New functionality via kernel modules?
|
||||
----------------------------------------
|
||||
Q: Can BPF functionality such as new program or map types, new
|
||||
helpers, etc be added out of kernel module code?
|
||||
|
||||
A: NO.
|
@ -1,156 +0,0 @@
|
||||
BPF extensibility and applicability to networking, tracing, security
|
||||
in the linux kernel and several user space implementations of BPF
|
||||
virtual machine led to a number of misunderstanding on what BPF actually is.
|
||||
This short QA is an attempt to address that and outline a direction
|
||||
of where BPF is heading long term.
|
||||
|
||||
Q: Is BPF a generic instruction set similar to x64 and arm64?
|
||||
A: NO.
|
||||
|
||||
Q: Is BPF a generic virtual machine ?
|
||||
A: NO.
|
||||
|
||||
BPF is generic instruction set _with_ C calling convention.
|
||||
|
||||
Q: Why C calling convention was chosen?
|
||||
A: Because BPF programs are designed to run in the linux kernel
|
||||
which is written in C, hence BPF defines instruction set compatible
|
||||
with two most used architectures x64 and arm64 (and takes into
|
||||
consideration important quirks of other architectures) and
|
||||
defines calling convention that is compatible with C calling
|
||||
convention of the linux kernel on those architectures.
|
||||
|
||||
Q: can multiple return values be supported in the future?
|
||||
A: NO. BPF allows only register R0 to be used as return value.
|
||||
|
||||
Q: can more than 5 function arguments be supported in the future?
|
||||
A: NO. BPF calling convention only allows registers R1-R5 to be used
|
||||
as arguments. BPF is not a standalone instruction set.
|
||||
(unlike x64 ISA that allows msft, cdecl and other conventions)
|
||||
|
||||
Q: can BPF programs access instruction pointer or return address?
|
||||
A: NO.
|
||||
|
||||
Q: can BPF programs access stack pointer ?
|
||||
A: NO. Only frame pointer (register R10) is accessible.
|
||||
From compiler point of view it's necessary to have stack pointer.
|
||||
For example LLVM defines register R11 as stack pointer in its
|
||||
BPF backend, but it makes sure that generated code never uses it.
|
||||
|
||||
Q: Does C-calling convention diminishes possible use cases?
|
||||
A: YES. BPF design forces addition of major functionality in the form
|
||||
of kernel helper functions and kernel objects like BPF maps with
|
||||
seamless interoperability between them. It lets kernel call into
|
||||
BPF programs and programs call kernel helpers with zero overhead.
|
||||
As all of them were native C code. That is particularly the case
|
||||
for JITed BPF programs that are indistinguishable from
|
||||
native kernel C code.
|
||||
|
||||
Q: Does it mean that 'innovative' extensions to BPF code are disallowed?
|
||||
A: Soft yes. At least for now until BPF core has support for
|
||||
bpf-to-bpf calls, indirect calls, loops, global variables,
|
||||
jump tables, read only sections and all other normal constructs
|
||||
that C code can produce.
|
||||
|
||||
Q: Can loops be supported in a safe way?
|
||||
A: It's not clear yet. BPF developers are trying to find a way to
|
||||
support bounded loops where the verifier can guarantee that
|
||||
the program terminates in less than 4096 instructions.
|
||||
|
||||
Q: How come LD_ABS and LD_IND instruction are present in BPF whereas
|
||||
C code cannot express them and has to use builtin intrinsics?
|
||||
A: This is artifact of compatibility with classic BPF. Modern
|
||||
networking code in BPF performs better without them.
|
||||
See 'direct packet access'.
|
||||
|
||||
Q: It seems not all BPF instructions are one-to-one to native CPU.
|
||||
For example why BPF_JNE and other compare and jumps are not cpu-like?
|
||||
A: This was necessary to avoid introducing flags into ISA which are
|
||||
impossible to make generic and efficient across CPU architectures.
|
||||
|
||||
Q: why BPF_DIV instruction doesn't map to x64 div?
|
||||
A: Because if we picked one-to-one relationship to x64 it would have made
|
||||
it more complicated to support on arm64 and other archs. Also it
|
||||
needs div-by-zero runtime check.
|
||||
|
||||
Q: why there is no BPF_SDIV for signed divide operation?
|
||||
A: Because it would be rarely used. llvm errors in such case and
|
||||
prints a suggestion to use unsigned divide instead
|
||||
|
||||
Q: Why BPF has implicit prologue and epilogue?
|
||||
A: Because architectures like sparc have register windows and in general
|
||||
there are enough subtle differences between architectures, so naive
|
||||
store return address into stack won't work. Another reason is BPF has
|
||||
to be safe from division by zero (and legacy exception path
|
||||
of LD_ABS insn). Those instructions need to invoke epilogue and
|
||||
return implicitly.
|
||||
|
||||
Q: Why BPF_JLT and BPF_JLE instructions were not introduced in the beginning?
|
||||
A: Because classic BPF didn't have them and BPF authors felt that compiler
|
||||
workaround would be acceptable. Turned out that programs lose performance
|
||||
due to lack of these compare instructions and they were added.
|
||||
These two instructions is a perfect example what kind of new BPF
|
||||
instructions are acceptable and can be added in the future.
|
||||
These two already had equivalent instructions in native CPUs.
|
||||
New instructions that don't have one-to-one mapping to HW instructions
|
||||
will not be accepted.
|
||||
|
||||
Q: BPF 32-bit subregisters have a requirement to zero upper 32-bits of BPF
|
||||
registers which makes BPF inefficient virtual machine for 32-bit
|
||||
CPU architectures and 32-bit HW accelerators. Can true 32-bit registers
|
||||
be added to BPF in the future?
|
||||
A: NO. The first thing to improve performance on 32-bit archs is to teach
|
||||
LLVM to generate code that uses 32-bit subregisters. Then second step
|
||||
is to teach verifier to mark operations where zero-ing upper bits
|
||||
is unnecessary. Then JITs can take advantage of those markings and
|
||||
drastically reduce size of generated code and improve performance.
|
||||
|
||||
Q: Does BPF have a stable ABI?
|
||||
A: YES. BPF instructions, arguments to BPF programs, set of helper
|
||||
functions and their arguments, recognized return codes are all part
|
||||
of ABI. However when tracing programs are using bpf_probe_read() helper
|
||||
to walk kernel internal datastructures and compile with kernel
|
||||
internal headers these accesses can and will break with newer
|
||||
kernels. The union bpf_attr -> kern_version is checked at load time
|
||||
to prevent accidentally loading kprobe-based bpf programs written
|
||||
for a different kernel. Networking programs don't do kern_version check.
|
||||
|
||||
Q: How much stack space a BPF program uses?
|
||||
A: Currently all program types are limited to 512 bytes of stack
|
||||
space, but the verifier computes the actual amount of stack used
|
||||
and both interpreter and most JITed code consume necessary amount.
|
||||
|
||||
Q: Can BPF be offloaded to HW?
|
||||
A: YES. BPF HW offload is supported by NFP driver.
|
||||
|
||||
Q: Does classic BPF interpreter still exist?
|
||||
A: NO. Classic BPF programs are converted into extend BPF instructions.
|
||||
|
||||
Q: Can BPF call arbitrary kernel functions?
|
||||
A: NO. BPF programs can only call a set of helper functions which
|
||||
is defined for every program type.
|
||||
|
||||
Q: Can BPF overwrite arbitrary kernel memory?
|
||||
A: NO. Tracing bpf programs can _read_ arbitrary memory with bpf_probe_read()
|
||||
and bpf_probe_read_str() helpers. Networking programs cannot read
|
||||
arbitrary memory, since they don't have access to these helpers.
|
||||
Programs can never read or write arbitrary memory directly.
|
||||
|
||||
Q: Can BPF overwrite arbitrary user memory?
|
||||
A: Sort-of. Tracing BPF programs can overwrite the user memory
|
||||
of the current task with bpf_probe_write_user(). Every time such
|
||||
program is loaded the kernel will print warning message, so
|
||||
this helper is only useful for experiments and prototypes.
|
||||
Tracing BPF programs are root only.
|
||||
|
||||
Q: When bpf_trace_printk() helper is used the kernel prints nasty
|
||||
warning message. Why is that?
|
||||
A: This is done to nudge program authors into better interfaces when
|
||||
programs need to pass data to user space. Like bpf_perf_event_output()
|
||||
can be used to efficiently stream data via perf ring buffer.
|
||||
BPF maps can be used for asynchronous data sharing between kernel
|
||||
and user space. bpf_trace_printk() should only be used for debugging.
|
||||
|
||||
Q: Can BPF functionality such as new program or map types, new
|
||||
helpers, etc be added out of kernel module code?
|
||||
A: NO.
|
640
Documentation/bpf/bpf_devel_QA.rst
Normal file
640
Documentation/bpf/bpf_devel_QA.rst
Normal file
@ -0,0 +1,640 @@
|
||||
=================================
|
||||
HOWTO interact with BPF subsystem
|
||||
=================================
|
||||
|
||||
This document provides information for the BPF subsystem about various
|
||||
workflows related to reporting bugs, submitting patches, and queueing
|
||||
patches for stable kernels.
|
||||
|
||||
For general information about submitting patches, please refer to
|
||||
`Documentation/process/`_. This document only describes additional specifics
|
||||
related to BPF.
|
||||
|
||||
.. contents::
|
||||
:local:
|
||||
:depth: 2
|
||||
|
||||
Reporting bugs
|
||||
==============
|
||||
|
||||
Q: How do I report bugs for BPF kernel code?
|
||||
--------------------------------------------
|
||||
A: Since all BPF kernel development as well as bpftool and iproute2 BPF
|
||||
loader development happens through the netdev kernel mailing list,
|
||||
please report any found issues around BPF to the following mailing
|
||||
list:
|
||||
|
||||
netdev@vger.kernel.org
|
||||
|
||||
This may also include issues related to XDP, BPF tracing, etc.
|
||||
|
||||
Given netdev has a high volume of traffic, please also add the BPF
|
||||
maintainers to Cc (from kernel MAINTAINERS_ file):
|
||||
|
||||
* Alexei Starovoitov <ast@kernel.org>
|
||||
* Daniel Borkmann <daniel@iogearbox.net>
|
||||
|
||||
In case a buggy commit has already been identified, make sure to keep
|
||||
the actual commit authors in Cc as well for the report. They can
|
||||
typically be identified through the kernel's git tree.
|
||||
|
||||
**Please do NOT report BPF issues to bugzilla.kernel.org since it
|
||||
is a guarantee that the reported issue will be overlooked.**
|
||||
|
||||
Submitting patches
|
||||
==================
|
||||
|
||||
Q: To which mailing list do I need to submit my BPF patches?
|
||||
------------------------------------------------------------
|
||||
A: Please submit your BPF patches to the netdev kernel mailing list:
|
||||
|
||||
netdev@vger.kernel.org
|
||||
|
||||
Historically, BPF came out of networking and has always been maintained
|
||||
by the kernel networking community. Although these days BPF touches
|
||||
many other subsystems as well, the patches are still routed mainly
|
||||
through the networking community.
|
||||
|
||||
In case your patch has changes in various different subsystems (e.g.
|
||||
tracing, security, etc), make sure to Cc the related kernel mailing
|
||||
lists and maintainers from there as well, so they are able to review
|
||||
the changes and provide their Acked-by's to the patches.
|
||||
|
||||
Q: Where can I find patches currently under discussion for BPF subsystem?
|
||||
-------------------------------------------------------------------------
|
||||
A: All patches that are Cc'ed to netdev are queued for review under netdev
|
||||
patchwork project:
|
||||
|
||||
http://patchwork.ozlabs.org/project/netdev/list/
|
||||
|
||||
Those patches which target BPF, are assigned to a 'bpf' delegate for
|
||||
further processing from BPF maintainers. The current queue with
|
||||
patches under review can be found at:
|
||||
|
||||
https://patchwork.ozlabs.org/project/netdev/list/?delegate=77147
|
||||
|
||||
Once the patches have been reviewed by the BPF community as a whole
|
||||
and approved by the BPF maintainers, their status in patchwork will be
|
||||
changed to 'Accepted' and the submitter will be notified by mail. This
|
||||
means that the patches look good from a BPF perspective and have been
|
||||
applied to one of the two BPF kernel trees.
|
||||
|
||||
In case feedback from the community requires a respin of the patches,
|
||||
their status in patchwork will be set to 'Changes Requested', and purged
|
||||
from the current review queue. Likewise for cases where patches would
|
||||
get rejected or are not applicable to the BPF trees (but assigned to
|
||||
the 'bpf' delegate).
|
||||
|
||||
Q: How do the changes make their way into Linux?
|
||||
------------------------------------------------
|
||||
A: There are two BPF kernel trees (git repositories). Once patches have
|
||||
been accepted by the BPF maintainers, they will be applied to one
|
||||
of the two BPF trees:
|
||||
|
||||
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/
|
||||
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/
|
||||
|
||||
The bpf tree itself is for fixes only, whereas bpf-next for features,
|
||||
cleanups or other kind of improvements ("next-like" content). This is
|
||||
analogous to net and net-next trees for networking. Both bpf and
|
||||
bpf-next will only have a master branch in order to simplify against
|
||||
which branch patches should get rebased to.
|
||||
|
||||
Accumulated BPF patches in the bpf tree will regularly get pulled
|
||||
into the net kernel tree. Likewise, accumulated BPF patches accepted
|
||||
into the bpf-next tree will make their way into net-next tree. net and
|
||||
net-next are both run by David S. Miller. From there, they will go
|
||||
into the kernel mainline tree run by Linus Torvalds. To read up on the
|
||||
process of net and net-next being merged into the mainline tree, see
|
||||
the `netdev FAQ`_ under:
|
||||
|
||||
`Documentation/networking/netdev-FAQ.txt`_
|
||||
|
||||
Occasionally, to prevent merge conflicts, we might send pull requests
|
||||
to other trees (e.g. tracing) with a small subset of the patches, but
|
||||
net and net-next are always the main trees targeted for integration.
|
||||
|
||||
The pull requests will contain a high-level summary of the accumulated
|
||||
patches and can be searched on netdev kernel mailing list through the
|
||||
following subject lines (``yyyy-mm-dd`` is the date of the pull
|
||||
request)::
|
||||
|
||||
pull-request: bpf yyyy-mm-dd
|
||||
pull-request: bpf-next yyyy-mm-dd
|
||||
|
||||
Q: How do I indicate which tree (bpf vs. bpf-next) my patch should be applied to?
|
||||
---------------------------------------------------------------------------------
|
||||
|
||||
A: The process is the very same as described in the `netdev FAQ`_, so
|
||||
please read up on it. The subject line must indicate whether the
|
||||
patch is a fix or rather "next-like" content in order to let the
|
||||
maintainers know whether it is targeted at bpf or bpf-next.
|
||||
|
||||
For fixes eventually landing in bpf -> net tree, the subject must
|
||||
look like::
|
||||
|
||||
git format-patch --subject-prefix='PATCH bpf' start..finish
|
||||
|
||||
For features/improvements/etc that should eventually land in
|
||||
bpf-next -> net-next, the subject must look like::
|
||||
|
||||
git format-patch --subject-prefix='PATCH bpf-next' start..finish
|
||||
|
||||
If unsure whether the patch or patch series should go into bpf
|
||||
or net directly, or bpf-next or net-next directly, it is not a
|
||||
problem either if the subject line says net or net-next as target.
|
||||
It is eventually up to the maintainers to do the delegation of
|
||||
the patches.
|
||||
|
||||
If it is clear that patches should go into bpf or bpf-next tree,
|
||||
please make sure to rebase the patches against those trees in
|
||||
order to reduce potential conflicts.
|
||||
|
||||
In case the patch or patch series has to be reworked and sent out
|
||||
again in a second or later revision, it is also required to add a
|
||||
version number (``v2``, ``v3``, ...) into the subject prefix::
|
||||
|
||||
git format-patch --subject-prefix='PATCH net-next v2' start..finish
|
||||
|
||||
When changes have been requested to the patch series, always send the
|
||||
whole patch series again with the feedback incorporated (never send
|
||||
individual diffs on top of the old series).
|
||||
|
||||
Q: What does it mean when a patch gets applied to bpf or bpf-next tree?
|
||||
-----------------------------------------------------------------------
|
||||
A: It means that the patch looks good for mainline inclusion from
|
||||
a BPF point of view.
|
||||
|
||||
Be aware that this is not a final verdict that the patch will
|
||||
automatically get accepted into net or net-next trees eventually:
|
||||
|
||||
On the netdev kernel mailing list reviews can come in at any point
|
||||
in time. If discussions around a patch conclude that they cannot
|
||||
get included as-is, we will either apply a follow-up fix or drop
|
||||
them from the trees entirely. Therefore, we also reserve to rebase
|
||||
the trees when deemed necessary. After all, the purpose of the tree
|
||||
is to:
|
||||
|
||||
i) accumulate and stage BPF patches for integration into trees
|
||||
like net and net-next, and
|
||||
|
||||
ii) run extensive BPF test suite and
|
||||
workloads on the patches before they make their way any further.
|
||||
|
||||
Once the BPF pull request was accepted by David S. Miller, then
|
||||
the patches end up in net or net-next tree, respectively, and
|
||||
make their way from there further into mainline. Again, see the
|
||||
`netdev FAQ`_ for additional information e.g. on how often they are
|
||||
merged to mainline.
|
||||
|
||||
Q: How long do I need to wait for feedback on my BPF patches?
|
||||
-------------------------------------------------------------
|
||||
A: We try to keep the latency low. The usual time to feedback will
|
||||
be around 2 or 3 business days. It may vary depending on the
|
||||
complexity of changes and current patch load.
|
||||
|
||||
Q: How often do you send pull requests to major kernel trees like net or net-next?
|
||||
----------------------------------------------------------------------------------
|
||||
|
||||
A: Pull requests will be sent out rather often in order to not
|
||||
accumulate too many patches in bpf or bpf-next.
|
||||
|
||||
As a rule of thumb, expect pull requests for each tree regularly
|
||||
at the end of the week. In some cases pull requests could additionally
|
||||
come also in the middle of the week depending on the current patch
|
||||
load or urgency.
|
||||
|
||||
Q: Are patches applied to bpf-next when the merge window is open?
|
||||
-----------------------------------------------------------------
|
||||
A: For the time when the merge window is open, bpf-next will not be
|
||||
processed. This is roughly analogous to net-next patch processing,
|
||||
so feel free to read up on the `netdev FAQ`_ about further details.
|
||||
|
||||
During those two weeks of merge window, we might ask you to resend
|
||||
your patch series once bpf-next is open again. Once Linus released
|
||||
a ``v*-rc1`` after the merge window, we continue processing of bpf-next.
|
||||
|
||||
For non-subscribers to kernel mailing lists, there is also a status
|
||||
page run by David S. Miller on net-next that provides guidance:
|
||||
|
||||
http://vger.kernel.org/~davem/net-next.html
|
||||
|
||||
Q: Verifier changes and test cases
|
||||
----------------------------------
|
||||
Q: I made a BPF verifier change, do I need to add test cases for
|
||||
BPF kernel selftests_?
|
||||
|
||||
A: If the patch has changes to the behavior of the verifier, then yes,
|
||||
it is absolutely necessary to add test cases to the BPF kernel
|
||||
selftests_ suite. If they are not present and we think they are
|
||||
needed, then we might ask for them before accepting any changes.
|
||||
|
||||
In particular, test_verifier.c is tracking a high number of BPF test
|
||||
cases, including a lot of corner cases that LLVM BPF back end may
|
||||
generate out of the restricted C code. Thus, adding test cases is
|
||||
absolutely crucial to make sure future changes do not accidentally
|
||||
affect prior use-cases. Thus, treat those test cases as: verifier
|
||||
behavior that is not tracked in test_verifier.c could potentially
|
||||
be subject to change.
|
||||
|
||||
Q: samples/bpf preference vs selftests?
|
||||
---------------------------------------
|
||||
Q: When should I add code to `samples/bpf/`_ and when to BPF kernel
|
||||
selftests_ ?
|
||||
|
||||
A: In general, we prefer additions to BPF kernel selftests_ rather than
|
||||
`samples/bpf/`_. The rationale is very simple: kernel selftests are
|
||||
regularly run by various bots to test for kernel regressions.
|
||||
|
||||
The more test cases we add to BPF selftests, the better the coverage
|
||||
and the less likely it is that those could accidentally break. It is
|
||||
not that BPF kernel selftests cannot demo how a specific feature can
|
||||
be used.
|
||||
|
||||
That said, `samples/bpf/`_ may be a good place for people to get started,
|
||||
so it might be advisable that simple demos of features could go into
|
||||
`samples/bpf/`_, but advanced functional and corner-case testing rather
|
||||
into kernel selftests.
|
||||
|
||||
If your sample looks like a test case, then go for BPF kernel selftests
|
||||
instead!
|
||||
|
||||
Q: When should I add code to the bpftool?
|
||||
-----------------------------------------
|
||||
A: The main purpose of bpftool (under tools/bpf/bpftool/) is to provide
|
||||
a central user space tool for debugging and introspection of BPF programs
|
||||
and maps that are active in the kernel. If UAPI changes related to BPF
|
||||
enable for dumping additional information of programs or maps, then
|
||||
bpftool should be extended as well to support dumping them.
|
||||
|
||||
Q: When should I add code to iproute2's BPF loader?
|
||||
---------------------------------------------------
|
||||
A: For UAPI changes related to the XDP or tc layer (e.g. ``cls_bpf``),
|
||||
the convention is that those control-path related changes are added to
|
||||
iproute2's BPF loader as well from user space side. This is not only
|
||||
useful to have UAPI changes properly designed to be usable, but also
|
||||
to make those changes available to a wider user base of major
|
||||
downstream distributions.
|
||||
|
||||
Q: Do you accept patches as well for iproute2's BPF loader?
|
||||
-----------------------------------------------------------
|
||||
A: Patches for the iproute2's BPF loader have to be sent to:
|
||||
|
||||
netdev@vger.kernel.org
|
||||
|
||||
While those patches are not processed by the BPF kernel maintainers,
|
||||
please keep them in Cc as well, so they can be reviewed.
|
||||
|
||||
The official git repository for iproute2 is run by Stephen Hemminger
|
||||
and can be found at:
|
||||
|
||||
https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/
|
||||
|
||||
The patches need to have a subject prefix of '``[PATCH iproute2
|
||||
master]``' or '``[PATCH iproute2 net-next]``'. '``master``' or
|
||||
'``net-next``' describes the target branch where the patch should be
|
||||
applied to. Meaning, if kernel changes went into the net-next kernel
|
||||
tree, then the related iproute2 changes need to go into the iproute2
|
||||
net-next branch, otherwise they can be targeted at master branch. The
|
||||
iproute2 net-next branch will get merged into the master branch after
|
||||
the current iproute2 version from master has been released.
|
||||
|
||||
Like BPF, the patches end up in patchwork under the netdev project and
|
||||
are delegated to 'shemminger' for further processing:
|
||||
|
||||
http://patchwork.ozlabs.org/project/netdev/list/?delegate=389
|
||||
|
||||
Q: What is the minimum requirement before I submit my BPF patches?
|
||||
------------------------------------------------------------------
|
||||
A: When submitting patches, always take the time and properly test your
|
||||
patches *prior* to submission. Never rush them! If maintainers find
|
||||
that your patches have not been properly tested, it is a good way to
|
||||
get them grumpy. Testing patch submissions is a hard requirement!
|
||||
|
||||
Note, fixes that go to bpf tree *must* have a ``Fixes:`` tag included.
|
||||
The same applies to fixes that target bpf-next, where the affected
|
||||
commit is in net-next (or in some cases bpf-next). The ``Fixes:`` tag is
|
||||
crucial in order to identify follow-up commits and tremendously helps
|
||||
for people having to do backporting, so it is a must have!
|
||||
|
||||
We also don't accept patches with an empty commit message. Take your
|
||||
time and properly write up a high quality commit message, it is
|
||||
essential!
|
||||
|
||||
Think about it this way: other developers looking at your code a month
|
||||
from now need to understand *why* a certain change has been done that
|
||||
way, and whether there have been flaws in the analysis or assumptions
|
||||
that the original author did. Thus providing a proper rationale and
|
||||
describing the use-case for the changes is a must.
|
||||
|
||||
Patch submissions with >1 patch must have a cover letter which includes
|
||||
a high level description of the series. This high level summary will
|
||||
then be placed into the merge commit by the BPF maintainers such that
|
||||
it is also accessible from the git log for future reference.
|
||||
|
||||
Q: Features changing BPF JIT and/or LLVM
|
||||
----------------------------------------
|
||||
Q: What do I need to consider when adding a new instruction or feature
|
||||
that would require BPF JIT and/or LLVM integration as well?
|
||||
|
||||
A: We try hard to keep all BPF JITs up to date such that the same user
|
||||
experience can be guaranteed when running BPF programs on different
|
||||
architectures without having the program punt to the less efficient
|
||||
interpreter in case the in-kernel BPF JIT is enabled.
|
||||
|
||||
If you are unable to implement or test the required JIT changes for
|
||||
certain architectures, please work together with the related BPF JIT
|
||||
developers in order to get the feature implemented in a timely manner.
|
||||
Please refer to the git log (``arch/*/net/``) to locate the necessary
|
||||
people for helping out.
|
||||
|
||||
Also always make sure to add BPF test cases (e.g. test_bpf.c and
|
||||
test_verifier.c) for new instructions, so that they can receive
|
||||
broad test coverage and help run-time testing the various BPF JITs.
|
||||
|
||||
In case of new BPF instructions, once the changes have been accepted
|
||||
into the Linux kernel, please implement support into LLVM's BPF back
|
||||
end. See LLVM_ section below for further information.
|
||||
|
||||
Stable submission
|
||||
=================
|
||||
|
||||
Q: I need a specific BPF commit in stable kernels. What should I do?
|
||||
--------------------------------------------------------------------
|
||||
A: In case you need a specific fix in stable kernels, first check whether
|
||||
the commit has already been applied in the related ``linux-*.y`` branches:
|
||||
|
||||
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/
|
||||
|
||||
If not the case, then drop an email to the BPF maintainers with the
|
||||
netdev kernel mailing list in Cc and ask for the fix to be queued up:
|
||||
|
||||
netdev@vger.kernel.org
|
||||
|
||||
The process in general is the same as on netdev itself, see also the
|
||||
`netdev FAQ`_ document.
|
||||
|
||||
Q: Do you also backport to kernels not currently maintained as stable?
|
||||
----------------------------------------------------------------------
|
||||
A: No. If you need a specific BPF commit in kernels that are currently not
|
||||
maintained by the stable maintainers, then you are on your own.
|
||||
|
||||
The current stable and longterm stable kernels are all listed here:
|
||||
|
||||
https://www.kernel.org/
|
||||
|
||||
Q: The BPF patch I am about to submit needs to go to stable as well
|
||||
-------------------------------------------------------------------
|
||||
What should I do?
|
||||
|
||||
A: The same rules apply as with netdev patch submissions in general, see
|
||||
`netdev FAQ`_ under:
|
||||
|
||||
`Documentation/networking/netdev-FAQ.txt`_
|
||||
|
||||
Never add "``Cc: stable@vger.kernel.org``" to the patch description, but
|
||||
ask the BPF maintainers to queue the patches instead. This can be done
|
||||
with a note, for example, under the ``---`` part of the patch which does
|
||||
not go into the git log. Alternatively, this can be done as a simple
|
||||
request by mail instead.
|
||||
|
||||
Q: Queue stable patches
|
||||
-----------------------
|
||||
Q: Where do I find currently queued BPF patches that will be submitted
|
||||
to stable?
|
||||
|
||||
A: Once patches that fix critical bugs got applied into the bpf tree, they
|
||||
are queued up for stable submission under:
|
||||
|
||||
http://patchwork.ozlabs.org/bundle/bpf/stable/?state=*
|
||||
|
||||
They will be on hold there at minimum until the related commit made its
|
||||
way into the mainline kernel tree.
|
||||
|
||||
After having been under broader exposure, the queued patches will be
|
||||
submitted by the BPF maintainers to the stable maintainers.
|
||||
|
||||
Testing patches
|
||||
===============
|
||||
|
||||
Q: How to run BPF selftests
|
||||
---------------------------
|
||||
A: After you have booted into the newly compiled kernel, navigate to
|
||||
the BPF selftests_ suite in order to test BPF functionality (current
|
||||
working directory points to the root of the cloned git tree)::
|
||||
|
||||
$ cd tools/testing/selftests/bpf/
|
||||
$ make
|
||||
|
||||
To run the verifier tests::
|
||||
|
||||
$ sudo ./test_verifier
|
||||
|
||||
The verifier tests print out all the current checks being
|
||||
performed. The summary at the end of running all tests will dump
|
||||
information of test successes and failures::
|
||||
|
||||
Summary: 418 PASSED, 0 FAILED
|
||||
|
||||
In order to run through all BPF selftests, the following command is
|
||||
needed::
|
||||
|
||||
$ sudo make run_tests
|
||||
|
||||
See the kernels selftest `Documentation/dev-tools/kselftest.rst`_
|
||||
document for further documentation.
|
||||
|
||||
Q: Which BPF kernel selftests version should I run my kernel against?
|
||||
---------------------------------------------------------------------
|
||||
A: If you run a kernel ``xyz``, then always run the BPF kernel selftests
|
||||
from that kernel ``xyz`` as well. Do not expect that the BPF selftest
|
||||
from the latest mainline tree will pass all the time.
|
||||
|
||||
In particular, test_bpf.c and test_verifier.c have a large number of
|
||||
test cases and are constantly updated with new BPF test sequences, or
|
||||
existing ones are adapted to verifier changes e.g. due to verifier
|
||||
becoming smarter and being able to better track certain things.
|
||||
|
||||
LLVM
|
||||
====
|
||||
|
||||
Q: Where do I find LLVM with BPF support?
|
||||
-----------------------------------------
|
||||
A: The BPF back end for LLVM is upstream in LLVM since version 3.7.1.
|
||||
|
||||
All major distributions these days ship LLVM with BPF back end enabled,
|
||||
so for the majority of use-cases it is not required to compile LLVM by
|
||||
hand anymore, just install the distribution provided package.
|
||||
|
||||
LLVM's static compiler lists the supported targets through
|
||||
``llc --version``, make sure BPF targets are listed. Example::
|
||||
|
||||
$ llc --version
|
||||
LLVM (http://llvm.org/):
|
||||
LLVM version 6.0.0svn
|
||||
Optimized build.
|
||||
Default target: x86_64-unknown-linux-gnu
|
||||
Host CPU: skylake
|
||||
|
||||
Registered Targets:
|
||||
bpf - BPF (host endian)
|
||||
bpfeb - BPF (big endian)
|
||||
bpfel - BPF (little endian)
|
||||
x86 - 32-bit X86: Pentium-Pro and above
|
||||
x86-64 - 64-bit X86: EM64T and AMD64
|
||||
|
||||
For developers in order to utilize the latest features added to LLVM's
|
||||
BPF back end, it is advisable to run the latest LLVM releases. Support
|
||||
for new BPF kernel features such as additions to the BPF instruction
|
||||
set are often developed together.
|
||||
|
||||
All LLVM releases can be found at: http://releases.llvm.org/
|
||||
|
||||
Q: Got it, so how do I build LLVM manually anyway?
|
||||
--------------------------------------------------
|
||||
A: You need cmake and gcc-c++ as build requisites for LLVM. Once you have
|
||||
that set up, proceed with building the latest LLVM and clang version
|
||||
from the git repositories::
|
||||
|
||||
$ git clone http://llvm.org/git/llvm.git
|
||||
$ cd llvm/tools
|
||||
$ git clone --depth 1 http://llvm.org/git/clang.git
|
||||
$ cd ..; mkdir build; cd build
|
||||
$ cmake .. -DLLVM_TARGETS_TO_BUILD="BPF;X86" \
|
||||
-DBUILD_SHARED_LIBS=OFF \
|
||||
-DCMAKE_BUILD_TYPE=Release \
|
||||
-DLLVM_BUILD_RUNTIME=OFF
|
||||
$ make -j $(getconf _NPROCESSORS_ONLN)
|
||||
|
||||
The built binaries can then be found in the build/bin/ directory, where
|
||||
you can point the PATH variable to.
|
||||
|
||||
Q: Reporting LLVM BPF issues
|
||||
----------------------------
|
||||
Q: Should I notify BPF kernel maintainers about issues in LLVM's BPF code
|
||||
generation back end or about LLVM generated code that the verifier
|
||||
refuses to accept?
|
||||
|
||||
A: Yes, please do!
|
||||
|
||||
LLVM's BPF back end is a key piece of the whole BPF
|
||||
infrastructure and it ties deeply into verification of programs from the
|
||||
kernel side. Therefore, any issues on either side need to be investigated
|
||||
and fixed whenever necessary.
|
||||
|
||||
Therefore, please make sure to bring them up at netdev kernel mailing
|
||||
list and Cc BPF maintainers for LLVM and kernel bits:
|
||||
|
||||
* Yonghong Song <yhs@fb.com>
|
||||
* Alexei Starovoitov <ast@kernel.org>
|
||||
* Daniel Borkmann <daniel@iogearbox.net>
|
||||
|
||||
LLVM also has an issue tracker where BPF related bugs can be found:
|
||||
|
||||
https://bugs.llvm.org/buglist.cgi?quicksearch=bpf
|
||||
|
||||
However, it is better to reach out through mailing lists with having
|
||||
maintainers in Cc.
|
||||
|
||||
Q: New BPF instruction for kernel and LLVM
|
||||
------------------------------------------
|
||||
Q: I have added a new BPF instruction to the kernel, how can I integrate
|
||||
it into LLVM?
|
||||
|
||||
A: LLVM has a ``-mcpu`` selector for the BPF back end in order to allow
|
||||
the selection of BPF instruction set extensions. By default the
|
||||
``generic`` processor target is used, which is the base instruction set
|
||||
(v1) of BPF.
|
||||
|
||||
LLVM has an option to select ``-mcpu=probe`` where it will probe the host
|
||||
kernel for supported BPF instruction set extensions and selects the
|
||||
optimal set automatically.
|
||||
|
||||
For cross-compilation, a specific version can be select manually as well ::
|
||||
|
||||
$ llc -march bpf -mcpu=help
|
||||
Available CPUs for this target:
|
||||
|
||||
generic - Select the generic processor.
|
||||
probe - Select the probe processor.
|
||||
v1 - Select the v1 processor.
|
||||
v2 - Select the v2 processor.
|
||||
[...]
|
||||
|
||||
Newly added BPF instructions to the Linux kernel need to follow the same
|
||||
scheme, bump the instruction set version and implement probing for the
|
||||
extensions such that ``-mcpu=probe`` users can benefit from the
|
||||
optimization transparently when upgrading their kernels.
|
||||
|
||||
If you are unable to implement support for the newly added BPF instruction
|
||||
please reach out to BPF developers for help.
|
||||
|
||||
By the way, the BPF kernel selftests run with ``-mcpu=probe`` for better
|
||||
test coverage.
|
||||
|
||||
Q: clang flag for target bpf?
|
||||
-----------------------------
|
||||
Q: In some cases clang flag ``-target bpf`` is used but in other cases the
|
||||
default clang target, which matches the underlying architecture, is used.
|
||||
What is the difference and when I should use which?
|
||||
|
||||
A: Although LLVM IR generation and optimization try to stay architecture
|
||||
independent, ``-target <arch>`` still has some impact on generated code:
|
||||
|
||||
- BPF program may recursively include header file(s) with file scope
|
||||
inline assembly codes. The default target can handle this well,
|
||||
while ``bpf`` target may fail if bpf backend assembler does not
|
||||
understand these assembly codes, which is true in most cases.
|
||||
|
||||
- When compiled without ``-g``, additional elf sections, e.g.,
|
||||
.eh_frame and .rela.eh_frame, may be present in the object file
|
||||
with default target, but not with ``bpf`` target.
|
||||
|
||||
- The default target may turn a C switch statement into a switch table
|
||||
lookup and jump operation. Since the switch table is placed
|
||||
in the global readonly section, the bpf program will fail to load.
|
||||
The bpf target does not support switch table optimization.
|
||||
The clang option ``-fno-jump-tables`` can be used to disable
|
||||
switch table generation.
|
||||
|
||||
- For clang ``-target bpf``, it is guaranteed that pointer or long /
|
||||
unsigned long types will always have a width of 64 bit, no matter
|
||||
whether underlying clang binary or default target (or kernel) is
|
||||
32 bit. However, when native clang target is used, then it will
|
||||
compile these types based on the underlying architecture's conventions,
|
||||
meaning in case of 32 bit architecture, pointer or long / unsigned
|
||||
long types e.g. in BPF context structure will have width of 32 bit
|
||||
while the BPF LLVM back end still operates in 64 bit. The native
|
||||
target is mostly needed in tracing for the case of walking ``pt_regs``
|
||||
or other kernel structures where CPU's register width matters.
|
||||
Otherwise, ``clang -target bpf`` is generally recommended.
|
||||
|
||||
You should use default target when:
|
||||
|
||||
- Your program includes a header file, e.g., ptrace.h, which eventually
|
||||
pulls in some header files containing file scope host assembly codes.
|
||||
|
||||
- You can add ``-fno-jump-tables`` to work around the switch table issue.
|
||||
|
||||
Otherwise, you can use ``bpf`` target. Additionally, you *must* use bpf target
|
||||
when:
|
||||
|
||||
- Your program uses data structures with pointer or long / unsigned long
|
||||
types that interface with BPF helpers or context data structures. Access
|
||||
into these structures is verified by the BPF verifier and may result
|
||||
in verification failures if the native architecture is not aligned with
|
||||
the BPF architecture, e.g. 64-bit. An example of this is
|
||||
BPF_PROG_TYPE_SK_MSG require ``-target bpf``
|
||||
|
||||
|
||||
.. Links
|
||||
.. _Documentation/process/: https://www.kernel.org/doc/html/latest/process/
|
||||
.. _MAINTAINERS: ../../MAINTAINERS
|
||||
.. _Documentation/networking/netdev-FAQ.txt: ../networking/netdev-FAQ.txt
|
||||
.. _netdev FAQ: ../networking/netdev-FAQ.txt
|
||||
.. _samples/bpf/: ../../samples/bpf/
|
||||
.. _selftests: ../../tools/testing/selftests/bpf/
|
||||
.. _Documentation/dev-tools/kselftest.rst:
|
||||
https://www.kernel.org/doc/html/latest/dev-tools/kselftest.html
|
||||
|
||||
Happy BPF hacking!
|
@ -1,562 +0,0 @@
|
||||
This document provides information for the BPF subsystem about various
|
||||
workflows related to reporting bugs, submitting patches, and queueing
|
||||
patches for stable kernels.
|
||||
|
||||
For general information about submitting patches, please refer to
|
||||
Documentation/process/. This document only describes additional specifics
|
||||
related to BPF.
|
||||
|
||||
Reporting bugs:
|
||||
---------------
|
||||
|
||||
Q: How do I report bugs for BPF kernel code?
|
||||
|
||||
A: Since all BPF kernel development as well as bpftool and iproute2 BPF
|
||||
loader development happens through the netdev kernel mailing list,
|
||||
please report any found issues around BPF to the following mailing
|
||||
list:
|
||||
|
||||
netdev@vger.kernel.org
|
||||
|
||||
This may also include issues related to XDP, BPF tracing, etc.
|
||||
|
||||
Given netdev has a high volume of traffic, please also add the BPF
|
||||
maintainers to Cc (from kernel MAINTAINERS file):
|
||||
|
||||
Alexei Starovoitov <ast@kernel.org>
|
||||
Daniel Borkmann <daniel@iogearbox.net>
|
||||
|
||||
In case a buggy commit has already been identified, make sure to keep
|
||||
the actual commit authors in Cc as well for the report. They can
|
||||
typically be identified through the kernel's git tree.
|
||||
|
||||
Please do *not* report BPF issues to bugzilla.kernel.org since it
|
||||
is a guarantee that the reported issue will be overlooked.
|
||||
|
||||
Submitting patches:
|
||||
-------------------
|
||||
|
||||
Q: To which mailing list do I need to submit my BPF patches?
|
||||
|
||||
A: Please submit your BPF patches to the netdev kernel mailing list:
|
||||
|
||||
netdev@vger.kernel.org
|
||||
|
||||
Historically, BPF came out of networking and has always been maintained
|
||||
by the kernel networking community. Although these days BPF touches
|
||||
many other subsystems as well, the patches are still routed mainly
|
||||
through the networking community.
|
||||
|
||||
In case your patch has changes in various different subsystems (e.g.
|
||||
tracing, security, etc), make sure to Cc the related kernel mailing
|
||||
lists and maintainers from there as well, so they are able to review
|
||||
the changes and provide their Acked-by's to the patches.
|
||||
|
||||
Q: Where can I find patches currently under discussion for BPF subsystem?
|
||||
|
||||
A: All patches that are Cc'ed to netdev are queued for review under netdev
|
||||
patchwork project:
|
||||
|
||||
http://patchwork.ozlabs.org/project/netdev/list/
|
||||
|
||||
Those patches which target BPF, are assigned to a 'bpf' delegate for
|
||||
further processing from BPF maintainers. The current queue with
|
||||
patches under review can be found at:
|
||||
|
||||
https://patchwork.ozlabs.org/project/netdev/list/?delegate=77147
|
||||
|
||||
Once the patches have been reviewed by the BPF community as a whole
|
||||
and approved by the BPF maintainers, their status in patchwork will be
|
||||
changed to 'Accepted' and the submitter will be notified by mail. This
|
||||
means that the patches look good from a BPF perspective and have been
|
||||
applied to one of the two BPF kernel trees.
|
||||
|
||||
In case feedback from the community requires a respin of the patches,
|
||||
their status in patchwork will be set to 'Changes Requested', and purged
|
||||
from the current review queue. Likewise for cases where patches would
|
||||
get rejected or are not applicable to the BPF trees (but assigned to
|
||||
the 'bpf' delegate).
|
||||
|
||||
Q: How do the changes make their way into Linux?
|
||||
|
||||
A: There are two BPF kernel trees (git repositories). Once patches have
|
||||
been accepted by the BPF maintainers, they will be applied to one
|
||||
of the two BPF trees:
|
||||
|
||||
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/
|
||||
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/
|
||||
|
||||
The bpf tree itself is for fixes only, whereas bpf-next for features,
|
||||
cleanups or other kind of improvements ("next-like" content). This is
|
||||
analogous to net and net-next trees for networking. Both bpf and
|
||||
bpf-next will only have a master branch in order to simplify against
|
||||
which branch patches should get rebased to.
|
||||
|
||||
Accumulated BPF patches in the bpf tree will regularly get pulled
|
||||
into the net kernel tree. Likewise, accumulated BPF patches accepted
|
||||
into the bpf-next tree will make their way into net-next tree. net and
|
||||
net-next are both run by David S. Miller. From there, they will go
|
||||
into the kernel mainline tree run by Linus Torvalds. To read up on the
|
||||
process of net and net-next being merged into the mainline tree, see
|
||||
the netdev FAQ under:
|
||||
|
||||
Documentation/networking/netdev-FAQ.txt
|
||||
|
||||
Occasionally, to prevent merge conflicts, we might send pull requests
|
||||
to other trees (e.g. tracing) with a small subset of the patches, but
|
||||
net and net-next are always the main trees targeted for integration.
|
||||
|
||||
The pull requests will contain a high-level summary of the accumulated
|
||||
patches and can be searched on netdev kernel mailing list through the
|
||||
following subject lines (yyyy-mm-dd is the date of the pull request):
|
||||
|
||||
pull-request: bpf yyyy-mm-dd
|
||||
pull-request: bpf-next yyyy-mm-dd
|
||||
|
||||
Q: How do I indicate which tree (bpf vs. bpf-next) my patch should be
|
||||
applied to?
|
||||
|
||||
A: The process is the very same as described in the netdev FAQ, so
|
||||
please read up on it. The subject line must indicate whether the
|
||||
patch is a fix or rather "next-like" content in order to let the
|
||||
maintainers know whether it is targeted at bpf or bpf-next.
|
||||
|
||||
For fixes eventually landing in bpf -> net tree, the subject must
|
||||
look like:
|
||||
|
||||
git format-patch --subject-prefix='PATCH bpf' start..finish
|
||||
|
||||
For features/improvements/etc that should eventually land in
|
||||
bpf-next -> net-next, the subject must look like:
|
||||
|
||||
git format-patch --subject-prefix='PATCH bpf-next' start..finish
|
||||
|
||||
If unsure whether the patch or patch series should go into bpf
|
||||
or net directly, or bpf-next or net-next directly, it is not a
|
||||
problem either if the subject line says net or net-next as target.
|
||||
It is eventually up to the maintainers to do the delegation of
|
||||
the patches.
|
||||
|
||||
If it is clear that patches should go into bpf or bpf-next tree,
|
||||
please make sure to rebase the patches against those trees in
|
||||
order to reduce potential conflicts.
|
||||
|
||||
In case the patch or patch series has to be reworked and sent out
|
||||
again in a second or later revision, it is also required to add a
|
||||
version number (v2, v3, ...) into the subject prefix:
|
||||
|
||||
git format-patch --subject-prefix='PATCH net-next v2' start..finish
|
||||
|
||||
When changes have been requested to the patch series, always send the
|
||||
whole patch series again with the feedback incorporated (never send
|
||||
individual diffs on top of the old series).
|
||||
|
||||
Q: What does it mean when a patch gets applied to bpf or bpf-next tree?
|
||||
|
||||
A: It means that the patch looks good for mainline inclusion from
|
||||
a BPF point of view.
|
||||
|
||||
Be aware that this is not a final verdict that the patch will
|
||||
automatically get accepted into net or net-next trees eventually:
|
||||
|
||||
On the netdev kernel mailing list reviews can come in at any point
|
||||
in time. If discussions around a patch conclude that they cannot
|
||||
get included as-is, we will either apply a follow-up fix or drop
|
||||
them from the trees entirely. Therefore, we also reserve to rebase
|
||||
the trees when deemed necessary. After all, the purpose of the tree
|
||||
is to i) accumulate and stage BPF patches for integration into trees
|
||||
like net and net-next, and ii) run extensive BPF test suite and
|
||||
workloads on the patches before they make their way any further.
|
||||
|
||||
Once the BPF pull request was accepted by David S. Miller, then
|
||||
the patches end up in net or net-next tree, respectively, and
|
||||
make their way from there further into mainline. Again, see the
|
||||
netdev FAQ for additional information e.g. on how often they are
|
||||
merged to mainline.
|
||||
|
||||
Q: How long do I need to wait for feedback on my BPF patches?
|
||||
|
||||
A: We try to keep the latency low. The usual time to feedback will
|
||||
be around 2 or 3 business days. It may vary depending on the
|
||||
complexity of changes and current patch load.
|
||||
|
||||
Q: How often do you send pull requests to major kernel trees like
|
||||
net or net-next?
|
||||
|
||||
A: Pull requests will be sent out rather often in order to not
|
||||
accumulate too many patches in bpf or bpf-next.
|
||||
|
||||
As a rule of thumb, expect pull requests for each tree regularly
|
||||
at the end of the week. In some cases pull requests could additionally
|
||||
come also in the middle of the week depending on the current patch
|
||||
load or urgency.
|
||||
|
||||
Q: Are patches applied to bpf-next when the merge window is open?
|
||||
|
||||
A: For the time when the merge window is open, bpf-next will not be
|
||||
processed. This is roughly analogous to net-next patch processing,
|
||||
so feel free to read up on the netdev FAQ about further details.
|
||||
|
||||
During those two weeks of merge window, we might ask you to resend
|
||||
your patch series once bpf-next is open again. Once Linus released
|
||||
a v*-rc1 after the merge window, we continue processing of bpf-next.
|
||||
|
||||
For non-subscribers to kernel mailing lists, there is also a status
|
||||
page run by David S. Miller on net-next that provides guidance:
|
||||
|
||||
http://vger.kernel.org/~davem/net-next.html
|
||||
|
||||
Q: I made a BPF verifier change, do I need to add test cases for
|
||||
BPF kernel selftests?
|
||||
|
||||
A: If the patch has changes to the behavior of the verifier, then yes,
|
||||
it is absolutely necessary to add test cases to the BPF kernel
|
||||
selftests suite. If they are not present and we think they are
|
||||
needed, then we might ask for them before accepting any changes.
|
||||
|
||||
In particular, test_verifier.c is tracking a high number of BPF test
|
||||
cases, including a lot of corner cases that LLVM BPF back end may
|
||||
generate out of the restricted C code. Thus, adding test cases is
|
||||
absolutely crucial to make sure future changes do not accidentally
|
||||
affect prior use-cases. Thus, treat those test cases as: verifier
|
||||
behavior that is not tracked in test_verifier.c could potentially
|
||||
be subject to change.
|
||||
|
||||
Q: When should I add code to samples/bpf/ and when to BPF kernel
|
||||
selftests?
|
||||
|
||||
A: In general, we prefer additions to BPF kernel selftests rather than
|
||||
samples/bpf/. The rationale is very simple: kernel selftests are
|
||||
regularly run by various bots to test for kernel regressions.
|
||||
|
||||
The more test cases we add to BPF selftests, the better the coverage
|
||||
and the less likely it is that those could accidentally break. It is
|
||||
not that BPF kernel selftests cannot demo how a specific feature can
|
||||
be used.
|
||||
|
||||
That said, samples/bpf/ may be a good place for people to get started,
|
||||
so it might be advisable that simple demos of features could go into
|
||||
samples/bpf/, but advanced functional and corner-case testing rather
|
||||
into kernel selftests.
|
||||
|
||||
If your sample looks like a test case, then go for BPF kernel selftests
|
||||
instead!
|
||||
|
||||
Q: When should I add code to the bpftool?
|
||||
|
||||
A: The main purpose of bpftool (under tools/bpf/bpftool/) is to provide
|
||||
a central user space tool for debugging and introspection of BPF programs
|
||||
and maps that are active in the kernel. If UAPI changes related to BPF
|
||||
enable for dumping additional information of programs or maps, then
|
||||
bpftool should be extended as well to support dumping them.
|
||||
|
||||
Q: When should I add code to iproute2's BPF loader?
|
||||
|
||||
A: For UAPI changes related to the XDP or tc layer (e.g. cls_bpf), the
|
||||
convention is that those control-path related changes are added to
|
||||
iproute2's BPF loader as well from user space side. This is not only
|
||||
useful to have UAPI changes properly designed to be usable, but also
|
||||
to make those changes available to a wider user base of major
|
||||
downstream distributions.
|
||||
|
||||
Q: Do you accept patches as well for iproute2's BPF loader?
|
||||
|
||||
A: Patches for the iproute2's BPF loader have to be sent to:
|
||||
|
||||
netdev@vger.kernel.org
|
||||
|
||||
While those patches are not processed by the BPF kernel maintainers,
|
||||
please keep them in Cc as well, so they can be reviewed.
|
||||
|
||||
The official git repository for iproute2 is run by Stephen Hemminger
|
||||
and can be found at:
|
||||
|
||||
https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/
|
||||
|
||||
The patches need to have a subject prefix of '[PATCH iproute2 master]'
|
||||
or '[PATCH iproute2 net-next]'. 'master' or 'net-next' describes the
|
||||
target branch where the patch should be applied to. Meaning, if kernel
|
||||
changes went into the net-next kernel tree, then the related iproute2
|
||||
changes need to go into the iproute2 net-next branch, otherwise they
|
||||
can be targeted at master branch. The iproute2 net-next branch will get
|
||||
merged into the master branch after the current iproute2 version from
|
||||
master has been released.
|
||||
|
||||
Like BPF, the patches end up in patchwork under the netdev project and
|
||||
are delegated to 'shemminger' for further processing:
|
||||
|
||||
http://patchwork.ozlabs.org/project/netdev/list/?delegate=389
|
||||
|
||||
Q: What is the minimum requirement before I submit my BPF patches?
|
||||
|
||||
A: When submitting patches, always take the time and properly test your
|
||||
patches *prior* to submission. Never rush them! If maintainers find
|
||||
that your patches have not been properly tested, it is a good way to
|
||||
get them grumpy. Testing patch submissions is a hard requirement!
|
||||
|
||||
Note, fixes that go to bpf tree *must* have a Fixes: tag included. The
|
||||
same applies to fixes that target bpf-next, where the affected commit
|
||||
is in net-next (or in some cases bpf-next). The Fixes: tag is crucial
|
||||
in order to identify follow-up commits and tremendously helps for people
|
||||
having to do backporting, so it is a must have!
|
||||
|
||||
We also don't accept patches with an empty commit message. Take your
|
||||
time and properly write up a high quality commit message, it is
|
||||
essential!
|
||||
|
||||
Think about it this way: other developers looking at your code a month
|
||||
from now need to understand *why* a certain change has been done that
|
||||
way, and whether there have been flaws in the analysis or assumptions
|
||||
that the original author did. Thus providing a proper rationale and
|
||||
describing the use-case for the changes is a must.
|
||||
|
||||
Patch submissions with >1 patch must have a cover letter which includes
|
||||
a high level description of the series. This high level summary will
|
||||
then be placed into the merge commit by the BPF maintainers such that
|
||||
it is also accessible from the git log for future reference.
|
||||
|
||||
Q: What do I need to consider when adding a new instruction or feature
|
||||
that would require BPF JIT and/or LLVM integration as well?
|
||||
|
||||
A: We try hard to keep all BPF JITs up to date such that the same user
|
||||
experience can be guaranteed when running BPF programs on different
|
||||
architectures without having the program punt to the less efficient
|
||||
interpreter in case the in-kernel BPF JIT is enabled.
|
||||
|
||||
If you are unable to implement or test the required JIT changes for
|
||||
certain architectures, please work together with the related BPF JIT
|
||||
developers in order to get the feature implemented in a timely manner.
|
||||
Please refer to the git log (arch/*/net/) to locate the necessary
|
||||
people for helping out.
|
||||
|
||||
Also always make sure to add BPF test cases (e.g. test_bpf.c and
|
||||
test_verifier.c) for new instructions, so that they can receive
|
||||
broad test coverage and help run-time testing the various BPF JITs.
|
||||
|
||||
In case of new BPF instructions, once the changes have been accepted
|
||||
into the Linux kernel, please implement support into LLVM's BPF back
|
||||
end. See LLVM section below for further information.
|
||||
|
||||
Stable submission:
|
||||
------------------
|
||||
|
||||
Q: I need a specific BPF commit in stable kernels. What should I do?
|
||||
|
||||
A: In case you need a specific fix in stable kernels, first check whether
|
||||
the commit has already been applied in the related linux-*.y branches:
|
||||
|
||||
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/
|
||||
|
||||
If not the case, then drop an email to the BPF maintainers with the
|
||||
netdev kernel mailing list in Cc and ask for the fix to be queued up:
|
||||
|
||||
netdev@vger.kernel.org
|
||||
|
||||
The process in general is the same as on netdev itself, see also the
|
||||
netdev FAQ document.
|
||||
|
||||
Q: Do you also backport to kernels not currently maintained as stable?
|
||||
|
||||
A: No. If you need a specific BPF commit in kernels that are currently not
|
||||
maintained by the stable maintainers, then you are on your own.
|
||||
|
||||
The current stable and longterm stable kernels are all listed here:
|
||||
|
||||
https://www.kernel.org/
|
||||
|
||||
Q: The BPF patch I am about to submit needs to go to stable as well. What
|
||||
should I do?
|
||||
|
||||
A: The same rules apply as with netdev patch submissions in general, see
|
||||
netdev FAQ under:
|
||||
|
||||
Documentation/networking/netdev-FAQ.txt
|
||||
|
||||
Never add "Cc: stable@vger.kernel.org" to the patch description, but
|
||||
ask the BPF maintainers to queue the patches instead. This can be done
|
||||
with a note, for example, under the "---" part of the patch which does
|
||||
not go into the git log. Alternatively, this can be done as a simple
|
||||
request by mail instead.
|
||||
|
||||
Q: Where do I find currently queued BPF patches that will be submitted
|
||||
to stable?
|
||||
|
||||
A: Once patches that fix critical bugs got applied into the bpf tree, they
|
||||
are queued up for stable submission under:
|
||||
|
||||
http://patchwork.ozlabs.org/bundle/bpf/stable/?state=*
|
||||
|
||||
They will be on hold there at minimum until the related commit made its
|
||||
way into the mainline kernel tree.
|
||||
|
||||
After having been under broader exposure, the queued patches will be
|
||||
submitted by the BPF maintainers to the stable maintainers.
|
||||
|
||||
Testing patches:
|
||||
----------------
|
||||
|
||||
Q: Which BPF kernel selftests version should I run my kernel against?
|
||||
|
||||
A: If you run a kernel xyz, then always run the BPF kernel selftests from
|
||||
that kernel xyz as well. Do not expect that the BPF selftest from the
|
||||
latest mainline tree will pass all the time.
|
||||
|
||||
In particular, test_bpf.c and test_verifier.c have a large number of
|
||||
test cases and are constantly updated with new BPF test sequences, or
|
||||
existing ones are adapted to verifier changes e.g. due to verifier
|
||||
becoming smarter and being able to better track certain things.
|
||||
|
||||
LLVM:
|
||||
-----
|
||||
|
||||
Q: Where do I find LLVM with BPF support?
|
||||
|
||||
A: The BPF back end for LLVM is upstream in LLVM since version 3.7.1.
|
||||
|
||||
All major distributions these days ship LLVM with BPF back end enabled,
|
||||
so for the majority of use-cases it is not required to compile LLVM by
|
||||
hand anymore, just install the distribution provided package.
|
||||
|
||||
LLVM's static compiler lists the supported targets through 'llc --version',
|
||||
make sure BPF targets are listed. Example:
|
||||
|
||||
$ llc --version
|
||||
LLVM (http://llvm.org/):
|
||||
LLVM version 6.0.0svn
|
||||
Optimized build.
|
||||
Default target: x86_64-unknown-linux-gnu
|
||||
Host CPU: skylake
|
||||
|
||||
Registered Targets:
|
||||
bpf - BPF (host endian)
|
||||
bpfeb - BPF (big endian)
|
||||
bpfel - BPF (little endian)
|
||||
x86 - 32-bit X86: Pentium-Pro and above
|
||||
x86-64 - 64-bit X86: EM64T and AMD64
|
||||
|
||||
For developers in order to utilize the latest features added to LLVM's
|
||||
BPF back end, it is advisable to run the latest LLVM releases. Support
|
||||
for new BPF kernel features such as additions to the BPF instruction
|
||||
set are often developed together.
|
||||
|
||||
All LLVM releases can be found at: http://releases.llvm.org/
|
||||
|
||||
Q: Got it, so how do I build LLVM manually anyway?
|
||||
|
||||
A: You need cmake and gcc-c++ as build requisites for LLVM. Once you have
|
||||
that set up, proceed with building the latest LLVM and clang version
|
||||
from the git repositories:
|
||||
|
||||
$ git clone http://llvm.org/git/llvm.git
|
||||
$ cd llvm/tools
|
||||
$ git clone --depth 1 http://llvm.org/git/clang.git
|
||||
$ cd ..; mkdir build; cd build
|
||||
$ cmake .. -DLLVM_TARGETS_TO_BUILD="BPF;X86" \
|
||||
-DBUILD_SHARED_LIBS=OFF \
|
||||
-DCMAKE_BUILD_TYPE=Release \
|
||||
-DLLVM_BUILD_RUNTIME=OFF
|
||||
$ make -j $(getconf _NPROCESSORS_ONLN)
|
||||
|
||||
The built binaries can then be found in the build/bin/ directory, where
|
||||
you can point the PATH variable to.
|
||||
|
||||
Q: Should I notify BPF kernel maintainers about issues in LLVM's BPF code
|
||||
generation back end or about LLVM generated code that the verifier
|
||||
refuses to accept?
|
||||
|
||||
A: Yes, please do! LLVM's BPF back end is a key piece of the whole BPF
|
||||
infrastructure and it ties deeply into verification of programs from the
|
||||
kernel side. Therefore, any issues on either side need to be investigated
|
||||
and fixed whenever necessary.
|
||||
|
||||
Therefore, please make sure to bring them up at netdev kernel mailing
|
||||
list and Cc BPF maintainers for LLVM and kernel bits:
|
||||
|
||||
Yonghong Song <yhs@fb.com>
|
||||
Alexei Starovoitov <ast@kernel.org>
|
||||
Daniel Borkmann <daniel@iogearbox.net>
|
||||
|
||||
LLVM also has an issue tracker where BPF related bugs can be found:
|
||||
|
||||
https://bugs.llvm.org/buglist.cgi?quicksearch=bpf
|
||||
|
||||
However, it is better to reach out through mailing lists with having
|
||||
maintainers in Cc.
|
||||
|
||||
Q: I have added a new BPF instruction to the kernel, how can I integrate
|
||||
it into LLVM?
|
||||
|
||||
A: LLVM has a -mcpu selector for the BPF back end in order to allow the
|
||||
selection of BPF instruction set extensions. By default the 'generic'
|
||||
processor target is used, which is the base instruction set (v1) of BPF.
|
||||
|
||||
LLVM has an option to select -mcpu=probe where it will probe the host
|
||||
kernel for supported BPF instruction set extensions and selects the
|
||||
optimal set automatically.
|
||||
|
||||
For cross-compilation, a specific version can be select manually as well.
|
||||
|
||||
$ llc -march bpf -mcpu=help
|
||||
Available CPUs for this target:
|
||||
|
||||
generic - Select the generic processor.
|
||||
probe - Select the probe processor.
|
||||
v1 - Select the v1 processor.
|
||||
v2 - Select the v2 processor.
|
||||
[...]
|
||||
|
||||
Newly added BPF instructions to the Linux kernel need to follow the same
|
||||
scheme, bump the instruction set version and implement probing for the
|
||||
extensions such that -mcpu=probe users can benefit from the optimization
|
||||
transparently when upgrading their kernels.
|
||||
|
||||
If you are unable to implement support for the newly added BPF instruction
|
||||
please reach out to BPF developers for help.
|
||||
|
||||
By the way, the BPF kernel selftests run with -mcpu=probe for better
|
||||
test coverage.
|
||||
|
||||
Q: In some cases clang flag "-target bpf" is used but in other cases the
|
||||
default clang target, which matches the underlying architecture, is used.
|
||||
What is the difference and when I should use which?
|
||||
|
||||
A: Although LLVM IR generation and optimization try to stay architecture
|
||||
independent, "-target <arch>" still has some impact on generated code:
|
||||
|
||||
- BPF program may recursively include header file(s) with file scope
|
||||
inline assembly codes. The default target can handle this well,
|
||||
while bpf target may fail if bpf backend assembler does not
|
||||
understand these assembly codes, which is true in most cases.
|
||||
|
||||
- When compiled without -g, additional elf sections, e.g.,
|
||||
.eh_frame and .rela.eh_frame, may be present in the object file
|
||||
with default target, but not with bpf target.
|
||||
|
||||
- The default target may turn a C switch statement into a switch table
|
||||
lookup and jump operation. Since the switch table is placed
|
||||
in the global readonly section, the bpf program will fail to load.
|
||||
The bpf target does not support switch table optimization.
|
||||
The clang option "-fno-jump-tables" can be used to disable
|
||||
switch table generation.
|
||||
|
||||
- For clang -target bpf, it is guaranteed that pointer or long /
|
||||
unsigned long types will always have a width of 64 bit, no matter
|
||||
whether underlying clang binary or default target (or kernel) is
|
||||
32 bit. However, when native clang target is used, then it will
|
||||
compile these types based on the underlying architecture's conventions,
|
||||
meaning in case of 32 bit architecture, pointer or long / unsigned
|
||||
long types e.g. in BPF context structure will have width of 32 bit
|
||||
while the BPF LLVM back end still operates in 64 bit. The native
|
||||
target is mostly needed in tracing for the case of walking pt_regs
|
||||
or other kernel structures where CPU's register width matters.
|
||||
Otherwise, clang -target bpf is generally recommended.
|
||||
|
||||
You should use default target when:
|
||||
|
||||
- Your program includes a header file, e.g., ptrace.h, which eventually
|
||||
pulls in some header files containing file scope host assembly codes.
|
||||
- You can add "-fno-jump-tables" to work around the switch table issue.
|
||||
|
||||
Otherwise, you can use bpf target.
|
||||
|
||||
Happy BPF hacking!
|
File diff suppressed because it is too large
Load Diff
@ -1,307 +0,0 @@
|
||||
========================
|
||||
The Common Clk Framework
|
||||
========================
|
||||
|
||||
:Author: Mike Turquette <mturquette@ti.com>
|
||||
|
||||
This document endeavours to explain the common clk framework details,
|
||||
and how to port a platform over to this framework. It is not yet a
|
||||
detailed explanation of the clock api in include/linux/clk.h, but
|
||||
perhaps someday it will include that information.
|
||||
|
||||
Introduction and interface split
|
||||
================================
|
||||
|
||||
The common clk framework is an interface to control the clock nodes
|
||||
available on various devices today. This may come in the form of clock
|
||||
gating, rate adjustment, muxing or other operations. This framework is
|
||||
enabled with the CONFIG_COMMON_CLK option.
|
||||
|
||||
The interface itself is divided into two halves, each shielded from the
|
||||
details of its counterpart. First is the common definition of struct
|
||||
clk which unifies the framework-level accounting and infrastructure that
|
||||
has traditionally been duplicated across a variety of platforms. Second
|
||||
is a common implementation of the clk.h api, defined in
|
||||
drivers/clk/clk.c. Finally there is struct clk_ops, whose operations
|
||||
are invoked by the clk api implementation.
|
||||
|
||||
The second half of the interface is comprised of the hardware-specific
|
||||
callbacks registered with struct clk_ops and the corresponding
|
||||
hardware-specific structures needed to model a particular clock. For
|
||||
the remainder of this document any reference to a callback in struct
|
||||
clk_ops, such as .enable or .set_rate, implies the hardware-specific
|
||||
implementation of that code. Likewise, references to struct clk_foo
|
||||
serve as a convenient shorthand for the implementation of the
|
||||
hardware-specific bits for the hypothetical "foo" hardware.
|
||||
|
||||
Tying the two halves of this interface together is struct clk_hw, which
|
||||
is defined in struct clk_foo and pointed to within struct clk_core. This
|
||||
allows for easy navigation between the two discrete halves of the common
|
||||
clock interface.
|
||||
|
||||
Common data structures and api
|
||||
==============================
|
||||
|
||||
Below is the common struct clk_core definition from
|
||||
drivers/clk/clk.c, modified for brevity::
|
||||
|
||||
struct clk_core {
|
||||
const char *name;
|
||||
const struct clk_ops *ops;
|
||||
struct clk_hw *hw;
|
||||
struct module *owner;
|
||||
struct clk_core *parent;
|
||||
const char **parent_names;
|
||||
struct clk_core **parents;
|
||||
u8 num_parents;
|
||||
u8 new_parent_index;
|
||||
...
|
||||
};
|
||||
|
||||
The members above make up the core of the clk tree topology. The clk
|
||||
api itself defines several driver-facing functions which operate on
|
||||
struct clk. That api is documented in include/linux/clk.h.
|
||||
|
||||
Platforms and devices utilizing the common struct clk_core use the struct
|
||||
clk_ops pointer in struct clk_core to perform the hardware-specific parts of
|
||||
the operations defined in clk-provider.h::
|
||||
|
||||
struct clk_ops {
|
||||
int (*prepare)(struct clk_hw *hw);
|
||||
void (*unprepare)(struct clk_hw *hw);
|
||||
int (*is_prepared)(struct clk_hw *hw);
|
||||
void (*unprepare_unused)(struct clk_hw *hw);
|
||||
int (*enable)(struct clk_hw *hw);
|
||||
void (*disable)(struct clk_hw *hw);
|
||||
int (*is_enabled)(struct clk_hw *hw);
|
||||
void (*disable_unused)(struct clk_hw *hw);
|
||||
unsigned long (*recalc_rate)(struct clk_hw *hw,
|
||||
unsigned long parent_rate);
|
||||
long (*round_rate)(struct clk_hw *hw,
|
||||
unsigned long rate,
|
||||
unsigned long *parent_rate);
|
||||
int (*determine_rate)(struct clk_hw *hw,
|
||||
struct clk_rate_request *req);
|
||||
int (*set_parent)(struct clk_hw *hw, u8 index);
|
||||
u8 (*get_parent)(struct clk_hw *hw);
|
||||
int (*set_rate)(struct clk_hw *hw,
|
||||
unsigned long rate,
|
||||
unsigned long parent_rate);
|
||||
int (*set_rate_and_parent)(struct clk_hw *hw,
|
||||
unsigned long rate,
|
||||
unsigned long parent_rate,
|
||||
u8 index);
|
||||
unsigned long (*recalc_accuracy)(struct clk_hw *hw,
|
||||
unsigned long parent_accuracy);
|
||||
int (*get_phase)(struct clk_hw *hw);
|
||||
int (*set_phase)(struct clk_hw *hw, int degrees);
|
||||
void (*init)(struct clk_hw *hw);
|
||||
int (*debug_init)(struct clk_hw *hw,
|
||||
struct dentry *dentry);
|
||||
};
|
||||
|
||||
Hardware clk implementations
|
||||
============================
|
||||
|
||||
The strength of the common struct clk_core comes from its .ops and .hw pointers
|
||||
which abstract the details of struct clk from the hardware-specific bits, and
|
||||
vice versa. To illustrate consider the simple gateable clk implementation in
|
||||
drivers/clk/clk-gate.c::
|
||||
|
||||
struct clk_gate {
|
||||
struct clk_hw hw;
|
||||
void __iomem *reg;
|
||||
u8 bit_idx;
|
||||
...
|
||||
};
|
||||
|
||||
struct clk_gate contains struct clk_hw hw as well as hardware-specific
|
||||
knowledge about which register and bit controls this clk's gating.
|
||||
Nothing about clock topology or accounting, such as enable_count or
|
||||
notifier_count, is needed here. That is all handled by the common
|
||||
framework code and struct clk_core.
|
||||
|
||||
Let's walk through enabling this clk from driver code::
|
||||
|
||||
struct clk *clk;
|
||||
clk = clk_get(NULL, "my_gateable_clk");
|
||||
|
||||
clk_prepare(clk);
|
||||
clk_enable(clk);
|
||||
|
||||
The call graph for clk_enable is very simple::
|
||||
|
||||
clk_enable(clk);
|
||||
clk->ops->enable(clk->hw);
|
||||
[resolves to...]
|
||||
clk_gate_enable(hw);
|
||||
[resolves struct clk gate with to_clk_gate(hw)]
|
||||
clk_gate_set_bit(gate);
|
||||
|
||||
And the definition of clk_gate_set_bit::
|
||||
|
||||
static void clk_gate_set_bit(struct clk_gate *gate)
|
||||
{
|
||||
u32 reg;
|
||||
|
||||
reg = __raw_readl(gate->reg);
|
||||
reg |= BIT(gate->bit_idx);
|
||||
writel(reg, gate->reg);
|
||||
}
|
||||
|
||||
Note that to_clk_gate is defined as::
|
||||
|
||||
#define to_clk_gate(_hw) container_of(_hw, struct clk_gate, hw)
|
||||
|
||||
This pattern of abstraction is used for every clock hardware
|
||||
representation.
|
||||
|
||||
Supporting your own clk hardware
|
||||
================================
|
||||
|
||||
When implementing support for a new type of clock it is only necessary to
|
||||
include the following header::
|
||||
|
||||
#include <linux/clk-provider.h>
|
||||
|
||||
To construct a clk hardware structure for your platform you must define
|
||||
the following::
|
||||
|
||||
struct clk_foo {
|
||||
struct clk_hw hw;
|
||||
... hardware specific data goes here ...
|
||||
};
|
||||
|
||||
To take advantage of your data you'll need to support valid operations
|
||||
for your clk::
|
||||
|
||||
struct clk_ops clk_foo_ops {
|
||||
.enable = &clk_foo_enable;
|
||||
.disable = &clk_foo_disable;
|
||||
};
|
||||
|
||||
Implement the above functions using container_of::
|
||||
|
||||
#define to_clk_foo(_hw) container_of(_hw, struct clk_foo, hw)
|
||||
|
||||
int clk_foo_enable(struct clk_hw *hw)
|
||||
{
|
||||
struct clk_foo *foo;
|
||||
|
||||
foo = to_clk_foo(hw);
|
||||
|
||||
... perform magic on foo ...
|
||||
|
||||
return 0;
|
||||
};
|
||||
|
||||
Below is a matrix detailing which clk_ops are mandatory based upon the
|
||||
hardware capabilities of that clock. A cell marked as "y" means
|
||||
mandatory, a cell marked as "n" implies that either including that
|
||||
callback is invalid or otherwise unnecessary. Empty cells are either
|
||||
optional or must be evaluated on a case-by-case basis.
|
||||
|
||||
.. table:: clock hardware characteristics
|
||||
|
||||
+----------------+------+-------------+---------------+-------------+------+
|
||||
| | gate | change rate | single parent | multiplexer | root |
|
||||
+================+======+=============+===============+=============+======+
|
||||
|.prepare | | | | | |
|
||||
+----------------+------+-------------+---------------+-------------+------+
|
||||
|.unprepare | | | | | |
|
||||
+----------------+------+-------------+---------------+-------------+------+
|
||||
+----------------+------+-------------+---------------+-------------+------+
|
||||
|.enable | y | | | | |
|
||||
+----------------+------+-------------+---------------+-------------+------+
|
||||
|.disable | y | | | | |
|
||||
+----------------+------+-------------+---------------+-------------+------+
|
||||
|.is_enabled | y | | | | |
|
||||
+----------------+------+-------------+---------------+-------------+------+
|
||||
+----------------+------+-------------+---------------+-------------+------+
|
||||
|.recalc_rate | | y | | | |
|
||||
+----------------+------+-------------+---------------+-------------+------+
|
||||
|.round_rate | | y [1]_ | | | |
|
||||
+----------------+------+-------------+---------------+-------------+------+
|
||||
|.determine_rate | | y [1]_ | | | |
|
||||
+----------------+------+-------------+---------------+-------------+------+
|
||||
|.set_rate | | y | | | |
|
||||
+----------------+------+-------------+---------------+-------------+------+
|
||||
+----------------+------+-------------+---------------+-------------+------+
|
||||
|.set_parent | | | n | y | n |
|
||||
+----------------+------+-------------+---------------+-------------+------+
|
||||
|.get_parent | | | n | y | n |
|
||||
+----------------+------+-------------+---------------+-------------+------+
|
||||
+----------------+------+-------------+---------------+-------------+------+
|
||||
|.recalc_accuracy| | | | | |
|
||||
+----------------+------+-------------+---------------+-------------+------+
|
||||
+----------------+------+-------------+---------------+-------------+------+
|
||||
|.init | | | | | |
|
||||
+----------------+------+-------------+---------------+-------------+------+
|
||||
|
||||
.. [1] either one of round_rate or determine_rate is required.
|
||||
|
||||
Finally, register your clock at run-time with a hardware-specific
|
||||
registration function. This function simply populates struct clk_foo's
|
||||
data and then passes the common struct clk parameters to the framework
|
||||
with a call to::
|
||||
|
||||
clk_register(...)
|
||||
|
||||
See the basic clock types in ``drivers/clk/clk-*.c`` for examples.
|
||||
|
||||
Disabling clock gating of unused clocks
|
||||
=======================================
|
||||
|
||||
Sometimes during development it can be useful to be able to bypass the
|
||||
default disabling of unused clocks. For example, if drivers aren't enabling
|
||||
clocks properly but rely on them being on from the bootloader, bypassing
|
||||
the disabling means that the driver will remain functional while the issues
|
||||
are sorted out.
|
||||
|
||||
To bypass this disabling, include "clk_ignore_unused" in the bootargs to the
|
||||
kernel.
|
||||
|
||||
Locking
|
||||
=======
|
||||
|
||||
The common clock framework uses two global locks, the prepare lock and the
|
||||
enable lock.
|
||||
|
||||
The enable lock is a spinlock and is held across calls to the .enable,
|
||||
.disable operations. Those operations are thus not allowed to sleep,
|
||||
and calls to the clk_enable(), clk_disable() API functions are allowed in
|
||||
atomic context.
|
||||
|
||||
For clk_is_enabled() API, it is also designed to be allowed to be used in
|
||||
atomic context. However, it doesn't really make any sense to hold the enable
|
||||
lock in core, unless you want to do something else with the information of
|
||||
the enable state with that lock held. Otherwise, seeing if a clk is enabled is
|
||||
a one-shot read of the enabled state, which could just as easily change after
|
||||
the function returns because the lock is released. Thus the user of this API
|
||||
needs to handle synchronizing the read of the state with whatever they're
|
||||
using it for to make sure that the enable state doesn't change during that
|
||||
time.
|
||||
|
||||
The prepare lock is a mutex and is held across calls to all other operations.
|
||||
All those operations are allowed to sleep, and calls to the corresponding API
|
||||
functions are not allowed in atomic context.
|
||||
|
||||
This effectively divides operations in two groups from a locking perspective.
|
||||
|
||||
Drivers don't need to manually protect resources shared between the operations
|
||||
of one group, regardless of whether those resources are shared by multiple
|
||||
clocks or not. However, access to resources that are shared between operations
|
||||
of the two groups needs to be protected by the drivers. An example of such a
|
||||
resource would be a register that controls both the clock rate and the clock
|
||||
enable/disable state.
|
||||
|
||||
The clock framework is reentrant, in that a driver is allowed to call clock
|
||||
framework functions from within its implementation of clock operations. This
|
||||
can for instance cause a .set_rate operation of one clock being called from
|
||||
within the .set_rate operation of another clock. This case must be considered
|
||||
in the driver implementations, but the code flow is usually controlled by the
|
||||
driver in that case.
|
||||
|
||||
Note that locking must also be considered when code outside of the common
|
||||
clock framework needs to access resources used by the clock operations. This
|
||||
is considered out of scope of this document.
|
@ -111,7 +111,6 @@ If the compiler can prove that do_something() does not store to the
|
||||
variable a, then the compiler is within its rights transforming this to
|
||||
the following::
|
||||
|
||||
tmp = a;
|
||||
if (a > 0)
|
||||
for (;;)
|
||||
do_something();
|
||||
@ -119,7 +118,7 @@ the following::
|
||||
If you don't want the compiler to do this (and you probably don't), then
|
||||
you should use something like the following::
|
||||
|
||||
while (READ_ONCE(a) < 0)
|
||||
while (READ_ONCE(a) > 0)
|
||||
do_something();
|
||||
|
||||
Alternatively, you could place a barrier() call in the loop.
|
||||
@ -467,10 +466,12 @@ Like the above, except that these routines return a boolean which
|
||||
indicates whether the changed bit was set _BEFORE_ the atomic bit
|
||||
operation.
|
||||
|
||||
WARNING! It is incredibly important that the value be a boolean,
|
||||
ie. "0" or "1". Do not try to be fancy and save a few instructions by
|
||||
declaring the above to return "long" and just returning something like
|
||||
"old_val & mask" because that will not work.
|
||||
|
||||
.. warning::
|
||||
It is incredibly important that the value be a boolean, ie. "0" or "1".
|
||||
Do not try to be fancy and save a few instructions by declaring the
|
||||
above to return "long" and just returning something like "old_val &
|
||||
mask" because that will not work.
|
||||
|
||||
For one thing, this return value gets truncated to int in many code
|
||||
paths using these interfaces, so on 64-bit if the bit is set in the
|
||||
|
66
Documentation/core-api/gfp_mask-from-fs-io.rst
Normal file
66
Documentation/core-api/gfp_mask-from-fs-io.rst
Normal file
@ -0,0 +1,66 @@
|
||||
=================================
|
||||
GFP masks used from FS/IO context
|
||||
=================================
|
||||
|
||||
:Date: May, 2018
|
||||
:Author: Michal Hocko <mhocko@kernel.org>
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
Code paths in the filesystem and IO stacks must be careful when
|
||||
allocating memory to prevent recursion deadlocks caused by direct
|
||||
memory reclaim calling back into the FS or IO paths and blocking on
|
||||
already held resources (e.g. locks - most commonly those used for the
|
||||
transaction context).
|
||||
|
||||
The traditional way to avoid this deadlock problem is to clear __GFP_FS
|
||||
respectively __GFP_IO (note the latter implies clearing the first as well) in
|
||||
the gfp mask when calling an allocator. GFP_NOFS respectively GFP_NOIO can be
|
||||
used as shortcut. It turned out though that above approach has led to
|
||||
abuses when the restricted gfp mask is used "just in case" without a
|
||||
deeper consideration which leads to problems because an excessive use
|
||||
of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
|
||||
reclaim issues.
|
||||
|
||||
New API
|
||||
========
|
||||
|
||||
Since 4.12 we do have a generic scope API for both NOFS and NOIO context
|
||||
``memalloc_nofs_save``, ``memalloc_nofs_restore`` respectively ``memalloc_noio_save``,
|
||||
``memalloc_noio_restore`` which allow to mark a scope to be a critical
|
||||
section from a filesystem or I/O point of view. Any allocation from that
|
||||
scope will inherently drop __GFP_FS respectively __GFP_IO from the given
|
||||
mask so no memory allocation can recurse back in the FS/IO.
|
||||
|
||||
.. kernel-doc:: include/linux/sched/mm.h
|
||||
:functions: memalloc_nofs_save memalloc_nofs_restore
|
||||
.. kernel-doc:: include/linux/sched/mm.h
|
||||
:functions: memalloc_noio_save memalloc_noio_restore
|
||||
|
||||
FS/IO code then simply calls the appropriate save function before
|
||||
any critical section with respect to the reclaim is started - e.g.
|
||||
lock shared with the reclaim context or when a transaction context
|
||||
nesting would be possible via reclaim. The restore function should be
|
||||
called when the critical section ends. All that ideally along with an
|
||||
explanation what is the reclaim context for easier maintenance.
|
||||
|
||||
Please note that the proper pairing of save/restore functions
|
||||
allows nesting so it is safe to call ``memalloc_noio_save`` or
|
||||
``memalloc_noio_restore`` respectively from an existing NOIO or NOFS
|
||||
scope.
|
||||
|
||||
What about __vmalloc(GFP_NOFS)
|
||||
==============================
|
||||
|
||||
vmalloc doesn't support GFP_NOFS semantic because there are hardcoded
|
||||
GFP_KERNEL allocations deep inside the allocator which are quite non-trivial
|
||||
to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is
|
||||
almost always a bug. The good news is that the NOFS/NOIO semantic can be
|
||||
achieved by the scope API.
|
||||
|
||||
In the ideal world, upper layers should already mark dangerous contexts
|
||||
and so no special care is required and vmalloc should be called without
|
||||
any problems. Sometimes if the context is not really clear or there are
|
||||
layering violations then the recommended way around that is to wrap ``vmalloc``
|
||||
by the scope API with a comment explaining the problem.
|
@ -14,6 +14,7 @@ Core utilities
|
||||
kernel-api
|
||||
assoc_array
|
||||
atomic_ops
|
||||
cachetlb
|
||||
refcount-vs-atomic
|
||||
cpu_hotplug
|
||||
idr
|
||||
@ -25,6 +26,8 @@ Core utilities
|
||||
genalloc
|
||||
errseq
|
||||
printk-formats
|
||||
circular-buffers
|
||||
gfp_mask-from-fs-io
|
||||
|
||||
Interfaces for kernel debugging
|
||||
===============================
|
||||
|
@ -39,17 +39,17 @@ String Manipulation
|
||||
.. kernel-doc:: lib/string.c
|
||||
:export:
|
||||
|
||||
Basic Kernel Library Functions
|
||||
==============================
|
||||
|
||||
The Linux kernel provides more basic utility functions.
|
||||
|
||||
Bit Operations
|
||||
--------------
|
||||
|
||||
.. kernel-doc:: arch/x86/include/asm/bitops.h
|
||||
:internal:
|
||||
|
||||
Basic Kernel Library Functions
|
||||
==============================
|
||||
|
||||
The Linux kernel provides more basic utility functions.
|
||||
|
||||
Bitmap Operations
|
||||
-----------------
|
||||
|
||||
@ -80,6 +80,31 @@ Command-line Parsing
|
||||
.. kernel-doc:: lib/cmdline.c
|
||||
:export:
|
||||
|
||||
Sorting
|
||||
-------
|
||||
|
||||
.. kernel-doc:: lib/sort.c
|
||||
:export:
|
||||
|
||||
.. kernel-doc:: lib/list_sort.c
|
||||
:export:
|
||||
|
||||
Text Searching
|
||||
--------------
|
||||
|
||||
.. kernel-doc:: lib/textsearch.c
|
||||
:doc: ts_intro
|
||||
|
||||
.. kernel-doc:: lib/textsearch.c
|
||||
:export:
|
||||
|
||||
.. kernel-doc:: include/linux/textsearch.h
|
||||
:functions: textsearch_find textsearch_next \
|
||||
textsearch_get_pattern textsearch_get_pattern_len
|
||||
|
||||
CRC and Math Functions in Linux
|
||||
===============================
|
||||
|
||||
CRC Functions
|
||||
-------------
|
||||
|
||||
@ -103,9 +128,6 @@ CRC Functions
|
||||
.. kernel-doc:: lib/crc-itu-t.c
|
||||
:export:
|
||||
|
||||
Math Functions in Linux
|
||||
=======================
|
||||
|
||||
Base 2 log and power Functions
|
||||
------------------------------
|
||||
|
||||
@ -127,28 +149,6 @@ Division Functions
|
||||
.. kernel-doc:: lib/gcd.c
|
||||
:export:
|
||||
|
||||
Sorting
|
||||
-------
|
||||
|
||||
.. kernel-doc:: lib/sort.c
|
||||
:export:
|
||||
|
||||
.. kernel-doc:: lib/list_sort.c
|
||||
:export:
|
||||
|
||||
Text Searching
|
||||
--------------
|
||||
|
||||
.. kernel-doc:: lib/textsearch.c
|
||||
:doc: ts_intro
|
||||
|
||||
.. kernel-doc:: lib/textsearch.c
|
||||
:export:
|
||||
|
||||
.. kernel-doc:: include/linux/textsearch.h
|
||||
:functions: textsearch_find textsearch_next \
|
||||
textsearch_get_pattern textsearch_get_pattern_len
|
||||
|
||||
UUID/GUID
|
||||
---------
|
||||
|
||||
@ -284,7 +284,7 @@ Resources Management
|
||||
MTRR Handling
|
||||
-------------
|
||||
|
||||
.. kernel-doc:: arch/x86/kernel/cpu/mtrr/main.c
|
||||
.. kernel-doc:: arch/x86/kernel/cpu/mtrr/mtrr.c
|
||||
:export:
|
||||
|
||||
Security Framework
|
||||
|
@ -419,11 +419,10 @@ struct clk
|
||||
|
||||
%pC pll1
|
||||
%pCn pll1
|
||||
%pCr 1560000000
|
||||
|
||||
For printing struct clk structures. %pC and %pCn print the name
|
||||
(Common Clock Framework) or address (legacy clock framework) of the
|
||||
structure; %pCr prints the current clock rate.
|
||||
structure.
|
||||
|
||||
Passed by reference.
|
||||
|
||||
|
@ -17,7 +17,7 @@ in order to help maintainers validate their code against the change in
|
||||
these memory ordering guarantees.
|
||||
|
||||
The terms used through this document try to follow the formal LKMM defined in
|
||||
github.com/aparri/memory-model/blob/master/Documentation/explanation.txt
|
||||
tools/memory-model/Documentation/explanation.txt.
|
||||
|
||||
memory-barriers.txt and atomic_t.txt provide more background to the
|
||||
memory ordering in general and for atomic operations specifically.
|
||||
|
@ -8,11 +8,13 @@ The crypto engine API (CE), is a crypto queue manager.
|
||||
|
||||
Requirement
|
||||
-----------
|
||||
You have to put at start of your tfm_ctx the struct crypto_engine_ctx
|
||||
struct your_tfm_ctx {
|
||||
You have to put at start of your tfm_ctx the struct crypto_engine_ctx::
|
||||
|
||||
struct your_tfm_ctx {
|
||||
struct crypto_engine_ctx enginectx;
|
||||
...
|
||||
};
|
||||
};
|
||||
|
||||
Why: Since CE manage only crypto_async_request, it cannot know the underlying
|
||||
request_type and so have access only on the TFM.
|
||||
So using container_of for accessing __ctx is impossible.
|
||||
|
@ -20,5 +20,6 @@ for cryptographic use cases, as well as programming examples.
|
||||
architecture
|
||||
devel-algos
|
||||
userspace-if
|
||||
crypto_engine
|
||||
api
|
||||
api-samples
|
||||
|
@ -121,10 +121,7 @@ read back the image downloaded.
|
||||
|
||||
.. note::
|
||||
|
||||
This driver requires a patch for firmware_class.c which has the modified
|
||||
request_firmware_nowait function.
|
||||
|
||||
Also after updating the BIOS image a user mode application needs to execute
|
||||
After updating the BIOS image a user mode application needs to execute
|
||||
code which sends the BIOS update request to the BIOS. So on the next reboot
|
||||
the BIOS knows about the new image downloaded and it updates itself.
|
||||
Also don't unload the rbu driver if the image has to be updated.
|
||||
|
@ -120,7 +120,7 @@ A typical out of bounds access report looks like this::
|
||||
|
||||
The header of the report discribe what kind of bug happened and what kind of
|
||||
access caused it. It's followed by the description of the accessed slub object
|
||||
(see 'SLUB Debug output' section in Documentation/vm/slub.txt for details) and
|
||||
(see 'SLUB Debug output' section in Documentation/vm/slub.rst for details) and
|
||||
the description of the accessed memory page.
|
||||
|
||||
In the last section the report shows memory state around the accessed address.
|
||||
|
@ -151,6 +151,11 @@ Contributing new tests (details)
|
||||
TEST_FILES, TEST_GEN_FILES mean it is the file which is used by
|
||||
test.
|
||||
|
||||
* First use the headers inside the kernel source and/or git repo, and then the
|
||||
system headers. Headers for the kernel release as opposed to headers
|
||||
installed by the distro on the system should be the primary focus to be able
|
||||
to find regressions.
|
||||
|
||||
Test Harness
|
||||
============
|
||||
|
||||
|
@ -264,7 +264,10 @@ i) Constructor
|
||||
data device, but just remove the mapping.
|
||||
|
||||
read_only: Don't allow any changes to be made to the pool
|
||||
metadata.
|
||||
metadata. This mode is only available after the
|
||||
thin-pool has been created and first used in full
|
||||
read/write mode. It cannot be specified on initial
|
||||
thin-pool creation.
|
||||
|
||||
error_if_no_space: Error IOs, instead of queueing, if no space.
|
||||
|
||||
|
68
Documentation/device-mapper/writecache.txt
Normal file
68
Documentation/device-mapper/writecache.txt
Normal file
@ -0,0 +1,68 @@
|
||||
The writecache target caches writes on persistent memory or on SSD. It
|
||||
doesn't cache reads because reads are supposed to be cached in page cache
|
||||
in normal RAM.
|
||||
|
||||
When the device is constructed, the first sector should be zeroed or the
|
||||
first sector should contain valid superblock from previous invocation.
|
||||
|
||||
Constructor parameters:
|
||||
1. type of the cache device - "p" or "s"
|
||||
p - persistent memory
|
||||
s - SSD
|
||||
2. the underlying device that will be cached
|
||||
3. the cache device
|
||||
4. block size (4096 is recommended; the maximum block size is the page
|
||||
size)
|
||||
5. the number of optional parameters (the parameters with an argument
|
||||
count as two)
|
||||
high_watermark n (default: 50)
|
||||
start writeback when the number of used blocks reach this
|
||||
watermark
|
||||
low_watermark x (default: 45)
|
||||
stop writeback when the number of used blocks drops below
|
||||
this watermark
|
||||
writeback_jobs n (default: unlimited)
|
||||
limit the number of blocks that are in flight during
|
||||
writeback. Setting this value reduces writeback
|
||||
throughput, but it may improve latency of read requests
|
||||
autocommit_blocks n (default: 64 for pmem, 65536 for ssd)
|
||||
when the application writes this amount of blocks without
|
||||
issuing the FLUSH request, the blocks are automatically
|
||||
commited
|
||||
autocommit_time ms (default: 1000)
|
||||
autocommit time in milliseconds. The data is automatically
|
||||
commited if this time passes and no FLUSH request is
|
||||
received
|
||||
fua (by default on)
|
||||
applicable only to persistent memory - use the FUA flag
|
||||
when writing data from persistent memory back to the
|
||||
underlying device
|
||||
nofua
|
||||
applicable only to persistent memory - don't use the FUA
|
||||
flag when writing back data and send the FLUSH request
|
||||
afterwards
|
||||
- some underlying devices perform better with fua, some
|
||||
with nofua. The user should test it
|
||||
|
||||
Status:
|
||||
1. error indicator - 0 if there was no error, otherwise error number
|
||||
2. the number of blocks
|
||||
3. the number of free blocks
|
||||
4. the number of blocks under writeback
|
||||
|
||||
Messages:
|
||||
flush
|
||||
flush the cache device. The message returns successfully
|
||||
if the cache device was flushed without an error
|
||||
flush_on_suspend
|
||||
flush the cache device on next suspend. Use this message
|
||||
when you are going to remove the cache device. The proper
|
||||
sequence for removing the cache device is:
|
||||
1. send the "flush_on_suspend" message
|
||||
2. load an inactive table with a linear target that maps
|
||||
to the underlying device
|
||||
3. suspend the device
|
||||
4. ask for status and verify that there are no errors
|
||||
5. resume the device, so that it will use the linear
|
||||
target
|
||||
6. the cache device is now inactive and it can be deleted
|
@ -1,233 +0,0 @@
|
||||
Altera SoCFPGA ECC Manager
|
||||
This driver uses the EDAC framework to implement the SOCFPGA ECC Manager.
|
||||
The ECC Manager counts and corrects single bit errors and counts/handles
|
||||
double bit errors which are uncorrectable.
|
||||
|
||||
Cyclone5 and Arria5 ECC Manager
|
||||
Required Properties:
|
||||
- compatible : Should be "altr,socfpga-ecc-manager"
|
||||
- #address-cells: must be 1
|
||||
- #size-cells: must be 1
|
||||
- ranges : standard definition, should translate from local addresses
|
||||
|
||||
Subcomponents:
|
||||
|
||||
L2 Cache ECC
|
||||
Required Properties:
|
||||
- compatible : Should be "altr,socfpga-l2-ecc"
|
||||
- reg : Address and size for ECC error interrupt clear registers.
|
||||
- interrupts : Should be single bit error interrupt, then double bit error
|
||||
interrupt. Note the rising edge type.
|
||||
|
||||
On Chip RAM ECC
|
||||
Required Properties:
|
||||
- compatible : Should be "altr,socfpga-ocram-ecc"
|
||||
- reg : Address and size for ECC error interrupt clear registers.
|
||||
- iram : phandle to On-Chip RAM definition.
|
||||
- interrupts : Should be single bit error interrupt, then double bit error
|
||||
interrupt. Note the rising edge type.
|
||||
|
||||
Example:
|
||||
|
||||
eccmgr: eccmgr@ffd08140 {
|
||||
compatible = "altr,socfpga-ecc-manager";
|
||||
#address-cells = <1>;
|
||||
#size-cells = <1>;
|
||||
ranges;
|
||||
|
||||
l2-ecc@ffd08140 {
|
||||
compatible = "altr,socfpga-l2-ecc";
|
||||
reg = <0xffd08140 0x4>;
|
||||
interrupts = <0 36 1>, <0 37 1>;
|
||||
};
|
||||
|
||||
ocram-ecc@ffd08144 {
|
||||
compatible = "altr,socfpga-ocram-ecc";
|
||||
reg = <0xffd08144 0x4>;
|
||||
iram = <&ocram>;
|
||||
interrupts = <0 178 1>, <0 179 1>;
|
||||
};
|
||||
};
|
||||
|
||||
Arria10 SoCFPGA ECC Manager
|
||||
The Arria10 SoC ECC Manager handles the IRQs for each peripheral
|
||||
in a shared register instead of individual IRQs like the Cyclone5
|
||||
and Arria5. Therefore the device tree is different as well.
|
||||
|
||||
Required Properties:
|
||||
- compatible : Should be "altr,socfpga-a10-ecc-manager"
|
||||
- altr,sysgr-syscon : phandle to Arria10 System Manager Block
|
||||
containing the ECC manager registers.
|
||||
- #address-cells: must be 1
|
||||
- #size-cells: must be 1
|
||||
- interrupts : Should be single bit error interrupt, then double bit error
|
||||
interrupt.
|
||||
- interrupt-controller : boolean indicator that ECC Manager is an interrupt controller
|
||||
- #interrupt-cells : must be set to 2.
|
||||
- ranges : standard definition, should translate from local addresses
|
||||
|
||||
Subcomponents:
|
||||
|
||||
L2 Cache ECC
|
||||
Required Properties:
|
||||
- compatible : Should be "altr,socfpga-a10-l2-ecc"
|
||||
- reg : Address and size for ECC error interrupt clear registers.
|
||||
- interrupts : Should be single bit error interrupt, then double bit error
|
||||
interrupt, in this order.
|
||||
|
||||
On-Chip RAM ECC
|
||||
Required Properties:
|
||||
- compatible : Should be "altr,socfpga-a10-ocram-ecc"
|
||||
- reg : Address and size for ECC block registers.
|
||||
- interrupts : Should be single bit error interrupt, then double bit error
|
||||
interrupt, in this order.
|
||||
|
||||
Ethernet FIFO ECC
|
||||
Required Properties:
|
||||
- compatible : Should be "altr,socfpga-eth-mac-ecc"
|
||||
- reg : Address and size for ECC block registers.
|
||||
- altr,ecc-parent : phandle to parent Ethernet node.
|
||||
- interrupts : Should be single bit error interrupt, then double bit error
|
||||
interrupt, in this order.
|
||||
|
||||
NAND FIFO ECC
|
||||
Required Properties:
|
||||
- compatible : Should be "altr,socfpga-nand-ecc"
|
||||
- reg : Address and size for ECC block registers.
|
||||
- altr,ecc-parent : phandle to parent NAND node.
|
||||
- interrupts : Should be single bit error interrupt, then double bit error
|
||||
interrupt, in this order.
|
||||
|
||||
DMA FIFO ECC
|
||||
Required Properties:
|
||||
- compatible : Should be "altr,socfpga-dma-ecc"
|
||||
- reg : Address and size for ECC block registers.
|
||||
- altr,ecc-parent : phandle to parent DMA node.
|
||||
- interrupts : Should be single bit error interrupt, then double bit error
|
||||
interrupt, in this order.
|
||||
|
||||
USB FIFO ECC
|
||||
Required Properties:
|
||||
- compatible : Should be "altr,socfpga-usb-ecc"
|
||||
- reg : Address and size for ECC block registers.
|
||||
- altr,ecc-parent : phandle to parent USB node.
|
||||
- interrupts : Should be single bit error interrupt, then double bit error
|
||||
interrupt, in this order.
|
||||
|
||||
QSPI FIFO ECC
|
||||
Required Properties:
|
||||
- compatible : Should be "altr,socfpga-qspi-ecc"
|
||||
- reg : Address and size for ECC block registers.
|
||||
- altr,ecc-parent : phandle to parent QSPI node.
|
||||
- interrupts : Should be single bit error interrupt, then double bit error
|
||||
interrupt, in this order.
|
||||
|
||||
SDMMC FIFO ECC
|
||||
Required Properties:
|
||||
- compatible : Should be "altr,socfpga-sdmmc-ecc"
|
||||
- reg : Address and size for ECC block registers.
|
||||
- altr,ecc-parent : phandle to parent SD/MMC node.
|
||||
- interrupts : Should be single bit error interrupt, then double bit error
|
||||
interrupt, in this order for port A, and then single bit error interrupt,
|
||||
then double bit error interrupt in this order for port B.
|
||||
|
||||
Example:
|
||||
|
||||
eccmgr: eccmgr@ffd06000 {
|
||||
compatible = "altr,socfpga-a10-ecc-manager";
|
||||
altr,sysmgr-syscon = <&sysmgr>;
|
||||
#address-cells = <1>;
|
||||
#size-cells = <1>;
|
||||
interrupts = <0 2 IRQ_TYPE_LEVEL_HIGH>,
|
||||
<0 0 IRQ_TYPE_LEVEL_HIGH>;
|
||||
interrupt-controller;
|
||||
#interrupt-cells = <2>;
|
||||
ranges;
|
||||
|
||||
l2-ecc@ffd06010 {
|
||||
compatible = "altr,socfpga-a10-l2-ecc";
|
||||
reg = <0xffd06010 0x4>;
|
||||
interrupts = <0 IRQ_TYPE_LEVEL_HIGH>,
|
||||
<32 IRQ_TYPE_LEVEL_HIGH>;
|
||||
};
|
||||
|
||||
ocram-ecc@ff8c3000 {
|
||||
compatible = "altr,socfpga-a10-ocram-ecc";
|
||||
reg = <0xff8c3000 0x90>;
|
||||
interrupts = <1 IRQ_TYPE_LEVEL_HIGH>,
|
||||
<33 IRQ_TYPE_LEVEL_HIGH> ;
|
||||
};
|
||||
|
||||
emac0-rx-ecc@ff8c0800 {
|
||||
compatible = "altr,socfpga-eth-mac-ecc";
|
||||
reg = <0xff8c0800 0x400>;
|
||||
altr,ecc-parent = <&gmac0>;
|
||||
interrupts = <4 IRQ_TYPE_LEVEL_HIGH>,
|
||||
<36 IRQ_TYPE_LEVEL_HIGH>;
|
||||
};
|
||||
|
||||
emac0-tx-ecc@ff8c0c00 {
|
||||
compatible = "altr,socfpga-eth-mac-ecc";
|
||||
reg = <0xff8c0c00 0x400>;
|
||||
altr,ecc-parent = <&gmac0>;
|
||||
interrupts = <5 IRQ_TYPE_LEVEL_HIGH>,
|
||||
<37 IRQ_TYPE_LEVEL_HIGH>;
|
||||
};
|
||||
|
||||
nand-buf-ecc@ff8c2000 {
|
||||
compatible = "altr,socfpga-nand-ecc";
|
||||
reg = <0xff8c2000 0x400>;
|
||||
altr,ecc-parent = <&nand>;
|
||||
interrupts = <11 IRQ_TYPE_LEVEL_HIGH>,
|
||||
<43 IRQ_TYPE_LEVEL_HIGH>;
|
||||
};
|
||||
|
||||
nand-rd-ecc@ff8c2400 {
|
||||
compatible = "altr,socfpga-nand-ecc";
|
||||
reg = <0xff8c2400 0x400>;
|
||||
altr,ecc-parent = <&nand>;
|
||||
interrupts = <13 IRQ_TYPE_LEVEL_HIGH>,
|
||||
<45 IRQ_TYPE_LEVEL_HIGH>;
|
||||
};
|
||||
|
||||
nand-wr-ecc@ff8c2800 {
|
||||
compatible = "altr,socfpga-nand-ecc";
|
||||
reg = <0xff8c2800 0x400>;
|
||||
altr,ecc-parent = <&nand>;
|
||||
interrupts = <12 IRQ_TYPE_LEVEL_HIGH>,
|
||||
<44 IRQ_TYPE_LEVEL_HIGH>;
|
||||
};
|
||||
|
||||
dma-ecc@ff8c8000 {
|
||||
compatible = "altr,socfpga-dma-ecc";
|
||||
reg = <0xff8c8000 0x400>;
|
||||
altr,ecc-parent = <&pdma>;
|
||||
interrupts = <10 IRQ_TYPE_LEVEL_HIGH>,
|
||||
<42 IRQ_TYPE_LEVEL_HIGH>;
|
||||
|
||||
usb0-ecc@ff8c8800 {
|
||||
compatible = "altr,socfpga-usb-ecc";
|
||||
reg = <0xff8c8800 0x400>;
|
||||
altr,ecc-parent = <&usb0>;
|
||||
interrupts = <2 IRQ_TYPE_LEVEL_HIGH>,
|
||||
<34 IRQ_TYPE_LEVEL_HIGH>;
|
||||
};
|
||||
|
||||
qspi-ecc@ff8c8400 {
|
||||
compatible = "altr,socfpga-qspi-ecc";
|
||||
reg = <0xff8c8400 0x400>;
|
||||
altr,ecc-parent = <&qspi>;
|
||||
interrupts = <14 IRQ_TYPE_LEVEL_HIGH>,
|
||||
<46 IRQ_TYPE_LEVEL_HIGH>;
|
||||
};
|
||||
|
||||
sdmmc-ecc@ff8c2c00 {
|
||||
compatible = "altr,socfpga-sdmmc-ecc";
|
||||
reg = <0xff8c2c00 0x400>;
|
||||
altr,ecc-parent = <&mmc>;
|
||||
interrupts = <15 IRQ_TYPE_LEVEL_HIGH>,
|
||||
<47 IRQ_TYPE_LEVEL_HIGH>,
|
||||
<16 IRQ_TYPE_LEVEL_HIGH>,
|
||||
<48 IRQ_TYPE_LEVEL_HIGH>;
|
||||
};
|
||||
};
|
@ -25,6 +25,10 @@ Boards with the Amlogic Meson8b SoC shall have the following properties:
|
||||
Required root node property:
|
||||
compatible: "amlogic,meson8b";
|
||||
|
||||
Boards with the Amlogic Meson8m2 SoC shall have the following properties:
|
||||
Required root node property:
|
||||
compatible: "amlogic,meson8m2";
|
||||
|
||||
Boards with the Amlogic Meson GXBaby SoC shall have the following properties:
|
||||
Required root node property:
|
||||
compatible: "amlogic,meson-gxbb";
|
||||
@ -54,6 +58,8 @@ Board compatible values (alphabetically, grouped by SoC):
|
||||
- "hardkernel,odroid-c1" (Meson8b)
|
||||
- "tronfy,mxq" (Meson8b)
|
||||
|
||||
- "tronsmart,mxiii-plus" (Meson8m2)
|
||||
|
||||
- "amlogic,p200" (Meson gxbb)
|
||||
- "amlogic,p201" (Meson gxbb)
|
||||
- "friendlyarm,nanopi-k2" (Meson gxbb)
|
||||
|
@ -34,6 +34,10 @@ Raspberry Pi 3 Model B
|
||||
Required root node properties:
|
||||
compatible = "raspberrypi,3-model-b", "brcm,bcm2837";
|
||||
|
||||
Raspberry Pi 3 Model B+
|
||||
Required root node properties:
|
||||
compatible = "raspberrypi,3-model-b-plus", "brcm,bcm2837";
|
||||
|
||||
Raspberry Pi Compute Module
|
||||
Required root node properties:
|
||||
compatible = "raspberrypi,compute-module", "brcm,bcm2835";
|
||||
|
@ -0,0 +1,30 @@
|
||||
MediaTek g3dsys controller
|
||||
============================
|
||||
|
||||
The MediaTek g3dsys controller provides various clocks and reset controller to
|
||||
the GPU.
|
||||
|
||||
Required Properties:
|
||||
|
||||
- compatible: Should be:
|
||||
- "mediatek,mt2701-g3dsys", "syscon":
|
||||
for MT2701 SoC
|
||||
- "mediatek,mt7623-g3dsys", "mediatek,mt2701-g3dsys", "syscon":
|
||||
for MT7623 SoC
|
||||
- #clock-cells: Must be 1
|
||||
- #reset-cells: Must be 1
|
||||
|
||||
The g3dsys controller uses the common clk binding from
|
||||
Documentation/devicetree/bindings/clock/clock-bindings.txt
|
||||
The available clocks are defined in dt-bindings/clock/mt*-clk.h.
|
||||
|
||||
Example:
|
||||
|
||||
g3dsys: clock-controller@13000000 {
|
||||
compatible = "mediatek,mt7623-g3dsys",
|
||||
"mediatek,mt2701-g3dsys",
|
||||
"syscon";
|
||||
reg = <0 0x13000000 0 0x200>;
|
||||
#clock-cells = <1>;
|
||||
#reset-cells = <1>;
|
||||
};
|
@ -21,8 +21,6 @@ Required root node properties:
|
||||
- "samsung,smdk5420" - for Exynos5420-based Samsung SMDK5420 eval board.
|
||||
- "samsung,tm2" - for Exynos5433-based Samsung TM2 board.
|
||||
- "samsung,tm2e" - for Exynos5433-based Samsung TM2E board.
|
||||
- "samsung,sd5v1" - for Exynos5440-based Samsung board.
|
||||
- "samsung,ssdk5440" - for Exynos5440-based Samsung board.
|
||||
|
||||
* Other companies Exynos SoC based
|
||||
* FriendlyARM
|
||||
|
@ -21,6 +21,8 @@ SoCs:
|
||||
compatible = "renesas,r8a7744"
|
||||
- RZ/G1E (R8A77450)
|
||||
compatible = "renesas,r8a7745"
|
||||
- RZ/G1C (R8A77470)
|
||||
compatible = "renesas,r8a77470"
|
||||
- R-Car M1A (R8A77781)
|
||||
compatible = "renesas,r8a7778"
|
||||
- R-Car H1 (R8A77790)
|
||||
@ -45,6 +47,8 @@ SoCs:
|
||||
compatible = "renesas,r8a77970"
|
||||
- R-Car V3H (R8A77980)
|
||||
compatible = "renesas,r8a77980"
|
||||
- R-Car E3 (R8A77990)
|
||||
compatible = "renesas,r8a77990"
|
||||
- R-Car D3 (R8A77995)
|
||||
compatible = "renesas,r8a77995"
|
||||
|
||||
@ -67,6 +71,8 @@ Boards:
|
||||
compatible = "renesas,draak", "renesas,r8a77995"
|
||||
- Eagle (RTP0RC77970SEB0010S)
|
||||
compatible = "renesas,eagle", "renesas,r8a77970"
|
||||
- Ebisu (RTP0RC77990SEB0010S)
|
||||
compatible = "renesas,ebisu", "renesas,r8a77990"
|
||||
- Genmai (RTK772100BC00000BR)
|
||||
compatible = "renesas,genmai", "renesas,r7s72100"
|
||||
- GR-Peach (X28A-M01-E/F)
|
||||
@ -78,6 +84,8 @@ Boards:
|
||||
compatible = "renesas,h3ulcb", "renesas,r8a7795"
|
||||
- Henninger
|
||||
compatible = "renesas,henninger", "renesas,r8a7791"
|
||||
- iWave Systems RZ/G1C Single Board Computer (iW-RainboW-G23S)
|
||||
compatible = "iwave,g23s", "renesas,r8a77470"
|
||||
- iWave Systems RZ/G1E SODIMM SOM Development Platform (iW-RainboW-G22D)
|
||||
compatible = "iwave,g22d", "iwave,g22m", "renesas,r8a7745"
|
||||
- iWave Systems RZ/G1E SODIMM System On Module (iW-RainboW-G22M-SM)
|
||||
@ -108,7 +116,7 @@ Boards:
|
||||
compatible = "renesas,salvator-x", "renesas,r8a7795"
|
||||
- Salvator-X (RTP0RC7796SIPB0011S)
|
||||
compatible = "renesas,salvator-x", "renesas,r8a7796"
|
||||
- Salvator-X (RTP0RC7796SIPB0011S (M3N))
|
||||
- Salvator-X (RTP0RC7796SIPB0011S (M3-N))
|
||||
compatible = "renesas,salvator-x", "renesas,r8a77965"
|
||||
- Salvator-XS (Salvator-X 2nd version, RTP0RC7795SIPB0012S)
|
||||
compatible = "renesas,salvator-xs", "renesas,r8a7795"
|
||||
@ -124,6 +132,8 @@ Boards:
|
||||
compatible = "renesas,sk-rzg1m", "renesas,r8a7743"
|
||||
- Stout (ADAS Starterkit, Y-R-CAR-ADAS-SKH2-BOARD)
|
||||
compatible = "renesas,stout", "renesas,r8a7790"
|
||||
- V3HSK (Y-ASK-RCAR-V3H-WS10)
|
||||
compatible = "renesas,v3hsk", "renesas,r8a77980"
|
||||
- V3MSK (Y-ASK-RCAR-V3M-WS10)
|
||||
compatible = "renesas,v3msk", "renesas,r8a77970"
|
||||
- Wheat (RTP0RC7792ASKB0000JE)
|
||||
|
14
Documentation/devicetree/bindings/arm/stm32/stm32-syscon.txt
Normal file
14
Documentation/devicetree/bindings/arm/stm32/stm32-syscon.txt
Normal file
@ -0,0 +1,14 @@
|
||||
STMicroelectronics STM32 Platforms System Controller
|
||||
|
||||
Properties:
|
||||
- compatible : should contain two values. First value must be :
|
||||
- " st,stm32mp157-syscfg " - for stm32mp157 based SoCs,
|
||||
second value must be always "syscon".
|
||||
- reg : offset and length of the register set.
|
||||
|
||||
Example:
|
||||
syscfg: syscon@50020000 {
|
||||
compatible = "st,stm32mp157-syscfg", "syscon";
|
||||
reg = <0x50020000 0x400>;
|
||||
};
|
||||
|
@ -1,16 +0,0 @@
|
||||
NVIDIA Tegra20 MC(Memory Controller)
|
||||
|
||||
Required properties:
|
||||
- compatible : "nvidia,tegra20-mc"
|
||||
- reg : Should contain 2 register ranges(address and length); see the
|
||||
example below. Note that the MC registers are interleaved with the
|
||||
GART registers, and hence must be represented as multiple ranges.
|
||||
- interrupts : Should contain MC General interrupt.
|
||||
|
||||
Example:
|
||||
memory-controller@7000f000 {
|
||||
compatible = "nvidia,tegra20-mc";
|
||||
reg = <0x7000f000 0x024
|
||||
0x7000f03c 0x3c4>;
|
||||
interrupts = <0 77 0x04>;
|
||||
};
|
@ -1,18 +0,0 @@
|
||||
NVIDIA Tegra30 MC(Memory Controller)
|
||||
|
||||
Required properties:
|
||||
- compatible : "nvidia,tegra30-mc"
|
||||
- reg : Should contain 4 register ranges(address and length); see the
|
||||
example below. Note that the MC registers are interleaved with the
|
||||
SMMU registers, and hence must be represented as multiple ranges.
|
||||
- interrupts : Should contain MC General interrupt.
|
||||
|
||||
Example:
|
||||
memory-controller {
|
||||
compatible = "nvidia,tegra30-mc";
|
||||
reg = <0x7000f000 0x010
|
||||
0x7000f03c 0x1b4
|
||||
0x7000f200 0x028
|
||||
0x7000f284 0x17c>;
|
||||
interrupts = <0 77 0x04>;
|
||||
};
|
@ -26,7 +26,7 @@ interrupt-controller:
|
||||
see binding for interrupt-controller/arm,gic.txt
|
||||
|
||||
timer:
|
||||
see binding for arm/twd.txt
|
||||
see binding for timer/arm,twd.txt
|
||||
|
||||
clocks:
|
||||
see binding for clocks/ux500.txt
|
||||
|
@ -30,7 +30,6 @@ compatible:
|
||||
Optional properties:
|
||||
- dma-coherent : Present if dma operations are coherent
|
||||
- clocks : a list of phandle + clock specifier pairs
|
||||
- resets : a list of phandle + reset specifier pairs
|
||||
- target-supply : regulator for SATA target power
|
||||
- phys : reference to the SATA PHY node
|
||||
- phy-names : must be "sata-phy"
|
||||
|
@ -79,7 +79,11 @@ Optional properties:
|
||||
mode as for example omap4 L4_CFG_CLKCTRL
|
||||
|
||||
- clock-names should contain at least "fck", and optionally also "ick"
|
||||
depending on the SoC and the interconnect target module
|
||||
depending on the SoC and the interconnect target module,
|
||||
some interconnect target modules also need additional
|
||||
optional clocks that can be specified as listed in TRM
|
||||
for the related CLKCTRL register bits 8 to 15 such as
|
||||
"dbclk" or "clk32k" depending on their role
|
||||
|
||||
- ti,hwmods optional TI interconnect module name to use legacy
|
||||
hwmod platform data
|
||||
|
47
Documentation/devicetree/bindings/clock/actions,s900-cmu.txt
Normal file
47
Documentation/devicetree/bindings/clock/actions,s900-cmu.txt
Normal file
@ -0,0 +1,47 @@
|
||||
* Actions S900 Clock Management Unit (CMU)
|
||||
|
||||
The Actions S900 clock management unit generates and supplies clock to various
|
||||
controllers within the SoC. The clock binding described here is applicable to
|
||||
S900 SoC.
|
||||
|
||||
Required Properties:
|
||||
|
||||
- compatible: should be "actions,s900-cmu"
|
||||
- reg: physical base address of the controller and length of memory mapped
|
||||
region.
|
||||
- clocks: Reference to the parent clocks ("hosc", "losc")
|
||||
- #clock-cells: should be 1.
|
||||
|
||||
Each clock is assigned an identifier, and client nodes can use this identifier
|
||||
to specify the clock which they consume.
|
||||
|
||||
All available clocks are defined as preprocessor macros in
|
||||
dt-bindings/clock/actions,s900-cmu.h header and can be used in device
|
||||
tree sources.
|
||||
|
||||
External clocks:
|
||||
|
||||
The hosc clock used as input for the plls is generated outside the SoC. It is
|
||||
expected that it is defined using standard clock bindings as "hosc".
|
||||
|
||||
Actions S900 CMU also requires one more clock:
|
||||
- "losc" - internal low frequency oscillator
|
||||
|
||||
Example: Clock Management Unit node:
|
||||
|
||||
cmu: clock-controller@e0160000 {
|
||||
compatible = "actions,s900-cmu";
|
||||
reg = <0x0 0xe0160000 0x0 0x1000>;
|
||||
clocks = <&hosc>, <&losc>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
Example: UART controller node that consumes clock generated by the clock
|
||||
management unit:
|
||||
|
||||
uart: serial@e012a000 {
|
||||
compatible = "actions,s900-uart", "actions,owl-uart";
|
||||
reg = <0x0 0xe012a000 0x0 0x2000>;
|
||||
interrupts = <GIC_SPI 34 IRQ_TYPE_LEVEL_HIGH>;
|
||||
clocks = <&cmu CLK_UART5>;
|
||||
};
|
@ -9,6 +9,7 @@ Required Properties:
|
||||
- GXBB (S905) : "amlogic,meson-gxbb-aoclkc"
|
||||
- GXL (S905X, S905D) : "amlogic,meson-gxl-aoclkc"
|
||||
- GXM (S912) : "amlogic,meson-gxm-aoclkc"
|
||||
- AXG (A113D, A113X) : "amlogic,meson-axg-aoclkc"
|
||||
followed by the common "amlogic,meson-gx-aoclkc"
|
||||
|
||||
- #clock-cells: should be 1.
|
||||
|
@ -10,9 +10,6 @@ Required Properties:
|
||||
"amlogic,gxl-clkc" for GXL and GXM SoC,
|
||||
"amlogic,axg-clkc" for AXG SoC.
|
||||
|
||||
- reg: physical base address of the clock controller and length of memory
|
||||
mapped region.
|
||||
|
||||
- #clock-cells: should be 1.
|
||||
|
||||
Each clock is assigned an identifier and client nodes can use this identifier
|
||||
@ -20,13 +17,22 @@ to specify the clock which they consume. All available clocks are defined as
|
||||
preprocessor macros in the dt-bindings/clock/gxbb-clkc.h header and can be
|
||||
used in device tree sources.
|
||||
|
||||
Parent node should have the following properties :
|
||||
- compatible: "syscon", "simple-mfd, and "amlogic,meson-gx-hhi-sysctrl" or
|
||||
"amlogic,meson-axg-hhi-sysctrl"
|
||||
- reg: base address and size of the HHI system control register space.
|
||||
|
||||
Example: Clock controller node:
|
||||
|
||||
clkc: clock-controller@c883c000 {
|
||||
sysctrl: system-controller@0 {
|
||||
compatible = "amlogic,meson-gx-hhi-sysctrl", "syscon", "simple-mfd";
|
||||
reg = <0 0 0 0x400>;
|
||||
|
||||
clkc: clock-controller {
|
||||
#clock-cells = <1>;
|
||||
compatible = "amlogic,gxbb-clkc";
|
||||
reg = <0x0 0xc883c000 0x0 0x3db>;
|
||||
};
|
||||
};
|
||||
|
||||
Example: UART controller node that consumes the clock generated by the clock
|
||||
controller:
|
||||
|
@ -276,36 +276,38 @@ These clock IDs are defined in:
|
||||
clk_ts_500_ref genpll2 2 BCM_SR_GENPLL2_TS_500_REF_CLK
|
||||
clk_125_nitro genpll2 3 BCM_SR_GENPLL2_125_NITRO_CLK
|
||||
clk_chimp genpll2 4 BCM_SR_GENPLL2_CHIMP_CLK
|
||||
clk_nic_flash genpll2 5 BCM_SR_GENPLL2_NIC_FLASH
|
||||
clk_nic_flash genpll2 5 BCM_SR_GENPLL2_NIC_FLASH_CLK
|
||||
clk_fs genpll2 6 BCM_SR_GENPLL2_FS_CLK
|
||||
|
||||
genpll3 crystal 0 BCM_SR_GENPLL3
|
||||
clk_hsls genpll3 1 BCM_SR_GENPLL3_HSLS_CLK
|
||||
clk_sdio genpll3 2 BCM_SR_GENPLL3_SDIO_CLK
|
||||
|
||||
genpll4 crystal 0 BCM_SR_GENPLL4
|
||||
ccn genpll4 1 BCM_SR_GENPLL4_CCN_CLK
|
||||
clk_ccn genpll4 1 BCM_SR_GENPLL4_CCN_CLK
|
||||
clk_tpiu_pll genpll4 2 BCM_SR_GENPLL4_TPIU_PLL_CLK
|
||||
noc_clk genpll4 3 BCM_SR_GENPLL4_NOC_CLK
|
||||
clk_noc genpll4 3 BCM_SR_GENPLL4_NOC_CLK
|
||||
clk_chclk_fs4 genpll4 4 BCM_SR_GENPLL4_CHCLK_FS4_CLK
|
||||
clk_bridge_fscpu genpll4 5 BCM_SR_GENPLL4_BRIDGE_FSCPU_CLK
|
||||
|
||||
|
||||
genpll5 crystal 0 BCM_SR_GENPLL5
|
||||
fs4_hf_clk genpll5 1 BCM_SR_GENPLL5_FS4_HF_CLK
|
||||
crypto_ae_clk genpll5 2 BCM_SR_GENPLL5_CRYPTO_AE_CLK
|
||||
raid_ae_clk genpll5 3 BCM_SR_GENPLL5_RAID_AE_CLK
|
||||
clk_fs4_hf genpll5 1 BCM_SR_GENPLL5_FS4_HF_CLK
|
||||
clk_crypto_ae genpll5 2 BCM_SR_GENPLL5_CRYPTO_AE_CLK
|
||||
clk_raid_ae genpll5 3 BCM_SR_GENPLL5_RAID_AE_CLK
|
||||
|
||||
genpll6 crystal 0 BCM_SR_GENPLL6
|
||||
48_usb genpll6 1 BCM_SR_GENPLL6_48_USB_CLK
|
||||
clk_48_usb genpll6 1 BCM_SR_GENPLL6_48_USB_CLK
|
||||
|
||||
lcpll0 crystal 0 BCM_SR_LCPLL0
|
||||
clk_sata_refp lcpll0 1 BCM_SR_LCPLL0_SATA_REFP_CLK
|
||||
clk_sata_refn lcpll0 2 BCM_SR_LCPLL0_SATA_REFN_CLK
|
||||
clk_usb_ref lcpll0 3 BCM_SR_LCPLL0_USB_REF_CLK
|
||||
sata_refpn lcpll0 3 BCM_SR_LCPLL0_SATA_REFPN_CLK
|
||||
clk_sata_350 lcpll0 3 BCM_SR_LCPLL0_SATA_350_CLK
|
||||
clk_sata_500 lcpll0 4 BCM_SR_LCPLL0_SATA_500_CLK
|
||||
|
||||
lcpll1 crystal 0 BCM_SR_LCPLL1
|
||||
wan lcpll1 1 BCM_SR_LCPLL0_WAN_CLK
|
||||
clk_wan lcpll1 1 BCM_SR_LCPLL1_WAN_CLK
|
||||
clk_usb_ref lcpll1 2 BCM_SR_LCPLL1_USB_REF_CLK
|
||||
clk_crmu_ts lcpll1 3 BCM_SR_LCPLL1_CRMU_TS_CLK
|
||||
|
||||
lcpll_pcie crystal 0 BCM_SR_LCPLL_PCIE
|
||||
pcie_phy_ref lcpll1 1 BCM_SR_LCPLL_PCIE_PHY_REF_CLK
|
||||
clk_pcie_phy_ref lcpll1 1 BCM_SR_LCPLL_PCIE_PHY_REF_CLK
|
||||
|
100
Documentation/devicetree/bindings/clock/nuvoton,npcm750-clk.txt
Normal file
100
Documentation/devicetree/bindings/clock/nuvoton,npcm750-clk.txt
Normal file
@ -0,0 +1,100 @@
|
||||
* Nuvoton NPCM7XX Clock Controller
|
||||
|
||||
Nuvoton Poleg BMC NPCM7XX contains an integrated clock controller, which
|
||||
generates and supplies clocks to all modules within the BMC.
|
||||
|
||||
External clocks:
|
||||
|
||||
There are six fixed clocks that are generated outside the BMC. All clocks are of
|
||||
a known fixed value that cannot be changed. clk_refclk, clk_mcbypck and
|
||||
clk_sysbypck are inputs to the clock controller.
|
||||
clk_rg1refck, clk_rg2refck and clk_xin are external clocks suppling the
|
||||
network. They are set on the device tree, but not used by the clock module. The
|
||||
network devices use them directly.
|
||||
Example can be found below.
|
||||
|
||||
All available clocks are defined as preprocessor macros in:
|
||||
dt-bindings/clock/nuvoton,npcm7xx-clock.h
|
||||
and can be reused as DT sources.
|
||||
|
||||
Required Properties of clock controller:
|
||||
|
||||
- compatible: "nuvoton,npcm750-clk" : for clock controller of Nuvoton
|
||||
Poleg BMC NPCM750
|
||||
|
||||
- reg: physical base address of the clock controller and length of
|
||||
memory mapped region.
|
||||
|
||||
- #clock-cells: should be 1.
|
||||
|
||||
Example: Clock controller node:
|
||||
|
||||
clk: clock-controller@f0801000 {
|
||||
compatible = "nuvoton,npcm750-clk";
|
||||
#clock-cells = <1>;
|
||||
reg = <0xf0801000 0x1000>;
|
||||
clock-names = "refclk", "sysbypck", "mcbypck";
|
||||
clocks = <&clk_refclk>, <&clk_sysbypck>, <&clk_mcbypck>;
|
||||
};
|
||||
|
||||
Example: Required external clocks for network:
|
||||
|
||||
/* external reference clock */
|
||||
clk_refclk: clk-refclk {
|
||||
compatible = "fixed-clock";
|
||||
#clock-cells = <0>;
|
||||
clock-frequency = <25000000>;
|
||||
clock-output-names = "refclk";
|
||||
};
|
||||
|
||||
/* external reference clock for cpu. float in normal operation */
|
||||
clk_sysbypck: clk-sysbypck {
|
||||
compatible = "fixed-clock";
|
||||
#clock-cells = <0>;
|
||||
clock-frequency = <800000000>;
|
||||
clock-output-names = "sysbypck";
|
||||
};
|
||||
|
||||
/* external reference clock for MC. float in normal operation */
|
||||
clk_mcbypck: clk-mcbypck {
|
||||
compatible = "fixed-clock";
|
||||
#clock-cells = <0>;
|
||||
clock-frequency = <800000000>;
|
||||
clock-output-names = "mcbypck";
|
||||
};
|
||||
|
||||
/* external clock signal rg1refck, supplied by the phy */
|
||||
clk_rg1refck: clk-rg1refck {
|
||||
compatible = "fixed-clock";
|
||||
#clock-cells = <0>;
|
||||
clock-frequency = <125000000>;
|
||||
clock-output-names = "clk_rg1refck";
|
||||
};
|
||||
|
||||
/* external clock signal rg2refck, supplied by the phy */
|
||||
clk_rg2refck: clk-rg2refck {
|
||||
compatible = "fixed-clock";
|
||||
#clock-cells = <0>;
|
||||
clock-frequency = <125000000>;
|
||||
clock-output-names = "clk_rg2refck";
|
||||
};
|
||||
|
||||
clk_xin: clk-xin {
|
||||
compatible = "fixed-clock";
|
||||
#clock-cells = <0>;
|
||||
clock-frequency = <50000000>;
|
||||
clock-output-names = "clk_xin";
|
||||
};
|
||||
|
||||
|
||||
Example: GMAC controller node that consumes two clocks: a generated clk by the
|
||||
clock controller and a fixed clock from DT (clk_rg1refck).
|
||||
|
||||
ethernet0: ethernet@f0802000 {
|
||||
compatible = "snps,dwmac";
|
||||
reg = <0xf0802000 0x2000>;
|
||||
interrupts = <0 14 4>;
|
||||
interrupt-names = "macirq";
|
||||
clocks = <&clk_rg1refck>, <&clk NPCM7XX_CLK_AHB>;
|
||||
clock-names = "stmmaceth", "clk_gmac";
|
||||
};
|
@ -17,7 +17,9 @@ Required properties :
|
||||
"qcom,gcc-msm8974pro-ac"
|
||||
"qcom,gcc-msm8994"
|
||||
"qcom,gcc-msm8996"
|
||||
"qcom,gcc-msm8998"
|
||||
"qcom,gcc-mdm9615"
|
||||
"qcom,gcc-sdm845"
|
||||
|
||||
- reg : shall contain base register location and length
|
||||
- #clock-cells : shall contain 1
|
||||
|
22
Documentation/devicetree/bindings/clock/qcom,rpmh-clk.txt
Normal file
22
Documentation/devicetree/bindings/clock/qcom,rpmh-clk.txt
Normal file
@ -0,0 +1,22 @@
|
||||
Qualcomm Technologies, Inc. RPMh Clocks
|
||||
-------------------------------------------------------
|
||||
|
||||
Resource Power Manager Hardened (RPMh) manages shared resources on
|
||||
some Qualcomm Technologies Inc. SoCs. It accepts clock requests from
|
||||
other hardware subsystems via RSC to control clocks.
|
||||
|
||||
Required properties :
|
||||
- compatible : shall contain "qcom,sdm845-rpmh-clk"
|
||||
|
||||
- #clock-cells : must contain 1
|
||||
|
||||
Example :
|
||||
|
||||
#include <dt-bindings/clock/qcom,rpmh.h>
|
||||
|
||||
&apps_rsc {
|
||||
rpmhcc: clock-controller {
|
||||
compatible = "qcom,sdm845-rpmh-clk";
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
};
|
19
Documentation/devicetree/bindings/clock/qcom,videocc.txt
Normal file
19
Documentation/devicetree/bindings/clock/qcom,videocc.txt
Normal file
@ -0,0 +1,19 @@
|
||||
Qualcomm Video Clock & Reset Controller Binding
|
||||
-----------------------------------------------
|
||||
|
||||
Required properties :
|
||||
- compatible : shall contain "qcom,sdm845-videocc"
|
||||
- reg : shall contain base register location and length
|
||||
- #clock-cells : from common clock binding, shall contain 1.
|
||||
- #power-domain-cells : from generic power domain binding, shall contain 1.
|
||||
|
||||
Optional properties :
|
||||
- #reset-cells : from common reset binding, shall contain 1.
|
||||
|
||||
Example:
|
||||
videocc: clock-controller@ab00000 {
|
||||
compatible = "qcom,sdm845-videocc";
|
||||
reg = <0xab00000 0x10000>;
|
||||
#clock-cells = <1>;
|
||||
#power-domain-cells = <1>;
|
||||
};
|
@ -15,6 +15,7 @@ Required Properties:
|
||||
- compatible: Must be one of:
|
||||
- "renesas,r8a7743-cpg-mssr" for the r8a7743 SoC (RZ/G1M)
|
||||
- "renesas,r8a7745-cpg-mssr" for the r8a7745 SoC (RZ/G1E)
|
||||
- "renesas,r8a77470-cpg-mssr" for the r8a77470 SoC (RZ/G1C)
|
||||
- "renesas,r8a7790-cpg-mssr" for the r8a7790 SoC (R-Car H2)
|
||||
- "renesas,r8a7791-cpg-mssr" for the r8a7791 SoC (R-Car M2-W)
|
||||
- "renesas,r8a7792-cpg-mssr" for the r8a7792 SoC (R-Car V2H)
|
||||
@ -25,6 +26,7 @@ Required Properties:
|
||||
- "renesas,r8a77965-cpg-mssr" for the r8a77965 SoC (R-Car M3-N)
|
||||
- "renesas,r8a77970-cpg-mssr" for the r8a77970 SoC (R-Car V3M)
|
||||
- "renesas,r8a77980-cpg-mssr" for the r8a77980 SoC (R-Car V3H)
|
||||
- "renesas,r8a77990-cpg-mssr" for the r8a77990 SoC (R-Car E3)
|
||||
- "renesas,r8a77995-cpg-mssr" for the r8a77995 SoC (R-Car D3)
|
||||
|
||||
- reg: Base address and length of the memory resource used by the CPG/MSSR
|
||||
@ -33,10 +35,12 @@ Required Properties:
|
||||
- clocks: References to external parent clocks, one entry for each entry in
|
||||
clock-names
|
||||
- clock-names: List of external parent clock names. Valid names are:
|
||||
- "extal" (r8a7743, r8a7745, r8a7790, r8a7791, r8a7792, r8a7793, r8a7794,
|
||||
r8a7795, r8a7796, r8a77965, r8a77970, r8a77980, r8a77995)
|
||||
- "extal" (r8a7743, r8a7745, r8a77470, r8a7790, r8a7791, r8a7792,
|
||||
r8a7793, r8a7794, r8a7795, r8a7796, r8a77965, r8a77970,
|
||||
r8a77980, r8a77990, r8a77995)
|
||||
- "extalr" (r8a7795, r8a7796, r8a77965, r8a77970, r8a77980)
|
||||
- "usb_extal" (r8a7743, r8a7745, r8a7790, r8a7791, r8a7793, r8a7794)
|
||||
- "usb_extal" (r8a7743, r8a7745, r8a77470, r8a7790, r8a7791, r8a7793,
|
||||
r8a7794)
|
||||
|
||||
- #clock-cells: Must be 2
|
||||
- For CPG core clocks, the two clock specifier cells must be "CPG_CORE"
|
||||
|
@ -1,77 +0,0 @@
|
||||
Device Tree Clock bindings for arch-rockchip
|
||||
|
||||
This binding uses the common clock binding[1].
|
||||
|
||||
[1] Documentation/devicetree/bindings/clock/clock-bindings.txt
|
||||
|
||||
== Gate clocks ==
|
||||
|
||||
These bindings are deprecated!
|
||||
Please use the soc specific CRU bindings instead.
|
||||
|
||||
The gate registers form a continuos block which makes the dt node
|
||||
structure a matter of taste, as either all gates can be put into
|
||||
one gate clock spanning all registers or they can be divided into
|
||||
the 10 individual gates containing 16 clocks each.
|
||||
The code supports both approaches.
|
||||
|
||||
Required properties:
|
||||
- compatible : "rockchip,rk2928-gate-clk"
|
||||
- reg : shall be the control register address(es) for the clock.
|
||||
- #clock-cells : from common clock binding; shall be set to 1
|
||||
- clock-output-names : the corresponding gate names that the clock controls
|
||||
- clocks : should contain the parent clock for each individual gate,
|
||||
therefore the number of clocks elements should match the number of
|
||||
clock-output-names
|
||||
|
||||
Example using multiple gate clocks:
|
||||
|
||||
clk_gates0: gate-clk@200000d0 {
|
||||
compatible = "rockchip,rk2928-gate-clk";
|
||||
reg = <0x200000d0 0x4>;
|
||||
clocks = <&dummy>, <&dummy>,
|
||||
<&dummy>, <&dummy>,
|
||||
<&dummy>, <&dummy>,
|
||||
<&dummy>, <&dummy>,
|
||||
<&dummy>, <&dummy>,
|
||||
<&dummy>, <&dummy>,
|
||||
<&dummy>, <&dummy>,
|
||||
<&dummy>, <&dummy>;
|
||||
|
||||
clock-output-names =
|
||||
"gate_core_periph", "gate_cpu_gpll",
|
||||
"gate_ddrphy", "gate_aclk_cpu",
|
||||
"gate_hclk_cpu", "gate_pclk_cpu",
|
||||
"gate_atclk_cpu", "gate_i2s0",
|
||||
"gate_i2s0_frac", "gate_i2s1",
|
||||
"gate_i2s1_frac", "gate_i2s2",
|
||||
"gate_i2s2_frac", "gate_spdif",
|
||||
"gate_spdif_frac", "gate_testclk";
|
||||
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
clk_gates1: gate-clk@200000d4 {
|
||||
compatible = "rockchip,rk2928-gate-clk";
|
||||
reg = <0x200000d4 0x4>;
|
||||
clocks = <&xin24m>, <&xin24m>,
|
||||
<&xin24m>, <&dummy>,
|
||||
<&dummy>, <&xin24m>,
|
||||
<&xin24m>, <&dummy>,
|
||||
<&xin24m>, <&dummy>,
|
||||
<&xin24m>, <&dummy>,
|
||||
<&xin24m>, <&dummy>,
|
||||
<&xin24m>, <&dummy>;
|
||||
|
||||
clock-output-names =
|
||||
"gate_timer0", "gate_timer1",
|
||||
"gate_timer2", "gate_jtag",
|
||||
"gate_aclk_lcdc1_src", "gate_otgphy0",
|
||||
"gate_otgphy1", "gate_ddr_gpll",
|
||||
"gate_uart0", "gate_frac_uart0",
|
||||
"gate_uart1", "gate_frac_uart1",
|
||||
"gate_uart2", "gate_frac_uart2",
|
||||
"gate_uart3", "gate_frac_uart3";
|
||||
|
||||
#clock-cells = <1>;
|
||||
};
|
@ -31,10 +31,10 @@ This binding uses the common clock binding[1].
|
||||
Each subnode should use the binding described in [2]..[7]
|
||||
|
||||
[1] Documentation/devicetree/bindings/clock/clock-bindings.txt
|
||||
[3] Documentation/devicetree/bindings/clock/st,clkgen-mux.txt
|
||||
[4] Documentation/devicetree/bindings/clock/st,clkgen-pll.txt
|
||||
[7] Documentation/devicetree/bindings/clock/st,quadfs.txt
|
||||
[8] Documentation/devicetree/bindings/clock/st,flexgen.txt
|
||||
[3] Documentation/devicetree/bindings/clock/st/st,clkgen-mux.txt
|
||||
[4] Documentation/devicetree/bindings/clock/st/st,clkgen-pll.txt
|
||||
[7] Documentation/devicetree/bindings/clock/st/st,quadfs.txt
|
||||
[8] Documentation/devicetree/bindings/clock/st/st,flexgen.txt
|
||||
|
||||
|
||||
Required properties:
|
||||
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user