2019-06-12 17:52:38 +00:00
|
|
|
==============================
|
|
|
|
Memory Layout on AArch64 Linux
|
|
|
|
==============================
|
|
|
|
|
|
|
|
Author: Catalin Marinas <catalin.marinas@arm.com>
|
|
|
|
|
|
|
|
This document describes the virtual memory layout used by the AArch64
|
|
|
|
Linux kernel. The architecture allows up to 4 levels of translation
|
|
|
|
tables with a 4KB page size and up to 3 levels with a 64KB page size.
|
|
|
|
|
|
|
|
AArch64 Linux uses either 3 levels or 4 levels of translation tables
|
|
|
|
with the 4KB page configuration, allowing 39-bit (512GB) or 48-bit
|
|
|
|
(256TB) virtual addresses, respectively, for both user and kernel. With
|
|
|
|
64KB pages, only 2 levels of translation tables, allowing 42-bit (4TB)
|
|
|
|
virtual address, are used but the memory layout is the same.
|
|
|
|
|
2019-08-07 15:55:24 +00:00
|
|
|
ARMv8.2 adds optional support for Large Virtual Address space. This is
|
|
|
|
only available when running with a 64KB page size and expands the
|
|
|
|
number of descriptors in the first level of translation.
|
|
|
|
|
2019-06-12 17:52:38 +00:00
|
|
|
User addresses have bits 63:48 set to 0 while the kernel addresses have
|
|
|
|
the same bits set to 1. TTBRx selection is given by bit 63 of the
|
|
|
|
virtual address. The swapper_pg_dir contains only kernel (global)
|
|
|
|
mappings while the user pgd contains only user (non-global) mappings.
|
|
|
|
The swapper_pg_dir address is written to TTBR1 and never written to
|
|
|
|
TTBR0.
|
|
|
|
|
|
|
|
|
2019-08-07 15:55:24 +00:00
|
|
|
AArch64 Linux memory layout with 4KB pages + 4 levels (48-bit)::
|
2019-06-12 17:52:38 +00:00
|
|
|
|
|
|
|
Start End Size Use
|
|
|
|
-----------------------------------------------------------------------
|
|
|
|
0000000000000000 0000ffffffffffff 256TB user
|
2019-08-07 15:55:24 +00:00
|
|
|
ffff000000000000 ffff7fffffffffff 128TB kernel logical memory map
|
2020-11-10 13:08:51 +00:00
|
|
|
[ffff600000000000 ffff7fffffffffff] 32TB [kasan shadow region]
|
arm64: module: rework module VA range selection
Currently, the modules region is 128M in size, which is a problem for
some large modules. Shanker reports [1] that the NVIDIA GPU driver alone
can consume 110M of module space in some configurations. We'd like to
make the modules region a full 2G such that we can always make use of a
2G range.
It's possible to build kernel images which are larger than 128M in some
configurations, such as when many debug options are selected and many
drivers are built in. In these configurations, we can't legitimately
select a base for a 128M module region, though we currently select a
value for which allocation will fail. It would be nicer to have a
diagnostic message in this case.
Similarly, in theory it's possible to build a kernel image which is
larger than 2G and which cannot support modules. While this isn't likely
to be the case for any realistic kernel deplyed in the field, it would
be nice if we could print a diagnostic in this case.
This patch reworks the module VA range selection to use a 2G range, and
improves handling of cases where we cannot select legitimate module
regions. We now attempt to select a 128M region and a 2G region:
* The 128M region is selected such that modules can use direct branches
(with JUMP26/CALL26 relocations) to branch to kernel code and other
modules, and so that modules can reference data and text (using PREL32
relocations) anywhere in the kernel image and other modules.
This region covers the entire kernel image (rather than just the text)
to ensure that all PREL32 relocations are in range even when the
kernel data section is absurdly large. Where we cannot allocate from
this region, we'll fall back to the full 2G region.
* The 2G region is selected such that modules can use direct branches
with PLTs to branch to kernel code and other modules, and so that
modules can use reference data and text (with PREL32 relocations) in
the kernel image and other modules.
This region covers the entire kernel image, and the 128M region (if
one is selected).
The two module regions are randomized independently while ensuring the
constraints described above.
[1] https://lore.kernel.org/linux-arm-kernel/159ceeab-09af-3174-5058-445bc8dcf85b@nvidia.com/
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Cc: Shanker Donthineni <sdonthineni@nvidia.com>
Cc: Will Deacon <will@kernel.org>
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com>
Link: https://lore.kernel.org/r/20230530110328.2213762-7-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-05-30 11:03:28 +00:00
|
|
|
ffff800000000000 ffff80007fffffff 2GB modules
|
|
|
|
ffff800080000000 fffffbffefffffff 124TB vmalloc
|
2020-10-08 15:36:02 +00:00
|
|
|
fffffbfff0000000 fffffbfffdffffff 224MB fixed mappings (top down)
|
|
|
|
fffffbfffe000000 fffffbfffe7fffff 8MB [guard region]
|
|
|
|
fffffbfffe800000 fffffbffff7fffff 16MB PCI I/O space
|
|
|
|
fffffbffff800000 fffffbffffffffff 8MB [guard region]
|
2020-10-08 15:36:01 +00:00
|
|
|
fffffc0000000000 fffffdffffffffff 2TB vmemmap
|
|
|
|
fffffe0000000000 ffffffffffffffff 2TB [guard region]
|
2019-08-07 15:55:24 +00:00
|
|
|
|
|
|
|
|
|
|
|
AArch64 Linux memory layout with 64KB pages + 3 levels (52-bit with HW support)::
|
2019-06-12 17:52:38 +00:00
|
|
|
|
|
|
|
Start End Size Use
|
|
|
|
-----------------------------------------------------------------------
|
2019-08-07 15:55:24 +00:00
|
|
|
0000000000000000 000fffffffffffff 4PB user
|
arm64: mm: extend linear region for 52-bit VA configurations
For historical reasons, the arm64 kernel VA space is configured as two
equally sized halves, i.e., on a 48-bit VA build, the VA space is split
into a 47-bit vmalloc region and a 47-bit linear region.
When support for 52-bit virtual addressing was added, this equal split
was kept, resulting in a substantial waste of virtual address space in
the linear region:
48-bit VA 52-bit VA
0xffff_ffff_ffff_ffff +-------------+ +-------------+
| vmalloc | | vmalloc |
0xffff_8000_0000_0000 +-------------+ _PAGE_END(48) +-------------+
| linear | : :
0xffff_0000_0000_0000 +-------------+ : :
: : : :
: : : :
: : : :
: : : currently :
: unusable : : :
: : : unused :
: by : : :
: : : :
: hardware : : :
: : : :
0xfff8_0000_0000_0000 : : _PAGE_END(52) +-------------+
: : | |
: : | |
: : | |
: : | |
: : | |
: unusable : | |
: : | linear |
: by : | |
: : | region |
: hardware : | |
: : | |
: : | |
: : | |
: : | |
: : | |
: : | |
0xfff0_0000_0000_0000 +-------------+ PAGE_OFFSET +-------------+
As illustrated above, the 52-bit VA kernel uses 47 bits for the vmalloc
space (as before), to ensure that a single 64k granule kernel image can
support any 64k granule capable system, regardless of whether it supports
the 52-bit virtual addressing extension. However, due to the fact that
the VA space is still split in equal halves, the linear region is only
2^51 bytes in size, wasting almost half of the 52-bit VA space.
Let's fix this, by abandoning the equal split, and simply assigning all
VA space outside of the vmalloc region to the linear region.
The KASAN shadow region is reconfigured so that it ends at the start of
the vmalloc region, and grows downwards. That way, the arrangement of
the vmalloc space (which contains kernel mappings, modules, BPF region,
the vmemmap array etc) is identical between non-KASAN and KASAN builds,
which aids debugging.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Link: https://lore.kernel.org/r/20201008153602.9467-3-ardb@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-10-08 15:36:00 +00:00
|
|
|
fff0000000000000 ffff7fffffffffff ~4PB kernel logical memory map
|
2020-11-10 13:08:51 +00:00
|
|
|
[fffd800000000000 ffff7fffffffffff] 512TB [kasan shadow region]
|
arm64: module: rework module VA range selection
Currently, the modules region is 128M in size, which is a problem for
some large modules. Shanker reports [1] that the NVIDIA GPU driver alone
can consume 110M of module space in some configurations. We'd like to
make the modules region a full 2G such that we can always make use of a
2G range.
It's possible to build kernel images which are larger than 128M in some
configurations, such as when many debug options are selected and many
drivers are built in. In these configurations, we can't legitimately
select a base for a 128M module region, though we currently select a
value for which allocation will fail. It would be nicer to have a
diagnostic message in this case.
Similarly, in theory it's possible to build a kernel image which is
larger than 2G and which cannot support modules. While this isn't likely
to be the case for any realistic kernel deplyed in the field, it would
be nice if we could print a diagnostic in this case.
This patch reworks the module VA range selection to use a 2G range, and
improves handling of cases where we cannot select legitimate module
regions. We now attempt to select a 128M region and a 2G region:
* The 128M region is selected such that modules can use direct branches
(with JUMP26/CALL26 relocations) to branch to kernel code and other
modules, and so that modules can reference data and text (using PREL32
relocations) anywhere in the kernel image and other modules.
This region covers the entire kernel image (rather than just the text)
to ensure that all PREL32 relocations are in range even when the
kernel data section is absurdly large. Where we cannot allocate from
this region, we'll fall back to the full 2G region.
* The 2G region is selected such that modules can use direct branches
with PLTs to branch to kernel code and other modules, and so that
modules can use reference data and text (with PREL32 relocations) in
the kernel image and other modules.
This region covers the entire kernel image, and the 128M region (if
one is selected).
The two module regions are randomized independently while ensuring the
constraints described above.
[1] https://lore.kernel.org/linux-arm-kernel/159ceeab-09af-3174-5058-445bc8dcf85b@nvidia.com/
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Cc: Shanker Donthineni <sdonthineni@nvidia.com>
Cc: Will Deacon <will@kernel.org>
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com>
Link: https://lore.kernel.org/r/20230530110328.2213762-7-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-05-30 11:03:28 +00:00
|
|
|
ffff800000000000 ffff80007fffffff 2GB modules
|
|
|
|
ffff800080000000 fffffbffefffffff 124TB vmalloc
|
2020-10-08 15:36:02 +00:00
|
|
|
fffffbfff0000000 fffffbfffdffffff 224MB fixed mappings (top down)
|
|
|
|
fffffbfffe000000 fffffbfffe7fffff 8MB [guard region]
|
|
|
|
fffffbfffe800000 fffffbffff7fffff 16MB PCI I/O space
|
|
|
|
fffffbffff800000 fffffbffffffffff 8MB [guard region]
|
2020-10-08 15:36:01 +00:00
|
|
|
fffffc0000000000 ffffffdfffffffff ~4TB vmemmap
|
|
|
|
ffffffe000000000 ffffffffffffffff 128GB [guard region]
|
2019-06-12 17:52:38 +00:00
|
|
|
|
|
|
|
|
|
|
|
Translation table lookup with 4KB pages::
|
|
|
|
|
|
|
|
+--------+--------+--------+--------+--------+--------+--------+--------+
|
|
|
|
|63 56|55 48|47 40|39 32|31 24|23 16|15 8|7 0|
|
|
|
|
+--------+--------+--------+--------+--------+--------+--------+--------+
|
|
|
|
| | | | | |
|
|
|
|
| | | | | v
|
|
|
|
| | | | | [11:0] in-page offset
|
|
|
|
| | | | +-> [20:12] L3 index
|
|
|
|
| | | +-----------> [29:21] L2 index
|
|
|
|
| | +---------------------> [38:30] L1 index
|
|
|
|
| +-------------------------------> [47:39] L0 index
|
|
|
|
+-------------------------------------------------> [63] TTBR0/1
|
|
|
|
|
|
|
|
|
|
|
|
Translation table lookup with 64KB pages::
|
|
|
|
|
|
|
|
+--------+--------+--------+--------+--------+--------+--------+--------+
|
|
|
|
|63 56|55 48|47 40|39 32|31 24|23 16|15 8|7 0|
|
|
|
|
+--------+--------+--------+--------+--------+--------+--------+--------+
|
|
|
|
| | | | |
|
|
|
|
| | | | v
|
|
|
|
| | | | [15:0] in-page offset
|
|
|
|
| | | +----------> [28:16] L3 index
|
|
|
|
| | +--------------------------> [41:29] L2 index
|
2019-08-07 15:55:24 +00:00
|
|
|
| +-------------------------------> [47:42] L1 index (48-bit)
|
|
|
|
| [51:42] L1 index (52-bit)
|
2019-06-12 17:52:38 +00:00
|
|
|
+-------------------------------------------------> [63] TTBR0/1
|
|
|
|
|
|
|
|
|
|
|
|
When using KVM without the Virtualization Host Extensions, the
|
|
|
|
hypervisor maps kernel pages in EL2 at a fixed (and potentially
|
|
|
|
random) offset from the linear mapping. See the kern_hyp_va macro and
|
|
|
|
kvm_update_va_mask function for more details. MMIO devices such as
|
|
|
|
GICv2 gets mapped next to the HYP idmap page, as do vectors when
|
2020-11-13 11:38:45 +00:00
|
|
|
ARM64_SPECTRE_V3A is enabled for particular CPUs.
|
2019-06-12 17:52:38 +00:00
|
|
|
|
|
|
|
When using KVM with the Virtualization Host Extensions, no additional
|
|
|
|
mappings are created, since the host kernel runs directly in EL2.
|
2019-08-07 15:55:24 +00:00
|
|
|
|
|
|
|
52-bit VA support in the kernel
|
|
|
|
-------------------------------
|
|
|
|
If the ARMv8.2-LVA optional feature is present, and we are running
|
|
|
|
with a 64KB page size; then it is possible to use 52-bits of address
|
|
|
|
space for both userspace and kernel addresses. However, any kernel
|
|
|
|
binary that supports 52-bit must also be able to fall back to 48-bit
|
|
|
|
at early boot time if the hardware feature is not present.
|
|
|
|
|
|
|
|
This fallback mechanism necessitates the kernel .text to be in the
|
|
|
|
higher addresses such that they are invariant to 48/52-bit VAs. Due
|
|
|
|
to the kasan shadow being a fraction of the entire kernel VA space,
|
|
|
|
the end of the kasan shadow must also be in the higher half of the
|
|
|
|
kernel VA space for both 48/52-bit. (Switching from 48-bit to 52-bit,
|
|
|
|
the end of the kasan shadow is invariant and dependent on ~0UL,
|
|
|
|
whilst the start address will "grow" towards the lower addresses).
|
|
|
|
|
|
|
|
In order to optimise phys_to_virt and virt_to_phys, the PAGE_OFFSET
|
|
|
|
is kept constant at 0xFFF0000000000000 (corresponding to 52-bit),
|
|
|
|
this obviates the need for an extra variable read. The physvirt
|
|
|
|
offset and vmemmap offsets are computed at early boot to enable
|
|
|
|
this logic.
|
|
|
|
|
|
|
|
As a single binary will need to support both 48-bit and 52-bit VA
|
|
|
|
spaces, the VMEMMAP must be sized large enough for 52-bit VAs and
|
2020-02-19 22:14:03 +00:00
|
|
|
also must be sized large enough to accommodate a fixed PAGE_OFFSET.
|
2019-08-07 15:55:24 +00:00
|
|
|
|
|
|
|
Most code in the kernel should not need to consider the VA_BITS, for
|
|
|
|
code that does need to know the VA size the variables are
|
|
|
|
defined as follows:
|
|
|
|
|
|
|
|
VA_BITS constant the *maximum* VA space size
|
|
|
|
|
|
|
|
VA_BITS_MIN constant the *minimum* VA space size
|
|
|
|
|
|
|
|
vabits_actual variable the *actual* VA space size
|
|
|
|
|
|
|
|
|
|
|
|
Maximum and minimum sizes can be useful to ensure that buffers are
|
|
|
|
sized large enough or that addresses are positioned close enough for
|
|
|
|
the "worst" case.
|
|
|
|
|
|
|
|
52-bit userspace VAs
|
|
|
|
--------------------
|
|
|
|
To maintain compatibility with software that relies on the ARMv8.0
|
|
|
|
VA space maximum size of 48-bits, the kernel will, by default,
|
|
|
|
return virtual addresses to userspace from a 48-bit range.
|
|
|
|
|
|
|
|
Software can "opt-in" to receiving VAs from a 52-bit space by
|
|
|
|
specifying an mmap hint parameter that is larger than 48-bit.
|
2019-09-28 12:58:19 +00:00
|
|
|
|
2019-08-07 15:55:24 +00:00
|
|
|
For example:
|
2019-09-28 12:58:19 +00:00
|
|
|
|
|
|
|
.. code-block:: c
|
|
|
|
|
|
|
|
maybe_high_address = mmap(~0UL, size, prot, flags,...);
|
2019-08-07 15:55:24 +00:00
|
|
|
|
|
|
|
It is also possible to build a debug kernel that returns addresses
|
|
|
|
from a 52-bit space by enabling the following kernel config options:
|
2019-09-28 12:58:19 +00:00
|
|
|
|
|
|
|
.. code-block:: sh
|
|
|
|
|
2019-08-07 15:55:24 +00:00
|
|
|
CONFIG_EXPERT=y && CONFIG_ARM64_FORCE_52BIT=y
|
|
|
|
|
|
|
|
Note that this option is only intended for debugging applications
|
|
|
|
and should not be used in production.
|