mirror of
https://github.com/torvalds/linux.git
synced 2024-12-30 14:52:05 +00:00
Bugfixes and improvements to selftests. On top of this, Mauro converted the
KVM documentation to rst format, which was very welcome. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQEcBAABAgAGBQJeRYe2AAoJEL/70l94x66DSdsH/31rMoISGbPReQNKDZtondwr UNLYHBLau46OR5JiLaKDqRODi12CXwrB6KZyRtQg+PeU/aCaFqW09kDOtMdV4AMl 2bEl4VfUuClsGXBQL5WWsb+adaWxHBtRFPcaT0xMAE+JXxAvt2o7rstOydM4yB9f vM9TqEP+oVGyqx0FsuCND+zu/qZqNOOqJ1eZCrzqZKwFS8lrjsYtBG4K8rfnvXnI uGxFnbmZuJKpXpRgCebDdB98m0KEWfAs0S8aOO4NRfmymUBcjMeIeqQz0Mnt17xJ Fz0KktwdGL28qJ+fnE51i/IBqCoBJL2BmIesSiGKVr1alpHA1ngKvUmcbtHkAXw= =To92 -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM fixes from Paolo Bonzini: "Bugfixes and improvements to selftests. On top of this, Mauro converted the KVM documentation to rst format, which was very welcome" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (44 commits) docs: virt: guest-halt-polling.txt convert to ReST docs: kvm: review-checklist.txt: rename to ReST docs: kvm: Convert timekeeping.txt to ReST format docs: kvm: Convert s390-diag.txt to ReST format docs: kvm: Convert ppc-pv.txt to ReST format docs: kvm: Convert nested-vmx.txt to ReST format docs: kvm: Convert mmu.txt to ReST format docs: kvm: Convert locking.txt to ReST format docs: kvm: Convert hypercalls.txt to ReST format docs: kvm: arm/psci.txt: convert to ReST docs: kvm: convert arm/hyp-abi.txt to ReST docs: kvm: Convert api.txt to ReST format docs: kvm: convert devices/xive.txt to ReST docs: kvm: convert devices/xics.txt to ReST docs: kvm: convert devices/vm.txt to ReST docs: kvm: convert devices/vfio.txt to ReST docs: kvm: convert devices/vcpu.txt to ReST docs: kvm: convert devices/s390_flic.txt to ReST docs: kvm: convert devices/mpic.txt to ReST docs: kvm: convert devices/arm-vgit.txt to ReST ...
This commit is contained in:
commit
44024adb4a
@ -1,9 +1,11 @@
|
||||
==================
|
||||
Guest halt polling
|
||||
==================
|
||||
|
||||
The cpuidle_haltpoll driver, with the haltpoll governor, allows
|
||||
the guest vcpus to poll for a specified amount of time before
|
||||
halting.
|
||||
|
||||
This provides the following benefits to host side polling:
|
||||
|
||||
1) The POLL flag is set while polling is performed, which allows
|
||||
@ -29,18 +31,21 @@ Module Parameters
|
||||
The haltpoll governor has 5 tunable module parameters:
|
||||
|
||||
1) guest_halt_poll_ns:
|
||||
|
||||
Maximum amount of time, in nanoseconds, that polling is
|
||||
performed before halting.
|
||||
|
||||
Default: 200000
|
||||
|
||||
2) guest_halt_poll_shrink:
|
||||
|
||||
Division factor used to shrink per-cpu guest_halt_poll_ns when
|
||||
wakeup event occurs after the global guest_halt_poll_ns.
|
||||
|
||||
Default: 2
|
||||
|
||||
3) guest_halt_poll_grow:
|
||||
|
||||
Multiplication factor used to grow per-cpu guest_halt_poll_ns
|
||||
when event occurs after per-cpu guest_halt_poll_ns
|
||||
but before global guest_halt_poll_ns.
|
||||
@ -48,6 +53,7 @@ but before global guest_halt_poll_ns.
|
||||
Default: 2
|
||||
|
||||
4) guest_halt_poll_grow_start:
|
||||
|
||||
The per-cpu guest_halt_poll_ns eventually reaches zero
|
||||
in case of an idle system. This value sets the initial
|
||||
per-cpu guest_halt_poll_ns when growing. This can
|
||||
@ -66,7 +72,7 @@ high once achieves global guest_halt_poll_ns value).
|
||||
|
||||
Default: Y
|
||||
|
||||
The module parameters can be set from the debugfs files in:
|
||||
The module parameters can be set from the debugfs files in::
|
||||
|
||||
/sys/module/haltpoll/parameters/
|
||||
|
||||
@ -74,5 +80,5 @@ Further Notes
|
||||
=============
|
||||
|
||||
- Care should be taken when setting the guest_halt_poll_ns parameter as a
|
||||
large value has the potential to drive the cpu usage to 100% on a machine which
|
||||
would be almost entirely idle otherwise.
|
||||
large value has the potential to drive the cpu usage to 100% on a machine
|
||||
which would be almost entirely idle otherwise.
|
@ -8,7 +8,9 @@ Linux Virtualization Support
|
||||
:maxdepth: 2
|
||||
|
||||
kvm/index
|
||||
uml/user_mode_linux
|
||||
paravirt_ops
|
||||
guest-halt-polling
|
||||
|
||||
.. only:: html and subproject
|
||||
|
||||
|
File diff suppressed because it is too large
Load Diff
@ -1,4 +1,8 @@
|
||||
* Internal ABI between the kernel and HYP
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=======================================
|
||||
Internal ABI between the kernel and HYP
|
||||
=======================================
|
||||
|
||||
This file documents the interaction between the Linux kernel and the
|
||||
hypervisor layer when running Linux as a hypervisor (for example
|
||||
@ -19,25 +23,31 @@ and only act on individual CPUs.
|
||||
Unless specified otherwise, any built-in hypervisor must implement
|
||||
these functions (see arch/arm{,64}/include/asm/virt.h):
|
||||
|
||||
* r0/x0 = HVC_SET_VECTORS
|
||||
r1/x1 = vectors
|
||||
* ::
|
||||
|
||||
r0/x0 = HVC_SET_VECTORS
|
||||
r1/x1 = vectors
|
||||
|
||||
Set HVBAR/VBAR_EL2 to 'vectors' to enable a hypervisor. 'vectors'
|
||||
must be a physical address, and respect the alignment requirements
|
||||
of the architecture. Only implemented by the initial stubs, not by
|
||||
Linux hypervisors.
|
||||
|
||||
* r0/x0 = HVC_RESET_VECTORS
|
||||
* ::
|
||||
|
||||
r0/x0 = HVC_RESET_VECTORS
|
||||
|
||||
Turn HYP/EL2 MMU off, and reset HVBAR/VBAR_EL2 to the initials
|
||||
stubs' exception vector value. This effectively disables an existing
|
||||
hypervisor.
|
||||
|
||||
* r0/x0 = HVC_SOFT_RESTART
|
||||
r1/x1 = restart address
|
||||
x2 = x0's value when entering the next payload (arm64)
|
||||
x3 = x1's value when entering the next payload (arm64)
|
||||
x4 = x2's value when entering the next payload (arm64)
|
||||
* ::
|
||||
|
||||
r0/x0 = HVC_SOFT_RESTART
|
||||
r1/x1 = restart address
|
||||
x2 = x0's value when entering the next payload (arm64)
|
||||
x3 = x1's value when entering the next payload (arm64)
|
||||
x4 = x2's value when entering the next payload (arm64)
|
||||
|
||||
Mask all exceptions, disable the MMU, move the arguments into place
|
||||
(arm64 only), and jump to the restart address while at HYP/EL2. This
|
12
Documentation/virt/kvm/arm/index.rst
Normal file
12
Documentation/virt/kvm/arm/index.rst
Normal file
@ -0,0 +1,12 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
===
|
||||
ARM
|
||||
===
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
hyp-abi
|
||||
psci
|
||||
pvtime
|
@ -1,3 +1,9 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=========================================
|
||||
Power State Coordination Interface (PSCI)
|
||||
=========================================
|
||||
|
||||
KVM implements the PSCI (Power State Coordination Interface)
|
||||
specification in order to provide services such as CPU on/off, reset
|
||||
and power-off to the guest.
|
||||
@ -30,32 +36,42 @@ The following register is defined:
|
||||
- Affects the whole VM (even if the register view is per-vcpu)
|
||||
|
||||
* KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1:
|
||||
Holds the state of the firmware support to mitigate CVE-2017-5715, as
|
||||
offered by KVM to the guest via a HVC call. The workaround is described
|
||||
under SMCCC_ARCH_WORKAROUND_1 in [1].
|
||||
Holds the state of the firmware support to mitigate CVE-2017-5715, as
|
||||
offered by KVM to the guest via a HVC call. The workaround is described
|
||||
under SMCCC_ARCH_WORKAROUND_1 in [1].
|
||||
|
||||
Accepted values are:
|
||||
KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_AVAIL: KVM does not offer
|
||||
|
||||
KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_AVAIL:
|
||||
KVM does not offer
|
||||
firmware support for the workaround. The mitigation status for the
|
||||
guest is unknown.
|
||||
KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_AVAIL: The workaround HVC call is
|
||||
KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_AVAIL:
|
||||
The workaround HVC call is
|
||||
available to the guest and required for the mitigation.
|
||||
KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_REQUIRED: The workaround HVC call
|
||||
KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_REQUIRED:
|
||||
The workaround HVC call
|
||||
is available to the guest, but it is not needed on this VCPU.
|
||||
|
||||
* KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2:
|
||||
Holds the state of the firmware support to mitigate CVE-2018-3639, as
|
||||
offered by KVM to the guest via a HVC call. The workaround is described
|
||||
under SMCCC_ARCH_WORKAROUND_2 in [1].
|
||||
Holds the state of the firmware support to mitigate CVE-2018-3639, as
|
||||
offered by KVM to the guest via a HVC call. The workaround is described
|
||||
under SMCCC_ARCH_WORKAROUND_2 in [1]_.
|
||||
|
||||
Accepted values are:
|
||||
KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL: A workaround is not
|
||||
|
||||
KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL:
|
||||
A workaround is not
|
||||
available. KVM does not offer firmware support for the workaround.
|
||||
KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_UNKNOWN: The workaround state is
|
||||
KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_UNKNOWN:
|
||||
The workaround state is
|
||||
unknown. KVM does not offer firmware support for the workaround.
|
||||
KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL: The workaround is available,
|
||||
KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL:
|
||||
The workaround is available,
|
||||
and can be disabled by a vCPU. If
|
||||
KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED is set, it is active for
|
||||
this vCPU.
|
||||
KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED: The workaround is
|
||||
always active on this vCPU or it is not needed.
|
||||
KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED:
|
||||
The workaround is always active on this vCPU or it is not needed.
|
||||
|
||||
[1] https://developer.arm.com/-/media/developer/pdf/ARM_DEN_0070A_Firmware_interfaces_for_mitigating_CVE-2017-5715.pdf
|
||||
.. [1] https://developer.arm.com/-/media/developer/pdf/ARM_DEN_0070A_Firmware_interfaces_for_mitigating_CVE-2017-5715.pdf
|
@ -1,3 +1,6 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
===============================================
|
||||
ARM Virtual Interrupt Translation Service (ITS)
|
||||
===============================================
|
||||
|
||||
@ -12,22 +15,32 @@ There can be multiple ITS controllers per guest, each of them has to have
|
||||
a separate, non-overlapping MMIO region.
|
||||
|
||||
|
||||
Groups:
|
||||
KVM_DEV_ARM_VGIC_GRP_ADDR
|
||||
Groups
|
||||
======
|
||||
|
||||
KVM_DEV_ARM_VGIC_GRP_ADDR
|
||||
-------------------------
|
||||
|
||||
Attributes:
|
||||
KVM_VGIC_ITS_ADDR_TYPE (rw, 64-bit)
|
||||
Base address in the guest physical address space of the GICv3 ITS
|
||||
control register frame.
|
||||
This address needs to be 64K aligned and the region covers 128K.
|
||||
|
||||
Errors:
|
||||
-E2BIG: Address outside of addressable IPA range
|
||||
-EINVAL: Incorrectly aligned address
|
||||
-EEXIST: Address already configured
|
||||
-EFAULT: Invalid user pointer for attr->addr.
|
||||
-ENODEV: Incorrect attribute or the ITS is not supported.
|
||||
|
||||
======= =================================================
|
||||
-E2BIG Address outside of addressable IPA range
|
||||
-EINVAL Incorrectly aligned address
|
||||
-EEXIST Address already configured
|
||||
-EFAULT Invalid user pointer for attr->addr.
|
||||
-ENODEV Incorrect attribute or the ITS is not supported.
|
||||
======= =================================================
|
||||
|
||||
|
||||
KVM_DEV_ARM_VGIC_GRP_CTRL
|
||||
KVM_DEV_ARM_VGIC_GRP_CTRL
|
||||
-------------------------
|
||||
|
||||
Attributes:
|
||||
KVM_DEV_ARM_VGIC_CTRL_INIT
|
||||
request the initialization of the ITS, no additional parameter in
|
||||
@ -58,16 +71,21 @@ Groups:
|
||||
"ITS Restore Sequence".
|
||||
|
||||
Errors:
|
||||
-ENXIO: ITS not properly configured as required prior to setting
|
||||
this attribute
|
||||
-ENOMEM: Memory shortage when allocating ITS internal data
|
||||
-EINVAL: Inconsistent restored data
|
||||
-EFAULT: Invalid guest ram access
|
||||
-EBUSY: One or more VCPUS are running
|
||||
-EACCES: The virtual ITS is backed by a physical GICv4 ITS, and the
|
||||
state is not available
|
||||
|
||||
KVM_DEV_ARM_VGIC_GRP_ITS_REGS
|
||||
======= ==========================================================
|
||||
-ENXIO ITS not properly configured as required prior to setting
|
||||
this attribute
|
||||
-ENOMEM Memory shortage when allocating ITS internal data
|
||||
-EINVAL Inconsistent restored data
|
||||
-EFAULT Invalid guest ram access
|
||||
-EBUSY One or more VCPUS are running
|
||||
-EACCES The virtual ITS is backed by a physical GICv4 ITS, and the
|
||||
state is not available
|
||||
======= ==========================================================
|
||||
|
||||
KVM_DEV_ARM_VGIC_GRP_ITS_REGS
|
||||
-----------------------------
|
||||
|
||||
Attributes:
|
||||
The attr field of kvm_device_attr encodes the offset of the
|
||||
ITS register, relative to the ITS control frame base address
|
||||
@ -78,6 +96,7 @@ Groups:
|
||||
be accessed with full length.
|
||||
|
||||
Writes to read-only registers are ignored by the kernel except for:
|
||||
|
||||
- GITS_CREADR. It must be restored otherwise commands in the queue
|
||||
will be re-executed after restoring CWRITER. GITS_CREADR must be
|
||||
restored before restoring the GITS_CTLR which is likely to enable the
|
||||
@ -91,30 +110,36 @@ Groups:
|
||||
|
||||
For other registers, getting or setting a register has the same
|
||||
effect as reading/writing the register on real hardware.
|
||||
Errors:
|
||||
-ENXIO: Offset does not correspond to any supported register
|
||||
-EFAULT: Invalid user pointer for attr->addr
|
||||
-EINVAL: Offset is not 64-bit aligned
|
||||
-EBUSY: one or more VCPUS are running
|
||||
|
||||
ITS Restore Sequence:
|
||||
-------------------------
|
||||
Errors:
|
||||
|
||||
======= ====================================================
|
||||
-ENXIO Offset does not correspond to any supported register
|
||||
-EFAULT Invalid user pointer for attr->addr
|
||||
-EINVAL Offset is not 64-bit aligned
|
||||
-EBUSY one or more VCPUS are running
|
||||
======= ====================================================
|
||||
|
||||
ITS Restore Sequence:
|
||||
---------------------
|
||||
|
||||
The following ordering must be followed when restoring the GIC and the ITS:
|
||||
|
||||
a) restore all guest memory and create vcpus
|
||||
b) restore all redistributors
|
||||
c) provide the ITS base address
|
||||
(KVM_DEV_ARM_VGIC_GRP_ADDR)
|
||||
d) restore the ITS in the following order:
|
||||
1. Restore GITS_CBASER
|
||||
2. Restore all other GITS_ registers, except GITS_CTLR!
|
||||
3. Load the ITS table data (KVM_DEV_ARM_ITS_RESTORE_TABLES)
|
||||
4. Restore GITS_CTLR
|
||||
|
||||
1. Restore GITS_CBASER
|
||||
2. Restore all other ``GITS_`` registers, except GITS_CTLR!
|
||||
3. Load the ITS table data (KVM_DEV_ARM_ITS_RESTORE_TABLES)
|
||||
4. Restore GITS_CTLR
|
||||
|
||||
Then vcpus can be started.
|
||||
|
||||
ITS Table ABI REV0:
|
||||
-------------------
|
||||
ITS Table ABI REV0:
|
||||
-------------------
|
||||
|
||||
Revision 0 of the ABI only supports the features of a virtual GICv3, and does
|
||||
not support a virtual GICv4 with support for direct injection of virtual
|
||||
@ -125,12 +150,13 @@ Then vcpus can be started.
|
||||
entries in the collection are listed in no particular order.
|
||||
All entries are 8 bytes.
|
||||
|
||||
Device Table Entry (DTE):
|
||||
Device Table Entry (DTE)::
|
||||
|
||||
bits: | 63| 62 ... 49 | 48 ... 5 | 4 ... 0 |
|
||||
values: | V | next | ITT_addr | Size |
|
||||
bits: | 63| 62 ... 49 | 48 ... 5 | 4 ... 0 |
|
||||
values: | V | next | ITT_addr | Size |
|
||||
|
||||
where:
|
||||
|
||||
where;
|
||||
- V indicates whether the entry is valid. If not, other fields
|
||||
are not meaningful.
|
||||
- next: equals to 0 if this entry is the last one; otherwise it
|
||||
@ -140,32 +166,34 @@ Then vcpus can be started.
|
||||
- Size specifies the supported number of bits for the EventID,
|
||||
minus one
|
||||
|
||||
Collection Table Entry (CTE):
|
||||
Collection Table Entry (CTE)::
|
||||
|
||||
bits: | 63| 62 .. 52 | 51 ... 16 | 15 ... 0 |
|
||||
values: | V | RES0 | RDBase | ICID |
|
||||
bits: | 63| 62 .. 52 | 51 ... 16 | 15 ... 0 |
|
||||
values: | V | RES0 | RDBase | ICID |
|
||||
|
||||
where:
|
||||
|
||||
- V indicates whether the entry is valid. If not, other fields are
|
||||
not meaningful.
|
||||
- RES0: reserved field with Should-Be-Zero-or-Preserved behavior.
|
||||
- RDBase is the PE number (GICR_TYPER.Processor_Number semantic),
|
||||
- ICID is the collection ID
|
||||
|
||||
Interrupt Translation Entry (ITE):
|
||||
Interrupt Translation Entry (ITE)::
|
||||
|
||||
bits: | 63 ... 48 | 47 ... 16 | 15 ... 0 |
|
||||
values: | next | pINTID | ICID |
|
||||
bits: | 63 ... 48 | 47 ... 16 | 15 ... 0 |
|
||||
values: | next | pINTID | ICID |
|
||||
|
||||
where:
|
||||
|
||||
- next: equals to 0 if this entry is the last one; otherwise it corresponds
|
||||
to the EventID offset to the next ITE capped by 2^16 -1.
|
||||
- pINTID is the physical LPI ID; if zero, it means the entry is not valid
|
||||
and other fields are not meaningful.
|
||||
- ICID is the collection ID
|
||||
|
||||
ITS Reset State:
|
||||
----------------
|
||||
ITS Reset State:
|
||||
----------------
|
||||
|
||||
RESET returns the ITS to the same state that it was when first created and
|
||||
initialized. When the RESET command returns, the following things are
|
@ -1,9 +1,12 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
==============================================================
|
||||
ARM Virtual Generic Interrupt Controller v3 and later (VGICv3)
|
||||
==============================================================
|
||||
|
||||
|
||||
Device types supported:
|
||||
KVM_DEV_TYPE_ARM_VGIC_V3 ARM Generic Interrupt Controller v3.0
|
||||
- KVM_DEV_TYPE_ARM_VGIC_V3 ARM Generic Interrupt Controller v3.0
|
||||
|
||||
Only one VGIC instance may be instantiated through this API. The created VGIC
|
||||
will act as the VM interrupt controller, requiring emulated user-space devices
|
||||
@ -15,7 +18,8 @@ Creating a guest GICv3 device requires a host GICv3 as well.
|
||||
|
||||
Groups:
|
||||
KVM_DEV_ARM_VGIC_GRP_ADDR
|
||||
Attributes:
|
||||
Attributes:
|
||||
|
||||
KVM_VGIC_V3_ADDR_TYPE_DIST (rw, 64-bit)
|
||||
Base address in the guest physical address space of the GICv3 distributor
|
||||
register mappings. Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
|
||||
@ -29,21 +33,25 @@ Groups:
|
||||
This address needs to be 64K aligned.
|
||||
|
||||
KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION (rw, 64-bit)
|
||||
The attribute data pointed to by kvm_device_attr.addr is a __u64 value:
|
||||
bits: | 63 .... 52 | 51 .... 16 | 15 - 12 |11 - 0
|
||||
values: | count | base | flags | index
|
||||
The attribute data pointed to by kvm_device_attr.addr is a __u64 value::
|
||||
|
||||
bits: | 63 .... 52 | 51 .... 16 | 15 - 12 |11 - 0
|
||||
values: | count | base | flags | index
|
||||
|
||||
- index encodes the unique redistributor region index
|
||||
- flags: reserved for future use, currently 0
|
||||
- base field encodes bits [51:16] of the guest physical base address
|
||||
of the first redistributor in the region.
|
||||
- count encodes the number of redistributors in the region. Must be
|
||||
greater than 0.
|
||||
|
||||
There are two 64K pages for each redistributor in the region and
|
||||
redistributors are laid out contiguously within the region. Regions
|
||||
are filled with redistributors in the index order. The sum of all
|
||||
region count fields must be greater than or equal to the number of
|
||||
VCPUs. Redistributor regions must be registered in the incremental
|
||||
index order, starting from index 0.
|
||||
|
||||
The characteristics of a specific redistributor region can be read
|
||||
by presetting the index field in the attr data.
|
||||
Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
|
||||
@ -52,23 +60,27 @@ Groups:
|
||||
KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION attributes.
|
||||
|
||||
Errors:
|
||||
-E2BIG: Address outside of addressable IPA range
|
||||
-EINVAL: Incorrectly aligned address, bad redistributor region
|
||||
|
||||
======= =============================================================
|
||||
-E2BIG Address outside of addressable IPA range
|
||||
-EINVAL Incorrectly aligned address, bad redistributor region
|
||||
count/index, mixed redistributor region attribute usage
|
||||
-EEXIST: Address already configured
|
||||
-ENOENT: Attempt to read the characteristics of a non existing
|
||||
-EEXIST Address already configured
|
||||
-ENOENT Attempt to read the characteristics of a non existing
|
||||
redistributor region
|
||||
-ENXIO: The group or attribute is unknown/unsupported for this device
|
||||
-ENXIO The group or attribute is unknown/unsupported for this device
|
||||
or hardware support is missing.
|
||||
-EFAULT: Invalid user pointer for attr->addr.
|
||||
-EFAULT Invalid user pointer for attr->addr.
|
||||
======= =============================================================
|
||||
|
||||
|
||||
KVM_DEV_ARM_VGIC_GRP_DIST_REGS
|
||||
KVM_DEV_ARM_VGIC_GRP_REDIST_REGS
|
||||
Attributes:
|
||||
The attr field of kvm_device_attr encodes two values:
|
||||
bits: | 63 .... 32 | 31 .... 0 |
|
||||
values: | mpidr | offset |
|
||||
KVM_DEV_ARM_VGIC_GRP_DIST_REGS, KVM_DEV_ARM_VGIC_GRP_REDIST_REGS
|
||||
Attributes:
|
||||
|
||||
The attr field of kvm_device_attr encodes two values::
|
||||
|
||||
bits: | 63 .... 32 | 31 .... 0 |
|
||||
values: | mpidr | offset |
|
||||
|
||||
All distributor regs are (rw, 32-bit) and kvm_device_attr.addr points to a
|
||||
__u32 value. 64-bit registers must be accessed by separately accessing the
|
||||
@ -93,7 +105,8 @@ Groups:
|
||||
redistributor is accessed. The mpidr is ignored for the distributor.
|
||||
|
||||
The mpidr encoding is based on the affinity information in the
|
||||
architecture defined MPIDR, and the field is encoded as follows:
|
||||
architecture defined MPIDR, and the field is encoded as follows::
|
||||
|
||||
| 63 .... 56 | 55 .... 48 | 47 .... 40 | 39 .... 32 |
|
||||
| Aff3 | Aff2 | Aff1 | Aff0 |
|
||||
|
||||
@ -148,24 +161,30 @@ Groups:
|
||||
ignored.
|
||||
|
||||
Errors:
|
||||
-ENXIO: Getting or setting this register is not yet supported
|
||||
-EBUSY: One or more VCPUs are running
|
||||
|
||||
====== =====================================================
|
||||
-ENXIO Getting or setting this register is not yet supported
|
||||
-EBUSY One or more VCPUs are running
|
||||
====== =====================================================
|
||||
|
||||
|
||||
KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS
|
||||
Attributes:
|
||||
The attr field of kvm_device_attr encodes two values:
|
||||
bits: | 63 .... 32 | 31 .... 16 | 15 .... 0 |
|
||||
values: | mpidr | RES | instr |
|
||||
Attributes:
|
||||
|
||||
The attr field of kvm_device_attr encodes two values::
|
||||
|
||||
bits: | 63 .... 32 | 31 .... 16 | 15 .... 0 |
|
||||
values: | mpidr | RES | instr |
|
||||
|
||||
The mpidr field encodes the CPU ID based on the affinity information in the
|
||||
architecture defined MPIDR, and the field is encoded as follows:
|
||||
architecture defined MPIDR, and the field is encoded as follows::
|
||||
|
||||
| 63 .... 56 | 55 .... 48 | 47 .... 40 | 39 .... 32 |
|
||||
| Aff3 | Aff2 | Aff1 | Aff0 |
|
||||
|
||||
The instr field encodes the system register to access based on the fields
|
||||
defined in the A64 instruction set encoding for system register access
|
||||
(RES means the bits are reserved for future use and should be zero):
|
||||
(RES means the bits are reserved for future use and should be zero)::
|
||||
|
||||
| 15 ... 14 | 13 ... 11 | 10 ... 7 | 6 ... 3 | 2 ... 0 |
|
||||
| Op 0 | Op1 | CRn | CRm | Op2 |
|
||||
@ -178,26 +197,35 @@ Groups:
|
||||
|
||||
CPU interface registers access is not implemented for AArch32 mode.
|
||||
Error -ENXIO is returned when accessed in AArch32 mode.
|
||||
|
||||
Errors:
|
||||
-ENXIO: Getting or setting this register is not yet supported
|
||||
-EBUSY: VCPU is running
|
||||
-EINVAL: Invalid mpidr or register value supplied
|
||||
|
||||
======= =====================================================
|
||||
-ENXIO Getting or setting this register is not yet supported
|
||||
-EBUSY VCPU is running
|
||||
-EINVAL Invalid mpidr or register value supplied
|
||||
======= =====================================================
|
||||
|
||||
|
||||
KVM_DEV_ARM_VGIC_GRP_NR_IRQS
|
||||
Attributes:
|
||||
Attributes:
|
||||
|
||||
A value describing the number of interrupts (SGI, PPI and SPI) for
|
||||
this GIC instance, ranging from 64 to 1024, in increments of 32.
|
||||
|
||||
kvm_device_attr.addr points to a __u32 value.
|
||||
|
||||
Errors:
|
||||
-EINVAL: Value set is out of the expected range
|
||||
-EBUSY: Value has already be set.
|
||||
|
||||
======= ======================================
|
||||
-EINVAL Value set is out of the expected range
|
||||
-EBUSY Value has already be set.
|
||||
======= ======================================
|
||||
|
||||
|
||||
KVM_DEV_ARM_VGIC_GRP_CTRL
|
||||
Attributes:
|
||||
Attributes:
|
||||
|
||||
KVM_DEV_ARM_VGIC_CTRL_INIT
|
||||
request the initialization of the VGIC, no additional parameter in
|
||||
kvm_device_attr.addr.
|
||||
@ -205,20 +233,26 @@ Groups:
|
||||
save all LPI pending bits into guest RAM pending tables.
|
||||
|
||||
The first kB of the pending table is not altered by this operation.
|
||||
|
||||
Errors:
|
||||
-ENXIO: VGIC not properly configured as required prior to calling
|
||||
this attribute
|
||||
-ENODEV: no online VCPU
|
||||
-ENOMEM: memory shortage when allocating vgic internal data
|
||||
-EFAULT: Invalid guest ram access
|
||||
-EBUSY: One or more VCPUS are running
|
||||
|
||||
======= ========================================================
|
||||
-ENXIO VGIC not properly configured as required prior to calling
|
||||
this attribute
|
||||
-ENODEV no online VCPU
|
||||
-ENOMEM memory shortage when allocating vgic internal data
|
||||
-EFAULT Invalid guest ram access
|
||||
-EBUSY One or more VCPUS are running
|
||||
======= ========================================================
|
||||
|
||||
|
||||
KVM_DEV_ARM_VGIC_GRP_LEVEL_INFO
|
||||
Attributes:
|
||||
The attr field of kvm_device_attr encodes the following values:
|
||||
bits: | 63 .... 32 | 31 .... 10 | 9 .... 0 |
|
||||
values: | mpidr | info | vINTID |
|
||||
Attributes:
|
||||
|
||||
The attr field of kvm_device_attr encodes the following values::
|
||||
|
||||
bits: | 63 .... 32 | 31 .... 10 | 9 .... 0 |
|
||||
values: | mpidr | info | vINTID |
|
||||
|
||||
The vINTID specifies which set of IRQs is reported on.
|
||||
|
||||
@ -228,6 +262,7 @@ Groups:
|
||||
VGIC_LEVEL_INFO_LINE_LEVEL:
|
||||
Get/Set the input level of the IRQ line for a set of 32 contiguously
|
||||
numbered interrupts.
|
||||
|
||||
vINTID must be a multiple of 32.
|
||||
|
||||
kvm_device_attr.addr points to a __u32 value which will contain a
|
||||
@ -243,9 +278,14 @@ Groups:
|
||||
reported with the same value regardless of the mpidr specified.
|
||||
|
||||
The mpidr field encodes the CPU ID based on the affinity information in the
|
||||
architecture defined MPIDR, and the field is encoded as follows:
|
||||
architecture defined MPIDR, and the field is encoded as follows::
|
||||
|
||||
| 63 .... 56 | 55 .... 48 | 47 .... 40 | 39 .... 32 |
|
||||
| Aff3 | Aff2 | Aff1 | Aff0 |
|
||||
|
||||
Errors:
|
||||
-EINVAL: vINTID is not multiple of 32 or
|
||||
info field is not VGIC_LEVEL_INFO_LINE_LEVEL
|
||||
|
||||
======= =============================================
|
||||
-EINVAL vINTID is not multiple of 32 or info field is
|
||||
not VGIC_LEVEL_INFO_LINE_LEVEL
|
||||
======= =============================================
|
@ -1,8 +1,12 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
==================================================
|
||||
ARM Virtual Generic Interrupt Controller v2 (VGIC)
|
||||
==================================================
|
||||
|
||||
Device types supported:
|
||||
KVM_DEV_TYPE_ARM_VGIC_V2 ARM Generic Interrupt Controller v2.0
|
||||
|
||||
- KVM_DEV_TYPE_ARM_VGIC_V2 ARM Generic Interrupt Controller v2.0
|
||||
|
||||
Only one VGIC instance may be instantiated through either this API or the
|
||||
legacy KVM_CREATE_IRQCHIP API. The created VGIC will act as the VM interrupt
|
||||
@ -17,7 +21,8 @@ create both a GICv3 and GICv2 device on the same VM.
|
||||
|
||||
Groups:
|
||||
KVM_DEV_ARM_VGIC_GRP_ADDR
|
||||
Attributes:
|
||||
Attributes:
|
||||
|
||||
KVM_VGIC_V2_ADDR_TYPE_DIST (rw, 64-bit)
|
||||
Base address in the guest physical address space of the GIC distributor
|
||||
register mappings. Only valid for KVM_DEV_TYPE_ARM_VGIC_V2.
|
||||
@ -27,19 +32,25 @@ Groups:
|
||||
Base address in the guest physical address space of the GIC virtual cpu
|
||||
interface register mappings. Only valid for KVM_DEV_TYPE_ARM_VGIC_V2.
|
||||
This address needs to be 4K aligned and the region covers 4 KByte.
|
||||
|
||||
Errors:
|
||||
-E2BIG: Address outside of addressable IPA range
|
||||
-EINVAL: Incorrectly aligned address
|
||||
-EEXIST: Address already configured
|
||||
-ENXIO: The group or attribute is unknown/unsupported for this device
|
||||
|
||||
======= =============================================================
|
||||
-E2BIG Address outside of addressable IPA range
|
||||
-EINVAL Incorrectly aligned address
|
||||
-EEXIST Address already configured
|
||||
-ENXIO The group or attribute is unknown/unsupported for this device
|
||||
or hardware support is missing.
|
||||
-EFAULT: Invalid user pointer for attr->addr.
|
||||
-EFAULT Invalid user pointer for attr->addr.
|
||||
======= =============================================================
|
||||
|
||||
KVM_DEV_ARM_VGIC_GRP_DIST_REGS
|
||||
Attributes:
|
||||
The attr field of kvm_device_attr encodes two values:
|
||||
bits: | 63 .... 40 | 39 .. 32 | 31 .... 0 |
|
||||
values: | reserved | vcpu_index | offset |
|
||||
Attributes:
|
||||
|
||||
The attr field of kvm_device_attr encodes two values::
|
||||
|
||||
bits: | 63 .... 40 | 39 .. 32 | 31 .... 0 |
|
||||
values: | reserved | vcpu_index | offset |
|
||||
|
||||
All distributor regs are (rw, 32-bit)
|
||||
|
||||
@ -58,16 +69,22 @@ Groups:
|
||||
KVM_DEV_ARM_VGIC_GRP_DIST_REGS and KVM_DEV_ARM_VGIC_GRP_CPU_REGS) to ensure
|
||||
the expected behavior. Unless GICD_IIDR has been set from userspace, writes
|
||||
to the interrupt group registers (GICD_IGROUPR) are ignored.
|
||||
|
||||
Errors:
|
||||
-ENXIO: Getting or setting this register is not yet supported
|
||||
-EBUSY: One or more VCPUs are running
|
||||
-EINVAL: Invalid vcpu_index supplied
|
||||
|
||||
======= =====================================================
|
||||
-ENXIO Getting or setting this register is not yet supported
|
||||
-EBUSY One or more VCPUs are running
|
||||
-EINVAL Invalid vcpu_index supplied
|
||||
======= =====================================================
|
||||
|
||||
KVM_DEV_ARM_VGIC_GRP_CPU_REGS
|
||||
Attributes:
|
||||
The attr field of kvm_device_attr encodes two values:
|
||||
bits: | 63 .... 40 | 39 .. 32 | 31 .... 0 |
|
||||
values: | reserved | vcpu_index | offset |
|
||||
Attributes:
|
||||
|
||||
The attr field of kvm_device_attr encodes two values::
|
||||
|
||||
bits: | 63 .... 40 | 39 .. 32 | 31 .... 0 |
|
||||
values: | reserved | vcpu_index | offset |
|
||||
|
||||
All CPU interface regs are (rw, 32-bit)
|
||||
|
||||
@ -101,27 +118,39 @@ Groups:
|
||||
value left by 3 places to obtain the actual priority mask level.
|
||||
|
||||
Errors:
|
||||
-ENXIO: Getting or setting this register is not yet supported
|
||||
-EBUSY: One or more VCPUs are running
|
||||
-EINVAL: Invalid vcpu_index supplied
|
||||
|
||||
======= =====================================================
|
||||
-ENXIO Getting or setting this register is not yet supported
|
||||
-EBUSY One or more VCPUs are running
|
||||
-EINVAL Invalid vcpu_index supplied
|
||||
======= =====================================================
|
||||
|
||||
KVM_DEV_ARM_VGIC_GRP_NR_IRQS
|
||||
Attributes:
|
||||
Attributes:
|
||||
|
||||
A value describing the number of interrupts (SGI, PPI and SPI) for
|
||||
this GIC instance, ranging from 64 to 1024, in increments of 32.
|
||||
|
||||
Errors:
|
||||
-EINVAL: Value set is out of the expected range
|
||||
-EBUSY: Value has already be set, or GIC has already been initialized
|
||||
with default values.
|
||||
|
||||
======= =============================================================
|
||||
-EINVAL Value set is out of the expected range
|
||||
-EBUSY Value has already be set, or GIC has already been initialized
|
||||
with default values.
|
||||
======= =============================================================
|
||||
|
||||
KVM_DEV_ARM_VGIC_GRP_CTRL
|
||||
Attributes:
|
||||
Attributes:
|
||||
|
||||
KVM_DEV_ARM_VGIC_CTRL_INIT
|
||||
request the initialization of the VGIC or ITS, no additional parameter
|
||||
in kvm_device_attr.addr.
|
||||
|
||||
Errors:
|
||||
-ENXIO: VGIC not properly configured as required prior to calling
|
||||
this attribute
|
||||
-ENODEV: no online VCPU
|
||||
-ENOMEM: memory shortage when allocating vgic internal data
|
||||
|
||||
======= =========================================================
|
||||
-ENXIO VGIC not properly configured as required prior to calling
|
||||
this attribute
|
||||
-ENODEV no online VCPU
|
||||
-ENOMEM memory shortage when allocating vgic internal data
|
||||
======= =========================================================
|
19
Documentation/virt/kvm/devices/index.rst
Normal file
19
Documentation/virt/kvm/devices/index.rst
Normal file
@ -0,0 +1,19 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=======
|
||||
Devices
|
||||
=======
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
arm-vgic-its
|
||||
arm-vgic
|
||||
arm-vgic-v3
|
||||
mpic
|
||||
s390_flic
|
||||
vcpu
|
||||
vfio
|
||||
vm
|
||||
xics
|
||||
xive
|
@ -1,9 +1,13 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=========================
|
||||
MPIC interrupt controller
|
||||
=========================
|
||||
|
||||
Device types supported:
|
||||
KVM_DEV_TYPE_FSL_MPIC_20 Freescale MPIC v2.0
|
||||
KVM_DEV_TYPE_FSL_MPIC_42 Freescale MPIC v4.2
|
||||
|
||||
- KVM_DEV_TYPE_FSL_MPIC_20 Freescale MPIC v2.0
|
||||
- KVM_DEV_TYPE_FSL_MPIC_42 Freescale MPIC v4.2
|
||||
|
||||
Only one MPIC instance, of any type, may be instantiated. The created
|
||||
MPIC will act as the system interrupt controller, connecting to each
|
||||
@ -11,7 +15,8 @@ vcpu's interrupt inputs.
|
||||
|
||||
Groups:
|
||||
KVM_DEV_MPIC_GRP_MISC
|
||||
Attributes:
|
||||
Attributes:
|
||||
|
||||
KVM_DEV_MPIC_BASE_ADDR (rw, 64-bit)
|
||||
Base address of the 256 KiB MPIC register space. Must be
|
||||
naturally aligned. A value of zero disables the mapping.
|
@ -1,3 +1,6 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
====================================
|
||||
FLIC (floating interrupt controller)
|
||||
====================================
|
||||
|
||||
@ -31,8 +34,10 @@ Groups:
|
||||
Copies all floating interrupts into a buffer provided by userspace.
|
||||
When the buffer is too small it returns -ENOMEM, which is the indication
|
||||
for userspace to try again with a bigger buffer.
|
||||
|
||||
-ENOBUFS is returned when the allocation of a kernelspace buffer has
|
||||
failed.
|
||||
|
||||
-EFAULT is returned when copying data to userspace failed.
|
||||
All interrupts remain pending, i.e. are not deleted from the list of
|
||||
currently pending interrupts.
|
||||
@ -60,38 +65,41 @@ Groups:
|
||||
|
||||
KVM_DEV_FLIC_ADAPTER_REGISTER
|
||||
Register an I/O adapter interrupt source. Takes a kvm_s390_io_adapter
|
||||
describing the adapter to register:
|
||||
describing the adapter to register::
|
||||
|
||||
struct kvm_s390_io_adapter {
|
||||
__u32 id;
|
||||
__u8 isc;
|
||||
__u8 maskable;
|
||||
__u8 swap;
|
||||
__u8 flags;
|
||||
};
|
||||
struct kvm_s390_io_adapter {
|
||||
__u32 id;
|
||||
__u8 isc;
|
||||
__u8 maskable;
|
||||
__u8 swap;
|
||||
__u8 flags;
|
||||
};
|
||||
|
||||
id contains the unique id for the adapter, isc the I/O interruption subclass
|
||||
to use, maskable whether this adapter may be masked (interrupts turned off),
|
||||
swap whether the indicators need to be byte swapped, and flags contains
|
||||
further characteristics of the adapter.
|
||||
|
||||
Currently defined values for 'flags' are:
|
||||
|
||||
- KVM_S390_ADAPTER_SUPPRESSIBLE: adapter is subject to AIS
|
||||
(adapter-interrupt-suppression) facility. This flag only has an effect if
|
||||
the AIS capability is enabled.
|
||||
|
||||
Unknown flag values are ignored.
|
||||
|
||||
|
||||
KVM_DEV_FLIC_ADAPTER_MODIFY
|
||||
Modifies attributes of an existing I/O adapter interrupt source. Takes
|
||||
a kvm_s390_io_adapter_req specifying the adapter and the operation:
|
||||
a kvm_s390_io_adapter_req specifying the adapter and the operation::
|
||||
|
||||
struct kvm_s390_io_adapter_req {
|
||||
__u32 id;
|
||||
__u8 type;
|
||||
__u8 mask;
|
||||
__u16 pad0;
|
||||
__u64 addr;
|
||||
};
|
||||
struct kvm_s390_io_adapter_req {
|
||||
__u32 id;
|
||||
__u8 type;
|
||||
__u8 mask;
|
||||
__u16 pad0;
|
||||
__u64 addr;
|
||||
};
|
||||
|
||||
id specifies the adapter and type the operation. The supported operations
|
||||
are:
|
||||
@ -103,8 +111,9 @@ struct kvm_s390_io_adapter_req {
|
||||
perform a gmap translation for the guest address provided in addr,
|
||||
pin a userspace page for the translated address and add it to the
|
||||
list of mappings
|
||||
Note: A new mapping will be created unconditionally; therefore,
|
||||
the calling code should avoid making duplicate mappings.
|
||||
|
||||
.. note:: A new mapping will be created unconditionally; therefore,
|
||||
the calling code should avoid making duplicate mappings.
|
||||
|
||||
KVM_S390_IO_ADAPTER_UNMAP
|
||||
release a userspace page for the translated address specified in addr
|
||||
@ -112,16 +121,17 @@ struct kvm_s390_io_adapter_req {
|
||||
|
||||
KVM_DEV_FLIC_AISM
|
||||
modify the adapter-interruption-suppression mode for a given isc if the
|
||||
AIS capability is enabled. Takes a kvm_s390_ais_req describing:
|
||||
AIS capability is enabled. Takes a kvm_s390_ais_req describing::
|
||||
|
||||
struct kvm_s390_ais_req {
|
||||
__u8 isc;
|
||||
__u16 mode;
|
||||
};
|
||||
struct kvm_s390_ais_req {
|
||||
__u8 isc;
|
||||
__u16 mode;
|
||||
};
|
||||
|
||||
isc contains the target I/O interruption subclass, mode the target
|
||||
adapter-interruption-suppression mode. The following modes are
|
||||
currently supported:
|
||||
|
||||
- KVM_S390_AIS_MODE_ALL: ALL-Interruptions Mode, i.e. airq injection
|
||||
is always allowed;
|
||||
- KVM_S390_AIS_MODE_SINGLE: SINGLE-Interruption Mode, i.e. airq
|
||||
@ -139,12 +149,12 @@ struct kvm_s390_ais_req {
|
||||
|
||||
KVM_DEV_FLIC_AISM_ALL
|
||||
Gets or sets the adapter-interruption-suppression mode for all ISCs. Takes
|
||||
a kvm_s390_ais_all describing:
|
||||
a kvm_s390_ais_all describing::
|
||||
|
||||
struct kvm_s390_ais_all {
|
||||
__u8 simm; /* Single-Interruption-Mode mask */
|
||||
__u8 nimm; /* No-Interruption-Mode mask *
|
||||
};
|
||||
struct kvm_s390_ais_all {
|
||||
__u8 simm; /* Single-Interruption-Mode mask */
|
||||
__u8 nimm; /* No-Interruption-Mode mask *
|
||||
};
|
||||
|
||||
simm contains Single-Interruption-Mode mask for all ISCs, nimm contains
|
||||
No-Interruption-Mode mask for all ISCs. Each bit in simm and nimm corresponds
|
||||
@ -159,5 +169,5 @@ ENXIO, as specified in the API documentation). It is not possible to conclude
|
||||
that a FLIC operation is unavailable based on the error code resulting from a
|
||||
usage attempt.
|
||||
|
||||
Note: The KVM_DEV_FLIC_CLEAR_IO_IRQ ioctl will return EINVAL in case a zero
|
||||
schid is specified.
|
||||
.. note:: The KVM_DEV_FLIC_CLEAR_IO_IRQ ioctl will return EINVAL in case a
|
||||
zero schid is specified.
|
114
Documentation/virt/kvm/devices/vcpu.rst
Normal file
114
Documentation/virt/kvm/devices/vcpu.rst
Normal file
@ -0,0 +1,114 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
======================
|
||||
Generic vcpu interface
|
||||
======================
|
||||
|
||||
The virtual cpu "device" also accepts the ioctls KVM_SET_DEVICE_ATTR,
|
||||
KVM_GET_DEVICE_ATTR, and KVM_HAS_DEVICE_ATTR. The interface uses the same struct
|
||||
kvm_device_attr as other devices, but targets VCPU-wide settings and controls.
|
||||
|
||||
The groups and attributes per virtual cpu, if any, are architecture specific.
|
||||
|
||||
1. GROUP: KVM_ARM_VCPU_PMU_V3_CTRL
|
||||
==================================
|
||||
|
||||
:Architectures: ARM64
|
||||
|
||||
1.1. ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_IRQ
|
||||
---------------------------------------
|
||||
|
||||
:Parameters: in kvm_device_attr.addr the address for PMU overflow interrupt is a
|
||||
pointer to an int
|
||||
|
||||
Returns:
|
||||
|
||||
======= ========================================================
|
||||
-EBUSY The PMU overflow interrupt is already set
|
||||
-ENXIO The overflow interrupt not set when attempting to get it
|
||||
-ENODEV PMUv3 not supported
|
||||
-EINVAL Invalid PMU overflow interrupt number supplied or
|
||||
trying to set the IRQ number without using an in-kernel
|
||||
irqchip.
|
||||
======= ========================================================
|
||||
|
||||
A value describing the PMUv3 (Performance Monitor Unit v3) overflow interrupt
|
||||
number for this vcpu. This interrupt could be a PPI or SPI, but the interrupt
|
||||
type must be same for each vcpu. As a PPI, the interrupt number is the same for
|
||||
all vcpus, while as an SPI it must be a separate number per vcpu.
|
||||
|
||||
1.2 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_INIT
|
||||
---------------------------------------
|
||||
|
||||
:Parameters: no additional parameter in kvm_device_attr.addr
|
||||
|
||||
Returns:
|
||||
|
||||
======= ======================================================
|
||||
-ENODEV PMUv3 not supported or GIC not initialized
|
||||
-ENXIO PMUv3 not properly configured or in-kernel irqchip not
|
||||
configured as required prior to calling this attribute
|
||||
-EBUSY PMUv3 already initialized
|
||||
======= ======================================================
|
||||
|
||||
Request the initialization of the PMUv3. If using the PMUv3 with an in-kernel
|
||||
virtual GIC implementation, this must be done after initializing the in-kernel
|
||||
irqchip.
|
||||
|
||||
|
||||
2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
|
||||
=================================
|
||||
|
||||
:Architectures: ARM, ARM64
|
||||
|
||||
2.1. ATTRIBUTES: KVM_ARM_VCPU_TIMER_IRQ_VTIMER, KVM_ARM_VCPU_TIMER_IRQ_PTIMER
|
||||
-----------------------------------------------------------------------------
|
||||
|
||||
:Parameters: in kvm_device_attr.addr the address for the timer interrupt is a
|
||||
pointer to an int
|
||||
|
||||
Returns:
|
||||
|
||||
======= =================================
|
||||
-EINVAL Invalid timer interrupt number
|
||||
-EBUSY One or more VCPUs has already run
|
||||
======= =================================
|
||||
|
||||
A value describing the architected timer interrupt number when connected to an
|
||||
in-kernel virtual GIC. These must be a PPI (16 <= intid < 32). Setting the
|
||||
attribute overrides the default values (see below).
|
||||
|
||||
============================= ==========================================
|
||||
KVM_ARM_VCPU_TIMER_IRQ_VTIMER The EL1 virtual timer intid (default: 27)
|
||||
KVM_ARM_VCPU_TIMER_IRQ_PTIMER The EL1 physical timer intid (default: 30)
|
||||
============================= ==========================================
|
||||
|
||||
Setting the same PPI for different timers will prevent the VCPUs from running.
|
||||
Setting the interrupt number on a VCPU configures all VCPUs created at that
|
||||
time to use the number provided for a given timer, overwriting any previously
|
||||
configured values on other VCPUs. Userspace should configure the interrupt
|
||||
numbers on at least one VCPU after creating all VCPUs and before running any
|
||||
VCPUs.
|
||||
|
||||
3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL
|
||||
==================================
|
||||
|
||||
:Architectures: ARM64
|
||||
|
||||
3.1 ATTRIBUTE: KVM_ARM_VCPU_PVTIME_IPA
|
||||
--------------------------------------
|
||||
|
||||
:Parameters: 64-bit base address
|
||||
|
||||
Returns:
|
||||
|
||||
======= ======================================
|
||||
-ENXIO Stolen time not implemented
|
||||
-EEXIST Base address already set for this VCPU
|
||||
-EINVAL Base address not 64 byte aligned
|
||||
======= ======================================
|
||||
|
||||
Specifies the base address of the stolen time structure for this VCPU. The
|
||||
base address must be 64 byte aligned and exist within a valid guest memory
|
||||
region. See Documentation/virt/kvm/arm/pvtime.txt for more information
|
||||
including the layout of the stolen time structure.
|
@ -1,76 +0,0 @@
|
||||
Generic vcpu interface
|
||||
====================================
|
||||
|
||||
The virtual cpu "device" also accepts the ioctls KVM_SET_DEVICE_ATTR,
|
||||
KVM_GET_DEVICE_ATTR, and KVM_HAS_DEVICE_ATTR. The interface uses the same struct
|
||||
kvm_device_attr as other devices, but targets VCPU-wide settings and controls.
|
||||
|
||||
The groups and attributes per virtual cpu, if any, are architecture specific.
|
||||
|
||||
1. GROUP: KVM_ARM_VCPU_PMU_V3_CTRL
|
||||
Architectures: ARM64
|
||||
|
||||
1.1. ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_IRQ
|
||||
Parameters: in kvm_device_attr.addr the address for PMU overflow interrupt is a
|
||||
pointer to an int
|
||||
Returns: -EBUSY: The PMU overflow interrupt is already set
|
||||
-ENXIO: The overflow interrupt not set when attempting to get it
|
||||
-ENODEV: PMUv3 not supported
|
||||
-EINVAL: Invalid PMU overflow interrupt number supplied or
|
||||
trying to set the IRQ number without using an in-kernel
|
||||
irqchip.
|
||||
|
||||
A value describing the PMUv3 (Performance Monitor Unit v3) overflow interrupt
|
||||
number for this vcpu. This interrupt could be a PPI or SPI, but the interrupt
|
||||
type must be same for each vcpu. As a PPI, the interrupt number is the same for
|
||||
all vcpus, while as an SPI it must be a separate number per vcpu.
|
||||
|
||||
1.2 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_INIT
|
||||
Parameters: no additional parameter in kvm_device_attr.addr
|
||||
Returns: -ENODEV: PMUv3 not supported or GIC not initialized
|
||||
-ENXIO: PMUv3 not properly configured or in-kernel irqchip not
|
||||
configured as required prior to calling this attribute
|
||||
-EBUSY: PMUv3 already initialized
|
||||
|
||||
Request the initialization of the PMUv3. If using the PMUv3 with an in-kernel
|
||||
virtual GIC implementation, this must be done after initializing the in-kernel
|
||||
irqchip.
|
||||
|
||||
|
||||
2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
|
||||
Architectures: ARM,ARM64
|
||||
|
||||
2.1. ATTRIBUTE: KVM_ARM_VCPU_TIMER_IRQ_VTIMER
|
||||
2.2. ATTRIBUTE: KVM_ARM_VCPU_TIMER_IRQ_PTIMER
|
||||
Parameters: in kvm_device_attr.addr the address for the timer interrupt is a
|
||||
pointer to an int
|
||||
Returns: -EINVAL: Invalid timer interrupt number
|
||||
-EBUSY: One or more VCPUs has already run
|
||||
|
||||
A value describing the architected timer interrupt number when connected to an
|
||||
in-kernel virtual GIC. These must be a PPI (16 <= intid < 32). Setting the
|
||||
attribute overrides the default values (see below).
|
||||
|
||||
KVM_ARM_VCPU_TIMER_IRQ_VTIMER: The EL1 virtual timer intid (default: 27)
|
||||
KVM_ARM_VCPU_TIMER_IRQ_PTIMER: The EL1 physical timer intid (default: 30)
|
||||
|
||||
Setting the same PPI for different timers will prevent the VCPUs from running.
|
||||
Setting the interrupt number on a VCPU configures all VCPUs created at that
|
||||
time to use the number provided for a given timer, overwriting any previously
|
||||
configured values on other VCPUs. Userspace should configure the interrupt
|
||||
numbers on at least one VCPU after creating all VCPUs and before running any
|
||||
VCPUs.
|
||||
|
||||
3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL
|
||||
Architectures: ARM64
|
||||
|
||||
3.1 ATTRIBUTE: KVM_ARM_VCPU_PVTIME_IPA
|
||||
Parameters: 64-bit base address
|
||||
Returns: -ENXIO: Stolen time not implemented
|
||||
-EEXIST: Base address already set for this VCPU
|
||||
-EINVAL: Base address not 64 byte aligned
|
||||
|
||||
Specifies the base address of the stolen time structure for this VCPU. The
|
||||
base address must be 64 byte aligned and exist within a valid guest memory
|
||||
region. See Documentation/virt/kvm/arm/pvtime.txt for more information
|
||||
including the layout of the stolen time structure.
|
@ -1,8 +1,12 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
===================
|
||||
VFIO virtual device
|
||||
===================
|
||||
|
||||
Device types supported:
|
||||
KVM_DEV_TYPE_VFIO
|
||||
|
||||
- KVM_DEV_TYPE_VFIO
|
||||
|
||||
Only one VFIO instance may be created per VM. The created device
|
||||
tracks VFIO groups in use by the VM and features of those groups
|
||||
@ -23,14 +27,15 @@ KVM_DEV_VFIO_GROUP attributes:
|
||||
for the VFIO group.
|
||||
KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
|
||||
allocated by sPAPR KVM.
|
||||
kvm_device_attr.addr points to a struct:
|
||||
kvm_device_attr.addr points to a struct::
|
||||
|
||||
struct kvm_vfio_spapr_tce {
|
||||
__s32 groupfd;
|
||||
__s32 tablefd;
|
||||
};
|
||||
struct kvm_vfio_spapr_tce {
|
||||
__s32 groupfd;
|
||||
__s32 tablefd;
|
||||
};
|
||||
|
||||
where
|
||||
@groupfd is a file descriptor for a VFIO group;
|
||||
@tablefd is a file descriptor for a TCE table allocated via
|
||||
KVM_CREATE_SPAPR_TCE.
|
||||
where:
|
||||
|
||||
- @groupfd is a file descriptor for a VFIO group;
|
||||
- @tablefd is a file descriptor for a TCE table allocated via
|
||||
KVM_CREATE_SPAPR_TCE.
|
@ -1,5 +1,8 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
====================
|
||||
Generic vm interface
|
||||
====================================
|
||||
====================
|
||||
|
||||
The virtual machine "device" also accepts the ioctls KVM_SET_DEVICE_ATTR,
|
||||
KVM_GET_DEVICE_ATTR, and KVM_HAS_DEVICE_ATTR. The interface uses the same
|
||||
@ -10,30 +13,38 @@ The groups and attributes per virtual machine, if any, are architecture
|
||||
specific.
|
||||
|
||||
1. GROUP: KVM_S390_VM_MEM_CTRL
|
||||
Architectures: s390
|
||||
==============================
|
||||
|
||||
:Architectures: s390
|
||||
|
||||
1.1. ATTRIBUTE: KVM_S390_VM_MEM_ENABLE_CMMA
|
||||
Parameters: none
|
||||
Returns: -EBUSY if a vcpu is already defined, otherwise 0
|
||||
-------------------------------------------
|
||||
|
||||
:Parameters: none
|
||||
:Returns: -EBUSY if a vcpu is already defined, otherwise 0
|
||||
|
||||
Enables Collaborative Memory Management Assist (CMMA) for the virtual machine.
|
||||
|
||||
1.2. ATTRIBUTE: KVM_S390_VM_MEM_CLR_CMMA
|
||||
Parameters: none
|
||||
Returns: -EINVAL if CMMA was not enabled
|
||||
0 otherwise
|
||||
----------------------------------------
|
||||
|
||||
:Parameters: none
|
||||
:Returns: -EINVAL if CMMA was not enabled;
|
||||
0 otherwise
|
||||
|
||||
Clear the CMMA status for all guest pages, so any pages the guest marked
|
||||
as unused are again used any may not be reclaimed by the host.
|
||||
|
||||
1.3. ATTRIBUTE KVM_S390_VM_MEM_LIMIT_SIZE
|
||||
Parameters: in attr->addr the address for the new limit of guest memory
|
||||
Returns: -EFAULT if the given address is not accessible
|
||||
-EINVAL if the virtual machine is of type UCONTROL
|
||||
-E2BIG if the given guest memory is to big for that machine
|
||||
-EBUSY if a vcpu is already defined
|
||||
-ENOMEM if not enough memory is available for a new shadow guest mapping
|
||||
0 otherwise
|
||||
-----------------------------------------
|
||||
|
||||
:Parameters: in attr->addr the address for the new limit of guest memory
|
||||
:Returns: -EFAULT if the given address is not accessible;
|
||||
-EINVAL if the virtual machine is of type UCONTROL;
|
||||
-E2BIG if the given guest memory is to big for that machine;
|
||||
-EBUSY if a vcpu is already defined;
|
||||
-ENOMEM if not enough memory is available for a new shadow guest mapping;
|
||||
0 otherwise.
|
||||
|
||||
Allows userspace to query the actual limit and set a new limit for
|
||||
the maximum guest memory size. The limit will be rounded up to
|
||||
@ -42,78 +53,92 @@ the number of page table levels. In the case that there is no limit we will set
|
||||
the limit to KVM_S390_NO_MEM_LIMIT (U64_MAX).
|
||||
|
||||
2. GROUP: KVM_S390_VM_CPU_MODEL
|
||||
Architectures: s390
|
||||
===============================
|
||||
|
||||
:Architectures: s390
|
||||
|
||||
2.1. ATTRIBUTE: KVM_S390_VM_CPU_MACHINE (r/o)
|
||||
---------------------------------------------
|
||||
|
||||
Allows user space to retrieve machine and kvm specific cpu related information:
|
||||
Allows user space to retrieve machine and kvm specific cpu related information::
|
||||
|
||||
struct kvm_s390_vm_cpu_machine {
|
||||
struct kvm_s390_vm_cpu_machine {
|
||||
__u64 cpuid; # CPUID of host
|
||||
__u32 ibc; # IBC level range offered by host
|
||||
__u8 pad[4];
|
||||
__u64 fac_mask[256]; # set of cpu facilities enabled by KVM
|
||||
__u64 fac_list[256]; # set of cpu facilities offered by host
|
||||
}
|
||||
}
|
||||
|
||||
Parameters: address of buffer to store the machine related cpu data
|
||||
of type struct kvm_s390_vm_cpu_machine*
|
||||
Returns: -EFAULT if the given address is not accessible from kernel space
|
||||
-ENOMEM if not enough memory is available to process the ioctl
|
||||
0 in case of success
|
||||
:Parameters: address of buffer to store the machine related cpu data
|
||||
of type struct kvm_s390_vm_cpu_machine*
|
||||
:Returns: -EFAULT if the given address is not accessible from kernel space;
|
||||
-ENOMEM if not enough memory is available to process the ioctl;
|
||||
0 in case of success.
|
||||
|
||||
2.2. ATTRIBUTE: KVM_S390_VM_CPU_PROCESSOR (r/w)
|
||||
===============================================
|
||||
|
||||
Allows user space to retrieve or request to change cpu related information for a vcpu:
|
||||
Allows user space to retrieve or request to change cpu related information for a vcpu::
|
||||
|
||||
struct kvm_s390_vm_cpu_processor {
|
||||
struct kvm_s390_vm_cpu_processor {
|
||||
__u64 cpuid; # CPUID currently (to be) used by this vcpu
|
||||
__u16 ibc; # IBC level currently (to be) used by this vcpu
|
||||
__u8 pad[6];
|
||||
__u64 fac_list[256]; # set of cpu facilities currently (to be) used
|
||||
# by this vcpu
|
||||
}
|
||||
# by this vcpu
|
||||
}
|
||||
|
||||
KVM does not enforce or limit the cpu model data in any form. Take the information
|
||||
retrieved by means of KVM_S390_VM_CPU_MACHINE as hint for reasonable configuration
|
||||
setups. Instruction interceptions triggered by additionally set facility bits that
|
||||
are not handled by KVM need to by imlemented in the VM driver code.
|
||||
|
||||
Parameters: address of buffer to store/set the processor related cpu
|
||||
data of type struct kvm_s390_vm_cpu_processor*.
|
||||
Returns: -EBUSY in case 1 or more vcpus are already activated (only in write case)
|
||||
-EFAULT if the given address is not accessible from kernel space
|
||||
-ENOMEM if not enough memory is available to process the ioctl
|
||||
0 in case of success
|
||||
:Parameters: address of buffer to store/set the processor related cpu
|
||||
data of type struct kvm_s390_vm_cpu_processor*.
|
||||
:Returns: -EBUSY in case 1 or more vcpus are already activated (only in write case);
|
||||
-EFAULT if the given address is not accessible from kernel space;
|
||||
-ENOMEM if not enough memory is available to process the ioctl;
|
||||
0 in case of success.
|
||||
|
||||
.. _KVM_S390_VM_CPU_MACHINE_FEAT:
|
||||
|
||||
2.3. ATTRIBUTE: KVM_S390_VM_CPU_MACHINE_FEAT (r/o)
|
||||
--------------------------------------------------
|
||||
|
||||
Allows user space to retrieve available cpu features. A feature is available if
|
||||
provided by the hardware and supported by kvm. In theory, cpu features could
|
||||
even be completely emulated by kvm.
|
||||
|
||||
struct kvm_s390_vm_cpu_feat {
|
||||
__u64 feat[16]; # Bitmap (1 = feature available), MSB 0 bit numbering
|
||||
};
|
||||
::
|
||||
|
||||
Parameters: address of a buffer to load the feature list from.
|
||||
Returns: -EFAULT if the given address is not accessible from kernel space.
|
||||
0 in case of success.
|
||||
struct kvm_s390_vm_cpu_feat {
|
||||
__u64 feat[16]; # Bitmap (1 = feature available), MSB 0 bit numbering
|
||||
};
|
||||
|
||||
:Parameters: address of a buffer to load the feature list from.
|
||||
:Returns: -EFAULT if the given address is not accessible from kernel space;
|
||||
0 in case of success.
|
||||
|
||||
2.4. ATTRIBUTE: KVM_S390_VM_CPU_PROCESSOR_FEAT (r/w)
|
||||
----------------------------------------------------
|
||||
|
||||
Allows user space to retrieve or change enabled cpu features for all VCPUs of a
|
||||
VM. Features that are not available cannot be enabled.
|
||||
|
||||
See 2.3. for a description of the parameter struct.
|
||||
See :ref:`KVM_S390_VM_CPU_MACHINE_FEAT` for
|
||||
a description of the parameter struct.
|
||||
|
||||
Parameters: address of a buffer to store/load the feature list from.
|
||||
Returns: -EFAULT if the given address is not accessible from kernel space.
|
||||
-EINVAL if a cpu feature that is not available is to be enabled.
|
||||
-EBUSY if at least one VCPU has already been defined.
|
||||
:Parameters: address of a buffer to store/load the feature list from.
|
||||
:Returns: -EFAULT if the given address is not accessible from kernel space;
|
||||
-EINVAL if a cpu feature that is not available is to be enabled;
|
||||
-EBUSY if at least one VCPU has already been defined;
|
||||
0 in case of success.
|
||||
|
||||
.. _KVM_S390_VM_CPU_MACHINE_SUBFUNC:
|
||||
|
||||
2.5. ATTRIBUTE: KVM_S390_VM_CPU_MACHINE_SUBFUNC (r/o)
|
||||
-----------------------------------------------------
|
||||
|
||||
Allows user space to retrieve available cpu subfunctions without any filtering
|
||||
done by a set IBC. These subfunctions are indicated to the guest VCPU via
|
||||
@ -126,7 +151,9 @@ contained in the returned struct. If the affected instruction
|
||||
indicates subfunctions via a "test bit" mechanism, the subfunction codes are
|
||||
contained in the returned struct in MSB 0 bit numbering.
|
||||
|
||||
struct kvm_s390_vm_cpu_subfunc {
|
||||
::
|
||||
|
||||
struct kvm_s390_vm_cpu_subfunc {
|
||||
u8 plo[32]; # always valid (ESA/390 feature)
|
||||
u8 ptff[16]; # valid with TOD-clock steering
|
||||
u8 kmac[16]; # valid with Message-Security-Assist
|
||||
@ -143,13 +170,14 @@ struct kvm_s390_vm_cpu_subfunc {
|
||||
u8 kma[16]; # valid with Message-Security-Assist-Extension 8
|
||||
u8 kdsa[16]; # valid with Message-Security-Assist-Extension 9
|
||||
u8 reserved[1792]; # reserved for future instructions
|
||||
};
|
||||
};
|
||||
|
||||
Parameters: address of a buffer to load the subfunction blocks from.
|
||||
Returns: -EFAULT if the given address is not accessible from kernel space.
|
||||
:Parameters: address of a buffer to load the subfunction blocks from.
|
||||
:Returns: -EFAULT if the given address is not accessible from kernel space;
|
||||
0 in case of success.
|
||||
|
||||
2.6. ATTRIBUTE: KVM_S390_VM_CPU_PROCESSOR_SUBFUNC (r/w)
|
||||
-------------------------------------------------------
|
||||
|
||||
Allows user space to retrieve or change cpu subfunctions to be indicated for
|
||||
all VCPUs of a VM. This attribute will only be available if kernel and
|
||||
@ -164,107 +192,125 @@ As long as no data has been written, a read will fail. The IBC will be used
|
||||
to determine available subfunctions in this case, this will guarantee backward
|
||||
compatibility.
|
||||
|
||||
See 2.5. for a description of the parameter struct.
|
||||
See :ref:`KVM_S390_VM_CPU_MACHINE_SUBFUNC` for a
|
||||
description of the parameter struct.
|
||||
|
||||
Parameters: address of a buffer to store/load the subfunction blocks from.
|
||||
Returns: -EFAULT if the given address is not accessible from kernel space.
|
||||
-EINVAL when reading, if there was no write yet.
|
||||
-EBUSY if at least one VCPU has already been defined.
|
||||
:Parameters: address of a buffer to store/load the subfunction blocks from.
|
||||
:Returns: -EFAULT if the given address is not accessible from kernel space;
|
||||
-EINVAL when reading, if there was no write yet;
|
||||
-EBUSY if at least one VCPU has already been defined;
|
||||
0 in case of success.
|
||||
|
||||
3. GROUP: KVM_S390_VM_TOD
|
||||
Architectures: s390
|
||||
=========================
|
||||
|
||||
:Architectures: s390
|
||||
|
||||
3.1. ATTRIBUTE: KVM_S390_VM_TOD_HIGH
|
||||
------------------------------------
|
||||
|
||||
Allows user space to set/get the TOD clock extension (u8) (superseded by
|
||||
KVM_S390_VM_TOD_EXT).
|
||||
|
||||
Parameters: address of a buffer in user space to store the data (u8) to
|
||||
Returns: -EFAULT if the given address is not accessible from kernel space
|
||||
:Parameters: address of a buffer in user space to store the data (u8) to
|
||||
:Returns: -EFAULT if the given address is not accessible from kernel space;
|
||||
-EINVAL if setting the TOD clock extension to != 0 is not supported
|
||||
|
||||
3.2. ATTRIBUTE: KVM_S390_VM_TOD_LOW
|
||||
-----------------------------------
|
||||
|
||||
Allows user space to set/get bits 0-63 of the TOD clock register as defined in
|
||||
the POP (u64).
|
||||
|
||||
Parameters: address of a buffer in user space to store the data (u64) to
|
||||
Returns: -EFAULT if the given address is not accessible from kernel space
|
||||
:Parameters: address of a buffer in user space to store the data (u64) to
|
||||
:Returns: -EFAULT if the given address is not accessible from kernel space
|
||||
|
||||
3.3. ATTRIBUTE: KVM_S390_VM_TOD_EXT
|
||||
-----------------------------------
|
||||
|
||||
Allows user space to set/get bits 0-63 of the TOD clock register as defined in
|
||||
the POP (u64). If the guest CPU model supports the TOD clock extension (u8), it
|
||||
also allows user space to get/set it. If the guest CPU model does not support
|
||||
it, it is stored as 0 and not allowed to be set to a value != 0.
|
||||
|
||||
Parameters: address of a buffer in user space to store the data
|
||||
(kvm_s390_vm_tod_clock) to
|
||||
Returns: -EFAULT if the given address is not accessible from kernel space
|
||||
:Parameters: address of a buffer in user space to store the data
|
||||
(kvm_s390_vm_tod_clock) to
|
||||
:Returns: -EFAULT if the given address is not accessible from kernel space;
|
||||
-EINVAL if setting the TOD clock extension to != 0 is not supported
|
||||
|
||||
4. GROUP: KVM_S390_VM_CRYPTO
|
||||
Architectures: s390
|
||||
============================
|
||||
|
||||
:Architectures: s390
|
||||
|
||||
4.1. ATTRIBUTE: KVM_S390_VM_CRYPTO_ENABLE_AES_KW (w/o)
|
||||
------------------------------------------------------
|
||||
|
||||
Allows user space to enable aes key wrapping, including generating a new
|
||||
wrapping key.
|
||||
|
||||
Parameters: none
|
||||
Returns: 0
|
||||
:Parameters: none
|
||||
:Returns: 0
|
||||
|
||||
4.2. ATTRIBUTE: KVM_S390_VM_CRYPTO_ENABLE_DEA_KW (w/o)
|
||||
------------------------------------------------------
|
||||
|
||||
Allows user space to enable dea key wrapping, including generating a new
|
||||
wrapping key.
|
||||
|
||||
Parameters: none
|
||||
Returns: 0
|
||||
:Parameters: none
|
||||
:Returns: 0
|
||||
|
||||
4.3. ATTRIBUTE: KVM_S390_VM_CRYPTO_DISABLE_AES_KW (w/o)
|
||||
-------------------------------------------------------
|
||||
|
||||
Allows user space to disable aes key wrapping, clearing the wrapping key.
|
||||
|
||||
Parameters: none
|
||||
Returns: 0
|
||||
:Parameters: none
|
||||
:Returns: 0
|
||||
|
||||
4.4. ATTRIBUTE: KVM_S390_VM_CRYPTO_DISABLE_DEA_KW (w/o)
|
||||
-------------------------------------------------------
|
||||
|
||||
Allows user space to disable dea key wrapping, clearing the wrapping key.
|
||||
|
||||
Parameters: none
|
||||
Returns: 0
|
||||
:Parameters: none
|
||||
:Returns: 0
|
||||
|
||||
5. GROUP: KVM_S390_VM_MIGRATION
|
||||
Architectures: s390
|
||||
===============================
|
||||
|
||||
:Architectures: s390
|
||||
|
||||
5.1. ATTRIBUTE: KVM_S390_VM_MIGRATION_STOP (w/o)
|
||||
------------------------------------------------
|
||||
|
||||
Allows userspace to stop migration mode, needed for PGSTE migration.
|
||||
Setting this attribute when migration mode is not active will have no
|
||||
effects.
|
||||
|
||||
Parameters: none
|
||||
Returns: 0
|
||||
:Parameters: none
|
||||
:Returns: 0
|
||||
|
||||
5.2. ATTRIBUTE: KVM_S390_VM_MIGRATION_START (w/o)
|
||||
-------------------------------------------------
|
||||
|
||||
Allows userspace to start migration mode, needed for PGSTE migration.
|
||||
Setting this attribute when migration mode is already active will have
|
||||
no effects.
|
||||
|
||||
Parameters: none
|
||||
Returns: -ENOMEM if there is not enough free memory to start migration mode
|
||||
-EINVAL if the state of the VM is invalid (e.g. no memory defined)
|
||||
:Parameters: none
|
||||
:Returns: -ENOMEM if there is not enough free memory to start migration mode;
|
||||
-EINVAL if the state of the VM is invalid (e.g. no memory defined);
|
||||
0 in case of success.
|
||||
|
||||
5.3. ATTRIBUTE: KVM_S390_VM_MIGRATION_STATUS (r/o)
|
||||
--------------------------------------------------
|
||||
|
||||
Allows userspace to query the status of migration mode.
|
||||
|
||||
Parameters: address of a buffer in user space to store the data (u64) to;
|
||||
the data itself is either 0 if migration mode is disabled or 1
|
||||
if it is enabled
|
||||
Returns: -EFAULT if the given address is not accessible from kernel space
|
||||
:Parameters: address of a buffer in user space to store the data (u64) to;
|
||||
the data itself is either 0 if migration mode is disabled or 1
|
||||
if it is enabled
|
||||
:Returns: -EFAULT if the given address is not accessible from kernel space;
|
||||
0 in case of success.
|
@ -1,20 +1,31 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=========================
|
||||
XICS interrupt controller
|
||||
=========================
|
||||
|
||||
Device type supported: KVM_DEV_TYPE_XICS
|
||||
|
||||
Groups:
|
||||
1. KVM_DEV_XICS_GRP_SOURCES
|
||||
Attributes: One per interrupt source, indexed by the source number.
|
||||
Attributes:
|
||||
|
||||
One per interrupt source, indexed by the source number.
|
||||
2. KVM_DEV_XICS_GRP_CTRL
|
||||
Attributes:
|
||||
2.1 KVM_DEV_XICS_NR_SERVERS (write only)
|
||||
Attributes:
|
||||
|
||||
2.1 KVM_DEV_XICS_NR_SERVERS (write only)
|
||||
|
||||
The kvm_device_attr.addr points to a __u32 value which is the number of
|
||||
interrupt server numbers (ie, highest possible vcpu id plus one).
|
||||
|
||||
Errors:
|
||||
-EINVAL: Value greater than KVM_MAX_VCPU_ID.
|
||||
-EFAULT: Invalid user pointer for attr->addr.
|
||||
-EBUSY: A vcpu is already connected to the device.
|
||||
|
||||
======= ==========================================
|
||||
-EINVAL Value greater than KVM_MAX_VCPU_ID.
|
||||
-EFAULT Invalid user pointer for attr->addr.
|
||||
-EBUSY A vcpu is already connected to the device.
|
||||
======= ==========================================
|
||||
|
||||
This device emulates the XICS (eXternal Interrupt Controller
|
||||
Specification) defined in PAPR. The XICS has a set of interrupt
|
||||
@ -53,24 +64,29 @@ the interrupt source number. The 64 bit state word has the following
|
||||
bitfields, starting from the least-significant end of the word:
|
||||
|
||||
* Destination (server number), 32 bits
|
||||
|
||||
This specifies where the interrupt should be sent, and is the
|
||||
interrupt server number specified for the destination vcpu.
|
||||
|
||||
* Priority, 8 bits
|
||||
|
||||
This is the priority specified for this interrupt source, where 0 is
|
||||
the highest priority and 255 is the lowest. An interrupt with a
|
||||
priority of 255 will never be delivered.
|
||||
|
||||
* Level sensitive flag, 1 bit
|
||||
|
||||
This bit is 1 for a level-sensitive interrupt source, or 0 for
|
||||
edge-sensitive (or MSI).
|
||||
|
||||
* Masked flag, 1 bit
|
||||
|
||||
This bit is set to 1 if the interrupt is masked (cannot be delivered
|
||||
regardless of its priority), for example by the ibm,int-off RTAS
|
||||
call, or 0 if it is not masked.
|
||||
|
||||
* Pending flag, 1 bit
|
||||
|
||||
This bit is 1 if the source has a pending interrupt, otherwise 0.
|
||||
|
||||
Only one XICS instance may be created per VM.
|
@ -1,8 +1,11 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
===========================================================
|
||||
POWER9 eXternal Interrupt Virtualization Engine (XIVE Gen1)
|
||||
==========================================================
|
||||
===========================================================
|
||||
|
||||
Device types supported:
|
||||
KVM_DEV_TYPE_XIVE POWER9 XIVE Interrupt Controller generation 1
|
||||
- KVM_DEV_TYPE_XIVE POWER9 XIVE Interrupt Controller generation 1
|
||||
|
||||
This device acts as a VM interrupt controller. It provides the KVM
|
||||
interface to configure the interrupt sources of a VM in the underlying
|
||||
@ -64,72 +67,100 @@ the legacy interrupt mode, referred as XICS (POWER7/8).
|
||||
|
||||
* Groups:
|
||||
|
||||
1. KVM_DEV_XIVE_GRP_CTRL
|
||||
Provides global controls on the device
|
||||
1. KVM_DEV_XIVE_GRP_CTRL
|
||||
Provides global controls on the device
|
||||
|
||||
Attributes:
|
||||
1.1 KVM_DEV_XIVE_RESET (write only)
|
||||
Resets the interrupt controller configuration for sources and event
|
||||
queues. To be used by kexec and kdump.
|
||||
|
||||
Errors: none
|
||||
|
||||
1.2 KVM_DEV_XIVE_EQ_SYNC (write only)
|
||||
Sync all the sources and queues and mark the EQ pages dirty. This
|
||||
to make sure that a consistent memory state is captured when
|
||||
migrating the VM.
|
||||
|
||||
Errors: none
|
||||
|
||||
1.3 KVM_DEV_XIVE_NR_SERVERS (write only)
|
||||
The kvm_device_attr.addr points to a __u32 value which is the number of
|
||||
interrupt server numbers (ie, highest possible vcpu id plus one).
|
||||
Errors:
|
||||
-EINVAL: Value greater than KVM_MAX_VCPU_ID.
|
||||
-EFAULT: Invalid user pointer for attr->addr.
|
||||
-EBUSY: A vCPU is already connected to the device.
|
||||
|
||||
2. KVM_DEV_XIVE_GRP_SOURCE (write only)
|
||||
Initializes a new source in the XIVE device and mask it.
|
||||
Errors:
|
||||
|
||||
======= ==========================================
|
||||
-EINVAL Value greater than KVM_MAX_VCPU_ID.
|
||||
-EFAULT Invalid user pointer for attr->addr.
|
||||
-EBUSY A vCPU is already connected to the device.
|
||||
======= ==========================================
|
||||
|
||||
2. KVM_DEV_XIVE_GRP_SOURCE (write only)
|
||||
Initializes a new source in the XIVE device and mask it.
|
||||
|
||||
Attributes:
|
||||
Interrupt source number (64-bit)
|
||||
The kvm_device_attr.addr points to a __u64 value:
|
||||
bits: | 63 .... 2 | 1 | 0
|
||||
values: | unused | level | type
|
||||
|
||||
The kvm_device_attr.addr points to a __u64 value::
|
||||
|
||||
bits: | 63 .... 2 | 1 | 0
|
||||
values: | unused | level | type
|
||||
|
||||
- type: 0:MSI 1:LSI
|
||||
- level: assertion level in case of an LSI.
|
||||
Errors:
|
||||
-E2BIG: Interrupt source number is out of range
|
||||
-ENOMEM: Could not create a new source block
|
||||
-EFAULT: Invalid user pointer for attr->addr.
|
||||
-ENXIO: Could not allocate underlying HW interrupt
|
||||
|
||||
3. KVM_DEV_XIVE_GRP_SOURCE_CONFIG (write only)
|
||||
Configures source targeting
|
||||
Errors:
|
||||
|
||||
======= ==========================================
|
||||
-E2BIG Interrupt source number is out of range
|
||||
-ENOMEM Could not create a new source block
|
||||
-EFAULT Invalid user pointer for attr->addr.
|
||||
-ENXIO Could not allocate underlying HW interrupt
|
||||
======= ==========================================
|
||||
|
||||
3. KVM_DEV_XIVE_GRP_SOURCE_CONFIG (write only)
|
||||
Configures source targeting
|
||||
|
||||
Attributes:
|
||||
Interrupt source number (64-bit)
|
||||
The kvm_device_attr.addr points to a __u64 value:
|
||||
bits: | 63 .... 33 | 32 | 31 .. 3 | 2 .. 0
|
||||
values: | eisn | mask | server | priority
|
||||
|
||||
The kvm_device_attr.addr points to a __u64 value::
|
||||
|
||||
bits: | 63 .... 33 | 32 | 31 .. 3 | 2 .. 0
|
||||
values: | eisn | mask | server | priority
|
||||
|
||||
- priority: 0-7 interrupt priority level
|
||||
- server: CPU number chosen to handle the interrupt
|
||||
- mask: mask flag (unused)
|
||||
- eisn: Effective Interrupt Source Number
|
||||
Errors:
|
||||
-ENOENT: Unknown source number
|
||||
-EINVAL: Not initialized source number
|
||||
-EINVAL: Invalid priority
|
||||
-EINVAL: Invalid CPU number.
|
||||
-EFAULT: Invalid user pointer for attr->addr.
|
||||
-ENXIO: CPU event queues not configured or configuration of the
|
||||
underlying HW interrupt failed
|
||||
-EBUSY: No CPU available to serve interrupt
|
||||
|
||||
4. KVM_DEV_XIVE_GRP_EQ_CONFIG (read-write)
|
||||
Configures an event queue of a CPU
|
||||
Errors:
|
||||
|
||||
======= =======================================================
|
||||
-ENOENT Unknown source number
|
||||
-EINVAL Not initialized source number
|
||||
-EINVAL Invalid priority
|
||||
-EINVAL Invalid CPU number.
|
||||
-EFAULT Invalid user pointer for attr->addr.
|
||||
-ENXIO CPU event queues not configured or configuration of the
|
||||
underlying HW interrupt failed
|
||||
-EBUSY No CPU available to serve interrupt
|
||||
======= =======================================================
|
||||
|
||||
4. KVM_DEV_XIVE_GRP_EQ_CONFIG (read-write)
|
||||
Configures an event queue of a CPU
|
||||
|
||||
Attributes:
|
||||
EQ descriptor identifier (64-bit)
|
||||
The EQ descriptor identifier is a tuple (server, priority) :
|
||||
bits: | 63 .... 32 | 31 .. 3 | 2 .. 0
|
||||
values: | unused | server | priority
|
||||
The kvm_device_attr.addr points to :
|
||||
|
||||
The EQ descriptor identifier is a tuple (server, priority)::
|
||||
|
||||
bits: | 63 .... 32 | 31 .. 3 | 2 .. 0
|
||||
values: | unused | server | priority
|
||||
|
||||
The kvm_device_attr.addr points to::
|
||||
|
||||
struct kvm_ppc_xive_eq {
|
||||
__u32 flags;
|
||||
__u32 qshift;
|
||||
@ -138,8 +169,9 @@ the legacy interrupt mode, referred as XICS (POWER7/8).
|
||||
__u32 qindex;
|
||||
__u8 pad[40];
|
||||
};
|
||||
|
||||
- flags: queue flags
|
||||
KVM_XIVE_EQ_ALWAYS_NOTIFY (required)
|
||||
KVM_XIVE_EQ_ALWAYS_NOTIFY (required)
|
||||
forces notification without using the coalescing mechanism
|
||||
provided by the XIVE END ESBs.
|
||||
- qshift: queue size (power of 2)
|
||||
@ -147,22 +179,31 @@ the legacy interrupt mode, referred as XICS (POWER7/8).
|
||||
- qtoggle: current queue toggle bit
|
||||
- qindex: current queue index
|
||||
- pad: reserved for future use
|
||||
Errors:
|
||||
-ENOENT: Invalid CPU number
|
||||
-EINVAL: Invalid priority
|
||||
-EINVAL: Invalid flags
|
||||
-EINVAL: Invalid queue size
|
||||
-EINVAL: Invalid queue address
|
||||
-EFAULT: Invalid user pointer for attr->addr.
|
||||
-EIO: Configuration of the underlying HW failed
|
||||
|
||||
5. KVM_DEV_XIVE_GRP_SOURCE_SYNC (write only)
|
||||
Synchronize the source to flush event notifications
|
||||
Errors:
|
||||
|
||||
======= =========================================
|
||||
-ENOENT Invalid CPU number
|
||||
-EINVAL Invalid priority
|
||||
-EINVAL Invalid flags
|
||||
-EINVAL Invalid queue size
|
||||
-EINVAL Invalid queue address
|
||||
-EFAULT Invalid user pointer for attr->addr.
|
||||
-EIO Configuration of the underlying HW failed
|
||||
======= =========================================
|
||||
|
||||
5. KVM_DEV_XIVE_GRP_SOURCE_SYNC (write only)
|
||||
Synchronize the source to flush event notifications
|
||||
|
||||
Attributes:
|
||||
Interrupt source number (64-bit)
|
||||
|
||||
Errors:
|
||||
-ENOENT: Unknown source number
|
||||
-EINVAL: Not initialized source number
|
||||
|
||||
======= =============================
|
||||
-ENOENT Unknown source number
|
||||
-EINVAL Not initialized source number
|
||||
======= =============================
|
||||
|
||||
* VCPU state
|
||||
|
||||
@ -175,11 +216,12 @@ the legacy interrupt mode, referred as XICS (POWER7/8).
|
||||
as it synthesizes the priorities of the pending interrupts. We
|
||||
capture a bit more to report debug information.
|
||||
|
||||
KVM_REG_PPC_VP_STATE (2 * 64bits)
|
||||
bits: | 63 .... 32 | 31 .... 0 |
|
||||
values: | TIMA word0 | TIMA word1 |
|
||||
bits: | 127 .......... 64 |
|
||||
values: | unused |
|
||||
KVM_REG_PPC_VP_STATE (2 * 64bits)::
|
||||
|
||||
bits: | 63 .... 32 | 31 .... 0 |
|
||||
values: | TIMA word0 | TIMA word1 |
|
||||
bits: | 127 .......... 64 |
|
||||
values: | unused |
|
||||
|
||||
* Migration:
|
||||
|
||||
@ -196,7 +238,7 @@ the legacy interrupt mode, referred as XICS (POWER7/8).
|
||||
3. Capture the state of the source targeting, the EQs configuration
|
||||
and the state of thread interrupt context registers.
|
||||
|
||||
Restore is similar :
|
||||
Restore is similar:
|
||||
|
||||
1. Restore the EQ configuration. As targeting depends on it.
|
||||
2. Restore targeting
|
@ -1,3 +1,6 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
===========================
|
||||
The KVM halt polling system
|
||||
===========================
|
||||
|
||||
@ -68,7 +71,8 @@ steady state polling interval but will only really do a good job for wakeups
|
||||
which come at an approximately constant rate, otherwise there will be constant
|
||||
adjustment of the polling interval.
|
||||
|
||||
[0] total block time: the time between when the halt polling function is
|
||||
[0] total block time:
|
||||
the time between when the halt polling function is
|
||||
invoked and a wakeup source received (irrespective of
|
||||
whether the scheduler is invoked within that function).
|
||||
|
||||
@ -81,31 +85,32 @@ shrunk. These variables are defined in include/linux/kvm_host.h and as module
|
||||
parameters in virt/kvm/kvm_main.c, or arch/powerpc/kvm/book3s_hv.c in the
|
||||
powerpc kvm-hv case.
|
||||
|
||||
Module Parameter | Description | Default Value
|
||||
--------------------------------------------------------------------------------
|
||||
halt_poll_ns | The global max polling | KVM_HALT_POLL_NS_DEFAULT
|
||||
| interval which defines |
|
||||
| the ceiling value of the |
|
||||
| polling interval for | (per arch value)
|
||||
| each vcpu. |
|
||||
--------------------------------------------------------------------------------
|
||||
halt_poll_ns_grow | The value by which the | 2
|
||||
| halt polling interval is |
|
||||
| multiplied in the |
|
||||
| grow_halt_poll_ns() |
|
||||
| function. |
|
||||
--------------------------------------------------------------------------------
|
||||
halt_poll_ns_grow_start | The initial value to grow | 10000
|
||||
| to from zero in the |
|
||||
| grow_halt_poll_ns() |
|
||||
| function. |
|
||||
--------------------------------------------------------------------------------
|
||||
halt_poll_ns_shrink | The value by which the | 0
|
||||
| halt polling interval is |
|
||||
| divided in the |
|
||||
| shrink_halt_poll_ns() |
|
||||
| function. |
|
||||
--------------------------------------------------------------------------------
|
||||
+-----------------------+---------------------------+-------------------------+
|
||||
|Module Parameter | Description | Default Value |
|
||||
+-----------------------+---------------------------+-------------------------+
|
||||
|halt_poll_ns | The global max polling | KVM_HALT_POLL_NS_DEFAULT|
|
||||
| | interval which defines | |
|
||||
| | the ceiling value of the | |
|
||||
| | polling interval for | (per arch value) |
|
||||
| | each vcpu. | |
|
||||
+-----------------------+---------------------------+-------------------------+
|
||||
|halt_poll_ns_grow | The value by which the | 2 |
|
||||
| | halt polling interval is | |
|
||||
| | multiplied in the | |
|
||||
| | grow_halt_poll_ns() | |
|
||||
| | function. | |
|
||||
+-----------------------+---------------------------+-------------------------+
|
||||
|halt_poll_ns_grow_start| The initial value to grow | 10000 |
|
||||
| | to from zero in the | |
|
||||
| | grow_halt_poll_ns() | |
|
||||
| | function. | |
|
||||
+-----------------------+---------------------------+-------------------------+
|
||||
|halt_poll_ns_shrink | The value by which the | 0 |
|
||||
| | halt polling interval is | |
|
||||
| | divided in the | |
|
||||
| | shrink_halt_poll_ns() | |
|
||||
| | function. | |
|
||||
+-----------------------+---------------------------+-------------------------+
|
||||
|
||||
These module parameters can be set from the debugfs files in:
|
||||
|
||||
@ -117,20 +122,19 @@ Note: that these module parameters are system wide values and are not able to
|
||||
Further Notes
|
||||
=============
|
||||
|
||||
- Care should be taken when setting the halt_poll_ns module parameter as a
|
||||
large value has the potential to drive the cpu usage to 100% on a machine which
|
||||
would be almost entirely idle otherwise. This is because even if a guest has
|
||||
wakeups during which very little work is done and which are quite far apart, if
|
||||
the period is shorter than the global max polling interval (halt_poll_ns) then
|
||||
the host will always poll for the entire block time and thus cpu utilisation
|
||||
will go to 100%.
|
||||
- Care should be taken when setting the halt_poll_ns module parameter as a large value
|
||||
has the potential to drive the cpu usage to 100% on a machine which would be almost
|
||||
entirely idle otherwise. This is because even if a guest has wakeups during which very
|
||||
little work is done and which are quite far apart, if the period is shorter than the
|
||||
global max polling interval (halt_poll_ns) then the host will always poll for the
|
||||
entire block time and thus cpu utilisation will go to 100%.
|
||||
|
||||
- Halt polling essentially presents a trade off between power usage and latency
|
||||
and the module parameters should be used to tune the affinity for this. Idle
|
||||
cpu time is essentially converted to host kernel time with the aim of decreasing
|
||||
latency when entering the guest.
|
||||
- Halt polling essentially presents a trade off between power usage and latency and
|
||||
the module parameters should be used to tune the affinity for this. Idle cpu time is
|
||||
essentially converted to host kernel time with the aim of decreasing latency when
|
||||
entering the guest.
|
||||
|
||||
- Halt polling will only be conducted by the host when no other tasks are
|
||||
runnable on that cpu, otherwise the polling will cease immediately and
|
||||
schedule will be invoked to allow that other task to run. Thus this doesn't
|
||||
allow a guest to denial of service the cpu.
|
||||
- Halt polling will only be conducted by the host when no other tasks are runnable on
|
||||
that cpu, otherwise the polling will cease immediately and schedule will be invoked to
|
||||
allow that other task to run. Thus this doesn't allow a guest to denial of service the
|
||||
cpu.
|
@ -1,5 +1,9 @@
|
||||
Linux KVM Hypercall:
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
===================
|
||||
Linux KVM Hypercall
|
||||
===================
|
||||
|
||||
X86:
|
||||
KVM Hypercalls have a three-byte sequence of either the vmcall or the vmmcall
|
||||
instruction. The hypervisor can replace it with instructions that are
|
||||
@ -20,7 +24,7 @@ S390:
|
||||
For further information on the S390 diagnose call as supported by KVM,
|
||||
refer to Documentation/virt/kvm/s390-diag.txt.
|
||||
|
||||
PowerPC:
|
||||
PowerPC:
|
||||
It uses R3-R10 and hypercall number in R11. R4-R11 are used as output registers.
|
||||
Return value is placed in R3.
|
||||
|
||||
@ -34,7 +38,8 @@ MIPS:
|
||||
the return value is placed in $2 (v0).
|
||||
|
||||
KVM Hypercalls Documentation
|
||||
===========================
|
||||
============================
|
||||
|
||||
The template for each hypercall is:
|
||||
1. Hypercall name.
|
||||
2. Architecture(s)
|
||||
@ -43,56 +48,64 @@ The template for each hypercall is:
|
||||
|
||||
1. KVM_HC_VAPIC_POLL_IRQ
|
||||
------------------------
|
||||
Architecture: x86
|
||||
Status: active
|
||||
Purpose: Trigger guest exit so that the host can check for pending
|
||||
interrupts on reentry.
|
||||
|
||||
:Architecture: x86
|
||||
:Status: active
|
||||
:Purpose: Trigger guest exit so that the host can check for pending
|
||||
interrupts on reentry.
|
||||
|
||||
2. KVM_HC_MMU_OP
|
||||
------------------------
|
||||
Architecture: x86
|
||||
Status: deprecated.
|
||||
Purpose: Support MMU operations such as writing to PTE,
|
||||
flushing TLB, release PT.
|
||||
----------------
|
||||
|
||||
:Architecture: x86
|
||||
:Status: deprecated.
|
||||
:Purpose: Support MMU operations such as writing to PTE,
|
||||
flushing TLB, release PT.
|
||||
|
||||
3. KVM_HC_FEATURES
|
||||
------------------------
|
||||
Architecture: PPC
|
||||
Status: active
|
||||
Purpose: Expose hypercall availability to the guest. On x86 platforms, cpuid
|
||||
used to enumerate which hypercalls are available. On PPC, either device tree
|
||||
based lookup ( which is also what EPAPR dictates) OR KVM specific enumeration
|
||||
mechanism (which is this hypercall) can be used.
|
||||
------------------
|
||||
|
||||
:Architecture: PPC
|
||||
:Status: active
|
||||
:Purpose: Expose hypercall availability to the guest. On x86 platforms, cpuid
|
||||
used to enumerate which hypercalls are available. On PPC, either
|
||||
device tree based lookup ( which is also what EPAPR dictates)
|
||||
OR KVM specific enumeration mechanism (which is this hypercall)
|
||||
can be used.
|
||||
|
||||
4. KVM_HC_PPC_MAP_MAGIC_PAGE
|
||||
------------------------
|
||||
Architecture: PPC
|
||||
Status: active
|
||||
Purpose: To enable communication between the hypervisor and guest there is a
|
||||
shared page that contains parts of supervisor visible register state.
|
||||
The guest can map this shared page to access its supervisor register through
|
||||
memory using this hypercall.
|
||||
----------------------------
|
||||
|
||||
:Architecture: PPC
|
||||
:Status: active
|
||||
:Purpose: To enable communication between the hypervisor and guest there is a
|
||||
shared page that contains parts of supervisor visible register state.
|
||||
The guest can map this shared page to access its supervisor register
|
||||
through memory using this hypercall.
|
||||
|
||||
5. KVM_HC_KICK_CPU
|
||||
------------------------
|
||||
Architecture: x86
|
||||
Status: active
|
||||
Purpose: Hypercall used to wakeup a vcpu from HLT state
|
||||
Usage example : A vcpu of a paravirtualized guest that is busywaiting in guest
|
||||
kernel mode for an event to occur (ex: a spinlock to become available) can
|
||||
execute HLT instruction once it has busy-waited for more than a threshold
|
||||
time-interval. Execution of HLT instruction would cause the hypervisor to put
|
||||
the vcpu to sleep until occurrence of an appropriate event. Another vcpu of the
|
||||
same guest can wakeup the sleeping vcpu by issuing KVM_HC_KICK_CPU hypercall,
|
||||
specifying APIC ID (a1) of the vcpu to be woken up. An additional argument (a0)
|
||||
is used in the hypercall for future use.
|
||||
------------------
|
||||
|
||||
:Architecture: x86
|
||||
:Status: active
|
||||
:Purpose: Hypercall used to wakeup a vcpu from HLT state
|
||||
:Usage example:
|
||||
A vcpu of a paravirtualized guest that is busywaiting in guest
|
||||
kernel mode for an event to occur (ex: a spinlock to become available) can
|
||||
execute HLT instruction once it has busy-waited for more than a threshold
|
||||
time-interval. Execution of HLT instruction would cause the hypervisor to put
|
||||
the vcpu to sleep until occurrence of an appropriate event. Another vcpu of the
|
||||
same guest can wakeup the sleeping vcpu by issuing KVM_HC_KICK_CPU hypercall,
|
||||
specifying APIC ID (a1) of the vcpu to be woken up. An additional argument (a0)
|
||||
is used in the hypercall for future use.
|
||||
|
||||
|
||||
6. KVM_HC_CLOCK_PAIRING
|
||||
------------------------
|
||||
Architecture: x86
|
||||
Status: active
|
||||
Purpose: Hypercall used to synchronize host and guest clocks.
|
||||
-----------------------
|
||||
:Architecture: x86
|
||||
:Status: active
|
||||
:Purpose: Hypercall used to synchronize host and guest clocks.
|
||||
|
||||
Usage:
|
||||
|
||||
a0: guest physical address where host copies
|
||||
@ -101,6 +114,8 @@ a0: guest physical address where host copies
|
||||
a1: clock_type, ATM only KVM_CLOCK_PAIRING_WALLCLOCK (0)
|
||||
is supported (corresponding to the host's CLOCK_REALTIME clock).
|
||||
|
||||
::
|
||||
|
||||
struct kvm_clock_pairing {
|
||||
__s64 sec;
|
||||
__s64 nsec;
|
||||
@ -123,15 +138,16 @@ Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource,
|
||||
or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK.
|
||||
|
||||
6. KVM_HC_SEND_IPI
|
||||
------------------------
|
||||
Architecture: x86
|
||||
Status: active
|
||||
Purpose: Send IPIs to multiple vCPUs.
|
||||
------------------
|
||||
|
||||
a0: lower part of the bitmap of destination APIC IDs
|
||||
a1: higher part of the bitmap of destination APIC IDs
|
||||
a2: the lowest APIC ID in bitmap
|
||||
a3: APIC ICR
|
||||
:Architecture: x86
|
||||
:Status: active
|
||||
:Purpose: Send IPIs to multiple vCPUs.
|
||||
|
||||
- a0: lower part of the bitmap of destination APIC IDs
|
||||
- a1: higher part of the bitmap of destination APIC IDs
|
||||
- a2: the lowest APIC ID in bitmap
|
||||
- a3: APIC ICR
|
||||
|
||||
The hypercall lets a guest send multicast IPIs, with at most 128
|
||||
128 destinations per hypercall in 64-bit mode and 64 vCPUs per
|
||||
@ -143,12 +159,13 @@ corresponds to the APIC ID a2+1, and so on.
|
||||
Returns the number of CPUs to which the IPIs were delivered successfully.
|
||||
|
||||
7. KVM_HC_SCHED_YIELD
|
||||
------------------------
|
||||
Architecture: x86
|
||||
Status: active
|
||||
Purpose: Hypercall used to yield if the IPI target vCPU is preempted
|
||||
---------------------
|
||||
|
||||
:Architecture: x86
|
||||
:Status: active
|
||||
:Purpose: Hypercall used to yield if the IPI target vCPU is preempted
|
||||
|
||||
a0: destination APIC ID
|
||||
|
||||
Usage example: When sending a call-function IPI-many to vCPUs, yield if
|
||||
any of the IPI target vCPUs was preempted.
|
||||
:Usage example: When sending a call-function IPI-many to vCPUs, yield if
|
||||
any of the IPI target vCPUs was preempted.
|
@ -7,6 +7,22 @@ KVM
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
api
|
||||
amd-memory-encryption
|
||||
cpuid
|
||||
halt-polling
|
||||
hypercalls
|
||||
locking
|
||||
mmu
|
||||
msr
|
||||
nested-vmx
|
||||
ppc-pv
|
||||
s390-diag
|
||||
timekeeping
|
||||
vcpu-requests
|
||||
|
||||
review-checklist
|
||||
|
||||
arm/index
|
||||
|
||||
devices/index
|
||||
|
243
Documentation/virt/kvm/locking.rst
Normal file
243
Documentation/virt/kvm/locking.rst
Normal file
@ -0,0 +1,243 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=================
|
||||
KVM Lock Overview
|
||||
=================
|
||||
|
||||
1. Acquisition Orders
|
||||
---------------------
|
||||
|
||||
The acquisition orders for mutexes are as follows:
|
||||
|
||||
- kvm->lock is taken outside vcpu->mutex
|
||||
|
||||
- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock
|
||||
|
||||
- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
|
||||
them together is quite rare.
|
||||
|
||||
On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock.
|
||||
|
||||
Everything else is a leaf: no other lock is taken inside the critical
|
||||
sections.
|
||||
|
||||
2. Exception
|
||||
------------
|
||||
|
||||
Fast page fault:
|
||||
|
||||
Fast page fault is the fast path which fixes the guest page fault out of
|
||||
the mmu-lock on x86. Currently, the page fault can be fast in one of the
|
||||
following two cases:
|
||||
|
||||
1. Access Tracking: The SPTE is not present, but it is marked for access
|
||||
tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to
|
||||
restore the saved R/X bits. This is described in more detail later below.
|
||||
|
||||
2. Write-Protection: The SPTE is present and the fault is
|
||||
caused by write-protect. That means we just need to change the W bit of
|
||||
the spte.
|
||||
|
||||
What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and
|
||||
SPTE_MMU_WRITEABLE bit on the spte:
|
||||
|
||||
- SPTE_HOST_WRITEABLE means the gfn is writable on host.
|
||||
- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when
|
||||
the gfn is writable on guest mmu and it is not write-protected by shadow
|
||||
page write-protection.
|
||||
|
||||
On fast page fault path, we will use cmpxchg to atomically set the spte W
|
||||
bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or
|
||||
restore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This
|
||||
is safe because whenever changing these bits can be detected by cmpxchg.
|
||||
|
||||
But we need carefully check these cases:
|
||||
|
||||
1) The mapping from gfn to pfn
|
||||
|
||||
The mapping from gfn to pfn may be changed since we can only ensure the pfn
|
||||
is not changed during cmpxchg. This is a ABA problem, for example, below case
|
||||
will happen:
|
||||
|
||||
+------------------------------------------------------------------------+
|
||||
| At the beginning:: |
|
||||
| |
|
||||
| gpte = gfn1 |
|
||||
| gfn1 is mapped to pfn1 on host |
|
||||
| spte is the shadow page table entry corresponding with gpte and |
|
||||
| spte = pfn1 |
|
||||
+------------------------------------------------------------------------+
|
||||
| On fast page fault path: |
|
||||
+------------------------------------+-----------------------------------+
|
||||
| CPU 0: | CPU 1: |
|
||||
+------------------------------------+-----------------------------------+
|
||||
| :: | |
|
||||
| | |
|
||||
| old_spte = *spte; | |
|
||||
+------------------------------------+-----------------------------------+
|
||||
| | pfn1 is swapped out:: |
|
||||
| | |
|
||||
| | spte = 0; |
|
||||
| | |
|
||||
| | pfn1 is re-alloced for gfn2. |
|
||||
| | |
|
||||
| | gpte is changed to point to |
|
||||
| | gfn2 by the guest:: |
|
||||
| | |
|
||||
| | spte = pfn1; |
|
||||
+------------------------------------+-----------------------------------+
|
||||
| :: |
|
||||
| |
|
||||
| if (cmpxchg(spte, old_spte, old_spte+W) |
|
||||
| mark_page_dirty(vcpu->kvm, gfn1) |
|
||||
| OOPS!!! |
|
||||
+------------------------------------------------------------------------+
|
||||
|
||||
We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
|
||||
|
||||
For direct sp, we can easily avoid it since the spte of direct sp is fixed
|
||||
to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic()
|
||||
to pin gfn to pfn, because after gfn_to_pfn_atomic():
|
||||
|
||||
- We have held the refcount of pfn that means the pfn can not be freed and
|
||||
be reused for another gfn.
|
||||
- The pfn is writable that means it can not be shared between different gfns
|
||||
by KSM.
|
||||
|
||||
Then, we can ensure the dirty bitmaps is correctly set for a gfn.
|
||||
|
||||
Currently, to simplify the whole things, we disable fast page fault for
|
||||
indirect shadow page.
|
||||
|
||||
2) Dirty bit tracking
|
||||
|
||||
In the origin code, the spte can be fast updated (non-atomically) if the
|
||||
spte is read-only and the Accessed bit has already been set since the
|
||||
Accessed bit and Dirty bit can not be lost.
|
||||
|
||||
But it is not true after fast page fault since the spte can be marked
|
||||
writable between reading spte and updating spte. Like below case:
|
||||
|
||||
+------------------------------------------------------------------------+
|
||||
| At the beginning:: |
|
||||
| |
|
||||
| spte.W = 0 |
|
||||
| spte.Accessed = 1 |
|
||||
+------------------------------------+-----------------------------------+
|
||||
| CPU 0: | CPU 1: |
|
||||
+------------------------------------+-----------------------------------+
|
||||
| In mmu_spte_clear_track_bits():: | |
|
||||
| | |
|
||||
| old_spte = *spte; | |
|
||||
| | |
|
||||
| | |
|
||||
| /* 'if' condition is satisfied. */| |
|
||||
| if (old_spte.Accessed == 1 && | |
|
||||
| old_spte.W == 0) | |
|
||||
| spte = 0ull; | |
|
||||
+------------------------------------+-----------------------------------+
|
||||
| | on fast page fault path:: |
|
||||
| | |
|
||||
| | spte.W = 1 |
|
||||
| | |
|
||||
| | memory write on the spte:: |
|
||||
| | |
|
||||
| | spte.Dirty = 1 |
|
||||
+------------------------------------+-----------------------------------+
|
||||
| :: | |
|
||||
| | |
|
||||
| else | |
|
||||
| old_spte = xchg(spte, 0ull) | |
|
||||
| if (old_spte.Accessed == 1) | |
|
||||
| kvm_set_pfn_accessed(spte.pfn);| |
|
||||
| if (old_spte.Dirty == 1) | |
|
||||
| kvm_set_pfn_dirty(spte.pfn); | |
|
||||
| OOPS!!! | |
|
||||
+------------------------------------+-----------------------------------+
|
||||
|
||||
The Dirty bit is lost in this case.
|
||||
|
||||
In order to avoid this kind of issue, we always treat the spte as "volatile"
|
||||
if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means,
|
||||
the spte is always atomically updated in this case.
|
||||
|
||||
3) flush tlbs due to spte updated
|
||||
|
||||
If the spte is updated from writable to readonly, we should flush all TLBs,
|
||||
otherwise rmap_write_protect will find a read-only spte, even though the
|
||||
writable spte might be cached on a CPU's TLB.
|
||||
|
||||
As mentioned before, the spte can be updated to writable out of mmu-lock on
|
||||
fast page fault path, in order to easily audit the path, we see if TLBs need
|
||||
be flushed caused by this reason in mmu_spte_update() since this is a common
|
||||
function to update spte (present -> present).
|
||||
|
||||
Since the spte is "volatile" if it can be updated out of mmu-lock, we always
|
||||
atomically update the spte, the race caused by fast page fault can be avoided,
|
||||
See the comments in spte_has_volatile_bits() and mmu_spte_update().
|
||||
|
||||
Lockless Access Tracking:
|
||||
|
||||
This is used for Intel CPUs that are using EPT but do not support the EPT A/D
|
||||
bits. In this case, when the KVM MMU notifier is called to track accesses to a
|
||||
page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present
|
||||
by clearing the RWX bits in the PTE and storing the original R & X bits in
|
||||
some unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the
|
||||
PTE (using the ignored bit 62). When the VM tries to access the page later on,
|
||||
a fault is generated and the fast page fault mechanism described above is used
|
||||
to atomically restore the PTE to a Present state. The W bit is not saved when
|
||||
the PTE is marked for access tracking and during restoration to the Present
|
||||
state, the W bit is set depending on whether or not it was a write access. If
|
||||
it wasn't, then the W bit will remain clear until a write access happens, at
|
||||
which time it will be set using the Dirty tracking mechanism described above.
|
||||
|
||||
3. Reference
|
||||
------------
|
||||
|
||||
:Name: kvm_lock
|
||||
:Type: mutex
|
||||
:Arch: any
|
||||
:Protects: - vm_list
|
||||
|
||||
:Name: kvm_count_lock
|
||||
:Type: raw_spinlock_t
|
||||
:Arch: any
|
||||
:Protects: - hardware virtualization enable/disable
|
||||
:Comment: 'raw' because hardware enabling/disabling must be atomic /wrt
|
||||
migration.
|
||||
|
||||
:Name: kvm_arch::tsc_write_lock
|
||||
:Type: raw_spinlock
|
||||
:Arch: x86
|
||||
:Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
|
||||
- tsc offset in vmcb
|
||||
:Comment: 'raw' because updating the tsc offsets must not be preempted.
|
||||
|
||||
:Name: kvm->mmu_lock
|
||||
:Type: spinlock_t
|
||||
:Arch: any
|
||||
:Protects: -shadow page/shadow tlb entry
|
||||
:Comment: it is a spinlock since it is used in mmu notifier.
|
||||
|
||||
:Name: kvm->srcu
|
||||
:Type: srcu lock
|
||||
:Arch: any
|
||||
:Protects: - kvm->memslots
|
||||
- kvm->buses
|
||||
:Comment: The srcu read lock must be held while accessing memslots (e.g.
|
||||
when using gfn_to_* functions) and while accessing in-kernel
|
||||
MMIO/PIO address->device structure mapping (kvm->buses).
|
||||
The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
|
||||
if it is needed by multiple functions.
|
||||
|
||||
:Name: blocked_vcpu_on_cpu_lock
|
||||
:Type: spinlock_t
|
||||
:Arch: x86
|
||||
:Protects: blocked_vcpu_on_cpu
|
||||
:Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts.
|
||||
When VT-d posted-interrupts is supported and the VM has assigned
|
||||
devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu
|
||||
protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues
|
||||
wakeup notification event since external interrupts from the
|
||||
assigned devices happens, we will find the vCPU on the list to
|
||||
wakeup.
|
@ -1,215 +0,0 @@
|
||||
KVM Lock Overview
|
||||
=================
|
||||
|
||||
1. Acquisition Orders
|
||||
---------------------
|
||||
|
||||
The acquisition orders for mutexes are as follows:
|
||||
|
||||
- kvm->lock is taken outside vcpu->mutex
|
||||
|
||||
- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock
|
||||
|
||||
- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
|
||||
them together is quite rare.
|
||||
|
||||
On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock.
|
||||
|
||||
Everything else is a leaf: no other lock is taken inside the critical
|
||||
sections.
|
||||
|
||||
2: Exception
|
||||
------------
|
||||
|
||||
Fast page fault:
|
||||
|
||||
Fast page fault is the fast path which fixes the guest page fault out of
|
||||
the mmu-lock on x86. Currently, the page fault can be fast in one of the
|
||||
following two cases:
|
||||
|
||||
1. Access Tracking: The SPTE is not present, but it is marked for access
|
||||
tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to
|
||||
restore the saved R/X bits. This is described in more detail later below.
|
||||
|
||||
2. Write-Protection: The SPTE is present and the fault is
|
||||
caused by write-protect. That means we just need to change the W bit of the
|
||||
spte.
|
||||
|
||||
What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and
|
||||
SPTE_MMU_WRITEABLE bit on the spte:
|
||||
- SPTE_HOST_WRITEABLE means the gfn is writable on host.
|
||||
- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when
|
||||
the gfn is writable on guest mmu and it is not write-protected by shadow
|
||||
page write-protection.
|
||||
|
||||
On fast page fault path, we will use cmpxchg to atomically set the spte W
|
||||
bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or
|
||||
restore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This
|
||||
is safe because whenever changing these bits can be detected by cmpxchg.
|
||||
|
||||
But we need carefully check these cases:
|
||||
1): The mapping from gfn to pfn
|
||||
The mapping from gfn to pfn may be changed since we can only ensure the pfn
|
||||
is not changed during cmpxchg. This is a ABA problem, for example, below case
|
||||
will happen:
|
||||
|
||||
At the beginning:
|
||||
gpte = gfn1
|
||||
gfn1 is mapped to pfn1 on host
|
||||
spte is the shadow page table entry corresponding with gpte and
|
||||
spte = pfn1
|
||||
|
||||
VCPU 0 VCPU0
|
||||
on fast page fault path:
|
||||
|
||||
old_spte = *spte;
|
||||
pfn1 is swapped out:
|
||||
spte = 0;
|
||||
|
||||
pfn1 is re-alloced for gfn2.
|
||||
|
||||
gpte is changed to point to
|
||||
gfn2 by the guest:
|
||||
spte = pfn1;
|
||||
|
||||
if (cmpxchg(spte, old_spte, old_spte+W)
|
||||
mark_page_dirty(vcpu->kvm, gfn1)
|
||||
OOPS!!!
|
||||
|
||||
We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
|
||||
|
||||
For direct sp, we can easily avoid it since the spte of direct sp is fixed
|
||||
to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic()
|
||||
to pin gfn to pfn, because after gfn_to_pfn_atomic():
|
||||
- We have held the refcount of pfn that means the pfn can not be freed and
|
||||
be reused for another gfn.
|
||||
- The pfn is writable that means it can not be shared between different gfns
|
||||
by KSM.
|
||||
|
||||
Then, we can ensure the dirty bitmaps is correctly set for a gfn.
|
||||
|
||||
Currently, to simplify the whole things, we disable fast page fault for
|
||||
indirect shadow page.
|
||||
|
||||
2): Dirty bit tracking
|
||||
In the origin code, the spte can be fast updated (non-atomically) if the
|
||||
spte is read-only and the Accessed bit has already been set since the
|
||||
Accessed bit and Dirty bit can not be lost.
|
||||
|
||||
But it is not true after fast page fault since the spte can be marked
|
||||
writable between reading spte and updating spte. Like below case:
|
||||
|
||||
At the beginning:
|
||||
spte.W = 0
|
||||
spte.Accessed = 1
|
||||
|
||||
VCPU 0 VCPU0
|
||||
In mmu_spte_clear_track_bits():
|
||||
|
||||
old_spte = *spte;
|
||||
|
||||
/* 'if' condition is satisfied. */
|
||||
if (old_spte.Accessed == 1 &&
|
||||
old_spte.W == 0)
|
||||
spte = 0ull;
|
||||
on fast page fault path:
|
||||
spte.W = 1
|
||||
memory write on the spte:
|
||||
spte.Dirty = 1
|
||||
|
||||
|
||||
else
|
||||
old_spte = xchg(spte, 0ull)
|
||||
|
||||
|
||||
if (old_spte.Accessed == 1)
|
||||
kvm_set_pfn_accessed(spte.pfn);
|
||||
if (old_spte.Dirty == 1)
|
||||
kvm_set_pfn_dirty(spte.pfn);
|
||||
OOPS!!!
|
||||
|
||||
The Dirty bit is lost in this case.
|
||||
|
||||
In order to avoid this kind of issue, we always treat the spte as "volatile"
|
||||
if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means,
|
||||
the spte is always atomically updated in this case.
|
||||
|
||||
3): flush tlbs due to spte updated
|
||||
If the spte is updated from writable to readonly, we should flush all TLBs,
|
||||
otherwise rmap_write_protect will find a read-only spte, even though the
|
||||
writable spte might be cached on a CPU's TLB.
|
||||
|
||||
As mentioned before, the spte can be updated to writable out of mmu-lock on
|
||||
fast page fault path, in order to easily audit the path, we see if TLBs need
|
||||
be flushed caused by this reason in mmu_spte_update() since this is a common
|
||||
function to update spte (present -> present).
|
||||
|
||||
Since the spte is "volatile" if it can be updated out of mmu-lock, we always
|
||||
atomically update the spte, the race caused by fast page fault can be avoided,
|
||||
See the comments in spte_has_volatile_bits() and mmu_spte_update().
|
||||
|
||||
Lockless Access Tracking:
|
||||
|
||||
This is used for Intel CPUs that are using EPT but do not support the EPT A/D
|
||||
bits. In this case, when the KVM MMU notifier is called to track accesses to a
|
||||
page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present
|
||||
by clearing the RWX bits in the PTE and storing the original R & X bits in
|
||||
some unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the
|
||||
PTE (using the ignored bit 62). When the VM tries to access the page later on,
|
||||
a fault is generated and the fast page fault mechanism described above is used
|
||||
to atomically restore the PTE to a Present state. The W bit is not saved when
|
||||
the PTE is marked for access tracking and during restoration to the Present
|
||||
state, the W bit is set depending on whether or not it was a write access. If
|
||||
it wasn't, then the W bit will remain clear until a write access happens, at
|
||||
which time it will be set using the Dirty tracking mechanism described above.
|
||||
|
||||
3. Reference
|
||||
------------
|
||||
|
||||
Name: kvm_lock
|
||||
Type: mutex
|
||||
Arch: any
|
||||
Protects: - vm_list
|
||||
|
||||
Name: kvm_count_lock
|
||||
Type: raw_spinlock_t
|
||||
Arch: any
|
||||
Protects: - hardware virtualization enable/disable
|
||||
Comment: 'raw' because hardware enabling/disabling must be atomic /wrt
|
||||
migration.
|
||||
|
||||
Name: kvm_arch::tsc_write_lock
|
||||
Type: raw_spinlock
|
||||
Arch: x86
|
||||
Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
|
||||
- tsc offset in vmcb
|
||||
Comment: 'raw' because updating the tsc offsets must not be preempted.
|
||||
|
||||
Name: kvm->mmu_lock
|
||||
Type: spinlock_t
|
||||
Arch: any
|
||||
Protects: -shadow page/shadow tlb entry
|
||||
Comment: it is a spinlock since it is used in mmu notifier.
|
||||
|
||||
Name: kvm->srcu
|
||||
Type: srcu lock
|
||||
Arch: any
|
||||
Protects: - kvm->memslots
|
||||
- kvm->buses
|
||||
Comment: The srcu read lock must be held while accessing memslots (e.g.
|
||||
when using gfn_to_* functions) and while accessing in-kernel
|
||||
MMIO/PIO address->device structure mapping (kvm->buses).
|
||||
The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
|
||||
if it is needed by multiple functions.
|
||||
|
||||
Name: blocked_vcpu_on_cpu_lock
|
||||
Type: spinlock_t
|
||||
Arch: x86
|
||||
Protects: blocked_vcpu_on_cpu
|
||||
Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts.
|
||||
When VT-d posted-interrupts is supported and the VM has assigned
|
||||
devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu
|
||||
protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues
|
||||
wakeup notification event since external interrupts from the
|
||||
assigned devices happens, we will find the vCPU on the list to
|
||||
wakeup.
|
@ -1,3 +1,6 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
======================
|
||||
The x86 kvm shadow mmu
|
||||
======================
|
||||
|
||||
@ -7,27 +10,37 @@ physical addresses to host physical addresses.
|
||||
|
||||
The mmu code attempts to satisfy the following requirements:
|
||||
|
||||
- correctness: the guest should not be able to determine that it is running
|
||||
- correctness:
|
||||
the guest should not be able to determine that it is running
|
||||
on an emulated mmu except for timing (we attempt to comply
|
||||
with the specification, not emulate the characteristics of
|
||||
a particular implementation such as tlb size)
|
||||
- security: the guest must not be able to touch host memory not assigned
|
||||
- security:
|
||||
the guest must not be able to touch host memory not assigned
|
||||
to it
|
||||
- performance: minimize the performance penalty imposed by the mmu
|
||||
- scaling: need to scale to large memory and large vcpu guests
|
||||
- hardware: support the full range of x86 virtualization hardware
|
||||
- integration: Linux memory management code must be in control of guest memory
|
||||
- performance:
|
||||
minimize the performance penalty imposed by the mmu
|
||||
- scaling:
|
||||
need to scale to large memory and large vcpu guests
|
||||
- hardware:
|
||||
support the full range of x86 virtualization hardware
|
||||
- integration:
|
||||
Linux memory management code must be in control of guest memory
|
||||
so that swapping, page migration, page merging, transparent
|
||||
hugepages, and similar features work without change
|
||||
- dirty tracking: report writes to guest memory to enable live migration
|
||||
- dirty tracking:
|
||||
report writes to guest memory to enable live migration
|
||||
and framebuffer-based displays
|
||||
- footprint: keep the amount of pinned kernel memory low (most memory
|
||||
- footprint:
|
||||
keep the amount of pinned kernel memory low (most memory
|
||||
should be shrinkable)
|
||||
- reliability: avoid multipage or GFP_ATOMIC allocations
|
||||
- reliability:
|
||||
avoid multipage or GFP_ATOMIC allocations
|
||||
|
||||
Acronyms
|
||||
========
|
||||
|
||||
==== ====================================================================
|
||||
pfn host page frame number
|
||||
hpa host physical address
|
||||
hva host virtual address
|
||||
@ -41,6 +54,7 @@ pte page table entry (used also to refer generically to paging structure
|
||||
gpte guest pte (referring to gfns)
|
||||
spte shadow pte (referring to pfns)
|
||||
tdp two dimensional paging (vendor neutral term for NPT and EPT)
|
||||
==== ====================================================================
|
||||
|
||||
Virtual and real hardware supported
|
||||
===================================
|
||||
@ -90,11 +104,13 @@ Events
|
||||
The mmu is driven by events, some from the guest, some from the host.
|
||||
|
||||
Guest generated events:
|
||||
|
||||
- writes to control registers (especially cr3)
|
||||
- invlpg/invlpga instruction execution
|
||||
- access to missing or protected translations
|
||||
|
||||
Host generated events:
|
||||
|
||||
- changes in the gpa->hpa translation (either through gpa->hva changes or
|
||||
through hva->hpa changes)
|
||||
- memory pressure (the shrinker)
|
||||
@ -117,16 +133,19 @@ Leaf ptes point at guest pages.
|
||||
The following table shows translations encoded by leaf ptes, with higher-level
|
||||
translations in parentheses:
|
||||
|
||||
Non-nested guests:
|
||||
Non-nested guests::
|
||||
|
||||
nonpaging: gpa->hpa
|
||||
paging: gva->gpa->hpa
|
||||
paging, tdp: (gva->)gpa->hpa
|
||||
Nested guests:
|
||||
|
||||
Nested guests::
|
||||
|
||||
non-tdp: ngva->gpa->hpa (*)
|
||||
tdp: (ngva->)ngpa->gpa->hpa
|
||||
|
||||
(*) the guest hypervisor will encode the ngva->gpa translation into its page
|
||||
tables if npt is not present
|
||||
(*) the guest hypervisor will encode the ngva->gpa translation into its page
|
||||
tables if npt is not present
|
||||
|
||||
Shadow pages contain the following information:
|
||||
role.level:
|
||||
@ -291,28 +310,41 @@ Handling a page fault is performed as follows:
|
||||
|
||||
- if the RSV bit of the error code is set, the page fault is caused by guest
|
||||
accessing MMIO and cached MMIO information is available.
|
||||
|
||||
- walk shadow page table
|
||||
- check for valid generation number in the spte (see "Fast invalidation of
|
||||
MMIO sptes" below)
|
||||
- cache the information to vcpu->arch.mmio_gva, vcpu->arch.mmio_access and
|
||||
vcpu->arch.mmio_gfn, and call the emulator
|
||||
|
||||
- If both P bit and R/W bit of error code are set, this could possibly
|
||||
be handled as a "fast page fault" (fixed without taking the MMU lock). See
|
||||
the description in Documentation/virt/kvm/locking.txt.
|
||||
|
||||
- if needed, walk the guest page tables to determine the guest translation
|
||||
(gva->gpa or ngpa->gpa)
|
||||
|
||||
- if permissions are insufficient, reflect the fault back to the guest
|
||||
|
||||
- determine the host page
|
||||
|
||||
- if this is an mmio request, there is no host page; cache the info to
|
||||
vcpu->arch.mmio_gva, vcpu->arch.mmio_access and vcpu->arch.mmio_gfn
|
||||
|
||||
- walk the shadow page table to find the spte for the translation,
|
||||
instantiating missing intermediate page tables as necessary
|
||||
|
||||
- If this is an mmio request, cache the mmio info to the spte and set some
|
||||
reserved bit on the spte (see callers of kvm_mmu_set_mmio_spte_mask)
|
||||
|
||||
- try to unsynchronize the page
|
||||
|
||||
- if successful, we can let the guest continue and modify the gpte
|
||||
|
||||
- emulate the instruction
|
||||
|
||||
- if failed, unshadow the page and let the guest continue
|
||||
|
||||
- update any translations that were modified by the instruction
|
||||
|
||||
invlpg handling:
|
||||
@ -324,10 +356,12 @@ invlpg handling:
|
||||
Guest control register updates:
|
||||
|
||||
- mov to cr3
|
||||
|
||||
- look up new shadow roots
|
||||
- synchronize newly reachable shadow pages
|
||||
|
||||
- mov to cr0/cr4/efer
|
||||
|
||||
- set up mmu context for new paging mode
|
||||
- look up new shadow roots
|
||||
- synchronize newly reachable shadow pages
|
||||
@ -358,6 +392,7 @@ on fault type:
|
||||
(user write faults generate a #PF)
|
||||
|
||||
In the first case there are two additional complications:
|
||||
|
||||
- if CR4.SMEP is enabled: since we've turned the page into a kernel page,
|
||||
the kernel may now execute it. We handle this by also setting spte.nx.
|
||||
If we get a user fetch or read fault, we'll change spte.u=1 and
|
||||
@ -446,4 +481,3 @@ Further reading
|
||||
|
||||
- NPT presentation from KVM Forum 2008
|
||||
http://www.linux-kvm.org/images/c/c8/KvmForum2008%24kdf2008_21.pdf
|
||||
|
@ -1,6 +1,10 @@
|
||||
KVM-specific MSRs.
|
||||
Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
|
||||
=====================================================
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=================
|
||||
KVM-specific MSRs
|
||||
=================
|
||||
|
||||
:Author: Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
|
||||
|
||||
KVM makes use of some custom MSRs to service some requests.
|
||||
|
||||
@ -9,34 +13,39 @@ Custom MSRs have a range reserved for them, that goes from
|
||||
but they are deprecated and their use is discouraged.
|
||||
|
||||
Custom MSR list
|
||||
--------
|
||||
---------------
|
||||
|
||||
The current supported Custom MSR list is:
|
||||
|
||||
MSR_KVM_WALL_CLOCK_NEW: 0x4b564d00
|
||||
MSR_KVM_WALL_CLOCK_NEW:
|
||||
0x4b564d00
|
||||
|
||||
data: 4-byte alignment physical address of a memory area which must be
|
||||
data:
|
||||
4-byte alignment physical address of a memory area which must be
|
||||
in guest RAM. This memory is expected to hold a copy of the following
|
||||
structure:
|
||||
structure::
|
||||
|
||||
struct pvclock_wall_clock {
|
||||
struct pvclock_wall_clock {
|
||||
u32 version;
|
||||
u32 sec;
|
||||
u32 nsec;
|
||||
} __attribute__((__packed__));
|
||||
} __attribute__((__packed__));
|
||||
|
||||
whose data will be filled in by the hypervisor. The hypervisor is only
|
||||
guaranteed to update this data at the moment of MSR write.
|
||||
Users that want to reliably query this information more than once have
|
||||
to write more than once to this MSR. Fields have the following meanings:
|
||||
|
||||
version: guest has to check version before and after grabbing
|
||||
version:
|
||||
guest has to check version before and after grabbing
|
||||
time information and check that they are both equal and even.
|
||||
An odd version indicates an in-progress update.
|
||||
|
||||
sec: number of seconds for wallclock at time of boot.
|
||||
sec:
|
||||
number of seconds for wallclock at time of boot.
|
||||
|
||||
nsec: number of nanoseconds for wallclock at time of boot.
|
||||
nsec:
|
||||
number of nanoseconds for wallclock at time of boot.
|
||||
|
||||
In order to get the current wallclock time, the system_time from
|
||||
MSR_KVM_SYSTEM_TIME_NEW needs to be added.
|
||||
@ -47,13 +56,15 @@ MSR_KVM_WALL_CLOCK_NEW: 0x4b564d00
|
||||
Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
|
||||
leaf prior to usage.
|
||||
|
||||
MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01
|
||||
MSR_KVM_SYSTEM_TIME_NEW:
|
||||
0x4b564d01
|
||||
|
||||
data: 4-byte aligned physical address of a memory area which must be in
|
||||
data:
|
||||
4-byte aligned physical address of a memory area which must be in
|
||||
guest RAM, plus an enable bit in bit 0. This memory is expected to hold
|
||||
a copy of the following structure:
|
||||
a copy of the following structure::
|
||||
|
||||
struct pvclock_vcpu_time_info {
|
||||
struct pvclock_vcpu_time_info {
|
||||
u32 version;
|
||||
u32 pad0;
|
||||
u64 tsc_timestamp;
|
||||
@ -62,7 +73,7 @@ MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01
|
||||
s8 tsc_shift;
|
||||
u8 flags;
|
||||
u8 pad[2];
|
||||
} __attribute__((__packed__)); /* 32 bytes */
|
||||
} __attribute__((__packed__)); /* 32 bytes */
|
||||
|
||||
whose data will be filled in by the hypervisor periodically. Only one
|
||||
write, or registration, is needed for each VCPU. The interval between
|
||||
@ -72,23 +83,28 @@ MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01
|
||||
|
||||
Fields have the following meanings:
|
||||
|
||||
version: guest has to check version before and after grabbing
|
||||
version:
|
||||
guest has to check version before and after grabbing
|
||||
time information and check that they are both equal and even.
|
||||
An odd version indicates an in-progress update.
|
||||
|
||||
tsc_timestamp: the tsc value at the current VCPU at the time
|
||||
tsc_timestamp:
|
||||
the tsc value at the current VCPU at the time
|
||||
of the update of this structure. Guests can subtract this value
|
||||
from current tsc to derive a notion of elapsed time since the
|
||||
structure update.
|
||||
|
||||
system_time: a host notion of monotonic time, including sleep
|
||||
system_time:
|
||||
a host notion of monotonic time, including sleep
|
||||
time at the time this structure was last updated. Unit is
|
||||
nanoseconds.
|
||||
|
||||
tsc_to_system_mul: multiplier to be used when converting
|
||||
tsc_to_system_mul:
|
||||
multiplier to be used when converting
|
||||
tsc-related quantity to nanoseconds
|
||||
|
||||
tsc_shift: shift to be used when converting tsc-related
|
||||
tsc_shift:
|
||||
shift to be used when converting tsc-related
|
||||
quantity to nanoseconds. This shift will ensure that
|
||||
multiplication with tsc_to_system_mul does not overflow.
|
||||
A positive value denotes a left shift, a negative value
|
||||
@ -96,7 +112,7 @@ MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01
|
||||
|
||||
The conversion from tsc to nanoseconds involves an additional
|
||||
right shift by 32 bits. With this information, guests can
|
||||
derive per-CPU time by doing:
|
||||
derive per-CPU time by doing::
|
||||
|
||||
time = (current_tsc - tsc_timestamp)
|
||||
if (tsc_shift >= 0)
|
||||
@ -106,29 +122,34 @@ MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01
|
||||
time = (time * tsc_to_system_mul) >> 32
|
||||
time = time + system_time
|
||||
|
||||
flags: bits in this field indicate extended capabilities
|
||||
flags:
|
||||
bits in this field indicate extended capabilities
|
||||
coordinated between the guest and the hypervisor. Availability
|
||||
of specific flags has to be checked in 0x40000001 cpuid leaf.
|
||||
Current flags are:
|
||||
|
||||
flag bit | cpuid bit | meaning
|
||||
-------------------------------------------------------------
|
||||
| | time measures taken across
|
||||
0 | 24 | multiple cpus are guaranteed to
|
||||
| | be monotonic
|
||||
-------------------------------------------------------------
|
||||
| | guest vcpu has been paused by
|
||||
1 | N/A | the host
|
||||
| | See 4.70 in api.txt
|
||||
-------------------------------------------------------------
|
||||
|
||||
+-----------+--------------+----------------------------------+
|
||||
| flag bit | cpuid bit | meaning |
|
||||
+-----------+--------------+----------------------------------+
|
||||
| | | time measures taken across |
|
||||
| 0 | 24 | multiple cpus are guaranteed to |
|
||||
| | | be monotonic |
|
||||
+-----------+--------------+----------------------------------+
|
||||
| | | guest vcpu has been paused by |
|
||||
| 1 | N/A | the host |
|
||||
| | | See 4.70 in api.txt |
|
||||
+-----------+--------------+----------------------------------+
|
||||
|
||||
Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
|
||||
leaf prior to usage.
|
||||
|
||||
|
||||
MSR_KVM_WALL_CLOCK: 0x11
|
||||
MSR_KVM_WALL_CLOCK:
|
||||
0x11
|
||||
|
||||
data and functioning: same as MSR_KVM_WALL_CLOCK_NEW. Use that instead.
|
||||
data and functioning:
|
||||
same as MSR_KVM_WALL_CLOCK_NEW. Use that instead.
|
||||
|
||||
This MSR falls outside the reserved KVM range and may be removed in the
|
||||
future. Its usage is deprecated.
|
||||
@ -136,9 +157,11 @@ MSR_KVM_WALL_CLOCK: 0x11
|
||||
Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
|
||||
leaf prior to usage.
|
||||
|
||||
MSR_KVM_SYSTEM_TIME: 0x12
|
||||
MSR_KVM_SYSTEM_TIME:
|
||||
0x12
|
||||
|
||||
data and functioning: same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead.
|
||||
data and functioning:
|
||||
same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead.
|
||||
|
||||
This MSR falls outside the reserved KVM range and may be removed in the
|
||||
future. Its usage is deprecated.
|
||||
@ -146,7 +169,7 @@ MSR_KVM_SYSTEM_TIME: 0x12
|
||||
Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
|
||||
leaf prior to usage.
|
||||
|
||||
The suggested algorithm for detecting kvmclock presence is then:
|
||||
The suggested algorithm for detecting kvmclock presence is then::
|
||||
|
||||
if (!kvm_para_available()) /* refer to cpuid.txt */
|
||||
return NON_PRESENT;
|
||||
@ -163,8 +186,11 @@ MSR_KVM_SYSTEM_TIME: 0x12
|
||||
} else
|
||||
return NON_PRESENT;
|
||||
|
||||
MSR_KVM_ASYNC_PF_EN: 0x4b564d02
|
||||
data: Bits 63-6 hold 64-byte aligned physical address of a
|
||||
MSR_KVM_ASYNC_PF_EN:
|
||||
0x4b564d02
|
||||
|
||||
data:
|
||||
Bits 63-6 hold 64-byte aligned physical address of a
|
||||
64 byte memory area which must be in guest RAM and must be
|
||||
zeroed. Bits 5-3 are reserved and should be zero. Bit 0 is 1
|
||||
when asynchronous page faults are enabled on the vcpu 0 when
|
||||
@ -200,20 +226,22 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02
|
||||
Currently type 2 APF will be always delivered on the same vcpu as
|
||||
type 1 was, but guest should not rely on that.
|
||||
|
||||
MSR_KVM_STEAL_TIME: 0x4b564d03
|
||||
MSR_KVM_STEAL_TIME:
|
||||
0x4b564d03
|
||||
|
||||
data: 64-byte alignment physical address of a memory area which must be
|
||||
data:
|
||||
64-byte alignment physical address of a memory area which must be
|
||||
in guest RAM, plus an enable bit in bit 0. This memory is expected to
|
||||
hold a copy of the following structure:
|
||||
hold a copy of the following structure::
|
||||
|
||||
struct kvm_steal_time {
|
||||
struct kvm_steal_time {
|
||||
__u64 steal;
|
||||
__u32 version;
|
||||
__u32 flags;
|
||||
__u8 preempted;
|
||||
__u8 u8_pad[3];
|
||||
__u32 pad[11];
|
||||
}
|
||||
}
|
||||
|
||||
whose data will be filled in by the hypervisor periodically. Only one
|
||||
write, or registration, is needed for each VCPU. The interval between
|
||||
@ -224,25 +252,32 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
|
||||
|
||||
Fields have the following meanings:
|
||||
|
||||
version: a sequence counter. In other words, guest has to check
|
||||
version:
|
||||
a sequence counter. In other words, guest has to check
|
||||
this field before and after grabbing time information and make
|
||||
sure they are both equal and even. An odd version indicates an
|
||||
in-progress update.
|
||||
|
||||
flags: At this point, always zero. May be used to indicate
|
||||
flags:
|
||||
At this point, always zero. May be used to indicate
|
||||
changes in this structure in the future.
|
||||
|
||||
steal: the amount of time in which this vCPU did not run, in
|
||||
steal:
|
||||
the amount of time in which this vCPU did not run, in
|
||||
nanoseconds. Time during which the vcpu is idle, will not be
|
||||
reported as steal time.
|
||||
|
||||
preempted: indicate the vCPU who owns this struct is running or
|
||||
preempted:
|
||||
indicate the vCPU who owns this struct is running or
|
||||
not. Non-zero values mean the vCPU has been preempted. Zero
|
||||
means the vCPU is not preempted. NOTE, it is always zero if the
|
||||
the hypervisor doesn't support this field.
|
||||
|
||||
MSR_KVM_EOI_EN: 0x4b564d04
|
||||
data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
|
||||
MSR_KVM_EOI_EN:
|
||||
0x4b564d04
|
||||
|
||||
data:
|
||||
Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
|
||||
when disabled. Bit 1 is reserved and must be zero. When PV end of
|
||||
interrupt is enabled (bit 0 set), bits 63-2 hold a 4-byte aligned
|
||||
physical address of a 4 byte memory area which must be in guest RAM and
|
||||
@ -274,11 +309,13 @@ MSR_KVM_EOI_EN: 0x4b564d04
|
||||
clear it using a single CPU instruction, such as test and clear, or
|
||||
compare and exchange.
|
||||
|
||||
MSR_KVM_POLL_CONTROL: 0x4b564d05
|
||||
MSR_KVM_POLL_CONTROL:
|
||||
0x4b564d05
|
||||
|
||||
Control host-side polling.
|
||||
|
||||
data: Bit 0 enables (1) or disables (0) host-side HLT polling logic.
|
||||
data:
|
||||
Bit 0 enables (1) or disables (0) host-side HLT polling logic.
|
||||
|
||||
KVM guests can request the host not to poll on HLT, for example if
|
||||
they are performing polling themselves.
|
||||
|
@ -1,3 +1,6 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
==========
|
||||
Nested VMX
|
||||
==========
|
||||
|
||||
@ -41,9 +44,9 @@ No modifications are required to user space (qemu). However, qemu's default
|
||||
emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
|
||||
explicitly enabled, by giving qemu one of the following options:
|
||||
|
||||
-cpu host (emulated CPU has all features of the real CPU)
|
||||
- cpu host (emulated CPU has all features of the real CPU)
|
||||
|
||||
-cpu qemu64,+vmx (add just the vmx feature to a named CPU type)
|
||||
- cpu qemu64,+vmx (add just the vmx feature to a named CPU type)
|
||||
|
||||
|
||||
ABIs
|
||||
@ -75,6 +78,8 @@ of this structure changes, this can break live migration across KVM versions.
|
||||
VMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner
|
||||
struct shadow_vmcs is ever changed.
|
||||
|
||||
::
|
||||
|
||||
typedef u64 natural_width;
|
||||
struct __packed vmcs12 {
|
||||
/* According to the Intel spec, a VMCS region must start with
|
||||
@ -220,21 +225,21 @@ Authors
|
||||
-------
|
||||
|
||||
These patches were written by:
|
||||
Abel Gordon, abelg <at> il.ibm.com
|
||||
Nadav Har'El, nyh <at> il.ibm.com
|
||||
Orit Wasserman, oritw <at> il.ibm.com
|
||||
Ben-Ami Yassor, benami <at> il.ibm.com
|
||||
Muli Ben-Yehuda, muli <at> il.ibm.com
|
||||
- Abel Gordon, abelg <at> il.ibm.com
|
||||
- Nadav Har'El, nyh <at> il.ibm.com
|
||||
- Orit Wasserman, oritw <at> il.ibm.com
|
||||
- Ben-Ami Yassor, benami <at> il.ibm.com
|
||||
- Muli Ben-Yehuda, muli <at> il.ibm.com
|
||||
|
||||
With contributions by:
|
||||
Anthony Liguori, aliguori <at> us.ibm.com
|
||||
Mike Day, mdday <at> us.ibm.com
|
||||
Michael Factor, factor <at> il.ibm.com
|
||||
Zvi Dubitzky, dubi <at> il.ibm.com
|
||||
- Anthony Liguori, aliguori <at> us.ibm.com
|
||||
- Mike Day, mdday <at> us.ibm.com
|
||||
- Michael Factor, factor <at> il.ibm.com
|
||||
- Zvi Dubitzky, dubi <at> il.ibm.com
|
||||
|
||||
And valuable reviews by:
|
||||
Avi Kivity, avi <at> redhat.com
|
||||
Gleb Natapov, gleb <at> redhat.com
|
||||
Marcelo Tosatti, mtosatti <at> redhat.com
|
||||
Kevin Tian, kevin.tian <at> intel.com
|
||||
and others.
|
||||
- Avi Kivity, avi <at> redhat.com
|
||||
- Gleb Natapov, gleb <at> redhat.com
|
||||
- Marcelo Tosatti, mtosatti <at> redhat.com
|
||||
- Kevin Tian, kevin.tian <at> intel.com
|
||||
- and others.
|
@ -1,3 +1,6 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=================================
|
||||
The PPC KVM paravirtual interface
|
||||
=================================
|
||||
|
||||
@ -34,8 +37,9 @@ up the hypercall. To call a hypercall, just call these instructions.
|
||||
|
||||
The parameters are as follows:
|
||||
|
||||
======== ================ ================
|
||||
Register IN OUT
|
||||
|
||||
======== ================ ================
|
||||
r0 - volatile
|
||||
r3 1st parameter Return code
|
||||
r4 2nd parameter 1st output value
|
||||
@ -47,6 +51,7 @@ The parameters are as follows:
|
||||
r10 8th parameter 7th output value
|
||||
r11 hypercall number 8th output value
|
||||
r12 - volatile
|
||||
======== ================ ================
|
||||
|
||||
Hypercall definitions are shared in generic code, so the same hypercall numbers
|
||||
apply for x86 and powerpc alike with the exception that each KVM hypercall
|
||||
@ -54,11 +59,13 @@ also needs to be ORed with the KVM vendor code which is (42 << 16).
|
||||
|
||||
Return codes can be as follows:
|
||||
|
||||
==== =========================
|
||||
Code Meaning
|
||||
|
||||
==== =========================
|
||||
0 Success
|
||||
12 Hypercall not implemented
|
||||
<0 Error
|
||||
==== =========================
|
||||
|
||||
The magic page
|
||||
==============
|
||||
@ -72,7 +79,7 @@ desired location. The first parameter indicates the effective address when the
|
||||
MMU is enabled. The second parameter indicates the address in real mode, if
|
||||
applicable to the target. For now, we always map the page to -4096. This way we
|
||||
can access it using absolute load and store functions. The following
|
||||
instruction reads the first field of the magic page:
|
||||
instruction reads the first field of the magic page::
|
||||
|
||||
ld rX, -4096(0)
|
||||
|
||||
@ -93,8 +100,10 @@ a bitmap of available features inside the magic page.
|
||||
|
||||
The following enhancements to the magic page are currently available:
|
||||
|
||||
============================ =======================================
|
||||
KVM_MAGIC_FEAT_SR Maps SR registers r/w in the magic page
|
||||
KVM_MAGIC_FEAT_MAS0_TO_SPRG7 Maps MASn, ESR, PIR and high SPRGs
|
||||
============================ =======================================
|
||||
|
||||
For enhanced features in the magic page, please check for the existence of the
|
||||
feature before using them!
|
||||
@ -121,8 +130,8 @@ when entering the guest or don't have any impact on the hypervisor's behavior.
|
||||
|
||||
The following bits are safe to be set inside the guest:
|
||||
|
||||
MSR_EE
|
||||
MSR_RI
|
||||
- MSR_EE
|
||||
- MSR_RI
|
||||
|
||||
If any other bit changes in the MSR, please still use mtmsr(d).
|
||||
|
||||
@ -138,9 +147,9 @@ guest. Implementing any of those mappings is optional, as the instruction traps
|
||||
also act on the shared page. So calling privileged instructions still works as
|
||||
before.
|
||||
|
||||
======================= ================================
|
||||
From To
|
||||
==== ==
|
||||
|
||||
======================= ================================
|
||||
mfmsr rX ld rX, magic_page->msr
|
||||
mfsprg rX, 0 ld rX, magic_page->sprg0
|
||||
mfsprg rX, 1 ld rX, magic_page->sprg1
|
||||
@ -173,7 +182,7 @@ mtsrin rX, rY b <special mtsrin section>
|
||||
|
||||
[BookE only]
|
||||
wrteei [0|1] b <special wrteei section>
|
||||
|
||||
======================= ================================
|
||||
|
||||
Some instructions require more logic to determine what's going on than a load
|
||||
or store instruction can deliver. To enable patching of those, we keep some
|
||||
@ -191,6 +200,7 @@ for example.
|
||||
|
||||
Hypercall ABIs in KVM on PowerPC
|
||||
=================================
|
||||
|
||||
1) KVM hypercalls (ePAPR)
|
||||
|
||||
These are ePAPR compliant hypercall implementation (mentioned above). Even
|
@ -1,3 +1,6 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
================================
|
||||
Review checklist for kvm patches
|
||||
================================
|
||||
|
@ -1,3 +1,6 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=============================
|
||||
The s390 DIAGNOSE call on KVM
|
||||
=============================
|
||||
|
||||
@ -16,12 +19,12 @@ DIAGNOSE calls by the guest cause a mandatory intercept. This implies
|
||||
all supported DIAGNOSE calls need to be handled by either KVM or its
|
||||
userspace.
|
||||
|
||||
All DIAGNOSE calls supported by KVM use the RS-a format:
|
||||
All DIAGNOSE calls supported by KVM use the RS-a format::
|
||||
|
||||
--------------------------------------
|
||||
| '83' | R1 | R3 | B2 | D2 |
|
||||
--------------------------------------
|
||||
0 8 12 16 20 31
|
||||
--------------------------------------
|
||||
| '83' | R1 | R3 | B2 | D2 |
|
||||
--------------------------------------
|
||||
0 8 12 16 20 31
|
||||
|
||||
The second-operand address (obtained by the base/displacement calculation)
|
||||
is not used to address data. Instead, bits 48-63 of this address specify
|
@ -1,17 +1,21 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
Timekeeping Virtualization for X86-Based Architectures
|
||||
======================================================
|
||||
Timekeeping Virtualization for X86-Based Architectures
|
||||
======================================================
|
||||
|
||||
Zachary Amsden <zamsden@redhat.com>
|
||||
Copyright (c) 2010, Red Hat. All rights reserved.
|
||||
:Author: Zachary Amsden <zamsden@redhat.com>
|
||||
:Copyright: (c) 2010, Red Hat. All rights reserved.
|
||||
|
||||
1) Overview
|
||||
2) Timing Devices
|
||||
3) TSC Hardware
|
||||
4) Virtualization Problems
|
||||
.. Contents
|
||||
|
||||
=========================================================================
|
||||
1) Overview
|
||||
2) Timing Devices
|
||||
3) TSC Hardware
|
||||
4) Virtualization Problems
|
||||
|
||||
1) Overview
|
||||
1. Overview
|
||||
===========
|
||||
|
||||
One of the most complicated parts of the X86 platform, and specifically,
|
||||
the virtualization of this platform is the plethora of timing devices available
|
||||
@ -27,15 +31,15 @@ The purpose of this document is to collect data and information relevant to
|
||||
timekeeping which may be difficult to find elsewhere, specifically,
|
||||
information relevant to KVM and hardware-based virtualization.
|
||||
|
||||
=========================================================================
|
||||
|
||||
2) Timing Devices
|
||||
2. Timing Devices
|
||||
=================
|
||||
|
||||
First we discuss the basic hardware devices available. TSC and the related
|
||||
KVM clock are special enough to warrant a full exposition and are described in
|
||||
the following section.
|
||||
|
||||
2.1) i8254 - PIT
|
||||
2.1. i8254 - PIT
|
||||
----------------
|
||||
|
||||
One of the first timer devices available is the programmable interrupt timer,
|
||||
or PIT. The PIT has a fixed frequency 1.193182 MHz base clock and three
|
||||
@ -50,13 +54,13 @@ The PIT uses I/O ports 0x40 - 0x43. Access to the 16-bit counters is done
|
||||
using single or multiple byte access to the I/O ports. There are 6 modes
|
||||
available, but not all modes are available to all timers, as only timer 2
|
||||
has a connected gate input, required for modes 1 and 5. The gate line is
|
||||
controlled by port 61h, bit 0, as illustrated in the following diagram.
|
||||
controlled by port 61h, bit 0, as illustrated in the following diagram::
|
||||
|
||||
-------------- ----------------
|
||||
| | | |
|
||||
| 1.1932 MHz |---------->| CLOCK OUT | ---------> IRQ 0
|
||||
| Clock | | | |
|
||||
-------------- | +->| GATE TIMER 0 |
|
||||
-------------- ----------------
|
||||
| | | |
|
||||
| 1.1932 MHz|---------->| CLOCK OUT | ---------> IRQ 0
|
||||
| Clock | | | |
|
||||
-------------- | +->| GATE TIMER 0 |
|
||||
| ----------------
|
||||
|
|
||||
| ----------------
|
||||
@ -70,29 +74,33 @@ controlled by port 61h, bit 0, as illustrated in the following diagram.
|
||||
| | |
|
||||
|------>| CLOCK OUT | ---------> Port 61h, bit 5
|
||||
| | |
|
||||
Port 61h, bit 0 ---------->| GATE TIMER 2 | \_.---- ____
|
||||
Port 61h, bit 0 -------->| GATE TIMER 2 | \_.---- ____
|
||||
---------------- _| )--|LPF|---Speaker
|
||||
/ *---- \___/
|
||||
Port 61h, bit 1 -----------------------------------/
|
||||
Port 61h, bit 1 ---------------------------------/
|
||||
|
||||
The timer modes are now described.
|
||||
|
||||
Mode 0: Single Timeout. This is a one-shot software timeout that counts down
|
||||
Mode 0: Single Timeout.
|
||||
This is a one-shot software timeout that counts down
|
||||
when the gate is high (always true for timers 0 and 1). When the count
|
||||
reaches zero, the output goes high.
|
||||
|
||||
Mode 1: Triggered One-shot. The output is initially set high. When the gate
|
||||
Mode 1: Triggered One-shot.
|
||||
The output is initially set high. When the gate
|
||||
line is set high, a countdown is initiated (which does not stop if the gate is
|
||||
lowered), during which the output is set low. When the count reaches zero,
|
||||
the output goes high.
|
||||
|
||||
Mode 2: Rate Generator. The output is initially set high. When the countdown
|
||||
Mode 2: Rate Generator.
|
||||
The output is initially set high. When the countdown
|
||||
reaches 1, the output goes low for one count and then returns high. The value
|
||||
is reloaded and the countdown automatically resumes. If the gate line goes
|
||||
low, the count is halted. If the output is low when the gate is lowered, the
|
||||
output automatically goes high (this only affects timer 2).
|
||||
|
||||
Mode 3: Square Wave. This generates a high / low square wave. The count
|
||||
Mode 3: Square Wave.
|
||||
This generates a high / low square wave. The count
|
||||
determines the length of the pulse, which alternates between high and low
|
||||
when zero is reached. The count only proceeds when gate is high and is
|
||||
automatically reloaded on reaching zero. The count is decremented twice at
|
||||
@ -103,12 +111,14 @@ Mode 3: Square Wave. This generates a high / low square wave. The count
|
||||
values are not observed when reading. This is the intended mode for timer 2,
|
||||
which generates sine-like tones by low-pass filtering the square wave output.
|
||||
|
||||
Mode 4: Software Strobe. After programming this mode and loading the counter,
|
||||
Mode 4: Software Strobe.
|
||||
After programming this mode and loading the counter,
|
||||
the output remains high until the counter reaches zero. Then the output
|
||||
goes low for 1 clock cycle and returns high. The counter is not reloaded.
|
||||
Counting only occurs when gate is high.
|
||||
|
||||
Mode 5: Hardware Strobe. After programming and loading the counter, the
|
||||
Mode 5: Hardware Strobe.
|
||||
After programming and loading the counter, the
|
||||
output remains high. When the gate is raised, a countdown is initiated
|
||||
(which does not stop if the gate is lowered). When the counter reaches zero,
|
||||
the output goes low for 1 clock cycle and then returns high. The counter is
|
||||
@ -118,49 +128,49 @@ In addition to normal binary counting, the PIT supports BCD counting. The
|
||||
command port, 0x43 is used to set the counter and mode for each of the three
|
||||
timers.
|
||||
|
||||
PIT commands, issued to port 0x43, using the following bit encoding:
|
||||
PIT commands, issued to port 0x43, using the following bit encoding::
|
||||
|
||||
Bit 7-4: Command (See table below)
|
||||
Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined)
|
||||
Bit 0 : Binary (0) / BCD (1)
|
||||
Bit 7-4: Command (See table below)
|
||||
Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined)
|
||||
Bit 0 : Binary (0) / BCD (1)
|
||||
|
||||
Command table:
|
||||
Command table::
|
||||
|
||||
0000 - Latch Timer 0 count for port 0x40
|
||||
0000 - Latch Timer 0 count for port 0x40
|
||||
sample and hold the count to be read in port 0x40;
|
||||
additional commands ignored until counter is read;
|
||||
mode bits ignored.
|
||||
|
||||
0001 - Set Timer 0 LSB mode for port 0x40
|
||||
0001 - Set Timer 0 LSB mode for port 0x40
|
||||
set timer to read LSB only and force MSB to zero;
|
||||
mode bits set timer mode
|
||||
|
||||
0010 - Set Timer 0 MSB mode for port 0x40
|
||||
0010 - Set Timer 0 MSB mode for port 0x40
|
||||
set timer to read MSB only and force LSB to zero;
|
||||
mode bits set timer mode
|
||||
|
||||
0011 - Set Timer 0 16-bit mode for port 0x40
|
||||
0011 - Set Timer 0 16-bit mode for port 0x40
|
||||
set timer to read / write LSB first, then MSB;
|
||||
mode bits set timer mode
|
||||
|
||||
0100 - Latch Timer 1 count for port 0x41 - as described above
|
||||
0101 - Set Timer 1 LSB mode for port 0x41 - as described above
|
||||
0110 - Set Timer 1 MSB mode for port 0x41 - as described above
|
||||
0111 - Set Timer 1 16-bit mode for port 0x41 - as described above
|
||||
0100 - Latch Timer 1 count for port 0x41 - as described above
|
||||
0101 - Set Timer 1 LSB mode for port 0x41 - as described above
|
||||
0110 - Set Timer 1 MSB mode for port 0x41 - as described above
|
||||
0111 - Set Timer 1 16-bit mode for port 0x41 - as described above
|
||||
|
||||
1000 - Latch Timer 2 count for port 0x42 - as described above
|
||||
1001 - Set Timer 2 LSB mode for port 0x42 - as described above
|
||||
1010 - Set Timer 2 MSB mode for port 0x42 - as described above
|
||||
1011 - Set Timer 2 16-bit mode for port 0x42 as described above
|
||||
1000 - Latch Timer 2 count for port 0x42 - as described above
|
||||
1001 - Set Timer 2 LSB mode for port 0x42 - as described above
|
||||
1010 - Set Timer 2 MSB mode for port 0x42 - as described above
|
||||
1011 - Set Timer 2 16-bit mode for port 0x42 as described above
|
||||
|
||||
1101 - General counter latch
|
||||
1101 - General counter latch
|
||||
Latch combination of counters into corresponding ports
|
||||
Bit 3 = Counter 2
|
||||
Bit 2 = Counter 1
|
||||
Bit 1 = Counter 0
|
||||
Bit 0 = Unused
|
||||
|
||||
1110 - Latch timer status
|
||||
1110 - Latch timer status
|
||||
Latch combination of counter mode into corresponding ports
|
||||
Bit 3 = Counter 2
|
||||
Bit 2 = Counter 1
|
||||
@ -177,7 +187,8 @@ Command table:
|
||||
Bit 3-1 = Mode
|
||||
Bit 0 = Binary (0) / BCD mode (1)
|
||||
|
||||
2.2) RTC
|
||||
2.2. RTC
|
||||
--------
|
||||
|
||||
The second device which was available in the original PC was the MC146818 real
|
||||
time clock. The original device is now obsolete, and usually emulated by the
|
||||
@ -201,21 +212,21 @@ in progress, as indicated in the status register.
|
||||
The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be
|
||||
programmed to a 32kHz divider if the RTC is to count seconds.
|
||||
|
||||
This is the RAM map originally used for the RTC/CMOS:
|
||||
This is the RAM map originally used for the RTC/CMOS::
|
||||
|
||||
Location Size Description
|
||||
------------------------------------------
|
||||
00h byte Current second (BCD)
|
||||
01h byte Seconds alarm (BCD)
|
||||
02h byte Current minute (BCD)
|
||||
03h byte Minutes alarm (BCD)
|
||||
04h byte Current hour (BCD)
|
||||
05h byte Hours alarm (BCD)
|
||||
06h byte Current day of week (BCD)
|
||||
07h byte Current day of month (BCD)
|
||||
08h byte Current month (BCD)
|
||||
09h byte Current year (BCD)
|
||||
0Ah byte Register A
|
||||
Location Size Description
|
||||
------------------------------------------
|
||||
00h byte Current second (BCD)
|
||||
01h byte Seconds alarm (BCD)
|
||||
02h byte Current minute (BCD)
|
||||
03h byte Minutes alarm (BCD)
|
||||
04h byte Current hour (BCD)
|
||||
05h byte Hours alarm (BCD)
|
||||
06h byte Current day of week (BCD)
|
||||
07h byte Current day of month (BCD)
|
||||
08h byte Current month (BCD)
|
||||
09h byte Current year (BCD)
|
||||
0Ah byte Register A
|
||||
bit 7 = Update in progress
|
||||
bit 6-4 = Divider for clock
|
||||
000 = 4.194 MHz
|
||||
@ -234,7 +245,7 @@ Location Size Description
|
||||
1101 = 125 mS
|
||||
1110 = 250 mS
|
||||
1111 = 500 mS
|
||||
0Bh byte Register B
|
||||
0Bh byte Register B
|
||||
bit 7 = Run (0) / Halt (1)
|
||||
bit 6 = Periodic interrupt enable
|
||||
bit 5 = Alarm interrupt enable
|
||||
@ -243,19 +254,20 @@ Location Size Description
|
||||
bit 2 = BCD calendar (0) / Binary (1)
|
||||
bit 1 = 12-hour mode (0) / 24-hour mode (1)
|
||||
bit 0 = 0 (DST off) / 1 (DST enabled)
|
||||
OCh byte Register C (read only)
|
||||
OCh byte Register C (read only)
|
||||
bit 7 = interrupt request flag (IRQF)
|
||||
bit 6 = periodic interrupt flag (PF)
|
||||
bit 5 = alarm interrupt flag (AF)
|
||||
bit 4 = update interrupt flag (UF)
|
||||
bit 3-0 = reserved
|
||||
ODh byte Register D (read only)
|
||||
ODh byte Register D (read only)
|
||||
bit 7 = RTC has power
|
||||
bit 6-0 = reserved
|
||||
32h byte Current century BCD (*)
|
||||
32h byte Current century BCD (*)
|
||||
(*) location vendor specific and now determined from ACPI global tables
|
||||
|
||||
2.3) APIC
|
||||
2.3. APIC
|
||||
---------
|
||||
|
||||
On Pentium and later processors, an on-board timer is available to each CPU
|
||||
as part of the Advanced Programmable Interrupt Controller. The APIC is
|
||||
@ -276,7 +288,8 @@ timer is programmed through the LVT (local vector timer) register, is capable
|
||||
of one-shot or periodic operation, and is based on the bus clock divided down
|
||||
by the programmable divider register.
|
||||
|
||||
2.4) HPET
|
||||
2.4. HPET
|
||||
---------
|
||||
|
||||
HPET is quite complex, and was originally intended to replace the PIT / RTC
|
||||
support of the X86 PC. It remains to be seen whether that will be the case, as
|
||||
@ -297,7 +310,8 @@ indicated through ACPI tables by the BIOS.
|
||||
Detailed specification of the HPET is beyond the current scope of this
|
||||
document, as it is also very well documented elsewhere.
|
||||
|
||||
2.5) Offboard Timers
|
||||
2.5. Offboard Timers
|
||||
--------------------
|
||||
|
||||
Several cards, both proprietary (watchdog boards) and commonplace (e1000) have
|
||||
timing chips built into the cards which may have registers which are accessible
|
||||
@ -307,9 +321,8 @@ general frowned upon as not playing by the agreed rules of the game. Such a
|
||||
timer device would require additional support to be virtualized properly and is
|
||||
not considered important at this time as no known operating system does this.
|
||||
|
||||
=========================================================================
|
||||
|
||||
3) TSC Hardware
|
||||
3. TSC Hardware
|
||||
===============
|
||||
|
||||
The TSC or time stamp counter is relatively simple in theory; it counts
|
||||
instruction cycles issued by the processor, which can be used as a measure of
|
||||
@ -340,7 +353,8 @@ allows the guest visible TSC to be offset by a constant. Newer implementations
|
||||
promise to allow the TSC to additionally be scaled, but this hardware is not
|
||||
yet widely available.
|
||||
|
||||
3.1) TSC synchronization
|
||||
3.1. TSC synchronization
|
||||
------------------------
|
||||
|
||||
The TSC is a CPU-local clock in most implementations. This means, on SMP
|
||||
platforms, the TSCs of different CPUs may start at different times depending
|
||||
@ -357,7 +371,8 @@ practice, getting a perfectly synchronized TSC will not be possible unless all
|
||||
values are read from the same clock, which generally only is possible on single
|
||||
socket systems or those with special hardware support.
|
||||
|
||||
3.2) TSC and CPU hotplug
|
||||
3.2. TSC and CPU hotplug
|
||||
------------------------
|
||||
|
||||
As touched on already, CPUs which arrive later than the boot time of the system
|
||||
may not have a TSC value that is synchronized with the rest of the system.
|
||||
@ -367,7 +382,8 @@ a guarantee. This can have the effect of bringing a system from a state where
|
||||
TSC is synchronized back to a state where TSC synchronization flaws, however
|
||||
small, may be exposed to the OS and any virtualization environment.
|
||||
|
||||
3.3) TSC and multi-socket / NUMA
|
||||
3.3. TSC and multi-socket / NUMA
|
||||
--------------------------------
|
||||
|
||||
Multi-socket systems, especially large multi-socket systems are likely to have
|
||||
individual clocksources rather than a single, universally distributed clock.
|
||||
@ -385,7 +401,8 @@ standards for telecommunications and computer equipment.
|
||||
It is recommended not to trust the TSCs to remain synchronized on NUMA or
|
||||
multiple socket systems for these reasons.
|
||||
|
||||
3.4) TSC and C-states
|
||||
3.4. TSC and C-states
|
||||
---------------------
|
||||
|
||||
C-states, or idling states of the processor, especially C1E and deeper sleep
|
||||
states may be problematic for TSC as well. The TSC may stop advancing in such
|
||||
@ -396,7 +413,8 @@ based on CPU and chipset identifications.
|
||||
The TSC in such a case may be corrected by catching it up to a known external
|
||||
clocksource.
|
||||
|
||||
3.5) TSC frequency change / P-states
|
||||
3.5. TSC frequency change / P-states
|
||||
------------------------------------
|
||||
|
||||
To make things slightly more interesting, some CPUs may change frequency. They
|
||||
may or may not run the TSC at the same rate, and because the frequency change
|
||||
@ -416,14 +434,16 @@ other processors. In such cases, the TSC on halted CPUs could advance faster
|
||||
than that of non-halted processors. AMD Turion processors are known to have
|
||||
this problem.
|
||||
|
||||
3.6) TSC and STPCLK / T-states
|
||||
3.6. TSC and STPCLK / T-states
|
||||
------------------------------
|
||||
|
||||
External signals given to the processor may also have the effect of stopping
|
||||
the TSC. This is typically done for thermal emergency power control to prevent
|
||||
an overheating condition, and typically, there is no way to detect that this
|
||||
condition has happened.
|
||||
|
||||
3.7) TSC virtualization - VMX
|
||||
3.7. TSC virtualization - VMX
|
||||
-----------------------------
|
||||
|
||||
VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
|
||||
instructions, which is enough for full virtualization of TSC in any manner. In
|
||||
@ -431,14 +451,16 @@ addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET
|
||||
field specified in the VMCS. Special instructions must be used to read and
|
||||
write the VMCS field.
|
||||
|
||||
3.8) TSC virtualization - SVM
|
||||
3.8. TSC virtualization - SVM
|
||||
-----------------------------
|
||||
|
||||
SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
|
||||
instructions, which is enough for full virtualization of TSC in any manner. In
|
||||
addition, SVM allows passing through the host TSC plus an additional offset
|
||||
field specified in the SVM control block.
|
||||
|
||||
3.9) TSC feature bits in Linux
|
||||
3.9. TSC feature bits in Linux
|
||||
------------------------------
|
||||
|
||||
In summary, there is no way to guarantee the TSC remains in perfect
|
||||
synchronization unless it is explicitly guaranteed by the architecture. Even
|
||||
@ -448,13 +470,16 @@ despite being locally consistent.
|
||||
The following feature bits are used by Linux to signal various TSC attributes,
|
||||
but they can only be taken to be meaningful for UP or single node systems.
|
||||
|
||||
X86_FEATURE_TSC : The TSC is available in hardware
|
||||
X86_FEATURE_RDTSCP : The RDTSCP instruction is available
|
||||
X86_FEATURE_CONSTANT_TSC : The TSC rate is unchanged with P-states
|
||||
X86_FEATURE_NONSTOP_TSC : The TSC does not stop in C-states
|
||||
X86_FEATURE_TSC_RELIABLE : TSC sync checks are skipped (VMware)
|
||||
========================= =======================================
|
||||
X86_FEATURE_TSC The TSC is available in hardware
|
||||
X86_FEATURE_RDTSCP The RDTSCP instruction is available
|
||||
X86_FEATURE_CONSTANT_TSC The TSC rate is unchanged with P-states
|
||||
X86_FEATURE_NONSTOP_TSC The TSC does not stop in C-states
|
||||
X86_FEATURE_TSC_RELIABLE TSC sync checks are skipped (VMware)
|
||||
========================= =======================================
|
||||
|
||||
4) Virtualization Problems
|
||||
4. Virtualization Problems
|
||||
==========================
|
||||
|
||||
Timekeeping is especially problematic for virtualization because a number of
|
||||
challenges arise. The most obvious problem is that time is now shared between
|
||||
@ -473,7 +498,8 @@ BIOS, but not in such an extreme fashion. However, the fact that SMM mode may
|
||||
cause similar problems to virtualization makes it a good justification for
|
||||
solving many of these problems on bare metal.
|
||||
|
||||
4.1) Interrupt clocking
|
||||
4.1. Interrupt clocking
|
||||
-----------------------
|
||||
|
||||
One of the most immediate problems that occurs with legacy operating systems
|
||||
is that the system timekeeping routines are often designed to keep track of
|
||||
@ -502,7 +528,8 @@ thus requires interrupt slewing to keep proper time. It does use a low enough
|
||||
rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in
|
||||
practice.
|
||||
|
||||
4.2) TSC sampling and serialization
|
||||
4.2. TSC sampling and serialization
|
||||
-----------------------------------
|
||||
|
||||
As the highest precision time source available, the cycle counter of the CPU
|
||||
has aroused much interest from developers. As explained above, this timer has
|
||||
@ -524,7 +551,8 @@ it may be necessary for an implementation to guard against "backwards" reads of
|
||||
the TSC as seen from other CPUs, even in an otherwise perfectly synchronized
|
||||
system.
|
||||
|
||||
4.3) Timespec aliasing
|
||||
4.3. Timespec aliasing
|
||||
----------------------
|
||||
|
||||
Additionally, this lack of serialization from the TSC poses another challenge
|
||||
when using results of the TSC when measured against another time source. As
|
||||
@ -548,7 +576,8 @@ This aliasing requires care in the computation and recalibration of kvmclock
|
||||
and any other values derived from TSC computation (such as TSC virtualization
|
||||
itself).
|
||||
|
||||
4.4) Migration
|
||||
4.4. Migration
|
||||
--------------
|
||||
|
||||
Migration of a virtual machine raises problems for timekeeping in two ways.
|
||||
First, the migration itself may take time, during which interrupts cannot be
|
||||
@ -566,7 +595,8 @@ always be caught up to the original rate. KVM clock avoids these problems by
|
||||
simply storing multipliers and offsets against the TSC for the guest to convert
|
||||
back into nanosecond resolution values.
|
||||
|
||||
4.5) Scheduling
|
||||
4.5. Scheduling
|
||||
---------------
|
||||
|
||||
Since scheduling may be based on precise timing and firing of interrupts, the
|
||||
scheduling algorithms of an operating system may be adversely affected by
|
||||
@ -579,7 +609,8 @@ In an attempt to work around this, several implementations have provided a
|
||||
paravirtualized scheduler clock, which reveals the true amount of CPU time for
|
||||
which a virtual machine has been running.
|
||||
|
||||
4.6) Watchdogs
|
||||
4.6. Watchdogs
|
||||
--------------
|
||||
|
||||
Watchdog timers, such as the lock detector in Linux may fire accidentally when
|
||||
running under hardware virtualization due to timer interrupts being delayed or
|
||||
@ -587,7 +618,8 @@ misinterpretation of the passage of real time. Usually, these warnings are
|
||||
spurious and can be ignored, but in some circumstances it may be necessary to
|
||||
disable such detection.
|
||||
|
||||
4.7) Delays and precision timing
|
||||
4.7. Delays and precision timing
|
||||
--------------------------------
|
||||
|
||||
Precise timing and delays may not be possible in a virtualized system. This
|
||||
can happen if the system is controlling physical hardware, or issues delays to
|
||||
@ -600,7 +632,8 @@ The second issue may cause performance problems, but this is unlikely to be a
|
||||
significant issue. In many cases these delays may be eliminated through
|
||||
configuration or paravirtualization.
|
||||
|
||||
4.8) Covert channels and leaks
|
||||
4.8. Covert channels and leaks
|
||||
------------------------------
|
||||
|
||||
In addition to the above problems, time information will inevitably leak to the
|
||||
guest about the host in anything but a perfect implementation of virtualized
|
File diff suppressed because it is too large
Load Diff
@ -781,9 +781,19 @@ struct kvm_vcpu_arch {
|
||||
u64 msr_kvm_poll_control;
|
||||
|
||||
/*
|
||||
* Indicate whether the access faults on its page table in guest
|
||||
* which is set when fix page fault and used to detect unhandeable
|
||||
* instruction.
|
||||
* Indicates the guest is trying to write a gfn that contains one or
|
||||
* more of the PTEs used to translate the write itself, i.e. the access
|
||||
* is changing its own translation in the guest page tables. KVM exits
|
||||
* to userspace if emulation of the faulting instruction fails and this
|
||||
* flag is set, as KVM cannot make forward progress.
|
||||
*
|
||||
* If emulation fails for a write to guest page tables, KVM unprotects
|
||||
* (zaps) the shadow page for the target gfn and resumes the guest to
|
||||
* retry the non-emulatable instruction (on hardware). Unprotecting the
|
||||
* gfn doesn't allow forward progress for a self-changing access because
|
||||
* doing so also zaps the translation for the gfn, i.e. retrying the
|
||||
* instruction will hit a !PRESENT fault, which results in a new shadow
|
||||
* page and sends KVM back to square one.
|
||||
*/
|
||||
bool write_fault_to_shadow_pgtable;
|
||||
|
||||
|
@ -1080,9 +1080,6 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
|
||||
result = 1;
|
||||
/* assumes that there are only KVM_APIC_INIT/SIPI */
|
||||
apic->pending_events = (1UL << KVM_APIC_INIT);
|
||||
/* make sure pending_events is visible before sending
|
||||
* the request */
|
||||
smp_wmb();
|
||||
kvm_make_request(KVM_REQ_EVENT, vcpu);
|
||||
kvm_vcpu_kick(vcpu);
|
||||
}
|
||||
|
@ -102,6 +102,19 @@ static inline void kvm_mmu_load_cr3(struct kvm_vcpu *vcpu)
|
||||
kvm_get_active_pcid(vcpu));
|
||||
}
|
||||
|
||||
int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
|
||||
bool prefault);
|
||||
|
||||
static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
|
||||
u32 err, bool prefault)
|
||||
{
|
||||
#ifdef CONFIG_RETPOLINE
|
||||
if (likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault))
|
||||
return kvm_tdp_page_fault(vcpu, cr2_or_gpa, err, prefault);
|
||||
#endif
|
||||
return vcpu->arch.mmu->page_fault(vcpu, cr2_or_gpa, err, prefault);
|
||||
}
|
||||
|
||||
/*
|
||||
* Currently, we have two sorts of write-protection, a) the first one
|
||||
* write-protects guest page to sync the guest modification, b) another one is
|
||||
|
@ -4219,8 +4219,8 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(kvm_handle_page_fault);
|
||||
|
||||
static int tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
|
||||
bool prefault)
|
||||
int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
|
||||
bool prefault)
|
||||
{
|
||||
int max_level;
|
||||
|
||||
@ -4925,7 +4925,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
|
||||
return;
|
||||
|
||||
context->mmu_role.as_u64 = new_role.as_u64;
|
||||
context->page_fault = tdp_page_fault;
|
||||
context->page_fault = kvm_tdp_page_fault;
|
||||
context->sync_page = nonpaging_sync_page;
|
||||
context->invlpg = nonpaging_invlpg;
|
||||
context->update_pte = nonpaging_update_pte;
|
||||
@ -5436,9 +5436,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
|
||||
}
|
||||
|
||||
if (r == RET_PF_INVALID) {
|
||||
r = vcpu->arch.mmu->page_fault(vcpu, cr2_or_gpa,
|
||||
lower_32_bits(error_code),
|
||||
false);
|
||||
r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
|
||||
lower_32_bits(error_code), false);
|
||||
WARN_ON(r == RET_PF_INVALID);
|
||||
}
|
||||
|
||||
|
@ -33,7 +33,7 @@
|
||||
#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
|
||||
#define PT_HAVE_ACCESSED_DIRTY(mmu) true
|
||||
#ifdef CONFIG_X86_64
|
||||
#define PT_MAX_FULL_LEVELS 4
|
||||
#define PT_MAX_FULL_LEVELS PT64_ROOT_MAX_LEVEL
|
||||
#define CMPXCHG cmpxchg
|
||||
#else
|
||||
#define CMPXCHG cmpxchg64
|
||||
|
@ -2175,7 +2175,6 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
|
||||
u32 dummy;
|
||||
u32 eax = 1;
|
||||
|
||||
vcpu->arch.microcode_version = 0x01000065;
|
||||
svm->spec_ctrl = 0;
|
||||
svm->virt_spec_ctrl = 0;
|
||||
|
||||
@ -2266,6 +2265,7 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
|
||||
init_vmcb(svm);
|
||||
|
||||
svm_init_osvw(vcpu);
|
||||
vcpu->arch.microcode_version = 0x01000065;
|
||||
|
||||
return 0;
|
||||
|
||||
|
@ -544,7 +544,8 @@ static void nested_vmx_disable_intercept_for_msr(unsigned long *msr_bitmap_l1,
|
||||
}
|
||||
}
|
||||
|
||||
static inline void enable_x2apic_msr_intercepts(unsigned long *msr_bitmap) {
|
||||
static inline void enable_x2apic_msr_intercepts(unsigned long *msr_bitmap)
|
||||
{
|
||||
int msr;
|
||||
|
||||
for (msr = 0x800; msr <= 0x8ff; msr += BITS_PER_LONG) {
|
||||
@ -1981,7 +1982,7 @@ static int nested_vmx_handle_enlightened_vmptrld(struct kvm_vcpu *vcpu,
|
||||
}
|
||||
|
||||
/*
|
||||
* Clean fields data can't de used on VMLAUNCH and when we switch
|
||||
* Clean fields data can't be used on VMLAUNCH and when we switch
|
||||
* between different L2 guests as KVM keeps a single VMCS12 per L1.
|
||||
*/
|
||||
if (from_launch || evmcs_gpa_changed)
|
||||
@ -3575,6 +3576,33 @@ static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
|
||||
nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, intr_info, exit_qual);
|
||||
}
|
||||
|
||||
/*
|
||||
* Returns true if a debug trap is pending delivery.
|
||||
*
|
||||
* In KVM, debug traps bear an exception payload. As such, the class of a #DB
|
||||
* exception may be inferred from the presence of an exception payload.
|
||||
*/
|
||||
static inline bool vmx_pending_dbg_trap(struct kvm_vcpu *vcpu)
|
||||
{
|
||||
return vcpu->arch.exception.pending &&
|
||||
vcpu->arch.exception.nr == DB_VECTOR &&
|
||||
vcpu->arch.exception.payload;
|
||||
}
|
||||
|
||||
/*
|
||||
* Certain VM-exits set the 'pending debug exceptions' field to indicate a
|
||||
* recognized #DB (data or single-step) that has yet to be delivered. Since KVM
|
||||
* represents these debug traps with a payload that is said to be compatible
|
||||
* with the 'pending debug exceptions' field, write the payload to the VMCS
|
||||
* field if a VM-exit is delivered before the debug trap.
|
||||
*/
|
||||
static void nested_vmx_update_pending_dbg(struct kvm_vcpu *vcpu)
|
||||
{
|
||||
if (vmx_pending_dbg_trap(vcpu))
|
||||
vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS,
|
||||
vcpu->arch.exception.payload);
|
||||
}
|
||||
|
||||
static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
|
||||
{
|
||||
struct vcpu_vmx *vmx = to_vmx(vcpu);
|
||||
@ -3587,6 +3615,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
|
||||
test_bit(KVM_APIC_INIT, &apic->pending_events)) {
|
||||
if (block_nested_events)
|
||||
return -EBUSY;
|
||||
nested_vmx_update_pending_dbg(vcpu);
|
||||
clear_bit(KVM_APIC_INIT, &apic->pending_events);
|
||||
nested_vmx_vmexit(vcpu, EXIT_REASON_INIT_SIGNAL, 0, 0);
|
||||
return 0;
|
||||
|
@ -2947,6 +2947,9 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
|
||||
|
||||
static int get_ept_level(struct kvm_vcpu *vcpu)
|
||||
{
|
||||
/* Nested EPT currently only supports 4-level walks. */
|
||||
if (is_guest_mode(vcpu) && nested_cpu_has_ept(get_vmcs12(vcpu)))
|
||||
return 4;
|
||||
if (cpu_has_vmx_ept_5levels() && (cpuid_maxphyaddr(vcpu) > 48))
|
||||
return 5;
|
||||
return 4;
|
||||
@ -4238,7 +4241,6 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
|
||||
|
||||
vmx->msr_ia32_umwait_control = 0;
|
||||
|
||||
vcpu->arch.microcode_version = 0x100000000ULL;
|
||||
vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
|
||||
vmx->hv_deadline_tsc = -1;
|
||||
kvm_set_cr8(vcpu, 0);
|
||||
@ -6763,6 +6765,7 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
|
||||
vmx->nested.posted_intr_nv = -1;
|
||||
vmx->nested.current_vmptr = -1ull;
|
||||
|
||||
vcpu->arch.microcode_version = 0x100000000ULL;
|
||||
vmx->msr_ia32_feature_control_valid_bits = FEAT_CTL_LOCKED;
|
||||
|
||||
/*
|
||||
|
@ -438,6 +438,14 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu)
|
||||
* for #DB exceptions under VMX.
|
||||
*/
|
||||
vcpu->arch.dr6 ^= payload & DR6_RTM;
|
||||
|
||||
/*
|
||||
* The #DB payload is defined as compatible with the 'pending
|
||||
* debug exceptions' field under VMX, not DR6. While bit 12 is
|
||||
* defined in the 'pending debug exceptions' field (enabled
|
||||
* breakpoint), it is reserved and must be zero in DR6.
|
||||
*/
|
||||
vcpu->arch.dr6 &= ~BIT(12);
|
||||
break;
|
||||
case PF_VECTOR:
|
||||
vcpu->arch.cr2 = payload;
|
||||
@ -490,19 +498,7 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
|
||||
vcpu->arch.exception.error_code = error_code;
|
||||
vcpu->arch.exception.has_payload = has_payload;
|
||||
vcpu->arch.exception.payload = payload;
|
||||
/*
|
||||
* In guest mode, payload delivery should be deferred,
|
||||
* so that the L1 hypervisor can intercept #PF before
|
||||
* CR2 is modified (or intercept #DB before DR6 is
|
||||
* modified under nVMX). However, for ABI
|
||||
* compatibility with KVM_GET_VCPU_EVENTS and
|
||||
* KVM_SET_VCPU_EVENTS, we can't delay payload
|
||||
* delivery unless userspace has enabled this
|
||||
* functionality via the per-VM capability,
|
||||
* KVM_CAP_EXCEPTION_PAYLOAD.
|
||||
*/
|
||||
if (!vcpu->kvm->arch.exception_payload_enabled ||
|
||||
!is_guest_mode(vcpu))
|
||||
if (!is_guest_mode(vcpu))
|
||||
kvm_deliver_exception_payload(vcpu);
|
||||
return;
|
||||
}
|
||||
@ -2448,7 +2444,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
|
||||
vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
|
||||
vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
|
||||
vcpu->last_guest_tsc = tsc_timestamp;
|
||||
WARN_ON(vcpu->hv_clock.system_time < 0);
|
||||
WARN_ON((s64)vcpu->hv_clock.system_time < 0);
|
||||
|
||||
/* If the host uses TSC clocksource, then it is stable */
|
||||
pvclock_flags = 0;
|
||||
@ -3795,6 +3791,21 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
|
||||
{
|
||||
process_nmi(vcpu);
|
||||
|
||||
/*
|
||||
* In guest mode, payload delivery should be deferred,
|
||||
* so that the L1 hypervisor can intercept #PF before
|
||||
* CR2 is modified (or intercept #DB before DR6 is
|
||||
* modified under nVMX). Unless the per-VM capability,
|
||||
* KVM_CAP_EXCEPTION_PAYLOAD, is set, we may not defer the delivery of
|
||||
* an exception payload and handle after a KVM_GET_VCPU_EVENTS. Since we
|
||||
* opportunistically defer the exception payload, deliver it if the
|
||||
* capability hasn't been requested before processing a
|
||||
* KVM_GET_VCPU_EVENTS.
|
||||
*/
|
||||
if (!vcpu->kvm->arch.exception_payload_enabled &&
|
||||
vcpu->arch.exception.pending && vcpu->arch.exception.has_payload)
|
||||
kvm_deliver_exception_payload(vcpu);
|
||||
|
||||
/*
|
||||
* The API doesn't provide the instruction length for software
|
||||
* exceptions, so don't report them. As long as the guest RIP
|
||||
@ -8942,7 +8953,6 @@ int kvm_task_switch(struct kvm_vcpu *vcpu, u16 tss_selector, int idt_index,
|
||||
|
||||
kvm_rip_write(vcpu, ctxt->eip);
|
||||
kvm_set_rflags(vcpu, ctxt->eflags);
|
||||
kvm_make_request(KVM_REQ_EVENT, vcpu);
|
||||
return 1;
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(kvm_task_switch);
|
||||
@ -10182,7 +10192,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
|
||||
work->arch.cr3 != vcpu->arch.mmu->get_cr3(vcpu))
|
||||
return;
|
||||
|
||||
vcpu->arch.mmu->page_fault(vcpu, work->cr2_or_gpa, 0, true);
|
||||
kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
|
||||
}
|
||||
|
||||
static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
|
||||
|
@ -8,7 +8,7 @@ KSFT_KHDR_INSTALL := 1
|
||||
UNAME_M := $(shell uname -m)
|
||||
|
||||
LIBKVM = lib/assert.c lib/elf.c lib/io.c lib/kvm_util.c lib/sparsebit.c
|
||||
LIBKVM_x86_64 = lib/x86_64/processor.c lib/x86_64/vmx.c lib/x86_64/ucall.c
|
||||
LIBKVM_x86_64 = lib/x86_64/processor.c lib/x86_64/vmx.c lib/x86_64/svm.c lib/x86_64/ucall.c
|
||||
LIBKVM_aarch64 = lib/aarch64/processor.c lib/aarch64/ucall.c
|
||||
LIBKVM_s390x = lib/s390x/processor.c lib/s390x/ucall.c
|
||||
|
||||
@ -26,6 +26,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/vmx_dirty_log_test
|
||||
TEST_GEN_PROGS_x86_64 += x86_64/vmx_set_nested_state_test
|
||||
TEST_GEN_PROGS_x86_64 += x86_64/vmx_tsc_adjust_test
|
||||
TEST_GEN_PROGS_x86_64 += x86_64/xss_msr_test
|
||||
TEST_GEN_PROGS_x86_64 += x86_64/svm_vmcall_test
|
||||
TEST_GEN_PROGS_x86_64 += clear_dirty_log_test
|
||||
TEST_GEN_PROGS_x86_64 += dirty_log_test
|
||||
TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
|
||||
|
@ -36,24 +36,24 @@
|
||||
#define X86_CR4_SMAP (1ul << 21)
|
||||
#define X86_CR4_PKE (1ul << 22)
|
||||
|
||||
/* The enum values match the intruction encoding of each register */
|
||||
enum x86_register {
|
||||
RAX = 0,
|
||||
RCX,
|
||||
RDX,
|
||||
RBX,
|
||||
RSP,
|
||||
RBP,
|
||||
RSI,
|
||||
RDI,
|
||||
R8,
|
||||
R9,
|
||||
R10,
|
||||
R11,
|
||||
R12,
|
||||
R13,
|
||||
R14,
|
||||
R15,
|
||||
/* General Registers in 64-Bit Mode */
|
||||
struct gpr64_regs {
|
||||
u64 rax;
|
||||
u64 rcx;
|
||||
u64 rdx;
|
||||
u64 rbx;
|
||||
u64 rsp;
|
||||
u64 rbp;
|
||||
u64 rsi;
|
||||
u64 rdi;
|
||||
u64 r8;
|
||||
u64 r9;
|
||||
u64 r10;
|
||||
u64 r11;
|
||||
u64 r12;
|
||||
u64 r13;
|
||||
u64 r14;
|
||||
u64 r15;
|
||||
};
|
||||
|
||||
struct desc64 {
|
||||
@ -220,20 +220,20 @@ static inline void set_cr4(uint64_t val)
|
||||
__asm__ __volatile__("mov %0, %%cr4" : : "r" (val) : "memory");
|
||||
}
|
||||
|
||||
static inline uint64_t get_gdt_base(void)
|
||||
static inline struct desc_ptr get_gdt(void)
|
||||
{
|
||||
struct desc_ptr gdt;
|
||||
__asm__ __volatile__("sgdt %[gdt]"
|
||||
: /* output */ [gdt]"=m"(gdt));
|
||||
return gdt.address;
|
||||
return gdt;
|
||||
}
|
||||
|
||||
static inline uint64_t get_idt_base(void)
|
||||
static inline struct desc_ptr get_idt(void)
|
||||
{
|
||||
struct desc_ptr idt;
|
||||
__asm__ __volatile__("sidt %[idt]"
|
||||
: /* output */ [idt]"=m"(idt));
|
||||
return idt.address;
|
||||
return idt;
|
||||
}
|
||||
|
||||
#define SET_XMM(__var, __xmm) \
|
||||
|
297
tools/testing/selftests/kvm/include/x86_64/svm.h
Normal file
297
tools/testing/selftests/kvm/include/x86_64/svm.h
Normal file
@ -0,0 +1,297 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
/*
|
||||
* tools/testing/selftests/kvm/include/x86_64/svm.h
|
||||
* This is a copy of arch/x86/include/asm/svm.h
|
||||
*
|
||||
*/
|
||||
|
||||
#ifndef SELFTEST_KVM_SVM_H
|
||||
#define SELFTEST_KVM_SVM_H
|
||||
|
||||
enum {
|
||||
INTERCEPT_INTR,
|
||||
INTERCEPT_NMI,
|
||||
INTERCEPT_SMI,
|
||||
INTERCEPT_INIT,
|
||||
INTERCEPT_VINTR,
|
||||
INTERCEPT_SELECTIVE_CR0,
|
||||
INTERCEPT_STORE_IDTR,
|
||||
INTERCEPT_STORE_GDTR,
|
||||
INTERCEPT_STORE_LDTR,
|
||||
INTERCEPT_STORE_TR,
|
||||
INTERCEPT_LOAD_IDTR,
|
||||
INTERCEPT_LOAD_GDTR,
|
||||
INTERCEPT_LOAD_LDTR,
|
||||
INTERCEPT_LOAD_TR,
|
||||
INTERCEPT_RDTSC,
|
||||
INTERCEPT_RDPMC,
|
||||
INTERCEPT_PUSHF,
|
||||
INTERCEPT_POPF,
|
||||
INTERCEPT_CPUID,
|
||||
INTERCEPT_RSM,
|
||||
INTERCEPT_IRET,
|
||||
INTERCEPT_INTn,
|
||||
INTERCEPT_INVD,
|
||||
INTERCEPT_PAUSE,
|
||||
INTERCEPT_HLT,
|
||||
INTERCEPT_INVLPG,
|
||||
INTERCEPT_INVLPGA,
|
||||
INTERCEPT_IOIO_PROT,
|
||||
INTERCEPT_MSR_PROT,
|
||||
INTERCEPT_TASK_SWITCH,
|
||||
INTERCEPT_FERR_FREEZE,
|
||||
INTERCEPT_SHUTDOWN,
|
||||
INTERCEPT_VMRUN,
|
||||
INTERCEPT_VMMCALL,
|
||||
INTERCEPT_VMLOAD,
|
||||
INTERCEPT_VMSAVE,
|
||||
INTERCEPT_STGI,
|
||||
INTERCEPT_CLGI,
|
||||
INTERCEPT_SKINIT,
|
||||
INTERCEPT_RDTSCP,
|
||||
INTERCEPT_ICEBP,
|
||||
INTERCEPT_WBINVD,
|
||||
INTERCEPT_MONITOR,
|
||||
INTERCEPT_MWAIT,
|
||||
INTERCEPT_MWAIT_COND,
|
||||
INTERCEPT_XSETBV,
|
||||
INTERCEPT_RDPRU,
|
||||
};
|
||||
|
||||
|
||||
struct __attribute__ ((__packed__)) vmcb_control_area {
|
||||
u32 intercept_cr;
|
||||
u32 intercept_dr;
|
||||
u32 intercept_exceptions;
|
||||
u64 intercept;
|
||||
u8 reserved_1[40];
|
||||
u16 pause_filter_thresh;
|
||||
u16 pause_filter_count;
|
||||
u64 iopm_base_pa;
|
||||
u64 msrpm_base_pa;
|
||||
u64 tsc_offset;
|
||||
u32 asid;
|
||||
u8 tlb_ctl;
|
||||
u8 reserved_2[3];
|
||||
u32 int_ctl;
|
||||
u32 int_vector;
|
||||
u32 int_state;
|
||||
u8 reserved_3[4];
|
||||
u32 exit_code;
|
||||
u32 exit_code_hi;
|
||||
u64 exit_info_1;
|
||||
u64 exit_info_2;
|
||||
u32 exit_int_info;
|
||||
u32 exit_int_info_err;
|
||||
u64 nested_ctl;
|
||||
u64 avic_vapic_bar;
|
||||
u8 reserved_4[8];
|
||||
u32 event_inj;
|
||||
u32 event_inj_err;
|
||||
u64 nested_cr3;
|
||||
u64 virt_ext;
|
||||
u32 clean;
|
||||
u32 reserved_5;
|
||||
u64 next_rip;
|
||||
u8 insn_len;
|
||||
u8 insn_bytes[15];
|
||||
u64 avic_backing_page; /* Offset 0xe0 */
|
||||
u8 reserved_6[8]; /* Offset 0xe8 */
|
||||
u64 avic_logical_id; /* Offset 0xf0 */
|
||||
u64 avic_physical_id; /* Offset 0xf8 */
|
||||
u8 reserved_7[768];
|
||||
};
|
||||
|
||||
|
||||
#define TLB_CONTROL_DO_NOTHING 0
|
||||
#define TLB_CONTROL_FLUSH_ALL_ASID 1
|
||||
#define TLB_CONTROL_FLUSH_ASID 3
|
||||
#define TLB_CONTROL_FLUSH_ASID_LOCAL 7
|
||||
|
||||
#define V_TPR_MASK 0x0f
|
||||
|
||||
#define V_IRQ_SHIFT 8
|
||||
#define V_IRQ_MASK (1 << V_IRQ_SHIFT)
|
||||
|
||||
#define V_GIF_SHIFT 9
|
||||
#define V_GIF_MASK (1 << V_GIF_SHIFT)
|
||||
|
||||
#define V_INTR_PRIO_SHIFT 16
|
||||
#define V_INTR_PRIO_MASK (0x0f << V_INTR_PRIO_SHIFT)
|
||||
|
||||
#define V_IGN_TPR_SHIFT 20
|
||||
#define V_IGN_TPR_MASK (1 << V_IGN_TPR_SHIFT)
|
||||
|
||||
#define V_INTR_MASKING_SHIFT 24
|
||||
#define V_INTR_MASKING_MASK (1 << V_INTR_MASKING_SHIFT)
|
||||
|
||||
#define V_GIF_ENABLE_SHIFT 25
|
||||
#define V_GIF_ENABLE_MASK (1 << V_GIF_ENABLE_SHIFT)
|
||||
|
||||
#define AVIC_ENABLE_SHIFT 31
|
||||
#define AVIC_ENABLE_MASK (1 << AVIC_ENABLE_SHIFT)
|
||||
|
||||
#define LBR_CTL_ENABLE_MASK BIT_ULL(0)
|
||||
#define VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK BIT_ULL(1)
|
||||
|
||||
#define SVM_INTERRUPT_SHADOW_MASK 1
|
||||
|
||||
#define SVM_IOIO_STR_SHIFT 2
|
||||
#define SVM_IOIO_REP_SHIFT 3
|
||||
#define SVM_IOIO_SIZE_SHIFT 4
|
||||
#define SVM_IOIO_ASIZE_SHIFT 7
|
||||
|
||||
#define SVM_IOIO_TYPE_MASK 1
|
||||
#define SVM_IOIO_STR_MASK (1 << SVM_IOIO_STR_SHIFT)
|
||||
#define SVM_IOIO_REP_MASK (1 << SVM_IOIO_REP_SHIFT)
|
||||
#define SVM_IOIO_SIZE_MASK (7 << SVM_IOIO_SIZE_SHIFT)
|
||||
#define SVM_IOIO_ASIZE_MASK (7 << SVM_IOIO_ASIZE_SHIFT)
|
||||
|
||||
#define SVM_VM_CR_VALID_MASK 0x001fULL
|
||||
#define SVM_VM_CR_SVM_LOCK_MASK 0x0008ULL
|
||||
#define SVM_VM_CR_SVM_DIS_MASK 0x0010ULL
|
||||
|
||||
#define SVM_NESTED_CTL_NP_ENABLE BIT(0)
|
||||
#define SVM_NESTED_CTL_SEV_ENABLE BIT(1)
|
||||
|
||||
struct __attribute__ ((__packed__)) vmcb_seg {
|
||||
u16 selector;
|
||||
u16 attrib;
|
||||
u32 limit;
|
||||
u64 base;
|
||||
};
|
||||
|
||||
struct __attribute__ ((__packed__)) vmcb_save_area {
|
||||
struct vmcb_seg es;
|
||||
struct vmcb_seg cs;
|
||||
struct vmcb_seg ss;
|
||||
struct vmcb_seg ds;
|
||||
struct vmcb_seg fs;
|
||||
struct vmcb_seg gs;
|
||||
struct vmcb_seg gdtr;
|
||||
struct vmcb_seg ldtr;
|
||||
struct vmcb_seg idtr;
|
||||
struct vmcb_seg tr;
|
||||
u8 reserved_1[43];
|
||||
u8 cpl;
|
||||
u8 reserved_2[4];
|
||||
u64 efer;
|
||||
u8 reserved_3[112];
|
||||
u64 cr4;
|
||||
u64 cr3;
|
||||
u64 cr0;
|
||||
u64 dr7;
|
||||
u64 dr6;
|
||||
u64 rflags;
|
||||
u64 rip;
|
||||
u8 reserved_4[88];
|
||||
u64 rsp;
|
||||
u8 reserved_5[24];
|
||||
u64 rax;
|
||||
u64 star;
|
||||
u64 lstar;
|
||||
u64 cstar;
|
||||
u64 sfmask;
|
||||
u64 kernel_gs_base;
|
||||
u64 sysenter_cs;
|
||||
u64 sysenter_esp;
|
||||
u64 sysenter_eip;
|
||||
u64 cr2;
|
||||
u8 reserved_6[32];
|
||||
u64 g_pat;
|
||||
u64 dbgctl;
|
||||
u64 br_from;
|
||||
u64 br_to;
|
||||
u64 last_excp_from;
|
||||
u64 last_excp_to;
|
||||
};
|
||||
|
||||
struct __attribute__ ((__packed__)) vmcb {
|
||||
struct vmcb_control_area control;
|
||||
struct vmcb_save_area save;
|
||||
};
|
||||
|
||||
#define SVM_CPUID_FUNC 0x8000000a
|
||||
|
||||
#define SVM_VM_CR_SVM_DISABLE 4
|
||||
|
||||
#define SVM_SELECTOR_S_SHIFT 4
|
||||
#define SVM_SELECTOR_DPL_SHIFT 5
|
||||
#define SVM_SELECTOR_P_SHIFT 7
|
||||
#define SVM_SELECTOR_AVL_SHIFT 8
|
||||
#define SVM_SELECTOR_L_SHIFT 9
|
||||
#define SVM_SELECTOR_DB_SHIFT 10
|
||||
#define SVM_SELECTOR_G_SHIFT 11
|
||||
|
||||
#define SVM_SELECTOR_TYPE_MASK (0xf)
|
||||
#define SVM_SELECTOR_S_MASK (1 << SVM_SELECTOR_S_SHIFT)
|
||||
#define SVM_SELECTOR_DPL_MASK (3 << SVM_SELECTOR_DPL_SHIFT)
|
||||
#define SVM_SELECTOR_P_MASK (1 << SVM_SELECTOR_P_SHIFT)
|
||||
#define SVM_SELECTOR_AVL_MASK (1 << SVM_SELECTOR_AVL_SHIFT)
|
||||
#define SVM_SELECTOR_L_MASK (1 << SVM_SELECTOR_L_SHIFT)
|
||||
#define SVM_SELECTOR_DB_MASK (1 << SVM_SELECTOR_DB_SHIFT)
|
||||
#define SVM_SELECTOR_G_MASK (1 << SVM_SELECTOR_G_SHIFT)
|
||||
|
||||
#define SVM_SELECTOR_WRITE_MASK (1 << 1)
|
||||
#define SVM_SELECTOR_READ_MASK SVM_SELECTOR_WRITE_MASK
|
||||
#define SVM_SELECTOR_CODE_MASK (1 << 3)
|
||||
|
||||
#define INTERCEPT_CR0_READ 0
|
||||
#define INTERCEPT_CR3_READ 3
|
||||
#define INTERCEPT_CR4_READ 4
|
||||
#define INTERCEPT_CR8_READ 8
|
||||
#define INTERCEPT_CR0_WRITE (16 + 0)
|
||||
#define INTERCEPT_CR3_WRITE (16 + 3)
|
||||
#define INTERCEPT_CR4_WRITE (16 + 4)
|
||||
#define INTERCEPT_CR8_WRITE (16 + 8)
|
||||
|
||||
#define INTERCEPT_DR0_READ 0
|
||||
#define INTERCEPT_DR1_READ 1
|
||||
#define INTERCEPT_DR2_READ 2
|
||||
#define INTERCEPT_DR3_READ 3
|
||||
#define INTERCEPT_DR4_READ 4
|
||||
#define INTERCEPT_DR5_READ 5
|
||||
#define INTERCEPT_DR6_READ 6
|
||||
#define INTERCEPT_DR7_READ 7
|
||||
#define INTERCEPT_DR0_WRITE (16 + 0)
|
||||
#define INTERCEPT_DR1_WRITE (16 + 1)
|
||||
#define INTERCEPT_DR2_WRITE (16 + 2)
|
||||
#define INTERCEPT_DR3_WRITE (16 + 3)
|
||||
#define INTERCEPT_DR4_WRITE (16 + 4)
|
||||
#define INTERCEPT_DR5_WRITE (16 + 5)
|
||||
#define INTERCEPT_DR6_WRITE (16 + 6)
|
||||
#define INTERCEPT_DR7_WRITE (16 + 7)
|
||||
|
||||
#define SVM_EVTINJ_VEC_MASK 0xff
|
||||
|
||||
#define SVM_EVTINJ_TYPE_SHIFT 8
|
||||
#define SVM_EVTINJ_TYPE_MASK (7 << SVM_EVTINJ_TYPE_SHIFT)
|
||||
|
||||
#define SVM_EVTINJ_TYPE_INTR (0 << SVM_EVTINJ_TYPE_SHIFT)
|
||||
#define SVM_EVTINJ_TYPE_NMI (2 << SVM_EVTINJ_TYPE_SHIFT)
|
||||
#define SVM_EVTINJ_TYPE_EXEPT (3 << SVM_EVTINJ_TYPE_SHIFT)
|
||||
#define SVM_EVTINJ_TYPE_SOFT (4 << SVM_EVTINJ_TYPE_SHIFT)
|
||||
|
||||
#define SVM_EVTINJ_VALID (1 << 31)
|
||||
#define SVM_EVTINJ_VALID_ERR (1 << 11)
|
||||
|
||||
#define SVM_EXITINTINFO_VEC_MASK SVM_EVTINJ_VEC_MASK
|
||||
#define SVM_EXITINTINFO_TYPE_MASK SVM_EVTINJ_TYPE_MASK
|
||||
|
||||
#define SVM_EXITINTINFO_TYPE_INTR SVM_EVTINJ_TYPE_INTR
|
||||
#define SVM_EXITINTINFO_TYPE_NMI SVM_EVTINJ_TYPE_NMI
|
||||
#define SVM_EXITINTINFO_TYPE_EXEPT SVM_EVTINJ_TYPE_EXEPT
|
||||
#define SVM_EXITINTINFO_TYPE_SOFT SVM_EVTINJ_TYPE_SOFT
|
||||
|
||||
#define SVM_EXITINTINFO_VALID SVM_EVTINJ_VALID
|
||||
#define SVM_EXITINTINFO_VALID_ERR SVM_EVTINJ_VALID_ERR
|
||||
|
||||
#define SVM_EXITINFOSHIFT_TS_REASON_IRET 36
|
||||
#define SVM_EXITINFOSHIFT_TS_REASON_JMP 38
|
||||
#define SVM_EXITINFOSHIFT_TS_HAS_ERROR_CODE 44
|
||||
|
||||
#define SVM_EXITINFO_REG_MASK 0x0F
|
||||
|
||||
#define SVM_CR0_SELECTIVE_MASK (X86_CR0_TS | X86_CR0_MP)
|
||||
|
||||
#endif /* SELFTEST_KVM_SVM_H */
|
38
tools/testing/selftests/kvm/include/x86_64/svm_util.h
Normal file
38
tools/testing/selftests/kvm/include/x86_64/svm_util.h
Normal file
@ -0,0 +1,38 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0-only */
|
||||
/*
|
||||
* tools/testing/selftests/kvm/include/x86_64/svm_utils.h
|
||||
* Header for nested SVM testing
|
||||
*
|
||||
* Copyright (C) 2020, Red Hat, Inc.
|
||||
*/
|
||||
|
||||
#ifndef SELFTEST_KVM_SVM_UTILS_H
|
||||
#define SELFTEST_KVM_SVM_UTILS_H
|
||||
|
||||
#include <stdint.h>
|
||||
#include "svm.h"
|
||||
#include "processor.h"
|
||||
|
||||
#define CPUID_SVM_BIT 2
|
||||
#define CPUID_SVM BIT_ULL(CPUID_SVM_BIT)
|
||||
|
||||
#define SVM_EXIT_VMMCALL 0x081
|
||||
|
||||
struct svm_test_data {
|
||||
/* VMCB */
|
||||
struct vmcb *vmcb; /* gva */
|
||||
void *vmcb_hva;
|
||||
uint64_t vmcb_gpa;
|
||||
|
||||
/* host state-save area */
|
||||
struct vmcb_save_area *save_area; /* gva */
|
||||
void *save_area_hva;
|
||||
uint64_t save_area_gpa;
|
||||
};
|
||||
|
||||
struct svm_test_data *vcpu_alloc_svm(struct kvm_vm *vm, vm_vaddr_t *p_svm_gva);
|
||||
void generic_svm_setup(struct svm_test_data *svm, void *guest_rip, void *guest_rsp);
|
||||
void run_guest(struct vmcb *vmcb, uint64_t vmcb_gpa);
|
||||
void nested_svm_check_supported(void);
|
||||
|
||||
#endif /* SELFTEST_KVM_SVM_UTILS_H */
|
161
tools/testing/selftests/kvm/lib/x86_64/svm.c
Normal file
161
tools/testing/selftests/kvm/lib/x86_64/svm.c
Normal file
@ -0,0 +1,161 @@
|
||||
// SPDX-License-Identifier: GPL-2.0-only
|
||||
/*
|
||||
* tools/testing/selftests/kvm/lib/x86_64/svm.c
|
||||
* Helpers used for nested SVM testing
|
||||
* Largely inspired from KVM unit test svm.c
|
||||
*
|
||||
* Copyright (C) 2020, Red Hat, Inc.
|
||||
*/
|
||||
|
||||
#include "test_util.h"
|
||||
#include "kvm_util.h"
|
||||
#include "../kvm_util_internal.h"
|
||||
#include "processor.h"
|
||||
#include "svm_util.h"
|
||||
|
||||
struct gpr64_regs guest_regs;
|
||||
u64 rflags;
|
||||
|
||||
/* Allocate memory regions for nested SVM tests.
|
||||
*
|
||||
* Input Args:
|
||||
* vm - The VM to allocate guest-virtual addresses in.
|
||||
*
|
||||
* Output Args:
|
||||
* p_svm_gva - The guest virtual address for the struct svm_test_data.
|
||||
*
|
||||
* Return:
|
||||
* Pointer to structure with the addresses of the SVM areas.
|
||||
*/
|
||||
struct svm_test_data *
|
||||
vcpu_alloc_svm(struct kvm_vm *vm, vm_vaddr_t *p_svm_gva)
|
||||
{
|
||||
vm_vaddr_t svm_gva = vm_vaddr_alloc(vm, getpagesize(),
|
||||
0x10000, 0, 0);
|
||||
struct svm_test_data *svm = addr_gva2hva(vm, svm_gva);
|
||||
|
||||
svm->vmcb = (void *)vm_vaddr_alloc(vm, getpagesize(),
|
||||
0x10000, 0, 0);
|
||||
svm->vmcb_hva = addr_gva2hva(vm, (uintptr_t)svm->vmcb);
|
||||
svm->vmcb_gpa = addr_gva2gpa(vm, (uintptr_t)svm->vmcb);
|
||||
|
||||
svm->save_area = (void *)vm_vaddr_alloc(vm, getpagesize(),
|
||||
0x10000, 0, 0);
|
||||
svm->save_area_hva = addr_gva2hva(vm, (uintptr_t)svm->save_area);
|
||||
svm->save_area_gpa = addr_gva2gpa(vm, (uintptr_t)svm->save_area);
|
||||
|
||||
*p_svm_gva = svm_gva;
|
||||
return svm;
|
||||
}
|
||||
|
||||
static void vmcb_set_seg(struct vmcb_seg *seg, u16 selector,
|
||||
u64 base, u32 limit, u32 attr)
|
||||
{
|
||||
seg->selector = selector;
|
||||
seg->attrib = attr;
|
||||
seg->limit = limit;
|
||||
seg->base = base;
|
||||
}
|
||||
|
||||
void generic_svm_setup(struct svm_test_data *svm, void *guest_rip, void *guest_rsp)
|
||||
{
|
||||
struct vmcb *vmcb = svm->vmcb;
|
||||
uint64_t vmcb_gpa = svm->vmcb_gpa;
|
||||
struct vmcb_save_area *save = &vmcb->save;
|
||||
struct vmcb_control_area *ctrl = &vmcb->control;
|
||||
u32 data_seg_attr = 3 | SVM_SELECTOR_S_MASK | SVM_SELECTOR_P_MASK
|
||||
| SVM_SELECTOR_DB_MASK | SVM_SELECTOR_G_MASK;
|
||||
u32 code_seg_attr = 9 | SVM_SELECTOR_S_MASK | SVM_SELECTOR_P_MASK
|
||||
| SVM_SELECTOR_L_MASK | SVM_SELECTOR_G_MASK;
|
||||
uint64_t efer;
|
||||
|
||||
efer = rdmsr(MSR_EFER);
|
||||
wrmsr(MSR_EFER, efer | EFER_SVME);
|
||||
wrmsr(MSR_VM_HSAVE_PA, svm->save_area_gpa);
|
||||
|
||||
memset(vmcb, 0, sizeof(*vmcb));
|
||||
asm volatile ("vmsave\n\t" : : "a" (vmcb_gpa) : "memory");
|
||||
vmcb_set_seg(&save->es, get_es(), 0, -1U, data_seg_attr);
|
||||
vmcb_set_seg(&save->cs, get_cs(), 0, -1U, code_seg_attr);
|
||||
vmcb_set_seg(&save->ss, get_ss(), 0, -1U, data_seg_attr);
|
||||
vmcb_set_seg(&save->ds, get_ds(), 0, -1U, data_seg_attr);
|
||||
vmcb_set_seg(&save->gdtr, 0, get_gdt().address, get_gdt().size, 0);
|
||||
vmcb_set_seg(&save->idtr, 0, get_idt().address, get_idt().size, 0);
|
||||
|
||||
ctrl->asid = 1;
|
||||
save->cpl = 0;
|
||||
save->efer = rdmsr(MSR_EFER);
|
||||
asm volatile ("mov %%cr4, %0" : "=r"(save->cr4) : : "memory");
|
||||
asm volatile ("mov %%cr3, %0" : "=r"(save->cr3) : : "memory");
|
||||
asm volatile ("mov %%cr0, %0" : "=r"(save->cr0) : : "memory");
|
||||
asm volatile ("mov %%dr7, %0" : "=r"(save->dr7) : : "memory");
|
||||
asm volatile ("mov %%dr6, %0" : "=r"(save->dr6) : : "memory");
|
||||
asm volatile ("mov %%cr2, %0" : "=r"(save->cr2) : : "memory");
|
||||
save->g_pat = rdmsr(MSR_IA32_CR_PAT);
|
||||
save->dbgctl = rdmsr(MSR_IA32_DEBUGCTLMSR);
|
||||
ctrl->intercept = (1ULL << INTERCEPT_VMRUN) |
|
||||
(1ULL << INTERCEPT_VMMCALL);
|
||||
|
||||
vmcb->save.rip = (u64)guest_rip;
|
||||
vmcb->save.rsp = (u64)guest_rsp;
|
||||
guest_regs.rdi = (u64)svm;
|
||||
}
|
||||
|
||||
/*
|
||||
* save/restore 64-bit general registers except rax, rip, rsp
|
||||
* which are directly handed through the VMCB guest processor state
|
||||
*/
|
||||
#define SAVE_GPR_C \
|
||||
"xchg %%rbx, guest_regs+0x20\n\t" \
|
||||
"xchg %%rcx, guest_regs+0x10\n\t" \
|
||||
"xchg %%rdx, guest_regs+0x18\n\t" \
|
||||
"xchg %%rbp, guest_regs+0x30\n\t" \
|
||||
"xchg %%rsi, guest_regs+0x38\n\t" \
|
||||
"xchg %%rdi, guest_regs+0x40\n\t" \
|
||||
"xchg %%r8, guest_regs+0x48\n\t" \
|
||||
"xchg %%r9, guest_regs+0x50\n\t" \
|
||||
"xchg %%r10, guest_regs+0x58\n\t" \
|
||||
"xchg %%r11, guest_regs+0x60\n\t" \
|
||||
"xchg %%r12, guest_regs+0x68\n\t" \
|
||||
"xchg %%r13, guest_regs+0x70\n\t" \
|
||||
"xchg %%r14, guest_regs+0x78\n\t" \
|
||||
"xchg %%r15, guest_regs+0x80\n\t"
|
||||
|
||||
#define LOAD_GPR_C SAVE_GPR_C
|
||||
|
||||
/*
|
||||
* selftests do not use interrupts so we dropped clgi/sti/cli/stgi
|
||||
* for now. registers involved in LOAD/SAVE_GPR_C are eventually
|
||||
* unmodified so they do not need to be in the clobber list.
|
||||
*/
|
||||
void run_guest(struct vmcb *vmcb, uint64_t vmcb_gpa)
|
||||
{
|
||||
asm volatile (
|
||||
"vmload\n\t"
|
||||
"mov rflags, %%r15\n\t" // rflags
|
||||
"mov %%r15, 0x170(%[vmcb])\n\t"
|
||||
"mov guest_regs, %%r15\n\t" // rax
|
||||
"mov %%r15, 0x1f8(%[vmcb])\n\t"
|
||||
LOAD_GPR_C
|
||||
"vmrun\n\t"
|
||||
SAVE_GPR_C
|
||||
"mov 0x170(%[vmcb]), %%r15\n\t" // rflags
|
||||
"mov %%r15, rflags\n\t"
|
||||
"mov 0x1f8(%[vmcb]), %%r15\n\t" // rax
|
||||
"mov %%r15, guest_regs\n\t"
|
||||
"vmsave\n\t"
|
||||
: : [vmcb] "r" (vmcb), [vmcb_gpa] "a" (vmcb_gpa)
|
||||
: "r15", "memory");
|
||||
}
|
||||
|
||||
void nested_svm_check_supported(void)
|
||||
{
|
||||
struct kvm_cpuid_entry2 *entry =
|
||||
kvm_get_supported_cpuid_entry(0x80000001);
|
||||
|
||||
if (!(entry->ecx & CPUID_SVM)) {
|
||||
fprintf(stderr, "nested SVM not enabled, skipping test\n");
|
||||
exit(KSFT_SKIP);
|
||||
}
|
||||
}
|
||||
|
@ -288,9 +288,9 @@ static inline void init_vmcs_host_state(void)
|
||||
vmwrite(HOST_FS_BASE, rdmsr(MSR_FS_BASE));
|
||||
vmwrite(HOST_GS_BASE, rdmsr(MSR_GS_BASE));
|
||||
vmwrite(HOST_TR_BASE,
|
||||
get_desc64_base((struct desc64 *)(get_gdt_base() + get_tr())));
|
||||
vmwrite(HOST_GDTR_BASE, get_gdt_base());
|
||||
vmwrite(HOST_IDTR_BASE, get_idt_base());
|
||||
get_desc64_base((struct desc64 *)(get_gdt().address + get_tr())));
|
||||
vmwrite(HOST_GDTR_BASE, get_gdt().address);
|
||||
vmwrite(HOST_IDTR_BASE, get_idt().address);
|
||||
vmwrite(HOST_IA32_SYSENTER_ESP, rdmsr(MSR_IA32_SYSENTER_ESP));
|
||||
vmwrite(HOST_IA32_SYSENTER_EIP, rdmsr(MSR_IA32_SYSENTER_EIP));
|
||||
}
|
||||
|
79
tools/testing/selftests/kvm/x86_64/svm_vmcall_test.c
Normal file
79
tools/testing/selftests/kvm/x86_64/svm_vmcall_test.c
Normal file
@ -0,0 +1,79 @@
|
||||
// SPDX-License-Identifier: GPL-2.0-only
|
||||
/*
|
||||
* svm_vmcall_test
|
||||
*
|
||||
* Copyright (C) 2020, Red Hat, Inc.
|
||||
*
|
||||
* Nested SVM testing: VMCALL
|
||||
*/
|
||||
|
||||
#include "test_util.h"
|
||||
#include "kvm_util.h"
|
||||
#include "processor.h"
|
||||
#include "svm_util.h"
|
||||
|
||||
#define VCPU_ID 5
|
||||
|
||||
static struct kvm_vm *vm;
|
||||
|
||||
static void l2_guest_code(struct svm_test_data *svm)
|
||||
{
|
||||
__asm__ __volatile__("vmcall");
|
||||
}
|
||||
|
||||
static void l1_guest_code(struct svm_test_data *svm)
|
||||
{
|
||||
#define L2_GUEST_STACK_SIZE 64
|
||||
unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
|
||||
struct vmcb *vmcb = svm->vmcb;
|
||||
|
||||
/* Prepare for L2 execution. */
|
||||
generic_svm_setup(svm, l2_guest_code,
|
||||
&l2_guest_stack[L2_GUEST_STACK_SIZE]);
|
||||
|
||||
run_guest(vmcb, svm->vmcb_gpa);
|
||||
|
||||
GUEST_ASSERT(vmcb->control.exit_code == SVM_EXIT_VMMCALL);
|
||||
GUEST_DONE();
|
||||
}
|
||||
|
||||
int main(int argc, char *argv[])
|
||||
{
|
||||
vm_vaddr_t svm_gva;
|
||||
|
||||
nested_svm_check_supported();
|
||||
|
||||
vm = vm_create_default(VCPU_ID, 0, (void *) l1_guest_code);
|
||||
vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid());
|
||||
|
||||
vcpu_alloc_svm(vm, &svm_gva);
|
||||
vcpu_args_set(vm, VCPU_ID, 1, svm_gva);
|
||||
|
||||
for (;;) {
|
||||
volatile struct kvm_run *run = vcpu_state(vm, VCPU_ID);
|
||||
struct ucall uc;
|
||||
|
||||
vcpu_run(vm, VCPU_ID);
|
||||
TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
|
||||
"Got exit_reason other than KVM_EXIT_IO: %u (%s)\n",
|
||||
run->exit_reason,
|
||||
exit_reason_str(run->exit_reason));
|
||||
|
||||
switch (get_ucall(vm, VCPU_ID, &uc)) {
|
||||
case UCALL_ABORT:
|
||||
TEST_ASSERT(false, "%s",
|
||||
(const char *)uc.args[0]);
|
||||
/* NOT REACHED */
|
||||
case UCALL_SYNC:
|
||||
break;
|
||||
case UCALL_DONE:
|
||||
goto done;
|
||||
default:
|
||||
TEST_ASSERT(false,
|
||||
"Unknown ucall 0x%x.", uc.cmd);
|
||||
}
|
||||
}
|
||||
done:
|
||||
kvm_vm_free(vm);
|
||||
return 0;
|
||||
}
|
@ -179,18 +179,6 @@ unsigned long vgic_mmio_read_pending(struct kvm_vcpu *vcpu,
|
||||
return value;
|
||||
}
|
||||
|
||||
/*
|
||||
* This function will return the VCPU that performed the MMIO access and
|
||||
* trapped from within the VM, and will return NULL if this is a userspace
|
||||
* access.
|
||||
*
|
||||
* We can disable preemption locally around accessing the per-CPU variable,
|
||||
* and use the resolved vcpu pointer after enabling preemption again, because
|
||||
* even if the current thread is migrated to another CPU, reading the per-CPU
|
||||
* value later will give us the same value as we update the per-CPU variable
|
||||
* in the preempt notifier handlers.
|
||||
*/
|
||||
|
||||
/* Must be called with irq->irq_lock held */
|
||||
static void vgic_hw_irq_spending(struct kvm_vcpu *vcpu, struct vgic_irq *irq,
|
||||
bool is_uaccess)
|
||||
|
@ -4409,12 +4409,22 @@ static void kvm_sched_out(struct preempt_notifier *pn,
|
||||
|
||||
/**
|
||||
* kvm_get_running_vcpu - get the vcpu running on the current CPU.
|
||||
* Thanks to preempt notifiers, this can also be called from
|
||||
* preemptible context.
|
||||
*
|
||||
* We can disable preemption locally around accessing the per-CPU variable,
|
||||
* and use the resolved vcpu pointer after enabling preemption again,
|
||||
* because even if the current thread is migrated to another CPU, reading
|
||||
* the per-CPU value later will give us the same value as we update the
|
||||
* per-CPU variable in the preempt notifier handlers.
|
||||
*/
|
||||
struct kvm_vcpu *kvm_get_running_vcpu(void)
|
||||
{
|
||||
return __this_cpu_read(kvm_running_vcpu);
|
||||
struct kvm_vcpu *vcpu;
|
||||
|
||||
preempt_disable();
|
||||
vcpu = __this_cpu_read(kvm_running_vcpu);
|
||||
preempt_enable();
|
||||
|
||||
return vcpu;
|
||||
}
|
||||
|
||||
/**
|
||||
|
Loading…
Reference in New Issue
Block a user