AVIC dependency on CONFIG_X86_LOCAL_APIC is dead code since
commit e42eef4ba3 ("KVM: add X86_LOCAL_APIC dependency").
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210518144339.1987982-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
- Fix regression with irqbypass not restarting the guest on failed connect
- Fix regression with debug register decoding resulting in overlapping access
- Commit exception state on exit to usrspace
- Fix the MMU notifier return values
- Add missing 'static' qualifiers in the new host stage-2 code
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmCfmUoPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDei8QAMOWMA9wFTydsMTyRwDDZzD9i3Vg4bYlTdj1
1C1FiHHGL37t44coo1eHtnydWBuhxhhwDHWQE8owFbDHyOnPzEX+NwhmJ4gVlUW5
51aSxfPgXzKiv17WyncqZO9SfA5/RFyA/C2gRq9/fMr/7CpQJjqrvdQXaWh4kPVa
9jFMVd1sCDUPd5c9Jyxd42CmVZjg6mCorOKaEwlI7NZkulRBlFW21A5y+M57sGTF
RLIuQcggFJaG17kZN4p6v55Yoclt8O4xVbDv8SZV3vO1gjpaF1LtXdsmAKvbDZrZ
lEtdumPHyD1maFhwXQFMOyvOgEaRhlhiNaTgKUOyX2LgeW1utCiYO/KwysflZvIC
oLsfx3x+G0nSxa+MWGL9m52Hrt4yyscfbKfBg6nqJB+AqD3teH20xfsEUHTEuYkW
kEgeWcJcWkadL5+ngs6S4PwFr88NyVBdUAagNd5VXE/KFhxCcr4B9oOXk5WdOaMi
ZvLG5IQfIH6k3w+h2wR2WSoxYwltriZ3PwrPIeJ2Se33bK15xtQy1k/IIqvZP/oK
0xxRVoY+nwuru0QZGwyI7zCFFvZzEKOXJ3qzJ2NeQxoTBky/e0bvUwnU8gXLXGPM
lx2Gzw6t+xlTfcF9oIaQq7WlOsrC7Zr4uiTurZGLZKWklso9tLdzW35zmdN6D3qx
sP2LC4iv
=57tg
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 5.13, take #1
- Fix regression with irqbypass not restarting the guest on failed connect
- Fix regression with debug register decoding resulting in overlapping access
- Commit exception state on exit to usrspace
- Fix the MMU notifier return values
- Add missing 'static' qualifiers in the new host stage-2 code
- Reorganize SEV code to streamline and simplify future development
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCg1XQACgkQEsHwGGHe
VUpRKA//dwzDD1QU16JucfhgFlv/9OTm48ukSwAb9lZjDEy4H1CtVL3xEHFd7L3G
LJp0LTW+OQf0/0aGlQp/cP6sBF6G9Bf4mydx70Id4SyCQt8eZDodB+ZOOWbeteWq
p92fJPbX8CzAglutbE+3v/MD8CCAllTiLZnJZPVj4Kux2/wF6EryDgF1+rb5q8jp
ObTT9817mHVwWVUYzbgceZtd43IocOlKZRmF1qivwScMGylQTe1wfMjunpD5pVt8
Zg4UDNknNfYduqpaG546E6e1zerGNaJK7SHnsuzHRUVU5icNqtgBk061CehP9Ksq
DvYXLUl4xF16j6xJAqIZPNrBkJGdQf4q1g5x2FiBm7rSQU5owzqh5rkVk4EBFFzn
UtzeXpqbStbsZHXycyxBNdq2HXxkFPf2NXZ+bkripPg+DifOGots1uwvAft+6iAE
GudK6qxAvr8phR1cRyy6BahGtgOStXbZYEz0ZdU6t7qFfZMz+DomD5Jimj0kAe6B
s6ras5xm8q3/Py87N/KNjKtSEpgsHv/7F+idde7ODtHhpRL5HCBqhkZOSRkMMZqI
ptX1oSTvBXwRKyi5x9YhkKHUFqfFSUTfJhiRFCWK+IEAv3Y7SipJtfkqxRbI6fEV
FfCeueKDDdViBtseaRceVLJ8Tlr6Qjy27fkPPTqJpthqPpCdoZ0=
=ENfF
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
"The three SEV commits are not really urgent material. But we figured
since getting them in now will avoid a huge amount of conflicts
between future SEV changes touching tip, the kvm and probably other
trees, sending them to you now would be best.
The idea is that the tip, kvm etc branches for 5.14 will all base
ontop of -rc2 and thus everything will be peachy. What is more, those
changes are purely mechanical and defines movement so they should be
fine to go now (famous last words).
Summary:
- Enable -Wundef for the compressed kernel build stage
- Reorganize SEV code to streamline and simplify future development"
* tag 'x86_urgent_for_v5.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot/compressed: Enable -Wundef
x86/msr: Rename MSR_K8_SYSCFG to MSR_AMD64_SYSCFG
x86/sev: Move GHCB MSR protocol and NAE definitions in a common header
x86/sev-es: Rename sev-es.{ch} to sev.{ch}
* Fix virtualization of RDPID
* Virtualization of DR6_BUS_LOCK, which on bare metal is new in
the 5.13 merge window
* More nested virtualization migration fixes (nSVM and eVMCS)
* Fix for KVM guest hibernation
* Fix for warning in SEV-ES SRCU usage
* Block KVM from loading on AMD machines with 5-level page tables,
due to the APM not mentioning how host CR4.LA57 exactly impacts
the guest.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCZWwgUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOE9wgAk7Io8cuvnhC9ogVqzZWrPweWqFg8
fJcPMB584JRnMqYHBVYbkTPGe8SsCHKR2MKsNdc4cEP111cyr3suWsxOdmjJn58i
7ahy6PcKx7wWeWwEt7O599l6CeoX5XB9ExvA6eiXAv7iZeOJHFa+Ny2GlWgauy6Y
DELryEomx1r4IUkZaSR+2fYjzvOWTXQixwU/jwx8NcTJz0DrzknzLE7XOciPBfn0
t0Q2rCXdL2nF1uPksZbntx8Qoa6t6GDVIyrH/ZCPQYJtAX6cjxNAh3zwCe+hMnOd
fW8ntBH1nZRiNnberA4IICAzqnUokgPWdKBrZT2ntWHBK+aqxXHznrlPJA==
=e+gD
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
- Lots of bug fixes.
- Fix virtualization of RDPID
- Virtualization of DR6_BUS_LOCK, which on bare metal is new to this
release
- More nested virtualization migration fixes (nSVM and eVMCS)
- Fix for KVM guest hibernation
- Fix for warning in SEV-ES SRCU usage
- Block KVM from loading on AMD machines with 5-level page tables, due
to the APM not mentioning how host CR4.LA57 exactly impacts the
guest.
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (48 commits)
KVM: SVM: Move GHCB unmapping to fix RCU warning
KVM: SVM: Invert user pointer casting in SEV {en,de}crypt helpers
kvm: Cap halt polling at kvm->max_halt_poll_ns
tools/kvm_stat: Fix documentation typo
KVM: x86: Prevent deadlock against tk_core.seq
KVM: x86: Cancel pvclock_gtod_work on module removal
KVM: x86: Prevent KVM SVM from loading on kernels with 5-level paging
KVM: X86: Expose bus lock debug exception to guest
KVM: X86: Add support for the emulation of DR6_BUS_LOCK bit
KVM: PPC: Book3S HV: Fix conversion to gfn-based MMU notifier callbacks
KVM: x86: Hide RDTSCP and RDPID if MSR_TSC_AUX probing failed
KVM: x86: Tie Intel and AMD behavior for MSR_TSC_AUX to guest CPU model
KVM: x86: Move uret MSR slot management to common x86
KVM: x86: Export the number of uret MSRs to vendor modules
KVM: VMX: Disable loading of TSX_CTRL MSR the more conventional way
KVM: VMX: Use common x86's uret MSR list as the one true list
KVM: VMX: Use flag to indicate "active" uret MSRs instead of sorting list
KVM: VMX: Configure list of user return MSRs at module init
KVM: x86: Add support for RDPID without RDTSCP
KVM: SVM: Probe and load MSR_TSC_AUX regardless of RDTSCP support in host
...
The SYSCFG MSR continued being updated beyond the K8 family; drop the K8
name from it.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/20210427111636.1207-4-brijesh.singh@amd.com
When an SEV-ES guest is running, the GHCB is unmapped as part of the
vCPU run support. However, kvm_vcpu_unmap() triggers an RCU dereference
warning with CONFIG_PROVE_LOCKING=y because the SRCU lock is released
before invoking the vCPU run support.
Move the GHCB unmapping into the prepare_guest_switch callback, which is
invoked while still holding the SRCU lock, eliminating the RCU dereference
warning.
Fixes: 291bd20d5d ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <b2f9b79d15166f2c3e4375c0d9bc3268b7696455.1620332081.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Disallow loading KVM SVM if 5-level paging is supported. In theory, NPT
for L1 should simply work, but there unknowns with respect to how the
guest's MAXPHYADDR will be handled by hardware.
Nested NPT is more problematic, as running an L1 VMM that is using
2-level page tables requires stacking single-entry PDP and PML4 tables in
KVM's NPT for L2, as there are no equivalent entries in L1's NPT to
shadow. Barring hardware magic, for 5-level paging, KVM would need stack
another layer to handle PML5.
Opportunistically rename the lm_root pointer, which is used for the
aforementioned stacking when shadowing 2-level L1 NPT, to pml4_root to
call out that it's specifically for PML4.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210505204221.1934471-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Squish the Intel and AMD emulation of MSR_TSC_AUX together and tie it to
the guest CPU model instead of the host CPU behavior. While not strictly
necessary to avoid guest breakage, emulating cross-vendor "architecture"
will provide consistent behavior for the guest, e.g. WRMSR fault behavior
won't change if the vCPU is migrated to a host with divergent behavior.
Note, the "new" kvm_is_supported_user_return_msr() checks do not add new
functionality on either SVM or VMX. On SVM, the equivalent was
"tsc_aux_uret_slot < 0", and on VMX the check was buried in the
vmx_find_uret_msr() call at the find_uret_msr label.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-15-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that SVM and VMX both probe MSRs before "defining" user return slots
for them, consolidate the code for probe+define into common x86 and
eliminate the odd behavior of having the vendor code define the slot for
a given MSR.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Allow userspace to enable RDPID for a guest without also enabling RDTSCP.
Aside from checking for RDPID support in the obvious flows, VMX also needs
to set ENABLE_RDTSCP=1 when RDPID is exposed.
For the record, there is no known scenario where enabling RDPID without
RDTSCP is desirable. But, both AMD and Intel architectures allow for the
condition, i.e. this is purely to make KVM more architecturally accurate.
Fixes: 41cd02c6f7 ("kvm: x86: Expose RDPID in KVM_GET_SUPPORTED_CPUID")
Cc: stable@vger.kernel.org
Reported-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Probe MSR_TSC_AUX whether or not RDTSCP is supported in the host, and
if probing succeeds, load the guest's MSR_TSC_AUX into hardware prior to
VMRUN. Because SVM doesn't support interception of RDPID, RDPID cannot
be disallowed in the guest (without resorting to binary translation).
Leaving the host's MSR_TSC_AUX in hardware would leak the host's value to
the guest if RDTSCP is not supported.
Note, there is also a kernel bug that prevents leaking the host's value.
The host kernel initializes MSR_TSC_AUX if and only if RDTSCP is
supported, even though the vDSO usage consumes MSR_TSC_AUX via RDPID.
I.e. if RDTSCP is not supported, there is no host value to leak. But,
if/when the host kernel bug is fixed, KVM would start leaking MSR_TSC_AUX
in the case where hardware supports RDPID but RDTSCP is unavailable for
whatever reason.
Probing MSR_TSC_AUX will also allow consolidating the probe and define
logic in common x86, and will make it simpler to condition the existence
of MSR_TSX_AUX (from the guest's perspective) on RDTSCP *or* RDPID.
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Intercept RDTSCP to inject #UD if RDTSC is disabled in the guest.
Note, SVM does not support intercepting RDPID. Unlike VMX's
ENABLE_RDTSCP control, RDTSCP interception does not apply to RDPID. This
is a benign virtualization hole as the host kernel (incorrectly) sets
MSR_TSC_AUX if RDTSCP is supported, and KVM loads the guest's MSR_TSC_AUX
into hardware if RDTSCP is supported in the host, i.e. KVM will not leak
the host's MSR_TSC_AUX to the guest.
But, when the kernel bug is fixed, KVM will start leaking the host's
MSR_TSC_AUX if RDPID is supported in hardware, but RDTSCP isn't available
for whatever reason. This leak will be remedied in a future commit.
Fixes: 46896c73c1 ("KVM: svm: add support for RDTSCP")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-4-seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the enter/exit logic in {svm,vmx}_vcpu_enter_exit() to common
helpers. Opportunistically update the somewhat stale comment about the
updates needing to occur immediately after VM-Exit.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210505002735.1684165-9-seanjc@google.com
Defer the call to account guest time until after servicing any IRQ(s)
that happened in the guest or immediately after VM-Exit. Tick-based
accounting of vCPU time relies on PF_VCPU being set when the tick IRQ
handler runs, and IRQs are blocked throughout the main sequence of
vcpu_enter_guest(), including the call into vendor code to actually
enter and exit the guest.
This fixes a bug where reported guest time remains '0', even when
running an infinite loop in the guest:
https://bugzilla.kernel.org/show_bug.cgi?id=209831
Fixes: 87fa7f3e98 ("x86/kvm: Move context tracking where it belongs")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210505002735.1684165-4-seanjc@google.com
* Define and use an invalid GPA (all ones) for init value of last
and current nested vmcb physical addresses.
* Reset the current vmcb12 gpa to the invalid value when leaving
the nested mode, similar to what is done on nested vmexit.
* Reset the last seen vmcb12 address when disabling the nested SVM,
as it relies on vmcb02 fields which are freed at that point.
Fixes: 4995a3685f ("KVM: SVM: Use a separate vmcb for the nested L2 guest")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210503125446.1353307-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Stage-2 isolation for the host kernel when running in protected mode
- Guest SVE support when running in nVHE mode
- Force W^X hypervisor mappings in nVHE mode
- ITS save/restore for guests using direct injection with GICv4.1
- nVHE panics now produce readable backtraces
- Guest support for PTP using the ptp_kvm driver
- Performance improvements in the S2 fault handler
x86:
- Optimizations and cleanup of nested SVM code
- AMD: Support for virtual SPEC_CTRL
- Optimizations of the new MMU code: fast invalidation,
zap under read lock, enable/disably dirty page logging under
read lock
- /dev/kvm API for AMD SEV live migration (guest API coming soon)
- support SEV virtual machines sharing the same encryption context
- support SGX in virtual machines
- add a few more statistics
- improved directed yield heuristics
- Lots and lots of cleanups
Generic:
- Rework of MMU notifier interface, simplifying and optimizing
the architecture-specific code
- Some selftests improvements
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCJ13kUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroM1HAgAqzPxEtiTPTFeFJV5cnPPJ3dFoFDK
y/juZJUQ1AOtvuWzzwuf175ewkv9vfmtG6rVohpNSkUlJYeoc6tw7n8BTTzCVC1b
c/4Dnrjeycr6cskYlzaPyV6MSgjSv5gfyj1LA5UEM16LDyekmaynosVWY5wJhju+
Bnyid8l8Utgz+TLLYogfQJQECCrsU0Wm//n+8TWQgLf1uuiwshU5JJe7b43diJrY
+2DX+8p9yWXCTz62sCeDWNahUv8AbXpMeJ8uqZPYcN1P0gSEUGu8xKmLOFf9kR7b
M4U1Gyz8QQbjd2lqnwiWIkvRLX6gyGVbq2zH0QbhUe5gg3qGUX7JjrhdDQ==
=AXUi
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"This is a large update by KVM standards, including AMD PSP (Platform
Security Processor, aka "AMD Secure Technology") and ARM CoreSight
(debug and trace) changes.
ARM:
- CoreSight: Add support for ETE and TRBE
- Stage-2 isolation for the host kernel when running in protected
mode
- Guest SVE support when running in nVHE mode
- Force W^X hypervisor mappings in nVHE mode
- ITS save/restore for guests using direct injection with GICv4.1
- nVHE panics now produce readable backtraces
- Guest support for PTP using the ptp_kvm driver
- Performance improvements in the S2 fault handler
x86:
- AMD PSP driver changes
- Optimizations and cleanup of nested SVM code
- AMD: Support for virtual SPEC_CTRL
- Optimizations of the new MMU code: fast invalidation, zap under
read lock, enable/disably dirty page logging under read lock
- /dev/kvm API for AMD SEV live migration (guest API coming soon)
- support SEV virtual machines sharing the same encryption context
- support SGX in virtual machines
- add a few more statistics
- improved directed yield heuristics
- Lots and lots of cleanups
Generic:
- Rework of MMU notifier interface, simplifying and optimizing the
architecture-specific code
- a handful of "Get rid of oprofile leftovers" patches
- Some selftests improvements"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (379 commits)
KVM: selftests: Speed up set_memory_region_test
selftests: kvm: Fix the check of return value
KVM: x86: Take advantage of kvm_arch_dy_has_pending_interrupt()
KVM: SVM: Skip SEV cache flush if no ASIDs have been used
KVM: SVM: Remove an unnecessary prototype declaration of sev_flush_asids()
KVM: SVM: Drop redundant svm_sev_enabled() helper
KVM: SVM: Move SEV VMCB tracking allocation to sev.c
KVM: SVM: Explicitly check max SEV ASID during sev_hardware_setup()
KVM: SVM: Unconditionally invoke sev_hardware_teardown()
KVM: SVM: Enable SEV/SEV-ES functionality by default (when supported)
KVM: SVM: Condition sev_enabled and sev_es_enabled on CONFIG_KVM_AMD_SEV=y
KVM: SVM: Append "_enabled" to module-scoped SEV/SEV-ES control variables
KVM: SEV: Mask CPUID[0x8000001F].eax according to supported features
KVM: SVM: Move SEV module params/variables to sev.c
KVM: SVM: Disable SEV/SEV-ES if NPT is disabled
KVM: SVM: Free sev_asid_bitmap during init if SEV setup fails
KVM: SVM: Zero out the VMCB array used to track SEV ASID association
x86/sev: Drop redundant and potentially misleading 'sev_enabled'
KVM: x86: Move reverse CPUID helpers to separate header file
KVM: x86: Rename GPR accessors to make mode-aware variants the defaults
...
Move the allocation of the SEV VMCB array to sev.c to help pave the way
toward encapsulating SEV enabling wholly within sev.c.
No functional change intended.
Reviewed by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422021125.3417167-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the redundant svm_sev_enabled() check when calling
sev_hardware_teardown(), the teardown helper itself does the check.
Removing the check from svm.c will eventually allow dropping
svm_sev_enabled() entirely.
No functional change intended.
Reviewed by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422021125.3417167-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a reverse-CPUID entry for the memory encryption word, 0x8000001F.EAX,
and use it to override the supported CPUID flags reported to userspace.
Masking the reported CPUID flags avoids over-reporting KVM support, e.g.
without the mask a SEV-SNP capable CPU may incorrectly advertise SNP
support to userspace.
Clear SEV/SEV-ES if their corresponding module parameters are disabled,
and clear the memory encryption leaf completely if SEV is not fully
supported in KVM. Advertise SME_COHERENT in addition to SEV and SEV-ES,
as the guest can use SME_COHERENT to avoid CLFLUSH operations.
Explicitly omit SME and VM_PAGE_FLUSH from the reporting. These features
are used by KVM, but are not exposed to the guest, e.g. guest access to
related MSRs will fault.
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422021125.3417167-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unconditionally invoke sev_hardware_setup() when configuring SVM and
handle clearing the module params/variable 'sev' and 'sev_es' in
sev_hardware_setup(). This allows making said variables static within
sev.c and reduces the odds of a collision with guest code, e.g. the guest
side of things has already laid claim to 'sev_enabled'.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422021125.3417167-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Disable SEV and SEV-ES if NPT is disabled. While the APM doesn't clearly
state that NPT is mandatory, it's alluded to by:
The guest page tables, managed by the guest, may mark data memory pages
as either private or shared, thus allowing selected pages to be shared
outside the guest.
And practically speaking, shadow paging can't work since KVM can't read
the guest's page tables.
Fixes: e9df094289 ("KVM: SVM: Add sev module_param")
Cc: Brijesh Singh <brijesh.singh@amd.com
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422021125.3417167-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Zero out the array of VMCB pointers so that pre_sev_run() won't see
garbage when querying the array to detect when an SEV ASID is being
associated with a new VMCB. In practice, reading random values is all
but guaranteed to be benign as a false negative (which is extremely
unlikely on its own) can only happen on CPU0 on the first VMRUN and would
only cause KVM to skip the ASID flush. For anything bad to happen, a
previous instance of KVM would have to exit without flushing the ASID,
_and_ KVM would have to not flush the ASID at any time while building the
new SEV guest.
Cc: Borislav Petkov <bp@suse.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Brijesh Singh <brijesh.singh@amd.com>
Fixes: 70cd94e60c ("KVM: SVM: VMRUN should use associated ASID when SEV is enabled")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422021125.3417167-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Append raw to the direct variants of kvm_register_read/write(), and
drop the "l" from the mode-aware variants. I.e. make the mode-aware
variants the default, and make the direct variants scary sounding so as
to discourage use. Accessing the full 64-bit values irrespective of
mode is rarely the desired behavior.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422022128.3464144-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop bits 63:32 of RAX when grabbing the address for INVLPGA emulation
outside of 64-bit mode to make KVM's emulation slightly less wrong. The
address for INVLPGA is determined by the effective address size, i.e.
it's not hardcoded to 64/32 bits for a given mode. Add a FIXME to call
out that the emulation is wrong.
Opportunistically tweak the ASID handling to make it clear that it's
defined by ECX, not rCX.
Per the APM:
The portion of rAX used to form the address is determined by the
effective address size (current execution mode and optional address
size prefix). The ASID is taken from ECX.
Fixes: ff092385e8 ("KVM: SVM: Implement INVLPGA")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422022128.3464144-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop bits 63:32 on loads/stores to/from DRs and CRs when the vCPU is not
in 64-bit mode. The APM states bits 63:32 are dropped for both DRs and
CRs:
In 64-bit mode, the operand size is fixed at 64 bits without the need
for a REX prefix. In non-64-bit mode, the operand size is fixed at 32
bits and the upper 32 bits of the destination are forced to 0.
Fixes: 7ff76d58a9 ("KVM: SVM: enhance MOV CR intercept handler")
Fixes: cae3797a46 ("KVM: SVM: enhance mov DR intercept handler")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422022128.3464144-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use KVM's "user return MSRs" framework to defer restoring the host's
MSR_TSC_AUX until the CPU returns to userspace. Add/improve comments to
clarify why MSR_TSC_AUX is intercepted on both RDMSR and WRMSR, and why
it's safe for KVM to keep the guest's value loaded even if KVM is
scheduled out.
Cc: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210423223404.3860547-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Force clear bits 63:32 of MSR_TSC_AUX on write to emulate current AMD
CPUs, which completely ignore the upper 32 bits, including dropping them
on write. Emulating AMD hardware will also allow migrating a vCPU from
AMD hardware to Intel hardware without requiring userspace to manually
clear the upper bits, which are reserved on Intel hardware.
Presumably, MSR_TSC_AUX[63:32] are intended to be reserved on AMD, but
sadly the APM doesn't say _anything_ about those bits in the context of
MSR access. The RDTSCP entry simply states that RCX contains bits 31:0
of the MSR, zero extended. And even worse is that the RDPID description
implies that it can consume all 64 bits of the MSR:
RDPID reads the value of TSC_AUX MSR used by the RDTSCP instruction
into the specified destination register. Normal operand size prefixes
do not apply and the update is either 32 bit or 64 bit based on the
current mode.
Emulate current hardware behavior to give KVM the best odds of playing
nice with whatever the behavior of future AMD CPUs happens to be.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210423223404.3860547-3-seanjc@google.com>
[Fix broken patch. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Inject #GP on guest accesses to MSR_TSC_AUX if RDTSCP is unsupported in
the guest's CPUID model.
Fixes: 46896c73c1 ("KVM: svm: add support for RDTSCP")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210423223404.3860547-2-seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a capability for userspace to mirror SEV encryption context from
one vm to another. On our side, this is intended to support a
Migration Helper vCPU, but it can also be used generically to support
other in-guest workloads scheduled by the host. The intention is for
the primary guest and the mirror to have nearly identical memslots.
The primary benefits of this are that:
1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they
can't accidentally clobber each other.
2) The VMs can have different memory-views, which is necessary for post-copy
migration (the migration vCPUs on the target need to read and write to
pages, when the primary guest would VMEXIT).
This does not change the threat model for AMD SEV. Any memory involved
is still owned by the primary guest and its initial state is still
attested to through the normal SEV_LAUNCH_* flows. If userspace wanted
to circumvent SEV, they could achieve the same effect by simply attaching
a vCPU to the primary VM.
This patch deliberately leaves userspace in charge of the memslots for the
mirror, as it already has the power to mess with them in the primary guest.
This patch does not support SEV-ES (much less SNP), as it does not
handle handing off attested VMSAs to the mirror.
For additional context, we need a Migration Helper because SEV PSP
migration is far too slow for our live migration on its own. Using
an in-guest migrator lets us speed this up significantly.
Signed-off-by: Nathan Tempelman <natet@google.com>
Message-Id: <20210408223214.2582277-1-natet@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Define the actual size of the IOPM and MSRPM tables so that the actual size
can be used when initializing them and when checking the consistency of their
physical address.
These #defines are placed in svm.h so that they can be shared.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Message-Id: <20210412215611.110095-2-krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explicitly document why a vmcb must be marked dirty and assigned a new
asid when it will be run on a different cpu. The "what" is relatively
obvious, whereas the "why" requires reading the APM and/or KVM code.
Opportunistically remove a spurious period and several unnecessary
newlines in the comment.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210406171811.4043363-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove vmcb_pa from vcpu_svm and simply read current_vmcb->pa directly in
the one path where it is consumed. Unlike svm->vmcb, use of the current
vmcb's address is very limited, as evidenced by the fact that its use
can be trimmed to a single dereference.
Opportunistically add a comment about using vmcb01 for VMLOAD/VMSAVE, at
first glance using vmcb01 instead of vmcb_pa looks wrong.
No functional change intended.
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210406171811.4043363-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Do not update the new vmcb's last-run cpu when switching to a different
vmcb. If the vCPU is migrated between its last run and a vmcb switch,
e.g. for nested VM-Exit, then setting the cpu without marking the vmcb
dirty will lead to KVM running the vCPU on a different physical cpu with
stale clean bit settings.
vcpu->cpu current_vmcb->cpu hardware
pre_svm_run() cpu0 cpu0 cpu0,clean
kvm_arch_vcpu_load() cpu1 cpu0 cpu0,clean
svm_switch_vmcb() cpu1 cpu1 cpu0,clean
pre_svm_run() cpu1 cpu1 kaboom
Simply delete the offending code; unlike VMX, which needs to update the
cpu at switch time due to the need to do VMPTRLD, SVM only cares about
which cpu last ran the vCPU.
Fixes: af18fa775d ("KVM: nSVM: Track the physical cpu of the vmcb vmrun through the vmcb")
Cc: Cathy Avery <cavery@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210406171811.4043363-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Access to the GHCB is mainly in the VMGEXIT path and it is known that the
GHCB will be mapped. But there are two paths where it is possible the GHCB
might not be mapped.
The sev_vcpu_deliver_sipi_vector() routine will update the GHCB to inform
the caller of the AP Reset Hold NAE event that a SIPI has been delivered.
However, if a SIPI is performed without a corresponding AP Reset Hold,
then the GHCB might not be mapped (depending on the previous VMEXIT),
which will result in a NULL pointer dereference.
The svm_complete_emulated_msr() routine will update the GHCB to inform
the caller of a RDMSR/WRMSR operation about any errors. While it is likely
that the GHCB will be mapped in this situation, add a safe guard
in this path to be certain a NULL pointer dereference is not encountered.
Fixes: f1c6366e30 ("KVM: SVM: Add required changes to support intercepts under SEV-ES")
Fixes: 647daca25d ("KVM: SVM: Add support for booting APs in an SEV-ES guest")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Message-Id: <a5d3ebb600a91170fc88599d5a575452b3e31036.1617979121.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently to support Intel->AMD migration, if CPU vendor is GenuineIntel,
we emulate the full 64 value for MSR_IA32_SYSENTER_{EIP|ESP}
msrs, and we also emulate the sysenter/sysexit instruction in long mode.
(Emulator does still refuse to emulate sysenter in 64 bit mode, on the
ground that the code for that wasn't tested and likely has no users)
However when virtual vmload/vmsave is enabled, the vmload instruction will
update these 32 bit msrs without triggering their msr intercept,
which will lead to having stale values in kvm's shadow copy of these msrs,
which relies on the intercept to be up to date.
Fix/optimize this by doing the following:
1. Enable the MSR intercepts for SYSENTER MSRs iff vendor=GenuineIntel
(This is both a tiny optimization and also ensures that in case
the guest cpu vendor is AMD, the msrs will be 32 bit wide as
AMD defined).
2. Store only high 32 bit part of these msrs on interception and combine
it with hardware msr value on intercepted read/writes
iff vendor=GenuineIntel.
3. Disable vmload/vmsave virtualization if vendor=GenuineIntel.
(It is somewhat insane to set vendor=GenuineIntel and still enable
SVM for the guest but well whatever).
Then zero the high 32 bit parts when kvm intercepts and emulates vmload.
Thanks a lot to Paulo Bonzini for helping me with fixing this in the most
correct way.
This patch fixes nested migration of 32 bit nested guests, that was
broken because incorrect cached values of SYSENTER msrs were stored in
the migration stream if L1 changed these msrs with
vmload prior to L2 entry.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210401111928.996871-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fix ~144 single-word typos in arch/x86/ code comments.
Doing this in a single commit should reduce the churn.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-kernel@vger.kernel.org
Set the PAE roots used as decrypted to play nice with SME when KVM is
using shadow paging. Explicitly skip setting the C-bit when loading
CR3 for PAE shadow paging, even though it's completely ignored by the
CPU. The extra documentation is nice to have.
Note, there are several subtleties at play with NPT. In addition to
legacy shadow paging, the PAE roots are used for SVM's NPT when either
KVM is 32-bit (uses PAE paging) or KVM is 64-bit and shadowing 32-bit
NPT. However, 32-bit Linux, and thus KVM, doesn't support SME. And
64-bit KVM can happily set the C-bit in CR3. This also means that
keeping __sme_set(root) for 32-bit KVM when NPT is enabled is
conceptually wrong, but functionally ok since SME is 64-bit only.
Leave it as is to avoid unnecessary pollution.
Fixes: d0ec49d4de ("kvm/x86/svm: Support Secure Memory Encryption within KVM")
Cc: stable@vger.kernel.org
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210309224207.1218275-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Retrieve the active PCID only when writing a guest CR3 value, i.e. don't
get the PCID when using EPT or NPT. The PCID is especially problematic
for EPT as the bits have different meaning, and so the PCID and must be
manually stripped, which is annoying and unnecessary. And on VMX,
getting the active PCID also involves reading the guest's CR3 and
CR4.PCIDE, i.e. may add pointless VMREADs.
Opportunistically rename the pgd/pgd_level params to root_hpa and
root_level to better reflect their new roles. Keep the function names,
as "load the guest PGD" is still accurate/correct.
Last, and probably least, pass root_hpa as a hpa_t/u64 instead of an
unsigned long. The EPTP holds a 64-bit value, even in 32-bit mode, so
in theory EPT could support HIGHMEM for 32-bit KVM. Never mind that
doing so would require changing the MMU page allocators and reworking
the MMU to use kmap().
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305183123.3978098-2-seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Stop tagging MMIO SPTEs with specific available bits and instead detect
MMIO SPTEs by checking for their unique SPTE value. The value is
guaranteed to be unique on shadow paging and NPT as setting reserved
physical address bits on any other type of SPTE would consistute a KVM
bug. Ditto for EPT, as creating a WX non-MMIO would also be a bug.
Note, this approach is also future-compatibile with TDX, which will need
to reflect MMIO EPT violations as #VEs into the guest. To create an EPT
violation instead of a misconfig, TDX EPTs will need to have RWX=0, But,
MMIO SPTEs will also be the only case where KVM clears SUPPRESS_VE, so
MMIO SPTEs will still be guaranteed to have a unique value within a given
MMU context.
The main motivation is to make it easier to reason about which types of
SPTEs use which available bits. As a happy side effect, this frees up
two more bits for storing the MMIO generation.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210225204749.1512652-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the vmcb12 control clean field to determine which vmcb12.save
registers were marked dirty in order to minimize register copies
when switching from L1 to L2. Those vmcb12 registers marked as dirty need
to be copied to L0's vmcb02 as they will be used to update the vmcb
state cache for the L2 VMRUN. In the case where we have a different
vmcb12 from the last L2 VMRUN all vmcb12.save registers must be
copied over to L2.save.
Tested:
kvm-unit-tests
kvm selftests
Fedora L1 L2
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Cathy Avery <cavery@redhat.com>
Message-Id: <20210301200844.2000-1-cavery@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Newer AMD processors have a feature to virtualize the use of the
SPEC_CTRL MSR. Presence of this feature is indicated via CPUID
function 0x8000000A_EDX[20]: GuestSpecCtrl. Hypervisors are not
required to enable this feature since it is automatically enabled on
processors that support it.
A hypervisor may wish to impose speculation controls on guest
execution or a guest may want to impose its own speculation controls.
Therefore, the processor implements both host and guest
versions of SPEC_CTRL.
When in host mode, the host SPEC_CTRL value is in effect and writes
update only the host version of SPEC_CTRL. On a VMRUN, the processor
loads the guest version of SPEC_CTRL from the VMCB. When the guest
writes SPEC_CTRL, only the guest version is updated. On a VMEXIT,
the guest version is saved into the VMCB and the processor returns
to only using the host SPEC_CTRL for speculation control. The guest
SPEC_CTRL is located at offset 0x2E0 in the VMCB.
The effective SPEC_CTRL setting is the guest SPEC_CTRL setting or'ed
with the hypervisor SPEC_CTRL setting. This allows the hypervisor to
ensure a minimum SPEC_CTRL if desired.
This support also fixes an issue where a guest may sometimes see an
inconsistent value for the SPEC_CTRL MSR on processors that support
this feature. With the current SPEC_CTRL support, the first write to
SPEC_CTRL is intercepted and the virtualized version of the SPEC_CTRL
MSR is not updated. When the guest reads back the SPEC_CTRL MSR, it
will be 0x0, instead of the actual expected value. There isn’t a
security concern here, because the host SPEC_CTRL value is or’ed with
the Guest SPEC_CTRL value to generate the effective SPEC_CTRL value.
KVM writes with the guest's virtualized SPEC_CTRL value to SPEC_CTRL
MSR just before the VMRUN, so it will always have the actual value
even though it doesn’t appear that way in the guest. The guest will
only see the proper value for the SPEC_CTRL register if the guest was
to write to the SPEC_CTRL register again. With Virtual SPEC_CTRL
support, the save area spec_ctrl is properly saved and restored.
So, the guest will always see the proper value when it is read back.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Message-Id: <161188100955.28787.11816849358413330720.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This allows to avoid copying of these fields between vmcb01
and vmcb02 on nested guest entry/exit.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Thanks to the new macros that handle exception handling for SVM
instructions, it is easier to just do the VMLOAD/VMSAVE in C.
This is safe, as shown by the fact that the host reload is
already done outside the assembly source.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Skip PAUSE after interception to avoid unnecessarily re-executing the
instruction in the guest, e.g. after regaining control post-yield.
This is a benign bug as KVM disables PAUSE interception if filtering is
off, including the case where pause_filter_count is set to zero.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove bizarre code that causes KVM to run RDPMC through the emulator
when nrips is disabled. Accelerated emulation of RDPMC doesn't rely on
any additional data from the VMCB, and SVM has generic handling for
updating RIP to skip instructions when nrips is disabled.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the entirety of the accelerated RDPMC emulation to x86.c, and assign
the common handler directly to the exit handler array for VMX. SVM has
bizarre nrips behavior that prevents it from directly invoking the common
handler. The nrips goofiness will be addressed in a future patch.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the trivial exit handlers, e.g. for instructions that KVM
"emulates" as nops, to common x86 code. Assign the common handlers
directly to the exit handler arrays and drop the vendor trampolines.
Opportunistically use pr_warn_once() where appropriate.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the entirety of XSETBV emulation to x86.c, and assign the
function directly to both VMX's and SVM's exit handlers, i.e. drop the
unnecessary trampolines.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add another helper layer for VMLOAD+VMSAVE, the code is identical except
for the one line that determines which VMCB is the source and which is
the destination.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a helper to consolidate boilerplate for nested VM-Exits that don't
provide any data in exit_info_*.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210302174515.2812275-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Refactor the svm_exit_handlers API to pass @vcpu instead of @svm to
allow directly invoking common x86 exit handlers (in a future patch).
Opportunistically convert an absurd number of instances of 'svm->vcpu'
to direct uses of 'vcpu' to avoid pointless casting.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The logic of update_cr0_intercept is pointlessly complicated.
All svm_set_cr0 is compute the effective cr0 and compare it with
the guest value.
Inlining the function and simplifying the condition
clarifies what it is doing.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that SVM is using a separate vmcb01 and vmcb02 (and also uses the vmcb12
naming) we can give clearer names to functions that write to and read
from those VMCBs. Likewise, variables and parameters can be renamed
from nested_vmcb to vmcb12.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch moves the asid_generation from the vcpu to the vmcb
in order to track the ASID generation that was active the last
time the vmcb was run. If sd->asid_generation changes between
two runs, the old ASID is invalid and must be changed.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Cathy Avery <cavery@redhat.com>
Message-Id: <20210112164313.4204-3-cavery@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch moves the physical cpu tracking from the vcpu
to the vmcb in svm_switch_vmcb. If either vmcb01 or vmcb02
change physical cpus from one vmrun to the next the vmcb's
previous cpu is preserved for comparison with the current
cpu and the vmcb is marked dirty if different. This prevents
the processor from using old cached data for a vmcb that may
have been updated on a prior run on a different processor.
It also moves the physical cpu check from svm_vcpu_load
to pre_svm_run as the check only needs to be done at run.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Cathy Avery <cavery@redhat.com>
Message-Id: <20210112164313.4204-2-cavery@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
svm->vmcb will now point to a separate vmcb for L1 (not nested) or L2
(nested).
The main advantages are removing get_host_vmcb and hsave, in favor of
concepts that are shared with VMX.
We don't need anymore to stash the L1 registers in hsave while L2
runs, but we need to copy the VMLOAD/VMSAVE registers from VMCB01 to
VMCB02 and back. This more or less has the same cost, but code-wise
nested_svm_vmloadsave can be reused.
This patch omits several optimizations that are possible:
- for simplicity there is some wholesale copying of vmcb.control areas
which can go away.
- we should be able to better use the VMCB01 and VMCB02 clean bits.
- another possibility is to always use VMCB01 for VMLOAD and VMSAVE,
thus avoiding the copy of VMLOAD/VMSAVE registers from VMCB01 to
VMCB02 and back.
Tested:
kvm-unit-tests
kvm self tests
Loaded fedora nested guest on fedora
Signed-off-by: Cathy Avery <cavery@redhat.com>
Message-Id: <20201011184818.3609-3-cavery@redhat.com>
[Fix conflicts; keep VMCB02 G_PAT up to date whenever guest writes the
PAT MSR; do not copy CR4 over from VMCB01 as it is not needed anymore; add
a few more comments. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Don't strip the C-bit from the faulting address on an intercepted #PF,
the address is a virtual address, not a physical address.
Fixes: 0ede79e132 ("KVM: SVM: Clear C-bit from the page fault address")
Cc: stable@vger.kernel.org
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305011101.3597423-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Directly connect the 'npt' param to the 'npt_enabled' variable so that
runtime adjustments to npt_enabled are reflected in sysfs. Move the
!PAE restriction to a runtime check to ensure NPT is forced off if the
host is using 2-level paging, and add a comment explicitly stating why
NPT requires a 64-bit kernel or a kernel with PAE enabled.
Opportunistically switch the param to octal permissions.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305021637.3768573-1-seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This problem was reported on a SVM guest while executing kexec.
Kexec fails to load the new kernel when the PCID feature is enabled.
When kexec starts loading the new kernel, it starts the process by
resetting the vCPU's and then bringing each vCPU online one by one.
The vCPU reset is supposed to reset all the register states before the
vCPUs are brought online. However, the CR4 register is not reset during
this process. If this register is already setup during the last boot,
all the flags can remain intact. The X86_CR4_PCIDE bit can only be
enabled in long mode. So, it must be enabled much later in SMP
initialization. Having the X86_CR4_PCIDE bit set during SMP boot can
cause a boot failures.
Fix the issue by resetting the CR4 register in init_vmcb().
Signed-off-by: Babu Moger <babu.moger@amd.com>
Message-Id: <161471109108.30811.6392805173629704166.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fix the interpreation of nested_svm_vmexit()'s return value when
synthesizing a nested VM-Exit after intercepting an SVM instruction while
L2 was running. The helper returns '0' on success, whereas a return
value of '0' in the exit handler path means "exit to userspace". The
incorrect return value causes KVM to exit to userspace without filling
the run state, e.g. QEMU logs "KVM: unknown exit, hardware reason 0".
Fixes: 14c2bf81fc ("KVM: SVM: Fix #GP handling for doubly-nested virtualization")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210224005627.657028-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Advertise INVPCID by default (if supported by the host kernel) instead
of having both SVM and VMX opt in. INVPCID was opt in when it was a
VMX only feature so that KVM wouldn't prematurely advertise support
if/when it showed up in the kernel on AMD hardware.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210212003411.1102677-3-seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Intercept INVPCID if it's disabled in the guest, even when using NPT,
as KVM needs to inject #UD in this case.
Fixes: 4407a797e9 ("KVM: SVM: Enable INVPCID feature on AMD")
Cc: Babu Moger <babu.moger@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210212003411.1102677-2-seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The sparse tool complains as follows:
arch/x86/kvm/svm/svm.c:204:6: warning:
symbol 'svm_gp_erratum_intercept' was not declared. Should it be static?
This symbol is not used outside of svm.c, so this
commit marks it static.
Fixes: 82a11e9c6f ("KVM: SVM: Add emulation support for #GP triggered by SVM instructions")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Message-Id: <20210210075958.1096317-1-weiyongjun1@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Push the injection of #GP up to the callers, so that they can just use
kvm_complete_insn_gp. __kvm_set_dr is pretty much what the callers can use
together with kvm_complete_insn_gp, so rename it to kvm_set_dr and drop
the old kvm_set_dr wrapper.
This also allows nested VMX code, which really wanted to use __kvm_set_dr,
to use the right function.
While at it, remove the kvm_require_dr() check from the SVM interception.
The APM states:
All normal exception checks take precedence over the SVM intercepts.
which includes the CR4.DE=1 #UD.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop a defunct forward declaration of svm_complete_interrupts().
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename cr3_lm_rsvd_bits to reserved_gpa_bits, and use it for all GPA
legality checks. AMD's APM states:
If the C-bit is an address bit, this bit is masked from the guest
physical address when it is translated through the nested page tables.
Thus, any access that can conceivably be run through NPT should ignore
the C-bit when checking for validity.
For features that KVM emulates in software, e.g. MTRRs, there is no
clear direction in the APM for how the C-bit should be handled. For
such cases, follow the SME behavior inasmuch as possible, since SEV is
is essentially a VM-specific variant of SME. For SME, the APM states:
In this case the upper physical address bits are treated as reserved
when the feature is enabled except where otherwise indicated.
Collecting the various relavant SME snippets in the APM and cross-
referencing the omissions with Linux kernel code, this leaves MTTRs and
APIC_BASE as the only flows that KVM emulates that should _not_ ignore
the C-bit.
Note, this means the reserved bit checks in the page tables are
technically broken. This will be remedied in a future patch.
Although the page table checks are technically broken, in practice, it's
all but guaranteed to be irrelevant. NPT is required for SEV, i.e.
shadowing page tables isn't needed in the common case. Theoretically,
the checks could be in play for nested NPT, but it's extremely unlikely
that anyone is running nested VMs on SEV, as doing so would require L1
to expose sensitive data to L0, e.g. the entire VMCB. And if anyone is
running nested VMs, L0 can't read the guest's encrypted memory, i.e. L1
would need to put its NPT in shared memory, in which case the C-bit will
never be set. Or, L1 could use shadow paging, but again, if L0 needs to
read page tables, e.g. to load PDPTRs, the memory can't be encrypted if
L1 has any expectation of L0 doing the right thing.
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Replace the hard-coded value for bit# 1 in EFLAGS, with the available
#define.
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Message-Id: <20210203012842.101447-2-krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently we save host state like user-visible host MSRs, and do some
initial guest register setup for MSR_TSC_AUX and MSR_AMD64_TSC_RATIO
in svm_vcpu_load(). Defer this until just before we enter the guest by
moving the handling to kvm_x86_ops.prepare_guest_switch() similarly to
how it is done for the VMX implementation.
Additionally, since handling of saving/restoring host user MSRs is the
same both with/without SEV-ES enabled, move that handling to common
code.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-Id: <20210202190126.2185715-4-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that the set of host user MSRs that need to be individually
saved/restored are the same with/without SEV-ES, we can drop the
.sev_es_restored flag and just iterate through the list unconditionally
for both cases. A subsequent patch can then move these loops to a
common path.
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-Id: <20210202190126.2185715-3-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Using a guest workload which simply issues 'hlt' in a tight loop to
generate VMEXITs, it was observed (on a recent EPYC processor) that a
significant amount of the VMEXIT overhead measured on the host was the
result of MSR reads/writes in svm_vcpu_load/svm_vcpu_put according to
perf:
67.49%--kvm_arch_vcpu_ioctl_run
|
|--23.13%--vcpu_put
| kvm_arch_vcpu_put
| |
| |--21.31%--native_write_msr
| |
| --1.27%--svm_set_cr4
|
|--16.11%--vcpu_load
| |
| --15.58%--kvm_arch_vcpu_load
| |
| |--13.97%--svm_set_cr4
| | |
| | |--12.64%--native_read_msr
Most of these MSRs relate to 'syscall'/'sysenter' and segment bases, and
can be saved/restored using 'vmsave'/'vmload' instructions rather than
explicit MSR reads/writes. In doing so there is a significant reduction
in the svm_vcpu_load/svm_vcpu_put overhead measured for the above
workload:
50.92%--kvm_arch_vcpu_ioctl_run
|
|--19.28%--disable_nmi_singlestep
|
|--13.68%--vcpu_load
| kvm_arch_vcpu_load
| |
| |--9.19%--svm_set_cr4
| | |
| | --6.44%--native_read_msr
| |
| --3.55%--native_write_msr
|
|--6.05%--kvm_inject_nmi
|--2.80%--kvm_sev_es_mmio_read
|--2.19%--vcpu_put
| |
| --1.25%--kvm_arch_vcpu_put
| native_write_msr
Quantifying this further, if we look at the raw cycle counts for a
normal iteration of the above workload (according to 'rdtscp'),
kvm_arch_vcpu_ioctl_run() takes ~4600 cycles from start to finish with
the current behavior. Using 'vmsave'/'vmload', this is reduced to
~2800 cycles, a savings of 39%.
While this approach doesn't seem to manifest in any noticeable
improvement for more realistic workloads like UnixBench, netperf, and
kernel builds, likely due to their exit paths generally involving IO
with comparatively high latencies, it does improve overall overhead
of KVM_RUN significantly, which may still be noticeable for certain
situations. It also simplifies some aspects of the code.
With this change, explicit save/restore is no longer needed for the
following host MSRs, since they are documented[1] as being part of the
VMCB State Save Area:
MSR_STAR, MSR_LSTAR, MSR_CSTAR,
MSR_SYSCALL_MASK, MSR_KERNEL_GS_BASE,
MSR_IA32_SYSENTER_CS,
MSR_IA32_SYSENTER_ESP,
MSR_IA32_SYSENTER_EIP,
MSR_FS_BASE, MSR_GS_BASE
and only the following MSR needs individual handling in
svm_vcpu_put/svm_vcpu_load:
MSR_TSC_AUX
We could drop the host_save_user_msrs array/loop and instead handle
MSR read/write of MSR_TSC_AUX directly, but we leave that for now as
a potential follow-up.
Since 'vmsave'/'vmload' also handles the LDTR and FS/GS segment
registers (and associated hidden state)[2], some of the code
previously used to handle this is no longer needed, so we drop it
as well.
The first public release of the SVM spec[3] also documents the same
handling for the host state in question, so we make these changes
unconditionally.
Also worth noting is that we 'vmsave' to the same page that is
subsequently used by 'vmrun' to record some host additional state. This
is okay, since, in accordance with the spec[2], the additional state
written to the page by 'vmrun' does not overwrite any fields written by
'vmsave'. This has also been confirmed through testing (for the above
CPU, at least).
[1] AMD64 Architecture Programmer's Manual, Rev 3.33, Volume 2, Appendix B, Table B-2
[2] AMD64 Architecture Programmer's Manual, Rev 3.31, Volume 3, Chapter 4, VMSAVE/VMLOAD
[3] Secure Virtual Machine Architecture Reference Manual, Rev 3.01
Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-Id: <20210202190126.2185715-2-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add svm_asm*() macros, a la the existing vmx_asm*() macros, to handle
faults on SVM instructions instead of using the generic __ex(), a.k.a.
__kvm_handle_fault_on_reboot(). Using asm goto generates slightly
better code as it eliminates the in-line JMP+CALL sequences that are
needed by __kvm_handle_fault_on_reboot() to avoid triggering BUG()
from fixup (which generates bad stack traces).
Using SVM specific macros also drops the last user of __ex() and the
the last asm linkage to kvm_spurious_fault(), and adds a helper for
VMSAVE, which may gain an addition call site in the future (as part
of optimizing the SVM context switching).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20201231002702.2223707-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A subsequent patch introduces macros in preparation for simplifying the
definition for vmx_x86_ops and svm_x86_ops. Making the naming more uniform
expands the coverage of the macros. Add vmx/svm prefix to the following
functions: update_exception_bitmap(), enable_nmi_window(),
enable_irq_window(), update_cr8_intercept and enable_smi_window().
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Message-Id: <ed594696f8e2c2b2bfc747504cee9bbb2a269300.1610680941.git.jbaron@akamai.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Under the case of nested on nested (L0, L1, L2 are all hypervisors),
we do not support emulation of the vVMLOAD/VMSAVE feature, the
L0 hypervisor can inject the proper #VMEXIT to inform L1 of what is
happening and L1 can avoid invoking the #GP workaround. For this
reason we turns on guest VM's X86_FEATURE_SVME_ADDR_CHK bit for KVM
running inside VM to receive the notification and change behavior.
Similarly we check if vcpu is under guest mode before emulating the
vmware-backdoor instructions. For the case of nested on nested, we
let the guest handle it.
Co-developed-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Wei Huang <wei.huang2@amd.com>
Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210126081831.570253-5-wei.huang2@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
New AMD CPUs have a change that checks #VMEXIT intercept on special SVM
instructions before checking their EAX against reserved memory region.
This change is indicated by CPUID_0x8000000A_EDX[28]. If it is 1, #VMEXIT
is triggered before #GP. KVM doesn't need to intercept and emulate #GP
faults as #GP is supposed to be triggered.
Co-developed-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Wei Huang <wei.huang2@amd.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210126081831.570253-4-wei.huang2@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
While running SVM related instructions (VMRUN/VMSAVE/VMLOAD), some AMD
CPUs check EAX against reserved memory regions (e.g. SMM memory on host)
before checking VMCB's instruction intercept. If EAX falls into such
memory areas, #GP is triggered before VMEXIT. This causes problem under
nested virtualization. To solve this problem, KVM needs to trap #GP and
check the instructions triggering #GP. For VM execution instructions,
KVM emulates these instructions.
Co-developed-by: Wei Huang <wei.huang2@amd.com>
Signed-off-by: Wei Huang <wei.huang2@amd.com>
Signed-off-by: Bandan Das <bsd@redhat.com>
Message-Id: <20210126081831.570253-3-wei.huang2@amd.com>
[Conditionally enable #GP intercept. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
DR6_INIT contains the 1-reserved bits as well as the bit that is cleared
to 0 when the condition (e.g. RTM) happens. The value can be used to
initialize dr6 and also be the XOR mask between the #DB exit
qualification (or payload) and DR6.
Concerning that DR6_INIT is used as initial value only once, rename it
to DR6_ACTIVE_LOW and apply it in other places, which would make the
incoming changes for bus lock debug exception more simple.
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20210202090433.13441-2-chenyi.qiang@intel.com>
[Define DR6_FIXED_1 from DR6_ACTIVE_LOW and DR6_VOLATILE. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Don't let KVM load when running as an SEV guest, regardless of what
CPUID says. Memory is encrypted with a key that is not accessible to
the host (L0), thus it's impossible for L0 to emulate SVM, e.g. it'll
see garbage when reading the VMCB.
Technically, KVM could decrypt all memory that needs to be accessible to
the L0 and use shadow paging so that L0 does not need to shadow NPT, but
exposing such information to L0 largely defeats the purpose of running as
an SEV guest. This can always be revisited if someone comes up with a
use case for running VMs inside SEV guests.
Note, VMLOAD, VMRUN, etc... will also #GP on GPAs with C-bit set, i.e. KVM
is doomed even if the SEV guest is debuggable and the hypervisor is willing
to decrypt the VMCB. This may or may not be fixed on CPUs that have the
SVME_ADDR_CHK fix.
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210202212017.2486595-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On VMX, if we exit and then re-enter immediately without leaving
the vmx_vcpu_run() function, the kvm_entry event is not logged.
That means we will see one (or more) kvm_exit, without its (their)
corresponding kvm_entry, as shown here:
CPU-1979 [002] 89.871187: kvm_entry: vcpu 1
CPU-1979 [002] 89.871218: kvm_exit: reason MSR_WRITE
CPU-1979 [002] 89.871259: kvm_exit: reason MSR_WRITE
It also seems possible for a kvm_entry event to be logged, but then
we leave vmx_vcpu_run() right away (if vmx->emulation_required is
true). In this case, we will have a spurious kvm_entry event in the
trace.
Fix these situations by moving trace_kvm_entry() inside vmx_vcpu_run()
(where trace_kvm_exit() already is).
A trace obtained with this patch applied looks like this:
CPU-14295 [000] 8388.395387: kvm_entry: vcpu 0
CPU-14295 [000] 8388.395392: kvm_exit: reason MSR_WRITE
CPU-14295 [000] 8388.395393: kvm_entry: vcpu 0
CPU-14295 [000] 8388.395503: kvm_exit: reason EXTERNAL_INTERRUPT
Of course, not calling trace_kvm_entry() in common x86 code any
longer means that we need to adjust the SVM side of things too.
Signed-off-by: Lorenzo Brescia <lorenzo.brescia@edu.unito.it>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Message-Id: <160873470698.11652.13483635328769030605.stgit@Wayrath>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Typically under KVM, an AP is booted using the INIT-SIPI-SIPI sequence,
where the guest vCPU register state is updated and then the vCPU is VMRUN
to begin execution of the AP. For an SEV-ES guest, this won't work because
the guest register state is encrypted.
Following the GHCB specification, the hypervisor must not alter the guest
register state, so KVM must track an AP/vCPU boot. Should the guest want
to park the AP, it must use the AP Reset Hold exit event in place of, for
example, a HLT loop.
First AP boot (first INIT-SIPI-SIPI sequence):
Execute the AP (vCPU) as it was initialized and measured by the SEV-ES
support. It is up to the guest to transfer control of the AP to the
proper location.
Subsequent AP boot:
KVM will expect to receive an AP Reset Hold exit event indicating that
the vCPU is being parked and will require an INIT-SIPI-SIPI sequence to
awaken it. When the AP Reset Hold exit event is received, KVM will place
the vCPU into a simulated HLT mode. Upon receiving the INIT-SIPI-SIPI
sequence, KVM will make the vCPU runnable. It is again up to the guest
to then transfer control of the AP to the proper location.
To differentiate between an actual HLT and an AP Reset Hold, a new MP
state is introduced, KVM_MP_STATE_AP_RESET_HOLD, which the vCPU is
placed in upon receiving the AP Reset Hold exit event. Additionally, to
communicate the AP Reset Hold exit event up to userspace (if needed), a
new exit reason is introduced, KVM_EXIT_AP_RESET_HOLD.
A new x86 ops function is introduced, vcpu_deliver_sipi_vector, in order
to accomplish AP booting. For VMX, vcpu_deliver_sipi_vector is set to the
original SIPI delivery function, kvm_vcpu_deliver_sipi_vector(). SVM adds
a new function that, for non SEV-ES guests, invokes the original SIPI
delivery function, kvm_vcpu_deliver_sipi_vector(), but for SEV-ES guests,
implements the logic above.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <e8fbebe8eb161ceaabdad7c01a5859a78b424d5e.1609791600.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 16809ecdc1 moved __svm_vcpu_run the prototype to svm.h,
but forgot to remove the original from svm.c.
Fixes: 16809ecdc1 ("KVM: SVM: Provide an updated VMRUN invocation for SEV-ES guests")
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Message-Id: <20201220200339.65115-1-ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The run sequence is different for an SEV-ES guest compared to a legacy or
even an SEV guest. The guest vCPU register state of an SEV-ES guest will
be restored on VMRUN and saved on VMEXIT. There is no need to restore the
guest registers directly and through VMLOAD before VMRUN and no need to
save the guest registers directly and through VMSAVE on VMEXIT.
Update the svm_vcpu_run() function to skip register state saving and
restoring and provide an alternative function for running an SEV-ES guest
in vmenter.S
Additionally, certain host state is restored across an SEV-ES VMRUN. As
a result certain register states are not required to be restored upon
VMEXIT (e.g. FS, GS, etc.), so only do that if the guest is not an SEV-ES
guest.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <fb1c66d32f2194e171b95fc1a8affd6d326e10c1.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
An SEV-ES vCPU requires additional VMCB vCPU load/put requirements. SEV-ES
hardware will restore certain registers on VMEXIT, but not save them on
VMRUN (see Table B-3 and Table B-4 of the AMD64 APM Volume 2), so make the
following changes:
General vCPU load changes:
- During vCPU loading, perform a VMSAVE to the per-CPU SVM save area and
save the current values of XCR0, XSS and PKRU to the per-CPU SVM save
area as these registers will be restored on VMEXIT.
General vCPU put changes:
- Do not attempt to restore registers that SEV-ES hardware has already
restored on VMEXIT.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <019390e9cb5e93cd73014fa5a040c17d42588733.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
An SEV-ES vCPU requires additional VMCB initialization requirements for
vCPU creation and vCPU load/put requirements. This includes:
General VMCB initialization changes:
- Set a VMCB control bit to enable SEV-ES support on the vCPU.
- Set the VMCB encrypted VM save area address.
- CRx registers are part of the encrypted register state and cannot be
updated. Remove the CRx register read and write intercepts and replace
them with CRx register write traps to track the CRx register values.
- Certain MSR values are part of the encrypted register state and cannot
be updated. Remove certain MSR intercepts (EFER, CR_PAT, etc.).
- Remove the #GP intercept (no support for "enable_vmware_backdoor").
- Remove the XSETBV intercept since the hypervisor cannot modify XCR0.
General vCPU creation changes:
- Set the initial GHCB gpa value as per the GHCB specification.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <3a8aef366416eddd5556dfa3fdc212aafa1ad0a2.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The SVM host save area is used to restore some host state on VMEXIT of an
SEV-ES guest. After allocating the save area, clear it and add the
encryption mask to the SVM host save area physical address that is
programmed into the VM_HSAVE_PA MSR.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <b77aa28af6d7f1a0cb545959e08d6dc75e0c3cba.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The GHCB specification defines how NMIs are to be handled for an SEV-ES
guest. To detect the completion of an NMI the hypervisor must not
intercept the IRET instruction (because a #VC while running the NMI will
issue an IRET) and, instead, must receive an NMI Complete exit event from
the guest.
Update the KVM support for detecting the completion of NMIs in the guest
to follow the GHCB specification. When an SEV-ES guest is active, the
IRET instruction will no longer be intercepted. Now, when the NMI Complete
exit event is received, the iret_interception() function will be called
to simulate the completion of the NMI.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <5ea3dd69b8d4396cefdc9048ebc1ab7caa70a847.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The guest FPU state is automatically restored on VMRUN and saved on VMEXIT
by the hardware, so there is no reason to do this in KVM. Eliminate the
allocation of the guest_fpu save area and key off that to skip operations
related to the guest FPU state.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <173e429b4d0d962c6a443c4553ffdaf31b7665a4.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SEV-ES guests do not currently support SMM. Update the has_emulated_msr()
kvm_x86_ops function to take a struct kvm parameter so that the capability
can be reported at a VM level.
Since this op is also called during KVM initialization and before a struct
kvm instance is available, comments will be added to each implementation
of has_emulated_msr() to indicate the kvm parameter can be null.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <75de5138e33b945d2fb17f81ae507bda381808e3.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For SEV-ES guests, the interception of control register write access
is not recommended. Control register interception occurs prior to the
control register being modified and the hypervisor is unable to modify
the control register itself because the register is located in the
encrypted register state.
SEV-ES guests introduce new control register write traps. These traps
provide intercept support of a control register write after the control
register has been modified. The new control register value is provided in
the VMCB EXITINFO1 field, allowing the hypervisor to track the setting
of the guest control registers.
Add support to track the value of the guest CR8 register using the control
register write trap so that the hypervisor understands the guest operating
mode.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <5a01033f4c8b3106ca9374b7cadf8e33da852df1.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For SEV-ES guests, the interception of control register write access
is not recommended. Control register interception occurs prior to the
control register being modified and the hypervisor is unable to modify
the control register itself because the register is located in the
encrypted register state.
SEV-ES guests introduce new control register write traps. These traps
provide intercept support of a control register write after the control
register has been modified. The new control register value is provided in
the VMCB EXITINFO1 field, allowing the hypervisor to track the setting
of the guest control registers.
Add support to track the value of the guest CR4 register using the control
register write trap so that the hypervisor understands the guest operating
mode.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <c3880bf2db8693aa26f648528fbc6e967ab46e25.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For SEV-ES guests, the interception of control register write access
is not recommended. Control register interception occurs prior to the
control register being modified and the hypervisor is unable to modify
the control register itself because the register is located in the
encrypted register state.
SEV-ES support introduces new control register write traps. These traps
provide intercept support of a control register write after the control
register has been modified. The new control register value is provided in
the VMCB EXITINFO1 field, allowing the hypervisor to track the setting
of the guest control registers.
Add support to track the value of the guest CR0 register using the control
register write trap so that the hypervisor understands the guest operating
mode.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <182c9baf99df7e40ad9617ff90b84542705ef0d7.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For SEV-ES guests, the interception of EFER write access is not
recommended. EFER interception occurs prior to EFER being modified and
the hypervisor is unable to modify EFER itself because the register is
located in the encrypted register state.
SEV-ES support introduces a new EFER write trap. This trap provides
intercept support of an EFER write after it has been modified. The new
EFER value is provided in the VMCB EXITINFO1 field, allowing the
hypervisor to track the setting of the guest EFER.
Add support to track the value of the guest EFER value using the EFER
write trap so that the hypervisor understands the guest operating mode.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <8993149352a3a87cd0625b3b61bfd31ab28977e1.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For an SEV-ES guest, string-based port IO is performed to a shared
(un-encrypted) page so that both the hypervisor and guest can read or
write to it and each see the contents.
For string-based port IO operations, invoke SEV-ES specific routines that
can complete the operation using common KVM port IO support.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <9d61daf0ffda496703717218f415cdc8fd487100.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SEV-ES adds a new VMEXIT reason code, VMGEXIT. Initial support for a
VMGEXIT includes mapping the GHCB based on the guest GPA, which is
obtained from a new VMCB field, and then validating the required inputs
for the VMGEXIT exit reason.
Since many of the VMGEXIT exit reasons correspond to existing VMEXIT
reasons, the information from the GHCB is copied into the VMCB control
exit code areas and KVM register areas. The standard exit handlers are
invoked, similar to standard VMEXIT processing. Before restarting the
vCPU, the GHCB is updated with any registers that have been updated by
the hypervisor.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <c6a4ed4294a369bd75c44d03bd7ce0f0c3840e50.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is a pre-patch to consolidate some exit handling code into callable
functions. Follow-on patches for SEV-ES exit handling will then be able
to use them from the sev.c file.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <5b8b0ffca8137f3e1e257f83df9f5c881c8a96a3.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When a SHUTDOWN VMEXIT is encountered, normally the VMCB is re-initialized
so that the guest can be re-launched. But when a guest is running as an
SEV-ES guest, the VMSA cannot be re-initialized because it has been
encrypted. For now, just return -EINVAL to prevent a possible attempt at
a guest reset.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <aa6506000f6f3a574de8dbcdab0707df844cb00c.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When a guest is running as an SEV-ES guest, it is not possible to emulate
instructions. Add support to prevent instruction emulation.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <f6355ea3024fda0a3eb5eb99c6b62dca10d792bd.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since the guest register state of an SEV-ES guest is encrypted, debugging
is not supported. Update the code to prevent guest debugging when the
guest has protected state.
Additionally, an SEV-ES guest must only and always intercept DR7 reads and
writes. Update set_dr_intercepts() and clr_dr_intercepts() to account for
this.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <8db966fa2f9803d6454ce773863025d0e2e7f3cc.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When a guest is running under SEV-ES, the hypervisor cannot access the
guest register state. There are numerous places in the KVM code where
certain registers are accessed that are not allowed to be accessed (e.g.
RIP, CR0, etc). Add checks to prevent register accesses and add intercept
update support at various points within the KVM code.
Also, when handling a VMGEXIT, exceptions are passed back through the
GHCB. Since the RDMSR/WRMSR intercepts (may) inject a #GP on error,
update the SVM intercepts to handle this for SEV-ES guests.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[Redo MSR part using the .complete_emulated_msr callback. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>