linux

Author	SHA1	Message	Date
Sean Christopherson	23f079c249	KVM: VMX: Refuse to load kvm_intel if EPT and NX are disabled Refuse to load KVM if NX support is not available and EPT is not enabled. Shadow paging has assumed NX support since commit `9167ab7993` ("KVM: vmx, svm: always run with EFER.NXE=1 when shadow paging is active"), so for all intents and purposes this has been a de facto requirement for over a year. Do not require NX support if EPT is enabled purely because Intel CPUs let firmware disable NX support via MSR_IA32_MISC_ENABLES. If not for that, VMX (and KVM as a whole) could require NX support with minimal risk to breaking userspace. Fixes: `9167ab7993` ("KVM: vmx, svm: always run with EFER.NXE=1 when shadow paging is active") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20210615164535.2146172-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-18 06:24:49 -04:00
Sean Christopherson	c5ffd408cd	KVM: nVMX: Drop redundant checks on vmcs12 in EPTP switching emulation Drop the explicit check on EPTP switching being enabled. The EPTP switching check is handled in the generic VMFUNC function check, while the underlying VMFUNC enablement check is done by hardware and redone by generic VMFUNC emulation. The vmcs12 EPT check is handled by KVM at VM-Enter in the form of a consistency check, keep it but add a WARN. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:54 -04:00
Sean Christopherson	546e8398bc	KVM: nVMX: WARN if subtly-impossible VMFUNC conditions occur WARN and inject #UD when emulating VMFUNC for L2 if the function is out-of-bounds or if VMFUNC is not enabled in vmcs12. Neither condition should occur in practice, as the CPU is supposed to prioritize the #UD over VM-Exit for out-of-bounds input and KVM is supposed to enable VMFUNC in vmcs02 if and only if it's enabled in vmcs12, but neither of those dependencies is obvious. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:54 -04:00
Sean Christopherson	c906066288	KVM: x86: Drop pointless @reset_roots from kvm_init_mmu() Remove the @reset_roots param from kvm_init_mmu(), the one user, kvm_mmu_reset_context() has already unloaded the MMU and thus freed and invalidated all roots. This also happens to be why the reset_roots=true paths doesn't leak roots; they're already invalid. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:54 -04:00
Sean Christopherson	39353ab579	KVM: nVMX: Use fast PGD switch when emulating VMFUNC[EPTP_SWITCH] Use __kvm_mmu_new_pgd() via kvm_init_shadow_ept_mmu() to emulate VMFUNC[EPTP_SWITCH] instead of nuking all MMUs. EPTP_SWITCH is the EPT equivalent of MOV to CR3, i.e. is a perfect fit for the common PGD flow, the only hiccup being that A/D enabling is buried in the EPTP. But, that is easily handled by bouncing through kvm_init_shadow_ept_mmu(). Explicitly request a guest TLB flush if VPID is disabled. Per Intel's SDM, if VPID is disabled, "an EPTP-switching VMFUNC invalidates combined mappings associated with VPID 0000H (for all PCIDs and for all EP4TA values, where EP4TA is the value of bits 51:12 of EPTP)". Note, this technically is a very bizarre bug fix of sorts if L2 is using PAE paging, as avoiding the full MMU reload also avoids incorrectly reloading the PDPTEs, which the SDM explicitly states are not touched: If PAE paging is in use, an EPTP-switching VMFUNC does not load the four page-directory-pointer-table entries (PDPTEs) from the guest-physical address in CR3. The logical processor continues to use the four guest-physical addresses already present in the PDPTEs. The guest-physical address in CR3 is not translated through the new EPT paging structures (until some operation that would load the PDPTEs). In addition to optimizing L2's MMU shenanigans, avoiding the full reload also optimizes L1's MMU as KVM_REQ_MMU_RELOAD wipes out all roots in both root_mmu and guest_mmu. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:53 -04:00
Sean Christopherson	25b62c6274	KVM: nVMX: Free only guest_mode (L2) roots on INVVPID w/o EPT When emulating INVVPID for L1, free only L2+ roots, using the guest_mode tag in the MMU role to identify L2+ roots. From L1's perspective, its own TLB entries use VPID=0, and INVVPID is not requied to invalidate such entries. Per Intel's SDM, INVVPID _may_ invalidate entries with VPID=0, but it is not required to do so. Cc: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:53 -04:00
Sean Christopherson	50a417962a	KVM: nVMX: Consolidate VM-Enter/VM-Exit TLB flush and MMU sync logic Drop the dedicated nested_vmx_transition_mmu_sync() now that the MMU sync is handled via KVM_REQ_TLB_FLUSH_GUEST, and fold that flush into the all-encompassing nested_vmx_transition_tlb_flush(). Opportunistically add a comment explaning why nested EPT never needs to sync the MMU on VM-Enter. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:53 -04:00
Sean Christopherson	b512910039	KVM: x86: Drop skip MMU sync and TLB flush params from "new PGD" helpers Drop skip_mmu_sync and skip_tlb_flush from __kvm_mmu_new_pgd() now that all call sites unconditionally skip both the sync and flush. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:52 -04:00
Sean Christopherson	272b0a998d	KVM: nVMX: Don't clobber nested MMU's A/D status on EPTP switch Drop bogus logic that incorrectly clobbers the accessed/dirty enabling status of the nested MMU on an EPTP switch. When nested EPT is enabled, walk_mmu points at L2's _legacy_ page tables, not L1's EPT for L2. This is likely a benign bug, as mmu->ept_ad is never consumed (since the MMU is not a nested EPT MMU), and stuffing mmu_role.base.ad_disabled will never propagate into future shadow pages since the nested MMU isn't used to map anything, just to walk L2's page tables. Note, KVM also does a full MMU reload, i.e. the guest_mmu will be recreated using the new EPTP, and thus any change in A/D enabling will be properly recognized in the relevant MMU. Fixes: `41ab937274` ("KVM: nVMX: Emulate EPTP switching for the L1 hypervisor") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:51 -04:00
Sean Christopherson	0e75225dfa	KVM: nVMX: Ensure 64-bit shift when checking VMFUNC bitmap Use BIT_ULL() instead of an open-coded shift to check whether or not a function is enabled in L1's VMFUNC bitmap. This is a benign bug as KVM supports only bit 0, and will fail VM-Enter if any other bits are set, i.e. bits 63:32 are guaranteed to be zero. Note, "function" is bounded by hardware as VMFUNC will #UD before taking a VM-Exit if the function is greater than 63. Before: if ((vmcs12->vm_function_control & (1 << function)) == 0) 0x000000000001a916 <+118>: mov $0x1,%eax 0x000000000001a91b <+123>: shl %cl,%eax 0x000000000001a91d <+125>: cltq 0x000000000001a91f <+127>: and 0x128(%rbx),%rax After: if (!(vmcs12->vm_function_control & BIT_ULL(function & 63))) 0x000000000001a955 <+117>: mov 0x128(%rbx),%rdx 0x000000000001a95c <+124>: bt %rax,%rdx Fixes: `27c42a1bb8` ("KVM: nVMX: Enable VMFUNC for the L1 hypervisor") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:51 -04:00
Sean Christopherson	07ffaf343e	KVM: nVMX: Sync all PGDs on nested transition with shadow paging Trigger a full TLB flush on behalf of the guest on nested VM-Enter and VM-Exit when VPID is disabled for L2. kvm_mmu_new_pgd() syncs only the current PGD, which can theoretically leave stale, unsync'd entries in a previous guest PGD, which could be consumed if L2 is allowed to load CR3 with PCID_NOFLUSH=1. Rename KVM_REQ_HV_TLB_FLUSH to KVM_REQ_TLB_FLUSH_GUEST so that it can be utilized for its obvious purpose of emulating a guest TLB flush. Note, there is no change the actual TLB flush executed by KVM, even though the fast PGD switch uses KVM_REQ_TLB_FLUSH_CURRENT. When VPID is disabled for L2, vpid02 is guaranteed to be '0', and thus nested_get_vpid02() will return the VPID that is shared by L1 and L2. Generate the request outside of kvm_mmu_new_pgd(), as getting the common helper to correctly identify which requested is needed is quite painful. E.g. using KVM_REQ_TLB_FLUSH_GUEST when nested EPT is in play is wrong as a TLB flush from the L1 kernel's perspective does not invalidate EPT mappings. And, by using KVM_REQ_TLB_FLUSH_GUEST, nVMX can do future simplification by moving the logic into nested_vmx_transition_tlb_flush(). Fixes: `41fab65e7c` ("KVM: nVMX: Skip MMU sync on nested VMX transition when possible") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:51 -04:00
Vitaly Kuznetsov	8629b625e0	KVM: nVMX: Request to sync eVMCS from VMCS12 after migration VMCS12 is used to keep the authoritative state during nested state migration. In case 'need_vmcs12_to_shadow_sync' flag is set, we're in between L2->L1 vmexit and L1 guest run when actual sync to enlightened (or shadow) VMCS happens. Nested state, however, has no flag for 'need_vmcs12_to_shadow_sync' so vmx_set_nested_state()-> set_current_vmptr() always sets it. Enlightened vmptrld path, however, doesn't have the quirk so some VMCS12 changes may not get properly reflected to eVMCS and L1 will see an incorrect state. Note, during L2 execution or when need_vmcs12_to_shadow_sync is not set the change is effectively a nop: in the former case all changes will get reflected during the first L2->L1 vmexit and in the later case VMCS12 and eVMCS are already in sync (thanks to copy_enlightened_to_vmcs12() in vmx_get_nested_state()). Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210526132026.270394-11-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:50 -04:00
Vitaly Kuznetsov	dc31338552	KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02() When nested state migration happens during L1's execution, it is incorrect to modify eVMCS as it is L1 who 'owns' it at the moment. At least genuine Hyper-V seems to not be very happy when 'clean fields' data changes underneath it. 'Clean fields' data is used in KVM twice: by copy_enlightened_to_vmcs12() and prepare_vmcs02_rare() so we can reset it from prepare_vmcs02() instead. While at it, update a comment stating why exactly we need to reset 'hv_clean_fields' data from L0. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210526132026.270394-10-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:50 -04:00
Vitaly Kuznetsov	b7685cfd5e	KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid() 'need_vmcs12_to_shadow_sync' is used for both shadow and enlightened VMCS sync when we exit to L1. The comment in nested_vmx_failValid() validly states why shadow vmcs sync can be omitted but this doesn't apply to enlightened VMCS as it 'shadows' all VMCS12 fields. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210526132026.270394-9-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:50 -04:00
Vitaly Kuznetsov	d6bf71a18c	KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in vmx_get_nested_state() 'Clean fields' data from enlightened VMCS is only valid upon vmentry: L1 hypervisor is not obliged to keep it up-to-date while it is mangling L2's state, KVM_GET_NESTED_STATE request may come at a wrong moment when actual eVMCS changes are unsynchronized with 'hv_clean_fields'. As upon migration VMCS12 is used as a source of ultimate truth, we must make sure we pick all the changes to eVMCS and thus 'clean fields' data must be ignored. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210526132026.270394-8-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:49 -04:00
Vitaly Kuznetsov	3b19b81acf	KVM: nVMX: Release enlightened VMCS on VMCLEAR Unlike VMREAD/VMWRITE/VMPTRLD, VMCLEAR is a valid instruction when enlightened VMCS is in use. TLFS has the following brief description: "The L1 hypervisor can execute a VMCLEAR instruction to transition an enlightened VMCS from the active to the non-active state". Normally, this change can be ignored as unmapping active eVMCS can be postponed until the next VMLAUNCH instruction but in case nested state is migrated with KVM_GET_NESTED_STATE/KVM_SET_NESTED_STATE, keeping eVMCS mapped may result in its synchronization with VMCS12 and this is incorrect: L1 hypervisor is free to reuse inactive eVMCS memory for something else. Inactive eVMCS after VMCLEAR can just be unmapped. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210526132026.270394-7-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:49 -04:00
Vitaly Kuznetsov	278499686b	KVM: nVMX: Introduce 'EVMPTR_MAP_PENDING' post-migration state Unlike regular set_current_vmptr(), nested_vmx_handle_enlightened_vmptrld() can not be called directly from vmx_set_nested_state() as KVM may not have all the information yet (e.g. HV_X64_MSR_VP_ASSIST_PAGE MSR may not be restored yet). Enlightened VMCS is mapped later while getting nested state pages. In the meantime, vmx->nested.hv_evmcs_vmptr remains 'EVMPTR_INVALID' and it's indistinguishable from 'evmcs is not in use' case. This leads to certain issues, in particular, if KVM_GET_NESTED_STATE is called right after KVM_SET_NESTED_STATE, KVM_STATE_NESTED_EVMCS flag in the resulting state will be unset (and such state will later fail to load). Introduce 'EVMPTR_MAP_PENDING' state to detect not-yet-mapped eVMCS after restore. With this, the 'is_guest_mode(vcpu)' hack in vmx_has_valid_vmcs12() is no longer needed. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210526132026.270394-6-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:49 -04:00
Vitaly Kuznetsov	25641cafab	KVM: nVMX: Make copy_vmcs12_to_enlightened()/copy_enlightened_to_vmcs12() return 'void' copy_vmcs12_to_enlightened()/copy_enlightened_to_vmcs12() don't return any result, make them return 'void'. No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210526132026.270394-5-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:49 -04:00
Vitaly Kuznetsov	0276171680	KVM: nVMX: Release eVMCS when enlightened VMENTRY was disabled In theory, L1 can try to disable enlightened VMENTRY in VP assist page and try to issue VMLAUNCH/VMRESUME. While nested_vmx_handle_enlightened_vmptrld() properly handles this as 'EVMPTRLD_DISABLED', previously mapped eVMCS remains mapped and thus all evmptr_is_valid() checks will still pass and nested_vmx_run() will proceed when it shouldn't. Release eVMCS immediately when we detect that enlightened vmentry was disabled by L1. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210526132026.270394-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:48 -04:00
Vitaly Kuznetsov	6a789ca5d5	KVM: nVMX: Don't set 'dirty_vmcs12' flag on enlightened VMPTRLD 'dirty_vmcs12' is only checked in prepare_vmcs02_early()/prepare_vmcs02() and both checks look like: 'vmx->nested.dirty_vmcs12 \|\| evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)' so for eVMCS case the flag changes nothing. Drop the assignment to avoid the confusion. No functional change intended. Reported-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210526132026.270394-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:48 -04:00
Vitaly Kuznetsov	1e9dfbd748	KVM: nVMX: Use '-1' in 'hv_evmcs_vmptr' to indicate that eVMCS is not in use Instead of checking 'vmx->nested.hv_evmcs' use '-1' in 'vmx->nested.hv_evmcs_vmptr' to indicate 'evmcs is not in use' state. This matches how we check 'vmx->nested.current_vmptr'. Introduce EVMPTR_INVALID and evmptr_is_valid() and use it instead of raw '-1' check as a preparation to adding other 'special' values. No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210526132026.270394-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:48 -04:00
Maxim Levitsky	158a48ecf7	KVM: x86: avoid loading PDPTRs after migration when possible if new KVM_*_SREGS2 ioctls are used, the PDPTRs are a part of the migration state and are correctly restored by those ioctls. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210607090203.133058-9-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:48 -04:00
Maxim Levitsky	0f85722341	KVM: nVMX: delay loading of PDPTRs to KVM_REQ_GET_NESTED_STATE_PAGES Similar to the rest of guest page accesses after a migration, this access should be delayed to KVM_REQ_GET_NESTED_STATE_PAGES. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210607090203.133058-6-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:47 -04:00
Sean Christopherson	bcb72d0627	KVM: nVMX: Drop obsolete (and pointless) pdptrs_changed() check Remove the pdptrs_changed() check when loading L2's CR3. The set of available registers is always reset when switching VMCSes (see commit `e5d03de593`, "KVM: nVMX: Reset register cache (available and dirty masks) on VMCS switch"), thus the "are PDPTRs available" check will always fail. And even if it didn't fail, reading guest memory to check the PDPTRs is just as expensive as reading guest memory to load 'em. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210607090203.133058-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:46 -04:00
Vineeth Pillai	3c86c0d3db	KVM: x86: hyper-v: Move the remote TLB flush logic out of vmx Currently the remote TLB flush logic is specific to VMX. Move it to a common place so that SVM can use it as well. Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com> Message-Id: <4f4e4ca19778437dae502f44363a38e99e3ef5d1.1622730232.git.viremana@linux.microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:36 -04:00
Krish Sadhukhan	b93af02c67	KVM: nVMX: nSVM: 'nested_run' should count guest-entry attempts that make it to guest code Currently, the 'nested_run' statistic counts all guest-entry attempts, including those that fail during vmentry checks on Intel and during consistency checks on AMD. Convert this statistic to count only those guest-entries that make it past these state checks and make it to guest code. This will tell us the number of guest-entries that actually executed or tried to execute guest code. Signed-off-by: Krish Sadhukhan <Krish.Sadhukhan@oracle.com> Message-Id: <20210609180340.104248-2-krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:35 -04:00
Sean Christopherson	ecc513e5bb	KVM: x86: Drop "pre_" from enter/leave_smm() helpers Now that .post_leave_smm() is gone, drop "pre_" from the remaining helpers. The helpers aren't invoked purely before SMI/RSM processing, e.g. both helpers are invoked after state is snapshotted (from regs or SMRAM), and the RSM helper is invoked after some amount of register state has been stuffed. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609185619.992058-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:35 -04:00
Vitaly Kuznetsov	4651fc56ba	KVM: x86: Drop vendor specific functions for APICv/AVIC enablement Now that APICv/AVIC enablement is kept in common 'enable_apicv' variable, there's no need to call kvm_apicv_init() from vendor specific code. No functional change intended. Reviewed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210609150911.1471882-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:33 -04:00
Vitaly Kuznetsov	fdf513e37a	KVM: x86: Use common 'enable_apicv' variable for both APICv and AVIC Unify VMX and SVM code by moving APICv/AVIC enablement tracking to common 'enable_apicv' variable. Note: unlike APICv, AVIC is disabled by default. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210609150911.1471882-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:33 -04:00
Jim Mattson	966eefb896	KVM: nVMX: Disable vmcs02 posted interrupts if vmcs12 PID isn't mappable Don't allow posted interrupts to modify a stale posted interrupt descriptor (including the initial value of 0). Empirical tests on real hardware reveal that a posted interrupt descriptor referencing an unbacked address has PCI bus error semantics (reads as all 1's; writes are ignored). However, kvm can't distinguish unbacked addresses from device-backed (MMIO) addresses, so it should really ask userspace for an MMIO completion. That's overly complicated, so just punt with KVM_INTERNAL_ERROR. Don't return the error until the posted interrupt descriptor is actually accessed. We don't want to break the existing kvm-unit-tests that assume they can launch an L2 VM with a posted interrupt descriptor that references MMIO space in L1. Fixes: `6beb7bd52e` ("kvm: nVMX: Refactor nested_get_vmcs12_pages()") Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20210604172611.281819-8-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:31 -04:00
Jim Mattson	0fe998b295	KVM: nVMX: Fail on MMIO completion for nested posted interrupts When the kernel has no mapping for the vmcs02 virtual APIC page, userspace MMIO completion is necessary to process nested posted interrupts. This is not a configuration that KVM supports. Rather than silently ignoring the problem, try to exit to userspace with KVM_INTERNAL_ERROR. Note that the event that triggers this error is consumed as a side-effect of a call to kvm_check_nested_events. On some paths (notably through kvm_vcpu_check_block), the error is dropped. In any case, this is an incremental improvement over always ignoring the error. Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20210604172611.281819-7-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:31 -04:00
Jim Mattson	650293c3de	KVM: nVMX: Add a return code to vmx_complete_nested_posted_interrupt No functional change intended. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20210604172611.281819-4-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:30 -04:00
Ilias Stamatis	d041b5ea93	KVM: nVMX: Enable nested TSC scaling Calculate the TSC offset and multiplier on nested transitions and expose the TSC scaling feature to L1. Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210526184418.28881-11-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:29 -04:00
Ilias Stamatis	1ab9287add	KVM: X86: Add vendor callbacks for writing the TSC multiplier Currently vmx_vcpu_load_vmcs() writes the TSC_MULTIPLIER field of the VMCS every time the VMCS is loaded. Instead of doing this, set this field from common code on initialization and whenever the scaling ratio changes. Additionally remove vmx->current_tsc_ratio. This field is redundant as vcpu->arch.tsc_scaling_ratio already tracks the current TSC scaling ratio. The vmx->current_tsc_ratio field is only used for avoiding unnecessary writes but it is no longer needed after removing the code from the VMCS load path. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Message-Id: <20210607105438.16541-1-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:29 -04:00
Ilias Stamatis	edcfe54058	KVM: X86: Move write_l1_tsc_offset() logic to common code and rename it The write_l1_tsc_offset() callback has a misleading name. It does not set L1's TSC offset, it rather updates the current TSC offset which might be different if a nested guest is executing. Additionally, both the vmx and svm implementations use the same logic for calculating the current TSC before writing it to hardware. Rename the function and move the common logic to the caller. The vmx/svm specific code now merely sets the given offset to the corresponding hardware structure. Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210526184418.28881-9-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:29 -04:00
Ilias Stamatis	307a94c721	KVM: X86: Add functions for retrieving L2 TSC fields from common code In order to implement as much of the nested TSC scaling logic as possible in common code, we need these vendor callbacks for retrieving the TSC offset and the TSC multiplier that L1 has set for L2. Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210526184418.28881-7-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:28 -04:00
Ilias Stamatis	3c0f99366e	KVM: nVMX: Add a TSC multiplier field in VMCS12 This is required for supporting nested TSC scaling. Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210526184418.28881-6-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:28 -04:00
Ilias Stamatis	805d705ff8	KVM: X86: Store L1's TSC scaling ratio in 'struct kvm_vcpu_arch' Store L1's scaling ratio in the kvm_vcpu_arch struct like we already do for L1's TSC offset. This allows for easy save/restore when we enter and then exit the nested guest. Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210526184418.28881-3-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:27 -04:00
Gustavo A. R. Silva	551912d286	KVM: x86: Fix fall-through warnings for Clang In preparation to enable -Wimplicit-fallthrough for Clang, fix a couple of warnings by explicitly adding break statements instead of just letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Message-Id: <20210528200756.GA39320@embeddedor> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-10 07:48:45 -04:00
Yuan Yao	e87e46d5f3	KVM: X86: Use kvm_get_linear_rip() in single-step and #DB/#BP interception The kvm_get_linear_rip() handles x86/long mode cases well and has better readability, __kvm_set_rflags() also use the paired function kvm_is_linear_rip() to check the vcpu->arch.singlestep_rip set in kvm_arch_vcpu_ioctl_set_guest_debug(), so change the "CS.BASE + RIP" code in kvm_arch_vcpu_ioctl_set_guest_debug() and handle_exception_nmi() to this one. Signed-off-by: Yuan Yao <yuan.yao@intel.com> Message-Id: <20210526063828.1173-1-yuan.yao@linux.intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-05-28 12:57:53 -04:00
Marcelo Tosatti	a2486020a8	KVM: VMX: update vcpu posted-interrupt descriptor when assigning device For VMX, when a vcpu enters HLT emulation, pi_post_block will: 1) Add vcpu to per-cpu list of blocked vcpus. 2) Program the posted-interrupt descriptor "notification vector" to POSTED_INTR_WAKEUP_VECTOR With interrupt remapping, an interrupt will set the PIR bit for the vector programmed for the device on the CPU, test-and-set the ON bit on the posted interrupt descriptor, and if the ON bit is clear generate an interrupt for the notification vector. This way, the target CPU wakes upon a device interrupt and wakes up the target vcpu. Problem is that pi_post_block only programs the notification vector if kvm_arch_has_assigned_device() is true. Its possible for the following to happen: 1) vcpu V HLTs on pcpu P, kvm_arch_has_assigned_device is false, notification vector is not programmed 2) device is assigned to VM 3) device interrupts vcpu V, sets ON bit (notification vector not programmed, so pcpu P remains in idle) 4) vcpu 0 IPIs vcpu V (in guest), but since pi descriptor ON bit is set, kvm_vcpu_kick is skipped 5) vcpu 0 busy spins on vcpu V's response for several seconds, until RCU watchdog NMIs all vCPUs. To fix this, use the start_assignment kvm_x86_ops callback to kick vcpus out of the halt loop, so the notification vector is properly reprogrammed to the wakeup vector. Reported-by: Pei Zhang <pezhang@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Message-Id: <20210526172014.GA29007@fuller.cnet> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-05-27 07:58:23 -04:00
Vitaly Kuznetsov	377872b335	KVM: VMX: Drop unneeded CONFIG_X86_LOCAL_APIC check CONFIG_X86_LOCAL_APIC is always on when CONFIG_KVM (on x86) since commit `e42eef4ba3` ("KVM: add X86_LOCAL_APIC dependency"). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210518144339.1987982-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com>	2021-05-24 18:47:39 +02:00
Paolo Bonzini	a4345a7cec	Merge tag 'kvmarm-fixes-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 5.13, take #1 - Fix regression with irqbypass not restarting the guest on failed connect - Fix regression with debug register decoding resulting in overlapping access - Commit exception state on exit to usrspace - Fix the MMU notifier return values - Add missing 'static' qualifiers in the new host stage-2 code	2021-05-17 09:55:12 +02:00
Paolo Bonzini	76ea438b4a	KVM: X86: Expose bus lock debug exception to guest Bus lock debug exception is an ability to notify the kernel by an #DB trap after the instruction acquires a bus lock and is executed when CPL>0. This allows the kernel to enforce user application throttling or mitigations. Existence of bus lock debug exception is enumerated via CPUID.(EAX=7,ECX=0).ECX[24]. Software can enable these exceptions by setting bit 2 of the MSR_IA32_DEBUGCTL. Expose the CPUID to guest and emulate the MSR handling when guest enables it. Support for this feature was originally developed by Xiaoyao Li and Chenyi Qiang, but code has since changed enough that this patch has nothing in common with theirs, except for this commit message. Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Message-Id: <20210202090433.13441-4-chenyi.qiang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-05-07 06:06:20 -04:00
Sean Christopherson	61a05d444d	KVM: x86: Tie Intel and AMD behavior for MSR_TSC_AUX to guest CPU model Squish the Intel and AMD emulation of MSR_TSC_AUX together and tie it to the guest CPU model instead of the host CPU behavior. While not strictly necessary to avoid guest breakage, emulating cross-vendor "architecture" will provide consistent behavior for the guest, e.g. WRMSR fault behavior won't change if the vCPU is migrated to a host with divergent behavior. Note, the "new" kvm_is_supported_user_return_msr() checks do not add new functionality on either SVM or VMX. On SVM, the equivalent was "tsc_aux_uret_slot < 0", and on VMX the check was buried in the vmx_find_uret_msr() call at the find_uret_msr label. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-05-07 06:06:19 -04:00
Sean Christopherson	e5fda4bbad	KVM: x86: Move uret MSR slot management to common x86 Now that SVM and VMX both probe MSRs before "defining" user return slots for them, consolidate the code for probe+define into common x86 and eliminate the odd behavior of having the vendor code define the slot for a given MSR. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-05-07 06:06:19 -04:00
Sean Christopherson	5e17c62401	KVM: VMX: Disable loading of TSX_CTRL MSR the more conventional way Tag TSX_CTRL as not needing to be loaded when RTM isn't supported in the host. Crushing the write mask to '0' has the same effect, but requires more mental gymnastics to understand. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-05-07 06:06:18 -04:00
Sean Christopherson	8ea8b8d6f8	KVM: VMX: Use common x86's uret MSR list as the one true list Drop VMX's global list of user return MSRs now that VMX doesn't resort said list to isolate "active" MSRs, i.e. now that VMX's list and x86's list have the same MSRs in the same order. In addition to eliminating the redundant list, this will also allow moving more of the list management into common x86. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-05-07 06:06:18 -04:00
Sean Christopherson	ee9d22e08d	KVM: VMX: Use flag to indicate "active" uret MSRs instead of sorting list Explicitly flag a uret MSR as needing to be loaded into hardware instead of resorting the list of "active" MSRs and tracking how many MSRs in total need to be loaded. The only benefit to sorting the list is that the loop to load MSRs during vmx_prepare_switch_to_guest() doesn't need to iterate over all supported uret MRS, only those that are active. But that is a pointless optimization, as the most common case, running a 64-bit guest, will load the vast majority of MSRs. Not to mention that a single WRMSR is far more expensive than iterating over the list. Providing a stable list order obviates the need to track a given MSR's "slot" in the per-CPU list of user return MSRs; all lists simply use the same ordering. Future patches will take advantage of the stable order to further simplify the related code. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-05-07 06:06:18 -04:00
Sean Christopherson	b6194b94a2	KVM: VMX: Configure list of user return MSRs at module init Configure the list of user return MSRs that are actually supported at module init instead of reprobing the list of possible MSRs every time a vCPU is created. Curating the list on a per-vCPU basis is pointless; KVM is completely hosed if the set of supported MSRs changes after module init, or if the set of MSRs differs per physical PCU. The per-vCPU lists also increase complexity (see __vmx_find_uret_msr()) and creates corner cases that _should_ be impossible, but theoretically exist in KVM, e.g. advertising RDTSCP to userspace without actually being able to virtualize RDTSCP if probing MSR_TSC_AUX fails. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-05-07 06:06:17 -04:00

... 4 5 6 7 8 ...

1144 Commits