Add an IS_ENABLED() check in setup_arch() and call pat_disable()
directly if MTRRs are not supported. This allows to remove the
<asm/memtype.h> include in <asm/mtrr.h>, which pull in lowlevel x86
headers that should not be included for UML builds and will cause build
warnings with a following patch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211215165612.554426-2-hch@lst.de
Pick commit fdba608f15 ("KVM: VMX: Wake vCPU when delivering posted
IRQ even if vCPU == this vCPU"). In addition to fixing a bug, it
also aligns the non-nested and nested usage of triggering posted
interrupts, allowing for additional cleanups.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop a check that guards triggering a posted interrupt on the currently
running vCPU, and more importantly guards waking the target vCPU if
triggering a posted interrupt fails because the vCPU isn't IN_GUEST_MODE.
If a vIRQ is delivered from asynchronous context, the target vCPU can be
the currently running vCPU and can also be blocking, in which case
skipping kvm_vcpu_wake_up() is effectively dropping what is supposed to
be a wake event for the vCPU.
The "do nothing" logic when "vcpu == running_vcpu" mostly works only
because the majority of calls to ->deliver_posted_interrupt(), especially
when using posted interrupts, come from synchronous KVM context. But if
a device is exposed to the guest using vfio-pci passthrough, the VFIO IRQ
and vCPU are bound to the same pCPU, and the IRQ is _not_ configured to
use posted interrupts, wake events from the device will be delivered to
KVM from IRQ context, e.g.
vfio_msihandler()
|
|-> eventfd_signal()
|
|-> ...
|
|-> irqfd_wakeup()
|
|->kvm_arch_set_irq_inatomic()
|
|-> kvm_irq_delivery_to_apic_fast()
|
|-> kvm_apic_set_irq()
This also aligns the non-nested and nested usage of triggering posted
interrupts, and will allow for additional cleanups.
Fixes: 379a3c8ee4 ("KVM: VMX: Optimize posted-interrupt delivery for timer fastpath")
Cc: stable@vger.kernel.org
Reported-by: Longpeng (Mike) <longpeng2@huawei.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20211208015236.1616697-18-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Synthesize a triple fault if L2 guest state is invalid at the time of
VM-Enter, which can happen if L1 modifies SMRAM or if userspace stuffs
guest state via ioctls(), e.g. KVM_SET_SREGS. KVM should never emulate
invalid guest state, since from L1's perspective, it's architecturally
impossible for L2 to have invalid state while L2 is running in hardware.
E.g. attempts to set CR0 or CR4 to unsupported values will either VM-Exit
or #GP.
Modifying vCPU state via RSM+SMRAM and ioctl() are the only paths that
can trigger this scenario, as nested VM-Enter correctly rejects any
attempt to enter L2 with invalid state.
RSM is a straightforward case as (a) KVM follows AMD's SMRAM layout and
behavior, and (b) Intel's SDM states that loading reserved CR0/CR4 bits
via RSM results in shutdown, i.e. there is precedent for KVM's behavior.
Following AMD's SMRAM layout is important as AMD's layout saves/restores
the descriptor cache information, including CS.RPL and SS.RPL, and also
defines all the fields relevant to invalid guest state as read-only, i.e.
so long as the vCPU had valid state before the SMI, which is guaranteed
for L2, RSM will generate valid state unless SMRAM was modified. Intel's
layout saves/restores only the selector, which means that scenarios where
the selector and cached RPL don't match, e.g. conforming code segments,
would yield invalid guest state. Intel CPUs fudge around this issued by
stuffing SS.RPL and CS.RPL on RSM. Per Intel's SDM on the "Default
Treatment of RSM", paraphrasing for brevity:
IF internal storage indicates that the [CPU was post-VMXON]
THEN
enter VMX operation (root or non-root);
restore VMX-critical state as defined in Section 34.14.1;
set to their fixed values any bits in CR0 and CR4 whose values must
be fixed in VMX operation [unless coming from an unrestricted guest];
IF RFLAGS.VM = 0 AND (in VMX root operation OR the
“unrestricted guest” VM-execution control is 0)
THEN
CS.RPL := SS.DPL;
SS.RPL := SS.DPL;
FI;
restore current VMCS pointer;
FI;
Note that Intel CPUs also overwrite the fixed CR0/CR4 bits, whereas KVM
will sythesize TRIPLE_FAULT in this scenario. KVM's behavior is allowed
as both Intel and AMD define CR0/CR4 SMRAM fields as read-only, i.e. the
only way for CR0 and/or CR4 to have illegal values is if they were
modified by the L1 SMM handler, and Intel's SDM "SMRAM State Save Map"
section states "modifying these registers will result in unpredictable
behavior".
KVM's ioctl() behavior is less straightforward. Because KVM allows
ioctls() to be executed in any order, rejecting an ioctl() if it would
result in invalid L2 guest state is not an option as KVM cannot know if
a future ioctl() would resolve the invalid state, e.g. KVM_SET_SREGS, or
drop the vCPU out of L2, e.g. KVM_SET_NESTED_STATE. Ideally, KVM would
reject KVM_RUN if L2 contained invalid guest state, but that carries the
risk of a false positive, e.g. if RSM loaded invalid guest state and KVM
exited to userspace. Setting a flag/request to detect such a scenario is
undesirable because (a) it's extremely unlikely to add value to KVM as a
whole, and (b) KVM would need to consider ioctl() interactions with such
a flag, e.g. if userspace migrated the vCPU while the flag were set.
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211207193006.120997-3-seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Revert a relatively recent change that set vmx->fail if the vCPU is in L2
and emulation_required is true, as that behavior is completely bogus.
Setting vmx->fail and synthesizing a VM-Exit is contradictory and wrong:
(a) it's impossible to have both a VM-Fail and VM-Exit
(b) vmcs.EXIT_REASON is not modified on VM-Fail
(c) emulation_required refers to guest state and guest state checks are
always VM-Exits, not VM-Fails.
For KVM specifically, emulation_required is handled before nested exits
in __vmx_handle_exit(), thus setting vmx->fail has no immediate effect,
i.e. KVM calls into handle_invalid_guest_state() and vmx->fail is ignored.
Setting vmx->fail can ultimately result in a WARN in nested_vmx_vmexit()
firing when tearing down the VM as KVM never expects vmx->fail to be set
when L2 is active, KVM always reflects those errors into L1.
------------[ cut here ]------------
WARNING: CPU: 0 PID: 21158 at arch/x86/kvm/vmx/nested.c:4548
nested_vmx_vmexit+0x16bd/0x17e0
arch/x86/kvm/vmx/nested.c:4547
Modules linked in:
CPU: 0 PID: 21158 Comm: syz-executor.1 Not tainted 5.16.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:nested_vmx_vmexit+0x16bd/0x17e0 arch/x86/kvm/vmx/nested.c:4547
Code: <0f> 0b e9 2e f8 ff ff e8 57 b3 5d 00 0f 0b e9 00 f1 ff ff 89 e9 80
Call Trace:
vmx_leave_nested arch/x86/kvm/vmx/nested.c:6220 [inline]
nested_vmx_free_vcpu+0x83/0xc0 arch/x86/kvm/vmx/nested.c:330
vmx_free_vcpu+0x11f/0x2a0 arch/x86/kvm/vmx/vmx.c:6799
kvm_arch_vcpu_destroy+0x6b/0x240 arch/x86/kvm/x86.c:10989
kvm_vcpu_destroy+0x29/0x90 arch/x86/kvm/../../../virt/kvm/kvm_main.c:441
kvm_free_vcpus arch/x86/kvm/x86.c:11426 [inline]
kvm_arch_destroy_vm+0x3ef/0x6b0 arch/x86/kvm/x86.c:11545
kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1189 [inline]
kvm_put_kvm+0x751/0xe40 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1220
kvm_vcpu_release+0x53/0x60 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3489
__fput+0x3fc/0x870 fs/file_table.c:280
task_work_run+0x146/0x1c0 kernel/task_work.c:164
exit_task_work include/linux/task_work.h:32 [inline]
do_exit+0x705/0x24f0 kernel/exit.c:832
do_group_exit+0x168/0x2d0 kernel/exit.c:929
get_signal+0x1740/0x2120 kernel/signal.c:2852
arch_do_signal_or_restart+0x9c/0x730 arch/x86/kernel/signal.c:868
handle_signal_work kernel/entry/common.c:148 [inline]
exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
exit_to_user_mode_prepare+0x191/0x220 kernel/entry/common.c:207
__syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
syscall_exit_to_user_mode+0x2e/0x70 kernel/entry/common.c:300
do_syscall_64+0x53/0xd0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixes: c8607e4a08 ("KVM: x86: nVMX: don't fail nested VM entry on invalid guest state if !from_vmentry")
Reported-by: syzbot+f1d2136db9c80d4733e8@syzkaller.appspotmail.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211207193006.120997-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The kvm_run struct's if_flag is a part of the userspace/kernel API. The
SEV-ES patches failed to set this flag because it's no longer needed by
QEMU (according to the comment in the source code). However, other
hypervisors may make use of this flag. Therefore, set the flag for
guests with encrypted registers (i.e., with guest_state_protected set).
Fixes: f1c6366e30 ("KVM: SVM: Add required changes to support intercepts under SEV-ES")
Signed-off-by: Marc Orr <marcorr@google.com>
Message-Id: <20211209155257.128747-1-marcorr@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
After dropping mmu_lock in the TDP MMU, restart the iterator during
tdp_iter_next() and do not advance the iterator. Advancing the iterator
results in skipping the top-level SPTE and all its children, which is
fatal if any of the skipped SPTEs were not visited before yielding.
When zapping all SPTEs, i.e. when min_level == root_level, restarting the
iter and then invoking tdp_iter_next() is always fatal if the current gfn
has as a valid SPTE, as advancing the iterator results in try_step_side()
skipping the current gfn, which wasn't visited before yielding.
Sprinkle WARNs on iter->yielded being true in various helpers that are
often used in conjunction with yielding, and tag the helper with
__must_check to reduce the probabily of improper usage.
Failing to zap a top-level SPTE manifests in one of two ways. If a valid
SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
the shadow page will be leaked and KVM will WARN accordingly.
WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
Call Trace:
<TASK>
kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
kvm_destroy_vm+0x162/0x2a0 [kvm]
kvm_vcpu_release+0x34/0x60 [kvm]
__fput+0x82/0x240
task_work_run+0x5c/0x90
do_exit+0x364/0xa10
? futex_unqueue+0x38/0x60
do_group_exit+0x33/0xa0
get_signal+0x155/0x850
arch_do_signal_or_restart+0xed/0x750
exit_to_user_mode_prepare+0xc5/0x120
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x48/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
marking a struct page as dirty/accessed after it has been put back on the
free list. This directly triggers a WARN due to encountering a page with
page_count() == 0, but it can also lead to data corruption and additional
errors in the kernel.
WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
Call Trace:
<TASK>
kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
__handle_changed_spte+0x92e/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
zap_gfn_range+0x549/0x620 [kvm]
kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
mmu_free_root_page+0x219/0x2c0 [kvm]
kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
kvm_mmu_unload+0x1c/0xa0 [kvm]
kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
kvm_put_kvm+0x3b1/0x8b0 [kvm]
kvm_vcpu_release+0x4e/0x70 [kvm]
__fput+0x1f7/0x8c0
task_work_run+0xf8/0x1a0
do_exit+0x97b/0x2230
do_group_exit+0xda/0x2a0
get_signal+0x3be/0x1e50
arch_do_signal_or_restart+0x244/0x17f0
exit_to_user_mode_prepare+0xcb/0x120
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x4d/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Note, the underlying bug existed even before commit 1af4a96025 ("KVM:
x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
incorrectly advance past a top-level entry when yielding on a lower-level
entry. But with respect to leaking shadow pages, the bug was introduced
by yielding before processing the current gfn.
Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
callers could jump to their "retry" label. The downside of that approach
is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
in the loop, and there's no easy way to enfornce that requirement.
Ideally, KVM would handling the cond_resched() fully within the iterator
macro (the code is actually quite clean) and avoid this entire class of
bugs, but that is extremely difficult do while also supporting yielding
after tdp_mmu_set_spte_atomic() fails. Yielding after failing to set a
SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
may block operations on the SPTE for a significant amount of time.
Fixes: faaf05b00a ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 1af4a96025 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211214033528.123268-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The fixed counter 3 is used for the Topdown metrics, which hasn't been
enabled for KVM guests. Userspace accessing to it will fail as it's not
included in get_fixed_pmc(). This breaks KVM selftests on ICX+ machines,
which have this counter.
To reproduce it on ICX+ machines, ./state_test reports:
==== Test Assertion Failure ====
lib/x86_64/processor.c:1078: r == nmsrs
pid=4564 tid=4564 - Argument list too long
1 0x000000000040b1b9: vcpu_save_state at processor.c:1077
2 0x0000000000402478: main at state_test.c:209 (discriminator 6)
3 0x00007fbe21ed5f92: ?? ??:0
4 0x000000000040264d: _start at ??:?
Unexpected result from KVM_GET_MSRS, r: 17 (failed MSR was 0x30c)
With this patch, it works well.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Message-Id: <20211217124934.32893-1-wei.w.wang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Play nice with a NULL shadow page when checking for an obsolete root in
the page fault handler by flagging the page fault as stale if there's no
shadow page associated with the root and KVM_REQ_MMU_RELOAD is pending.
Invalidating memslots, which is the only case where _all_ roots need to
be reloaded, requests all vCPUs to reload their MMUs while holding
mmu_lock for lock.
The "special" roots, e.g. pae_root when KVM uses PAE paging, are not
backed by a shadow page. Running with TDP disabled or with nested NPT
explodes spectaculary due to dereferencing a NULL shadow page pointer.
Skip the KVM_REQ_MMU_RELOAD check if there is a valid shadow page for the
root. Zapping shadow pages in response to guest activity, e.g. when the
guest frees a PGD, can trigger KVM_REQ_MMU_RELOAD even if the current
vCPU isn't using the affected root. I.e. KVM_REQ_MMU_RELOAD can be seen
with a completely valid root shadow page. This is a bit of a moot point
as KVM currently unloads all roots on KVM_REQ_MMU_RELOAD, but that will
be cleaned up in the future.
Fixes: a955cad84c ("KVM: x86/mmu: Retry page fault if root is invalidated by memslot update")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211209060552.2956723-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The ability to write to MSR_IA32_PERF_CAPABILITIES from the host should
not depend on guest visible CPUID entries, even if just to allow
creating/restoring guest MSRs and CPUIDs in any sequence.
Fixes: 27461da310 ("KVM: x86/pmu: Support full width counting")
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211216165213.338923-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In the vmread exceptin path, use the, thus far, unused output register
to push the @fault argument onto the stack. This, in turn, enables the
exception handler to not do pushes and only modify that register when
an exception does occur.
As noted by Sean the input constraint needs to be changed to "=&r" to
avoid the value and field occupying the same register.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20211110101325.781308550@infradead.org
KVM instruction emulation has a gnarly hack where the .fixup does a
return, however there's already a ret right after the 10b label, so
mark that as 11 and have the exception clear %esi to remove the
.fixup.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20211110101325.722157053@infradead.org
Replace a WARN with a comment to call out that userspace can modify RCX
during an exit to userspace to handle string I/O. KVM doesn't actually
support changing the rep count during an exit, i.e. the scenario can be
ignored, but the WARN needs to go as it's trivial to trigger from
userspace.
Cc: stable@vger.kernel.org
Fixes: 3b27de2718 ("KVM: x86: split the two parts of emulator_pio_in")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211025201311.1881846-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In the SDM:
If the logical processor is in 64-bit mode or if CR4.PCIDE = 1, an
attempt to clear CR0.PG causes a general-protection exception (#GP).
Software should transition to compatibility mode and clear CR4.PCIDE
before attempting to disable paging.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211207095230.53437-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Do not bail early if there are no bits set in the sparse banks for a
non-sparse, a.k.a. "all CPUs", IPI request. Per the Hyper-V spec, it is
legal to have a variable length of '0', e.g. VP_SET's BankContents in
this case, if the request can be serviced without the extra info.
It is possible that for a given invocation of a hypercall that does
accept variable sized input headers that all the header input fits
entirely within the fixed size header. In such cases the variable sized
input header is zero-sized and the corresponding bits in the hypercall
input should be set to zero.
Bailing early results in KVM failing to send IPIs to all CPUs as expected
by the guest.
Fixes: 214ff83d44 ("KVM: x86: hyperv: implement PV IPI send hypercalls")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Splitting kvm_main.c out into smaller and better-organized files is
slightly non-trivial when it involves editing a bunch of per-arch
KVM makefiles. Provide virt/kvm/Makefile.kvm for them to include.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Acked-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20211121125451.9489-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
I'd like to make the build include dirty_ring.c based on whether the
arch wants it or not. That's a whole lot simpler if there's a config
symbol instead of doing it implicitly on KVM_DIRTY_LOG_PAGE_OFFSET
being set to something non-zero.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211121125451.9489-2-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the WARN sanity checks out of the PI descriptor update loop so as
not to spam the kernel log if the condition is violated and the update
takes multiple attempts due to another writer. This also eliminates a
few extra uops from the retry path.
Technically not checking every attempt could mean KVM will now fail to
WARN in a scenario that would have failed before, but any such failure
would be inherently racy as some other agent (CPU or device) would have
to concurrent modify the PI descriptor.
Add a helper to handle the actual write and more importantly to document
why the write may need to be retried.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211208015236.1616697-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a memory barrier between writing vcpu->requests and reading
vcpu->guest_mode to ensure the read is ordered after the write when
(potentially) delivering an IRQ to L2 via nested posted interrupt. If
the request were to be completed after reading vcpu->mode, it would be
possible for the target vCPU to enter the guest without posting the
interrupt and without handling the event request.
Note, the barrier is only for documentation since atomic operations are
serializing on x86.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Fixes: 6b6977117f ("KVM: nVMX: Fix races when sending nested PI while dest enters/leaves L2")
Fixes: 705699a139 ("KVM: nVMX: Enable nested posted interrupt processing")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211208015236.1616697-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This allows to see how many interrupts were delivered via the
APICv/AVIC from the host.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20211209115440.394441-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Replace all ret/retq instructions with ASM_RET in preparation of
making it more than a single instruction.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211204134907.964635458@infradead.org
Updating MSR bitmap for L2 is not cheap and rearly needed. TLFS for Hyper-V
offers 'Enlightened MSR Bitmap' feature which allows L1 hypervisor to
inform L0 when it changes MSR bitmap, this eliminates the need to examine
L1's MSR bitmap for L2 every time when 'real' MSR bitmap for L2 gets
constructed.
Use 'vmx->nested.msr_bitmap_changed' flag to implement the feature.
Note, KVM already uses 'Enlightened MSR bitmap' feature when it runs as a
nested hypervisor on top of Hyper-V. The newly introduced feature is going
to be used by Hyper-V guests on KVM.
When the feature is enabled for Win10+WSL2, it shaves off around 700 CPU
cycles from a nested vmexit cost (tight cpuid loop test).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211129094704.326635-5-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce a flag to keep track of whether MSR bitmap for L2 needs to be
rebuilt due to changes in MSR bitmap for L1 or switching to a different
L2. This information will be used for Enlightened MSR Bitmap feature for
Hyper-V guests.
Note, setting msr_bitmap_changed to 'true' from set_current_vmptr() is
not really needed for Enlightened MSR Bitmap as the feature can only
be used in conjunction with Enlightened VMCS but let's keep tracking
information complete, it's cheap and in the future similar PV feature can
easily be implemented for KVM on KVM too.
No functional change intended.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211129094704.326635-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In preparation to enabling 'Enlightened MSR Bitmap' feature for Hyper-V
guests move MSR bitmap update tracking to a dedicated helper.
Note: vmx_msr_bitmap_l01_changed() is called when MSR bitmap might be
updated. KVM doesn't check if the bit we're trying to set is already set
(or the bit it's trying to clear is already cleared). Such situations
should not be common and a few false positives should not be a problem.
No functional change intended.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211129094704.326635-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Replace all ret/retq instructions with RET in preparation of making
RET a macro. Since AS is case insensitive it's a big no-op without
RET defined.
find arch/x86/ -name \*.S | while read file
do
sed -i 's/\<ret[q]*\>/RET/' $file
done
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211204134907.905503893@infradead.org
em_rdmsr() and em_wrmsr() return X86EMUL_IO_NEEDED if MSR accesses
required an exit to userspace. However, x86_emulate_insn() doesn't return
X86EMUL_*, so x86_emulate_instruction() doesn't directly act on
X86EMUL_IO_NEEDED; instead, it looks for other signals to differentiate
between PIO, MMIO, etc. causing RDMSR/WRMSR emulation not to
exit to userspace now.
Nevertheless, if the userspace_msr_exit_test testcase in selftests
is changed to test RDMSR/WRMSR with a forced emulation prefix,
the test passes. What happens is that first userspace exit
information is filled but the userspace exit does not happen.
Because x86_emulate_instruction() returns 1, the guest retries
the instruction---but this time RIP has already been adjusted
past the forced emulation prefix, so the guest executes RDMSR/WRMSR
and the userspace exit finally happens.
Since the X86EMUL_IO_NEEDED path has provided a complete_userspace_io
callback, x86_emulate_instruction() can just return 0 if the
callback is not NULL. Then RDMSR/WRMSR instruction emulation will
exit to userspace directly, without the RDMSR/WRMSR vmexit.
Fixes: 1ae099540e ("KVM: x86: Allow deflecting unknown MSR accesses to user space")
Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <56f9df2ee5c05a81155e2be366c9dc1f7adc8817.1635842679.git.houwenlong93@linux.alibaba.com>
When KVM runs as a nested hypervisor on top of Hyper-V it uses Enlightened
VMCS and enables Enlightened MSR Bitmap feature for its L1s and L2s (which
are actually L2s and L3s from Hyper-V's perspective). When MSR bitmap is
updated, KVM has to reset HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP from
clean fields to make Hyper-V aware of the change. For KVM's L1s, this is
done in vmx_disable_intercept_for_msr()/vmx_enable_intercept_for_msr().
MSR bitmap for L2 is build in nested_vmx_prepare_msr_bitmap() by blending
MSR bitmap for L1 and L1's idea of MSR bitmap for L2. KVM, however, doesn't
check if the resulting bitmap is different and never cleans
HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP in eVMCS02. This is incorrect and
may result in Hyper-V missing the update.
The issue could've been solved by calling evmcs_touch_msr_bitmap() for
eVMCS02 from nested_vmx_prepare_msr_bitmap() unconditionally but doing so
would not give any performance benefits (compared to not using Enlightened
MSR Bitmap at all). 3-level nesting is also not a very common setup
nowadays.
Don't enable 'Enlightened MSR Bitmap' feature for KVM's L2s (real L3s) for
now.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211129094704.326635-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If msr access triggers an exit to userspace, the
complete_userspace_io callback would skip instruction by vendor
callback for kvm_skip_emulated_instruction(). However, when msr
access comes from the emulator, e.g. if kvm.force_emulation_prefix
is enabled and the guest uses rdmsr/wrmsr with kvm prefix,
VM_EXIT_INSTRUCTION_LEN in vmcs is invalid and
kvm_emulate_instruction() should be used to skip instruction
instead.
As Sean noted, unlike the previous case, there's no #UD if
unrestricted guest is disabled and the guest accesses an MSR in
Big RM. So the correct way to fix this is to attach a different
callback when the msr access comes from the emulator.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <34208da8f51580a06e45afefac95afea0e3f96e3.1635842679.git.houwenlong93@linux.alibaba.com>
The next patch would use kvm_emulate_instruction() with
EMULTYPE_SKIP in complete_userspace_io callback to fix a
problem in msr access emulation. However, EMULTYPE_SKIP
only updates RIP, more things like updating interruptibility
state and injecting single-step #DBs would be done in the
callback. Since the emulator also does those things after
x86_emulate_insn(), add a new emulation type to pair with
EMULTYPE_SKIP to do those things for completion of user exits
within the emulator.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <8f8c8e268b65f31d55c2881a4b30670946ecfa0d.1635842679.git.houwenlong93@linux.alibaba.com>
Truncate the new EIP to a 32-bit value when handling EMULTYPE_SKIP as the
decode phase does not truncate _eip. Wrapping the 32-bit boundary is
legal if and only if CS is a flat code segment, but that check is
implicitly handled in the form of limit checks in the decode phase.
Opportunstically prepare for a future fix by storing the result of any
truncation in "eip" instead of "_eip".
Fixes: 1957aa63be ("KVM: VMX: Handle single-step #DB for EMULTYPE_SKIP on EPT misconfig")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <093eabb1eab2965201c9b018373baf26ff256d85.1635842679.git.houwenlong93@linux.alibaba.com>
merge pv_eoi_get_pending and pv_eoi_clr_pending into a single
function pv_eoi_test_and_clear_pending, which returns and clear
the value of the pending bit.
This makes it possible to clear the pending bit only if the guest
had set it, and otherwise skip the call to pv_eoi_put_user().
This can save up to 300 nsec on AMD EPYC processors.
Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Message-Id: <1636026974-50555-2-git-send-email-lirongqing@baidu.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If guest gives MSR_KVM_PV_EOI_EN a wrong value, this printk() will
be trigged, and kernel log is spammed with the useless message
Fixes: 0d88800d54 ("kvm: x86: ioapic and apic debug macros cleanup")
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Cc: stable@kernel.org
Message-Id: <1636026974-50555-1-git-send-email-lirongqing@baidu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It uses vcpu->arch.walk_mmu always; nested EPT does not have PDPTRs,
and nested NPT treats them like all other non-leaf page table levels
instead of caching them.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-11-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This bit is very close to mean "role.quadrant is not in use", except that
it is false also when the MMU is mapping guest physical addresses
directly. In that case, role.quadrant is indeed not in use, but there
are no guest PTEs at all.
Changing the name and direction of the bit removes the special case,
since a guest with paging disabled, or not considering guest paging
structures as is the case for two-dimensional paging, does not have
to deal with 4-byte guest PTEs.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-10-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Using ept_caps_to_lpage_level is simpler.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-9-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The level of supported large page on nEPT affects the rsvds_bits_mask.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-8-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Bit 7 on pte depends on the level of supported large page.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-7-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reduce an indirect function call (retpoline) and some intialization
code.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-4-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The mmu->gva_to_gpa() has no "struct kvm_mmu *mmu", so an extra
FNAME(gva_to_gpa_nested) is needed.
Add the parameter can simplify the code. And it makes it explicit that
the walk is upon vcpu->arch.walk_mmu for gva and vcpu->arch.mmu for L2
gpa in translate_nested_gpa() via the new parameter.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
role.quadrant is only valid when gpte size is 4 bytes and only be
calculated when gpte size is 4 bytes.
Although "vcpu->arch.mmu->root_level <= PT32_ROOT_LEVEL" also means
gpte size is 4 bytes, but using "!role.gpte_is_8_bytes" is clearer
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211118110814.2568-15-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
role.gpte_is_8_bytes is unused when role.direct; there is no
point in changing a bit in the role, the value that was set
when the MMU is initialized is just fine.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211118110814.2568-14-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The allocation of special roots is moved to mmu_alloc_special_roots().
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211118110814.2568-12-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It is never used.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211118110814.2568-11-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
And remove clear_page() on it.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211118110814.2568-10-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It returns the only proper NPT level, so the "max" in the name
is not appropriate.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211118110814.2568-9-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The variable name is changed in the code.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211118110814.2568-8-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MSR_IA32_TSC_ADJUST can be left to the default way which also uese
kvm_set_msr_common().
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211118110814.2568-7-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The host CR3 in the vcpu thread can only be changed when scheduling.
Moving the code in vmx_prepare_switch_to_guest() makes the code
simpler.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211118110814.2568-5-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The value of host MSR_IA32_SYSENTER_ESP is known to be constant for
each CPU: (cpu_entry_stack(cpu) + 1) when 32 bit syscall is enabled or
NULL is 32 bit syscall is not enabled.
So rdmsrl() can be avoided for the first case and both rdmsrl() and
vmcs_writel() can be avoided for the second case.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211118110814.2568-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It is unchanged in most cases.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211111144527.88852-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It has no user.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211108124407.12187-15-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Document the meaning of the three combinations of regs_avail and
regs_dirty. Update regs_dirty just after writeback instead of
doing it later after vmexit. After vmexit, instead, we clear the
regs_avail bits corresponding to lazily-loaded registers.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When vcpu->arch.cr3 is changed, it is marked dirty, so vmcs.GUEST_CR3
can be updated only when kvm_register_is_dirty(vcpu, VCPU_EXREG_CR3).
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211108124407.12187-12-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When vcpu->arch.cr3 is changed, it should be marked dirty unless it
is being updated to the value of the architecture guest CR3 (i.e.
VMX.GUEST_CR3 or vmcb->save.cr3 when tdp is enabled).
This patch has no functionality changed because
kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3) is superset of
kvm_register_mark_available(vcpu, VCPU_EXREG_CR3) with additional
change to vcpu->arch.regs_dirty, but no code uses regs_dirty for
VCPU_EXREG_CR3. (vmx_load_mmu_pgd() uses vcpu->arch.regs_avail instead
to test if VCPU_EXREG_CR3 dirty which means current code (ab)uses
regs_avail for VCPU_EXREG_CR3 dirty information.)
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211108124407.12187-11-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VCPU_EXREG_CR3 is never cleared from vcpu->arch.regs_avail or
vcpu->arch.regs_dirty in SVM; therefore, marking CR3 as available is
merely a NOP, and testing it will likewise always succeed.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211108124407.12187-9-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The comment had been added in the commit 689f3bf216 ("KVM: x86: unify
callbacks to load paging root") and its related code was removed later,
and it has nothing to do with the next line of code.
So the comment should be removed too.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211108124407.12187-8-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Not functionality changed.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211108124407.12187-7-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In set_cr4_guest_host_mask(), all cr4 pdptr bits are already set to be
intercepted in an unclear way.
Add X86_CR4_PDPTR_BITS to make it clear and self-documented.
No functionality changed.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211108124407.12187-6-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In set_cr4_guest_host_mask(), X86_CR4_PGE is set to be intercepted when
!enable_ept just because X86_CR4_PGE is the only bit that is
responsible for flushing TLB but listed in KVM_POSSIBLE_CR4_GUEST_BITS.
It is clearer and self-documented to use X86_CR4_TLBFLUSH_BITS instead.
No functionality changed.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211108124407.12187-5-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the same logic to handle the availability of VCPU_EXREG_PDPTR
as VMX, also removing a branch in svm_vcpu_run().
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211108124407.12187-4-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
mmu->pdptrs[] and vmcs.GUEST_PDPTR[0-3] are synced, so mmu->pdptrs is
available and GUEST_PDPTR[0-3] is not dirty.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211108124407.12187-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For VMX with EPT, dirty PDPTRs need to be loaded before the next vmentry
via vmx_load_mmu_pgd()
But not all paths that call load_pdptrs() will cause vmx_load_mmu_pgd()
to be invoked. Normally, kvm_mmu_reset_context() is used to cause
KVM_REQ_LOAD_MMU_PGD, but sometimes it is skipped:
* commit d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and
CR0.NW are changed") skips kvm_mmu_reset_context() after load_pdptrs()
when changing CR0.CD and CR0.NW.
* commit 21823fbda552("KVM: x86: Invalidate all PGDs for the current
PCID on MOV CR3 w/ flush") skips KVM_REQ_LOAD_MMU_PGD after
load_pdptrs() when rewriting the CR3 with the same value.
* commit a91a7c709600("KVM: X86: Don't reset mmu context when
toggling X86_CR4_PGE") skips kvm_mmu_reset_context() after
load_pdptrs() when changing CR4.PGE.
Fixes: d81135a57a ("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed")
Fixes: 21823fbda5 ("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush")
Fixes: a91a7c7096 ("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211108124407.12187-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For Intel, the guest PMU can be disabled via clearing the PMU CPUID.
For AMD, all hw implementations support the base set of four
performance counters, with current mainstream hardware indicating
the presence of two additional counters via X86_FEATURE_PERFCTR_CORE.
In the virtualized world, the AMD guest driver may detect
the presence of at least one counter MSR. Most hypervisor
vendors would introduce a module param (like lbrv for svm)
to disable PMU for all guests.
Another control proposal per-VM is to pass PMU disable information
via MSR_IA32_PERF_CAPABILITIES or one bit in CPUID Fn4000_00[FF:00].
Both of methods require some guest-side changes, so a module
parameter may not be sufficiently granular, but practical enough.
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20211117080304.38989-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the vCPU from the wakeup list before updating the notification
vector in the posted interrupt post-block helper. There is no need to
wake the current vCPU as it is by definition not blocking. Practically
speaking this is a nop as it only shaves a few meager cycles in the
unlikely case that the vCPU was migrated and the previous pCPU gets a
wakeup IRQ right before PID.NV is updated. The real motivation is to
allow for more readable code in the future, when post-block is merged
with vmx_vcpu_pi_load(), at which point removal from the list will be
conditional on the old notification vector.
Opportunistically add comments to document why KVM has a per-CPU spinlock
that, at first glance, appears to be taken only on the owning CPU.
Explicitly call out that the spinlock must be taken with IRQs disabled, a
detail that was "lost" when KVM switched from spin_lock_irqsave() to
spin_lock(), with IRQs disabled for the entirety of the relevant path.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-29-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Hoist the CPU => APIC ID conversion for the Posted Interrupt descriptor
out of the loop to write the descriptor, preemption is disabled so the
CPU won't change, and if the APIC ID changes KVM has bigger problems.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-28-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use READ_ONCE() when loading the posted interrupt descriptor control
field to ensure "old" and "new" have the same base value. If the
compiler emits separate loads, and loads into "new" before "old", KVM
could theoretically drop the ON bit if it were set between the loads.
Fixes: 28b835d60f ("KVM: Update Posted-Interrupts Descriptor when vCPU is preempted")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-27-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Save/restore IRQs when disabling IRQs in posted interrupt pre/post block
in preparation for moving the code into vcpu_put/load(), where it would be
called with IRQs already disabled.
No functional changed intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-26-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Don't update Posted Interrupt's NDST, a.k.a. the target pCPU, in the
pre-block path, as NDST is guaranteed to be up-to-date. The comment
about the vCPU being preempted during the update is simply wrong, as the
update path runs with IRQs disabled (from before snapshotting vcpu->cpu,
until after the update completes).
Since commit 8b306e2f3c ("KVM: VMX: avoid double list add with VT-d
posted interrupts", 2017-09-27) The vCPU can get preempted _before_
the update starts, but not during. And if the vCPU is preempted before,
vmx_vcpu_pi_load() is responsible for updating NDST when the vCPU is
scheduled back in. In that case, the check against the wakeup vector in
vmx_vcpu_pi_load() cannot be true as that would require the notification
vector to have been set to the wakeup vector _before_ blocking.
Opportunistically switch to using vcpu->cpu for the list/lock lookups,
which do not need pre_pcpu since the same commit.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-25-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Return bools instead of ints for the posted interrupt "test" helpers.
The bit position of the flag being test does not matter to the callers,
and is in fact lost by virtue of test_bit() itself returning a bool.
Returning ints is potentially dangerous, e.g. "pi_test_on(pi_desc) == 1"
is safe-ish because ON is bit 0 and thus any sane implementation of
pi_test_on() will work, but for SN (bit 1), checking "== 1" would rely on
pi_test_on() to return 0 or 1, a.k.a. bools, as opposed to 0 or 2 (the
positive bit position).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-24-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop sanity checks on the validity of the previous pCPU when handling
vCPU block/unlock for posted interrupts. The intention behind the sanity
checks is to avoid memory corruption in case of a race or incorrect locking,
but the code has been stable for a few years now and the checks get in
the way of eliminating kvm_vcpu.pre_cpu.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-23-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explicitly skip posted interrupt updates if APICv is disabled in all of
KVM, or if the guest doesn't have an in-kernel APIC. The PI descriptor
is kept up-to-date if APICv is inhibited, e.g. so that re-enabling APICv
doesn't require a bunch of updates, but neither the module param nor the
APIC type can be changed on-the-fly.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add helpers to wake and query a blocking vCPU. In addition to providing
nice names, the helpers reduce the probability of KVM neglecting to use
kvm_arch_vcpu_get_wait().
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-20-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Call kvm_vcpu_block() directly for all wait states except HALTED so that
kvm_vcpu_halt() is no longer a misnomer on x86.
Functionally, this means KVM will never attempt halt-polling or adjust
vcpu->halt_poll_ns for INIT_RECEIVED (a.k.a. Wait-For-SIPI (WFS)) or
AP_RESET_HOLD; UNINITIALIZED is handled in kvm_arch_vcpu_ioctl_run(),
and x86 doesn't use any other "wait" states.
As mentioned above, the motivation of this is purely so that "halt" isn't
overloaded on x86, e.g. in KVM's stats. Skipping halt-polling for WFS
(and RESET_HOLD) has no meaningful effect on guest performance as there
are typically single-digit numbers of INIT-SIPI sequences per AP vCPU,
per boot, versus thousands of HLTs just to boot to console.
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-19-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Go directly to kvm_vcpu_block() when handling the case where userspace
attempts to run an UNINITIALIZED vCPU. The vCPU is not halted, nor is it
likely that halt-polling will be successful in this case.
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-18-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename kvm_vcpu_block() to kvm_vcpu_halt() in preparation for splitting
the actual "block" sequences into a separate helper (to be named
kvm_vcpu_block()). x86 will use the standalone block-only path to handle
non-halt cases where the vCPU is not runnable.
Rename block_ns to halt_ns to match the new function name.
No functional change intended.
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename a variety of HLT-related helpers to free up the function name
"kvm_vcpu_halt" for future use in generic KVM code, e.g. to differentiate
between "block" and "halt".
No functional change intended.
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Ensure vcpu->cpu is read once when signalling the AVIC doorbell. If the
compiler rereads the field and the vCPU is migrated between the check and
writing the doorbell, KVM would signal the wrong physical CPU.
Functionally, signalling the wrong CPU in this case is not an issue as
task migration means the vCPU has exited and will pick up any pending
interrupts on the next VMRUN. Add the READ_ONCE() purely to clean up the
code.
Opportunistically add a comment explaining the task migration behavior,
and rename cpuid=>cpu to avoid conflating the CPU number with KVM's more
common usage of CPUID.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Don't configure the wakeup handler when a vCPU is blocking with IRQs
disabled, in which case any IRQ, posted or otherwise, should not be
recognized and thus should not wake the vCPU.
Fixes: bf9f6ac8d7 ("KVM: Update Posted-Interrupts Descriptor when vCPU is blocked")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
change 0 to false and 1 to true to fix following cocci warnings:
arch/x86/kvm/mmu/mmu.c:1485:9-10: WARNING: return of 0/1 in function 'kvm_set_pte_rmapp' with return type bool
arch/x86/kvm/mmu/mmu.c:1636:10-11: WARNING: return of 0/1 in function 'kvm_test_age_rmapp' with return type bool
Signed-off-by: Vihas Mak <makvihas@gmail.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Message-Id: <20211114164312.GA28736@makvihas>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the gotos from vmx_get_mt_mask. It's easier to build the whole
memory type at once, than it is to combine separate cacheability and
ipat fields.
No functional change intended.
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20211115234603.2908381-12-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In preparation for implementing in-place hugepage promotion, various
functions will need to be called from zap_collapsible_spte_range, which
has the const qualifier on its memslot argument. Propagate the const
qualifier to the various functions which will be needed. This just serves
to simplify the following patch.
No functional change intended.
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20211115234603.2908381-11-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The vCPU argument to mmu_try_to_unsync_pages is now only used to get a
pointer to the associated struct kvm, so pass in the kvm pointer from
the beginning to remove the need for a vCPU when calling the function.
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20211115234603.2908381-7-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_slot_page_track_is_active only uses its vCPU argument to get a
pointer to the assoicated struct kvm, so just pass in the struct KVM to
remove the need for a vCPU pointer.
No functional change intended.
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20211115234603.2908381-6-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rework make_spte() to query the shadow page's role, specifically whether
or not it's a guest_mode page, a.k.a. a page for L2, when determining if
the SPTE is compatible with PML. This eliminates a dependency on @vcpu,
with a future goal of being able to create SPTEs without a specific vCPU.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This structure will replace vmcb_control_area in
svm_nested_state, providing only the fields that are actually
used by the nested state. This avoids having and copying around
uninitialized fields. The cost of this, however, is that all
functions (in this case vmcb_is_intercept) expect the old
structure, so they need to be duplicated.
In addition, in svm_get_nested_state() user space expects a
vmcb_control_area struct, so we need to copy back all fields
in a temporary structure before copying it to userspace.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20211103140527.752797-7-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the struct vmcb_control_area parameter from nested_vmcb_check_controls,
for consistency with the functions that operate on the save area. This
way, VMRUN uses the version without underscores for both areas, while
KVM_SET_NESTED_STATE uses the version with underscores.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the already checked svm->nested.save cached fields
(EFER, CR0, CR4, ...) instead of vmcb12's in
nested_vmcb02_prepare_save().
This prevents from creating TOC/TOU races, since the
guest could modify the vmcb12 fields.
This also avoids the need of force-setting EFER_SVME in
nested_vmcb02_prepare_save.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20211103140527.752797-6-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that struct vmcb_save_area_cached contains the required
vmcb fields values (done in nested_load_save_from_vmcb12()),
check them to see if they are correct in nested_vmcb_valid_sregs().
While at it, rename nested_vmcb_valid_sregs in nested_vmcb_check_save.
__nested_vmcb_check_save takes the additional @save parameter, so it
is helpful when we want to check a non-svm save state, like in
svm_set_nested_state. The reason for that is that save is the L1
state, not L2, so we check it without moving it to svm->nested.save.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20211103140527.752797-5-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Following the same naming convention of the previous patch,
rename nested_load_control_from_vmcb12.
In addition, inline copy_vmcb_control_area as it is only called
by this function.
__nested_copy_vmcb_control_to_cache() works with vmcb_control_area
parameters and it will be useful in next patches, when we use
local variables instead of svm cached state.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20211103140527.752797-4-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is useful in the next patch, to keep a saved copy
of vmcb12 registers and pass it around more easily.
Instead of blindly copying everything, we just copy EFER, CR0, CR3, CR4,
DR6 and DR7 which are needed by the VMRUN checks. If more fields will
need to be checked, it will be quite obvious to see that they must be added
in struct vmcb_save_area_cached and in nested_copy_vmcb_save_to_cache().
__nested_copy_vmcb_save_to_cache() takes a vmcb_save_area_cached
parameter, which is useful in order to save the state to a local
variable.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20211103140527.752797-3-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Inline nested_vmcb_check_cr3_cr4 as it is not called by anyone else.
Doing so simplifies next patches.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20211103140527.752797-2-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce a memslots gfn upper bound operation and use it to optimize
kvm_zap_gfn_range().
This way this handler can do a quick lookup for intersecting gfns and won't
have to do a linear scan of the whole memslot set.
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <ef242146a87a335ee93b441dcf01665cb847c902.1638817641.git.maciej.szmigiero@oracle.com>
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
The current memslots implementation only allows quick binary search by gfn,
quick lookup by hva is not possible - the implementation has to do a linear
scan of the whole memslots array, even though the operation being performed
might apply just to a single memslot.
This significantly hurts performance of per-hva operations with higher
memslot counts.
Since hva ranges can overlap between memslots an interval tree is needed
for tracking them.
[sean: handle interval tree updates in kvm_replace_memslot()]
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <d66b9974becaa9839be9c4e1a5de97b177b4ac20.1638817640.git.maciej.szmigiero@oracle.com>
There is no point in recalculating from scratch the total number of pages
in all memslots each time a memslot is created or deleted. Use KVM's
cached nr_memslot_pages to compute the default max number of MMU pages.
Note that even with nr_memslot_pages capped at ULONG_MAX we can't safely
multiply it by KVM_PERMILLE_MMU_PAGES (20) since this operation can
possibly overflow an unsigned long variable.
Write this "* 20 / 1000" operation as "/ 50" instead to avoid such
overflow.
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
[sean: use common KVM field and rework changelog accordingly]
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <d14c5a24535269606675437d5602b7dac4ad8c0e.1638817640.git.maciej.szmigiero@oracle.com>
There is no point in calling kvm_mmu_change_mmu_pages() for memslot
operations that don't change the total page count, so do it just for
KVM_MR_CREATE and KVM_MR_DELETE.
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <9e56b7616a11f5654e4ab486b3237366b7ba9f2a.1638817640.git.maciej.szmigiero@oracle.com>
Play nice with a NULL @old or @new when handling memslot updates so that
common KVM can pass NULL for one or the other in CREATE and DELETE cases
instead of having to synthesize a dummy memslot.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <2eb7788adbdc2bc9a9c5f86844dd8ee5c8428732.1638817640.git.maciej.szmigiero@oracle.com>
Drop the @mem param from kvm_arch_{prepare,commit}_memory_region() now
that its use has been removed in all architectures.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <aa5ed3e62c27e881d0d8bc0acbc1572bc336dc19.1638817640.git.maciej.szmigiero@oracle.com>
Get the number of pages directly from the new memslot instead of
computing the same from the userspace memory region when allocating
memslot metadata. This will allow a future patch to drop @mem.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <ef44892eb615f5c28e682bbe06af96aff9ce2a9f.1638817639.git.maciej.szmigiero@oracle.com>
Pass the "old" slot to kvm_arch_prepare_memory_region() and force arch
code to handle propagating arch specific data from "new" to "old" when
necessary. This is a baby step towards dynamically allocating "new" from
the get go, and is a (very) minor performance boost on x86 due to not
unnecessarily copying arch data.
For PPC HV, copy the rmap in the !CREATE and !DELETE paths, i.e. for MOVE
and FLAGS_ONLY. This is functionally a nop as the previous behavior
would overwrite the pointer for CREATE, and eventually discard/ignore it
for DELETE.
For x86, copy the arch data only for FLAGS_ONLY changes. Unlike PPC HV,
x86 needs to reallocate arch data in the MOVE case as the size of x86's
allocations depend on the alignment of the memslot's gfn.
Opportunistically tweak kvm_arch_prepare_memory_region()'s param order to
match the "commit" prototype.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
[mss: add missing RISCV kvm_arch_prepare_memory_region() change]
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <67dea5f11bbcfd71e3da5986f11e87f5dd4013f9.1638817639.git.maciej.szmigiero@oracle.com>
Everywhere we use kvm_for_each_vpcu(), we use an int as the vcpu
index. Unfortunately, we're about to move rework the iterator,
which requires this to be upgrade to an unsigned long.
Let's bite the bullet and repaint all of it in one go.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20211116160403.4074052-7-maz@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
All architectures have similar loops iterating over the vcpus,
freeing one vcpu at a time, and eventually wiping the reference
off the vcpus array. They are also inconsistently taking
the kvm->lock mutex when wiping the references from the array.
Make this code common, which will simplify further changes.
The locking is dropped altogether, as this should only be called
when there is no further references on the kvm structure.
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20211116160403.4074052-2-maz@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, an SEV-ES guest is terminated if the validation of the VMGEXIT
exit code or exit parameters fails.
The VMGEXIT instruction can be issued from userspace, even though
userspace (likely) can't update the GHCB. To prevent userspace from being
able to kill the guest, return an error through the GHCB when validation
fails rather than terminating the guest. For cases where the GHCB can't be
updated (e.g. the GHCB can't be mapped, etc.), just return back to the
guest.
The new error codes are documented in the lasest update to the GHCB
specification.
Fixes: 291bd20d5d ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <b57280b5562893e2616257ac9c2d4525a9aeeb42.1638471124.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use kvzalloc() to allocate KVM's buffer for SEV-ES's GHCB scratch area so
that KVM falls back to __vmalloc() if physically contiguous memory isn't
available. The buffer is purely a KVM software construct, i.e. there's
no need for it to be physically contiguous.
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211109222350.2266045-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Return appropriate error codes if setting up the GHCB scratch area for an
SEV-ES guest fails. In particular, returning -EINVAL instead of -ENOMEM
when allocating the kernel buffer could be confusing as userspace would
likely suspect a guest issue.
Fixes: 8f423a80d2 ("KVM: SVM: Support MMIO for an SEV-ES guest")
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211109222350.2266045-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Bail from the page fault handler if the root shadow page was obsoleted by
a memslot update. Do the check _after_ acuiring mmu_lock, as the TDP MMU
doesn't rely on the memslot/MMU generation, and instead relies on the
root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes
mmu_lock for write.
For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if
kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has
moved past the gfn associated with the SP.
For other MMUs, the resulting behavior is far more convoluted, though
unlikely to be truly problematic. Installing SPs/SPTEs into the obsolete
root isn't directly problematic, as the obsolete root will be unloaded
and dropped before the vCPU re-enters the guest. But because the legacy
MMU tracks shadow pages by their role, any SP created by the fault can
can be reused in the new post-reload root. Again, that _shouldn't_ be
problematic as any leaf child SPTEs will be created for the current/valid
memslot generation, and kvm_mmu_get_page() will not reuse child SPs from
the old generation as they will be flagged as obsolete. But, given that
continuing with the fault is pointess (the root will be unloaded), apply
the check to all MMUs.
Fixes: b7cccd397f ("KVM: x86/mmu: Fast invalidation for TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120045046.3940942-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The error paths in the prepare_vmcs02() function are supposed to set
*entry_failure_code but this path does not. It leads to using an
uninitialized variable in the caller.
Fixes: 71f7347025 ("KVM: nVMX: Load GUEST_IA32_PERF_GLOBAL_CTRL MSR on VM-Entry")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Message-Id: <20211130125337.GB24578@kili>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_vcpu_apicv_active() returns false if a virtual machine has no in-kernel
local APIC, however kvm_apicv_activated might still be true if there are
no reasons to disable APICv; in fact it is quite likely that there is none
because APICv is inhibited by specific configurations of the local APIC
and those configurations cannot be programmed. This triggers a WARN:
WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm) != kvm_vcpu_apicv_active(vcpu));
To avoid this, introduce another cause for APICv inhibition, namely the
absence of an in-kernel local APIC. This cause is enabled by default,
and is dropped by either KVM_CREATE_IRQCHIP or the enabling of
KVM_CAP_IRQCHIP_SPLIT.
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Fixes: ee49a89329 ("KVM: x86: Move SVM's APICv sanity check to common x86", 2021-10-22)
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Tested-by: Ignat Korchagin <ignat@cloudflare.com>
Message-Id: <20211130123746.293379-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If we run the following perf command in an AMD Milan guest:
perf stat \
-e cpu/event=0x1d0/ \
-e cpu/event=0x1c7/ \
-e cpu/umask=0x1f,event=0x18e/ \
-e cpu/umask=0x7,event=0x18e/ \
-e cpu/umask=0x18,event=0x18e/ \
./workload
dmesg will report a #GP warning from an unchecked MSR access
error on MSR_F15H_PERF_CTLx.
This is because according to APM (Revision: 4.03) Figure 13-7,
the bits [35:32] of AMD PerfEvtSeln register is a part of the
event select encoding, which extends the EVENT_SELECT field
from 8 bits to 12 bits.
Opportunistically update pmu->reserved_bits for reserved bit 19.
Reported-by: Jim Mattson <jmattson@google.com>
Fixes: ca724305a2 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM")
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20211118130320.95997-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
avic_set_running() passes the current CPU to avic_vcpu_load(), albeit
via vcpu->cpu rather than smp_processor_id(). If the thread is migrated
while avic_set_running runs, the call to avic_vcpu_load() can use a stale
value for the processor id. Avoid this by blocking preemption over the
entire execution of avic_set_running().
Reported-by: Sean Christopherson <seanjc@google.com>
Fixes: 8221c13700 ("svm: Manage vcpu load/unload when enable AVIC")
Cc: stable@vger.kernel.org
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is nothing to synchronize if APICv is disabled, since neither
other vCPUs nor assigned devices can set PIR.ON.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Generally, kvm->lock is not taken for a long time, but
sev_lock_two_vms is different: it takes vCPU locks
inside, so userspace can hold it back just by calling
a vCPU ioctl. Play it safe and use mutex_lock_killable.
Message-Id: <20211123005036.2954379-13-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Taking the lock is useless since there are no other references,
and there are already accesses (e.g. to sev->enc_context_owner)
that do not take it. So get rid of it.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-12-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VMs that mirror an encryption context rely on the owner to keep the
ASID allocated. Performing a KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
would cause a dangling ASID:
1. copy context from A to B (gets ref to A)
2. move context from A to L (moves ASID from A to L)
3. close L (releases ASID from L, B still references it)
The right way to do the handoff instead is to create a fresh mirror VM
on the destination first:
1. copy context from A to B (gets ref to A)
[later] 2. close B (releases ref to A)
3. move context from A to L (moves ASID from A to L)
4. copy context from L to M
So, catch the situation by adding a count of how many VMs are
mirroring this one's encryption context.
Fixes: 0b020f5af0 ("KVM: SEV: Add support for SEV-ES intra host migration")
Message-Id: <20211123005036.2954379-11-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that we have a facility to lock two VMs with deadlock
protection, use it for the creation of mirror VMs as well. One of
COPY_ENC_CONTEXT_FROM(dst, src) and COPY_ENC_CONTEXT_FROM(src, dst)
would always fail, so the combination is nonsensical and it is okay to
return -EBUSY if it is attempted.
This sidesteps the question of what happens if a VM is
MOVE_ENC_CONTEXT_FROM'd at the same time as it is
COPY_ENC_CONTEXT_FROM'd: the locking prevents that from
happening.
Cc: Peter Gonda <pgonda@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-10-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Allow intra-host migration of a mirror VM; the destination VM will be
a mirror of the same ASID as the source.
Fixes: b56639318b ("KVM: SEV: Add support for SEV intra host migration")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-8-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This was broken before the introduction of KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM,
but technically harmless because the region list was unused for a mirror
VM. However, it is untidy and it now causes a NULL pointer access when
attempting to move the encryption context of a mirror VM.
Fixes: 54526d1fd5 ("KVM: x86: Support KVM VMs sharing SEV context")
Message-Id: <20211123005036.2954379-7-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Encapsulate the handling of the migration_in_progress flag for both VMs in
two functions sev_lock_two_vms and sev_unlock_two_vms. It does not matter
if KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM locks the destination struct kvm a bit
later, and this change 1) keeps the cleanup chain of labels smaller 2)
makes it possible for KVM_CAP_VM_COPY_ENC_CONTEXT_FROM to reuse the logic.
Cc: Peter Gonda <pgonda@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-6-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
list_replace_init cannot be used if the source is an empty list,
because "new->next->prev = new" will overwrite "old->next":
new old
prev = new, next = new prev = old, next = old
new->next = old->next prev = new, next = old prev = old, next = old
new->next->prev = new prev = new, next = old prev = old, next = new
new->prev = old->prev prev = old, next = old prev = old, next = old
new->next->prev = new prev = old, next = old prev = new, next = new
The desired outcome instead would be to leave both old and new the same
as they were (two empty circular lists). Use list_cut_before, which
already has the necessary check and is documented to discard the
previous contents of the list that will hold the result.
Fixes: b56639318b ("KVM: SEV: Add support for SEV intra host migration")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-5-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, checks for whether VT-d PI can be used refer to the current
status of the feature in the current vCPU; or they more or less pick
vCPU 0 in case a specific vCPU is not available.
However, these checks do not attempt to synchronize with changes to
the IRTE. In particular, there is no path that updates the IRTE when
APICv is re-activated on vCPU 0; and there is no path to wakeup a CPU
that has APICv disabled, if the wakeup occurs because of an IRTE
that points to a posted interrupt.
To fix this, always go through the VT-d PI path as long as there are
assigned devices and APICv is available on both the host and the VM side.
Since the relevant condition was copied over three times, take the hint
and factor it into a separate function.
Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: stable@vger.kernel.org
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Message-Id: <20211123004311.2954158-5-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The IRTE for an assigned device can trigger a POSTED_INTR_VECTOR even
if APICv is disabled on the vCPU that receives it. In that case, the
interrupt will just cause a vmexit and leave the ON bit set together
with the PIR bit corresponding to the interrupt.
Right now, the interrupt would not be delivered until APICv is re-enabled.
However, fixing this is just a matter of always doing the PIR->IRR
synchronization, even if the vCPU has temporarily disabled APICv.
This is not a problem for performance, or if anything it is an
improvement. First, in the common case where vcpu->arch.apicv_active is
true, one fewer check has to be performed. Second, static_call_cond will
elide the function call if APICv is not present or disabled. Finally,
in the case for AMD hardware we can remove the sync_pir_to_irr callback:
it is only needed for apic_has_interrupt_for_ppr, and that function
already has a fallback for !APICv.
Cc: stable@vger.kernel.org
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Message-Id: <20211123004311.2954158-4-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If APICv is disabled for this vCPU, assigned devices may still attempt to
post interrupts. In that case, we need to cancel the vmentry and deliver
the interrupt with KVM_REQ_EVENT. Extend the existing code that handles
injection of L1 interrupts into L2 to cover this case as well.
vmx_hwapic_irr_update is only called when APICv is active so it would be
confusing to add a check for vcpu->arch.apicv_active in there. Instead,
just use vmx_set_rvi directly in vmx_sync_pir_to_irr.
Cc: stable@vger.kernel.org
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123004311.2954158-3-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Account for the '0' being a default, "let KVM choose" period, when
determining whether or not the recovery worker needs to be awakened in
response to userspace reducing the period. Failure to do so results in
the worker not being awakened properly, e.g. when changing the period
from '0' to any small-ish value.
Fixes: 4dfe4f40d8 ("kvm: x86: mmu: Make NX huge page recovery period configurable")
Cc: stable@vger.kernel.org
Cc: Junaid Shahid <junaids@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120015706.3830341-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Initialize the mask for PKU permissions as if CR4.PKE=0, avoiding
incorrect interpretations of the nested hypervisor's page tables.
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop the "flush" param and return values to/from the TDP MMU's helper for
zapping collapsible SPTEs. Because the helper runs with mmu_lock held
for read, not write, it uses tdp_mmu_zap_spte_atomic(), and the atomic
zap handles the necessary remote TLB flush.
Similarly, because mmu_lock is dropped and re-acquired between zapping
legacy MMUs and zapping TDP MMUs, kvm_mmu_zap_collapsible_sptes() must
handle remote TLB flushes from the legacy MMU before calling into the TDP
MMU.
Fixes: e2209710cc ("KVM: x86/mmu: Skip rmap operations if rmaps not allocated")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120045046.3940942-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the yield-safe variant of the TDP MMU iterator when handling an
unmapping event from the MMU notifier, as most occurences of the event
allow yielding.
Fixes: e1eed5847b ("KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if possible")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120015008.3780032-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
INVLPG operates on guest virtual address, which are represented by
vcpu->arch.walk_mmu. In nested virtualization scenarios,
kvm_mmu_invlpg() was using the wrong MMU structure; if L2's invlpg were
emulated by L0 (in practice, it hardly happen) when nested two-dimensional
paging is enabled, the call to ->tlb_flush_gva() would be skipped and
the hardware TLB entry would not be invalidated.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-5-jiangshanlai@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If the is an L1 with nNPT in 32bit, the shadow walk starts with
pae_root.
Fixes: a717a780fc ("KVM: x86/mmu: Support shadowing NPT when 5-level paging is enabled in host)
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 63f5a1909f ("KVM: x86: Alert userspace that KVM_SET_CPUID{,2}
after KVM_RUN is broken") officially deprecated KVM_SET_CPUID{,2} ioctls
after first successful KVM_RUN and promissed to make this sequence forbiden
in 5.16. It's time to fulfil the promise.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211122175818.608220-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fully emulate a guest TLB flush on nested VM-Enter which changes vpid12,
i.e. L2's VPID, instead of simply doing INVVPID to flush real hardware's
TLB entries for vpid02. From L1's perspective, changing L2's VPID is
effectively a TLB flush unless "hardware" has previously cached entries
for the new vpid12. Because KVM tracks only a single vpid12, KVM doesn't
know if the new vpid12 has been used in the past and so must treat it as
a brand new, never been used VPID, i.e. must assume that the new vpid12
represents a TLB flush from L1's perspective.
For example, if L1 and L2 share a CR3, the first VM-Enter to L2 (with a
VPID) is effectively a TLB flush as hardware/KVM has never seen vpid12
and thus can't have cached entries in the TLB for vpid12.
Reported-by: Lai Jiangshan <jiangshanlai+lkml@gmail.com>
Fixes: 5c614b3583 ("KVM: nVMX: nested VPID emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211125014944.536398-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like KVM_REQ_TLB_FLUSH_CURRENT, the GUEST variant needs to be serviced at
nested transitions, as KVM doesn't track requests for L1 vs L2. E.g. if
there's a pending flush when a nested VM-Exit occurs, then the flush was
requested in the context of L2 and needs to be handled before switching
to L1, otherwise the flush for L2 would effectiely be lost.
Opportunistically add a helper to handle CURRENT and GUEST as a pair, the
logic for when they need to be serviced is identical as both requests are
tied to L1 vs. L2, the only difference is the scope of the flush.
Reported-by: Lai Jiangshan <jiangshanlai+lkml@gmail.com>
Fixes: 07ffaf343e ("KVM: nVMX: Sync all PGDs on nested transition with shadow paging")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211125014944.536398-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Flush the current VPID when handling KVM_REQ_TLB_FLUSH_GUEST instead of
always flushing vpid01. Any TLB flush that is triggered when L2 is
active is scoped to L2's VPID (if it has one), e.g. if L2 toggles CR4.PGE
and L1 doesn't intercept PGE writes, then KVM's emulation of the TLB
flush needs to be applied to L2's VPID.
Reported-by: Lai Jiangshan <jiangshanlai+lkml@gmail.com>
Fixes: 07ffaf343e ("KVM: nVMX: Sync all PGDs on nested transition with shadow paging")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211125014944.536398-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The capability, albeit present, was never exposed via KVM_CHECK_EXTENSION.
Fixes: b56639318b ("KVM: SEV: Add support for SEV intra host migration")
Cc: Peter Gonda <pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Synchronize the two calls to kvm_x86_sync_pir_to_irr. The one
in the reenter-guest fast path invoked the callback unconditionally
even if LAPIC is present but disabled. In this case, there are
no interrupts to deliver, and therefore posted interrupts can
be ignored.
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
An uninitialized gfn_to_hva_cache has ghc->len == 0, which causes
the accessors to croak very loudly. While a BUG_ON is definitely
_too_ loud and a bug on its own, there is indeed an issue of using
the caches in such a way that they could not have been initialized,
because ghc->gpa == 0 might match and thus kvm_gfn_to_hva_cache_init
would not be called.
For the vmcs12_cache, the solution is simply to invoke
kvm_gfn_to_hva_cache_init unconditionally: we already know
that the cache does not match the current VMCS pointer.
For the shadow_vmcs12_cache, there is no similar condition
that checks the VMCS link pointer, so invalidate the cache
on VMXON.
Fixes: cee66664dc ("KVM: nVMX: Use a gfn_to_hva_cache for vmptrld")
Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Reported-by: syzbot+7b7db8bb4db6fd5e157b@syzkaller.appspotmail.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since tlb flush has been done for legacy MMU before
kvm_tdp_mmu_zap_collapsible_sptes(), so the parameter flush
should be false for kvm_tdp_mmu_zap_collapsible_sptes().
Fixes: e2209710cc ("KVM: x86/mmu: Skip rmap operations if rmaps not allocated")
Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com>
Message-Id: <21453a1d2533afb6e59fb6c729af89e771ff2e76.1637140154.git.houwenlong93@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If the parameter flush is set, zap_gfn_range() would flush remote tlb
when yield, then tlb flush is not needed outside. So use the return
value of zap_gfn_range() directly instead of OR on it in
kvm_unmap_gfn_range() and kvm_tdp_mmu_unmap_gfn_range().
Fixes: 3039bcc744 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com>
Message-Id: <5e16546e228877a4d974f8c0e448a93d52c7a5a9.1637140154.git.houwenlong93@linux.alibaba.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit f52447261b ("KVM: irq ack notification") introduced an
ack_notifier() callback in struct kvm_pic and in struct kvm_ioapic
without using them anywhere. Remove those callbacks again.
Signed-off-by: Juergen Gross <jgross@suse.com>
Message-Id: <20211117071617.19504-1-jgross@suse.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When recursively clearing out disconnected pts, the range based TLB
flush in handle_removed_tdp_mmu_page uses the wrong starting GFN,
resulting in the flush mostly missing the affected range. Fix this by
using base_gfn for the flush.
In response to feedback from David Matlack on the RFC version of this
patch, also move a few definitions into the for loop in the function to
prevent unintended references to them in the future.
Fixes: a066e61f13 ("KVM: x86/mmu: Factor out handling of removed page tables")
CC: stable@vger.kernel.org
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20211115211704.2621644-1-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It doesn't make sense to return the recommended maximum number of
vCPUs which exceeds the maximum possible number of vCPUs.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211116163443.88707-7-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When processing a hypercall for a guest with protected state, currently
SEV-ES guests, the guest CS segment register can't be checked to
determine if the guest is in 64-bit mode. For an SEV-ES guest, it is
expected that communication between the guest and the hypervisor is
performed to shared memory using the GHCB. In order to use the GHCB, the
guest must have been in long mode, otherwise writes by the guest to the
GHCB would be encrypted and not be able to be comprehended by the
hypervisor.
Create a new helper function, is_64_bit_hypercall(), that assumes the
guest is in 64-bit mode when the guest has protected state, and returns
true, otherwise invoking is_64_bit_mode() to determine the mode. Update
the hypercall related routines to use is_64_bit_hypercall() instead of
is_64_bit_mode().
Add a WARN_ON_ONCE() to is_64_bit_mode() to catch occurences of calls to
this helper function for a guest running with protected state.
Fixes: f1c6366e30 ("KVM: SVM: Add required changes to support intercepts under SEV-ES")
Reported-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <e0b20c770c9d0d1403f23d83e785385104211f74.1621878537.git.thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Fixes for Xen emulation
* Kill kvm_map_gfn() / kvm_unmap_gfn() and broken gfn_to_pfn_cache
* Fixes for migration of 32-bit nested guests on 64-bit hypervisor
* Compilation fixes
* More SEV cleanups
Rename cmd_allowed_from_miror() to is_cmd_allowed_from_mirror(), fixing
a typo and making it obvious that the result is a boolean where
false means "not allowed".
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211109215101.2211373-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove a fully redundant write to sev->asid during SEV/SEV-ES guest
initialization. The ASID is set a few lines earlier prior to the call to
sev_platform_init(), which doesn't take "sev" as a param, i.e. can't
muck with the ASID barring some truly magical behind-the-scenes code.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211109215101.2211373-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
WARN if the VM is tagged as SEV-ES but not SEV. KVM relies on SEV and
SEV-ES being set atomically, and guards common flows with "is SEV", i.e.
observing SEV-ES without SEV means KVM has a fatal bug.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211109215101.2211373-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Set sev_info.active during SEV/SEV-ES activation before calling any code
that can potentially consume sev_info.es_active, e.g. set "active" and
"es_active" as a pair immediately after the initial sanity checks. KVM
generally expects that es_active can be true if and only if active is
true, e.g. sev_asid_new() deliberately avoids sev_es_guest() so that it
doesn't get a false negative. This will allow WARNing in sev_es_guest()
if the VM is tagged as SEV-ES but not SEV.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211109215101.2211373-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reject COPY_ENC_CONTEXT_FROM if the destination VM has created vCPUs.
KVM relies on SEV activation to occur before vCPUs are created, e.g. to
set VMCB flags and intercepts correctly.
Fixes: 54526d1fd5 ("KVM: x86: Support KVM VMs sharing SEV context")
Cc: stable@vger.kernel.org
Cc: Peter Gonda <pgonda@google.com>
Cc: Marc Orr <marcorr@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Nathan Tempelman <natet@google.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211109215101.2211373-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
And thus another call to kvm_vcpu_map() can die.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211115165030.7422-7-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>