linux

Author	SHA1	Message	Date
Paolo Bonzini	95e16b4792	KVM: SEV-ES: go over the sev_pio_data buffer in multiple passes if needed The PIO scratch buffer is larger than a single page, and therefore it is not possible to copy it in a single step to vcpu->arch/pio_data. Bound each call to emulator_pio_in/out to a single page; keep track of how many I/O operations are left in vcpu->arch.sev_pio_count, so that the operation can be restarted in the complete_userspace_io callback. For OUT, this means that the previous kvm_sev_es_outs implementation becomes an iterator of the loop, and we can consume the sev_pio_data buffer before leaving to userspace. For IN, instead, consuming the buffer and decreasing sev_pio_count is always done in the complete_userspace_io callback, because that is when the memcpy is done into sev_pio_data. Cc: stable@vger.kernel.org Fixes: `7ed9abfe8e` ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reported-by: Felix Wilhelm <fwilhelm@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 10:09:13 -04:00
Paolo Bonzini	4fa4b38dae	KVM: SEV-ES: keep INS functions together Make the diff a little nicer when we actually get to fixing the bug. No functional change intended. Cc: stable@vger.kernel.org Fixes: `7ed9abfe8e` ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 10:08:51 -04:00
Paolo Bonzini	6b5efc930b	KVM: x86: remove unnecessary arguments from complete_emulator_pio_in complete_emulator_pio_in can expect that vcpu->arch.pio has been filled in, and therefore does not need the size and count arguments. This makes things nicer when the function is called directly from a complete_userspace_io callback. No functional change intended. Cc: stable@vger.kernel.org Fixes: `7ed9abfe8e` ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 10:08:38 -04:00
Paolo Bonzini	3b27de2718	KVM: x86: split the two parts of emulator_pio_in emulator_pio_in handles both the case where the data is pending in vcpu->arch.pio.count, and the case where I/O has to be done via either an in-kernel device or a userspace exit. For SEV-ES we would like to split these, to identify clearly the moment at which the sev_pio_data is consumed. To this end, create two different functions: __emulator_pio_in fills in vcpu->arch.pio.count, while complete_emulator_pio_in clears it and releases vcpu->arch.pio.data. Because this patch has to be backported, things are left a bit messy. kernel_pio() operates on vcpu->arch.pio, which leads to emulator_pio_in() having with two calls to complete_emulator_pio_in(). It will be fixed in the next release. While at it, remove the unused void* val argument of emulator_pio_in_out. The function currently hardcodes vcpu->arch.pio_data as the source/destination buffer, which sucks but will be fixed after the more severe SEV-ES buffer overflow. No functional change intended. Cc: stable@vger.kernel.org Fixes: `7ed9abfe8e` ("KVM: SVM: Support string IO operations for an SEV-ES guest") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 10:08:00 -04:00
Paolo Bonzini	ea724ea420	KVM: SEV-ES: clean up kvm_sev_es_ins/outs A few very small cleanups to the functions, smushed together because the patch is already very small like this: - inline emulator_pio_in_emulated and emulator_pio_out_emulated, since we already have the vCPU - remove the data argument and pull setting vcpu->arch.sev_pio_data into the caller - remove unnecessary clearing of vcpu->arch.pio.count when emulation is done by the kernel (and therefore vcpu->arch.pio.count is already clear on exit from emulator_pio_in and emulator_pio_out). No functional change intended. Cc: stable@vger.kernel.org Fixes: `7ed9abfe8e` ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 10:02:20 -04:00
Paolo Bonzini	0d33b1baeb	KVM: x86: leave vcpu->arch.pio.count alone in emulator_pio_in_out Currently emulator_pio_in clears vcpu->arch.pio.count twice if emulator_pio_in_out performs kernel PIO. Move the clear into emulator_pio_out where it is actually necessary. No functional change intended. Cc: stable@vger.kernel.org Fixes: `7ed9abfe8e` ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 10:02:07 -04:00
Paolo Bonzini	b5998402e3	KVM: SEV-ES: rename guest_ins_data to sev_pio_data We will be using this field for OUTS emulation as well, in case the data that is pushed via OUTS spans more than one page. In that case, there will be a need to save the data pointer across exits to userspace. So, change the name to something that refers to any kind of PIO. Also spell out what it is used for, namely SEV-ES. No functional change intended. Cc: stable@vger.kernel.org Fixes: `7ed9abfe8e` ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 10:01:26 -04:00
Lai Jiangshan	61b05a9fd4	KVM: X86: Don't unload MMU in kvm_vcpu_flush_tlb_guest() kvm_mmu_unload() destroys all the PGD caches. Use the lighter kvm_mmu_sync_roots() and kvm_mmu_sync_prev_roots() instead. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211019110154.4091-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:44:43 -04:00
Lai Jiangshan	509bfe3d97	KVM: X86: Cache CR3 in prev_roots when PCID is disabled The commit `21823fbda5` ("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush") invalidates all PGDs for the specific PCID and in the case of PCID is disabled, it includes all PGDs in the prev_roots and the commit made prev_roots totally unused in this case. Not using prev_roots fixes a problem when CR4.PCIDE is changed 0 -> 1 before the said commit: (CR4.PCIDE=0, CR4.PGE=1; CR3=cr3_a; the page for the guest RIP is global; cr3_b is cached in prev_roots) modify page tables under cr3_b the shadow root of cr3_b is unsync in kvm INVPCID single context the guest expects the TLB is clean for PCID=0 change CR4.PCIDE 0 -> 1 switch to cr3_b with PCID=0,NOFLUSH=1 No sync in kvm, cr3_b is still unsync in kvm jump to the page that was modified in step 1 shadow page tables point to the wrong page It is a very unlikely case, but it shows that stale prev_roots can be a problem after CR4.PCIDE changes from 0 to 1. However, to fix this case, the commit disabled caching CR3 in prev_roots altogether when PCID is disabled. Not all CPUs have PCID; especially the PCID support for AMD CPUs is kind of recent. To restore the prev_roots optimization for CR4.PCIDE=0, flush the whole MMU (including all prev_roots) when CR4.PCIDE changes. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211019110154.4091-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:30 -04:00
Lai Jiangshan	e45e9e3998	KVM: X86: Fix tlb flush for tdp in kvm_invalidate_pcid() The KVM doesn't know whether any TLB for a specific pcid is cached in the CPU when tdp is enabled. So it is better to flush all the guest TLB when invalidating any single PCID context. The case is very rare or even impossible since KVM generally doesn't intercept CR3 write or INVPCID instructions when tdp is enabled, so the fix is mostly for the sake of overall robustness. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211019110154.4091-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:30 -04:00
Lai Jiangshan	a91a7c7096	KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE X86_CR4_PGE doesn't participate in kvm_mmu_role, so the mmu context doesn't need to be reset. It is only required to flush all the guest tlb. It is also inconsistent that X86_CR4_PGE is in KVM_MMU_CR4_ROLE_BITS while kvm_mmu_role doesn't use X86_CR4_PGE. So X86_CR4_PGE is also removed from KVM_MMU_CR4_ROLE_BITS. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210919024246.89230-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:29 -04:00
Lai Jiangshan	552617382c	KVM: X86: Don't reset mmu context when X86_CR4_PCIDE 1->0 X86_CR4_PCIDE doesn't participate in kvm_mmu_role, so the mmu context doesn't need to be reset. It is only required to flush all the guest tlb. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210919024246.89230-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:29 -04:00
Sean Christopherson	9dadfc4a61	KVM: x86: Add vendor name to kvm_x86_ops, use it for error messages Paul pointed out the error messages when KVM fails to load are unhelpful in understanding exactly what went wrong if userspace probes the "wrong" module. Add a mandatory kvm_x86_ops field to track vendor module names, kvm_intel and kvm_amd, and use the name for relevant error message when KVM fails to load so that the user knows which module failed to load. Opportunistically tweak the "disabled by bios" error message to clarify that _support_ was disabled, not that the module itself was magically disabled by BIOS. Suggested-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211018183929.897461-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:28 -04:00
David Stevens	1e76a3ce0d	KVM: cleanup allocation of rmaps and page tracking data Unify the flags for rmaps and page tracking data, using a single flag in struct kvm_arch and a single loop to go over all the address spaces and memslots. This avoids code duplication between alloc_all_memslots_rmaps and kvm_page_track_enable_mmu_write_tracking. Signed-off-by: David Stevens <stevensd@chromium.org> [This patch is the delta between David's v2 and v3, with conflicts fixed and my own commit message. - Paolo] Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:25 -04:00
Paolo Bonzini	de7cd3f676	KVM: x86: check for interrupts before deciding whether to exit the fast path The kvm_x86_sync_pir_to_irr callback can sometimes set KVM_REQ_EVENT. If that happens exactly at the time that an exit is handled as EXIT_FASTPATH_REENTER_GUEST, vcpu_enter_guest will go incorrectly through the loop that calls kvm_x86_run, instead of processing the request promptly. Fixes: `379a3c8ee4` ("KVM: VMX: Optimize posted-interrupt delivery for timer fastpath") Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-21 03:35:41 -04:00
Thomas Gleixner	1c57572d75	x86/KVM: Convert to fpstate Convert KVM code to the new register storage mechanism in preparation for dynamically sized buffers. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org Link: https://lkml.kernel.org/r/20211013145322.451439983@linutronix.de	2021-10-20 22:34:14 +02:00
Thomas Gleixner	087df48c29	x86/fpu: Replace KVMs xstate component clearing In order to prepare for the support of dynamically enabled FPU features, move the clearing of xstate components to the FPU core code. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: kvm@vger.kernel.org Link: https://lkml.kernel.org/r/20211013145322.399567049@linutronix.de	2021-10-20 22:26:41 +02:00
Thomas Gleixner	bf5d004707	x86/fpu: Replace KVMs home brewed FPU copy to user Similar to the copy from user function the FPU core has this already implemented with all bells and whistles. Get rid of the duplicated code and use the core functionality. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: kvm@vger.kernel.org Link: https://lkml.kernel.org/r/20211015011539.244101845@linutronix.de	2021-10-20 22:17:17 +02:00
Thomas Gleixner	ea4d6938d4	x86/fpu: Replace KVMs home brewed FPU copy from user Copying a user space buffer to the memory buffer is already available in the FPU core. The copy mechanism in KVM lacks sanity checks and needs to use cpuid() to lookup the offset of each component, while the FPU core has this information cached. Make the FPU core variant accessible for KVM and replace the home brewed mechanism. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: kvm@vger.kernel.org Link: https://lkml.kernel.org/r/20211015011539.134065207@linutronix.de	2021-10-20 15:27:27 +02:00
Thomas Gleixner	a0ff0611c2	x86/fpu: Move KVMs FPU swapping to FPU core Swapping the host/guest FPU is directly fiddling with FPU internals which requires 5 exports. The upcoming support of dynamically enabled states would even need more. Implement a swap function in the FPU core code and export that instead. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org Link: https://lkml.kernel.org/r/20211015011539.076072399@linutronix.de	2021-10-20 15:27:27 +02:00
Thomas Gleixner	126fe04018	x86/fpu: Cleanup xstate xcomp_bv initialization No point in having this duplicated all over the place with needlessly different defines. Provide a proper initialization function which initializes user buffers properly and make KVM use it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211015011538.897664678@linutronix.de	2021-10-20 15:27:26 +02:00
Oliver Upton	828ca89628	KVM: x86: Expose TSC offset controls to userspace To date, VMM-directed TSC synchronization and migration has been a bit messy. KVM has some baked-in heuristics around TSC writes to infer if the VMM is attempting to synchronize. This is problematic, as it depends on host userspace writing to the guest's TSC within 1 second of the last write. A much cleaner approach to configuring the guest's views of the TSC is to simply migrate the TSC offset for every vCPU. Offsets are idempotent, and thus not subject to change depending on when the VMM actually reads/writes values from/to KVM. The VMM can then read the TSC once with KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when the guest is paused. Cc: David Matlack <dmatlack@google.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Oliver Upton <oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210916181538.968978-8-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:45 -04:00
Oliver Upton	58d4277be9	KVM: x86: Refactor tsc synchronization code Refactor kvm_synchronize_tsc to make a new function that allows callers to specify TSC parameters (offset, value, nanoseconds, etc.) explicitly for the sake of participating in TSC synchronization. Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20210916181538.968978-7-oupton@google.com> [Make sure kvm->arch.cur_tsc_generation and vcpu->arch.this_tsc_generation are equal at the end of __kvm_synchronize_tsc, if matched is false. Reported by Maxim Levitsky. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:45 -04:00
Paolo Bonzini	869b44211a	kvm: x86: protect masterclock with a seqcount Protect the reference point for kvmclock with a seqcount, so that kvmclock updates for all vCPUs can proceed in parallel. Xen runstate updates will also run in parallel and not bounce the kvmclock cacheline. Of the variables that were protected by pvclock_gtod_sync_lock, nr_vcpus_matched_tsc is different because it is updated outside pvclock_update_vm_gtod_copy and read inside it. Therefore, we need to keep it protected by a spinlock. In fact it must now be a raw spinlock, because pvclock_update_vm_gtod_copy, being the write-side of a seqcount, is non-preemptible. Since we already have tsc_write_lock which is a raw spinlock, we can just use tsc_write_lock as the lock that protects the write-side of the seqcount. Co-developed-by: Oliver Upton <oupton@google.com> Message-Id: <20210916181538.968978-6-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:44 -04:00
Oliver Upton	c68dc1b577	KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK Handling the migration of TSCs correctly is difficult, in part because Linux does not provide userspace with the ability to retrieve a (TSC, realtime) clock pair for a single instant in time. In lieu of a more convenient facility, KVM can report similar information in the kvm_clock structure. Provide userspace with a host TSC & realtime pair iff the realtime clock is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid realtime value, advance the KVM clock by the amount of elapsed time. Do not step the KVM clock backwards, though, as it is a monotonic oscillator. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Oliver Upton <oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210916181538.968978-5-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:44 -04:00
Paolo Bonzini	a25c78d04c	Merge commit 'kvm-pagedata-alloc-fixes' into HEAD	2021-10-18 14:13:37 -04:00
Paolo Bonzini	fa13843d15	KVM: X86: fix lazy allocation of rmaps If allocation of rmaps fails, but some of the pointers have already been written, those pointers can be cleaned up when the memslot is freed, or even reused later for another attempt at allocating the rmaps. Therefore there is no need to WARN, as done for example in memslot_rmap_alloc, but the allocation must be skipped lest KVM will overwrite the previous pointer and will indeed leak memory. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:07:17 -04:00
David Stevens	deae4a10f1	KVM: x86: only allocate gfn_track when necessary Avoid allocating the gfn_track arrays if nothing needs them. If there are no external to KVM users of the API (i.e. no GVT-g), then page tracking is only needed for shadow page tables. This means that when tdp is enabled and there are no external users, then the gfn_track arrays can be lazily allocated when the shadow MMU is actually used. This avoid allocations equal to .05% of guest memory when nested virtualization is not used, if the kernel is compiled without GVT-g. Signed-off-by: David Stevens <stevensd@chromium.org> Message-Id: <20210922045859.2011227-3-stevensd@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:58 -04:00
Juergen Gross	78b497f2e6	kvm: use kvfree() in kvm_arch_free_vm() By switching from kfree() to kvfree() in kvm_arch_free_vm() Arm64 can use the common variant. This can be accomplished by adding another macro __KVM_HAVE_ARCH_VM_FREE, which will be used only by x86 for now. Further simplification can be achieved by adding __kvm_arch_free_vm() doing the common part. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-Id: <20210903130808.30142-5-jgross@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:57 -04:00
Oliver Upton	55c0cefbdb	KVM: x86: Fix potential race in KVM_GET_CLOCK Sean noticed that KVM_GET_CLOCK was checking kvm_arch.use_master_clock outside of the pvclock sync lock. This is problematic, as the clock value written to the user may or may not actually correspond to a stable TSC. Fix the race by populating the entire kvm_clock_data structure behind the pvclock_gtod_sync_lock. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20210916181538.968978-4-oupton@google.com> Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:47 -04:00
Paolo Bonzini	45e6c2fac0	KVM: x86: extract KVM_GET_CLOCK/KVM_SET_CLOCK to separate functions No functional change intended. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:47 -04:00
Paolo Bonzini	6b6fcd2804	kvm: x86: abstract locking around pvclock_update_vm_gtod_copy Updates to the kvmclock parameters needs to do a complicated dance of KVM_REQ_MCLOCK_INPROGRESS and KVM_REQ_CLOCK_UPDATE in addition to taking pvclock_gtod_sync_lock. Place that in two functions that can be called on all of master clock update, KVM_SET_CLOCK, and Hyper-V reenlightenment. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:47 -04:00
Maxim Levitsky	5228eb96a4	KVM: x86: nSVM: implement nested TSC scaling This was tested by booting a nested guest with TSC=1Ghz, observing the clocks, and doing about 100 cycles of migration. Note that qemu patch is needed to support migration because of a new MSR that needs to be placed in the migration state. The patch will be sent to the qemu mailing list soon. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210914154825.104886-14-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:46 -04:00
Longpeng(Mike)	515a0c79e7	kvm: irqfd: avoid update unmodified entries of the routing All of the irqfds would to be updated when update the irq routing, it's too expensive if there're too many irqfds. However we can reduce the cost by avoid some unnecessary updates. For irqs of MSI type on X86, the update can be saved if the msi values are not change. The vfio migration could receives benefit from this optimi- zaiton. The test VM has 128 vcpus and 8 VF (with 65 vectors enabled), so the VM has more than 520 irqfds. We mesure the cost of the vfio_msix_enable (in QEMU, it would set routing for each irqfd) for each VF, and we can see the total cost can be significantly reduced. Origin Apply this Patch 1st 8 4 2nd 15 5 3rd 22 6 4th 24 6 5th 36 7 6th 44 7 7th 51 8 8th 58 8 Total 258ms 51ms We're also tring to optimize the QEMU part [1], but it's still worth to optimize the KVM to gain more benefits. [1] https://lists.gnu.org/archive/html/qemu-devel/2021-08/msg04215.html Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Message-Id: <20210827080003.2689-1-longpeng2@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:10 -04:00
Sean Christopherson	25b9784586	KVM: x86: Manually retrieve CPUID.0x1 when getting FMS for RESET/INIT Manually look for a CPUID.0x1 entry instead of bouncing through kvm_cpuid() when retrieving the Family-Model-Stepping information for vCPU RESET/INIT. This fixes a potential undefined behavior bug due to kvm_cpuid() using the uninitialized "dummy" param as the ECX _input_, a.k.a. the index. A more minimal fix would be to simply zero "dummy", but the extra work in kvm_cpuid() is wasteful, and KVM should be treating the FMS retrieval as an out-of-band access, e.g. same as how KVM computes guest.MAXPHYADDR. Both Intel's SDM and AMD's APM describe the RDX value at RESET/INIT as holding the CPU's FMS information, not as holding CPUID.0x1.EAX. KVM's usage of CPUID entries to get FMS is simply a pragmatic approach to avoid having yet another way for userspace to provide inconsistent data. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20210929222426.1855730-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:08 -04:00
Sean Christopherson	62dd57dd67	KVM: x86: WARN on non-zero CRs at RESET to detect improper initalization WARN if CR0, CR3, or CR4 are non-zero at RESET, which given the current KVM implementation, really means WARN if they're not zeroed at vCPU creation. VMX in particular has several ->set_*() flows that read other registers to handle side effects, and because those flows are common to RESET and INIT, KVM subtly relies on emulated/virtualized registers to be zeroed at vCPU creation in order to do the right thing at RESET. Use CRs as a sentinel because they are most likely to be written as side effects, and because KVM specifically needs CR0.PG and CR0.PE to be '0' to correctly reflect the state of the vCPU's MMU. CRs are also loaded and stored from/to the VMCS, and so adds some level of coverage to verify that KVM doesn't conflate zero-allocating the VMCS with properly initializing the VMCS with VMWRITEs. Note, '0' is somewhat arbitrary, vCPU creation can technically stuff any value for a register so long as it's coherent with respect to the current vCPU state. In practice, '0' works for all registers and is convenient. Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:07 -04:00
Sean Christopherson	583d369b36	KVM: x86: Fold fx_init() into kvm_arch_vcpu_create() Move the few bits of relevant fx_init() code into kvm_arch_vcpu_create(), dropping the superfluous check on vcpu->arch.guest_fpu that was blindly and wrongly added by commit `ed02b21309` ("KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest"). Note, KVM currently allocates and then frees FPU state for SEV-ES guests, rather than avoid the allocation in the first place. While that approach is inarguably inefficient and unnecessary, it's a cleanup for the future. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:06 -04:00
Sean Christopherson	e8f65b9bb4	KVM: x86: Remove defunct setting of XCR0 for guest during vCPU create Drop code to initialize XCR0 during fx_init(), a.k.a. vCPU creation, as XCR0 has been initialized during kvm_vcpu_reset() (for RESET) since commit `a554d207dc` ("KVM: X86: Processor States following Reset or INIT"). Back when XCR0 support was added by commit `2acf923e38` ("KVM: VMX: Enable XSAVE/XRSTOR for guest"), KVM didn't differentiate between RESET and INIT. Ignoring the fact that calling fx_init() for INIT is obviously wrong, e.g. FPU state after INIT is not the same as after RESET, setting XCR0 in fx_init() was correct. Eventually fx_init() got moved to kvm_arch_vcpu_init(), a.k.a. vCPU creation (ignore the terrible name) by commit `0ee6a51725` ("x86/fpu, kvm: Simplify fx_init()"). Finally, commit `95a0d01eef` ("KVM: x86: Move all vcpu init code into kvm_arch_vcpu_create()") killed off kvm_arch_vcpu_init(), leaving behind the oddity of redundant setting of guest state during vCPU creation. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:06 -04:00
Sean Christopherson	5ebbc470d7	KVM: x86: Remove defunct setting of CR0.ET for guests during vCPU create Drop code to set CR0.ET for the guest during initialization of the guest FPU. The code was added as a misguided bug fix by commit `380102c8e4` ("KVM Set the ET flag in CR0 after initializing FX") to resolve an issue where vcpu->cr0 (now vcpu->arch.cr0) was not correctly initialized on SVM systems. While init_vmcb() did set CR0.ET, it only did so in the VMCB, and subtly did not update vcpu->cr0. Stuffing CR0.ET worked around the immediate problem, but did not fix the real bug of vcpu->cr0 and the VMCB being out of sync. That underlying bug was eventually remedied by commit `18fa000ae4` ("KVM: SVM: Reset cr0 properly on vcpu reset"). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:06 -04:00
Sean Christopherson	ff8828c84f	KVM: x86: Do not mark all registers as avail/dirty during RESET/INIT Do not blindly mark all registers as available+dirty at RESET/INIT, and instead rely on writes to registers to go through the proper mutators or to explicitly mark registers as dirty. INIT in particular does not blindly overwrite all registers, e.g. select bits in CR0 are preserved across INIT, thus marking registers available+dirty without first reading the register from hardware is incorrect. In practice this is a benign bug as KVM doesn't let the guest control CR0 bits that are preserved across INIT, and all other true registers are explicitly written during the RESET/INIT flows. The PDPTRs and EX_INFO "registers" are not explicitly written, but accessing those values during RESET/INIT is nonsensical and would be a KVM bug regardless of register caching. Fixes: `66f7b72e11` ("KVM: x86: Make register state after reset conform to specification") [sean: !!! NOT FOR STABLE !!!] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:05 -04:00
Sean Christopherson	94c641ba7a	KVM: x86: Simplify retrieving the page offset when loading PDTPRs Replace impressively complex "logic" for computing the page offset from CR3 when loading PDPTRs. Unlike other paging modes, the address held in CR3 for PAE paging is 32-byte aligned, i.e. occupies bits 31:5, thus bits 11:5 need to be used as the offset from the gfn when reading PDPTRs. The existing calculation originated in commit `1342d3536d` ("[PATCH] KVM: MMU: Load the pae pdptrs on cr3 change like the processor does"), which read the PDPTRs from guest memory as individual 8-byte loads. At the time, the so called "offset" was the base index of PDPTR0 as a _u64_, not a byte offset. Naming aside, the computation was useful and arguably simplified the overall flow. Unfortunately, when commit `195aefde9c` ("KVM: Add general accessors to read and write guest memory") added accessors with offsets at byte granularity, the cleverness of the original code was lost and KVM was left with convoluted code for a simple operation. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210831164224.1119728-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:05 -04:00
Sean Christopherson	15cabbc259	KVM: x86: Subsume nested GPA read helper into load_pdptrs() Open code the call to mmu->translate_gpa() when loading nested PDPTRs and kill off the existing helper, kvm_read_guest_page_mmu(), to discourage incorrect use. Reading guest memory straight from an L2 GPA is extremely rare (as evidenced by the lack of users), as very few constructs in x86 specify physical addresses, even fewer are virtualized by KVM, and even fewer yet require emulation of L2 by L0 KVM. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210831164224.1119728-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:05 -04:00
Juergen Gross	a1c42ddedf	kvm: rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS KVM_MAX_VCPU_ID is not specifying the highest allowed vcpu-id, but the number of allowed vcpu-ids. This has already led to confusion, so rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS to make its semantics more clear Suggested-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210913135745.13944-3-jgross@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:05 -04:00
Vitaly Kuznetsov	620b2438ab	KVM: Make kvm_make_vcpus_request_mask() use pre-allocated cpu_kick_mask kvm_make_vcpus_request_mask() already disables preemption so just like kvm_make_all_cpus_request_except() it can be switched to using pre-allocated per-cpu cpumasks. This allows for improvements for both users of the function: in Hyper-V emulation code 'tlb_flush' can now be dropped from 'struct kvm_vcpu_hv' and kvm_make_scan_ioapic_request_mask() gets rid of dynamic allocation. cpumask_available() checks in kvm_make_vcpu_request() and kvm_kick_many_cpus() can now be dropped as they checks for an impossible condition: kvm_init() makes sure per-cpu masks are allocated. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210903075141.403071-9-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:04 -04:00
Vitaly Kuznetsov	381cecc5d7	KVM: Drop 'except' parameter from kvm_make_vcpus_request_mask() Both remaining callers of kvm_make_vcpus_request_mask() pass 'NULL' for 'except' parameter so it can just be dropped. No functional change intended ©. Suggested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210903075141.403071-6-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:04 -04:00
Fares Mehanna	e1fc1553cd	kvm: x86: Add AMD PMU MSRs to msrs_to_save_all[] Intel PMU MSRs is in msrs_to_save_all[], so add AMD PMU MSRs to have a consistent behavior between Intel and AMD when using KVM_GET_MSRS, KVM_SET_MSRS or KVM_GET_MSR_INDEX_LIST. We have to add legacy and new MSRs to handle guests running without X86_FEATURE_PERFCTR_CORE. Signed-off-by: Fares Mehanna <faresx@amazon.de> Message-Id: <20210915133951.22389-1-faresx@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-22 11:30:14 -04:00
Maxim Levitsky	37687c403a	KVM: x86: reset pdptrs_from_userspace when exiting smm When exiting SMM, pdpts are loaded again from the guest memory. This fixes a theoretical bug, when exit from SMM triggers entry to the nested guest which re-uses some of the migration code which uses this flag as a workaround for a legacy userspace. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210913140954.165665-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-22 10:33:16 -04:00
Sean Christopherson	94c245a245	KVM: x86: Identify vCPU0 by its vcpu_idx instead of its vCPUs array entry Use vcpu_idx to identify vCPU0 when updating HyperV's TSC page, which is shared by all vCPUs and "owned" by vCPU0 (because vCPU0 is the only vCPU that's guaranteed to exist). Using kvm_get_vcpu() to find vCPU works, but it's a rather odd and suboptimal method to check the index of a given vCPU. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210910183220.2397812-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-22 10:33:12 -04:00
Haimin Zhang	eb7511bf91	KVM: x86: Handle SRCU initialization failure during page track init Check the return of init_srcu_struct(), which can fail due to OOM, when initializing the page track mechanism. Lack of checking leads to a NULL pointer deref found by a modified syzkaller. Reported-by: TCS Robot <tcs_robot@tencent.com> Signed-off-by: Haimin Zhang <tcs_kernel@tencent.com> Message-Id: <1630636626-12262-1-git-send-email-tcs_kernel@tencent.com> [Move the call towards the beginning of kvm_arch_init_vm. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-22 10:33:09 -04:00
Sean Christopherson	03a6e84069	KVM: x86: Clear KVM's cached guest CR3 at RESET/INIT Explicitly zero the guest's CR3 and mark it available+dirty at RESET/INIT. Per Intel's SDM and AMD's APM, CR3 is zeroed at both RESET and INIT. For RESET, this is a nop as vcpu is zero-allocated. For INIT, the bug has likely escaped notice because no firmware/kernel puts its page tables root at PA=0, let alone relies on INIT to get the desired CR3 for such page tables. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210921000303.400537-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-22 10:33:08 -04:00
Sean Christopherson	7117003fe4	KVM: x86: Mark all registers as avail/dirty at vCPU creation Mark all registers as available and dirty at vCPU creation, as the vCPU has obviously not been loaded into hardware, let alone been given the chance to be modified in hardware. On SVM, reading from "uninitialized" hardware is a non-issue as VMCBs are zero allocated (thus not truly uninitialized) and hardware does not allow for arbitrary field encoding schemes. On VMX, backing memory for VMCSes is also zero allocated, but true initialization of the VMCS _technically_ requires VMWRITEs, as the VMX architectural specification technically allows CPU implementations to encode fields with arbitrary schemes. E.g. a CPU could theoretically store the inverted value of every field, which would result in VMREAD to a zero-allocated field returns all ones. In practice, only the AR_BYTES fields are known to be manipulated by hardware during VMREAD/VMREAD; no known hardware or VMM (for nested VMX) does fancy encoding of cacheable field values (CR0, CR3, CR4, etc...). In other words, this is technically a bug fix, but practically speakings it's a glorified nop. Failure to mark registers as available has been a lurking bug for quite some time. The original register caching supported only GPRs (+RIP, which is kinda sorta a GPR), with the masks initialized at ->vcpu_reset(). That worked because the two cacheable registers, RIP and RSP, are generally speaking not read as side effects in other flows. Arguably, commit `aff48baa34` ("KVM: Fetch guest cr3 from hardware on demand") was the first instance of failure to mark regs available. While _just_ marking CR3 available during vCPU creation wouldn't have fixed the VMREAD from an uninitialized VMCS bug because ept_update_paging_mode_cr0() unconditionally read vmcs.GUEST_CR3, marking CR3 _and_ intentionally not reading GUEST_CR3 when it's available would have avoided VMREAD to a technically-uninitialized VMCS. Fixes: `aff48baa34` ("KVM: Fetch guest cr3 from hardware on demand") Fixes: `6de4f3ada4` ("KVM: Cache pdptrs") Fixes: `6de12732c4` ("KVM: VMX: Optimize vmx_get_rflags()") Fixes: `2fb92db1ec` ("KVM: VMX: Cache vmcs segment fields") Fixes: `bd31fe495d` ("KVM: VMX: Add proper cache tracking for CR0") Fixes: `f98c1e7712` ("KVM: VMX: Add proper cache tracking for CR4") Fixes: `5addc23519` ("KVM: VMX: Cache vmcs.EXIT_QUALIFICATION using arch avail_reg flags") Fixes: `8791585837` ("KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flags") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210921000303.400537-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-22 10:33:07 -04:00
Zelin Deng	d9130a2dfd	KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset is adjusted When MSR_IA32_TSC_ADJUST is written by guest due to TSC ADJUST feature especially there's a big tsc warp (like a new vCPU is hot-added into VM which has been up for a long time), tsc_offset is added by a large value then go back to guest. This causes system time jump as tsc_timestamp is not adjusted in the meantime and pvclock monotonic character. To fix this, just notify kvm to update vCPU's guest time before back to guest. Cc: stable@vger.kernel.org Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <1619576521-81399-2-git-send-email-zelin.deng@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-06 07:07:03 -04:00
Maxim Levitsky	61e5f69ef0	KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ KVM_GUESTDBG_BLOCKIRQ will allow KVM to block all interrupts while running. This change is mostly intended for more robust single stepping of the guest and it has the following benefits when enabled: * Resuming from a breakpoint is much more reliable. When resuming execution from a breakpoint, with interrupts enabled, more often than not, KVM would inject an interrupt and make the CPU jump immediately to the interrupt handler and eventually return to the breakpoint, to trigger it again. From the user point of view it looks like the CPU never executed a single instruction and in some cases that can even prevent forward progress, for example, when the breakpoint is placed by an automated script (e.g lx-symbols), which does something in response to the breakpoint and then continues the guest automatically. If the script execution takes enough time for another interrupt to arrive, the guest will be stuck on the same breakpoint RIP forever. * Normal single stepping is much more predictable, since it won't land the debugger into an interrupt handler. * RFLAGS.TF has less chance to be leaked to the guest: We set that flag behind the guest's back to do single stepping but if single step lands us into an interrupt/exception handler it will be leaked to the guest in the form of being pushed to the stack. This doesn't completely eliminate this problem as exceptions can still happen, but at least this reduces the chances of this happening. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210811122927.900604-6-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:06:37 -04:00
Mingwei Zhang	71f51d2c32	KVM: x86/mmu: Add detailed page size stats Existing KVM code tracks the number of large pages regardless of their sizes. Therefore, when large page of 1GB (or larger) is adopted, the information becomes less useful because lpages counts a mix of 1G and 2M pages. So remove the lpages since it is easy for user space to aggregate the info. Instead, provide a comprehensive page stats of all sizes from 4K to 512G. Suggested-by: Ben Gardon <bgardon@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Cc: Jing Zhang <jingzhangos@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Sean Christopherson <seanjc@google.com> Message-Id: <20210803044607.599629-4-mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:06:34 -04:00
Jing Zhang	f95937ccf5	KVM: stats: Support linear and logarithmic histogram statistics Add new types of KVM stats, linear and logarithmic histogram. Histogram are very useful for observing the value distribution of time or size related stats. Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210802165633.1866976-2-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:06:32 -04:00
Maxim Levitsky	06ef813466	KVM: SVM: avoid refreshing avic if its state didn't change Since AVIC can be inhibited and uninhibited rapidly it is possible that we have nothing to do by the time the svm_refresh_apicv_exec_ctrl is called. Detect and avoid this, which will be useful when we will start calling avic_vcpu_load/avic_vcpu_put when the avic inhibition state changes. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210810205251.424103-14-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:06:27 -04:00
Maxim Levitsky	b0a1637f64	KVM: x86: APICv: fix race in kvm_request_apicv_update on SVM Currently on SVM, the kvm_request_apicv_update toggles the APICv memslot without doing any synchronization. If there is a mismatch between that memslot state and the AVIC state, on one of the vCPUs, an APIC mmio access can be lost: For example: VCPU0: enable the APIC_ACCESS_PAGE_PRIVATE_MEMSLOT VCPU1: access an APIC mmio register. Since AVIC is still disabled on VCPU1, the access will not be intercepted by it, and neither will it cause MMIO fault, but rather it will just be read/written from/to the dummy page mapped into the APIC_ACCESS_PAGE_PRIVATE_MEMSLOT. Fix that by adding a lock guarding the AVIC state changes, and carefully order the operations of kvm_request_apicv_update to avoid this race: 1. Take the lock 2. Send KVM_REQ_APICV_UPDATE 3. Update the apic inhibit reason 4. Release the lock This ensures that at (2) all vCPUs are kicked out of the guest mode, but don't yet see the new avic state. Then only after (4) all other vCPUs can update their AVIC state and resume. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210810205251.424103-10-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:06:23 -04:00
Maxim Levitsky	36222b117e	KVM: x86: don't disable APICv memslot when inhibited Thanks to the former patches, it is now possible to keep the APICv memslot always enabled, and it will be invisible to the guest when it is inhibited This code is based on a suggestion from Sean Christopherson: https://lkml.org/lkml/2021/7/19/2970 Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210810205251.424103-9-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:06:22 -04:00
Peter Xu	4139b1972a	KVM: X86: Introduce kvm_mmu_slot_lpages() helpers Introduce kvm_mmu_slot_lpages() to calculcate lpage_info and rmap array size. The other __kvm_mmu_slot_lpages() can take an extra parameter of npages rather than fetching from the memslot pointer. Start to use the latter one in kvm_alloc_memslot_metadata(). Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210730220455.26054-4-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:04:51 -04:00
Sean Christopherson	ad0577c375	KVM: x86: Kill off __ex() and __kvm_handle_fault_on_reboot() Remove the __kvm_handle_fault_on_reboot() and __ex() macros now that all VMX and SVM instructions use asm goto to handle the fault (or in the case of VMREAD, completely custom logic). Drop kvm_spurious_fault()'s asmlinkage annotation as __kvm_handle_fault_on_reboot() was the only flow that invoked it from assembly code. Cc: Uros Bizjak <ubizjak@gmail.com> Cc: Like Xu <like.xu.linux@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210809173955.1710866-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-13 03:35:16 -04:00
Paolo Bonzini	1ccb6f983a	KVM: VMX: Reset DR6 only when KVM_DEBUGREG_WONT_EXIT The commit `efdab99281` ("KVM: x86: fix escape of guest dr6 to the host") fixed a bug by resetting DR6 unconditionally when the vcpu being scheduled out. But writing to debug registers is slow, and it can be visible in perf results sometimes, even if neither the host nor the guest activate breakpoints. Since KVM_DEBUGREG_WONT_EXIT on Intel processors is the only case where DR6 gets the guest value, and it never happens at all on SVM, the register can be cleared in vmx.c right after reading it. Reported-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-13 03:35:14 -04:00
Paolo Bonzini	375e28ffc0	KVM: X86: Set host DR6 only on VMX and for KVM_DEBUGREG_WONT_EXIT Commit `c77fb5fe6f` ("KVM: x86: Allow the guest to run with dirty debug registers") allows the guest accessing to DRs without exiting when KVM_DEBUGREG_WONT_EXIT and we need to ensure that they are synchronized on entry to the guest---including DR6 that was not synced before the commit. But the commit sets the hardware DR6 not only when KVM_DEBUGREG_WONT_EXIT, but also when KVM_DEBUGREG_BP_ENABLED. The second case is unnecessary and just leads to a more case which leaks stale DR6 to the host which has to be resolved by unconditionally reseting DR6 in kvm_arch_vcpu_put(). Even if KVM_DEBUGREG_WONT_EXIT, however, setting the host DR6 only matters on VMX because SVM always uses the DR6 value from the VMCB. So move this line to vmx.c and make it conditional on KVM_DEBUGREG_WONT_EXIT. Reported-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-13 03:35:14 -04:00
Lai Jiangshan	34e9f86007	KVM: X86: Remove unneeded KVM_DEBUGREG_RELOAD Commit `ae561edeb4` ("KVM: x86: DR0-DR3 are not clear on reset") added code to ensure eff_db are updated when they're modified through non-standard paths. But there is no reason to also update hardware DRs unless hardware breakpoints are active or DR exiting is disabled, and in those cases updating hardware is handled by KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_BP_ENABLED. KVM_DEBUGREG_RELOAD just causes unnecesarry load of hardware DRs and is better to be removed. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20210809174307.145263-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-13 03:35:14 -04:00
Paolo Bonzini	319afe6856	KVM: xen: do not use struct gfn_to_hva_cache gfn_to_hva_cache is not thread-safe, so it is usually used only within a vCPU (whose code is protected by vcpu->mutex). The Xen interface implementation has such a cache in kvm->arch, but it is not really used except to store the location of the shared info page. Replace shinfo_set and shinfo_cache with just the value that is passed via KVM_XEN_ATTR_TYPE_SHARED_INFO; the only complication is that the initialization value is not zero anymore and therefore kvm_xen_init_vm needs to be introduced. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-05 03:31:40 -04:00
Hamza Mahfooz	269e9552d2	KVM: const-ify all relevant uses of struct kvm_memory_slot As alluded to in commit `f36f3f2846` ("KVM: add "new" argument to kvm_arch_commit_memory_region"), a bunch of other places where struct kvm_memory_slot is used, needs to be refactored to preserve the "const"ness of struct kvm_memory_slot across-the-board. Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> Message-Id: <20210713023338.57108-1-someguy@effective-light.com> [Do not touch body of slot_rmap_walk_init. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-03 06:04:24 -04:00
Sean Christopherson	4c72ab5aa6	KVM: x86: Preserve guest's CR0.CD/NW on INIT Preserve CR0.CD and CR0.NW on INIT instead of forcing them to '1', as defined by both Intel's SDM and AMD's APM. Note, current versions of Intel's SDM are very poorly written with respect to INIT behavior. Table 9-1. "IA-32 and Intel 64 Processor States Following Power-up, Reset, or INIT" quite clearly lists power-up, RESET, _and_ INIT as setting CR0=60000010H, i.e. CD/NW=1. But the SDM then attempts to qualify CD/NW behavior in a footnote: 2. The CD and NW flags are unchanged, bit 4 is set to 1, all other bits are cleared. Presumably that footnote is only meant for INIT, as the RESET case and especially the power-up case are rather non-sensical. Another footnote all but confirms that: 6. Internal caches are invalid after power-up and RESET, but left unchanged with an INIT. Bare metal testing shows that CD/NW are indeed preserved on INIT (someone else can hack their BIOS to check RESET and power-up :-D). Reported-by: Reiji Watanabe <reijiw@google.com> Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-47-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-02 11:02:00 -04:00
Sean Christopherson	265e43530c	KVM: SVM: Emulate #INIT in response to triple fault shutdown Emulate a full #INIT instead of simply initializing the VMCB if the guest hits a shutdown. Initializing the VMCB but not other vCPU state, much of which is mirrored by the VMCB, results in incoherent and broken vCPU state. Ideally, KVM would not automatically init anything on shutdown, and instead put the vCPU into e.g. KVM_MP_STATE_UNINITIALIZED and force userspace to explicitly INIT or RESET the vCPU. Even better would be to add KVM_MP_STATE_SHUTDOWN, since technically NMI can break shutdown (and SMI on Intel CPUs). But, that ship has sailed, and emulating #INIT is the next best thing as that has at least some connection with reality since there exist bare metal platforms that automatically INIT the CPU if it hits shutdown. Fixes: `46fe4ddd9d` ("[PATCH] KVM: SVM: Propagate cpu shutdown events to userspace") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-45-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-02 11:01:59 -04:00
Sean Christopherson	f39e805ee1	KVM: x86: Move setting of sregs during vCPU RESET/INIT to common x86 Move the setting of CR0, CR4, EFER, RFLAGS, and RIP from vendor code to common x86. VMX and SVM now have near-identical sequences, the only difference being that VMX updates the exception bitmap. Updating the bitmap on SVM is unnecessary, but benign. Unfortunately it can't be left behind in VMX due to the need to update exception intercepts after the control registers are set. Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-37-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-02 11:01:57 -04:00
Sean Christopherson	908b7d43c0	KVM: x86/mmu: Skip the permission_fault() check on MMIO if CR0.PG=0 Skip the MMU permission_fault() check if paging is disabled when verifying the cached MMIO GVA is usable. The check is unnecessary and can theoretically get a false positive since the MMU doesn't zero out "permissions" or "pkru_mask" when guest paging is disabled. The obvious alternative is to zero out all the bitmasks when configuring nonpaging MMUs, but that's unnecessary work and doesn't align with the MMU's general approach of doing as little as possible for flows that are supposed to be unreachable. This is nearly a nop as the false positive is nothing more than an insignificant performance blip, and more or less limited to string MMIO when L1 is running with paging disabled. KVM doesn't cache MMIO if L2 is active with nested TDP since the "GVA" is really an L2 GPA. If L2 is active without nested TDP, then paging can't be disabled as neither VMX nor SVM allows entering the guest without paging of some form. Jumping back to L1 with paging disabled, in that case direct_map is true and so KVM will use CR2 as a GPA; the only time it doesn't is if the fault from the emulator doesn't match or emulator_can_use_gpa(), and that fails only on string MMIO and other instructions with multiple memory operands. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-27-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-02 11:01:54 -04:00
Sean Christopherson	49d8665cc2	KVM: x86: Move EDX initialization at vCPU RESET to common code Move the EDX initialization at vCPU RESET, which is now identical between VMX and SVM, into common code. No functional change intended. Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-02 11:01:52 -04:00
Sean Christopherson	df37ed38e6	KVM: x86: Flush the guest's TLB on INIT Flush the guest's TLB on INIT, as required by Intel's SDM. Although AMD's APM states that the TLBs are unchanged by INIT, it's not clear that that's correct as the APM also states that the TLB is flush on "External initialization of the processor." Regardless, relying on the guest to be paranoid is unnecessarily risky, while an unnecessary flush is benign from a functional perspective and likely has no measurable impact on guest performance. Note, as of the April 2021 version of Intels' SDM, it also contradicts itself with respect to TLB flushing. The overview of INIT explicitly calls out the TLBs as being invalidated, while a table later in the same section says they are unchanged. 9.1 INITIALIZATION OVERVIEW: The major difference is that during an INIT, the internal caches, MSRs, MTRRs, and x87 FPU state are left unchanged (although, the TLBs and BTB are invalidated as with a hardware reset) Table 9-1: Register Power up Reset INIT Data and Code Cache, TLBs: Invalid[6] Invalid[6] Unchanged Given Core2's erratum[] about global TLB entries not being flush on INIT, it's safe to assume that the table is simply wrong. AZ28. INIT Does Not Clear Global Entries in the TLB Problem: INIT may not flush a TLB entry when: • The processor is in protected mode with paging enabled and the page global enable flag is set (PGE bit of CR4 register) • G bit for the page table entry is set • TLB entry is present in TLB when INIT occurs • Software may encounter unexpected page fault or incorrect address translation due to a TLB entry erroneously left in TLB after INIT. Workaround: Write to CR3, CR4 (setting bits PSE, PGE or PAE) or CR0 (setting bits PG or PE) registers before writing to memory early in BIOS code to clear all the global entries from TLB. Status: For the steppings affected, see the Summary Tables of Changes. [] https://www.intel.com/content/dam/support/us/en/documents/processors/mobile/celeron/sb/320121.pdf Fixes: `6aa8b732ca` ("[PATCH] kvm: userspace interface") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-02 11:01:48 -04:00
Maxim Levitsky	df63202fe5	KVM: x86: APICv: drop immediate APICv disablement on current vCPU Special case of disabling the APICv on the current vCPU right away in kvm_request_apicv_update doesn't bring much benefit vs raising KVM_REQ_APICV_UPDATE on it instead, since this request will be processed on the next entry to the guest. (the comment about having another #VMEXIT is wrong). It also hides various assumptions that APIVc enable state matches the APICv inhibit state, as this special case only makes those states match on the current vCPU. Previous patches fixed few such assumptions so now it should be safe to drop this special case. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210713142023.106183-5-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-02 11:01:48 -04:00
Peter Xu	ec1cf69c37	KVM: X86: Add per-vm stat for max rmap list size Add a new statistic max_mmu_rmap_size, which stores the maximum size of rmap for the vm. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210625153214.43106-2-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-02 09:36:37 -04:00
Sean Christopherson	e489a4a6bd	KVM: x86: Hoist kvm_dirty_regs check out of sync_regs() Move the kvm_dirty_regs vs. KVM_SYNC_X86_VALID_FIELDS check out of sync_regs() and into its sole caller, kvm_arch_vcpu_ioctl_run(). This allows a future patch to allow synchronizing select state for protected VMs. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <889017a8d31cea46472e0c64b234ef5919278ed9.1625186503.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-02 09:36:37 -04:00
Sean Christopherson	673692735f	KVM: x86: Use KVM_BUG/KVM_BUG_ON to handle bugs that are fatal to the VM Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <0e8760a26151f47dc47052b25ca8b84fffe0641e.1625186503.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-02 09:36:36 -04:00
Paolo Bonzini	fa7a549d32	KVM: x86: accept userspace interrupt only if no event is injected Once an exception has been injected, any side effects related to the exception (such as setting CR2 or DR6) have been taked place. Therefore, once KVM sets the VM-entry interruption information field or the AMD EVENTINJ field, the next VM-entry must deliver that exception. Pending interrupts are processed after injected exceptions, so in theory it would not be a problem to use KVM_INTERRUPT when an injected exception is present. However, DOSEMU is using run->ready_for_interrupt_injection to detect interrupt windows and then using KVM_SET_SREGS/KVM_SET_REGS to inject the interrupt manually. For this to work, the interrupt window must be delayed after the completion of the previous event injection. Cc: stable@vger.kernel.org Reported-by: Stas Sergeev <stsp2@yandex.ru> Tested-by: Stas Sergeev <stsp2@yandex.ru> Fixes: `71cc849b70` ("KVM: x86: Fix split-irqchip vs interrupt injection window request") Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-30 07:53:02 -04:00
Vitaly Kuznetsov	0a31df6823	KVM: x86: Check the right feature bit for MSR_KVM_ASYNC_PF_ACK access MSR_KVM_ASYNC_PF_ACK MSR is part of interrupt based asynchronous page fault interface and not the original (deprecated) KVM_FEATURE_ASYNC_PF. This is stated in Documentation/virt/kvm/msr.rst. Fixes: `66570e966d` ("kvm: x86: only provide PV features if enabled in guest's CPUID") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20210722123018.260035-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-26 08:26:53 -04:00
Linus Torvalds	405386b021	* Allow again loading KVM on 32-bit non-PAE builds * Fixes for host SMIs on AMD * Fixes for guest SMIs on AMD * Fixes for selftests on s390 and ARM * Fix memory leak * Enforce no-instrumentation area on vmentry when hardware breakpoints are in use. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmDwRi4UHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroOt4AgAl6xEkMwDC74d/QFIOA7s2GD3ugfa z5XqGN1qz/nmEMnuIg6/tjTXDPmn/dfLMqy8RGZfyUv6xbgPcv/7JuFMRILvwGTb SbOVrGnR/QOhMdlfWH34qDkXeEsthTXSgQgVm/iiED0TttvQYVcZ/E9mgzaWQXor T1yTug2uAUXJ1EBxY0ZBo2kbh+BvvdmhEF0pksZOuwqZdH3zn3QCXwAwkL/OtUYE M6nNn3j1LU38C4OK1niXOZZVOuMIdk/l7LyFpjUQTFlIqitQAPtBE5MD+K+A9oC2 Yocxyj2tId1e6o8bLic/oN8/LpdORTvA/wDMj5M1DcMzvxQuQIpGYkcVGg== =gjVA -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: - Allow again loading KVM on 32-bit non-PAE builds - Fixes for host SMIs on AMD - Fixes for guest SMIs on AMD - Fixes for selftests on s390 and ARM - Fix memory leak - Enforce no-instrumentation area on vmentry when hardware breakpoints are in use. * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits) KVM: selftests: smm_test: Test SMM enter from L2 KVM: nSVM: Restore nested control upon leaving SMM KVM: nSVM: Fix L1 state corruption upon return from SMM KVM: nSVM: Introduce svm_copy_vmrun_state() KVM: nSVM: Check that VM_HSAVE_PA MSR was set before VMRUN KVM: nSVM: Check the value written to MSR_VM_HSAVE_PA KVM: SVM: Fix sev_pin_memory() error checks in SEV migration utilities KVM: SVM: Return -EFAULT if copy_to_user() for SEV mig packet header fails KVM: SVM: add module param to control the #SMI interception KVM: SVM: remove INIT intercept handler KVM: SVM: #SMI interception must not skip the instruction KVM: VMX: Remove vmx_msr_index from vmx.h KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run() KVM: selftests: Address extra memslot parameters in vm_vaddr_alloc kvm: debugfs: fix memory leak in kvm_create_vm_debugfs KVM: x86/pmu: Clear anythread deprecated bit when 0xa leaf is unsupported on the SVM KVM: mmio: Fix use-after-free Read in kvm_vm_ioctl_unregister_coalesced_mmio KVM: SVM: Revert clearing of C-bit on GPA in #NPF handler KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs KVM: x86: Use kernel's x86_phys_bits to handle reduced MAXPHYADDR ...	2021-07-15 11:56:07 -07:00
Lai Jiangshan	f85d401606	KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run() When the host is using debug registers but the guest is not using them nor is the guest in guest-debug state, the kvm code does not reset the host debug registers before kvm_x86->run(). Rather, it relies on the hardware vmentry instruction to automatically reset the dr7 registers which ensures that the host breakpoints do not affect the guest. This however violates the non-instrumentable nature around VM entry and exit; for example, when a host breakpoint is set on vcpu->arch.cr2, Another issue is consistency. When the guest debug registers are active, the host breakpoints are reset before kvm_x86->run(). But when the guest debug registers are inactive, the host breakpoints are delayed to be disabled. The host tracing tools may see different results depending on what the guest is doing. To fix the problems, we clear %db7 unconditionally before kvm_x86->run() if the host has set any breakpoints, no matter if the guest is using them or not. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20210628172632.81029-1-jiangshanlai@gmail.com> Cc: stable@vger.kernel.org [Only clear %db7 instead of reloading all debug registers. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-15 10:19:41 -04:00
Sean Christopherson	f0414b078d	Revert "KVM: x86: WARN and reject loading KVM if NX is supported but not enabled" Let KVM load if EFER.NX=0 even if NX is supported, the analysis and testing (or lack thereof) for the non-PAE host case was garbage. If the kernel won't be using PAE paging, .Ldefault_entry in head_32.S skips over the entire EFER sequence. Hopefully that can be changed in the future to allow KVM to require EFER.NX, but the motivation behind KVM's requirement isn't yet merged. Reverting and revisiting the mess at a later date is by far the safest approach. This reverts commit `8bbed95d2c`. Fixes: `8bbed95d2c` ("KVM: x86: WARN and reject loading KVM if NX is supported but not enabled") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210625001853.318148-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-14 12:17:55 -04:00
Linus Torvalds	1423e2660c	Fixes and improvements for FPU handling on x86: - Prevent sigaltstack out of bounds writes. The kernel unconditionally writes the FPU state to the alternate stack without checking whether the stack is large enough to accomodate it. Check the alternate stack size before doing so and in case it's too small force a SIGSEGV instead of silently corrupting user space data. - MINSIGSTKZ and SIGSTKSZ are constants in signal.h and have never been updated despite the fact that the FPU state which is stored on the signal stack has grown over time which causes trouble in the field when AVX512 is available on a CPU. The kernel does not expose the minimum requirements for the alternate stack size depending on the available and enabled CPU features. ARM already added an aux vector AT_MINSIGSTKSZ for the same reason. Add it to x86 as well - A major cleanup of the x86 FPU code. The recent discoveries of XSTATE related issues unearthed quite some inconsistencies, duplicated code and other issues. The fine granular overhaul addresses this, makes the code more robust and maintainable, which allows to integrate upcoming XSTATE related features in sane ways. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmDlcpETHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoeP5D/4i+AgYYeiMLgGb+NS7iaKPfoWo6LIz y3qdTSA0DQaIYbYivWwRO/g0GYdDMXDWeZalFi7eGnVI8O3eOog+22Zrf/y0UINB KJHdYd4ApWHhs401022y5hexrWQvnV8w1yQCuj/zLm6eC+AVhdwt2AY+IBoRrdUj wqY97B/4rJNsBvvqTDn9EeDrJA2y0y0Suc7AhIp2BGMI+dpIdxys8RJDamXNWyDL gJf0YRgUoiIn3AHKb+fgv60AoxfC175NSg/5/y/scFNXqVlW0Up4YCb7pqG9o2Ga f3XvtWfbw1N5PmUYjFkALwEkzGUbM3v0RA3xLY2j2WlWm9fBPPy59dt+i/h/VKyA GrA7i7lcIqX8dfVH6XkrReZBkRDSB6t9SZTvV54jAz5fcIZO2Rg++UFUvI/R6GKK XCcxukYaArwo+IG62iqDszS3gfLGhcor/cviOeULRC5zMUIO4Jah+IhDnifmShtC M5s9QzrwIRD/XMewGRQmvkiN4kBfE7jFoBQr1J9leCXJKrM+2JQmMzVInuubTQIq SdlKOaAIn7xtekz+6XdFG9Gmhck0PCLMJMOLNvQkKWI3KqGLRZ+dAWKK0vsCizAx 0BA7ZeB9w9lFT+D8mQCX77JvW9+VNwyfwIOLIrJRHk3VqVpS5qvoiFTLGJJBdZx4 /TbbRZu7nXDN2w== =Mq1m -----END PGP SIGNATURE----- Merge tag 'x86-fpu-2021-07-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fpu updates from Thomas Gleixner: "Fixes and improvements for FPU handling on x86: - Prevent sigaltstack out of bounds writes. The kernel unconditionally writes the FPU state to the alternate stack without checking whether the stack is large enough to accomodate it. Check the alternate stack size before doing so and in case it's too small force a SIGSEGV instead of silently corrupting user space data. - MINSIGSTKZ and SIGSTKSZ are constants in signal.h and have never been updated despite the fact that the FPU state which is stored on the signal stack has grown over time which causes trouble in the field when AVX512 is available on a CPU. The kernel does not expose the minimum requirements for the alternate stack size depending on the available and enabled CPU features. ARM already added an aux vector AT_MINSIGSTKSZ for the same reason. Add it to x86 as well. - A major cleanup of the x86 FPU code. The recent discoveries of XSTATE related issues unearthed quite some inconsistencies, duplicated code and other issues. The fine granular overhaul addresses this, makes the code more robust and maintainable, which allows to integrate upcoming XSTATE related features in sane ways" * tag 'x86-fpu-2021-07-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits) x86/fpu/xstate: Clear xstate header in copy_xstate_to_uabi_buf() again x86/fpu/signal: Let xrstor handle the features to init x86/fpu/signal: Handle #PF in the direct restore path x86/fpu: Return proper error codes from user access functions x86/fpu/signal: Split out the direct restore code x86/fpu/signal: Sanitize copy_user_to_fpregs_zeroing() x86/fpu/signal: Sanitize the xstate check on sigframe x86/fpu/signal: Remove the legacy alignment check x86/fpu/signal: Move initial checks into fpu__restore_sig() x86/fpu: Mark init_fpstate __ro_after_init x86/pkru: Remove xstate fiddling from write_pkru() x86/fpu: Don't store PKRU in xstate in fpu_reset_fpstate() x86/fpu: Remove PKRU handling from switch_fpu_finish() x86/fpu: Mask PKRU from kernel XRSTOR[S] operations x86/fpu: Hook up PKRU into ptrace() x86/fpu: Add PKRU storage outside of task XSAVE buffer x86/fpu: Dont restore PKRU in fpregs_restore_userspace() x86/fpu: Rename xfeatures_mask_user() to xfeatures_mask_uabi() x86/fpu: Move FXSAVE_LEAK quirk info __copy_kernel_to_fpregs() x86/fpu: Rename __fpregs_load_activate() to fpregs_restore_userregs() ...	2021-07-07 11:12:01 -07:00
Aaron Lewis	19238e75bd	kvm: x86: Allow userspace to handle emulation errors Add a fallback mechanism to the in-kernel instruction emulator that allows userspace the opportunity to process an instruction the emulator was unable to. When the in-kernel instruction emulator fails to process an instruction it will either inject a #UD into the guest or exit to userspace with exit reason KVM_INTERNAL_ERROR. This is because it does not know how to proceed in an appropriate manner. This feature lets userspace get involved to see if it can figure out a better path forward. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Reviewed-by: David Edmondson <david.edmondson@oracle.com> Message-Id: <20210510144834.658457-2-aaronlewis@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:48 -04:00
Sean Christopherson	20f632bd00	KVM: x86: Read and pass all CR0/CR4 role bits to shadow MMU helper Grab all CR0/CR4 MMU role bits from current vCPU state when initializing a non-nested shadow MMU. Extract the masks from kvm_post_set_cr{0,4}(), as the CR0/CR4 update masks must exactly match the mmu_role bits, with one exception (see below). The "full" CR0/CR4 will be used by future commits to initialize the MMU and its role, as opposed to the current approach of pulling everything from vCPU, which is incorrect for certain flows, e.g. nested NPT. CR4.LA57 is an exception, as it can be toggled on VM-Exit (for L1's MMU) but can't be toggled via MOV CR4 while long mode is active. I.e. LA57 needs to be in the mmu_role, but technically doesn't need to be checked by kvm_post_set_cr4(). However, the extra check is completely benign as the hardware restrictions simply mean LA57 will never be _the_ cause of a MMU reset during MOV CR4. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-18-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:39 -04:00
Sean Christopherson	dbc4739b6b	KVM: x86: Fix sizes used to pass around CR0, CR4, and EFER When configuring KVM's MMU, pass CR0 and CR4 as unsigned longs, and EFER as a u64 in various flows (mostly MMU). Passing the params as u32s is functionally ok since all of the affected registers reserve bits 63:32 to zero (enforced by KVM), but it's technically wrong. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:38 -04:00
Sean Christopherson	63f5a1909f	KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is broken Warn userspace that KVM_SET_CPUID{,2} after KVM_RUN "may" cause guest instability. Initialize last_vmentry_cpu to -1 and use it to detect if the vCPU has been run at least once when its CPUID model is changed. KVM does not correctly handle changes to paging related settings in the guest's vCPU model after KVM_RUN, e.g. MAXPHYADDR, GBPAGES, etc... KVM could theoretically zap all shadow pages, but actually making that happen is a mess due to lock inversion (vcpu->mutex is held). And even then, updating paging settings on the fly would only work if all vCPUs are stopped, updated in concert with identical settings, then restarted. To support running vCPUs with different vCPU models (that affect paging), KVM would need to track all relevant information in kvm_mmu_page_role. Note, that's the _page_ role, not the full mmu_role. Updating mmu_role isn't sufficient as a vCPU can reuse a shadow page translation that was created by a vCPU with different settings and thus completely skip the reserved bit checks (that are tied to CPUID). Tracking CPUID state in kvm_mmu_page_role is _extremely_ undesirable as it would require doubling gfn_track from a u16 to a u32, i.e. would increase KVM's memory footprint by 2 bytes for every 4kb of guest memory. E.g. MAXPHYADDR (6 bits), GBPAGES, AMD vs. INTEL = 1 bit, and SEV C-BIT would all need to be tracked. In practice, there is no remotely sane use case for changing any paging related CPUID entries on the fly, so just sweep it under the rug (after yelling at userspace). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:36 -04:00
Sean Christopherson	0aa1837533	KVM: x86: Properly reset MMU context at vCPU RESET/INIT Reset the MMU context at vCPU INIT (and RESET for good measure) if CR0.PG was set prior to INIT. Simply re-initializing the current MMU is not sufficient as the current root HPA may not be usable in the new context. E.g. if TDP is disabled and INIT arrives while the vCPU is in long mode, KVM will fail to switch to the 32-bit pae_root and bomb on the next VM-Enter due to running with a 64-bit CR3 in 32-bit mode. This bug was papered over in both VMX and SVM, but still managed to rear its head in the MMU role on VMX. Because EFER.LMA=1 requires CR0.PG=1, kvm_calc_shadow_mmu_root_page_role() checks for EFER.LMA without first checking CR0.PG. VMX's RESET/INIT flow writes CR0 before EFER, and so an INIT with the vCPU in 64-bit mode will cause the hack-a-fix to generate the wrong MMU role. In VMX, the INIT issue is specific to running without unrestricted guest since unrestricted guest is available if and only if EPT is enabled. Commit `8668a3c468` ("KVM: VMX: Reset mmu context when entering real mode") resolved the issue by forcing a reset when entering emulated real mode. In SVM, commit `ebae871a50` ("kvm: svm: reset mmu on VCPU reset") forced a MMU reset on every INIT to workaround the flaw in common x86. Note, at the time the bug was fixed, the SVM problem was exacerbated by a complete lack of a CR4 update. The vendor resets will be reverted in future patches, primarily to aid bisection in case there are non-INIT flows that rely on the existing VMX logic. Because CR0.PG is unconditionally cleared on INIT, and because CR0.WP and all CR4/EFER paging bits are ignored if CR0.PG=0, simply checking that CR0.PG was '1' prior to INIT/RESET is sufficient to detect a required MMU context reset. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:35 -04:00
Jing Zhang	bc9e9e672d	KVM: debugfs: Reuse binary stats descriptors To remove code duplication, use the binary stats descriptors in the implementation of the debugfs interface for statistics. This unifies the definition of statistics for the binary and debugfs interfaces. Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-8-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:29 -04:00
Jing Zhang	ce55c04945	KVM: stats: Support binary stats retrieval for a VCPU Add a VCPU ioctl to get a statistics file descriptor by which a read functionality is provided for userspace to read out VCPU stats header, descriptors and data. Define VCPU statistics descriptors and header for all architectures. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> #arm64 Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-5-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:19 -04:00
Jing Zhang	fcfe1baedd	KVM: stats: Support binary stats retrieval for a VM Add a VM ioctl to get a statistics file descriptor by which a read functionality is provided for userspace to read out VM stats header, descriptors and data. Define VM statistics descriptors and header for all architectures. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> #arm64 Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-4-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:10 -04:00
Jing Zhang	0193cc908b	KVM: stats: Separate generic stats from architecture specific ones Generic KVM stats are those collected in architecture independent code or those supported by all architectures; put all generic statistics in a separate structure. This ensures that they are defined the same way in the statistics API which is being added, removing duplication among different architectures in the declaration of the descriptors. No functional change intended. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-2-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 11:47:56 -04:00
Thomas Gleixner	72a6c08c44	x86/pkru: Remove xstate fiddling from write_pkru() The PKRU value of a task is stored in task->thread.pkru when the task is scheduled out. PKRU is restored on schedule in from there. So keeping the XSAVE buffer up to date is a pointless exercise. Remove the xstate fiddling and cleanup all related functions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121456.897372712@linutronix.de	2021-06-23 19:55:51 +02:00
Dave Hansen	784a46618f	x86/pkeys: Move read_pkru() and write_pkru() write_pkru() was originally used just to write to the PKRU register. It was mercifully short and sweet and was not out of place in pgtable.h with some other pkey-related code. But, later work included a requirement to also modify the task XSAVE buffer when updating the register. This really is more related to the XSAVE architecture than to paging. Move the read/write_pkru() to asm/pkru.h. pgtable.h won't miss them. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121455.102647114@linutronix.de	2021-06-23 18:52:57 +02:00
Thomas Gleixner	1c61fada30	x86/fpu: Rename copy_kernel_to_fpregs() to restore_fpregs_from_fpstate() This is not a copy functionality. It restores the register state from the supplied kernel buffer. No functional changes. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121454.716058365@linutronix.de	2021-06-23 18:36:42 +02:00
Thomas Gleixner	ebe7234b08	x86/fpu: Rename copy_fpregs_to_fpstate() to save_fpregs_to_fpstate() A copy is guaranteed to leave the source intact, which is not the case when FNSAVE is used as that reinitilizes the registers. Save does not make such guarantees and it matches what this is about, i.e. to save the state for a later restore. Rename it to save_fpregs_to_fpstate(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121454.508853062@linutronix.de	2021-06-23 18:26:43 +02:00
Dave Hansen	71ef453355	x86/kvm: Avoid looking up PKRU in XSAVE buffer PKRU is being removed from the kernel XSAVE/FPU buffers. This removal will probably include warnings for code that look up PKRU in those buffers. KVM currently looks up the location of PKRU but doesn't even use the pointer that it gets back. Rework the code to avoid calling get_xsave_addr() except in cases where its result is actually used. This makes the code more clear and also avoids the inevitable PKRU warnings. This is probably a good cleanup and could go upstream idependently of any PKRU rework. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121453.541037562@linutronix.de	2021-06-23 17:49:47 +02:00
Paolo Bonzini	c3ab0e28a4	Merge branch 'topic/ppc-kvm' of https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux into HEAD - Support for the H_RPT_INVALIDATE hypercall - Conversion of Book3S entry/exit to C - Bug fixes	2021-06-23 07:30:41 -04:00
Sean Christopherson	8bbed95d2c	KVM: x86: WARN and reject loading KVM if NX is supported but not enabled WARN if NX is reported as supported but not enabled in EFER. All flavors of the kernel, including non-PAE 32-bit kernels, set EFER.NX=1 if NX is supported, even if NX usage is disable via kernel command line. KVM relies on NX being enabled if it's supported, e.g. KVM will generate illegal NPT entries if nx_huge_pages is enabled and NX is supported but not enabled. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20210615164535.2146172-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-18 06:24:50 -04:00
Linus Torvalds	fd0aa1a456	Miscellaneous bugfixes. The main interesting one is a NULL pointer dereference reported by syzkaller ("KVM: x86: Immediately reset the MMU context when the SMM flag is cleared"). -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmDLldwUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroPTOgf/XpAehLdWlx2877ulcBD0Kjt0tLvH OFHRD1Ir0d1Ay3DX8qmxLquXHB4CoDGZBwi1d7AI165kUP/XLmV0bY6TZ74inI/P CaD8Bsbm8+iBl5jrovEPc+223bK+3OFOvo2pk6M/MlsO/ExRikaPDtHOnkfousbl nLX8v2qd7ihWyJOdLJMU9pV8E2iczQoCuH9yWBHdCrxRxWtPzkEekPWb0sujByiF 4tD7sqiEA3ugbF1Wm5keQV63NLplfxx+Zun0FV+tbpjjxQWAGl81dP+zmqok0sM/ qQCyZevt6jLLkL+Fn6hI6PP9OTeYreX2fgwhWXs71d2js33yNg5Veqx5Bw== =Gs/y -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "Miscellaneous bugfixes. The main interesting one is a NULL pointer dereference reported by syzkaller ("KVM: x86: Immediately reset the MMU context when the SMM flag is cleared")" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: selftests: Fix kvm_check_cap() assertion KVM: x86/mmu: Calculate and check "full" mmu_role for nested MMU KVM: X86: Fix x86_emulator slab cache leak KVM: SVM: Call SEV Guest Decommission if ASID binding fails KVM: x86: Immediately reset the MMU context when the SMM flag is cleared KVM: x86: Fix fall-through warnings for Clang KVM: SVM: fix doc warnings KVM: selftests: Fix compiling errors when initializing the static structure kvm: LAPIC: Restore guard to prevent illegal APIC register access	2021-06-17 13:14:53 -07:00
Ashish Kalra	0dbb112304	KVM: X86: Introduce KVM_HC_MAP_GPA_RANGE hypercall This hypercall is used by the SEV guest to notify a change in the page encryption status to the hypervisor. The hypercall should be invoked only when the encryption attribute is changed from encrypted -> decrypted and vice versa. By default all guest pages are considered encrypted. The hypercall exits to userspace to manage the guest shared regions and integrate with the userspace VMM's migration code. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Borislav Petkov <bp@suse.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: x86@kernel.org Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Steve Rutherford <srutherford@google.com> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <90778988e1ee01926ff9cac447aacb745f954c8c.1623174621.git.ashish.kalra@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 14:25:39 -04:00
Vitaly Kuznetsov	bca66dbcd2	KVM: x86: Check for pending interrupts when APICv is getting disabled When APICv is active, interrupt injection doesn't raise KVM_REQ_EVENT request (see __apic_accept_irq()) as the required work is done by hardware. In case KVM_REQ_APICV_UPDATE collides with such injection, the interrupt may never get delivered. Currently, the described situation is hardly possible: all kvm_request_apicv_update() calls normally happen upon VM creation when no interrupts are pending. We are, however, going to move unconditional kvm_request_apicv_update() call from kvm_hv_activate_synic() to synic_update_vector() and without this fix 'hyperv_connections' test from kvm-unit-tests gets stuck on IPI delivery attempt right after configuring a SynIC route which triggers APICv disablement. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210609150911.1471882-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:55 -04:00
Sean Christopherson	c906066288	KVM: x86: Drop pointless @reset_roots from kvm_init_mmu() Remove the @reset_roots param from kvm_init_mmu(), the one user, kvm_mmu_reset_context() has already unloaded the MMU and thus freed and invalidated all roots. This also happens to be why the reset_roots=true paths doesn't leak roots; they're already invalid. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:54 -04:00
Sean Christopherson	e62f1aa8b9	KVM: x86: Defer MMU sync on PCID invalidation Defer the MMU sync on PCID invalidation so that multiple sync requests in a single VM-Exit are batched. This is a very minor optimization as checking for unsync'd children is quite cheap. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:54 -04:00
Sean Christopherson	28f28d453f	KVM: x86: Use KVM_REQ_TLB_FLUSH_GUEST to handle INVPCID(ALL) emulation Use KVM_REQ_TLB_FLUSH_GUEST instead of KVM_REQ_MMU_RELOAD when emulating INVPCID of all contexts. In the current code, this is a glorified nop as TLB_FLUSH_GUEST becomes kvm_mmu_unload(), same as MMU_RELOAD, when TDP is disabled, which is the only time INVPCID is only intercepted+emulated. In the future, reusing TLB_FLUSH_GUEST will simplify optimizing paths that emulate a guest TLB flush, e.g. by synchronizing as needed instead of completely unloading all MMUs. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:53 -04:00
Sean Christopherson	b512910039	KVM: x86: Drop skip MMU sync and TLB flush params from "new PGD" helpers Drop skip_mmu_sync and skip_tlb_flush from __kvm_mmu_new_pgd() now that all call sites unconditionally skip both the sync and flush. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:52 -04:00
Sean Christopherson	415b1a0105	KVM: x86: Uncondtionally skip MMU sync/TLB flush in MOV CR3's PGD switch Stop leveraging the MMU sync and TLB flush requested by the fast PGD switch helper now that kvm_set_cr3() manually handles the necessary sync, frees, and TLB flush. This will allow dropping the params from the fast PGD helpers since nested SVM is now the odd blob out. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:52 -04:00
Sean Christopherson	21823fbda5	KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush Flush and sync all PGDs for the current/target PCID on MOV CR3 with a TLB flush, i.e. without PCID_NOFLUSH set. Paraphrasing Intel's SDM regarding the behavior of MOV to CR3: - If CR4.PCIDE = 0, invalidates all TLB entries associated with PCID 000H and all entries in all paging-structure caches associated with PCID 000H. - If CR4.PCIDE = 1 and NOFLUSH=0, invalidates all TLB entries associated with the PCID specified in bits 11:0, and all entries in all paging-structure caches associated with that PCID. It is not required to invalidate entries in the TLBs and paging-structure caches that are associated with other PCIDs. - If CR4.PCIDE=1 and NOFLUSH=1, is not required to invalidate any TLB entries or entries in paging-structure caches. Extract and reuse the logic for INVPCID(single) which is effectively the same flow and works even if CR4.PCIDE=0, as the current PCID will be '0' in that case, thus honoring the requirement of flushing PCID=0. Continue passing skip_tlb_flush to kvm_mmu_new_pgd() even though it _should_ be redundant; the clean up will be done in a future patch. The overhead of an unnecessary nop sync is minimal (especially compared to the actual sync), and the TLB flush is handled via request. Avoiding the the negligible overhead is not worth the risk of breaking kernels that backport the fix. Fixes: `956bf3531f` ("kvm: x86: Skip shadow page resync on CR3 switch when indicated by guest") Cc: Junaid Shahid <junaids@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:52 -04:00
Sean Christopherson	07ffaf343e	KVM: nVMX: Sync all PGDs on nested transition with shadow paging Trigger a full TLB flush on behalf of the guest on nested VM-Enter and VM-Exit when VPID is disabled for L2. kvm_mmu_new_pgd() syncs only the current PGD, which can theoretically leave stale, unsync'd entries in a previous guest PGD, which could be consumed if L2 is allowed to load CR3 with PCID_NOFLUSH=1. Rename KVM_REQ_HV_TLB_FLUSH to KVM_REQ_TLB_FLUSH_GUEST so that it can be utilized for its obvious purpose of emulating a guest TLB flush. Note, there is no change the actual TLB flush executed by KVM, even though the fast PGD switch uses KVM_REQ_TLB_FLUSH_CURRENT. When VPID is disabled for L2, vpid02 is guaranteed to be '0', and thus nested_get_vpid02() will return the VPID that is shared by L1 and L2. Generate the request outside of kvm_mmu_new_pgd(), as getting the common helper to correctly identify which requested is needed is quite painful. E.g. using KVM_REQ_TLB_FLUSH_GUEST when nested EPT is in play is wrong as a TLB flush from the L1 kernel's perspective does not invalidate EPT mappings. And, by using KVM_REQ_TLB_FLUSH_GUEST, nVMX can do future simplification by moving the logic into nested_vmx_transition_tlb_flush(). Fixes: `41fab65e7c` ("KVM: nVMX: Skip MMU sync on nested VMX transition when possible") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:51 -04:00
Maxim Levitsky	158a48ecf7	KVM: x86: avoid loading PDPTRs after migration when possible if new KVM_*_SREGS2 ioctls are used, the PDPTRs are a part of the migration state and are correctly restored by those ioctls. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210607090203.133058-9-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:48 -04:00
Maxim Levitsky	6dba940352	KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2 This is a new version of KVM_GET_SREGS / KVM_SET_SREGS. It has the following changes: * Has flags for future extensions * Has vcpu's PDPTRs, allowing to save/restore them on migration. * Lacks obsolete interrupt bitmap (done now via KVM_SET_VCPU_EVENTS) New capability, KVM_CAP_SREGS2 is added to signal the userspace of this ioctl. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210607090203.133058-8-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:47 -04:00
Sean Christopherson	c7313155bf	KVM: x86: Always load PDPTRs on CR3 load for SVM w/o NPT and a PAE guest Kill off pdptrs_changed() and instead go through the full kvm_set_cr3() for PAE guest, even if the new CR3 is the same as the current CR3. For VMX, and SVM with NPT enabled, the PDPTRs are unconditionally marked as unavailable after VM-Exit, i.e. the optimization is dead code except for SVM without NPT. In the unlikely scenario that anyone cares about SVM without NPT _and_ a PAE guest, they've got bigger problems if their guest is loading the same CR3 so frequently that the performance of kvm_set_cr3() is notable, especially since KVM's fast PGD switching means reloading the same CR3 does not require a full rebuild. Given that PAE and PCID are mutually exclusive, i.e. a sync and flush are guaranteed in any case, the actual benefits of the pdptrs_changed() optimization are marginal at best. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210607090203.133058-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:46 -04:00
Vitaly Kuznetsov	644f706719	KVM: x86: hyper-v: Introduce KVM_CAP_HYPERV_ENFORCE_CPUID Modeled after KVM_CAP_ENFORCE_PV_FEATURE_CPUID, the new capability allows for limiting Hyper-V features to those exposed to the guest in Hyper-V CPUIDs (0x40000003, 0x40000004, ...). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:38 -04:00
Vineeth Pillai	3c86c0d3db	KVM: x86: hyper-v: Move the remote TLB flush logic out of vmx Currently the remote TLB flush logic is specific to VMX. Move it to a common place so that SVM can use it as well. Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com> Message-Id: <4f4e4ca19778437dae502f44363a38e99e3ef5d1.1622730232.git.viremana@linux.microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:36 -04:00
Krish Sadhukhan	d5a0483f9f	KVM: nVMX: nSVM: Add a new VCPU statistic to show if VCPU is in guest mode Add the following per-VCPU statistic to KVM debugfs to show if a given VCPU is in guest mode: guest_mode Also add this as a per-VM statistic to KVM debugfs to show the total number of VCPUs that are in guest mode in a given VM. Signed-off-by: Krish Sadhukhan <Krish.Sadhukhan@oracle.com> Message-Id: <20210609180340.104248-3-krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:36 -04:00
Sean Christopherson	ecc513e5bb	KVM: x86: Drop "pre_" from enter/leave_smm() helpers Now that .post_leave_smm() is gone, drop "pre_" from the remaining helpers. The helpers aren't invoked purely before SMI/RSM processing, e.g. both helpers are invoked after state is snapshotted (from regs or SMRAM), and the RSM helper is invoked after some amount of register state has been stuffed. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609185619.992058-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:35 -04:00
Sean Christopherson	0128116550	KVM: x86: Drop .post_leave_smm(), i.e. the manual post-RSM MMU reset Drop the .post_leave_smm() emulator callback, which at this point is just a wrapper to kvm_mmu_reset_context(). The manual context reset is unnecessary, because unlike enter_smm() which calls vendor MSR/CR helpers directly, em_rsm() bounces through the KVM helpers, e.g. kvm_set_cr4(), which are responsible for processing side effects. em_rsm() is already subtly relying on this behavior as it doesn't manually do kvm_update_cpuid_runtime(), e.g. to recognize CR4.OSXSAVE changes. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609185619.992058-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:35 -04:00
Sean Christopherson	1270e647c8	KVM: x86: Rename SMM tracepoint to make it reflect reality Rename the SMM tracepoint, which handles both entering and exiting SMM, from kvm_enter_smm to kvm_smm_transition. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609185619.992058-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:35 -04:00
Sean Christopherson	0d7ee6f4b5	KVM: x86: Move "entering SMM" tracepoint into kvm_smm_changed() Invoke the "entering SMM" tracepoint from kvm_smm_changed() instead of enter_smm(), effectively moving it from before reading vCPU state to after reading state (but still before writing it to SMRAM!). The primary motivation is to consolidate code, but calling the tracepoint from kvm_smm_changed() also makes its invocation consistent with respect to SMI and RSM, and with respect to KVM_SET_VCPU_EVENTS (which previously only invoked the tracepoint when forcing the vCPU out of SMM). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609185619.992058-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:34 -04:00
Sean Christopherson	dc87275f47	KVM: x86: Move (most) SMM hflags modifications into kvm_smm_changed() Move the core of SMM hflags modifications into kvm_smm_changed() and use kvm_smm_changed() in enter_smm(). Clear HF_SMM_INSIDE_NMI_MASK for leaving SMM but do not set it for entering SMM. If the vCPU is executing outside of SMM, the flag should unequivocally be cleared, e.g. this technically fixes a benign bug where the flag could be left set after KVM_SET_VCPU_EVENTS, but the reverse is not true as NMI blocking depends on pre-SMM state or userspace input. Note, this adds an extra kvm_mmu_reset_context() to enter_smm(). The extra/early reset isn't strictly necessary, and in a way can never be necessary since the vCPU/MMU context is in a half-baked state until the final context reset at the end of the function. But, enter_smm() is not a hot path, and exploding on an invalid root_hpa is probably better than having a stale SMM flag in the MMU role; it's at least no worse. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609185619.992058-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:34 -04:00
Sean Christopherson	fa75e08bbe	KVM: x86: Invoke kvm_smm_changed() immediately after clearing SMM flag Move RSM emulation's call to kvm_smm_changed() from .post_leave_smm() to .exiting_smm(), leaving behind the MMU context reset. The primary motivation is to allow for future cleanup, but this also fixes a bug of sorts by queueing KVM_REQ_EVENT even if RSM causes shutdown, e.g. to let an INIT wake the vCPU from shutdown. Of course, KVM doesn't properly emulate a shutdown state, e.g. KVM doesn't block SMIs after shutdown, and immediately exits to userspace, so the event request is a moot point in practice. Moving kvm_smm_changed() also moves the RSM tracepoint. This isn't strictly necessary, but will allow consolidating the SMI and RSM tracepoints in a future commit (by also moving the SMI tracepoint). Invoking the tracepoint before loading SMRAM state also means the SMBASE that reported in the tracepoint will point that the state that will be used for RSM, as opposed to the SMBASE _after_ RSM completes, which is arguably a good thing if the tracepoint is being used to debug a RSM/SMM issue. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609185619.992058-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:34 -04:00
Sean Christopherson	edce46548b	KVM: x86: Replace .set_hflags() with dedicated .exiting_smm() helper Replace the .set_hflags() emulator hook with a dedicated .exiting_smm(), moving the SMM and SMM_INSIDE_NMI flag handling out of the emulator in the process. This is a step towards consolidating much of the logic in kvm_smm_changed(), including the SMM hflags updates. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609185619.992058-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:34 -04:00
Sean Christopherson	25b17226cd	KVM: x86: Emulate triple fault shutdown if RSM emulation fails Use the recently introduced KVM_REQ_TRIPLE_FAULT to properly emulate shutdown if RSM from SMM fails. Note, entering shutdown after clearing the SMM flag and restoring NMI blocking is architecturally correct with respect to AMD's APM, which KVM also uses for SMRAM layout and RSM NMI blocking behavior. The APM says: An RSM causes a processor shutdown if an invalid-state condition is found in the SMRAM state-save area. Only an external reset, external processor-initialization, or non-maskable external interrupt (NMI) can cause the processor to leave the shutdown state. Of note is processor-initialization (INIT) as a valid shutdown wake event, as INIT is blocked by SMM, implying that entering shutdown also forces the CPU out of SMM. For recent Intel CPUs, restoring NMI blocking is technically wrong, but so is restoring NMI blocking in the first place, and Intel's RSM "architecture" is such a mess that just about anything is allowed and can be justified as micro-architectural behavior. Per the SDM: On Pentium 4 and later processors, shutdown will inhibit INTR and A20M but will not change any of the other inhibits. On these processors, NMIs will be inhibited if no action is taken in the SMI handler to uninhibit them (see Section 34.8). where Section 34.8 says: When the processor enters SMM while executing an NMI handler, the processor saves the SMRAM state save map but does not save the attribute to keep NMI interrupts disabled. Potentially, an NMI could be latched (while in SMM or upon exit) and serviced upon exit of SMM even though the previous NMI handler has still not completed. I.e. RSM unconditionally unblocks NMI, but shutdown on RSM does not, which is in direct contradiction of KVM's behavior. But, as mentioned above, KVM follows AMD architecture and restores NMI blocking on RSM, so that micro-architectural detail is already lost. And for Pentium era CPUs, SMI# can break shutdown, meaning that at least some Intel CPUs fully leave SMM when entering shutdown: In the shutdown state, Intel processors stop executing instructions until a RESET#, INIT# or NMI# is asserted. While Pentium family processors recognize the SMI# signal in shutdown state, P6 family and Intel486 processors do not. In other words, the fact that Intel CPUs have implemented the two extremes gives KVM carte blanche when it comes to honoring Intel's architecture for handling shutdown during RSM. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609185619.992058-3-seanjc@google.com> [Return X86EMUL_CONTINUE after triple fault. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:33 -04:00
Vitaly Kuznetsov	4651fc56ba	KVM: x86: Drop vendor specific functions for APICv/AVIC enablement Now that APICv/AVIC enablement is kept in common 'enable_apicv' variable, there's no need to call kvm_apicv_init() from vendor specific code. No functional change intended. Reviewed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210609150911.1471882-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:33 -04:00
Vitaly Kuznetsov	fdf513e37a	KVM: x86: Use common 'enable_apicv' variable for both APICv and AVIC Unify VMX and SVM code by moving APICv/AVIC enablement tracking to common 'enable_apicv' variable. Note: unlike APICv, AVIC is disabled by default. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210609150911.1471882-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:33 -04:00
Sergey Senozhatsky	7d62874f69	kvm: x86: implement KVM PM-notifier Implement PM hibernation/suspend prepare notifiers so that KVM can reliably set PVCLOCK_GUEST_STOPPED on VCPUs and properly suspend VMs. Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Message-Id: <20210606021045.14159-2-senozhatsky@chromium.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:33 -04:00
Jim Mattson	4fe09bcf14	KVM: x86: Add a return code to kvm_apic_accept_events No functional change intended. At present, the only negative value returned by kvm_check_nested_events is -EBUSY. Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20210604172611.281819-6-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:31 -04:00
Jim Mattson	a5f6909a71	KVM: x86: Add a return code to inject_pending_event No functional change intended. At present, 'r' will always be -EBUSY on a control transfer to the 'out' label. Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20210604172611.281819-5-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:30 -04:00
Jim Mattson	d82ee28195	KVM: x86: Remove guest mode check from kvm_check_nested_events A survey of the callsites reveals that they all ensure the vCPU is in guest mode before calling kvm_check_nested_events. Remove this dead code so that the only negative value this function returns (at the moment) is -EBUSY. Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20210604172611.281819-2-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:30 -04:00
Ilias Stamatis	1ab9287add	KVM: X86: Add vendor callbacks for writing the TSC multiplier Currently vmx_vcpu_load_vmcs() writes the TSC_MULTIPLIER field of the VMCS every time the VMCS is loaded. Instead of doing this, set this field from common code on initialization and whenever the scaling ratio changes. Additionally remove vmx->current_tsc_ratio. This field is redundant as vcpu->arch.tsc_scaling_ratio already tracks the current TSC scaling ratio. The vmx->current_tsc_ratio field is only used for avoiding unnecessary writes but it is no longer needed after removing the code from the VMCS load path. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Message-Id: <20210607105438.16541-1-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:29 -04:00
Ilias Stamatis	edcfe54058	KVM: X86: Move write_l1_tsc_offset() logic to common code and rename it The write_l1_tsc_offset() callback has a misleading name. It does not set L1's TSC offset, it rather updates the current TSC offset which might be different if a nested guest is executing. Additionally, both the vmx and svm implementations use the same logic for calculating the current TSC before writing it to hardware. Rename the function and move the common logic to the caller. The vmx/svm specific code now merely sets the given offset to the corresponding hardware structure. Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210526184418.28881-9-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:29 -04:00
Ilias Stamatis	83150f2932	KVM: X86: Add functions that calculate the nested TSC fields When L2 is entered we need to "merge" the TSC multiplier and TSC offset values of 01 and 12 together. The merging is done using the following equations: offset_02 = ((offset_01 * mult_12) >> shift_bits) + offset_12 mult_02 = (mult_01 * mult_12) >> shift_bits Where shift_bits is kvm_tsc_scaling_ratio_frac_bits. Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210526184418.28881-8-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:29 -04:00
Ilias Stamatis	fe3eb50418	KVM: X86: Add a ratio parameter to kvm_scale_tsc() Sometimes kvm_scale_tsc() needs to use the current scaling ratio and other times (like when reading the TSC from user space) it needs to use L1's scaling ratio. Have the caller specify this by passing the ratio as a parameter. Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210526184418.28881-5-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:28 -04:00
Ilias Stamatis	9b399dfd4c	KVM: X86: Rename kvm_compute_tsc_offset() to kvm_compute_l1_tsc_offset() All existing code uses kvm_compute_tsc_offset() passing L1 TSC values to it. Let's document this by renaming it to kvm_compute_l1_tsc_offset(). Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210526184418.28881-4-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:28 -04:00
Ilias Stamatis	805d705ff8	KVM: X86: Store L1's TSC scaling ratio in 'struct kvm_vcpu_arch' Store L1's scaling ratio in the kvm_vcpu_arch struct like we already do for L1's TSC offset. This allows for easy save/restore when we enter and then exit the nested guest. Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210526184418.28881-3-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:27 -04:00
Ben Gardon	d501f747ef	KVM: x86/mmu: Lazily allocate memslot rmaps If the TDP MMU is in use, wait to allocate the rmaps until the shadow MMU is actually used. (i.e. a nested VM is launched.) This saves memory equal to 0.2% of guest memory in cases where the TDP MMU is used and there are no nested guests involved. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210518173414.450044-8-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:27 -04:00
Ben Gardon	e2209710cc	KVM: x86/mmu: Skip rmap operations if rmaps not allocated If only the TDP MMU is being used to manage the memory mappings for a VM, then many rmap operations can be skipped as they are guaranteed to be no-ops. This saves some time which would be spent on the rmap operation. It also avoids acquiring the MMU lock in write mode for many operations. This makes it safe to run the VM without rmaps allocated, when only using the TDP MMU and sets the stage for waiting to allocate the rmaps until they're needed. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210518173414.450044-7-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:26 -04:00
Ben Gardon	a255740876	KVM: x86/mmu: Add a field to control memslot rmap allocation Add a field to control whether new memslots should have rmaps allocated for them. As of this change, it's not safe to skip allocating rmaps, so the field is always set to allocate rmaps. Future changes will make it safe to operate without rmaps, using the TDP MMU. Then further changes will allow the rmaps to be allocated lazily when needed for nested oprtation. No functional change expected. Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210518173414.450044-6-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:26 -04:00
Ben Gardon	56dd1019c8	KVM: x86/mmu: Factor out allocating memslot rmap Small refactor to facilitate allocating rmaps for all memslots at once. No functional change expected. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210518173414.450044-3-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:25 -04:00
Ben Gardon	c9b929b3fa	KVM: x86/mmu: Deduplicate rmap freeing Small code deduplication. No functional change expected. Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210518173414.450044-2-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:25 -04:00
Keqian Zhu	8921291980	KVM: x86: Do not write protect huge page in initially-all-set mode Currently, when dirty logging is started in initially-all-set mode, we write protect huge pages to prepare for splitting them into 4K pages, and leave normal pages untouched as the logging will be enabled lazily as dirty bits are cleared. However, enabling dirty logging lazily is also feasible for huge pages. This not only reduces the time of start dirty logging, but it also greatly reduces side-effect on guest when there is high dirty rate. Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com> Message-Id: <20210429034115.35560-3-zhukeqian1@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:25 -04:00
Wanpeng Li	dfdc0a714d	KVM: X86: Fix x86_emulator slab cache leak Commit `c9b8b07cde` (KVM: x86: Dynamically allocate per-vCPU emulation context) tries to allocate per-vCPU emulation context dynamically, however, the x86_emulator slab cache is still exiting after the kvm module is unload as below after destroying the VM and unloading the kvm module. grep x86_emulator /proc/slabinfo x86_emulator 36 36 2672 12 8 : tunables 0 0 0 : slabdata 3 3 0 This patch fixes this slab cache leak by destroying the x86_emulator slab cache when the kvm module is unloaded. Fixes: `c9b8b07cde` (KVM: x86: Dynamically allocate per-vCPU emulation context) Cc: stable@vger.kernel.org Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1623387573-5969-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-11 11:53:48 -04:00
Sean Christopherson	78fcb2c91a	KVM: x86: Immediately reset the MMU context when the SMM flag is cleared Immediately reset the MMU context when the vCPU's SMM flag is cleared so that the SMM flag in the MMU role is always synchronized with the vCPU's flag. If RSM fails (which isn't correctly emulated), KVM will bail without calling post_leave_smm() and leave the MMU in a bad state. The bad MMU role can lead to a NULL pointer dereference when grabbing a shadow page's rmap for a page fault as the initial lookups for the gfn will happen with the vCPU's SMM flag (=0), whereas the rmap lookup will use the shadow page's SMM flag, which comes from the MMU (=1). SMM has an entirely different set of memslots, and so the initial lookup can find a memslot (SMM=0) and then explode on the rmap memslot lookup (SMM=1). general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 1 PID: 8410 Comm: syz-executor382 Not tainted 5.13.0-rc5-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__gfn_to_rmap arch/x86/kvm/mmu/mmu.c:935 [inline] RIP: 0010:gfn_to_rmap+0x2b0/0x4d0 arch/x86/kvm/mmu/mmu.c:947 Code: <42> 80 3c 20 00 74 08 4c 89 ff e8 f1 79 a9 00 4c 89 fb 4d 8b 37 44 RSP: 0018:ffffc90000ffef98 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888015b9f414 RCX: ffff888019669c40 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000001 RBP: 0000000000000001 R08: ffffffff811d9cdb R09: ffffed10065a6002 R10: ffffed10065a6002 R11: 0000000000000000 R12: dffffc0000000000 R13: 0000000000000003 R14: 0000000000000001 R15: 0000000000000000 FS: 000000000124b300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000028e31000 CR4: 00000000001526e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: rmap_add arch/x86/kvm/mmu/mmu.c:965 [inline] mmu_set_spte+0x862/0xe60 arch/x86/kvm/mmu/mmu.c:2604 __direct_map arch/x86/kvm/mmu/mmu.c:2862 [inline] direct_page_fault+0x1f74/0x2b70 arch/x86/kvm/mmu/mmu.c:3769 kvm_mmu_do_page_fault arch/x86/kvm/mmu.h:124 [inline] kvm_mmu_page_fault+0x199/0x1440 arch/x86/kvm/mmu/mmu.c:5065 vmx_handle_exit+0x26/0x160 arch/x86/kvm/vmx/vmx.c:6122 vcpu_enter_guest+0x3bdd/0x9630 arch/x86/kvm/x86.c:9428 vcpu_run+0x416/0xc20 arch/x86/kvm/x86.c:9494 kvm_arch_vcpu_ioctl_run+0x4e8/0xa40 arch/x86/kvm/x86.c:9722 kvm_vcpu_ioctl+0x70f/0xbb0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3460 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:1069 [inline] __se_sys_ioctl+0xfb/0x170 fs/ioctl.c:1055 do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x440ce9 Cc: stable@vger.kernel.org Reported-by: syzbot+fb0b6a7e8713aeb0319c@syzkaller.appspotmail.com Fixes: `9ec19493fb` ("KVM: x86: clear SMM flags before loading state while leaving SMM") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609185619.992058-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-10 09:21:12 -04:00
Linus Torvalds	2f673816b2	Bugfixes, including a TLB flush fix that affects processors without nested page tables. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmDAVpQUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroNkOgf9F97eFxAdod3/wbW9EbsUPR5bMTLE +R6Hmvw+yCm/W2cycVGdCSh1BEKNuZN/XfHln2cYVfVr6ndog58A4Y0urFAhTROv IHs8TCA5biQitoZ716l88ExOitnqJiSmMhGex969+zm1Lb9MQo1KA/zxERlqCi3s Pfcxb6I8VbD9LEb6NaQdDgQoslJo1tzhe9gGYAYrpMOZujpj1RPeIOZIfeII0MP/ g14/JSar8cXc9QJ6zbiKn8HhpmzGJnaIsyFFL2RMIBlKvxsnpOU6VmisLTL9407o P246Vq59BM8pdRCVUW9W9hLr2ho8lmi+ZYXASCm+qfn8cLaHyRCqSK56ZQ== =nW43 -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "Bugfixes, including a TLB flush fix that affects processors without nested page tables" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: kvm: fix previous commit for 32-bit builds kvm: avoid speculation-based attacks from out-of-range memslot accesses KVM: x86: Unload MMU on guest TLB flush if TDP disabled to force MMU sync KVM: x86: Ensure liveliness of nested VM-Enter fail tracepoint message selftests: kvm: Add support for customized slot0 memory size KVM: selftests: introduce P47V64 for s390x KVM: x86: Ensure PV TLB flush tracepoint reflects KVM behavior KVM: X86: MMU: Use the correct inherited permissions to get shadow page KVM: LAPIC: Write 0 to TMICT should also cancel vmx-preemption timer KVM: SVM: Fix SEV SEND_START session length & SEND_UPDATE_DATA query length after commit `238eca821c`	2021-06-09 13:09:57 -07:00
Lai Jiangshan	b53e84eed0	KVM: x86: Unload MMU on guest TLB flush if TDP disabled to force MMU sync When using shadow paging, unload the guest MMU when emulating a guest TLB flush to ensure all roots are synchronized. From the guest's perspective, flushing the TLB ensures any and all modifications to its PTEs will be recognized by the CPU. Note, unloading the MMU is overkill, but is done to mirror KVM's existing handling of INVPCID(all) and ensure the bug is squashed. Future cleanup can be done to more precisely synchronize roots when servicing a guest TLB flush. If TDP is enabled, synchronizing the MMU is unnecessary even if nested TDP is in play, as a "legacy" TLB flush from L1 does not invalidate L1's TDP mappings. For EPT, an explicit INVEPT is required to invalidate guest-physical mappings; for NPT, guest mappings are always tagged with an ASID and thus can only be invalidated via the VMCB's ASID control. This bug has existed since the introduction of KVM_VCPU_FLUSH_TLB. It was only recently exposed after Linux guests stopped flushing the local CPU's TLB prior to flushing remote TLBs (see commit `4ce94eabac`, "x86/mm/tlb: Flush remote and local TLBs concurrently"), but is also visible in Windows 10 guests. Tested-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Fixes: `f38a7b7526` ("KVM: X86: support paravirtualized help for TLB shootdowns") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> [sean: massaged comment and changelog] Message-Id: <20210531172256.2908-1-jiangshanlai@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-08 17:10:21 -04:00
Lai Jiangshan	af3511ff7f	KVM: x86: Ensure PV TLB flush tracepoint reflects KVM behavior In record_steal_time(), st->preempted is read twice, and trace_kvm_pv_tlb_flush() might output result inconsistent if kvm_vcpu_flush_tlb_guest() see a different st->preempted later. It is a very trivial problem and hardly has actual harm and can be avoided by reseting and reading st->preempted in atomic way via xchg(). Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20210531174628.10265-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-08 13:15:20 -04:00
Linus Torvalds	224478289c	ARM fixes: * Another state update on exit to userspace fix * Prevent the creation of mixed 32/64 VMs * Fix regression with irqbypass not restarting the guest on failed connect * Fix regression with debug register decoding resulting in overlapping access * Commit exception state on exit to usrspace * Fix the MMU notifier return values * Add missing 'static' qualifiers in the new host stage-2 code x86 fixes: * fix guest missed wakeup with assigned devices * fix WARN reported by syzkaller * do not use BIT() in UAPI headers * make the kvm_amd.avic parameter bool PPC fixes: * make halt polling heuristics consistent with other architectures selftests: * various fixes * new performance selftest memslot_perf_test * test UFFD minor faults in demand_paging_test -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCyF0MUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroOHSgf/Q4Hm5e12Bj2xJy6A+iShnrbbT8PW hcIIOA7zGWXfjVYcBV7anbj7CcpzfIz0otcRBABa5mkhj+fb3YmPEb0EzCPi4Hru zxpcpB2w7W7WtUOIKe2EmaT+4Pk6/iLcfr8UMHMqx460akE9OmIg10QNWai3My/3 RIOeakSckBI9e/1TQZbxH66dsLwCT0lLco7i7AWHdFxkzUQyoA34HX5pczOCBsO5 3nXH+/txnRVhqlcyzWLVVGVzFqmpHtBqkIInDOXfUqIoxo/gOhOgF1QdMUEKomxn 5ZFXlL5IXNtr+7yiI67iHX7CWkGZE9oJ04TgPHn6LR6wRnVvc3JInzcB5Q== =ollO -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM fixes from Paolo Bonzini: "ARM fixes: - Another state update on exit to userspace fix - Prevent the creation of mixed 32/64 VMs - Fix regression with irqbypass not restarting the guest on failed connect - Fix regression with debug register decoding resulting in overlapping access - Commit exception state on exit to usrspace - Fix the MMU notifier return values - Add missing 'static' qualifiers in the new host stage-2 code x86 fixes: - fix guest missed wakeup with assigned devices - fix WARN reported by syzkaller - do not use BIT() in UAPI headers - make the kvm_amd.avic parameter bool PPC fixes: - make halt polling heuristics consistent with other architectures selftests: - various fixes - new performance selftest memslot_perf_test - test UFFD minor faults in demand_paging_test" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (44 commits) selftests: kvm: fix overlapping addresses in memslot_perf_test KVM: X86: Kill off ctxt->ud KVM: X86: Fix warning caused by stale emulation context KVM: X86: Use kvm_get_linear_rip() in single-step and #DB/#BP interception KVM: x86/mmu: Fix comment mentioning skip_4k KVM: VMX: update vcpu posted-interrupt descriptor when assigning device KVM: rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK KVM: x86: add start_assignment hook to kvm_x86_ops KVM: LAPIC: Narrow the timer latency between wait_lapic_expire and world switch selftests: kvm: do only 1 memslot_perf_test run by default KVM: X86: Use _BITUL() macro in UAPI headers KVM: selftests: add shared hugetlbfs backing source type KVM: selftests: allow using UFFD minor faults for demand paging KVM: selftests: create alias mappings when using shared memory KVM: selftests: add shmem backing source type KVM: selftests: refactor vm_mem_backing_src_type flags KVM: selftests: allow different backing source types KVM: selftests: compute correct demand paging size KVM: selftests: simplify setup_demand_paging error handling KVM: selftests: Print a message if /dev/kvm is missing ...	2021-05-29 06:02:25 -10:00
Wanpeng Li	b35491e66c	KVM: X86: Kill off ctxt->ud ctxt->ud is consumed only by x86_decode_insn(), we can kill it off by passing emulation_type to x86_decode_insn() and dropping ctxt->ud altogether. Tracking that info in ctxt for literally one call is silly. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <1622160097-37633-2-git-send-email-wanpengli@tencent.com>	2021-05-28 12:59:10 -04:00
Wanpeng Li	da6393cdd8	KVM: X86: Fix warning caused by stale emulation context Reported by syzkaller: WARNING: CPU: 7 PID: 10526 at linux/arch/x86/kvm//x86.c:7621 x86_emulate_instruction+0x41b/0x510 [kvm] RIP: 0010:x86_emulate_instruction+0x41b/0x510 [kvm] Call Trace: kvm_mmu_page_fault+0x126/0x8f0 [kvm] vmx_handle_exit+0x11e/0x680 [kvm_intel] vcpu_enter_guest+0xd95/0x1b40 [kvm] kvm_arch_vcpu_ioctl_run+0x377/0x6a0 [kvm] kvm_vcpu_ioctl+0x389/0x630 [kvm] __x64_sys_ioctl+0x8e/0xd0 do_syscall_64+0x3c/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae Commit `4a1e10d5b5` ("KVM: x86: handle hardware breakpoints during emulation()) adds hardware breakpoints check before emulation the instruction and parts of emulation context initialization, actually we don't have the EMULTYPE_NO_DECODE flag here and the emulation context will not be reused. Commit `c8848cee74` ("KVM: x86: set ctxt->have_exception in x86_decode_insn()) triggers the warning because it catches the stale emulation context has #UD, however, it is not during instruction decoding which should result in EMULATION_FAILED. This patch fixes it by moving the second part emulation context initialization into init_emulate_ctxt() and before hardware breakpoints check. The ctxt->ud will be dropped by a follow-up patch. syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=134683fdd00000 Reported-by: syzbot+71271244f206d17f6441@syzkaller.appspotmail.com Fixes: `4a1e10d5b5` (KVM: x86: handle hardware breakpoints during emulation) Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <1622160097-37633-1-git-send-email-wanpengli@tencent.com>	2021-05-28 12:59:09 -04:00
Yuan Yao	e87e46d5f3	KVM: X86: Use kvm_get_linear_rip() in single-step and #DB/#BP interception The kvm_get_linear_rip() handles x86/long mode cases well and has better readability, __kvm_set_rflags() also use the paired function kvm_is_linear_rip() to check the vcpu->arch.singlestep_rip set in kvm_arch_vcpu_ioctl_set_guest_debug(), so change the "CS.BASE + RIP" code in kvm_arch_vcpu_ioctl_set_guest_debug() and handle_exception_nmi() to this one. Signed-off-by: Yuan Yao <yuan.yao@intel.com> Message-Id: <20210526063828.1173-1-yuan.yao@linux.intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-05-28 12:57:53 -04:00
Marcelo Tosatti	084071d5e9	KVM: rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK KVM_REQ_UNBLOCK will be used to exit a vcpu from its inner vcpu halt emulation loop. Rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK, switch PowerPC to arch specific request bit. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Message-Id: <20210525134321.303768132@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-05-27 07:57:38 -04:00
Marcelo Tosatti	57ab87947a	KVM: x86: add start_assignment hook to kvm_x86_ops Add a start_assignment hook to kvm_x86_ops, which is called when kvm_arch_start_assignment is done. The hook is required to update the wakeup vector of a sleeping vCPU when a device is assigned to the guest. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Message-Id: <20210525134321.254128742@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-05-27 07:50:13 -04:00

1 2 3 4 5 ...

2374 Commits