Commit Graph

1463 Commits

Author SHA1 Message Date
Joe Perches
1d804d079a x86: Use bool function return values of true/false not 1/0
Use the normal return values for bool functions

Signed-off-by: Joe Perches <joe@perches.com>
Message-Id: <9f593eb2f43b456851cd73f7ed09654ca58fb570.1427759009.git.joe@perches.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-03-31 18:05:09 +02:00
Nadav Amit
b32a991800 KVM: x86: Remove redundant definitions
Some constants are redfined in emulate.c. Avoid it.

s/SELECTOR_RPL_MASK/SEGMENT_RPL_MASK
s/SELECTOR_TI_MASK/SEGMENT_TI_MASK

No functional change.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1427635984-8113-3-git-send-email-namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-03-30 16:46:42 +02:00
Jan Kiszka
b3a2a9076d KVM: nVMX: Add support for rdtscp
If the guest CPU is supposed to support rdtscp and the host has rdtscp
enabled in the secondary execution controls, we can also expose this
feature to L1. Just extend nested_vmx_exit_handled to properly route
EXIT_REASON_RDTSCP.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2015-03-26 22:33:48 -03:00
Nikolay Nikolaev
e32edf4fd0 KVM: Redesign kvm_io_bus_ API to pass VCPU structure to the callbacks.
This is needed in e.g. ARM vGIC emulation, where the MMIO handling
depends on the VCPU that does the access.

Signed-off-by: Nikolay Nikolaev <n.nikolaev@virtualopensystems.com>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2015-03-26 21:43:11 +00:00
Radim Krčmář
0790ec172d KVM: nVMX: mask unrestricted_guest if disabled on L0
If EPT was enabled, unrestricted_guest was allowed in L1 regardless of
L0.  L1 triple faulted when running L2 guest that required emulation.

Another side effect was 'WARN_ON_ONCE(vmx->nested.nested_run_pending)'
in L0's dmesg:
  WARNING: CPU: 0 PID: 0 at arch/x86/kvm/vmx.c:9190 nested_vmx_vmexit+0x96e/0xb00 [kvm_intel] ()

Prevent this scenario by masking SECONDARY_EXEC_UNRESTRICTED_GUEST when
the host doesn't have it enabled.

Fixes: 78051e3b7e ("KVM: nVMX: Disable unrestricted mode if ept=0")
Cc: stable@vger.kernel.org
Tested-By: Kashyap Chamarthy <kchamart@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2015-03-17 22:09:17 -03:00
Jan Kiszka
ae1f576707 KVM: nVMX: Do not emulate #UD while in guest mode
While in L2, leave all #UD to L2 and do not try to emulate it. If L1 is
interested in doing this, it reports its interest via the exception
bitmap, and we never get into handle_exception of L0 anyway.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2015-03-13 13:44:43 -03:00
Wincy Van
670125bda1 KVM: VMX: Set msr bitmap correctly if vcpu is in guest mode
In commit 3af18d9c5f ("KVM: nVMX: Prepare for using hardware MSR bitmap"),
we are setting MSR_BITMAP in prepare_vmcs02 if we should use hardware. This
is not enough since the field will be modified by following vmx_set_efer.

Fix this by setting vmx_msr_bitmap_nested in vmx_set_msr_bitmap if vcpu is
in guest mode.

Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2015-03-13 09:24:51 -03:00
Joel Schopp
5cb56059c9 kvm: x86: make kvm_emulate_* consistant
Currently kvm_emulate() skips the instruction but kvm_emulate_* sometimes
don't.  The end reult is the caller ends up doing the skip themselves.
Let's make them consistant.

Signed-off-by: Joel Schopp <joel.schopp@amd.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2015-03-10 20:29:15 -03:00
Radim Krčmář
21bc8dc5b7 KVM: VMX: fix build without CONFIG_SMP
'apic' is not defined if !CONFIG_X86_64 && !CONFIG_X86_LOCAL_APIC.
Posted interrupt makes no sense without CONFIG_SMP, and
CONFIG_X86_LOCAL_APIC will be set with it.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-23 22:28:48 +01:00
Linus Torvalds
37507717de Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf updates from Ingo Molnar:
 "This series tightens up RDPMC permissions: currently even highly
  sandboxed x86 execution environments (such as seccomp) have permission
  to execute RDPMC, which may leak various perf events / PMU state such
  as timing information and other CPU execution details.

  This 'all is allowed' RDPMC mode is still preserved as the
  (non-default) /sys/devices/cpu/rdpmc=2 setting.  The new default is
  that RDPMC access is only allowed if a perf event is mmap-ed (which is
  needed to correctly interpret RDPMC counter values in any case).

  As a side effect of these changes CR4 handling is cleaned up in the
  x86 code and a shadow copy of the CR4 value is added.

  The extra CR4 manipulation adds ~ <50ns to the context switch cost
  between rdpmc-capable and rdpmc-non-capable mms"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Add /sys/devices/cpu/rdpmc=2 to allow rdpmc for all tasks
  perf/x86: Only allow rdpmc if a perf_event is mapped
  perf: Pass the event to arch_perf_update_userpage()
  perf: Add pmu callbacks to track event mapping and unmapping
  x86: Add a comment clarifying LDT context switching
  x86: Store a per-cpu shadow copy of CR4
  x86: Clean up cr4 manipulation
2015-02-16 14:58:12 -08:00
Radim Krčmář
dab2087def KVM: x86: fix build with !CONFIG_SMP
<asm/apic.h> isn't included directly and without CONFIG_SMP, an option
that automagically pulls it can't be enabled.

Reported-by: Jim Davis <jim.epost@gmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-10 08:53:18 +01:00
Andy Lutomirski
1e02ce4ccc x86: Store a per-cpu shadow copy of CR4
Context switches and TLB flushes can change individual bits of CR4.
CR4 reads take several cycles, so store a shadow copy of CR4 in a
per-cpu variable.

To avoid wasting a cache line, I added the CR4 shadow to
cpu_tlbstate, which is already touched in switch_mm.  The heaviest
users of the cr4 shadow will be switch_mm and __switch_to_xtra, and
__switch_to_xtra is called shortly after switch_mm during context
switch, so the cacheline is likely to be hot.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vince Weaver <vince@deater.net>
Cc: "hillf.zj" <hillf.zj@alibaba-inc.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/3a54dd3353fffbf84804398e00dfdc5b7c1afd7d.1414190806.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04 12:10:42 +01:00
Andy Lutomirski
375074cc73 x86: Clean up cr4 manipulation
CR4 manipulation was split, seemingly at random, between direct
(write_cr4) and using a helper (set/clear_in_cr4).  Unfortunately,
the set_in_cr4 and clear_in_cr4 helpers also poke at the boot code,
which only a small subset of users actually wanted.

This patch replaces all cr4 access in functions that don't leave cr4
exactly the way they found it with new helpers cr4_set_bits,
cr4_clear_bits, and cr4_set_bits_and_update_boot.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vince Weaver <vince@deater.net>
Cc: "hillf.zj" <hillf.zj@alibaba-inc.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/495a10bdc9e67016b8fd3945700d46cfd5c12c2f.1414190806.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04 12:10:41 +01:00
Wincy Van
705699a139 KVM: nVMX: Enable nested posted interrupt processing
If vcpu has a interrupt in vmx non-root mode, injecting that interrupt
requires a vmexit.  With posted interrupt processing, the vmexit
is not needed, and interrupts are fully taken care of by hardware.
In nested vmx, this feature avoids much more vmexits than non-nested vmx.

When L1 asks L0 to deliver L1's posted interrupt vector, and the target
VCPU is in non-root mode, we use a physical ipi to deliver POSTED_INTR_NV
to the target vCPU.  Using POSTED_INTR_NV avoids unexpected interrupts
if a concurrent vmexit happens and L1's vector is different with L0's.
The IPI triggers posted interrupt processing in the target physical CPU.

In case the target vCPU was not in guest mode, complete the posted
interrupt delivery on the next entry to L2.

Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-03 17:15:08 +01:00
Wincy Van
608406e290 KVM: nVMX: Enable nested virtual interrupt delivery
With virtual interrupt delivery, the hardware lets KVM use a more
efficient mechanism for interrupt injection. This is an important feature
for nested VMX, because it reduces vmexits substantially and they are
much more expensive with nested virtualization.  This is especially
important for throughput-bound scenarios.

Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-03 17:07:38 +01:00
Wincy Van
82f0dd4b27 KVM: nVMX: Enable nested apic register virtualization
We can reduce apic register virtualization cost with this feature,
it is also a requirement for virtual interrupt delivery and posted
interrupt processing.

Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-03 17:07:03 +01:00
Wincy Van
b9c237bb1d KVM: nVMX: Make nested control MSRs per-cpu
To enable nested apicv support, we need per-cpu vmx
control MSRs:
  1. If in-kernel irqchip is enabled, we can enable nested
     posted interrupt, we should set posted intr bit in
     the nested_vmx_pinbased_ctls_high.
  2. If in-kernel irqchip is disabled, we can not enable
     nested posted interrupt, the posted intr bit
     in the nested_vmx_pinbased_ctls_high will be cleared.

Since there would be different settings about in-kernel
irqchip between VMs, different nested control MSRs
are needed.

Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-03 17:06:51 +01:00
Wincy Van
f2b93280ed KVM: nVMX: Enable nested virtualize x2apic mode
When L2 is using x2apic, we can use virtualize x2apic mode to
gain higher performance, especially in apicv case.

This patch also introduces nested_vmx_check_apicv_controls
for the nested apicv patches.

Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-03 17:06:17 +01:00
Wincy Van
3af18d9c5f KVM: nVMX: Prepare for using hardware MSR bitmap
Currently, if L1 enables MSR_BITMAP, we will emulate this feature, all
of L2's msr access is intercepted by L0.  Features like "virtualize
x2apic mode" require that the MSR bitmap is enabled, or the hardware
will exit and for example not virtualize the x2apic MSRs.  In order to
let L1 use these features, we need to build a merged bitmap that only
not cause a VMEXIT if 1) L1 requires that 2) the bit is not required by
the processor for APIC virtualization.

For now the guests are still run with MSR bitmap disabled, but this
patch already introduces nested_vmx_merge_msr_bitmap for future use.

Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-03 17:02:32 +01:00
Marcelo Tosatti
2e6d015799 KVM: x86: revert "add method to test PIR bitmap vector"
Revert 7c6a98dfa1, given
that testing PIR is not necessary anymore.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-02 18:36:34 +01:00
Paolo Bonzini
ad15a29647 kvm: vmx: fix oops with explicit flexpriority=0 option
A function pointer was not NULLed, causing kvm_vcpu_reload_apic_access_page to
go down the wrong path and OOPS when doing put_page(NULL).

This did not happen on old processors, only when setting the module option
explicitly.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-30 16:18:49 +01:00
Kai Huang
843e433057 KVM: VMX: Add PML support in VMX
This patch adds PML support in VMX. A new module parameter 'enable_pml' is added
to allow user to enable/disable it manually.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-30 09:39:54 +01:00
Rickard Strandqvist
0c55d6d931 x86: kvm: vmx: Remove some unused functions
Removes some functions that are not used anywhere:
cpu_has_vmx_eptp_writeback() cpu_has_vmx_eptp_uncacheable()

This was partially found by using a static code analysis program called cppcheck.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-19 11:09:36 +01:00
Paolo Bonzini
ad896af0b5 KVM: x86: mmu: remove argument to kvm_init_shadow_mmu and kvm_init_shadow_ept_mmu
The initialization function in mmu.c can always use walk_mmu, which
is known to be vcpu->arch.mmu.  Only init_kvm_nested_mmu is used to
initialize vcpu->arch.nested_mmu.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:48:02 +01:00
Marcelo Tosatti
7c6a98dfa1 KVM: x86: add method to test PIR bitmap vector
kvm_x86_ops->test_posted_interrupt() returns true/false depending
whether 'vector' is set.

Next patch makes use of this interface.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:45:18 +01:00
Tiejun Chen
b4eef9b36d kvm: x86: vmx: NULL out hwapic_isr_update() in case of !enable_apicv
In most cases calling hwapic_isr_update(), we always check if
kvm_apic_vid_enabled() == 1, but actually,
kvm_apic_vid_enabled()
    -> kvm_x86_ops->vm_has_apicv()
        -> vmx_vm_has_apicv() or '0' in svm case
            -> return enable_apicv && irqchip_in_kernel(kvm)

So its a little cost to recall vmx_vm_has_apicv() inside
hwapic_isr_update(), here just NULL out hwapic_isr_update() in
case of !enable_apicv inside hardware_setup() then make all
related stuffs follow this. Note we don't check this under that
condition of irqchip_in_kernel() since we should make sure
definitely any caller don't work  without in-kernel irqchip.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:45:17 +01:00
Eugene Korenevsky
19d5f10b3a KVM: nVMX: consult PFEC_MASK and PFEC_MATCH when generating #PF VM-exit
When generating #PF VM-exit, check equality:
(PFEC & PFEC_MASK) == PFEC_MATCH
If there is equality, the 14 bit of exception bitmap is used to take decision
about generating #PF VM-exit. If there is inequality, inverted 14 bit is used.

Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:45:16 +01:00
Eugene Korenevsky
e9ac033e6b KVM: nVMX: Improve nested msr switch checking
This patch improve checks required by Intel Software Developer Manual.
 - SMM MSRs are not allowed.
 - microcode MSRs are not allowed.
 - check x2apic MSRs only when LAPIC is in x2apic mode.
 - MSR switch areas must be aligned to 16 bytes.
 - address of first and last byte in MSR switch areas should not set any bits
   beyond the processor's physical-address width.

Also it adds warning messages on failures during MSR switch. These messages
are useful for people who debug their VMMs in nVMX.

Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:45:15 +01:00
Wincy Van
ff651cb613 KVM: nVMX: Add nested msr load/restore algorithm
Several hypervisors need MSR auto load/restore feature.
We read MSRs from VM-entry MSR load area which specified by L1,
and load them via kvm_set_msr in the nested entry.
When nested exit occurs, we get MSRs via kvm_get_msr, writing
them to L1`s MSR store area. After this, we read MSRs from VM-exit
MSR load area, and load them via kvm_set_msr.

Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:45:14 +01:00
Tiejun Chen
baa035227b kvm: x86: vmx: reorder some msr writing
The commit 34a1cd60d1, "x86: vmx: move some vmx setting from
vmx_init() to hardware_setup()", tried to refactor some codes
specific to vmx hardware setting into hardware_setup(), but some
msr writing should depend on our previous setting condition like
enable_apicv, enable_ept and so on.

Reported-by: Jamie Heilman <jamie@audible.transient.net>
Tested-by: Jamie Heilman <jamie@audible.transient.net>
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-27 21:52:10 +01:00
Bandan Das
78051e3b7e KVM: nVMX: Disable unrestricted mode if ept=0
If L0 has disabled EPT, don't advertise unrestricted
mode at all since it depends on EPT to run real mode code.

Fixes: 92fbc7b195
Cc: stable@vger.kernel.org
Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-11 12:26:15 +01:00
Wanpeng Li
81dc01f749 kvm: vmx: add nested virtualization support for xsaves
Add nested virtualization support for xsaves.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-05 13:57:44 +01:00
Wanpeng Li
203000993d kvm: vmx: add MSR logic for XSAVES
Add logic to get/set the XSS model-specific register.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-05 13:57:39 +01:00
Wanpeng Li
f53cd63c2d kvm: x86: handle XSAVES vmcs and vmexit
Initialize the XSS exit bitmap.  It is zero so there should be no XSAVES
or XRSTORS exits.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-05 13:57:33 +01:00
Wanpeng Li
55412b2eda kvm: x86: Add kvm_x86_ops hook that enables XSAVES for guest
Expose the XSAVES feature to the guest if the kvm_x86_ops say it is
available.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-05 13:57:16 +01:00
Tiejun Chen
81ed33e4aa kvm: x86: vmx: cleanup handle_ept_violation
Instead, just use PFERR_{FETCH, PRESENT, WRITE}_MASK
inside handle_ept_violation() for slightly better code.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-18 11:07:53 +01:00
Andy Lutomirski
54b98bff8e x86, kvm, vmx: Don't set LOAD_IA32_EFER when host and guest match
There's nothing to switch if the host and guest values are the same.
I am unable to find evidence that this makes any difference
whatsoever.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
[I could see a difference on Nehalem.  From 5 runs:

 userspace exit, guest!=host   12200 11772 12130 12164 12327
 userspace exit, guest=host    11983 11780 11920 11919 12040
 lightweight exit, guest!=host  3214  3220  3238  3218  3337
 lightweight exit, guest=host   3178  3193  3193  3187  3220

 This passes the t-test with 99% confidence for userspace exit,
 98.5% confidence for lightweight exit. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-12 16:27:21 +01:00
Andy Lutomirski
f6577a5fa1 x86, kvm, vmx: Always use LOAD_IA32_EFER if available
At least on Sandy Bridge, letting the CPU switch IA32_EFER is much
faster than switching it manually.

I benchmarked this using the vmexit kvm-unit-test (single run, but
GOAL multiplied by 5 to do more iterations):

Test                                  Before      After    Change
cpuid                                   2000       1932    -3.40%
vmcall                                  1914       1817    -5.07%
mov_from_cr8                              13         13     0.00%
mov_to_cr8                                19         19     0.00%
inl_from_pmtimer                       19164      10619   -44.59%
inl_from_qemu                          15662      10302   -34.22%
inl_from_kernel                         3916       3802    -2.91%
outl_to_kernel                          2230       2194    -1.61%
mov_dr                                   172        176     2.33%
ipi                                (skipped)  (skipped)
ipi+halt                           (skipped)  (skipped)
ple-round-robin                           13         13     0.00%
wr_tsc_adjust_msr                       1920       1845    -3.91%
rd_tsc_adjust_msr                       1892       1814    -4.12%
mmio-no-eventfd:pci-mem                16394      11165   -31.90%
mmio-wildcard-eventfd:pci-mem           4607       4645     0.82%
mmio-datamatch-eventfd:pci-mem          4601       4610     0.20%
portio-no-eventfd:pci-io               11507       7942   -30.98%
portio-wildcard-eventfd:pci-io          2239       2225    -0.63%
portio-datamatch-eventfd:pci-io         2250       2234    -0.71%

I haven't explicitly computed the significance of these numbers,
but this isn't subtle.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
[The results were reproducible on all of Nehalem, Sandy Bridge and
 Ivy Bridge.  The slowness of manual switching is because writing
 to EFER with WRMSR triggers a TLB flush, even if the only bit you're
 touching is SCE (so the page table format is not affected).  Doing
 the write as part of vmentry/vmexit, instead, does not flush the TLB,
 probably because all processors that have EPT also have VPID. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-12 12:35:02 +01:00
Nadav Amit
82b32774c2 KVM: x86: Breakpoints do not consider CS.base
x86 debug registers hold a linear address. Therefore, breakpoints detection
should consider CS.base, and check whether instruction linear address equals
(CS.base + RIP). This patch introduces a function to evaluate RIP linear
address and uses it for breakpoints detection.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07 15:44:04 +01:00
Nadav Amit
7305eb5d8c KVM: x86: Clear DR6[0:3] on #DB during handle_dr
DR6[0:3] (previous breakpoint indications) are cleared when #DB is injected
during handle_exception, just as real hardware does.  Similarily, handle_dr
should clear DR6[0:3].

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07 15:44:03 +01:00
Wei Wang
4114c27d45 KVM: x86: reset RVI upon system reset
A bug was reported as follows: when running Windows 7 32-bit guests on qemu-kvm,
sometimes the guests run into blue screen during reboot. The problem was that a
guest's RVI was not cleared when it rebooted. This patch has fixed the problem.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
Tested-by: Rongrong Liu <rongrongx.liu@intel.com>, Da Chun <ngugc@qq.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07 15:44:00 +01:00
Paolo Bonzini
a2ae9df7c9 kvm: x86: vmx: avoid returning bool to distinguish success from error
Return a negative error code instead, and WARN() when we should be covering
the entire 2-bit space of vmcs_field_type's return value.  For increased
robustness, add a BUILD_BUG_ON checking the range of vmcs_field_to_offset.

Suggested-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07 15:44:00 +01:00
Tiejun Chen
34a1cd60d1 kvm: x86: vmx: move some vmx setting from vmx_init() to hardware_setup()
Instead of vmx_init(), actually it would make reasonable sense to do
anything specific to vmx hardware setting in vmx_x86_ops->hardware_setup().

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07 15:43:59 +01:00
Tiejun Chen
f2c7648d91 kvm: x86: vmx: move down hardware_setup() and hardware_unsetup()
Just move this pair of functions down to make sure later we can
add something dependent on others.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07 15:43:59 +01:00
Nadav Amit
16f8a6f979 KVM: vmx: Unavailable DR4/5 is checked before CPL
If DR4/5 is accessed when it is unavailable (since CR4.DE is set), then #UD
should be generated even if CPL>0. This is according to Intel SDM Table 6-2:
"Priority Among Simultaneous Exceptions and Interrupts".

Note, that this may happen on the first DR access, even if the host does not
sets debug breakpoints. Obviously, it occurs when the host debugs the guest.

This patch moves the DR4/5 checks from __kvm_set_dr/_kvm_get_dr to handle_dr.
The emulator already checks DR4/5 availability in check_dr_read. Nested
virutalization related calls to kvm_set_dr/kvm_get_dr would not like to inject
exceptions to the guest.

As for SVM, the patch follows the previous logic as much as possible. Anyhow,
it appears the DR interception code might be buggy - even if the DR access
may cause an exception, the instruction is skipped.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03 12:07:26 +01:00
Nadav Amit
0e8a09969a KVM: x86: Clear DR7.LE during task-switch
DR7.LE should be cleared during task-switch. This feature is poorly documented.
For reference, see:
http://pdos.csail.mit.edu/6.828/2005/readings/i386/s12_02.htm

SDM [17.2.4]:
  This feature is not supported in the P6 family processors, later IA-32
  processors, and Intel 64 processors.

AMD [2:13.1.1.4]:
  This bit is ignored by implementations of the AMD64 architecture.

Intel's formulation could mean that it isn't even zeroed, but current
hardware indeed does not behave like that.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03 12:07:25 +01:00
Nadav Amit
6bdf06625d KVM: x86: DR7.GD should be cleared upon any #DB exception
Intel SDM 17.2.4 (Debug Control Register (DR7)) says: "The processor clears the
GD flag upon entering to the debug exception handler." This sentence may be
misunderstood as if it happens only on #DB due to debug-register protection,
but it happens regardless to the cause of the #DB.

Fix the behavior to match both real hardware and Bochs.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03 12:07:23 +01:00
Andy Lutomirski
52ce3c21ae x86,kvm,vmx: Don't trap writes to CR4.TSD
CR4.TSD is guest-owned; don't trap writes to it in VMX guests.  This
avoids a VM exit on context switches into or out of a PR_TSC_SIGSEGV
task.

I think that this fixes an unintentional side-effect of:
    4c38609ac5 KVM: VMX: Make guest cr4 mask more conservative

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03 12:07:22 +01:00
Paolo Bonzini
a73896cb5b KVM: vmx: defer load of APIC access page address during reset
Most call paths to vmx_vcpu_reset do not hold the SRCU lock.  Defer loading
the APIC access page to the next vmentry.

This avoids the following lockdep splat:

[ INFO: suspicious RCU usage. ]
3.18.0-rc2-test2+ #70 Not tainted
-------------------------------
include/linux/kvm_host.h:474 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
1 lock held by qemu-system-x86/2371:
 #0:  (&vcpu->mutex){+.+...}, at: [<ffffffffa037d800>] vcpu_load+0x20/0xd0 [kvm]

stack backtrace:
CPU: 4 PID: 2371 Comm: qemu-system-x86 Not tainted 3.18.0-rc2-test2+ #70
Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
 0000000000000001 ffff880209983ca8 ffffffff816f514f 0000000000000000
 ffff8802099b8990 ffff880209983cd8 ffffffff810bd687 00000000000fee00
 ffff880208a2c000 ffff880208a10000 ffff88020ef50040 ffff880209983d08
Call Trace:
 [<ffffffff816f514f>] dump_stack+0x4e/0x71
 [<ffffffff810bd687>] lockdep_rcu_suspicious+0xe7/0x120
 [<ffffffffa037d055>] gfn_to_memslot+0xd5/0xe0 [kvm]
 [<ffffffffa03807d3>] __gfn_to_pfn+0x33/0x60 [kvm]
 [<ffffffffa0380885>] gfn_to_page+0x25/0x90 [kvm]
 [<ffffffffa038aeec>] kvm_vcpu_reload_apic_access_page+0x3c/0x80 [kvm]
 [<ffffffffa08f0a9c>] vmx_vcpu_reset+0x20c/0x460 [kvm_intel]
 [<ffffffffa039ab8e>] kvm_vcpu_reset+0x15e/0x1b0 [kvm]
 [<ffffffffa039ac0c>] kvm_arch_vcpu_setup+0x2c/0x50 [kvm]
 [<ffffffffa037f7e0>] kvm_vm_ioctl+0x1d0/0x780 [kvm]
 [<ffffffff810bc664>] ? __lock_is_held+0x54/0x80
 [<ffffffff812231f0>] do_vfs_ioctl+0x300/0x520
 [<ffffffff8122ee45>] ? __fget+0x5/0x250
 [<ffffffff8122f0fa>] ? __fget_light+0x2a/0xe0
 [<ffffffff81223491>] SyS_ioctl+0x81/0xa0
 [<ffffffff816fed6d>] system_call_fastpath+0x16/0x1b

Reported-by: Takashi Iwai <tiwai@suse.de>
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reviewed-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Tested-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Fixes: 38b9917350
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-02 08:37:18 +01:00
Jan Kiszka
282da870f4 KVM: nVMX: Disable preemption while reading from shadow VMCS
In order to access the shadow VMCS, we need to load it. At this point,
vmx->loaded_vmcs->vmcs and the actually loaded one start to differ. If
we now get preempted by Linux, vmx_vcpu_put and, on return, the
vmx_vcpu_load will work against the wrong vmcs. That can cause
copy_shadow_to_vmcs12 to corrupt the vmcs12 state.

Fix the issue by disabling preemption during the copy operation.
copy_vmcs12_to_shadow is safe from this issue as it is executed by
vmx_vcpu_run when preemption is already disabled before vmentry.

This bug is exposed by running Jailhouse within KVM on CPUs with
shadow VMCS support.  Jailhouse never expects an interrupt pending
vmexit, but the bug can cause it if, after copy_shadow_to_vmcs12
is preempted, the active VMCS happens to have the virtual interrupt
pending flag set in the CPU-based execution controls.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-02 07:55:46 +01:00
Michael S. Tsirkin
2bc19dc375 kvm: x86: don't kill guest on unknown exit reason
KVM_EXIT_UNKNOWN is a kvm bug, we don't really know whether it was
triggered by a priveledged application.  Let's not kill the guest: WARN
and inject #UD instead.

Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-10-24 13:21:17 +02:00
Petr Matousek
a642fc3050 kvm: vmx: handle invvpid vm exit gracefully
On systems with invvpid instruction support (corresponding bit in
IA32_VMX_EPT_VPID_CAP MSR is set) guest invocation of invvpid
causes vm exit, which is currently not handled and results in
propagation of unknown exit to userspace.

Fix this by installing an invvpid vm exit handler.

This is CVE-2014-3646.

Cc: stable@vger.kernel.org
Signed-off-by: Petr Matousek <pmatouse@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-10-24 13:21:17 +02:00
Andy Honig
8b3c3104c3 KVM: x86: Prevent host from panicking on shared MSR writes.
The previous patch blocked invalid writes directly when the MSR
is written.  As a precaution, prevent future similar mistakes by
gracefulling handle GPs caused by writes to shared MSRs.

Cc: stable@vger.kernel.org
Signed-off-by: Andrew Honig <ahonig@google.com>
[Remove parts obsoleted by Nadav's patch. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-10-24 13:21:08 +02:00
Nadav Amit
854e8bb1aa KVM: x86: Check non-canonical addresses upon WRMSR
Upon WRMSR, the CPU should inject #GP if a non-canonical value (address) is
written to certain MSRs. The behavior is "almost" identical for AMD and Intel
(ignoring MSRs that are not implemented in either architecture since they would
anyhow #GP). However, IA32_SYSENTER_ESP and IA32_SYSENTER_EIP cause #GP if
non-canonical address is written on Intel but not on AMD (which ignores the top
32-bits).

Accordingly, this patch injects a #GP on the MSRs which behave identically on
Intel and AMD.  To eliminate the differences between the architecutres, the
value which is written to IA32_SYSENTER_ESP and IA32_SYSENTER_EIP is turned to
canonical value before writing instead of injecting a #GP.

Some references from Intel and AMD manuals:

According to Intel SDM description of WRMSR instruction #GP is expected on
WRMSR "If the source register contains a non-canonical address and ECX
specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE,
IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP."

According to AMD manual instruction manual:
LSTAR/CSTAR (SYSCALL): "The WRMSR instruction loads the target RIP into the
LSTAR and CSTAR registers.  If an RIP written by WRMSR is not in canonical
form, a general-protection exception (#GP) occurs."
IA32_GS_BASE and IA32_FS_BASE (WRFSBASE/WRGSBASE): "The address written to the
base field must be in canonical form or a #GP fault will occur."
IA32_KERNEL_GS_BASE (SWAPGS): "The address stored in the KernelGSbase MSR must
be in canonical form."

This patch fixes CVE-2014-3610.

Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-10-24 13:21:08 +02:00
Andy Lutomirski
d974baa398 x86,kvm,vmx: Preserve CR4 across VM entry
CR4 isn't constant; at least the TSD and PCE bits can vary.

TBH, treating CR0 and CR3 as constant scares me a bit, too, but it looks
like it's correct.

This adds a branch and a read from cr4 to each vm entry.  Because it is
extremely likely that consecutive entries into the same vcpu will have
the same host cr4 value, this fixes up the vmcs instead of restoring cr4
after the fact.  A subsequent patch will add a kernel-wide cr4 shadow,
reducing the overhead in the common case to just two memory reads and a
branch.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: stable@vger.kernel.org
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-18 10:09:03 -07:00
Linus Torvalds
0429fbc0bd Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu consistent-ops changes from Tejun Heo:
 "Way back, before the current percpu allocator was implemented, static
  and dynamic percpu memory areas were allocated and handled separately
  and had their own accessors.  The distinction has been gone for many
  years now; however, the now duplicate two sets of accessors remained
  with the pointer based ones - this_cpu_*() - evolving various other
  operations over time.  During the process, we also accumulated other
  inconsistent operations.

  This pull request contains Christoph's patches to clean up the
  duplicate accessor situation.  __get_cpu_var() uses are replaced with
  with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

  Unfortunately, the former sometimes is tricky thanks to C being a bit
  messy with the distinction between lvalues and pointers, which led to
  a rather ugly solution for cpumask_var_t involving the introduction of
  this_cpu_cpumask_var_ptr().

  This converts most of the uses but not all.  Christoph will follow up
  with the remaining conversions in this merge window and hopefully
  remove the obsolete accessors"

* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
  irqchip: Properly fetch the per cpu offset
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
  ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
  Revert "powerpc: Replace __get_cpu_var uses"
  percpu: Remove __this_cpu_ptr
  clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
  sparc: Replace __get_cpu_var uses
  avr32: Replace __get_cpu_var with __this_cpu_write
  blackfin: Replace __get_cpu_var uses
  tile: Use this_cpu_ptr() for hardware counters
  tile: Replace __get_cpu_var uses
  powerpc: Replace __get_cpu_var uses
  alpha: Replace __get_cpu_var
  ia64: Replace __get_cpu_var uses
  s390: cio driver &__get_cpu_var replacements
  s390: Replace __get_cpu_var uses
  mips: Replace __get_cpu_var uses
  MIPS: Replace __get_cpu_var uses in FPU emulator.
  arm: Replace __this_cpu_ptr with raw_cpu_ptr
  ...
2014-10-15 07:48:18 +02:00
Tang Chen
c24ae0dcd3 kvm: x86: Unpin and remove kvm_arch->apic_access_page
In order to make the APIC access page migratable, stop pinning it in
memory.

And because the APIC access page is not pinned in memory, we can
remove kvm_arch->apic_access_page.  When we need to write its
physical address into vmcs, we use gfn_to_page() to get its page
struct, which is needed to call page_to_phys(); the page is then
immediately unpinned.

Suggested-by: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:08:01 +02:00
Tang Chen
38b9917350 kvm: vmx: Implement set_apic_access_page_addr
Currently, the APIC access page is pinned by KVM for the entire life
of the guest.  We want to make it migratable in order to make memory
hot-unplug available for machines that run KVM.

This patch prepares to handle this for the case where there is no nested
virtualization, or where the nested guest does not have an APIC page of
its own.  All accesses to kvm->arch.apic_access_page are changed to go
through kvm_vcpu_reload_apic_access_page.

If the APIC access page is invalidated when the host is running, we update
the VMCS in the next guest entry.

If it is invalidated when the guest is running, the MMU notifier will force
an exit, after which we will handle everything as in the previous case.

If it is invalidated when a nested guest is running, the request will update
either the VMCS01 or the VMCS02.  Updating the VMCS01 is done at the
next L2->L1 exit, while updating the VMCS02 is done in prepare_vmcs02.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:08:01 +02:00
Tiejun Chen
b461966063 kvm: x86: fix two typos in comment
s/drity/dirty and s/vmsc01/vmcs01

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:53 +02:00
Nadav Amit
4566654bb9 KVM: vmx: Inject #GP on invalid PAT CR
Guest which sets the PAT CR to invalid value should get a #GP.  Currently, if
vmx supports loading PAT CR during entry, then the value is not checked.  This
patch makes the required check in that case.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:52 +02:00
Liang Chen
77c3913b74 KVM: x86: directly use kvm_make_request again
A one-line wrapper around kvm_make_request is not particularly
useful. Replace kvm_mmu_flush_tlb() with kvm_make_request().

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:51 +02:00
Marcelo Tosatti
bc6134942d KVM: nested VMX: disable perf cpuid reporting
Initilization of L2 guest with -cpu host, on L1 guest with -cpu host
triggers:

(qemu) KVM: entry failed, hardware error 0x7
...
nested_vmx_run: VMCS MSR_{LOAD,STORE} unsupported

Nested VMX MSR load/store support is not sufficient to
allow perf for L2 guest.

Until properly fixed, trap CPUID and disable function 0xA.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:50 +02:00
Paolo Bonzini
1f755a8275 kvm: Make init_rmode_tss() return 0 on success.
In init_rmode_tss(), there two variables indicating the return
value, r and ret, and it return 0 on error, 1 on success. The function
is only called by vmx_set_tss_addr(), and ret is redundant.

This patch removes the redundant variable, by making init_rmode_tss()
return 0 on success, -errno on failure.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:48 +02:00
Tang Chen
f51770ed46 kvm: Make init_rmode_identity_map() return 0 on success.
In init_rmode_identity_map(), there two variables indicating the return
value, r and ret, and it return 0 on error, 1 on success. The function
is only called by vmx_create_vcpu(), and ret is redundant.

This patch removes the redundant variable, and makes init_rmode_identity_map()
return 0 on success, -errno on failure.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-17 13:10:12 +02:00
Tang Chen
a255d4795f kvm: Remove ept_identity_pagetable from struct kvm_arch.
kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But
it is never used to refer to the page at all.

In vcpu initialization, it indicates two things:
1. indicates if ept page is allocated
2. indicates if a memory slot for identity page is initialized

Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept
identity pagetable is initialized. So we can remove ept_identity_pagetable.

NOTE: In the original code, ept identity pagetable page is pinned in memroy.
      As a result, it cannot be migrated/hot-removed. After this patch, since
      kvm_arch->ept_identity_pagetable is removed, ept identity pagetable page
      is no longer pinned in memory. And it can be migrated/hot-removed.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-17 13:10:11 +02:00
Tang Chen
73a6d94162 kvm: Use APIC_DEFAULT_PHYS_BASE macro as the apic access page address.
We have APIC_DEFAULT_PHYS_BASE defined as 0xfee00000, which is also the address of
apic access page. So use this macro.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-11 11:10:22 +02:00
Radim Krčmář
13a34e067e KVM: remove garbage arg to *hardware_{en,dis}able
In the beggining was on_each_cpu(), which required an unused argument to
kvm_arch_ops.hardware_{en,dis}able, but this was soon forgotten.

Remove unnecessary arguments that stem from this.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-29 16:35:55 +02:00
Paolo Bonzini
48d89b9260 KVM: x86: fix some sparse warnings
Sparse reports the following easily fixed warnings:

   arch/x86/kvm/vmx.c:8795:48: sparse: Using plain integer as NULL pointer
   arch/x86/kvm/vmx.c:2138:5: sparse: symbol vmx_read_l1_tsc was not declared. Should it be static?
   arch/x86/kvm/vmx.c:6151:48: sparse: Using plain integer as NULL pointer
   arch/x86/kvm/vmx.c:8851:6: sparse: symbol vmx_sched_in was not declared. Should it be static?

   arch/x86/kvm/svm.c:2162:5: sparse: symbol svm_read_l1_tsc was not declared. Should it be static?

Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-29 14:02:49 +02:00
Wanpeng Li
a7c0b07d57 KVM: nVMX: nested TPR shadow/threshold emulation
This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=61411

TPR shadow/threshold feature is important to speed up the Windows guest.
Besides, it is a must feature for certain VMM.

We map virtual APIC page address and TPR threshold from L1 VMCS. If
TPR_BELOW_THRESHOLD VM exit is triggered by L2 guest and L1 interested
in, we inject it into L1 VMM for handling.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
[Add PAGE_ALIGNED check, do not write useless virtual APIC page address
 if TPR shadowing is disabled. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-29 14:02:48 +02:00
Wanpeng Li
a2bcba5035 KVM: nVMX: introduce nested_get_vmcs12_pages
Introduce function nested_get_vmcs12_pages() to check the valid
of nested apic access page and virtual apic page earlier.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-29 14:02:48 +02:00
Christoph Lameter
89cbc76768 x86: Replace __get_cpu_var uses
__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x).  This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.

Other use cases are for storing and retrieving data from the current
processors percpu area.  __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.

__get_cpu_var() is defined as :

#define __get_cpu_var(var) (*this_cpu_ptr(&(var)))

__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.

this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.

This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset.  Thereby address calculations are avoided and less registers
are used when code is generated.

Transformations done to __get_cpu_var()

1. Determine the address of the percpu instance of the current processor.

	DEFINE_PER_CPU(int, y);
	int *x = &__get_cpu_var(y);

    Converts to

	int *x = this_cpu_ptr(&y);

2. Same as #1 but this time an array structure is involved.

	DEFINE_PER_CPU(int, y[20]);
	int *x = __get_cpu_var(y);

    Converts to

	int *x = this_cpu_ptr(y);

3. Retrieve the content of the current processors instance of a per cpu
variable.

	DEFINE_PER_CPU(int, y);
	int x = __get_cpu_var(y)

   Converts to

	int x = __this_cpu_read(y);

4. Retrieve the content of a percpu struct

	DEFINE_PER_CPU(struct mystruct, y);
	struct mystruct x = __get_cpu_var(y);

   Converts to

	memcpy(&x, this_cpu_ptr(&y), sizeof(x));

5. Assignment to a per cpu variable

	DEFINE_PER_CPU(int, y)
	__get_cpu_var(y) = x;

   Converts to

	__this_cpu_write(y, x);

6. Increment/Decrement etc of a per cpu variable

	DEFINE_PER_CPU(int, y);
	__get_cpu_var(y)++

   Converts to

	__this_cpu_inc(y)

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-08-26 13:45:49 -04:00
Radim Krčmář
7b46268d29 KVM: trace kvm_ple_window grow/shrink
Tracepoint for dynamic PLE window, fired on every potential change.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-21 18:45:23 +02:00
Radim Krčmář
b4a2d31da8 KVM: VMX: dynamise PLE window
Window is increased on every PLE exit and decreased on every sched_in.
The idea is that we don't want to PLE exit if there is no preemption
going on.
We do this with sched_in() because it does not hold rq lock.

There are two new kernel parameters for changing the window:
 ple_window_grow and ple_window_shrink
ple_window_grow affects the window on PLE exit and ple_window_shrink
does it on sched_in;  depending on their value, the window is modifier
like this: (ple_window is kvm_intel's global)

  ple_window_shrink/ |
  ple_window_grow    | PLE exit           | sched_in
  -------------------+--------------------+---------------------
  < 1                |  = ple_window      |  = ple_window
  < ple_window       | *= ple_window_grow | /= ple_window_shrink
  otherwise          | += ple_window_grow | -= ple_window_shrink

A third new parameter, ple_window_max, controls the maximal ple_window;
it is internally rounded down to a closest multiple of ple_window_grow.

VCPU's PLE window is never allowed below ple_window.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-21 18:45:23 +02:00
Radim Krčmář
a7653ecdf3 KVM: VMX: make PLE window per-VCPU
Change PLE window into per-VCPU variable, seeded from module parameter,
to allow greater flexibility.

Brings in a small overhead on every vmentry.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-21 18:45:22 +02:00
Radim Krčmář
ae97a3b818 KVM: x86: introduce sched_in to kvm_x86_ops
sched_in preempt notifier is available for x86, allow its use in
specific virtualization technlogies as well.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-21 18:45:22 +02:00
Wanpeng Li
a32e84594d KVM: vmx: fix ept reserved bits for 1-GByte page
EPT misconfig handler in kvm will check which reason lead to EPT
misconfiguration after vmexit. One of the reasons is that an EPT
paging-structure entry is configured with settings reserved for
future functionality. However, the handler can't identify if
paging-structure entry of reserved bits for 1-GByte page are
configured, since PDPTE which point to 1-GByte page will reserve
bits 29:12 instead of bits 7:3 which are reserved for PDPTE that
references an EPT Page Directory. This patch fix it by reserve
bits 29:12 for 1-GByte page.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-20 10:13:40 +02:00
Monam Agarwal
3b63a43f1e arch/x86: Use RCU_INIT_POINTER(x, NULL) in kvm/vmx.c
Here rcu_assign_pointer() is ensuring that the
initialization of a structure is carried out before storing a pointer
to that structure.
So, rcu_assign_pointer(p, NULL) can always safely be converted to
RCU_INIT_POINTER(p, NULL).

Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-19 15:12:29 +02:00
Wanpeng Li
4473b570a7 KVM: x86: drop fpu_activate hook
The only user of the fpu_activate hook was dropped in commit
2d04a05bd7 (KVM: x86 emulator: emulate CLTS internally, 2011-04-20).
vmx_fpu_activate and svm_fpu_activate are still called on #NM (and for
Intel CLTS), but never from common code; hence, there's no need for
a hook.

Reviewed-by: Yang Zhang <yang.z.zhang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-19 15:12:28 +02:00
Wanpeng Li
f3380ca5d7 KVM: nVMX: Fix nested vmexit ack intr before load vmcs01
An external interrupt will cause a vmexit with reason "external interrupt"
when L2 is running.  L1 will pick up the interrupt through vmcs12 if
L1 set the ack interrupt bit.  Commit 77b0f5d (KVM: nVMX: Ack and write
vector info to intr_info if L1 asks us to) retrieves the interrupt that
belongs to L1 before vmcs01 is loaded.

This will lead to problems in the next patch, which would write to SVI
of vmcs02 instead of vmcs01 (SVI of vmcs02 doesn't make sense because
L2 runs without APICv).

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Liu, RongrongX <rongrongx.liu@intel.com>
Tested-by: Felipe Reyes <freyes@suse.com>
Fixes: 77b0f5d67f
Cc: stable@vger.kernel.org
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
[Move tracepoint as well. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-05 14:50:45 +02:00
Chris J Arges
296f047502 KVM: vmx: remove duplicate vmx_mpx_supported() prototype
Remove a prototype which was added by both 93c4adc7af and 36be0b9deb.

Signed-off-by: Chris J Arges <chris.j.arges@canonical.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-30 17:43:57 +02:00
Paolo Bonzini
03916db934 Replace NR_VMX_MSR with its definition
Using ARRAY_SIZE directly makes it easier to read the code.  While touching
the code, replace the division by a multiplication in the recently added
BUILD_BUG_ON.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-24 14:21:57 +02:00
Nadav Amit
0123be429f KVM: x86: Assertions to check no overrun in MSR lists
Currently there is no check whether shared MSRs list overrun the allocated size
which can results in bugs. In addition there is no check that vmx->guest_msrs
has sufficient space to accommodate all the VMX msrs.  This patch adds the
assertions.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-24 14:16:57 +02:00
Nadav Amit
6f43ed01e8 KVM: x86: DR6/7.RTM cannot be written
Haswell and newer Intel CPUs have support for RTM, and in that case DR6.RTM is
not fixed to 1 and DR7.RTM is not fixed to zero. That is not the case in the
current KVM implementation. This bug is apparent only if the MOV-DR instruction
is emulated or the host also debugs the guest.

This patch is a partial fix which enables DR6.RTM and DR7.RTM to be cleared and
set respectively. It also sets DR6.RTM upon every debug exception. Obviously,
it is not a complete fix, as debugging of RTM is still unsupported.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-21 17:17:52 +02:00
Paolo Bonzini
9a2a05b9ed KVM: nVMX: clean up nested_release_vmcs12 and code around it
Make nested_release_vmcs12 idempotent.

Tested-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-21 14:29:49 +02:00
Paolo Bonzini
4fa7734c62 KVM: nVMX: fix lifetime issues for vmcs02
free_nested needs the loaded_vmcs to be valid if it is a vmcs02, in
order to detach it from the shadow vmcs.  However, this is not
available anymore after commit 26a865f4aa (KVM: VMX: fix use after
free of vmx->loaded_vmcs, 2014-01-03).

Revert that patch, and fix its problem by forcing a vmcs01 as the
active VMCS before freeing all the nested VMX state.

Reported-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Tested-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-21 14:29:49 +02:00
Wanpeng Li
963fee1656 KVM: nVMX: Fix virtual interrupt delivery injection
This patch fix bug reported in https://bugzilla.kernel.org/show_bug.cgi?id=73331,
after the patch http://www.spinics.net/lists/kvm/msg105230.html applied, there is
some progress and the L2 can boot up, however, slowly. The original idea of this
fix vid injection patch is from "Zhang, Yang Z" <yang.z.zhang@intel.com>.

Interrupt which delivered by vid should be injected to L1 by L0 if current is in
L1, or should be injected to L2 by L0 through the old injection way if L1 doesn't
have set External-interrupt exiting bit. The current logic doen't consider these
cases. This patch fix it by vid intr to L1 if current is L1 or L2 through old
injection way if L1 doen't have External-interrupt exiting bit set.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: "Zhang, Yang Z" <yang.z.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-17 14:45:26 +02:00
Paolo Bonzini
37ccdcbe07 KVM: x86: return all bits from get_interrupt_shadow
For the next patch we will need to know the full state of the
interrupt shadow; we will then set KVM_REQ_EVENT when one bit
is cleared.

However, right now get_interrupt_shadow only returns the one
corresponding to the emulated instruction, or an unconditional
0 if the emulated instruction does not have an interrupt shadow.
This is confusing and does not allow us to check for cleared
bits as mentioned above.

Clean the callback up, and modify toggle_interruptibility to
match the comment above the call.  As a small result, the
call to set_interrupt_shadow will be skipped in the common
case where int_shadow == 0 && mask == 0.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11 09:13:56 +02:00
Paolo Bonzini
98eb2f8b14 KVM: vmx: speed up emulation of invalid guest state
About 25% of the time spent in emulation of invalid guest state
is wasted in checking whether emulation is required for the next
instruction.  However, this almost never changes except when a
segment register (or TR or LDTR) changes, or when there is a mode
transition (i.e. CR0 changes).

In fact, vmx_set_segment and vmx_set_cr0 already modify
vmx->emulation_required (except that the former for some reason
uses |= instead of just an assignment).  So there is no need to
call guest_state_valid in the emulation loop.

Emulation performance test results indicate 1650-2600 cycles
for common instructions, versus 2300-3200 before this patch on
a Sandy Bridge Xeon.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11 09:13:56 +02:00
Nadav Amit
27e6fb5dae KVM: vmx: vmx instructions handling does not consider cs.l
VMX instructions use 32-bit operands in 32-bit mode, and 64-bit operands in
64-bit mode.  The current implementation is broken since it does not use the
register operands correctly, and always uses 64-bit for reads and writes.
Moreover, write to memory in vmwrite only considers long-mode, so it ignores
cs.l. This patch fixes this behavior.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-19 12:52:15 +02:00
Nadav Amit
1e32c07955 KVM: vmx: handle_cr ignores 32/64-bit mode
On 32-bit mode only bits [31:0] of the CR should be used for setting the CR
value.  Otherwise, the host may incorrectly assume the value is invalid if bits
[63:32] are not zero.  Moreover, the CR is currently being read twice when CR8
is used.  Last, nested mov-cr exiting is modified to handle the CR value
correctly as well.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-19 12:52:15 +02:00
Nadav Amit
5777392e83 KVM: x86: check DR6/7 high-bits are clear only on long-mode
When the guest sets DR6 and DR7, KVM asserts the high 32-bits are clear, and
otherwise injects a #GP exception. This exception should only be injected only
if running in long-mode.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-19 12:52:14 +02:00
Jan Kiszka
5381417f6a KVM: nVMX: Fix returned value of MSR_IA32_VMX_VMCS_ENUM
Many real CPUs get this wrong as well, but ours is totally off: bits 9:1
define the highest index value.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-19 12:52:13 +02:00
Jan Kiszka
2996fca069 KVM: nVMX: Allow to disable VM_{ENTRY_LOAD,EXIT_SAVE}_DEBUG_CONTROLS
Allow L1 to "leak" its debug controls into L2, i.e. permit cleared
VM_{ENTRY_LOAD,EXIT_SAVE}_DEBUG_CONTROLS. This requires to manually
transfer the state of DR7 and IA32_DEBUGCTLMSR from L1 into L2 as both
run on different VMCS.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-19 12:52:13 +02:00
Jan Kiszka
560b7ee12c KVM: nVMX: Fix returned value of MSR_IA32_VMX_PROCBASED_CTLS
SDM says bits 1, 4-6, 8, 13-16, and 26 have to be set.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-19 12:52:12 +02:00
Jan Kiszka
3dcdf3ec6e KVM: nVMX: Allow to disable CR3 access interception
We already have this control enabled by exposing a broken
MSR_IA32_VMX_PROCBASED_CTLS value. This will properly advertise our
capability once the value is fixed by clearing the right bits in
MSR_IA32_VMX_TRUE_PROCBASED_CTLS. We also have to ensure to test the
right value on L2 entry.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-19 12:52:12 +02:00
Jan Kiszka
3dbcd8da7b KVM: nVMX: Advertise support for MSR_IA32_VMX_TRUE_*_CTLS
We already implemented them but failed to advertise them. Currently they
all return the identical values to the capability MSRs they are
augmenting. So there is no change in exposed features yet.

Drop related comments at this chance that are partially incorrect and
redundant anyway.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-19 12:52:11 +02:00
Fabian Frederick
bc39c4db71 arch/x86/kvm/vmx.c: use PAGE_ALIGNED instead of IS_ALIGNED(PAGE_SIZE
use mm.h definition

Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-19 12:52:08 +02:00
Linus Torvalds
b05d59dfce At over 200 commits, covering almost all supported architectures, this
was a pretty active cycle for KVM.  Changes include:
 
 - a lot of s390 changes: optimizations, support for migration,
   GDB support and more
 
 - ARM changes are pretty small: support for the PSCI 0.2 hypercall
   interface on both the guest and the host (the latter acked by Catalin)
 
 - initial POWER8 and little-endian host support
 
 - support for running u-boot on embedded POWER targets
 
 - pretty large changes to MIPS too, completing the userspace interface
   and improving the handling of virtualized timer hardware
 
 - for x86, a larger set of changes is scheduled for 3.17.  Still,
   we have a few emulator bugfixes and support for running nested
   fully-virtualized Xen guests (para-virtualized Xen guests have
   always worked).  And some optimizations too.
 
 The only missing architecture here is ia64.  It's not a coincidence
 that support for KVM on ia64 is scheduled for removal in 3.17.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJTjtlBAAoJEBvWZb6bTYbyMOUP/2NAePghE3IjG99ikHFdn+BX
 BfrURsuR6GD0AhYQnBidBmpFbAmN/LwSJxv/M7sV7OBRWLu3qbt69DrPTU2e/FK1
 j9q25peu8jRyHzJ1q9rBroo74nD9lQYuVr3uXNxxcg0DRnw14JHGlM3y8LDEknO8
 W+gpWTeAQ+2AuOX98MpRbCRMuzziCSv5bP5FhBVnsWHiZfvMbcUrbeJt+zYSiDAZ
 0tHm/5dFKzfj/vVrrnjD4EZcRr688Bs5rztG96hY6aoVJryjZGLtLp92wCWkRRmH
 CCvZwd245NmNthuKHzcs27/duSWfU0uOlu7AMrD44QYhzeDGyB/2nbCxbGqLLoBA
 nnOviXH4cC65/CnisZ79zfo979HbZcX+Lzg747EjBgCSxJmLlwgiG8yXtDvk5otB
 TH6GUeGDiEEPj//JD3XtgSz0sF2NvjREWRyemjDMvhz6JC/bLytXKb3sn+NXSj8m
 ujzF9eQoa4qKDcBL4IQYGTJ4z5nY3Pd68dHFIPHB7n82OxFLSQUBKxXw8/1fb5og
 VVb8PL4GOcmakQlAKtTMlFPmuy4bbL2r/2iV5xJiOZKmXIu8Hs1JezBE3SFAltbl
 3cAGwSM9/dDkKxUbTFblyOE9bkKbg4WYmq0LkdzsPEomb3IZWntOT25rYnX+LrBz
 bAknaZpPiOrW11Et1htY
 =j5Od
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm into next

Pull KVM updates from Paolo Bonzini:
 "At over 200 commits, covering almost all supported architectures, this
  was a pretty active cycle for KVM.  Changes include:

   - a lot of s390 changes: optimizations, support for migration, GDB
     support and more

   - ARM changes are pretty small: support for the PSCI 0.2 hypercall
     interface on both the guest and the host (the latter acked by
     Catalin)

   - initial POWER8 and little-endian host support

   - support for running u-boot on embedded POWER targets

   - pretty large changes to MIPS too, completing the userspace
     interface and improving the handling of virtualized timer hardware

   - for x86, a larger set of changes is scheduled for 3.17.  Still, we
     have a few emulator bugfixes and support for running nested
     fully-virtualized Xen guests (para-virtualized Xen guests have
     always worked).  And some optimizations too.

  The only missing architecture here is ia64.  It's not a coincidence
  that support for KVM on ia64 is scheduled for removal in 3.17"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (203 commits)
  KVM: add missing cleanup_srcu_struct
  KVM: PPC: Book3S PR: Rework SLB switching code
  KVM: PPC: Book3S PR: Use SLB entry 0
  KVM: PPC: Book3S HV: Fix machine check delivery to guest
  KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs
  KVM: PPC: Book3S HV: Make sure we don't miss dirty pages
  KVM: PPC: Book3S HV: Fix dirty map for hugepages
  KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address
  KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates()
  KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
  KVM: PPC: Book3S: Add ONE_REG register names that were missed
  KVM: PPC: Add CAP to indicate hcall fixes
  KVM: PPC: MPIC: Reset IRQ source private members
  KVM: PPC: Graciously fail broken LE hypercalls
  PPC: ePAPR: Fix hypercall on LE guest
  KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler
  KVM: PPC: BOOK3S: Always use the saved DAR value
  PPC: KVM: Make NX bit available with magic page
  KVM: PPC: Disable NX for old magic page using guests
  KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
  ...
2014-06-04 08:47:12 -07:00
Nadav Amit
1f85411255 KVM: vmx: DR7 masking on task switch emulation is wrong
The DR7 masking which is done on task switch emulation should be in hex format
(clearing the local breakpoints enable bits 0,2,4 and 6).

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-22 17:47:18 +02:00
Paolo Bonzini
ae9fedc793 KVM: x86: get CPL from SS.DPL
CS.RPL is not equal to the CPL in the few instructions between
setting CR0.PE and reloading CS.  And CS.DPL is also not equal
to the CPL for conforming code segments.

However, SS.DPL *is* always equal to the CPL except for the weird
case of SYSRET on AMD processors, which sets SS.DPL=SS.RPL from the
value in the STAR MSR, but force CPL=3 (Intel instead forces
SS.DPL=SS.RPL=CPL=3).

So this patch:

- modifies SVM to update the CPL from SS.DPL rather than CS.RPL;
the above case with SYSRET is not broken further, and the way
to fix it would be to pass the CPL to userspace and back

- modifies VMX to always return the CPL from SS.DPL (except
forcing it to 0 if we are emulating real mode via vm86 mode;
in vm86 mode all DPLs have to be 3, but real mode does allow
privileged instructions).  It also removes the CPL cache,
which becomes a duplicate of the SS access rights cache.

This fixes doing KVM_IOCTL_SET_SREGS exactly after setting
CR0.PE=1 but before CS has been reloaded.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-22 17:47:17 +02:00
Gabriel L. Somlo
87c00572ba kvm: x86: emulate monitor and mwait instructions as nop
Treat monitor and mwait instructions as nop, which is architecturally
correct (but inefficient) behavior. We do this to prevent misbehaving
guests (e.g. OS X <= 10.7) from crashing after they fail to check for
monitor/mwait availability via cpuid.

Since mwait-based idle loops relying on these nop-emulated instructions
would keep the host CPU pegged at 100%, do NOT advertise their presence
via cpuid, to prevent compliant guests from using them inadvertently.

Signed-off-by: Gabriel L. Somlo <somlo@cmu.edu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-08 15:40:49 +02:00
Nadav Amit
a4ab9d0cf1 KVM: vmx: handle_dr does not handle RSP correctly
The RSP register is not automatically cached, causing mov DR instruction with
RSP to fail.  Instead the regular register accessing interface should be used.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-07 17:24:59 +02:00
Paolo Bonzini
696dfd95ba KVM: vmx: disable APIC virtualization in nested guests
While running a nested guest, we should disable APIC virtualization
controls (virtualized APIC register accesses, virtual interrupt
delivery and posted interrupts), because we do not expose them to
the nested guest.

Reported-by: Hu Yaohui <loki2441@gmail.com>
Suggested-by: Abel Gordon <abel@stratoscale.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-07 13:46:02 +02:00
Bandan Das
4291b58885 KVM: nVMX: move vmclear and vmptrld pre-checks to nested_vmx_check_vmptr
Some checks are common to all, and moreover,
according to the spec, the check for whether any bits
beyond the physical address width are set are also
applicable to all of them

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-06 19:00:43 +02:00
Bandan Das
96ec146330 KVM: nVMX: fail on invalid vmclear/vmptrld pointer
The spec mandates that if the vmptrld or vmclear
address is equal to the vmxon region pointer, the
instruction should fail with error "VMPTRLD with
VMXON pointer" or "VMCLEAR with VMXON pointer"

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-06 19:00:37 +02:00
Bandan Das
3573e22cfe KVM: nVMX: additional checks on vmxon region
Currently, the vmxon region isn't used in the nested case.
However, according to the spec, the vmxon instruction performs
additional sanity checks on this region and the associated
pointer. Modify emulated vmxon to better adhere to the spec
requirements

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-06 19:00:27 +02:00
Bandan Das
19677e32fe KVM: nVMX: rearrange get_vmx_mem_address
Our common function for vmptr checks (in 2/4) needs to fetch
the memory address

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-06 18:59:57 +02:00
Bandan Das
fe2b201b3b KVM: x86: Check for host supported fields in shadow vmcs
We track shadow vmcs fields through two static lists,
one for read only and another for r/w fields. However, with
addition of new vmcs fields, not all fields may be supported on
all hosts. If so, copy_vmcs12_to_shadow() trying to vmwrite on
unsupported hosts will result in a vmwrite error. For example, commit
36be0b9deb introduced GUEST_BNDCFGS, which is not supported
by all processors. Filter out host unsupported fields before
letting guests use shadow vmcs

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-04-28 11:14:51 +02:00
Bandan Das
e0ba1a6ffc KVM: nVMX: Advertise support for interrupt acknowledgement
Some Type 1 hypervisors such as XEN won't enable VMX without it present

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-22 18:41:34 -03:00
Bandan Das
77b0f5d67f KVM: nVMX: Ack and write vector info to intr_info if L1 asks us to
This feature emulates the "Acknowledge interrupt on exit" behavior.
We can safely emulate it for L1 to run L2 even if L0 itself has it
disabled (to run L1).

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-22 18:41:33 -03:00
Bandan Das
4b85507860 KVM: nVMX: Don't advertise single context invalidation for invept
For single context invalidation, we fall through to global
invalidation in handle_invept() except for one case - when
the operand supplied by L1 is different from what we have in
vmcs12. However, typically hypervisors will only call invept
for the currently loaded eptp, so the condition will
never be true.

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-22 18:41:28 -03:00
Huw Davies
fd2a445a94 KVM: VMX: Advance rip to after an ICEBP instruction
When entering an exception after an ICEBP, the saved instruction
pointer should point to after the instruction.

This fixes the bug here: https://bugs.launchpad.net/qemu/+bug/1119686

Signed-off-by: Huw Davies <huw@codeweavers.com>
Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-22 18:37:43 -03:00
Michael S. Tsirkin
68c3b4d167 KVM: VMX: speed up wildcard MMIO EVENTFD
With KVM, MMIO is much slower than PIO, due to the need to
do page walk and emulation. But with EPT, it does not have to be: we
know the address from the VMCS so if the address is unique, we can look
up the eventfd directly, bypassing emulation.

Unfortunately, this only works if userspace does not need to match on
access length and data.  The implementation adds a separate FAST_MMIO
bus internally. This serves two purposes:
    - minimize overhead for old userspace that does not use eventfd with lengtth = 0
    - minimize disruption in other code (since we don't know the length,
      devices on the MMIO bus only get a valid address in write, this
      way we don't need to touch all devices to teach them to handle
      an invalid length)

At the moment, this optimization only has effect for EPT on x86.

It will be possible to speed up MMIO for NPT and MMU using the same
idea in the future.

With this patch applied, on VMX MMIO EVENTFD is essentially as fast as PIO.
I was unable to detect any measureable slowdown to non-eventfd MMIO.

Making MMIO faster is important for the upcoming virtio 1.0 which
includes an MMIO signalling capability.

The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
pre-review and suggestions.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-17 14:01:43 -03:00
Feng Wu
e1e746b3c5 KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode
SMAP is disabled if CPU is in non-paging mode in hardware.
However KVM always uses paging mode to emulate guest non-paging
mode with TDP. To emulate this behavior, SMAP needs to be
manually disabled when guest switches to non-paging mode.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-14 17:50:35 -03:00
Paolo Bonzini
93c4adc7af KVM: x86: handle missing MPX in nested virtualization
When doing nested virtualization, we may be able to read BNDCFGS but
still not be allowed to write to GUEST_BNDCFGS in the VMCS.  Guard
writes to the field with vmx_mpx_supported(), and similarly hide the
MSR from userspace if the processor does not support the field.

We could work around this with the generic MSR save/load machinery,
but there is only a limited number of MSR save/load slots and it is
not really worthwhile to waste one for a scenario that should not
happen except in the nested virtualization case.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-17 12:21:39 +01:00
Paolo Bonzini
36be0b9deb KVM: x86: Add nested virtualization support for MPX
This is simple to do, the "host" BNDCFGS is either 0 or the guest value.
However, both controls have to be present.  We cannot provide MPX if
we only have one of the "load BNDCFGS" or "clear BNDCFGS" controls.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-17 12:21:39 +01:00
Paolo Bonzini
d16c293e4e KVM: nVMX: Allow nested guests to run with dirty debug registers
When preparing the VMCS02, the CPU-based execution controls is computed
by vmx_exec_control.  Turn off DR access exits there, too, if the
KVM_DEBUGREG_WONT_EXIT bit is set in switch_db_regs.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11 10:46:03 +01:00
Paolo Bonzini
81908bf443 KVM: vmx: Allow the guest to run with dirty debug registers
When not running in guest-debug mode (i.e. the guest controls the debug
registers, having to take an exit for each DR access is a waste of time.
If the guest gets into a state where each context switch causes DR to be
saved and restored, this can take away as much as 40% of the execution
time from the guest.

If the guest is running with vcpu->arch.db == vcpu->arch.eff_db, we
can let it write freely to the debug registers and reload them on the
next exit.  We still need to exit on the first access, so that the
KVM_DEBUGREG_WONT_EXIT flag is set in switch_db_regs; after that, further
accesses to the debug registers will not cause a vmexit.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11 10:46:03 +01:00
Paolo Bonzini
c845f9c646 KVM: vmx: we do rely on loading DR7 on entry
Currently, this works even if the bit is not in "min", because the bit is always
set in MSR_IA32_VMX_ENTRY_CTLS.  Mention it for the sake of documentation, and
to avoid surprises if we later switch to MSR_IA32_VMX_TRUE_ENTRY_CTLS.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11 10:46:01 +01:00
Jan Kiszka
c9a7953f09 KVM: x86: Remove return code from enable_irq/nmi_window
It's no longer possible to enter enable_irq_window in guest mode when
L1 intercepts external interrupts and we are entering L2. This is now
caught in vcpu_enter_guest. So we can remove the check from the VMX
version of enable_irq_window, thus the need to return an error code from
both enable_irq_window and enable_nmi_window.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11 08:41:47 +01:00
Jan Kiszka
220c567297 KVM: nVMX: Do not inject NMI vmexits when L2 has a pending interrupt
According to SDM 27.2.3, IDT vectoring information will not be valid on
vmexits caused by external NMIs. So we have to avoid creating such
scenarios by delaying EXIT_REASON_EXCEPTION_NMI injection as long as we
have a pending interrupt because that one would be migrated to L1's IDT
vectoring info on nested exit.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11 08:41:46 +01:00
Jan Kiszka
f4124500c2 KVM: nVMX: Fully emulate preemption timer
We cannot rely on the hardware-provided preemption timer support because
we are holding L2 in HLT outside non-root mode. Furthermore, emulating
the preemption will resolve tick rate errata on older Intel CPUs.

The emulation is based on hrtimer which is started on L2 entry, stopped
on L2 exit and evaluated via the new check_nested_events hook. As we no
longer rely on hardware features, we can enable both the preemption
timer support and value saving unconditionally.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11 08:41:45 +01:00
Jan Kiszka
b6b8a1451f KVM: nVMX: Rework interception of IRQs and NMIs
Move the check for leaving L2 on pending and intercepted IRQs or NMIs
from the *_allowed handler into a dedicated callback. Invoke this
callback at the relevant points before KVM checks if IRQs/NMIs can be
injected. The callback has the task to switch from L2 to L1 if needed
and inject the proper vmexit events.

The rework fixes L2 wakeups from HLT and provides the foundation for
preemption timer emulation.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11 08:41:45 +01:00
Paolo Bonzini
ccf9844e5d kvm, vmx: Really fix lazy FPU on nested guest
Commit e504c9098e (kvm, vmx: Fix lazy FPU on nested guest, 2013-11-13)
highlighted a real problem, but the fix was subtly wrong.

nested_read_cr0 is the CR0 as read by L2, but here we want to look at
the CR0 value reflecting L1's setup.  In other words, L2 might think
that TS=0 (so nested_read_cr0 has the bit clear); but if L1 is actually
running it with TS=1, we should inject the fault into L1.

The effective value of CR0 in L2 is contained in vmcs12->guest_cr0, use
it.

Fixes: e504c9098e
Reported-by: Kashyap Chamarty <kchamart@redhat.com>
Reported-by: Stefan Bader <stefan.bader@canonical.com>
Tested-by: Kashyap Chamarty <kchamart@redhat.com>
Tested-by: Anthoine Bourgeois <bourgeois@bertin.fr>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-03 12:49:42 +01:00
Liu, Jinsong
0dd376e709 KVM: x86: add MSR_IA32_BNDCFGS to msrs_to_save
From 5d5a80cd172ea6fb51786369bcc23356b1e9e956 Mon Sep 17 00:00:00 2001
From: Liu Jinsong <jinsong.liu@intel.com>
Date: Mon, 24 Feb 2014 18:11:55 +0800
Subject: [PATCH v5 2/3] KVM: x86: add MSR_IA32_BNDCFGS to msrs_to_save

Add MSR_IA32_BNDCFGS to msrs_to_save, and corresponding logic
to kvm_get/set_msr().

Signed-off-by: Liu Jinsong <jinsong.liu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-02-25 20:17:09 +01:00
Liu, Jinsong
da8999d318 KVM: x86: Intel MPX vmx and msr handle
From caddc009a6d2019034af8f2346b2fd37a81608d0 Mon Sep 17 00:00:00 2001
From: Liu Jinsong <jinsong.liu@intel.com>
Date: Mon, 24 Feb 2014 18:11:11 +0800
Subject: [PATCH v5 1/3] KVM: x86: Intel MPX vmx and msr handle

This patch handle vmx and msr of Intel MPX feature.

Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Liu Jinsong <jinsong.liu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-02-24 12:14:00 +01:00
Jan Kiszka
58cb628dbe KVM: x86: Validate guest writes to MSR_IA32_APICBASE
Check for invalid state transitions on guest-initiated updates of
MSR_IA32_APICBASE. This address both enabling of the x2APIC when it is
not supported and all invalid transitions as described in SDM section
10.12.5. It also checks that no reserved bit is set in APICBASE by the
guest.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
[Use cpuid_maxphyaddr instead of guest_cpuid_get_phys_bits. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-27 14:39:44 +01:00
Linus Torvalds
7ebd3faa9b First round of KVM updates for 3.14; PPC parts will come next week.
Nothing major here, just bugfixes all over the place.  The most
 interesting part is the ARM guys' virtualized interrupt controller
 overhaul, which lets userspace get/set the state and thus enables
 migration of ARM VMs.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJS3TVKAAoJEBvWZb6bTYbyIFgP/2cmt4ifCuFMaZv4+G1S8jZU
 uC9ZB/+7vzht/p6zAy+4BxurKbHmSBFkC1OKcxYuy7yB4CQkHabzj4V2vRtqFdwH
 5lExP9qh3kqaVLuhnvxLTmkktR3EW4PFy6OI53l5kRNktOXSuZ0aN6K3V7tCg/X0
 iL7ASo4bJKlxeWcDpmuVrNgAajmZVfXrjKY7robgBQno+yIsgKhRZRBQHjozA6B8
 FpCo/k48RZd/EzIbV/PDDRI4hmmry/lgrO9SKjzq56wSqff2bd/k/KYze4dbAPfd
 Ps60enPTuHmeEjjb4MMMU4EKHVdTQFUMx/xZCmT4xzoh8s4of6RHphXbfE0SUznQ
 dTveyEQAR7E3JNS0k1+3WEX5fWlFesp0hO2NeE0wzUq4TAr9ztgVO9NQ6Si15e7Z
 2HysO0T5Ojtt0lY08/PvS6i48eCAuuBomrejJS8hLW4SUZ5adn+yW4Qo7Fp9JeBR
 l9a3LsVT8BZMtUWrUuFcVhlM4MbzElUPjDbgWhR8UYU/kpfVZOQu8qWgGKR4UWXy
 X7/t9l/tjR99CmfMJBAOzJid+ScSpAfg77BdaKiQrVfVIJmsjEjlO8vUMyj5b1HF
 hPX5wNyJjHAOfridLeHSs4Rdm4a8sk8Az5d4h76pLVz8M4jyTi2v0rO3N4/dU/pu
 x7N8KR5hAj+mLBoM9/Al
 =8sYU
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "First round of KVM updates for 3.14; PPC parts will come next week.

  Nothing major here, just bugfixes all over the place.  The most
  interesting part is the ARM guys' virtualized interrupt controller
  overhaul, which lets userspace get/set the state and thus enables
  migration of ARM VMs"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (67 commits)
  kvm: make KVM_MMU_AUDIT help text more readable
  KVM: s390: Fix memory access error detection
  KVM: nVMX: Update guest activity state field on L2 exits
  KVM: nVMX: Fix nested_run_pending on activity state HLT
  KVM: nVMX: Clean up handling of VMX-related MSRs
  KVM: nVMX: Add tracepoints for nested_vmexit and nested_vmexit_inject
  KVM: nVMX: Pass vmexit parameters to nested_vmx_vmexit
  KVM: nVMX: Leave VMX mode on clearing of feature control MSR
  KVM: VMX: Fix DR6 update on #DB exception
  KVM: SVM: Fix reading of DR6
  KVM: x86: Sync DR7 on KVM_SET_DEBUGREGS
  add support for Hyper-V reference time counter
  KVM: remove useless write to vcpu->hv_clock.tsc_timestamp
  KVM: x86: fix tsc catchup issue with tsc scaling
  KVM: x86: limit PIT timer frequency
  KVM: x86: handle invalid root_hpa everywhere
  kvm: Provide kvm_vcpu_eligible_for_directed_yield() stub
  kvm: vfio: silence GCC warning
  KVM: ARM: Remove duplicate include
  arm/arm64: KVM: relax the requirements of VMA alignment for THP
  ...
2014-01-22 21:40:43 -08:00
Jan Kiszka
3edf1e698f KVM: nVMX: Update guest activity state field on L2 exits
Set guest activity state in L1's VMCS according to the VCPUs mp_state.
This ensures we report the correct state in case we L2 executed HLT or
if we put L2 into HLT state and it was now woken up by an event.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-17 10:22:18 +01:00
Jan Kiszka
7af40ad37b KVM: nVMX: Fix nested_run_pending on activity state HLT
When we suspend the guest in HLT state, the nested run is no longer
pending - we emulated it completely. So only set nested_run_pending
after checking the activity state.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-17 10:22:17 +01:00
Jan Kiszka
cae501397a KVM: nVMX: Clean up handling of VMX-related MSRs
This simplifies the code and also stops issuing warning about writing to
unhandled MSRs when VMX is disabled or the Feature Control MSR is
locked - we do handle them all according to the spec.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-17 10:22:16 +01:00
Jan Kiszka
542060ea79 KVM: nVMX: Add tracepoints for nested_vmexit and nested_vmexit_inject
Already used by nested SVM for tracing nested vmexit: kvm_nested_vmexit
marks exits from L2 to L0 while kvm_nested_vmexit_inject marks vmexits
that are reflected to L1.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-17 10:22:15 +01:00
Jan Kiszka
533558bcb6 KVM: nVMX: Pass vmexit parameters to nested_vmx_vmexit
Instead of fixing up the vmcs12 after the nested vmexit, pass key
parameters already when calling nested_vmx_vmexit. This will help
tracing those vmexits.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-17 10:22:14 +01:00
Jan Kiszka
42124925c1 KVM: nVMX: Leave VMX mode on clearing of feature control MSR
When userspace sets MSR_IA32_FEATURE_CONTROL to 0, make sure we leave
root and non-root mode, fully disabling VMX. The register state of the
VCPU is undefined after this step, so userspace has to set it to a
proper state afterward.

This enables to reboot a VM while it is running some hypervisor code.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-17 10:22:13 +01:00
Jan Kiszka
8246bf52c7 KVM: VMX: Fix DR6 update on #DB exception
According to the SDM, only bits 0-3 of DR6 "may" be cleared by "certain"
debug exception. So do update them on #DB exception in KVM, but leave
the rest alone, only setting BD and BS in addition to already set bits
in DR6. This also aligns us with kvm_vcpu_check_singlestep.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-17 10:22:12 +01:00
Jan Kiszka
73aaf249ee KVM: SVM: Fix reading of DR6
In contrast to VMX, SVM dose not automatically transfer DR6 into the
VCPU's arch.dr6. So if we face a DR6 read, we must consult a new vendor
hook to obtain the current value. And as SVM now picks the DR6 state
from its VMCB, we also need a set callback in order to write updates of
DR6 back.

Fixes a regression of 020df0794f.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-17 10:22:10 +01:00
Marcelo Tosatti
26a865f4aa KVM: VMX: fix use after free of vmx->loaded_vmcs
After free_loaded_vmcs executes, the "loaded_vmcs" structure
is kfreed, and now vmx->loaded_vmcs points to a kfreed area.
Subsequent free_loaded_vmcs then attempts to manipulate
vmx->loaded_vmcs.

Switch the order to avoid the problem.

https://bugzilla.redhat.com/show_bug.cgi?id=1047892

Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-01-08 19:14:08 -02:00
Zhihui Zhang
2f0a6397dd KVM: VMX: check use I/O bitmap first before unconditional I/O exit
According to Table C-1 of Intel SDM 3C, a VM exit happens on an I/O instruction when
"use I/O bitmaps" VM-execution control was 0 _and_ the "unconditional I/O exiting"
VM-execution control was 1. So we can't just check "unconditional I/O exiting" alone.
This patch was improved by suggestion from Jan Kiszka.

Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Zhihui Zhang <zzhsuny@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-01-08 19:01:40 -02:00
Jan Kiszka
29bf08f12b KVM: nVMX: Unconditionally uninit the MMU on nested vmexit
Three reasons for doing this: 1. arch.walk_mmu points to arch.mmu anyway
in case nested EPT wasn't in use. 2. this aligns VMX with SVM. But 3. is
most important: nested_cpu_has_ept(vmcs12) queries the VMCS page, and if
one guest VCPU manipulates the page of another VCPU in L2, we may be
fooled to skip over the nested_ept_uninit_mmu_context, leaving mmu in
nested state. That can crash the host later on if nested_ept_get_cr3 is
invoked while L1 already left vmxon and nested.current_vmcs12 became
NULL therefore.

Cc: stable@kernel.org
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-01-02 11:22:14 -02:00
Jan Kiszka
4c4d563b49 KVM: VMX: Do not skip the instruction if handle_dr injects a fault
If kvm_get_dr or kvm_set_dr reports that it raised a fault, we must not
advance the instruction pointer. Otherwise the exception will hit the
wrong instruction.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-12-20 19:20:39 +01:00
Jan Kiszka
ca3f257ae5 KVM: nVMX: Support direct APIC access from L2
It's a pathological case, but still a valid one: If L1 disables APIC
virtualization and also allows L2 to directly write to the APIC page, we
have to forcibly enable APIC virtualization while in L2 if the in-kernel
APIC is in use.

This allows to run the direct interrupt test case in the vmx unit test
without x2APIC.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-12-18 10:27:09 +01:00
Jan Kiszka
6dfacadd58 KVM: nVMX: Add support for activity state HLT
We can easily emulate the HLT activity state for L1: If it decides that
L2 shall be halted on entry, just invoke the normal emulation of halt
after switching to L2. We do not depend on specific host features to
provide this, so we can expose the capability unconditionally.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-12-12 10:49:56 +01:00
Gleb Natapov
2961e8764f KVM: VMX: shadow VM_(ENTRY|EXIT)_CONTROLS vmcs field
VM_(ENTRY|EXIT)_CONTROLS vmcs fields are read/written on each guest
entry but most times it can be avoided since values do not changes.
Keep fields copy in memory to avoid unnecessary reads from vmcs.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-12-12 10:49:49 +01:00
Anthoine Bourgeois
e504c9098e kvm, vmx: Fix lazy FPU on nested guest
If a nested guest does a NM fault but its CR0 doesn't contain the TS
flag (because it was already cleared by the guest with L1 aid) then we
have to activate FPU ourselves in L0 and then continue to L2. If TS flag
is set then we fallback on the previous behavior, forward the fault to
L1 if it asked for.

Signed-off-by: Anthoine Bourgeois <bourgeois@bertin.fr>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-11-13 18:46:54 +01:00
Michael S. Tsirkin
6026620475 kvm/vmx: error message typo fix
mst can't be blamed for lack of switch entries: the
issue is with msrs actually.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-31 11:55:15 +01:00
Alex Williamson
e0f0bbc527 kvm: Create non-coherent DMA registeration
We currently use some ad-hoc arch variables tied to legacy KVM device
assignment to manage emulation of instructions that depend on whether
non-coherent DMA is present.  Create an interface for this, adapting
legacy KVM device assignment and adding VFIO via the KVM-VFIO device.
For now we assume that non-coherent DMA is possible any time we have a
VFIO group.  Eventually an interface can be developed as part of the
VFIO external user interface to query the coherency of a group.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-30 19:02:23 +01:00
Alex Williamson
d96eb2c6f4 kvm/x86: Convert iommu_flags to iommu_noncoherent
Default to operating in coherent mode.  This simplifies the logic when
we switch to a model of registering and unregistering noncoherent I/O
with KVM.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-30 19:02:13 +01:00
Jan Kiszka
a294c9bbd0 nVMX: Report CPU_BASED_VIRTUAL_NMI_PENDING as supported
If the host supports it, we can and should expose it to the guest as
well, just like we already do with PIN_BASED_VIRTUAL_NMIS.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-28 13:14:38 +01:00
Jan Kiszka
cd2633c59b nVMX: Fix pick-up of uninjected NMIs
__vmx_complete_interrupts stored uninjected NMIs in arch.nmi_injected,
not arch.nmi_pending. So we actually need to check the former field in
vmcs12_save_pending_event. This fixes the eventinj unit test when run
in nested KVM.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-28 13:14:24 +01:00
Jan Kiszka
d3134dbf20 KVM: nVMX: Report 2MB EPT pages as supported
As long as the hardware provides us 2MB EPT pages, we can also expose
them to the guest because our shadow EPT code already supports this
feature.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-28 13:14:12 +01:00
Gleb Natapov
13acfd5715 Powerpc KVM work is based on a commit after rc4.
Merging master into next to satisfy the dependencies.

Conflicts:
	arch/arm/kvm/reset.c
2013-10-17 17:41:49 +03:00
Arthur Chunqi Li
7854cbca81 KVM: nVMX: Fully support nested VMX preemption timer
This patch contains the following two changes:
1. Fix the bug in nested preemption timer support. If vmexit L2->L0
with some reasons not emulated by L1, preemption timer value should
be save in such exits.
2. Add support of "Save VMX-preemption timer value" VM-Exit controls
to nVMX.

With this patch, nested VMX preemption timer features are fully
supported.

Signed-off-by: Arthur Chunqi Li <yzt356@gmail.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-10 18:22:54 +02:00
Gleb Natapov
d0d538b9d1 KVM: nVMX: fix shadow on EPT
72f857950f broke shadow on EPT. This patch reverts it and fixes PAE
on nEPT (which reverted commit fixed) in other way.

Shadow on EPT is now broken because while L1 builds shadow page table
for L2 (which is PAE while L2 is in real mode) it never loads L2's
GUEST_PDPTR[0-3].  They do not need to be loaded because without nested
virtualization HW does this during guest entry if EPT is disabled,
but in our case L0 emulates L2's vmentry while EPT is enables, so we
cannot rely on vmcs12->guest_pdptr[0-3] to contain up-to-date values
and need to re-read PDPTEs from L2 memory. This is what kvm_set_cr3()
is doing, but by clearing cache bits during L2 vmentry we drop values
that kvm_set_cr3() read from memory.

So why the same code does not work for PAE on nEPT? kvm_set_cr3()
reads pdptes into vcpu->arch.walk_mmu->pdptrs[]. walk_mmu points to
vcpu->arch.nested_mmu while nested guest is running, but ept_load_pdptrs()
uses vcpu->arch.mmu which contain incorrect values. Fix that by using
walk_mmu in ept_(load|save)_pdptrs.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-10 11:39:57 +02:00
Paolo Bonzini
8a3c1a3347 KVM: mmu: change useless int return types to void
kvm_mmu initialization is mostly filling in function pointers, there is
no way for it to fail.  Clean up unused return values.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-10-03 15:44:02 +03:00
Gleb Natapov
feaf0c7dc4 KVM: nVMX: Do not generate #DF if #PF happens during exception delivery into L2
If #PF happens during delivery of an exception into L2 and L1 also do
not have the page mapped in its shadow page table then L0 needs to
generate vmexit to L2 with original event in IDT_VECTORING_INFO, but
current code combines both exception and generates #DF instead. Fix that
by providing nVMX specific function to handle page faults during page
table walk that handles this case correctly.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-09-30 09:14:25 +02:00
Gleb Natapov
e011c663b9 KVM: nVMX: Check all exceptions for intercept during delivery to L2
All exceptions should be checked for intercept during delivery to L2,
but we check only #PF currently. Drop nested_run_pending while we are
at it since exception cannot be injected during vmentry anyway.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
[Renamed the nested_vmx_check_exception function. - Paolo]
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-09-30 09:14:24 +02:00
Gleb Natapov
851eb6677c KVM: nVMX: Do not put exception that caused vmexit to IDT_VECTORING_INFO
If an exception causes vmexit directly it should not be reported in
IDT_VECTORING_INFO during the exit. For that we need to be able to
distinguish between exception that is injected into nested VM and one that
is reinjected because its delivery failed. Fortunately we already have
mechanism to do so for nested SVM, so here we just use correct function
to requeue exceptions and make sure that reinjected exception is not
moved to IDT_VECTORING_INFO during vmexit emulation and not re-checked
for interception during delivery.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-09-30 09:14:24 +02:00
Gleb Natapov
e0b890d35c KVM: nVMX: Amend nested_run_pending logic
EXIT_REASON_VMLAUNCH/EXIT_REASON_VMRESUME exit does not mean that nested
VM will actually run during next entry. Move setting nested_run_pending
closer to vmentry emulation code and move its clearing close to vmexit to
minimize amount of code that will erroneously run with nested_run_pending
set.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-09-30 09:14:23 +02:00
Gleb Natapov
bcd1c29495 KVM: VMX: do not check bit 12 of EPT violation exit qualification when undefined
Bit 12 is undefined in any of the following cases:
- If the "NMI exiting" VM-execution control is 1 and the "virtual NMIs"
  VM-execution control is 0.
- If the VM exit sets the valid bit in the IDT-vectoring information field

Signed-off-by: Gleb Natapov <gleb@redhat.com>
[Add parentheses around & within && - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-09-25 11:38:26 +02:00
Jan Kiszka
92fbc7b195 KVM: nVMX: Enable unrestricted guest mode support
Now that we provide EPT support, there is no reason to torture our
guests by hiding the relieving unrestricted guest mode feature. We just
need to relax CR0 checks for always-on bits as PE and PG can now be
switched off.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-09-24 19:12:15 +02:00
Jan Kiszka
10ba54a589 KVM: nVMX: Implement support for EFER saving on VM-exit
Implement and advertise VM_EXIT_SAVE_IA32_EFER. L0 traps EFER writes
unconditionally, so we always find the current L2 value in the
architectural state.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-09-24 19:12:14 +02:00
Jan Kiszka
59ab5a8f44 KVM: nVMX: Do not set identity page map for L2
Fiddling with CR3 for L2 is L1's job. It may set its own, different
identity map or simple leave it alone if unrestricted guest mode is
enabled. This also fixes reading back the current CR3 on L2 exits for
reporting it to L1.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-09-24 19:12:14 +02:00
Jan Kiszka
9e3e4dbf44 KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state
kvm_set_cr0 performs checks on the state transition that may prevent
loading L1's cr0. For now we rely on the hardware to catch invalid
states loaded by L1 into its VMCS. Still, consistency checks on the host
state part of the VMCS on guest entry will have to be improved later on.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-09-24 19:12:13 +02:00
Gleb Natapov
0be9c7a89f KVM: VMX: set "blocked by NMI" flag if EPT violation happens during IRET from NMI
Set "blocked by NMI" flag if EPT violation happens during IRET from NMI
otherwise NMI can be called recursively causing stack corruption.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-09-17 19:09:47 +03:00
Gleb Natapov
72f857950f KVM: nEPT: reset PDPTR register cache on nested vmentry emulation
After nested vmentry stale cache can be used to reload L2 PDPTR pointers
which will cause L2 guest to fail. Fix it by invalidating cache on nested
vmentry emulation.

https://bugzilla.kernel.org/show_bug.cgi?id=60830

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-09-17 12:52:42 +03:00
Paolo Bonzini
94452b9e34 KVM: vmx: count exits to userspace during invalid guest emulation
These will happen due to MMIO.

Suggested-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-08-28 17:13:15 +03:00
Arthur Chunqi Li
c0dfee582e KVM: nVMX: Advertise IA32_PAT in VM exit control
Advertise VM_EXIT_SAVE_IA32_PAT and VM_EXIT_LOAD_IA32_PAT.

Signed-off-by: Arthur Chunqi Li <yzt356@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:47 +02:00
Jan Kiszka
5743534960 KVM: nVMX: Fix up VM_ENTRY_IA32E_MODE control feature reporting
Do not report that we can enter the guest in 64-bit mode if the host is
32-bit only. This is not supported by KVM.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:47 +02:00
Jan Kiszka
ca72d970ff KVM: nEPT: Advertise WB type EPTP
At least WB must be possible.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:46 +02:00
Jan Kiszka
44811c02ed nVMX: Keep arch.pat in sync on L1-L2 switches
When asking vmx to load the PAT MSR for us while switching from L1 to L2
or vice versa, we have to update arch.pat as well as it may later be
used again to load or read out the MSR content.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Tested-by: Arthur Chunqi Li <yzt356@gmail.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:45 +02:00
Nadav Har'El
f5c4368f85 nEPT: Miscelleneous cleanups
Some trivial code cleanups not really related to nested EPT.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:44 +02:00
Nadav Har'El
2b1be67741 nEPT: Some additional comments
Some additional comments to preexisting code:
Explain who (L0 or L1) handles EPT violation and misconfiguration exits.
Don't mention "shadow on either EPT or shadow" as the only two options.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:44 +02:00
Nadav Har'El
afa61f752b Advertise the support of EPT to the L1 guest, through the appropriate MSR.
This is the last patch of the basic Nested EPT feature, so as to allow
bisection through this patch series: The guest will not see EPT support until
this last patch, and will not attempt to use the half-applied feature.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:43 +02:00
Nadav Har'El
bfd0a56b90 nEPT: Nested INVEPT
If we let L1 use EPT, we should probably also support the INVEPT instruction.

In our current nested EPT implementation, when L1 changes its EPT table
for L2 (i.e., EPT12), L0 modifies the shadow EPT table (EPT02), and in
the course of this modification already calls INVEPT. But if last level
of shadow page is unsync not all L1's changes to EPT12 are intercepted,
which means roots need to be synced when L1 calls INVEPT. Global INVEPT
should not be different since roots are synced by kvm_mmu_load() each
time EPTP02 changes.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:42 +02:00
Nadav Har'El
155a97a3d7 nEPT: MMU context for nested EPT
KVM's existing shadow MMU code already supports nested TDP. To use it, we
need to set up a new "MMU context" for nested EPT, and create a few callbacks
for it (nested_ept_*()). This context should also use the EPT versions of
the page table access functions (defined in the previous patch).
Then, we need to switch back and forth between this nested context and the
regular MMU context when switching between L1 and L2 (when L1 runs this L2
with EPT).

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:41 +02:00
Yang Zhang
25d92081ae nEPT: Add nEPT violation/misconfigration support
Inject nEPT fault to L1 guest. This patch is original from Xinhao.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:40 +02:00
Nadav Har'El
3633cfc3e8 nEPT: Fix cr3 handling in nested exit and entry
The existing code for handling cr3 and related VMCS fields during nested
exit and entry wasn't correct in all cases:

If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
we forgot to do so. This patch adds this copy.

If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
whoever does control cr3 (L1 or L2) is using PAE, the processor might have
saved PDPTEs and we should also save them in vmcs12 (and restore later).

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:34 +02:00
Nadav Har'El
8049d651e8 nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
switch the EFER MSR when EPT is used and the host and guest have different
NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
and want to be able to run recent KVM as L1, we need to allow L1 to use this
EFER switching feature.

To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available,
and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
support for the former (the latter is still unsupported).

Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state,
respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all
that's left to do in this patch is to properly advertise this feature to L1.

Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using
vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
support this feature, regardless of whether the host supports it.

Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:34 +02:00
Gleb Natapov
205befd9a5 KVM: nVMX: correctly set tr base on nested vmexit emulation
After commit 21feb4eb64 tr base is zeroed
during vmexit. Set it to L1's HOST_TR_BASE. This should fix
https://bugzilla.kernel.org/show_bug.cgi?id=60679

Reported-by: Yongjie Ren <yongjie.ren@intel.com>
Reviewed-by: Arthur Chunqi Li <yzt356@gmail.com>
Tested-by: Yongjie Ren <yongjie.ren@intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:32 +02:00
Gleb Natapov
63fbf59f8a nVMX: reset rflags register cache during nested vmentry.
During nested vmentry into vm86 mode a vcpu state is found to be incorrect
because rflags does not have VM flag set since it is read from the cache
and has L1's value instead of L2's. If emulate_invalid_guest_state=1 L0
KVM tries to emulate it, but emulation does not work for nVMX and it
never should happen anyway. Fix that by using vmx_set_rflags() to set
rflags during nested vmentry which takes care of updating register cache.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-29 09:04:22 +02:00
Paolo Bonzini
ac0a48c39a KVM: x86: rename EMULATE_DO_MMIO
The next patch will reuse it for other userspace exits than MMIO,
namely debug events.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-29 09:01:14 +02:00
Arthur Chunqi Li
21feb4eb64 KVM: nVMX: Set segment infomation of L1 when L2 exits
When L2 exits to L1, segment infomations of L1 are not set correctly.
According to Intel SDM 27.5.2(Loading Host Segment and Descriptor
Table Registers), segment base/limit/access right of L1 should be
set to some designed value when L2 exits to L1. This patch fixes
this.

Signed-off-by: Arthur Chunqi Li <yzt356@gmail.com>
Reviewed-by: Gleb Natapov <gnatapov@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-18 12:29:31 +02:00
Nadav Har'El
b3897a49e2 KVM: nVMX: Fix read/write to MSR_IA32_FEATURE_CONTROL
Fix read/write to IA32_FEATURE_CONTROL MSR in nested environment.

This patch simulate this MSR in nested_vmx and the default value is
0x0. BIOS should set it to 0x5 before VMXON. After setting the lock
bit, write to it will cause #GP(0).

Another QEMU patch is also needed to handle emulation of reset
and migration. Reset to vCPU should clear this MSR and migration
should reserve value of it.

This patch is based on Nadav's previous commit.
http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/88478

Signed-off-by: Nadav Har'El <nyh@math.technion.ac.il>
Signed-off-by: Arthur Chunqi Li <yzt356@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-07-18 12:29:29 +02:00
Mathias Krause
c2bae89394 KVM: VMX: Use proper types to access const arrays
Use a const pointer type instead of casting away the const qualifier
from const arrays. Keep the pointer array on the stack, nonetheless.
Making it static just increases the object size.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-07-18 12:29:28 +02:00
Arthur Chunqi Li
a25eb114d5 KVM: nVMX: Set success rflags when emulate VMXON/VMXOFF in nested virt
Set rflags after successfully emulateing VMXON/VMXOFF in VMX.

Signed-off-by: Arthur Chunqi Li <yzt356@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-18 12:29:27 +02:00
Arthur Chunqi Li
0658fbaad8 KVM: nVMX: Change location of 3 functions in vmx.c
Move nested_vmx_succeed/nested_vmx_failInvalid/nested_vmx_failValid
ahead of handle_vmon to eliminate double declaration in the same
file

Signed-off-by: Arthur Chunqi Li <yzt356@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-18 12:29:26 +02:00
Gleb Natapov
03617c188f KVM: VMX: mark unusable segment as nonpresent
Some userspaces do not preserve unusable property. Since usable
segment has to be present according to VMX spec we can use present
property to amend userspace bug by making unusable segment always
nonpresent. vmx_segment_access_rights() already marks nonpresent segment
as unusable.

Cc: stable@vger.kernel.org # 3.9+
Reported-by: Stefan Pietsch <stefan.pietsch@lsexperts.de>
Tested-by: Stefan Pietsch <stefan.pietsch@lsexperts.de>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-04 14:40:36 +02:00
Linus Torvalds
fe489bf450 KVM fixes for 3.11
On the x86 side, there are some optimizations and documentation updates.
 The big ARM/KVM change for 3.11, support for AArch64, will come through
 Catalin Marinas's tree.  s390 and PPC have misc cleanups and bugfixes.
 
 There is a conflict due to "s390/pgtable: fix ipte notify bit" having
 entered 3.10 through Martin Schwidefsky's s390 tree.  This pull request
 has additional changes on top, so this tree's version is the correct one.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQIcBAABAgAGBQJR0oU6AAoJEBvWZb6bTYbynnsP/RSUrrHrA8Wu1tqVfAKu+1y5
 6OIihqZ9x11/YMaNofAfv86jqxFu0/j7CzMGphNdjzujqKI+Q1tGe7oiVCmKzoG+
 UvSctWsz0lpllgBtnnrm5tcfmG6rrddhLtpA7m320+xCVx8KV5P4VfyHZEU+Ho8h
 ziPmb2mAQ65gBNX6nLHEJ3ITTgad6gt4NNbrKIYpyXuWZQJypzaRqT/vpc4md+Ed
 dCebMXsL1xgyb98EcnOdrWH1wV30MfucR7IpObOhXnnMKeeltqAQPvaOlKzZh4dK
 +QfxJfdRZVS0cepcxzx1Q2X3dgjoKQsHq1nlIyz3qu1vhtfaqBlixLZk0SguZ/R9
 1S1YqucZiLRO57RD4q0Ak5oxwobu18ZoqJZ6nledNdWwDe8bz/W2wGAeVty19ky0
 qstBdM9jnwXrc0qrVgZp3+s5dsx3NAm/KKZBoq4sXiDLd/yBzdEdWIVkIrU3X9wU
 3X26wOmBxtsB7so/JR7ciTsQHelmLicnVeXohAEP9CjIJffB81xVXnXs0P0SYuiQ
 RzbSCwjPzET4JBOaHWT0Dhv0DTS/EaI97KzlN32US3Bn3WiLlS1oDCoPFoaLqd2K
 LxQMsXS8anAWxFvexfSuUpbJGPnKSidSQoQmJeMGBa9QhmZCht3IL16/Fb641ToN
 xBohzi49L9FDbpOnTYfz
 =1zpG
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "On the x86 side, there are some optimizations and documentation
  updates.  The big ARM/KVM change for 3.11, support for AArch64, will
  come through Catalin Marinas's tree.  s390 and PPC have misc cleanups
  and bugfixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (87 commits)
  KVM: PPC: Ignore PIR writes
  KVM: PPC: Book3S PR: Invalidate SLB entries properly
  KVM: PPC: Book3S PR: Allow guest to use 1TB segments
  KVM: PPC: Book3S PR: Don't keep scanning HPTEG after we find a match
  KVM: PPC: Book3S PR: Fix invalidation of SLB entry 0 on guest entry
  KVM: PPC: Book3S PR: Fix proto-VSID calculations
  KVM: PPC: Guard doorbell exception with CONFIG_PPC_DOORBELL
  KVM: Fix RTC interrupt coalescing tracking
  kvm: Add a tracepoint write_tsc_offset
  KVM: MMU: Inform users of mmio generation wraparound
  KVM: MMU: document fast invalidate all mmio sptes
  KVM: MMU: document fast invalidate all pages
  KVM: MMU: document fast page fault
  KVM: MMU: document mmio page fault
  KVM: MMU: document write_flooding_count
  KVM: MMU: document clear_spte_count
  KVM: MMU: drop kvm_mmu_zap_mmio_sptes
  KVM: MMU: init kvm generation close to mmio wrap-around value
  KVM: MMU: add tracepoint for check_mmio_spte
  KVM: MMU: fast invalidate all mmio sptes
  ...
2013-07-03 13:21:40 -07:00
Yoshihiro YUNOMAE
489223edf2 kvm: Add a tracepoint write_tsc_offset
Add a tracepoint write_tsc_offset for tracing TSC offset change.
We want to merge ftrace's trace data of guest OSs and the host OS using
TSC for timestamp in chronological order. We need "TSC offset" values for
each guest when merge those because the TSC value on a guest is always the
host TSC plus guest's TSC offset. If we get the TSC offset values, we can
calculate the host TSC value for each guest events from the TSC offset and
the event TSC value. The host TSC values of the guest events are used when we
want to merge trace data of guests and the host in chronological order.
(Note: the trace_clock of both the host and the guest must be set x86-tsc in
this case)

This tracepoint also records vcpu_id which can be used to merge trace data for
SMP guests. A merge tool will read TSC offset for each vcpu, then the tool
converts guest TSC values to host TSC values for each vcpu.

TSC offset is stored in the VMCS by vmx_write_tsc_offset() or
vmx_adjust_tsc_offset(). KVM executes the former function when a guest boots.
The latter function is executed when kvm clock is updated. Only host can read
TSC offset value from VMCS, so a host needs to output TSC offset value
when TSC offset is changed.

Since the TSC offset is not often changed, it could be overwritten by other
frequent events while tracing. To avoid that, I recommend to use a special
instance for getting this event:

1. set a instance before booting a guest
 # cd /sys/kernel/debug/tracing/instances
 # mkdir tsc_offset
 # cd tsc_offset
 # echo x86-tsc > trace_clock
 # echo 1 > events/kvm/kvm_write_tsc_offset/enable

2. boot a guest

Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-27 14:20:51 +03:00
Xiao Guangrong
f8f559422b KVM: MMU: fast invalidate all mmio sptes
This patch tries to introduce a very simple and scale way to invalidate
all mmio sptes - it need not walk any shadow pages and hold mmu-lock

KVM maintains a global mmio valid generation-number which is stored in
kvm->memslots.generation and every mmio spte stores the current global
generation-number into his available bits when it is created

When KVM need zap all mmio sptes, it just simply increase the global
generation-number. When guests do mmio access, KVM intercepts a MMIO #PF
then it walks the shadow page table and get the mmio spte. If the
generation-number on the spte does not equal the global generation-number,
it will go to the normal #PF handler to update the mmio spte

Since 19 bits are used to store generation-number on mmio spte, we zap all
mmio sptes when the number is round

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-06-27 14:20:36 +03:00
Xiao Guangrong
b37fbea6ce KVM: MMU: make return value of mmio page fault handler more readable
Define some meaningful names instead of raw code

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-06-27 14:20:17 +03:00
H. Peter Anvin
1adfa76a95 x86, flags: Rename X86_EFLAGS_BIT1 to X86_EFLAGS_FIXED
Bit 1 in the x86 EFLAGS is always set.  Name the macro something that
actually tries to explain what it is all about, rather than being a
tautology.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: http://lkml.kernel.org/n/tip-f10rx5vjjm6tfnt8o1wseb3v@git.kernel.org
2013-06-25 16:25:32 -07:00
Xiao Guangrong
885032b910 KVM: MMU: retain more available bits on mmio spte
Let mmio spte only use bit62 and bit63 on upper 32 bits, then bit 52 ~ bit 61
can be used for other purposes

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-06-20 23:33:20 +02:00
Gleb Natapov
8d76c49e9f KVM: VMX: fix halt emulation while emulating invalid guest sate
The invalid guest state emulation loop does not check halt_request
which causes 100% cpu loop while guest is in halt and in invalid
state, but more serious issue is that this leaves halt_request set, so
random instruction emulated by vm86 #GP exit can be interpreted
as halt which causes guest hang. Fix both problems by handling
halt_request in emulation loop.

Reported-by: Tomas Papan <tomas.papan@gmail.com>
Tested-by: Tomas Papan <tomas.papan@gmail.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
CC: stable@vger.kernel.org
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-09 09:04:56 +03:00
Linus Torvalds
01227a889e Merge tag 'kvm-3.10-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Gleb Natapov:
 "Highlights of the updates are:

  general:
   - new emulated device API
   - legacy device assignment is now optional
   - irqfd interface is more generic and can be shared between arches

  x86:
   - VMCS shadow support and other nested VMX improvements
   - APIC virtualization and Posted Interrupt hardware support
   - Optimize mmio spte zapping

  ppc:
    - BookE: in-kernel MPIC emulation with irqfd support
    - Book3S: in-kernel XICS emulation (incomplete)
    - Book3S: HV: migration fixes
    - BookE: more debug support preparation
    - BookE: e6500 support

  ARM:
   - reworking of Hyp idmaps

  s390:
   - ioeventfd for virtio-ccw

  And many other bug fixes, cleanups and improvements"

* tag 'kvm-3.10-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits)
  kvm: Add compat_ioctl for device control API
  KVM: x86: Account for failing enable_irq_window for NMI window request
  KVM: PPC: Book3S: Add API for in-kernel XICS emulation
  kvm/ppc/mpic: fix missing unlock in set_base_addr()
  kvm/ppc: Hold srcu lock when calling kvm_io_bus_read/write
  kvm/ppc/mpic: remove users
  kvm/ppc/mpic: fix mmio region lists when multiple guests used
  kvm/ppc/mpic: remove default routes from documentation
  kvm: KVM_CAP_IOMMU only available with device assignment
  ARM: KVM: iterate over all CPUs for CPU compatibility check
  KVM: ARM: Fix spelling in error message
  ARM: KVM: define KVM_ARM_MAX_VCPUS unconditionally
  KVM: ARM: Fix API documentation for ONE_REG encoding
  ARM: KVM: promote vfp_host pointer to generic host cpu context
  ARM: KVM: add architecture specific hook for capabilities
  ARM: KVM: perform HYP initilization for hotplugged CPUs
  ARM: KVM: switch to a dual-step HYP init code
  ARM: KVM: rework HYP page table freeing
  ARM: KVM: enforce maximum size for identity mapped code
  ARM: KVM: move to a KVM provided HYP idmap
  ...
2013-05-05 14:47:31 -07:00
Jan Kiszka
03b28f8133 KVM: x86: Account for failing enable_irq_window for NMI window request
With VMX, enable_irq_window can now return -EBUSY, in which case an
immediate exit shall be requested before entering the guest. Account for
this also in enable_nmi_window which uses enable_irq_window in absence
of vnmi support, e.g.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-05-02 22:17:38 -03:00
Jan Kiszka
5a2892ce72 KVM: nVMX: Skip PF interception check when queuing during nested run
While a nested run is pending, vmx_queue_exception is only called to
requeue exceptions that were previously picked up via
vmx_cancel_injection. Therefore, we must not check for PF interception
by L1, possibly causing a bogus nested vmexit.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-28 13:34:39 +03:00
Jan Kiszka
730dca42c1 KVM: x86: Rework request for immediate exit
The VMX implementation of enable_irq_window raised
KVM_REQ_IMMEDIATE_EXIT after we checked it in vcpu_enter_guest. This
caused infinite loops on vmentry. Fix it by letting enable_irq_window
signal the need for an immediate exit via its return value and drop
KVM_REQ_IMMEDIATE_EXIT.

This issue only affects nested VMX scenarios.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-28 12:44:18 +03:00
Jan Kiszka
cb0c8cda13 KVM: VMX: remove unprintable characters from comment
Slipped in while copy&pasting from the SDM.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-28 08:55:59 +03:00
Jan Kiszka
d1fa0352a1 KVM: nVMX: VM_ENTRY/EXIT_LOAD_IA32_EFER overrides EFER.LMA settings
If we load the complete EFER MSR on entry or exit, EFER.LMA (and LME)
loading is skipped. Their consistency is already checked now before
starting the transition.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 12:53:52 +03:00
Jan Kiszka
384bb78327 KVM: nVMX: Validate EFER values for VM_ENTRY/EXIT_LOAD_IA32_EFER
As we may emulate the loading of EFER on VM-entry and VM-exit, implement
the checks that VMX performs on the guest and host values on vmlaunch/
vmresume. Factor out kvm_valid_efer for this purpose which checks for
set reserved bits.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 12:53:42 +03:00
Jan Kiszka
ea8ceb8354 KVM: nVMX: Fix conditions for NMI injection
The logic for checking if interrupts can be injected has to be applied
also on NMIs. The difference is that if NMI interception is on these
events are consumed and blocked by the VM exit.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 11:10:49 +03:00
Jan Kiszka
2505dc9fad KVM: VMX: Move vmx_nmi_allowed after vmx_set_nmi_mask
vmx_set_nmi_mask will soon be used by vmx_nmi_allowed. No functional
changes.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 11:10:49 +03:00
Abel Gordon
8a1b9dd000 KVM: nVMX: Enable and disable shadow vmcs functionality
Once L1 loads VMCS12 we enable shadow-vmcs capability and copy all the VMCS12
shadowed fields to the shadow vmcs.  When we release the VMCS12, we also
disable shadow-vmcs capability.

Signed-off-by: Abel Gordon <abelg@il.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 10:52:55 +03:00
Abel Gordon
012f83cb2f KVM: nVMX: Synchronize VMCS12 content with the shadow vmcs
Synchronize between the VMCS12 software controlled structure and the
processor-specific shadow vmcs

Signed-off-by: Abel Gordon <abelg@il.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 10:52:45 +03:00
Abel Gordon
c3114420d1 KVM: nVMX: Copy VMCS12 to processor-specific shadow vmcs
Introduce a function used to copy fields from the software controlled VMCS12
to the processor-specific shadow vmcs

Signed-off-by: Abel Gordon <abelg@il.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 10:52:37 +03:00
Abel Gordon
16f5b9034b KVM: nVMX: Copy processor-specific shadow-vmcs to VMCS12
Introduce a function used to copy fields from the processor-specific shadow
vmcs to the software controlled VMCS12

Signed-off-by: Abel Gordon <abelg@il.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 10:52:24 +03:00
Abel Gordon
e7953d7fab KVM: nVMX: Release shadow vmcs
Unmap vmcs12 and release the corresponding shadow vmcs

Signed-off-by: Abel Gordon <abelg@il.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 10:52:17 +03:00
Abel Gordon
8de4883370 KVM: nVMX: Allocate shadow vmcs
Allocate a shadow vmcs used by the processor to shadow part of the fields
stored in the software defined VMCS12 (let L1 access fields without causing
exits). Note we keep a shadow vmcs only for the current vmcs12.  Once a vmcs12
becomes non-current, its shadow vmcs is released.

Signed-off-by: Abel Gordon <abelg@il.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 10:52:10 +03:00
Abel Gordon
145c28dd19 KVM: nVMX: Fix VMXON emulation
handle_vmon doesn't check if L1 is already in root mode (VMXON
was previously called). This patch adds this missing check and calls
nested_vmx_failValid if VMX is already ON.
We need this check because L0 will allocate the shadow vmcs when L1
executes VMXON and we want to avoid host leaks (due to shadow vmcs
allocation) if L1 executes VMXON repeatedly.

Signed-off-by: Abel Gordon <abelg@il.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 10:52:01 +03:00
Abel Gordon
20b97feaf6 KVM: nVMX: Refactor handle_vmwrite
Refactor existent code so we re-use vmcs12_write_any to copy fields from the
shadow vmcs specified by the link pointer (used by the processor,
implementation-specific) to the VMCS12 software format used by L0 to hold
the fields in L1 memory address space.

Signed-off-by: Abel Gordon <abelg@il.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 10:51:44 +03:00
Abel Gordon
4607c2d7a2 KVM: nVMX: Introduce vmread and vmwrite bitmaps
Prepare vmread and vmwrite bitmaps according to a pre-specified list of fields.
These lists are intended to specifiy most frequent accessed fields so we can
minimize the number of fields that are copied from/to the software controlled
VMCS12 format to/from to processor-specific shadow vmcs. The lists were built
measuring the VMCS fields access rate after L2 Ubuntu 12.04 booted when it was
running on top of L1 KVM, also Ubuntu 12.04. Note that during boot there were
additional fields which were frequently modified but they were not added to
these lists because after boot these fields were not longer accessed by L1.

Signed-off-by: Abel Gordon <abelg@il.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 10:51:34 +03:00
Abel Gordon
abc4fc58c5 KVM: nVMX: Detect shadow-vmcs capability
Add logic required to detect if shadow-vmcs is supported by the
processor. Introduce a new kernel module parameter to specify if L0 should use
shadow vmcs (or not) to run L1.

Signed-off-by: Abel Gordon <abelg@il.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 10:51:21 +03:00
Zhang, Yang Z
6ffbbbbab3 KVM: x86: Fix posted interrupt with CONFIG_SMP=n
->send_IPI_mask is not defined on UP.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-17 23:11:54 -03:00
Gleb Natapov
f13882d84d KVM: VMX: Fix check guest state validity if a guest is in VM86 mode
If guest vcpu is in VM86 mode the vcpu state should be checked as if in
real mode.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 18:34:19 -03:00
Paolo Bonzini
26539bd0e4 KVM: nVMX: check vmcs12 for valid activity state
KVM does not use the activity state VMCS field, and does not support
it in nested VMX either (the corresponding bits in the misc VMX feature
MSR are zero).  Fail entry if the activity state is set to anything but
"active".

Since the value will always be the same for L1 and L2, we do not need
to read and write the corresponding VMCS field on L1/L2 transitions,
either.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 18:22:14 -03:00
Yang Zhang
5a71785dde KVM: VMX: Use posted interrupt to deliver virtual interrupt
If posted interrupt is avaliable, then uses it to inject virtual
interrupt to guest.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 16:32:41 -03:00
Yang Zhang
a20ed54d6e KVM: VMX: Add the deliver posted interrupt algorithm
Only deliver the posted interrupt when target vcpu is running
and there is no previous interrupt pending in pir.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 16:32:40 -03:00
Yang Zhang
3d81bc7e96 KVM: Call common update function when ioapic entry changed.
Both TMR and EOI exit bitmap need to be updated when ioapic changed
or vcpu's id/ldr/dfr changed. So use common function instead eoi exit
bitmap specific function.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 16:32:40 -03:00
Yang Zhang
01e439be77 KVM: VMX: Check the posted interrupt capability
Detect the posted interrupt feature. If it exists, then set it in vmcs_config.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 16:32:40 -03:00
Yang Zhang
a547c6db4d KVM: VMX: Enable acknowledge interupt on vmexit
The "acknowledge interrupt on exit" feature controls processor behavior
for external interrupt acknowledgement. When this control is set, the
processor acknowledges the interrupt controller to acquire the
interrupt vector on VM exit.

After enabling this feature, an interrupt which arrived when target cpu is
running in vmx non-root mode will be handled by vmx handler instead of handler
in idt. Currently, vmx handler only fakes an interrupt stack and jump to idt
table to let real handler to handle it. Further, we will recognize the interrupt
and only delivery the interrupt which not belong to current vcpu through idt table.
The interrupt which belonged to current vcpu will be handled inside vmx handler.
This will reduce the interrupt handle cost of KVM.

Also, interrupt enable logic is changed if this feature is turnning on:
Before this patch, hypervior call local_irq_enable() to enable it directly.
Now IF bit is set on interrupt stack frame, and will be enabled on a return from
interrupt handler if exterrupt interrupt exists. If no external interrupt, still
call local_irq_enable() to enable it.

Refer to Intel SDM volum 3, chapter 33.2.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 16:32:39 -03:00
Jan Kiszka
c0d1c770c0 KVM: nVMX: Avoid reading VM_EXIT_INTR_ERROR_CODE needlessly on nested exits
We only need to update vm_exit_intr_error_code if there is a valid exit
interruption information and it comes with a valid error code.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-14 18:27:10 +03:00
Jan Kiszka
e8457c67a4 KVM: nVMX: Fix conditions for interrupt injection
If we are entering guest mode, we do not want L0 to interrupt this
vmentry with all its side effects on the vmcs. Therefore, injection
shall be disallowed during L1->L2 transitions, as in the previous
version. However, this check is conceptually independent of
nested_exit_on_intr, so decouple it.

If L1 traps external interrupts, we can kick the guest from L2 to L1,
also just like the previous code worked. But we no longer need to
consider L1's idt_vectoring_info_field. It will always be empty at this
point. Instead, if L2 has pending events, those are now found in the
architectural queues and will, thus, prevent vmx_interrupt_allowed from
being called at all.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-14 18:27:09 +03:00
Jan Kiszka
5f3d579997 KVM: nVMX: Rework event injection and recovery
The basic idea is to always transfer the pending event injection on
vmexit into the architectural state of the VCPU and then drop it from
there if it turns out that we left L2 to enter L1, i.e. if we enter
prepare_vmcs12.

vmcs12_save_pending_events takes care to transfer pending L0 events into
the queue of L1. That is mandatory as L1 may decide to switch the guest
state completely, invalidating or preserving the pending events for
later injection (including on a different node, once we support
migration).

This concept is based on the rule that a pending vmlaunch/vmresume is
not canceled. Otherwise, we would risk to lose injected events or leak
them into the wrong queues. Encode this rule via a WARN_ON_ONCE at the
entry of nested_vmx_vmexit.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-14 18:27:07 +03:00
Jan Kiszka
3b656cf764 KVM: nVMX: Fix injection of PENDING_INTERRUPT and NMI_WINDOW exits to L1
Check if the interrupt or NMI window exit is for L1 by testing if it has
the corresponding controls enabled. This is required when we allow
direct injection from L0 to L2

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-14 18:27:05 +03:00
Gleb Natapov
991eebf9f8 KVM: VMX: do not try to reexecute failed instruction while emulating invalid guest state
During invalid guest state emulation vcpu cannot enter guest mode to try
to reexecute instruction that emulator failed to emulate, so emulation
will happen again and again.  Prevent that by telling the emulator that
instruction reexecution should not be attempted.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-14 09:44:17 +03:00
Konrad Rzeszutek Wilk
357d122670 x86, xen, gdt: Remove the pvops variant of store_gdt.
The two use-cases where we needed to store the GDT were during ACPI S3 suspend
and resume. As the patches:
 x86/gdt/i386: store/load GDT for ACPI S3 or hibernation/resume path is not needed
 x86/gdt/64-bit: store/load GDT for ACPI S3 or hibernate/resume path is not needed.

have demonstrated - there are other mechanism by which the GDT is
saved and reloaded during early resume path.

Hence we do not need to worry about the pvops call-chain for saving the
GDT and can and can eliminate it. The other areas where the store_gdt is
used are never going to be hit when running under the pvops platforms.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/1365194544-14648-4-git-send-email-konrad.wilk@oracle.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-11 15:40:38 -07:00
Jan Kiszka
a63cb56061 KVM: VMX: Add missing braces to avoid redundant error check
The code was already properly aligned, now also add the braces to avoid
that err is checked even if alloc_apic_access_page didn't run and change
it. Found via Coccinelle by Fengguang Wu.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-08 12:46:06 +03:00
Yang Zhang
458f212e36 KVM: x86: fix memory leak in vmx_init
Free vmx_msr_bitmap_longmode_x2apic and vmx_msr_bitmap_longmode if
kvm_init() fails.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-08 10:56:08 +03:00
Jan Kiszka
b8c07d55d0 KVM: nVMX: Check exit control for VM_EXIT_SAVE_IA32_PAT, not entry controls
Obviously a copy&paste mistake: prepare_vmcs12 has to check L1's exit
controls for VM_EXIT_SAVE_IA32_PAT.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-07 14:06:42 +03:00
Paolo Bonzini
04b66839d3 KVM: x86: correctly initialize the CS base on reset
The CS base was initialized to 0 on VMX (wrong, but usually overridden
by userspace before starting) or 0xf0000 on SVM.  The correct value is
0xffff0000, and VMX is able to emulate it now, so use it.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-20 17:34:55 -03:00
Jan Kiszka
4918c6ca68 KVM: VMX: Require KVM_SET_TSS_ADDR being called prior to running a VCPU
Very old user space (namely qemu-kvm before kvm-49) didn't set the TSS
base before running the VCPU. We always warned about this bug, but no
reports about users actually seeing this are known. Time to finally
remove the workaround that effectively prevented to call vmx_vcpu_reset
while already holding the KVM srcu lock.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-18 13:48:15 -03:00
Jan Kiszka
0238ea913c KVM: nVMX: Add preemption timer support
Provided the host has this feature, it's straightforward to offer it to
the guest as well. We just need to load to timer value on L2 entry if
the feature was enabled by L1 and watch out for the corresponding exit
reason.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-03-14 10:01:21 +02:00
Jan Kiszka
c18911a23c KVM: nVMX: Provide EFER.LMA saving support
We will need EFER.LMA saving to provide unrestricted guest mode. All
what is missing for this is picking up EFER.LMA from VM_ENTRY_CONTROLS
on L2->L1 switches. If the host does not support EFER.LMA saving,
no change is performed, otherwise we properly emulate for L1 what the
hardware does for L0. Advertise the support, depending on the host
feature.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-03-14 10:00:55 +02:00
Jan Kiszka
eabeaaccfc KVM: nVMX: Clean up and fix pin-based execution controls
Only interrupt and NMI exiting are mandatory for KVM to work, thus can
be exposed to the guest unconditionally, virtual NMI exiting is
optional. So we must not advertise it unless the host supports it.

Introduce the symbolic constant PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR at
this chance.

Reviewed-by:: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-03-13 16:14:40 +02:00
Jan Kiszka
66450a21f9 KVM: x86: Rework INIT and SIPI handling
A VCPU sending INIT or SIPI to some other VCPU races for setting the
remote VCPU's mp_state. When we were unlucky, KVM_MP_STATE_INIT_RECEIVED
was overwritten by kvm_emulate_halt and, thus, got lost.

This introduces APIC events for those two signals, keeping them in
kvm_apic until kvm_apic_accept_events is run over the target vcpu
context. kvm_apic_has_events reports to kvm_arch_vcpu_runnable if there
are pending events, thus if vcpu blocking should end.

The patch comes with the side effect of effectively obsoleting
KVM_MP_STATE_SIPI_RECEIVED. We still accept it from user space, but
immediately translate it to KVM_MP_STATE_INIT_RECEIVED + KVM_APIC_SIPI.
The vcpu itself will no longer enter the KVM_MP_STATE_SIPI_RECEIVED
state. That also means we no longer exit to user space after receiving a
SIPI event.

Furthermore, we already reset the VCPU on INIT, only fixing up the code
segment later on when SIPI arrives. Moreover, we fix INIT handling for
the BSP: it never enter wait-for-SIPI but directly starts over on INIT.

Tested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-03-13 16:08:10 +02:00
Jan Kiszka
57f252f229 KVM: x86: Drop unused return code from VCPU reset callback
Neither vmx nor svm nor the common part may generate an error on
kvm_vcpu_reset. So drop the return code.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-03-12 13:25:56 +02:00
Ioan Orghici
0fa24ce3f5 kvm: remove cast for kmalloc return value
Signed-off-by: Ioan Orghici<ioan.orghici@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-03-11 12:03:54 +02:00
Jan Kiszka
1a0d74e664 KVM: nVMX: Fix setting of CR0 and CR4 in guest mode
The logic for calculating the value with which we call kvm_set_cr0/4 was
broken (will definitely be visible with nested unrestricted guest mode
support). Also, we performed the check regarding CR0_ALWAYSON too early
when in guest mode.

What really needs to be done on both CR0 and CR4 is to mask out L1-owned
bits and merge them in from L1's guest_cr0/4. In contrast, arch.cr0/4
and arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and,
thus, are not suited as input.

For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and
refuse the update if it fails. To be fully consistent, we implement this
check now also for CR4. For CR4, we move the check into vmx_set_cr4
while we keep it in handle_set_cr0. This is because the CR0 checks for
vmxon vs. guest mode will diverge soon when adding unrestricted guest
mode support.

Finally, we have to set the shadow to the value L2 wanted to write
originally.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-07 15:48:47 -03:00
Jan Kiszka
33fb20c39e KVM: nVMX: Fix content of MSR_IA32_VMX_ENTRY/EXIT_CTLS
Properly set those bits to 1 that the spec demands in case bit 55 of
VMX_BASIC is 0 - like in our case.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-07 15:47:11 -03:00
Jan Kiszka
c4627c72e9 KVM: nVMX: Reset RFLAGS on VM-exit
Ouch, how could this work so well that far? We need to clear RFLAGS to
the reset value as specified by the SDM. Particularly, IF must be off
after VM-exit!

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-05 20:49:17 -03:00
Jan Kiszka
503cd0c50a KVM: nVMX: Fix switching of debug state
First of all, do not blindly overwrite GUEST_DR7 on L2 entry. The host
may have guest debugging enabled. Then properly reset DR7 and DEBUG_CTL
on L2->L1 switch as specified in the SDM.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-04 21:37:28 -03:00
Takuya Yoshikawa
47ae31e257 KVM: set_memory_region: Drop user_alloc from set_memory_region()
Except ia64's stale code, KVM_SET_MEMORY_REGION support, this is only
used for sanity checks in __kvm_set_memory_region() which can easily
be changed to use slot id instead.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-04 20:21:08 -03:00
Jan Kiszka
3ab66e8a45 KVM: VMX: Pass vcpu to __vmx_complete_interrupts
Cleanup: __vmx_complete_interrupts has no use for the vmx structure.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-28 10:29:03 +02:00
Jan Kiszka
44ceb9d665 KVM: nVMX: Avoid one redundant vmcs_read in prepare_vmcs12
IDT_VECTORING_INFO_FIELD was already read right after vmexit.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-28 10:20:06 +02:00
Jan Kiszka
957c897e8c KVM: nVMX: Use cached exit reason
No need to re-read what vmx_vcpu_run already picked up for us.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-27 15:46:07 +02:00
Jan Kiszka
36c3cc422b KVM: nVMX: Clear segment cache after switching between L1 and L2
Switching the VMCS obviously invalidates what may have been cached about
the guest segments.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-27 15:41:09 +02:00
Jan Kiszka
d6851fbeee KVM: nVMX: Advertise PAUSE and WBINVD exiting support
These exits have no preconditions, and we already process the
corresponding reasons in nested_vmx_exit_handled correctly.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-27 15:33:51 +02:00
Jan Kiszka
733568f9ce KVM: VMX: Make prepare_vmcs12 and load_vmcs12_host_state static
Both are only used locally.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-27 15:31:15 +02:00
Jan Kiszka
bd31a7f557 KVM: nVMX: Trap unconditionally if msr bitmap access fails
This avoids basing decisions on uninitialized variables, potentially
leaking kernel data to the L1 guest.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-22 00:50:45 -03:00
Jan Kiszka
908a7bdd6a KVM: nVMX: Improve I/O exit handling
This prevents trapping L2 I/O exits if L1 has neither unconditional nor
bitmap-based exiting enabled. Furthermore, it implements I/O bitmap
handling.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-22 00:50:42 -03:00
Jan Kiszka
cbd29cb6e3 KVM: nVMX: Remove redundant get_vmcs12 from nested_vmx_exit_handled_msr
We already pass vmcs12 as argument.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-14 10:35:16 +02:00
Yang Zhang
257090f702 KVM: VMX: disable apicv by default
Without Posted Interrupt, current code is broken. Just disable by
default until Posted Interrupt is ready.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-11 10:51:13 +02:00
Gleb Natapov
5037878e22 KVM: VMX: cleanup vmx_set_cr0().
When calculating hw_cr0 teh current code masks bits that should be always
on and re-adds them back immediately after. Cleanup the code by masking
only those bits that should be dropped from hw_cr0. This allow us to
get rid of some defines.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-06 22:00:02 -02:00
Dongxiao Xu
c08800a56c KVM: VMX: disable SMEP feature when guest is in non-paging mode
SMEP is disabled if CPU is in non-paging mode in hardware.
However KVM always uses paging mode to emulate guest non-paging
mode with TDP. To emulate this behavior, SMEP needs to be manually
disabled when guest switches to non-paging mode.

We met an issue that, SMP Linux guest with recent kernel (enable
SMEP support, for example, 3.5.3) would crash with triple fault if
setting unrestricted_guest=0. This is because KVM uses an identity
mapping page table to emulate the non-paging mode, where the page
table is set with USER flag. If SMEP is still enabled in this case,
guest will meet unhandlable page fault and then crash.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-05 23:28:07 -02:00
Yang Zhang
c7c9c56ca2 x86, apicv: add virtual interrupt delivery support
Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
manually, which is fully taken care of by the hardware. This needs
some special awareness into existing interrupr injection path:

- for pending interrupt, instead of direct injection, we may need
  update architecture specific indicators before resuming to guest.

- A pending interrupt, which is masked by ISR, should be also
  considered in above update action, since hardware will decide
  when to inject it at right time. Current has_interrupt and
  get_interrupt only returns a valid vector from injection p.o.v.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-29 10:48:19 +02:00
Yang Zhang
8d14695f95 x86, apicv: add virtual x2apic support
basically to benefit from apicv, we need to enable virtualized x2apic mode.
Currently, we only enable it when guest is really using x2apic.

Also, clear MSR bitmap for corresponding x2apic MSRs when guest enabled x2apic:
0x800 - 0x8ff: no read intercept for apicv register virtualization,
               except APIC ID and TMCCT which need software's assistance to
               get right value.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-29 10:48:06 +02:00
Yang Zhang
83d4c28693 x86, apicv: add APICv register virtualization support
- APIC read doesn't cause VM-Exit
- APIC write becomes trap-like

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-29 10:47:54 +02:00
Gleb Natapov
141687869f KVM: VMX: set vmx->emulation_required only when needed.
If emulate_invalid_guest_state=false vmx->emulation_required is never
actually used, but it ends up to be always set to true since
handle_invalid_guest_state(), the only place it is reset back to
false, is never called. This, besides been not very clean, makes vmexit
and vmentry path to check emulate_invalid_guest_state needlessly.

The patch fixes that by keeping emulation_required coherent with
emulate_invalid_guest_state setting.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:31 -02:00
Gleb Natapov
91b0aa2ca6 KVM: VMX: rename fix_pmode_dataseg to fix_pmode_seg.
The function deals with code segment too.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:30 -02:00
Gleb Natapov
25391454e7 KVM: VMX: don't clobber segment AR of unusable segments.
Usability is returned in unusable field, so not need to clobber entire
AR. Callers have to know how to deal with unusable segments already
since if emulate_invalid_guest_state=true AR is not zeroed.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:28 -02:00
Gleb Natapov
218e763f45 KVM: VMX: skip vmx->rmode.vm86_active check on cr0 write if unrestricted guest is enabled
vmx->rmode.vm86_active is never true is unrestricted guest is enabled.
Make it more explicit that neither enter_pmode() nor enter_rmode() is
called in this case.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:28 -02:00
Gleb Natapov
286da4156d KVM: VMX: remove hack that disables emulation on vcpu reset/init
There is no reason for it. If state is suitable for vmentry it
will be detected during guest entry and no emulation will happen.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:27 -02:00
Gleb Natapov
c5e97c80b5 KVM: VMX: if unrestricted guest is enabled vcpu state is always valid.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:26 -02:00
Gleb Natapov
2f143240cb KVM: VMX: reset CPL only on CS register write.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:26 -02:00
Gleb Natapov
1f3141e80b KVM: VMX: remove special CPL cache access during transition to real mode.
Since vmx_get_cpl() always returns 0 when VCPU is in real mode it is no
longer needed. Also reset CPL cache to zero during transaction to
protected mode since transaction may happen while CS.selectors & 3 != 0,
but in reality CPL is 0.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:25 -02:00
Marcelo Tosatti
b09408d00f KVM: VMX: fix incorrect cached cpl value with real/v8086 modes
CPL is always 0 when in real mode, and always 3 when virtual 8086 mode.

Using values other than those can cause failures on operations that
check CPL.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-08 17:25:35 -02:00
Gleb Natapov
0ca1b4f4ba KVM: VMX: handle IO when emulation is due to #GP in real mode.
With emulate_invalid_guest_state=0 if a vcpu is in real mode VMX can
enter the vcpu with smaller segment limit than guest configured.  If the
guest tries to access pass this limit it will get #GP at which point
instruction will be emulated with correct segment limit applied. If
during the emulation IO is detected it is not handled correctly. Vcpu
thread should exit to userspace to serve the IO, but it returns to the
guest instead.  Since emulation is not completed till userspace completes
the IO the faulty instruction is re-executed ad infinitum.

The patch fixes that by exiting to userspace if IO happens during
instruction emulation.

Reported-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-02 19:36:31 -02:00
Gleb Natapov
d54d07b2ca KVM: VMX: Do not fix segment register during vcpu initialization.
Segment registers will be fixed according to current emulation policy
during switching to real mode for the first time.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-02 19:36:30 -02:00
Gleb Natapov
d99e415275 KVM: VMX: fix emulation of invalid guest state.
Currently when emulation of invalid guest state is enable
(emulate_invalid_guest_state=1) segment registers are still fixed for
entry to vm86 mode some times. Segment register fixing is avoided in
enter_rmode(), but vmx_set_segment() still does it unconditionally.
The patch fixes it.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-02 19:36:29 -02:00
Gleb Natapov
89efbed02c KVM: VMX: make rmode_segment_valid() more strict.
Currently it allows entering vm86 mode if segment limit is greater than
0xffff and db bit is set. Both of those can cause incorrect execution of
instruction by cpu since in vm86 mode limit will be set to 0xffff and db
will be forced to 0.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-02 19:36:28 -02:00
Gleb Natapov
f924d66d27 KVM: VMX: remove unneeded temporary variable from vmx_set_segment()
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-23 14:02:00 +02:00
Gleb Natapov
1ecd50a947 KVM: VMX: clean-up vmx_set_segment()
Move all vm86_active logic into one place.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-23 14:01:49 +02:00
Gleb Natapov
39dcfb95de KVM: VMX: remove redundant code from vmx_set_segment()
Segment descriptor's base is fixed by call to fix_rmode_seg(). Not need
to do it twice.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-23 14:01:37 +02:00
Gleb Natapov
beb853ffec KVM: VMX: use fix_rmode_seg() to fix all code/data segments
The code for SS and CS does the same thing fix_rmode_seg() is doing.
Use it instead of hand crafted code.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-23 14:01:18 +02:00
Gleb Natapov
c6ad115348 KVM: VMX: return correct segment limit and flags for CS/SS registers in real mode
VMX without unrestricted mode cannot virtualize real mode, so if
emulate_invalid_guest_state=0 kvm uses vm86 mode to approximate
it. Sometimes, when guest moves from protected mode to real mode, it
leaves segment descriptors in a state not suitable for use by vm86 mode
virtualization, so we keep shadow copy of segment descriptors for internal
use and load fake register to VMCS for guest entry to succeed. Till
now we kept shadow for all segments except SS and CS (for SS and CS we
returned parameters directly from VMCS), but since commit a5625189f6
emulator enforces segment limits in real mode. This causes #GP during move
from protected mode to real mode when emulator fetches first instruction
after moving to real mode since it uses incorrect CS base and limit to
linearize the %rip. Fix by keeping shadow for SS and CS too.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-23 14:01:03 +02:00
Gleb Natapov
0647f4aa8c KVM: VMX: relax check for CS register in rmode_segment_valid()
rmode_segment_valid() checks if segment descriptor can be used to enter
vm86 mode. VMX spec mandates that in vm86 mode CS register will be of
type data, not code. Lets allow guest entry with vm86 mode if the only
problem with CS register is incorrect type. Otherwise entire real mode
will be emulated.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-23 14:00:47 +02:00
Gleb Natapov
07f42f5f25 KVM: VMX: cleanup rmode_segment_valid()
Set segment fields explicitly instead of using  binary operations.

No behaviour changes.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-23 14:00:36 +02:00
Alex Williamson
f82a8cfe93 KVM: struct kvm_memory_slot.user_alloc -> bool
There's no need for this to be an int, it holds a boolean.
Move to the end of the struct for alignment.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-13 23:24:38 -02:00
Linus Torvalds
66cdd0ceaf Merge tag 'kvm-3.8-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Marcelo Tosatti:
 "Considerable KVM/PPC work, x86 kvmclock vsyscall support,
  IA32_TSC_ADJUST MSR emulation, amongst others."

Fix up trivial conflict in kernel/sched/core.c due to cross-cpu
migration notifier added next to rq migration call-back.

* tag 'kvm-3.8-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (156 commits)
  KVM: emulator: fix real mode segment checks in address linearization
  VMX: remove unneeded enable_unrestricted_guest check
  KVM: VMX: fix DPL during entry to protected mode
  x86/kexec: crash_vmclear_local_vmcss needs __rcu
  kvm: Fix irqfd resampler list walk
  KVM: VMX: provide the vmclear function and a bitmap to support VMCLEAR in kdump
  x86/kexec: VMCLEAR VMCSs loaded on all cpus if necessary
  KVM: MMU: optimize for set_spte
  KVM: PPC: booke: Get/set guest EPCR register using ONE_REG interface
  KVM: PPC: bookehv: Add EPCR support in mtspr/mfspr emulation
  KVM: PPC: bookehv: Add guest computation mode for irq delivery
  KVM: PPC: Make EPCR a valid field for booke64 and bookehv
  KVM: PPC: booke: Extend MAS2 EPN mask for 64-bit
  KVM: PPC: e500: Mask MAS2 EPN high 32-bits in 32/64 tlbwe emulation
  KVM: PPC: Mask ea's high 32-bits in 32/64 instr emulation
  KVM: PPC: e500: Add emulation helper for getting instruction ea
  KVM: PPC: bookehv64: Add support for interrupt handling
  KVM: PPC: bookehv: Remove GET_VCPU macro from exception handler
  KVM: PPC: booke: Fix get_tb() compile error on 64-bit
  KVM: PPC: e500: Silence bogus GCC warning in tlb code
  ...
2012-12-13 15:31:08 -08:00
Gleb Natapov
0b26b588d9 VMX: remove unneeded enable_unrestricted_guest check
If enable_unrestricted_guest is true vmx->rmode.vm86_active will
always be false.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-11 21:00:28 -02:00
Gleb Natapov
a4d3326c2d KVM: VMX: fix DPL during entry to protected mode
On CPUs without support for unrestricted guests DPL cannot be smaller
than RPL for data segments during guest entry, but this state can occurs
if a data segment selector changes while vcpu is in real mode to a value
with lowest two bits != 00. Fix that by forcing DPL == RPL on transition
to protected mode.

This is a regression introduced by c865c43de6.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-11 21:00:27 -02:00
Zhang Yanfei
8f536b7697 KVM: VMX: provide the vmclear function and a bitmap to support VMCLEAR in kdump
The vmclear function will be assigned to the callback function pointer
when loading kvm-intel module. And the bitmap indicates whether we
should do VMCLEAR operation in kdump. The bits in the bitmap are
set/unset according to different conditions.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-06 18:26:57 +02:00
Julian Stecklina
66f7b72e11 KVM: x86: Make register state after reset conform to specification
VMX behaves now as SVM wrt to FPU initialization. Code has been moved to
generic code path. General-purpose registers are now cleared on reset and
INIT.  SVM code properly initializes EDX.

Signed-off-by: Julian Stecklina <jsteckli@os.inf.tu-dresden.de>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-05 18:00:07 +02:00
Zhang Xiantao
2b3c5cbc0d kvm: don't use bit24 for detecting address-specific invalidation capability
Bit24 in VMX_EPT_VPID_CAP_MASI is  not used for address-specific invalidation capability
reporting, so remove it from KVM to avoid conflicts in future.

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-05 16:35:48 +02:00
Zhang Xiantao
0307b7b8c2 kvm: remove unnecessary bit checking for ept violation
Bit 6 in EPT vmexit's exit qualification is not defined in SDM, so remove it.

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-05 16:35:21 +02:00
Will Auld
ba904635d4 KVM: x86: Emulate IA32_TSC_ADJUST MSR
CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported

Basic design is to emulate the MSR by allowing reads and writes to a guest
vcpu specific location to store the value of the emulated MSR while adding
the value to the vmcs tsc_offset. In this way the IA32_TSC_ADJUST value will
be included in all reads to the TSC MSR whether through rdmsr or rdtsc. This
is of course as long as the "use TSC counter offsetting" VM-execution control
is enabled as well as the IA32_TSC_ADJUST control.

However, because hardware will only return the TSC + IA32_TSC_ADJUST +
vmsc tsc_offset for a guest process when it does and rdtsc (with the correct
settings) the value of our virtualized IA32_TSC_ADJUST must be stored in one
of these three locations. The argument against storing it in the actual MSR
is performance. This is likely to be seldom used while the save/restore is
required on every transition. IA32_TSC_ADJUST was created as a way to solve
some issues with writing TSC itself so that is not an option either.

The remaining option, defined above as our solution has the problem of
returning incorrect vmcs tsc_offset values (unless we intercept and fix, not
done here) as mentioned above. However, more problematic is that storing the
data in vmcs tsc_offset will have a different semantic effect on the system
than does using the actual MSR. This is illustrated in the following example:

The hypervisor set the IA32_TSC_ADJUST, then the guest sets it and a guest
process performs a rdtsc. In this case the guest process will get
TSC + IA32_TSC_ADJUST_hyperviser + vmsc tsc_offset including
IA32_TSC_ADJUST_guest. While the total system semantics changed the semantics
as seen by the guest do not and hence this will not cause a problem.

Signed-off-by: Will Auld <will.auld@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-30 18:29:30 -02:00
Will Auld
8fe8ab46be KVM: x86: Add code to track call origin for msr assignment
In order to track who initiated the call (host or guest) to modify an msr
value I have changed function call parameters along the call path. The
specific change is to add a struct pointer parameter that points to (index,
data, caller) information rather than having this information passed as
individual parameters.

The initial use for this capability is for updating the IA32_TSC_ADJUST msr
while setting the tsc value. It is anticipated that this capability is
useful for other tasks.

Signed-off-by: Will Auld <will.auld@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-30 18:26:12 -02:00
Xiao Guangrong
5a560f8b5e KVM: VMX: fix memory order between loading vmcs and clearing vmcs
vmcs->cpu indicates whether it exists on the target cpu, -1 means the vmcs
does not exist on any vcpu

If vcpu load vmcs with vmcs.cpu = -1, it can be directly added to cpu's percpu
list. The list can be corrupted if the cpu prefetch the vmcs's list before
reading vmcs->cpu. Meanwhile, we should remove vmcs from the list before
making vmcs->vcpu == -1 be visible

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-29 21:14:46 -02:00
Xiao Guangrong
e6c7d32172 KVM: VMX: fix invalid cpu passed to smp_call_function_single
In loaded_vmcs_clear, loaded_vmcs->cpu is the fist parameter passed to
smp_call_function_single, if the target cpu is downing (doing cpu hot remove),
loaded_vmcs->cpu can become -1 then -1 is passed to smp_call_function_single

It can be triggered when vcpu is being destroyed, loaded_vmcs_clear is called
in the preemptionable context

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-28 22:04:58 -02:00
Marcelo Tosatti
42897d866b KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization
TSC initialization will soon make use of online_vcpus.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:14 -02:00
Marcelo Tosatti
886b470cb1 KVM: x86: pass host_tsc to read_l1_tsc
Allow the caller to pass host tsc value to kvm_x86_ops->read_l1_tsc().

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:11 -02:00
Takashi Iwai
29282fde80 KVM: x86: Fix invalid secondary exec controls in vmx_cpuid_update()
The commit [ad756a16: KVM: VMX: Implement PCID/INVPCID for guests with
EPT] introduced the unconditional access to SECONDARY_VM_EXEC_CONTROL,
and this triggers kernel warnings like below on old CPUs:

    vmwrite error: reg 401e value a0568000 (err 12)
    Pid: 13649, comm: qemu-kvm Not tainted 3.7.0-rc4-test2+ #154
    Call Trace:
     [<ffffffffa0558d86>] vmwrite_error+0x27/0x29 [kvm_intel]
     [<ffffffffa054e8cb>] vmcs_writel+0x1b/0x20 [kvm_intel]
     [<ffffffffa054f114>] vmx_cpuid_update+0x74/0x170 [kvm_intel]
     [<ffffffffa03629b6>] kvm_vcpu_ioctl_set_cpuid2+0x76/0x90 [kvm]
     [<ffffffffa0341c67>] kvm_arch_vcpu_ioctl+0xc37/0xed0 [kvm]
     [<ffffffff81143f7c>] ? __vunmap+0x9c/0x110
     [<ffffffffa0551489>] ? vmx_vcpu_load+0x39/0x1a0 [kvm_intel]
     [<ffffffffa0340ee2>] ? kvm_arch_vcpu_load+0x52/0x1a0 [kvm]
     [<ffffffffa032dcd4>] ? vcpu_load+0x74/0xd0 [kvm]
     [<ffffffffa032deb0>] kvm_vcpu_ioctl+0x110/0x5e0 [kvm]
     [<ffffffffa032e93d>] ? kvm_dev_ioctl+0x4d/0x4a0 [kvm]
     [<ffffffff8117dc6f>] do_vfs_ioctl+0x8f/0x530
     [<ffffffff81139d76>] ? remove_vma+0x56/0x60
     [<ffffffff8113b708>] ? do_munmap+0x328/0x400
     [<ffffffff81187c8c>] ? fget_light+0x4c/0x100
     [<ffffffff8117e1a1>] sys_ioctl+0x91/0xb0
     [<ffffffff815a942d>] system_call_fastpath+0x1a/0x1f

This patch adds a check for the availability of secondary exec
control to avoid these warnings.

Cc: <stable@vger.kernel.org> [v3.6+]
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-16 20:25:18 -02:00
Xiao Guangrong
bf4ca23ef5 KVM: VMX: report internal error for MMIO #PF due to delivery event
The #PF with PFEC.RSV = 1 indicates that the guest is accessing MMIO, we
can not fix it if it is caused by delivery event. Reporting internal error
for this case

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-10-18 16:30:32 +02:00
Xiao Guangrong
b9bf6882c1 KVM: VMX: report internal error for the unhandleable event
VM exits during Event Delivery is really unexpected if it is not caused
by Exceptions/EPT-VIOLATION/TASK_SWITCH, we'd better to report an internal
and freeze the guest, the VMM has the chance to check the guest

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-10-18 16:30:29 +02:00
Linus Torvalds
ecefbd94b8 KVM updates for the 3.7 merge window
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQbY/2AAoJEI7yEDeUysxlymQQAIv5svpAI/FUe3FhvBi3IW2h
 WWMIpbdhHyocaINT18qNp8prO0iwoaBfgsnU8zuB34MrbdUgiwSHgM6T4Ff4NGa+
 R4u+gpyKYwxNQYKeJyj04luXra/krxwHL1u9OwN7o44JuQXAmzrw2tZ9ad1ArvL3
 eoZ6kGsPcdHPZMZWw2jN5xzBsRtqybm0GPPQh1qPXdn8UlPPd1X7owvbaud2y4+e
 StVIpGY6wrsO36f7UcA4Gm1EP/1E6Lm5KMXJyHgM9WBRkEfp92jTY5+XKv91vK8Z
 VKUd58QMdZE5NCNBkAR9U5N9aH0oSXnFU/g8hgiwGvrhS3IsSkKUePE6sVyMVTIO
 VptKRYe0AdmD/g25p6ApJsguV7ITlgoCPaE4rMmRcW9/bw8+iY098r7tO7w11H8M
 TyFOXihc3B+rlH8WdzOblwxHMC4yRuiPIktaA3WwbX7eA7Xv/ZRtdidifXKtgsVE
 rtubVqwGyYcHoX1Y+JiByIW1NN0pYncJhPEdc8KbRe2wKs3amA9rio1mUpBYYBPO
 B0ygcITftyXbhcTtssgcwBDGXB0AAGqI7wqdtJhFeIrKwHXD7fNeAGRwO8oKxmlj
 0aPwo9fDtpI+e6BFTohEgjZBocRvXXNWLnDSFB0E7xDR31bACck2FG5FAp1DxdS7
 lb/nbAsXf9UJLgGir4I1
 =kN6V
 -----END PGP SIGNATURE-----

Merge tag 'kvm-3.7-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Avi Kivity:
 "Highlights of the changes for this release include support for vfio
  level triggered interrupts, improved big real mode support on older
  Intels, a streamlines guest page table walker, guest APIC speedups,
  PIO optimizations, better overcommit handling, and read-only memory."

* tag 'kvm-3.7-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (138 commits)
  KVM: s390: Fix vcpu_load handling in interrupt code
  KVM: x86: Fix guest debug across vcpu INIT reset
  KVM: Add resampling irqfds for level triggered interrupts
  KVM: optimize apic interrupt delivery
  KVM: MMU: Eliminate pointless temporary 'ac'
  KVM: MMU: Avoid access/dirty update loop if all is well
  KVM: MMU: Eliminate eperm temporary
  KVM: MMU: Optimize is_last_gpte()
  KVM: MMU: Simplify walk_addr_generic() loop
  KVM: MMU: Optimize pte permission checks
  KVM: MMU: Update accessed and dirty bits after guest pagetable walk
  KVM: MMU: Move gpte_access() out of paging_tmpl.h
  KVM: MMU: Optimize gpte_access() slightly
  KVM: MMU: Push clean gpte write protection out of gpte_access()
  KVM: clarify kvmclock documentation
  KVM: make processes waiting on vcpu mutex killable
  KVM: SVM: Make use of asm.h
  KVM: VMX: Make use of asm.h
  KVM: VMX: Make lto-friendly
  KVM: x86: lapic: Clean up find_highest_vector() and count_vectors()
  ...

Conflicts:
	arch/s390/include/asm/processor.h
	arch/x86/kvm/i8259.c
2012-10-04 09:30:33 -07:00
Linus Torvalds
ac07f5c3cb Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/fpu update from Ingo Molnar:
 "The biggest change is the addition of the non-lazy (eager) FPU saving
  support model and enabling it on CPUs with optimized xsaveopt/xrstor
  FPU state saving instructions.

  There are also various Sparse fixes"

Fix up trivial add-add conflict in arch/x86/kernel/traps.c

* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, kvm: fix kvm's usage of kernel_fpu_begin/end()
  x86, fpu: remove cpu_has_xmm check in the fx_finit()
  x86, fpu: make eagerfpu= boot param tri-state
  x86, fpu: enable eagerfpu by default for xsaveopt
  x86, fpu: decouple non-lazy/eager fpu restore from xsave
  x86, fpu: use non-lazy fpu restore for processors supporting xsave
  lguest, x86: handle guest TS bit for lazy/non-lazy fpu host models
  x86, fpu: always use kernel_fpu_begin/end() for in-kernel FPU usage
  x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()
  x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig()
  x86, fpu: drop_fpu() before restoring new state from sigframe
  x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels
  x86, fpu: Consolidate inline asm routines for saving/restoring fpu state
  x86, signal: Cleanup ifdefs and is_ia32, is_x32
2012-10-01 11:10:52 -07:00
Jan Kiszka
c863901075 KVM: x86: Fix guest debug across vcpu INIT reset
If we reset a vcpu on INIT, we so far overwrote dr7 as provided by
KVM_SET_GUEST_DEBUG, and we also cleared switch_db_regs unconditionally.

Fix this by saving the dr7 used for guest debugging and calculating the
effective register value as well as switch_db_regs on any potential
change. This will change to focus of the set_guest_debug vendor op to
update_dp_bp_intercept.

Found while trying to stop on start_secondary.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-23 15:00:07 +02:00
Suresh Siddha
b1a74bf821 x86, kvm: fix kvm's usage of kernel_fpu_begin/end()
Preemption is disabled between kernel_fpu_begin/end() and as such
it is not a good idea to use these routines in kvm_load/put_guest_fpu()
which can be very far apart.

kvm_load/put_guest_fpu() routines are already called with
preemption disabled and KVM already uses the preempt notifier to save
the guest fpu state using kvm_put_guest_fpu().

So introduce __kernel_fpu_begin/end() routines which don't touch
preemption and use them instead of kernel_fpu_begin/end()
for KVM's use model of saving/restoring guest FPU state.

Also with this change (and with eagerFPU model), fix the host cr0.TS vm-exit
state in the case of VMX. For eagerFPU case, host cr0.TS is always clear.
So no need to worry about it. For the traditional lazyFPU restore case,
change the cr0.TS bit for the host state during vm-exit to be always clear
and cr0.TS bit is set in the __vmx_load_host_state() when the FPU
(guest FPU or the host task's FPU) state is not active. This ensures
that the host/guest FPU state is properly saved, restored
during context-switch and with interrupts (using irq_fpu_usable()) not
stomping on the active FPU state.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1348164109.26695.338.camel@sbsiddha-desk.sc.intel.com
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-09-21 16:59:04 -07:00
Peter Senna Tschudin
4b8073e467 arch/x86: Remove unecessary semicolons
Found by http://coccinelle.lip6.fr/

Signed-off-by: Peter Senna Tschudin <peter.senna@gmail.com>
Cc: avi@redhat.com
Cc: mtosatti@redhat.com
Cc: a.p.zijlstra@chello.nl
Cc: rusty@rustcorp.com.au
Cc: masami.hiramatsu.pt@hitachi.com
Cc: suresh.b.siddha@intel.com
Cc: joerg.roedel@amd.com
Cc: agordeev@redhat.com
Cc: yinghai@kernel.org
Cc: bhelgaas@google.com
Cc: liuj97@gmail.com
Link: http://lkml.kernel.org/r/1347986174-30287-7-git-send-email-peter.senna@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-19 17:32:48 +02:00
Avi Kivity
b188c81f2e KVM: VMX: Make use of asm.h
Use macros for bitness-insensitive register names, instead of
rolling our own.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-09-17 10:38:04 -03:00
Avi Kivity
83287ea420 KVM: VMX: Make lto-friendly
LTO (link-time optimization) doesn't like local labels to be referred to
from a different function, since the two functions may be built in separate
compilation units.  Use an external variable instead.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-09-17 10:38:03 -03:00
Xiao Guangrong
4484141a94 KVM: fix error paths for failed gfn_to_page() calls
This bug was triggered:
[ 4220.198458] BUG: unable to handle kernel paging request at fffffffffffffffe
[ 4220.203907] IP: [<ffffffff81104d85>] put_page+0xf/0x34
......
[ 4220.237326] Call Trace:
[ 4220.237361]  [<ffffffffa03830d0>] kvm_arch_destroy_vm+0xf9/0x101 [kvm]
[ 4220.237382]  [<ffffffffa036fe53>] kvm_put_kvm+0xcc/0x127 [kvm]
[ 4220.237401]  [<ffffffffa03702bc>] kvm_vcpu_release+0x18/0x1c [kvm]
[ 4220.237407]  [<ffffffff81145425>] __fput+0x111/0x1ed
[ 4220.237411]  [<ffffffff8114550f>] ____fput+0xe/0x10
[ 4220.237418]  [<ffffffff81063511>] task_work_run+0x5d/0x88
[ 4220.237424]  [<ffffffff8104c3f7>] do_exit+0x2bf/0x7ca

The test case:

	printf(fmt, ##args);		\
	exit(-1);} while (0)

static int create_vm(void)
{
	int sys_fd, vm_fd;

	sys_fd = open("/dev/kvm", O_RDWR);
	if (sys_fd < 0)
		die("open /dev/kvm fail.\n");

	vm_fd = ioctl(sys_fd, KVM_CREATE_VM, 0);
	if (vm_fd < 0)
		die("KVM_CREATE_VM fail.\n");

	return vm_fd;
}

static int create_vcpu(int vm_fd)
{
	int vcpu_fd;

	vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);
	if (vcpu_fd < 0)
		die("KVM_CREATE_VCPU ioctl.\n");
	printf("Create vcpu.\n");
	return vcpu_fd;
}

static void *vcpu_thread(void *arg)
{
	int vm_fd = (int)(long)arg;

	create_vcpu(vm_fd);
	return NULL;
}

int main(int argc, char *argv[])
{
	pthread_t thread;
	int vm_fd;

	(void)argc;
	(void)argv;

	vm_fd = create_vm();
	pthread_create(&thread, NULL, vcpu_thread, (void *)(long)vm_fd);
	printf("Exit.\n");
	return 0;
}

It caused by release kvm->arch.ept_identity_map_addr which is the
error page.

The parent thread can send KILL signal to the vcpu thread when it was
exiting which stops faulting pages and potentially allocating memory.
So gfn_to_pfn/gfn_to_page may fail at this time

Fixed by checking the page before it is used

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-10 11:34:11 +03:00
Ren, Yongjie
4f97704555 KVM: x86: Check INVPCID feature bit in EBX of leaf 7
Checks and operations on the INVPCID feature bit should use EBX
of CPUID leaf 7 instead of ECX.

Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Signed-off-by: Yongjie Ren <yongjien.ren@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-09 17:34:01 +03:00
Mathias Krause
772e031899 KVM: VMX: constify lookup tables
We use vmcs_field_to_offset_table[], kvm_vmx_segment_fields[] and
kvm_vmx_exit_handlers[] as lookup tables only -- make them r/o.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-05 12:42:09 +03:00
Avi Kivity
a81aba14dc KVM: VMX: Ignore segment G and D bits when considering whether we can virtualize
We will enter the guest with G and D cleared; as real hardware ignores D in
real mode, and G is taken care of by the limit test, we allow more code to
run in vm86 mode.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:21 -03:00
Avi Kivity
ce56680347 KVM: VMX: Save all segment data in real mode
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:21 -03:00
Avi Kivity
1390a28b27 KVM: VMX: Preserve segment limit and access rights in real mode
While this is undocumented, real processors do not reload the segment
limit and access rights when loading a segment register in real mode.
Real programs rely on it so we need to comply with this behaviour.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:21 -03:00
Avi Kivity
7263642028 KVM: VMX: Return real real-mode segment data even if emulate_invalid_guest_state=1
emulate_invalid_guest_state=1 doesn't mean we don't munge the segments in the
vmcs; we do.  So we need to return the real ones (maintained by vmx_set_segment).

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:21 -03:00
Avi Kivity
e2a610d7fc KVM: VMX: Allow vm86 virtualization of big real mode
Usually, big real mode uses large (4GB) segments.  Currently we don't
virtualize this; if any segment has a limit other than 0xffff, we emulate.
But if we set the vmx-visible limit to 0xffff, we can use vm86 to virtualize
real mode; if an access overruns the segment limit, the guest will #GP, which
we will trap and forward to the emulator.  This results in significantly
faster execution, and less risk of hitting an unemulated instruction.

If the limit is less than 0xffff, we retain the existing behaviour.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:20 -03:00
Avi Kivity
495e116684 KVM: VMX: Allow real mode emulation using vm86 with dpl=0
Real mode is always entered from protected mode with dpl=0.  Since
the dpl doesn't affect execution, and we already override it to 3
in the vmcs (as vmx requires), we can allow execution in that state.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:20 -03:00
Avi Kivity
c865c43de6 KVM: VMX: Retain limit and attributes when entering protected mode
Real processors don't change segment limits and attributes while in
real mode.  Mimic that behaviour.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:20 -03:00
Avi Kivity
f5f7b2fe3b KVM: VMX: Use kvm_segment to save protected-mode segments when entering realmode
Instead of using struct kvm_save_segment, use struct kvm_segment, which is what
the other APIs use.  This leads to some simplification.

We replace save_rmode_seg() with a call to vmx_save_segment().  Since this depends
on rmode.vm86_active, we move the call to before setting the flag.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:19 -03:00
Avi Kivity
72fbefec26 KVM: VMX: Fix incorrect lookup of segment S flag in fix_pmode_dataseg()
fix_pmode_dataseg() looks up S in ->base instead of ->ar_bytes.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:19 -03:00
Avi Kivity
baa7e81e32 KVM: VMX: Separate saving pre-realmode state from setting segments
Commit b246dd5df1 ("KVM: VMX: Fix KVM_SET_SREGS with big real mode
segments") moved fix_rmode_seg() to vmx_set_segment(), so that it is
applied not just on transitions to real mode, but also on KVM_SET_SREGS
(migration).  However fix_rmode_seg() not only munges the vmcs segments,
it also sets up the save area for us to restore when returning to
protected mode or to return in vmx_get_segment().

Move saving the segment into a new function, save_rmode_seg(), and
call it just during the transition.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:19 -03:00
Avi Kivity
dbcb4e7980 KVM: VMX: Advertize RDTSC exiting to nested guests
All processors that support VMX have that feature, and guests (Xen) depend on
it.  As we already implement it, advertize it to the guest.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-13 19:08:28 -03:00
Gleb Natapov
2a7921b7a0 KVM: VMX: restore MSR_IA32_DEBUGCTLMSR after VMEXIT
MSR_IA32_DEBUGCTLMSR is zeroed on VMEXIT. Restore it to the correct
value.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-13 19:07:58 -03:00
Xiao Guangrong
32cad84f44 KVM: do not release the error page
After commit a2766325cf, the error page is replaced by the
error code, it need not be released anymore

[ The patch has been compiling tested for powerpc ]

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 16:04:58 +03:00
Avi Kivity
fe56097b23 Merge remote-tracking branch 'upstream' into next
- bring back critical fixes (esp. aa67f6096c)
 - provide an updated base for development

* upstream: (4334 commits)
  missed mnt_drop_write() in do_dentry_open()
  UBIFS: nuke pdflush from comments
  gfs2: nuke pdflush from comments
  drbd: nuke pdflush from comments
  nilfs2: nuke write_super from comments
  hfs: nuke write_super from comments
  vfs: nuke pdflush from comments
  jbd/jbd2: nuke write_super from comments
  btrfs: nuke pdflush from comments
  btrfs: nuke write_super from comments
  ext4: nuke pdflush from comments
  ext4: nuke write_super from comments
  ext3: nuke write_super from comments
  Documentation: fix the VM knobs descritpion WRT pdflush
  Documentation: get rid of write_super
  vfs: kill write_super and sync_supers
  ACPI processor: Fix tick_broadcast_mask online/offline regression
  ACPI: Only count valid srat memory structures
  ACPI: Untangle a return statement for better readability
  Linux 3.6-rc1
  ...

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-05 13:25:10 +03:00
Avi Kivity
aa67f6096c KVM: VMX: Fix ds/es corruption on i386 with preemption
Commit b2da15ac26 ("KVM: VMX: Optimize %ds, %es reload") broke i386
in the following scenario:

  vcpu_load
  ...
  vmx_save_host_state
  vmx_vcpu_run
  (ds.rpl, es.rpl cleared by hardware)

  interrupt
    push ds, es  # pushes bad ds, es
    schedule
      vmx_vcpu_put
        vmx_load_host_state
          reload ds, es (with __USER_DS)
    pop ds, es  # of other thread's stack
    iret
  # other thread runs
  interrupt
    push ds, es
    schedule  # back in vcpu thread
    pop ds, es  # now with rpl=0
    iret
  ...
  vcpu_put
  resume_userspace
  iret  # clears ds, es due to mismatched rpl

(instead of resume_userspace, we might return with SYSEXIT and then
take an exception; when the exception IRETs we end up with cleared
ds, es)

Fix by avoiding the optimization on i386 and reloading ds, es on the
lightweight exit path.

Reported-by: Chris Clayron <chris2553@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-01 20:23:57 -03:00
Guo Chao
0fa0607147 KVM: VMX: Fix typos
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-20 15:19:04 -03:00
Mao, Junjie
ad756a1603 KVM: VMX: Implement PCID/INVPCID for guests with EPT
This patch handles PCID/INVPCID for guests.

Process-context identifiers (PCIDs) are a facility by which a logical processor
may cache information for multiple linear-address spaces so that the processor
may retain cached information when software switches to a different linear
address space. Refer to section 4.10.1 in IA32 Intel Software Developer's Manual
Volume 3A for details.

For guests with EPT, the PCID feature is enabled and INVPCID behaves as running
natively.
For guests without EPT, the PCID feature is disabled and INVPCID triggers #UD.

Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-12 13:07:34 +03:00
Xiao Guangrong
4f5982a56a KVM: VMX: export PFEC.P bit on ept
Export the present bit of page fault error code, the later patch
will use it

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-11 16:51:17 +03:00
Avi Kivity
a27685c33a KVM: VMX: Emulate invalid guest state by default
Our emulation should be complete enough that we can emulate guests
while they are in big real mode, or in a mode transition that is not
virtualizable without unrestricted guest support.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:05 +03:00
Avi Kivity
de5f70e0c6 KVM: VMX: Improve error reporting during invalid guest state emulation
If instruction emulation fails, report it properly to userspace.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:04 +03:00
Avi Kivity
de87dcddc7 KVM: VMX: Stop invalid guest state emulation on pending event
Process the event, possibly injecting an interrupt, before continuing.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:04 +03:00
Avi Kivity
7c068e4558 KVM: VMX: Continue emulating after batch exhausted
If we return early from an invalid guest state emulation loop, make
sure we return to it later if the guest state is still invalid.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:03 +03:00
Avi Kivity
bdea48e305 KVM: VMX: Fix interrupt exit condition during emulation
Checking EFLAGS.IF is incorrect as we might be in interrupt shadow.  If
that is the case, the main loop will notice that and not inject the interrupt,
causing an endless loop.

Fix by using vmx_interrupt_allowed() to check if we can inject an interrupt
instead.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:02 +03:00
Avi Kivity
b8405c184b KVM: VMX: Limit iterations with emulator_invalid_guest_state
Otherwise, if the guest ends up looping, we never exit the srcu critical
section, which causes synchronize_srcu() to hang.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:01 +03:00
Avi Kivity
f0495f9b99 KVM: VMX: Relax check on unusable segment
Some userspace (e.g. QEMU 1.1) munge the d and g bits of segment
descriptors, causing us not to recognize them as unusable segments
with emulate_invalid_guest_state=1.  Relax the check by testing for
segment not present (a non-present segment cannot be usable).

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:01 +03:00
Avi Kivity
d881e6f6cf KVM: VMX: Return correct CPL during transition to protected mode
In protected mode, the CPL is defined as the lower two bits of CS, as set by
the last far jump.  But during the transition to protected mode, there is no
last far jump, so we need to return zero (the inherited real mode CPL).

Fix by reading CPL from the cache during the transition.  This isn't 100%
correct since we don't set the CPL cache on a far jump, but since protected
mode transition will always jump to a segment with RPL=0, it will always
work.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:00 +03:00
Guo Chao
2106a54812 KVM: VMX: code clean for vmx_init()
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-03 14:55:30 -03:00
Christoffer Dall
a737f256bf KVM: Cleanup the kvm_print functions and introduce pr_XX wrappers
Introduces a couple of print functions, which are essentially wrappers
around standard printk functions, with a KVM: prefix.

Functions introduced or modified are:
 - kvm_err(fmt, ...)
 - kvm_info(fmt, ...)
 - kvm_debug(fmt, ...)
 - kvm_pr_unimpl(fmt, ...)
 - pr_unimpl(vcpu, fmt, ...) -> vcpu_unimpl(vcpu, fmt, ...)

Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-06 15:24:00 +03:00
Orit Wasserman
b246dd5df1 KVM: VMX: Fix KVM_SET_SREGS with big real mode segments
For example migration between Westmere and Nehelem hosts, caught in big real mode.

The code that fixes the segments for real mode guest was moved from enter_rmode
to vmx_set_segments. enter_rmode calls vmx_set_segments for each segment.

Signed-off-by: Orit Wasserman <owasserm@rehdat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-05 17:51:46 +03:00
Xudong Hao
3f6d8c8a47 KVM: VMX: Use EPT Access bit in response to memory notifiers
Signed-off-by: Haitao Shan <haitao.shan@intel.com>
Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-05 16:31:05 +03:00
Xudong Hao
b38f993478 KVM: VMX: Enable EPT A/D bits if supported by turning on relevant bit in EPTP
In EPT page structure entry, Enable EPT A/D bits if processor supported.

Signed-off-by: Haitao Shan <haitao.shan@intel.com>
Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-05 16:31:04 +03:00
Xudong Hao
83c3a33122 KVM: VMX: Add parameter to control A/D bits support, default is on
Add kernel parameter to control A/D bits support, it's on by default.

Signed-off-by: Haitao Shan <haitao.shan@intel.com>
Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-05 16:31:03 +03:00
Avi Kivity
b2da15ac26 KVM: VMX: Optimize %ds, %es reload
On x86_64, we can defer %ds and %es reload to the heavyweight context switch,
since nothing in the lightweight paths uses the host %ds or %es (they are
ignored by the processor).  Furthermore we can avoid the load if the segments
are null, by letting the hardware load the null segments for us.  This is the
expected case.

On i386, we could avoid the reload entirely, since the entry.S paths take care
of reload, except for the SYSEXIT path which leaves %ds and %es set to __USER_DS.
So we set them to the same values as well.

Saves about 70 cycles out of 1600 (around 4%; noisy measurements).

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-05-16 16:03:19 -03:00
Avi Kivity
512d5649e8 KVM: VMX: Fix %ds/%es clobber
The vmx exit code unconditionally restores %ds and %es to __USER_DS.  This
can override the user's values, since %ds and %es are not saved and restored
in x86_64 syscalls.  In practice, this isn't dangerous since nobody uses
segment registers in long mode, least of all programs that use KVM.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-05-16 16:03:19 -03:00
Xiao Guangrong
5f3fbc342f KVM: VMX: unlike vmcs on fail path
fix:

[ 1529.577273] Call Trace:
[ 1529.577289]  [<ffffffffa060d58f>] kvm_arch_hardware_disable+0x13/0x30 [kvm]
[ 1529.577302]  [<ffffffffa05fa2d4>] hardware_disable_nolock+0x35/0x39 [kvm]
[ 1529.577311]  [<ffffffffa05fa29f>] ? cpumask_clear_cpu.constprop.31+0x13/0x13 [kvm]
[ 1529.577315]  [<ffffffff81096ba8>] on_each_cpu+0x44/0x84
[ 1529.577326]  [<ffffffffa05f98b5>] hardware_disable_all_nolock+0x34/0x36 [kvm]
[ 1529.577335]  [<ffffffffa05f98e2>] hardware_disable_all+0x2b/0x39 [kvm]
[ 1529.577349]  [<ffffffffa05fafe5>] kvm_put_kvm+0xed/0x10f [kvm]
[ 1529.577358]  [<ffffffffa05fb3d7>] kvm_vm_release+0x22/0x28 [kvm]

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-05-14 11:28:02 +03:00
Marcelo Tosatti
eac0556750 Merge branch 'linus' into queue
Merge reason: development work has dependency on kvm patches merged
upstream.

Conflicts:
	Documentation/feature-removal-schedule.txt

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-04-19 17:06:26 -03:00
Avi Kivity
2225fd5604 KVM: VMX: Fix kvm_set_shared_msr() called in preemptible context
kvm_set_shared_msr() may not be called in preemptible context,
but vmx_set_msr() does so:

  BUG: using smp_processor_id() in preemptible [00000000] code: qemu-kvm/22713
  caller is kvm_set_shared_msr+0x32/0xa0 [kvm]
  Pid: 22713, comm: qemu-kvm Not tainted 3.4.0-rc3+ #39
  Call Trace:
   [<ffffffff8131fa82>] debug_smp_processor_id+0xe2/0x100
   [<ffffffffa0328ae2>] kvm_set_shared_msr+0x32/0xa0 [kvm]
   [<ffffffffa03a103b>] vmx_set_msr+0x28b/0x2d0 [kvm_intel]
   ...

Making kvm_set_shared_msr() work in preemptible is cleaner, but
it's used in the fast path.  Making two variants is overkill, so
this patch just disables preemption around the call.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-04-18 23:42:27 -03:00
Josh Triplett
e9bda3b3d0 KVM: VMX: Auto-load on CPUs with VMX
Enable x86 feature-based autoloading for the kvm-intel module on CPUs
with X86_FEATURE_VMX.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Acked-By: Kay Sievers <kay@vrfy.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-08 14:03:12 +03:00
Marcelo Tosatti
7a4f5ad051 KVM: VMX: vmx_set_cr0 expects kvm->srcu locked
vmx_set_cr0 is called from vcpu run context, therefore it expects
kvm->srcu to be held (for setting up the real-mode TSS).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-05 19:04:09 +03:00
Linus Torvalds
2e7580b0e7 Merge branch 'kvm-updates/3.4' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Avi Kivity:
 "Changes include timekeeping improvements, support for assigning host
  PCI devices that share interrupt lines, s390 user-controlled guests, a
  large ppc update, and random fixes."

This is with the sign-off's fixed, hopefully next merge window we won't
have rebased commits.

* 'kvm-updates/3.4' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (130 commits)
  KVM: Convert intx_mask_lock to spin lock
  KVM: x86: fix kvm_write_tsc() TSC matching thinko
  x86: kvmclock: abstract save/restore sched_clock_state
  KVM: nVMX: Fix erroneous exception bitmap check
  KVM: Ignore the writes to MSR_K7_HWCR(3)
  KVM: MMU: make use of ->root_level in reset_rsvds_bits_mask
  KVM: PMU: add proper support for fixed counter 2
  KVM: PMU: Fix raw event check
  KVM: PMU: warn when pin control is set in eventsel msr
  KVM: VMX: Fix delayed load of shared MSRs
  KVM: use correct tlbs dirty type in cmpxchg
  KVM: Allow host IRQ sharing for assigned PCI 2.3 devices
  KVM: Ensure all vcpus are consistent with in-kernel irqchip settings
  KVM: x86 emulator: Allow PM/VM86 switch during task switch
  KVM: SVM: Fix CPL updates
  KVM: x86 emulator: VM86 segments must have DPL 3
  KVM: x86 emulator: Fix task switch privilege checks
  arch/powerpc/kvm/book3s_hv.c: included linux/sched.h twice
  KVM: x86 emulator: correctly mask pmc index bits in RDPMC instruction emulation
  KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
  ...
2012-03-28 14:35:31 -07:00
Nadav Har'El
9587190107 KVM: nVMX: Fix erroneous exception bitmap check
The code which checks whether to inject a pagefault to L1 or L2 (in
nested VMX) was wrong, incorrect in how it checked the PF_VECTOR bit.
Thanks to Dan Carpenter for spotting this.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:14:23 +02:00
Avi Kivity
9ee73970c0 KVM: VMX: Fix delayed load of shared MSRs
Shared MSRs (MSR_*STAR and related) are stored in both vmx->guest_msrs
and in the CPU registers, but vmx_set_msr() only updated memory. Prior
to 46199f33c2, this didn't matter, since we called vmx_load_host_state(),
which scheduled a vmx_save_host_state(), which re-synchronized the CPU
state, but now we don't, so the CPU state will not be synchronized until
the next exit to host userspace.  This mostly affects nested vmx workloads,
which play with these MSRs a lot.

Fix by loading the MSR eagerly.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:11:55 +02:00
Kevin Wolf
7f3d35fddd KVM: x86 emulator: Fix task switch privilege checks
Currently, all task switches check privileges against the DPL of the
TSS. This is only correct for jmp/call to a TSS. If a task gate is used,
the DPL of this take gate is used for the check instead. Exceptions,
external interrupts and iret shouldn't perform any check.

[avi: kill kvm-kmod remnants]

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:26 +02:00
Raghavendra K T
10166744b8 KVM: VMX: remove yield_on_hlt
yield_on_hlt was introduced for CPU bandwidth capping. Now it is
redundant with CFS hardlimit.

yield_on_hlt also complicates the scenario in paravirtual environment,
that needs to trap halt. for e.g. paravirtualized ticket spinlocks.

Acked-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:11 +02:00
Marcelo Tosatti
f1e2b26003 KVM: Allow adjust_tsc_offset to be in host or guest cycles
Redefine the API to take a parameter indicating whether an
adjustment is in host or guest cycles.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:07 +02:00
Zachary Amsden
cc578287e3 KVM: Infrastructure for software and hardware based TSC rate scaling
This requires some restructuring; rather than use 'virtual_tsc_khz'
to indicate whether hardware rate scaling is in effect, we consider
each VCPU to always have a virtual TSC rate.  Instead, there is new
logic above the vendor-specific hardware scaling that decides whether
it is even necessary to use and updates all rate variables used by
common code.  This means we can simply query the virtual rate at
any point, which is needed for software rate scaling.

There is also now a threshold added to the TSC rate scaling; minor
differences and variations of measured TSC rate can accidentally
provoke rate scaling to be used when it is not needed.  Instead,
we have a tolerance variable called tsc_tolerance_ppm, which is
the maximum variation from user requested rate at which scaling
will be used.  The default is 250ppm, which is the half the
threshold for NTP adjustment, allowing for some hardware variation.

In the event that hardware rate scaling is not available, we can
kludge a bit by forcing TSC catchup to turn on when a faster than
hardware speed has been requested, but there is nothing available
yet for the reverse case; this requires a trap and emulate software
implementation for RDTSC, which is still forthcoming.

[avi: fix 64-bit division on i386]

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:09:35 +02:00
Linus Torvalds
1361b83a13 i387: Split up <asm/i387.h> into exported and internal interfaces
While various modules include <asm/i387.h> to get access to things we
actually *intend* for them to use, most of that header file was really
pretty low-level internal stuff that we really don't want to expose to
others.

So split the header file into two: the small exported interfaces remain
in <asm/i387.h>, while the internal definitions that are only used by
core architecture code are now in <asm/fpu-internal.h>.

The guiding principle for this was to expose functions that we export to
modules, and leave them in <asm/i387.h>, while stuff that is used by
task switching or was marked GPL-only is in <asm/fpu-internal.h>.

The fpu-internal.h file could be further split up too, especially since
arch/x86/kvm/ uses some of the remaining stuff for its module.  But that
kvm usage should probably be abstracted out a bit, and at least now the
internal FPU accessor functions are much more contained.  Even if it
isn't perhaps as contained as it _could_ be.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202211340330.5354@i5.linux-foundation.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-02-21 14:12:54 -08:00
Linus Torvalds
f94edacf99 i387: move TS_USEDFPU flag from thread_info to task_struct
This moves the bit that indicates whether a thread has ownership of the
FPU from the TS_USEDFPU bit in thread_info->status to a word of its own
(called 'has_fpu') in task_struct->thread.has_fpu.

This fixes two independent bugs at the same time:

 - changing 'thread_info->status' from the scheduler causes nasty
   problems for the other users of that variable, since it is defined to
   be thread-synchronous (that's what the "TS_" part of the naming was
   supposed to indicate).

   So perfectly valid code could (and did) do

	ti->status |= TS_RESTORE_SIGMASK;

   and the compiler was free to do that as separate load, or and store
   instructions.  Which can cause problems with preemption, since a task
   switch could happen in between, and change the TS_USEDFPU bit. The
   change to TS_USEDFPU would be overwritten by the final store.

   In practice, this seldom happened, though, because the 'status' field
   was seldom used more than once, so gcc would generally tend to
   generate code that used a read-modify-write instruction and thus
   happened to avoid this problem - RMW instructions are naturally low
   fat and preemption-safe.

 - On x86-32, the current_thread_info() pointer would, during interrupts
   and softirqs, point to a *copy* of the real thread_info, because
   x86-32 uses %esp to calculate the thread_info address, and thus the
   separate irq (and softirq) stacks would cause these kinds of odd
   thread_info copy aliases.

   This is normally not a problem, since interrupts aren't supposed to
   look at thread information anyway (what thread is running at
   interrupt time really isn't very well-defined), but it confused the
   heck out of irq_fpu_usable() and the code that tried to squirrel
   away the FPU state.

   (It also caused untold confusion for us poor kernel developers).

It also turns out that using 'task_struct' is actually much more natural
for most of the call sites that care about the FPU state, since they
tend to work with the task struct for other reasons anyway (ie
scheduling).  And the FPU data that we are going to save/restore is
found there too.

Thanks to Arjan Van De Ven <arjan@linux.intel.com> for pointing us to
the %esp issue.

Cc: Arjan van de Ven <arjan@linux.intel.com>
Reported-and-tested-by: Raphael Prevost <raphael@buro.asia>
Acked-and-tested-by: Suresh Siddha <suresh.b.siddha@intel.com>
Tested-by: Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-18 10:19:41 -08:00
Linus Torvalds
6d59d7a9f5 i387: don't ever touch TS_USEDFPU directly, use helper functions
This creates three helper functions that do the TS_USEDFPU accesses, and
makes everybody that used to do it by hand use those helpers instead.

In addition, there's a couple of helper functions for the "change both
CR0.TS and TS_USEDFPU at the same time" case, and the places that do
that together have been changed to use those.  That means that we have
fewer random places that open-code this situation.

The intent is partly to clarify the code without actually changing any
semantics yet (since we clearly still have some hard to reproduce bug in
this area), but also to make it much easier to use another approach
entirely to caching the CR0.TS bit for software accesses.

Right now we use a bit in the thread-info 'status' variable (this patch
does not change that), but we might want to make it a full field of its
own or even make it a per-cpu variable.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-16 13:33:12 -08:00
Rusty Russell
476bc0015b module_param: make bool parameters really bool (arch)
module_param(bool) used to counter-intuitively take an int.  In
fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy
trick.

It's time to remove the int/unsigned int option.  For this version
it'll simply give a warning, but it'll break next kernel version.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-01-13 09:32:18 +10:30
Avi Kivity
fee84b079d KVM: VMX: Intercept RDPMC
Intercept RDPMC and forward it to the PMU emulation code.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:24:38 +02:00
Avi Kivity
00b27a3efb KVM: Move cpuid code to new file
The cpuid code has grown; put it into a separate file.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:21:49 +02:00
Xiao Guangrong
28a37544fb KVM: introduce id_to_memslot function
Introduce id_to_memslot to get memslot by slot id

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:39 +02:00
Gleb Natapov
46199f33c2 KVM: VMX: remove unneeded vmx_load_host_state() calls.
vmx_load_host_state() does not handle msrs switching (except
MSR_KERNEL_GS_BASE) since commit 26bb0981b3. Remove call to it
where it is no longer make sense.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:22 +02:00
Nadav Har'El
51cfe38ea5 KVM: nVMX: Fix warning-causing idt-vectoring-info behavior
When L0 wishes to inject an interrupt while L2 is running, it emulates an exit
to L1 with EXIT_REASON_EXTERNAL_INTERRUPT. This was explained in the original
nVMX patch 23, titled "Correct handling of interrupt injection".

Unfortunately, it is possible (though rare) that at this point there is valid
idt_vectoring_info in vmcs02. For example, L1 injected some interrupt to L2,
and when L2 tried to run this interrupt's handler, it got a page fault - so
it returns the original interrupt vector in idt_vectoring_info. The problem
is that if this is the case, we cannot exit to L1 with EXTERNAL_INTERRUPT
like we wished to, because the VMX spec guarantees that idt_vectoring_info
and exit_reason_external_interrupt can never happen together. This is not
just specified in the spec - a KVM L1 actually prints a kernel warning
"unexpected, valid vectoring info" if we violate this guarantee, and some
users noticed these warnings in L1's logs.

In order to better emulate a processor, which would never return the external
interrupt and the idt-vectoring-info together, we need to separate the two
injection steps: First, complete L1's injection into L2 (i.e., enter L2,
injecting to it the idt-vectoring-info); Second, after entry into L2 succeeds
and it exits back to L0, exit to L1 with the EXIT_REASON_EXTERNAL_INTERRUPT.
Most of this is already in the code - the only change we need is to remain
in L2 (and not exit to L1) in this case.

Note that the previous patch ensures (by using KVM_REQ_IMMEDIATE_EXIT) that
although we do enter L2 first, it will exit immediately after processing its
injection, allowing us to promptly inject to L1.

Note how we test vmcs12->idt_vectoring_info_field; This isn't really the
vmcs12 value (we haven't exited to L1 yet, so vmcs12 hasn't been updated),
but rather the place we save, at the end of vmx_vcpu_run, the vmcs02 value
of this field. This was explained in patch 25 ("Correct handling of idt
vectoring info") of the original nVMX patch series.

Thanks to Dave Allan and to Federico Simoncelli for reporting this bug,
to Abel Gordon for helping me figure out the solution, and to Avi Kivity
for helping to improve it.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:45 +02:00
Nadav Har'El
d6185f20a0 KVM: nVMX: Add KVM_REQ_IMMEDIATE_EXIT
This patch adds a new vcpu->requests bit, KVM_REQ_IMMEDIATE_EXIT.
This bit requests that when next entering the guest, we should run it only
for as little as possible, and exit again.

We use this new option in nested VMX: When L1 launches L2, but L0 wishes L1
to continue running so it can inject an event to it, we unfortunately cannot
just pretend to have run L2 for a little while - We must really launch L2,
otherwise certain one-off vmcs12 parameters (namely, L1 injection into L2)
will be lost. So the existing code runs L2 in this case.
But L2 could potentially run for a long time until it exits, and the
injection into L1 will be delayed. The new KVM_REQ_IMMEDIATE_EXIT allows us
to request that L2 will be entered, as necessary, but will exit as soon as
possible after entry.

Our implementation of this request uses smp_send_reschedule() to send a
self-IPI, with interrupts disabled. The interrupts remain disabled until the
guest is entered, and then, after the entry is complete (often including
processing an injection and jumping to the relevant handler), the physical
interrupt is noticed and causes an exit.

On recent Intel processors, we could have achieved the same goal by using
MTF instead of a self-IPI. Another technique worth considering in the future
is to use VM_EXIT_ACK_INTR_ON_EXIT and a highest-priority vector IPI - to
slightly improve performance by avoiding the useless interrupt handler
which ends up being called when smp_send_reschedule() is used.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:43 +02:00
Gleb Natapov
e7fc6f93b4 KVM: VMX: Check for automatic switch msr table overflow
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-11-17 16:28:09 +02:00
Gleb Natapov
d7cd97964b KVM: VMX: Add support for guest/host-only profiling
Support guest/host-only profiling by switch perf msrs on
a guest entry if needed.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-11-17 16:28:00 +02:00
Gleb Natapov
8bf00a5299 KVM: VMX: add support for switching of PERF_GLOBAL_CTRL
Some cpus have special support for switching PERF_GLOBAL_CTRL msr.
Add logic to detect if such support exists and works properly and extend
msr switching code to use it if available. Also extend number of generic
msr switching entries to 8.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-11-17 16:27:54 +02:00
Jan Kiszka
bd80158aff KVM: Clean up and extend rate-limited output
The use of printk_ratelimit is discouraged, replace it with
pr*_ratelimited or __ratelimit. While at it, convert remaining
guest-triggerable printks to rate-limited variants.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:43 +03:00
Jan Kiszka
1e2b1dd797 KVM: x86: Move kvm_trace_exit into atomic vmexit section
This avoids that events causing the vmexit are recorded before the
actual exit reason.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:41 +03:00
Kevin Tian
58fbbf26eb KVM: APIC: avoid instruction emulation for EOI writes
Instruction emulation for EOI writes can be skipped, since sane
guest simply uses MOV instead of string operations. This is a nice
improvement when guest doesn't support x2apic or hyper-V EOI
support.

a single VM bandwidth is observed with ~8% bandwidth improvement
(7.4Gbps->8Gbps), by saving ~5% cycles from EOI emulation.

Signed-off-by: Kevin Tian <kevin.tian@intel.com>
<Based on earlier work from>:
Signed-off-by: Eddie Dong <eddie.dong@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:52:17 +03:00
Nadav Har'El
27fc51b21c KVM: nVMX: Fix nested VMX TSC emulation
This patch fixes two corner cases in nested (L2) handling of TSC-related
issues:

1. Somewhat suprisingly, according to the Intel spec, if L1 allows WRMSR to
the TSC MSR without an exit, then this should set L1's TSC value itself - not
offset by vmcs12.TSC_OFFSET (like was wrongly done in the previous code).

2. Allow L1 to disable the TSC_OFFSETING control, and then correctly ignore
the vmcs12.TSC_OFFSET.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:18:02 +03:00
Nadav Har'El
d5c1785d2f KVM: L1 TSC handling
KVM assumed in several places that reading the TSC MSR returns the value for
L1. This is incorrect, because when L2 is running, the correct TSC read exit
emulation is to return L2's value.

We therefore add a new x86_ops function, read_l1_tsc, to use in places that
specifically need to read the L1 TSC, NOT the TSC of the current level of
guest.

Note that one change, of one line in kvm_arch_vcpu_load, is made redundant
by a different patch sent by Zachary Amsden (and not yet applied):
kvm_arch_vcpu_load() should not read the guest TSC, and if it didn't, of
course we didn't have to change the call of kvm_get_msr() to read_l1_tsc().

[avi: moved callback to kvm_x86_ops tsc block]

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Acked-by: Zachary Amsdem <zamsden@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:18:02 +03:00
Julia Lawall
cf3ace79c0 KVM: VMX: trivial: use BUG_ON
Use BUG_ON(x) rather than if(x) BUG();

The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@ identifier x; @@
-if (x) BUG();
+BUG_ON(x);

@@ identifier x; @@
-if (!x) BUG();
+BUG_ON(!x);
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:18:01 +03:00
Stefan Hajnoczi
0d460ffc09 KVM: Use __print_symbolic() for vmexit tracepoints
The vmexit tracepoints format the exit_reason to make it human-readable.
Since the exit_reason depends on the instruction set (vmx or svm),
formatting is handled with ftrace_print_symbols_seq() by referring to
the appropriate exit reason table.

However, the ftrace_print_symbols_seq() function is not meant to be used
directly in tracepoints since it does not export the formatting table
which userspace tools like trace-cmd and perf use to format traces.

In practice perf dies when formatting vmexit-related events and
trace-cmd falls back to printing the numeric value (with extra
formatting code in the kvm plugin to paper over this limitation).  Other
userspace consumers of vmexit-related tracepoints would be in similar
trouble.

To avoid significant changes to the kvm_exit tracepoint, this patch
moves the vmx and svm exit reason tables into arch/x86/kvm/trace.h and
selects the right table with __print_symbolic() depending on the
instruction set.  Note that __print_symbolic() is designed for exporting
the formatting table to userspace and allows trace-cmd and perf to work.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:17:59 +03:00
Xiao Guangrong
ce88decffd KVM: MMU: mmio page fault support
The idea is from Avi:

| We could cache the result of a miss in an spte by using a reserved bit, and
| checking the page fault error code (or seeing if we get an ept violation or
| ept misconfiguration), so if we get repeated mmio on a page, we don't need to
| search the slot list/tree.
| (https://lkml.org/lkml/2011/2/22/221)

When the page fault is caused by mmio, we cache the info in the shadow page
table, and also set the reserved bits in the shadow page table, so if the mmio
is caused again, we can quickly identify it and emulate it directly

Searching mmio gfn in memslots is heavy since we need to walk all memeslots, it
can be reduced by this feature, and also avoid walking guest page table for
soft mmu.

[jan: fix operator precedence issue]

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:40 +03:00
Xiao Guangrong
c37079586f KVM: MMU: remove bypass_guest_pf
The idea is from Avi:
| Maybe it's time to kill off bypass_guest_pf=1.  It's not as effective as
| it used to be, since unsync pages always use shadow_trap_nonpresent_pte,
| and since we convert between the two nonpresent_ptes during sync and unsync.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:33 +03:00
Nadav Har'El
509c75ea19 KVM: nVMX: Fix bug preventing more than two levels of nesting
The nested VMX feature is supposed to fully emulate VMX for the guest. This
(theoretically) not only allows it to run its own guests, but also also
to further emulate VMX for its own guests, and allow arbitrarily deep nesting.

This patch fixes a bug (discovered by Kevin Tian) in handling a VMLAUNCH
by L2, which prevented deeper nesting.

Deeper nesting now works (I only actually tested L3), but is currently
*absurdly* slow, to the point of being unusable.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 13:16:11 +03:00
Jan Kiszka
2e4ce7f574 KVM: VMX: Silence warning on 32-bit hosts
a is unused now on CONFIG_X86_32.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 13:16:08 +03:00
Nadav Har'El
2844d84905 KVM: nVMX: Miscellenous small corrections
Small corrections of KVM (spelling, etc.) not directly related to nested VMX.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:19 +03:00
Nadav Har'El
7b8050f570 KVM: nVMX: Add VMX to list of supported cpuid features
If the "nested" module option is enabled, add the "VMX" CPU feature to the
list of CPU features KVM advertises with the KVM_GET_SUPPORTED_CPUID ioctl.

Qemu uses this ioctl, and intersects KVM's list with its own list of desired
cpu features (depending on the -cpu option given to qemu) to determine the
final list of features presented to the guest.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:19 +03:00
Nadav Har'El
7991825b85 KVM: nVMX: Additional TSC-offset handling
In the unlikely case that L1 does not capture MSR_IA32_TSC, L0 needs to
emulate this MSR write by L2 by modifying vmcs02.tsc_offset. We also need to
set vmcs12.tsc_offset, for this change to survive the next nested entry (see
prepare_vmcs02()).
Additionally, we also need to modify vmx_adjust_tsc_offset: The semantics
of this function is that the TSC of all guests on this vcpu, L1 and possibly
several L2s, need to be adjusted. To do this, we need to adjust vmcs01's
tsc_offset (this offset will also apply to each L2s we enter). We can't set
vmcs01 now, so we have to remember this adjustment and apply it when we
later exit to L1.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:19 +03:00
Nadav Har'El
36cf24e01e KVM: nVMX: Further fixes for lazy FPU loading
KVM's "Lazy FPU loading" means that sometimes L0 needs to set CR0.TS, even
if a guest didn't set it. Moreover, L0 must also trap CR0.TS changes and
NM exceptions, even if we have a guest hypervisor (L1) who didn't want these
traps. And of course, conversely: If L1 wanted to trap these events, we
must let it, even if L0 is not interested in them.

This patch fixes some existing KVM code (in update_exception_bitmap(),
vmx_fpu_activate(), vmx_fpu_deactivate()) to do the correct merging of L0's
and L1's needs. Note that handle_cr() was already fixed in the above patch,
and that new code in introduced in previous patches already handles CR0
correctly (see prepare_vmcs02(), prepare_vmcs12(), and nested_vmx_vmexit()).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:18 +03:00
Nadav Har'El
eeadf9e755 KVM: nVMX: Handling of CR0 and CR4 modifying instructions
When L2 tries to modify CR0 or CR4 (with mov or clts), and modifies a bit
which L1 asked to shadow (via CR[04]_GUEST_HOST_MASK), we already do the right
thing: we let L1 handle the trap (see nested_vmx_exit_handled_cr() in a
previous patch).
When L2 modifies bits that L1 doesn't care about, we let it think (via
CR[04]_READ_SHADOW) that it did these modifications, while only changing
(in GUEST_CR[04]) the bits that L0 doesn't shadow.

This is needed for corect handling of CR0.TS for lazy FPU loading: L0 may
want to leave TS on, while pretending to allow the guest to change it.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:18 +03:00
Nadav Har'El
66c78ae40c KVM: nVMX: Correct handling of idt vectoring info
This patch adds correct handling of IDT_VECTORING_INFO_FIELD for the nested
case.

When a guest exits while delivering an interrupt or exception, we get this
information in IDT_VECTORING_INFO_FIELD in the VMCS. When L2 exits to L1,
there's nothing we need to do, because L1 will see this field in vmcs12, and
handle it itself. However, when L2 exits and L0 handles the exit itself and
plans to return to L2, L0 must inject this event to L2.

In the normal non-nested case, the idt_vectoring_info case is discovered after
the exit, and the decision to inject (though not the injection itself) is made
at that point. However, in the nested case a decision of whether to return
to L2 or L1 also happens during the injection phase (see the previous
patches), so in the nested case we can only decide what to do about the
idt_vectoring_info right after the injection, i.e., in the beginning of
vmx_vcpu_run, which is the first time we know for sure if we're staying in
L2.

Therefore, when we exit L2 (is_guest_mode(vcpu)), we disable the regular
vmx_complete_interrupts() code which queues the idt_vectoring_info for
injection on next entry - because such injection would not be appropriate
if we will decide to exit to L1. Rather, we just save the idt_vectoring_info
and related fields in vmcs12 (which is a convenient place to save these
fields). On the next entry in vmx_vcpu_run (*after* the injection phase,
potentially exiting to L1 to inject an event requested by user space), if
we find ourselves in L1 we don't need to do anything with those values
we saved (as explained above). But if we find that we're in L2, or rather
*still* at L2 (it's not nested_run_pending, meaning that this is the first
round of L2 running after L1 having just launched it), we need to inject
the event saved in those fields - by writing the appropriate VMCS fields.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:18 +03:00
Nadav Har'El
0b6ac343fc KVM: nVMX: Correct handling of exception injection
Similar to the previous patch, but concerning injection of exceptions rather
than external interrupts.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:17 +03:00
Nadav Har'El
b6f1250edb KVM: nVMX: Correct handling of interrupt injection
The code in this patch correctly emulates external-interrupt injection
while a nested guest L2 is running.

Because of this code's relative un-obviousness, I include here a longer-than-
usual justification for what it does - much longer than the code itself ;-)

To understand how to correctly emulate interrupt injection while L2 is
running, let's look first at what we need to emulate: How would things look
like if the extra L0 hypervisor layer is removed, and instead of L0 injecting
an interrupt, we had hardware delivering an interrupt?

Now we have L1 running on bare metal with a guest L2, and the hardware
generates an interrupt. Assuming that L1 set PIN_BASED_EXT_INTR_MASK to 1, and
VM_EXIT_ACK_INTR_ON_EXIT to 0 (we'll revisit these assumptions below), what
happens now is this: The processor exits from L2 to L1, with an external-
interrupt exit reason but without an interrupt vector. L1 runs, with
interrupts disabled, and it doesn't yet know what the interrupt was. Soon
after, it enables interrupts and only at that moment, it gets the interrupt
from the processor. when L1 is KVM, Linux handles this interrupt.

Now we need exactly the same thing to happen when that L1->L2 system runs
on top of L0, instead of real hardware. This is how we do this:

When L0 wants to inject an interrupt, it needs to exit from L2 to L1, with
external-interrupt exit reason (with an invalid interrupt vector), and run L1.
Just like in the bare metal case, it likely can't deliver the interrupt to
L1 now because L1 is running with interrupts disabled, in which case it turns
on the interrupt window when running L1 after the exit. L1 will soon enable
interrupts, and at that point L0 will gain control again and inject the
interrupt to L1.

Finally, there is an extra complication in the code: when nested_run_pending,
we cannot return to L1 now, and must launch L2. We need to remember the
interrupt we wanted to inject (and not clear it now), and do it on the
next exit.

The above explanation shows that the relative strangeness of the nested
interrupt injection code in this patch, and the extra interrupt-window
exit incurred, are in fact necessary for accurate emulation, and are not
just an unoptimized implementation.

Let's revisit now the two assumptions made above:

If L1 turns off PIN_BASED_EXT_INTR_MASK (no hypervisor that I know
does, by the way), things are simple: L0 may inject the interrupt directly
to the L2 guest - using the normal code path that injects to any guest.
We support this case in the code below.

If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT, things look very different from the
description above: L1 expects to see an exit from L2 with the interrupt vector
already filled in the exit information, and does not expect to be interrupted
again with this interrupt. The current code does not (yet) support this case,
so we do not allow the VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on
by L1.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:17 +03:00
Nadav Har'El
644d711aa0 KVM: nVMX: Deciding if L0 or L1 should handle an L2 exit
This patch contains the logic of whether an L2 exit should be handled by L0
and then L2 should be resumed, or whether L1 should be run to handle this
exit (using the nested_vmx_vmexit() function of the previous patch).

The basic idea is to let L1 handle the exit only if it actually asked to
trap this sort of event. For example, when L2 exits on a change to CR0,
we check L1's CR0_GUEST_HOST_MASK to see if L1 expressed interest in any
bit which changed; If it did, we exit to L1. But if it didn't it means that
it is we (L0) that wished to trap this event, so we handle it ourselves.

The next two patches add additional logic of what to do when an interrupt or
exception is injected: Does L0 need to do it, should we exit to L1 to do it,
or should we resume L2 and keep the exception to be injected later.

We keep a new flag, "nested_run_pending", which can override the decision of
which should run next, L1 or L2. nested_run_pending=1 means that we *must* run
L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2
and therefore expects L2 to be run (and perhaps be injected with an event it
specified, etc.). Nested_run_pending is especially intended to avoid switching
to L1 in the injection decision-point described above.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:16 +03:00
Nadav Har'El
7c1779384a KVM: nVMX: vmcs12 checks on nested entry
This patch adds a bunch of tests of the validity of the vmcs12 fields,
according to what the VMX spec and our implementation allows. If fields
we cannot (or don't want to) honor are discovered, an entry failure is
emulated.

According to the spec, there are two types of entry failures: If the problem
was in vmcs12's host state or control fields, the VMLAUNCH instruction simply
fails. But a problem is found in the guest state, the behavior is more
similar to that of an exit.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:16 +03:00
Nadav Har'El
4704d0befb KVM: nVMX: Exiting from L2 to L1
This patch implements nested_vmx_vmexit(), called when the nested L2 guest
exits and we want to run its L1 parent and let it handle this exit.

Note that this will not necessarily be called on every L2 exit. L0 may decide
to handle a particular exit on its own, without L1's involvement; In that
case, L0 will handle the exit, and resume running L2, without running L1 and
without calling nested_vmx_vmexit(). The logic for deciding whether to handle
a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
will appear in a separate patch below.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:16 +03:00
Nadav Har'El
99e65e805d KVM: nVMX: No need for handle_vmx_insn function any more
Before nested VMX support, the exit handler for a guest executing a VMX
instruction (vmclear, vmlaunch, vmptrld, vmptrst, vmread, vmread, vmresume,
vmwrite, vmon, vmoff), was handle_vmx_insn(). This handler simply threw a #UD
exception. Now that all these exit reasons are properly handled (and emulate
the respective VMX instruction), nothing calls this dummy handler and it can
be removed.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:15 +03:00
Nadav Har'El
cd232ad02f KVM: nVMX: Implement VMLAUNCH and VMRESUME
Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
hypervisor to run its own guests.

This patch does not include some of the necessary validity checks on
vmcs12 fields before the entry. These will appear in a separate patch
below.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:15 +03:00
Nadav Har'El
fe3ef05c75 KVM: nVMX: Prepare vmcs02 from vmcs01 and vmcs12
This patch contains code to prepare the VMCS which can be used to actually
run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information
in vmcs12 (the vmcs that L1 built for L2) and in vmcs01 (our desires for our
own guests).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:14 +03:00
Nadav Har'El
bf8179a011 KVM: nVMX: Move control field setup to functions
Move some of the control field setup to common functions. These functions will
also be needed for running L2 guests - L0's desires (expressed in these
functions) will be appropriately merged with L1's desires.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:14 +03:00
Nadav Har'El
a3a8ff8ebf KVM: nVMX: Move host-state field setup to a function
Move the setting of constant host-state fields (fields that do not change
throughout the life of the guest) from vmx_vcpu_setup to a new common function
vmx_set_constant_host_state(). This function will also be used to set the
host state when running L2 guests.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:14 +03:00
Nadav Har'El
49f705c532 KVM: nVMX: Implement VMREAD and VMWRITE
Implement the VMREAD and VMWRITE instructions. With these instructions, L1
can read and write to the VMCS it is holding. The values are read or written
to the fields of the vmcs12 structure introduced in a previous patch.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:14 +03:00
Nadav Har'El
6a4d755060 KVM: nVMX: Implement VMPTRST
This patch implements the VMPTRST instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:13 +03:00
Nadav Har'El
63846663ea KVM: nVMX: Implement VMPTRLD
This patch implements the VMPTRLD instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:12 +03:00
Nadav Har'El
27d6c86521 KVM: nVMX: Implement VMCLEAR
This patch implements the VMCLEAR instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:12 +03:00
Nadav Har'El
0140caea3b KVM: nVMX: Success/failure of VMX instructions.
VMX instructions specify success or failure by setting certain RFLAGS bits.
This patch contains common functions to do this, and they will be used in
the following patches which emulate the various VMX instructions.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:12 +03:00
Nadav Har'El
22bd035868 KVM: nVMX: Add VMCS fields to the vmcs12
In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the
standard VMCS fields.

Later patches will enable L1 to read and write these fields using VMREAD/
VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing vmcs02,
a hardware VMCS for running L2.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:11 +03:00
Nadav Har'El
ff2f6fe961 KVM: nVMX: Introduce vmcs02: VMCS used to run L2
We saw in a previous patch that L1 controls its L2 guest with a vcms12.
L0 needs to create a real VMCS for running L2. We call that "vmcs02".
A later patch will contain the code, prepare_vmcs02(), for filling the vmcs02
fields. This patch only contains code for allocating vmcs02.

In this version, prepare_vmcs02() sets *all* of vmcs02's fields each time we
enter from L1 to L2, so keeping just one vmcs02 for the vcpu is enough: It can
be reused even when L1 runs multiple L2 guests. However, in future versions
we'll probably want to add an optimization where vmcs02 fields that rarely
change will not be set each time. For that, we may want to keep around several
vmcs02s of L2 guests that have recently run, so that potentially we could run
these L2s again more quickly because less vmwrites to vmcs02 will be needed.

This patch adds to each vcpu a vmcs02 pool, vmx->nested.vmcs02_pool,
which remembers the vmcs02s last used to run up to VMCS02_POOL_SIZE L2s.
As explained above, in the current version we choose VMCS02_POOL_SIZE=1,
I.e., one vmcs02 is allocated (and loaded onto the processor), and it is
reused to enter any L2 guest. In the future, when prepare_vmcs02() is
optimized not to set all fields every time, VMCS02_POOL_SIZE should be
increased.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:11 +03:00
Nadav Har'El
064aea7747 KVM: nVMX: Decoding memory operands of VMX instructions
This patch includes a utility function for decoding pointer operands of VMX
instructions issued by L1 (a guest hypervisor)

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:11 +03:00
Nadav Har'El
b87a51ae28 KVM: nVMX: Implement reading and writing of VMX MSRs
When the guest can use VMX instructions (when the "nested" module option is
on), it should also be able to read and write VMX MSRs, e.g., to query about
VMX capabilities. This patch adds this support.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:11 +03:00
Nadav Har'El
a9d30f33dd KVM: nVMX: Introduce vmcs12: a VMCS structure for L1
An implementation of VMX needs to define a VMCS structure. This structure
is kept in guest memory, but is opaque to the guest (who can only read or
write it with VMX instructions).

This patch starts to define the VMCS structure which our nested VMX
implementation will present to L1. We call it "vmcs12", as it is the VMCS
that L1 keeps for its L2 guest. We will add more content to this structure
in later patches.

This patch also adds the notion (as required by the VMX spec) of L1's "current
VMCS", and finally includes utility functions for mapping the guest-allocated
VMCSs in host memory.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:10 +03:00
Nadav Har'El
5e1746d620 KVM: nVMX: Allow setting the VMXE bit in CR4
This patch allows the guest to enable the VMXE bit in CR4, which is a
prerequisite to running VMXON.

Whether to allow setting the VMXE bit now depends on the architecture (svm
or vmx), so its checking has moved to kvm_x86_ops->set_cr4(). This function
now returns an int: If kvm_x86_ops->set_cr4() returns 1, __kvm_set_cr4()
will also return 1, and this will cause kvm_set_cr4() will throw a #GP.

Turning on the VMXE bit is allowed only when the nested VMX feature is
enabled, and turning it off is forbidden after a vmxon.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:10 +03:00
Nadav Har'El
ec378aeef9 KVM: nVMX: Implement VMXON and VMXOFF
This patch allows a guest to use the VMXON and VMXOFF instructions, and
emulates them accordingly. Basically this amounts to checking some
prerequisites, and then remembering whether the guest has enabled or disabled
VMX operation.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:09 +03:00
Nadav Har'El
801d342432 KVM: nVMX: Add "nested" module option to kvm_intel
This patch adds to kvm_intel a module option "nested". This option controls
whether the guest can use VMX instructions, i.e., whether we allow nested
virtualization. A similar, but separate, option already exists for the
SVM module.

This option currently defaults to 0, meaning that nested VMX must be
explicitly enabled by giving nested=1. When nested VMX matures, the default
should probably be changed to enable nested VMX by default - just like
nested SVM is currently enabled by default.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:09 +03:00
Nadav Har'El
d462b81923 KVM: VMX: Keep list of loaded VMCSs, instead of vcpus
In VMX, before we bring down a CPU we must VMCLEAR all VMCSs loaded on it
because (at least in theory) the processor might not have written all of its
content back to memory. Since a patch from June 26, 2008, this is done using
a per-cpu "vcpus_on_cpu" linked list of vcpus loaded on each CPU.

The problem is that with nested VMX, we no longer have the concept of a
vcpu being loaded on a cpu: A vcpu has multiple VMCSs (one for L1, a pool for
L2s), and each of those may be have been last loaded on a different cpu.

So instead of linking the vcpus, we link the VMCSs, using a new structure
loaded_vmcs. This structure contains the VMCS, and the information pertaining
to its loading on a specific cpu (namely, the cpu number, and whether it
was already launched on this cpu once). In nested we will also use the same
structure to hold L2 VMCSs, and vmx->loaded_vmcs is a pointer to the
currently active VMCS.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Acked-by: Acked-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-12 11:45:08 +03:00
Avi Kivity
96304217a7 KVM: VMX: always_inline VMREADs
vmcs_readl() and friends are really short, but gcc thinks they are long because of
the out-of-line exception handlers.  Mark them always_inline to clear the
misunderstanding.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:01 +03:00
Avi Kivity
5e520e6278 KVM: VMX: Move VMREAD cleanup to exception handler
We clean up a failed VMREAD by clearing the output register.  Do
it in the exception handler instead of unconditionally.  This is
worthwhile since there are more than a hundred call sites.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:00 +03:00
Marcelo Tosatti
5233dd51ec KVM: VMX: do not overwrite uptodate vcpu->arch.cr3 on KVM_SET_SREGS
Only decache guest CR3 value if vcpu->arch.cr3 is stale.
Fixes loadvm with live guest.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Tested-by: Markus Schade <markus.schade@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-06-19 19:23:13 +03:00
Avi Kivity
2fb92db1ec KVM: VMX: Cache vmcs segment fields
Since the emulator now checks segment limits and access rights, it
generates a lot more accesses to the vmcs segment fields.  Undo some
of the performance hit by cacheing those fields in a read-only cache
(the entire cache is invalidated on any write, or on guest exit).

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-22 08:47:45 -04:00
Avi Kivity
0a434bb2bf KVM: VMX: Avoid reading %rip unnecessarily when handling exceptions
Avoids a VMREAD.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-22 08:40:01 -04:00
Serge E. Hallyn
71f9833bb1 KVM: fix push of wrong eip when doing softint
When doing a soft int, we need to bump eip before pushing it to
the stack.  Otherwise we'll do the int a second time.

[apw@canonical.com: merged eip update as per Jan's recommendation.]
Signed-off-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:57:09 -04:00
Jan Kiszka
be6d05cfdf KVM: VMX: Ensure that vmx_create_vcpu always returns proper error
In case certain allocations fail, vmx_create_vcpu may return 0 as error
instead of a negative value encoded via ERR_PTR. This causes a NULL
pointer dereferencing later on in kvm_vm_ioctl_vcpu_create.

Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-05-11 07:57:08 -04:00
Joerg Roedel
857e40999e KVM: X86: Delegate tsc-offset calculation to architecture code
With TSC scaling in SVM the tsc-offset needs to be
calculated differently. This patch propagates this
calculation into the architecture specific modules so that
this complexity can be handled there.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:57:05 -04:00
Joerg Roedel
4051b18801 KVM: X86: Implement call-back to propagate virtual_tsc_khz
This patch implements a call-back into the architecture code
to allow the propagation of changes to the virtual tsc_khz
of the vcpu.
On SVM it updates the tsc_ratio variable, on VMX it does
nothing.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:57:05 -04:00
Joerg Roedel
8a76d7f25f KVM: x86: Add x86 callback for intercept check
This patch adds a callback into kvm_x86_ops so that svm and
vmx code can do intercept checks on emulated instructions.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:57:01 -04:00
Avi Kivity
654f06fc65 KVM: VMX: simplify NMI mask management
Use vmx_set_nmi_mask() instead of open-coding management of
the hardware bit and the software hint (nmi_known_unmasked).

There's a slight change of behaviour when running without
hardware virtual NMI support - we now clear the NMI mask if
NMI delivery faulted in that case as well.  This improves
emulation accuracy.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:56:57 -04:00
Avi Kivity
8878647585 KVM: VMX: Use cached VM_EXIT_INTR_INFO in handle_exception
vmx_complete_atomic_exit() cached it for us, so we can use it here.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:56:56 -04:00
Avi Kivity
c5ca8e572c KVM: VMX: Don't VMREAD VM_EXIT_INTR_INFO unconditionally
Only read it if we're going to use it later.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:56:56 -04:00
Avi Kivity
00eba012d5 KVM: VMX: Refactor vmx_complete_atomic_exit()
Move the exit reason checks to the front of the function, for early
exit in the common case.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:56:56 -04:00
Avi Kivity
f9902069c4 KVM: VMX: Qualify check for host NMI
Check for the exit reason first; this allows us, later,
to avoid a VMREAD for VM_EXIT_INTR_INFO_FIELD.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:56:56 -04:00
Avi Kivity
9d58b93192 KVM: VMX: Avoid vmx_recover_nmi_blocking() when unneeded
When we haven't injected an interrupt, we don't need to recover
the nmi blocking state (since the guest can't set it by itself).
This allows us to avoid a VMREAD later on.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:56:56 -04:00
Avi Kivity
69c7302890 KVM: VMX: Cache cpl
We may read the cpl quite often in the same vmexit (instruction privilege
check, memory access checks for instruction and operands), so we gain
a bit if we cache the value.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:56:54 -04:00
Avi Kivity
f4c63e5d5a KVM: VMX: Optimize vmx_get_cpl()
In long mode, vm86 mode is disallowed, so we need not check for
it.  Reading rflags.vm may require a VMREAD, so it is expensive.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:56:54 -04:00
Avi Kivity
6de12732c4 KVM: VMX: Optimize vmx_get_rflags()
If called several times within the same exit, return cached results.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:56:54 -04:00
Avi Kivity
f6e7847589 KVM: Use kvm_get_rflags() and kvm_set_rflags() instead of the raw versions
Some rflags bits are owned by the host, not guest, so we need to use
kvm_get_rflags() to strip those bits away or kvm_set_rflags() to add them
back.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:56:54 -04:00
Gleb Natapov
776e58ea3d KVM: unbreak userspace that does not sets tss address
Commit 6440e5967bc broke old userspaces that do not set tss address
before entering vcpu. Unbreak it by setting tss address to a safe
value on the first vcpu entry. New userspaces should set tss address,
so print warning in case it doesn't.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-03-17 13:08:35 -03:00
Xiao Guangrong
40dcaa9f69 KVM: fix rcu usage in init_rmode_* functions
fix:
[ 3494.671786] stack backtrace:
[ 3494.671789] Pid: 10527, comm: qemu-system-x86 Not tainted 2.6.38-rc6+ #23
[ 3494.671790] Call Trace:
[ 3494.671796]  [] ? lockdep_rcu_dereference+0x9d/0xa5
[ 3494.671826]  [] ? kvm_memslots+0x6b/0x73 [kvm]
[ 3494.671834]  [] ? gfn_to_memslot+0x16/0x4f [kvm]
[ 3494.671843]  [] ? gfn_to_hva+0x16/0x27 [kvm]
[ 3494.671851]  [] ? kvm_write_guest_page+0x31/0x83 [kvm]
[ 3494.671861]  [] ? kvm_clear_guest_page+0x1a/0x1c [kvm]
[ 3494.671867]  [] ? vmx_set_tss_addr+0x83/0x122 [kvm_intel]

and:
[ 8328.789599] stack backtrace:
[ 8328.789601] Pid: 18736, comm: qemu-system-x86 Not tainted 2.6.38-rc6+ #23
[ 8328.789603] Call Trace:
[ 8328.789609]  [] ? lockdep_rcu_dereference+0x9d/0xa5
[ 8328.789621]  [] ? kvm_memslots+0x6b/0x73 [kvm]
[ 8328.789628]  [] ? gfn_to_memslot+0x16/0x4f [kvm]
[ 8328.789635]  [] ? gfn_to_hva+0x16/0x27 [kvm]
[ 8328.789643]  [] ? kvm_write_guest_page+0x31/0x83 [kvm]
[ 8328.789699]  [] ? kvm_clear_guest_page+0x1a/0x1c [kvm]
[ 8328.789713]  [] ? vmx_create_vcpu+0x316/0x3c8 [kvm_intel]

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:34 -03:00
Takuya Yoshikawa
afc20184b7 KVM: x86: Remove useless regs_page pointer from kvm_lapic
Access to this page is mostly done through the regs member which holds
the address to this page.  The exceptions are in vmx_vcpu_reset() and
kvm_free_lapic() and these both can easily be converted to using regs.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:33 -03:00
Gleb Natapov
93ea5388ea KVM: VMX: Initialize vm86 TSS only once.
Currently vm86 task is initialized on each real mode entry and vcpu
reset. Initialization is done by zeroing TSS and updating relevant
fields. But since all vcpus are using the same TSS there is a race where
one vcpu may use TSS while other vcpu is initializing it, so the vcpu
that uses TSS will see wrong TSS content and will behave incorrectly.
Fix that by initializing TSS only once.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:31 -03:00
Gleb Natapov
a8ba6c2622 KVM: VMX: update live TR selector if it changes in real mode
When rmode.vm86 is active TR descriptor is updated with vm86 task values,
but selector is left intact. vmx_set_segment() makes sure that if TR
register is written into while vm86 is active the new values are saved
for use after vm86 is deactivated, but since selector is not updated on
vm86 activation/deactivation new value is lost. Fix this by writing new
selector into vmcs immediately.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:31 -03:00
Lai Jiangshan
a3b5ba49a8 KVM: VMX: add the __noclone attribute to vmx_vcpu_run
The changelog of 104f226 said "adds the __noclone attribute",
but it was missing in its patch. I think it is still needed.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-03-17 13:08:31 -03:00
Joseph Cihula
23f3e99132 KVM: VMX: fix detection of BIOS disabling VMX
This patch fixes the logic used to detect whether BIOS has disabled VMX, for
the case where VMX is enabled only under SMX, but tboot is not active.

Signed-off-by:  Joseph Cihula <joseph.cihula@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:30 -03:00
Avi Kivity
40712faeb8 KVM: VMX: Avoid atomic operation in vmx_vcpu_run
Instead of exchanging the guest and host rcx, have separate storage
for each.  This allows us to avoid using the xchg instruction, which
is is a little slower than normal operations.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-03-17 13:08:26 -03:00
Avi Kivity
1c696d0e1b KVM: VMX: Simplify saving guest rcx in vmx_vcpu_run
Change

  push top-of-stack
  pop guest-rcx
  pop dummy

to

  pop guest-rcx

which is the same thing, only simpler.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-03-17 13:08:25 -03:00
Rik van Riel
00c25bce02 KVM: VMX: increase ple_gap default to 128
On some CPUs, a ple_gap of 41 is simply insufficient to ever trigger
PLE exits, even with the minimalistic PLE test from kvm-unit-tests.

http://git.kernel.org/?p=virt/kvm/kvm-unit-tests.git;a=commitdiff;h=eda71b28fa122203e316483b35f37aaacd42f545

For example, the Xeon X5670 CPU needs a ple_gap of at least 48 in
order to get pause loop exits:

# modprobe kvm_intel ple_gap=47
# taskset 1 /usr/local/bin/qemu-system-x86_64 \
  -device testdev,chardev=log -chardev stdio,id=log \
  -kernel x86/vmexit.flat -append ple-round-robin -smp 2
VNC server running on `::1:5900'
enabling apic
enabling apic
ple-round-robin 58298446
# rmmod kvm_intel
# modprobe kvm_intel ple_gap=48
# taskset 1 /usr/local/bin/qemu-system-x86_64 \
   -device testdev,chardev=log -chardev stdio,id=log \
   -kernel x86/vmexit.flat -append ple-round-robin -smp 2
VNC server running on `::1:5900'
enabling apic
enabling apic
ple-round-robin 36616

Increase the ple_gap to 128 to be on the safe side.

Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Zhai, Edwin <edwin.zhai@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:25 -03:00
Avi Kivity
a917949935 KVM: VMX: Avoid leaking fake realmode state to userspace
When emulating real mode, we fake some state:

 - tr.base points to a fake vm86 tss
 - segment registers are made to conform to vm86 restrictions

change vmx_get_segment() not to expose this fake state to userspace;
instead, return the original state.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-03-17 13:08:25 -03:00
Avi Kivity
d0ba64f9b4 KVM: VMX: Save and restore tr selector across mode switches
When emulating real mode we play with tr hidden state, but leave
tr.selector alone.  That works well, except for save/restore, since
loading TR writes it to the hidden state in vmx->rmode.

Fix by also saving and restoring the tr selector; this makes things
more consistent and allows migration to work during the early
boot stages of Windows XP.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-03-17 13:08:25 -03:00
Gleb Natapov
444e863d13 KVM: VMX: when entering real mode align segment base to 16 bytes
VMX checks that base is equal segment shifted 4 bits left. Otherwise
guest entry fails.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:31:20 +02:00
Avi Kivity
aff48baa34 KVM: Fetch guest cr3 from hardware on demand
Instead of syncing the guest cr3 every exit, which is expensince on vmx
with ept enabled, sync it only on demand.

[sheng: fix incorrect cr3 seen by Windows XP]

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:31:16 +02:00
Avi Kivity
9f8fe5043f KVM: Replace reads of vcpu->arch.cr3 by an accessor
This allows us to keep cr3 in the VMCS, later on.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:31:15 +02:00
Avi Kivity
16d8f72f70 KVM: VMX: Correct asm constraint in vmcs_load()/vmcs_clear()
'error' is byte sized, so use a byte register constraint.

Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:31:12 +02:00
Avi Kivity
110312c84b KVM: VMX: Optimize atomic EFER load
When NX is enabled on the host but not on the guest, we use the entry/exit
msr load facility, which is slow.  Optimize it to use entry/exit efer load,
which is ~1200 cycles faster.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:31:09 +02:00
Andre Przywara
dc25e89e07 KVM: SVM: copy instruction bytes from VMCB
In case of a nested page fault or an intercepted #PF newer SVM
implementations provide a copy of the faulting instruction bytes
in the VMCB.
Use these bytes to feed the instruction emulator and avoid the costly
guest instruction fetch in this case.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:31:07 +02:00
Andre Przywara
51d8b66199 KVM: cleanup emulate_instruction
emulate_instruction had many callers, but only one used all
parameters. One parameter was unused, another one is now
hidden by a wrapper function (required for a future addition
anyway), so most callers use now a shorter parameter list.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:31:00 +02:00
Andre Przywara
db8fcefaa7 KVM: move complete_insn_gp() into x86.c
move the complete_insn_gp() helper function out of the VMX part
into the generic x86 part to make it usable by SVM.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:30:59 +02:00
Andre Przywara
eea1cff9ab KVM: x86: fix CR8 handling
The handling of CR8 writes in KVM is currently somewhat cumbersome.
This patch makes it look like the other CR register handlers
and fixes a possible issue in VMX, where the RIP would be incremented
despite an injected #GP.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:30:58 +02:00
Anthony Liguori
443381a828 KVM: VMX: add module parameter to avoid trapping HLT instructions (v5)
In certain use-cases, we want to allocate guests fixed time slices where idle
guest cycles leave the machine idling.  There are many approaches to achieve
this but the most direct is to simply avoid trapping the HLT instruction which
lets the guest directly execute the instruction putting the processor to sleep.

Introduce this as a module-level option for kvm-vmx.ko since if you do this
for one guest, you probably want to do it for all.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:30:46 +02:00
Avi Kivity
a295673aba KVM: VMX: Return 0 from a failed VMREAD
If we execute VMREAD during reboot we'll just skip over it.  Instead of
returning garbage, return 0, which has a much smaller chance of confusing
the code.  Otherwise we risk a flood of debug printk()s which block the
reboot process if a serial console or netconsole is enabled.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:30:20 +02:00
Avi Kivity
586f960796 KVM: Add instruction-set-specific exit qualifications to kvm_exit trace
The exit reason alone is insufficient to understand exactly why an exit
occured; add ISA-specific trace parameters for additional information.

Because fetching these parameters is expensive on vmx, and because these
parameters are fetched even if tracing is disabled, we fetch the
parameters via a callback instead of as traditional trace arguments.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:29:41 +02:00
Avi Kivity
aa17911e3c KVM: Record instruction set in kvm_exit tracepoint
exit_reason's meaning depend on the instruction set; record it so a trace
taken on one machine can be interpreted on another.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:29:40 +02:00
Avi Kivity
104f226bfd KVM: VMX: Fold __vmx_vcpu_run() into vmx_vcpu_run()
cea15c2 ("KVM: Move KVM context switch into own function") split vmx_vcpu_run()
to prevent multiple copies of the context switch from being generated (causing
problems due to a label).  This patch folds them back together again and adds
the __noclone attribute to prevent the label from being duplicated.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:29:37 +02:00
Shane Wang
f9335afea5 KVM: VMX: Inform user about INTEL_TXT dependency
Inform user to either disable TXT in the BIOS or do TXT launch
with tboot before enabling KVM since some BIOSes do not set
FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX bit when TXT is enabled.

Signed-off-by: Shane Wang <shane.wang@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:29:31 +02:00
Avi Kivity
30bd0c4c6c KVM: VMX: Disallow NMI while blocked by STI
While not mandated by the spec, Linux relies on NMI being blocked by an
IF-enabling STI.  VMX also refuses to enter a guest in this state, at
least on some implementations.

Disallow NMI while blocked by STI by checking for the condition, and
requesting an interrupt window exit if it occurs.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:29:04 +02:00
Gleb Natapov
ec25d5e66e KVM: handle exit due to INVD in VMX
Currently the exit is unhandled, so guest halts with error if it tries
to execute INVD instruction. Call into emulator when INVD instruction
is executed by a guest instead. This instruction is not needed by ordinary
guests, but firmware (like OpenBIOS) use it and fail.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:28:53 +02:00
Marcelo Tosatti
ff1fcb9ebd KVM: VMX: remove setting of shadow_base_ptes for EPT
The EPT present/writable bits use the same position as normal
pagetable bits.

Since direct_map passes ACC_ALL to mmu_set_spte, thus always setting
the writable bit on sptes, use the generic PT_PRESENT shadow_base_pte.

Also pass present/writable error code information from EPT violation
to generic pagefault handler.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:23:37 +02:00
Andi Kleen
f56f536956 KVM: Move KVM context switch into own function
gcc 4.5 with some special options is able to duplicate the VMX
context switch asm in vmx_vcpu_run(). This results in a compile error
because the inline asm sequence uses an on local label. The non local
label is needed because other code wants to set up the return address.

This patch moves the asm code into an own function and marks
that explicitely noinline to avoid this problem.

Better would be probably to just move it into an .S file.

The diff looks worse than the change really is, it's all just
code movement and no logic change.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:23:26 +02:00
Joerg Roedel
24d1b15f72 KVM: SVM: Do not report xsave in supported cpuid
To support xsave properly for the guest the SVM module need
software support for it. As long as this is not present do
not report the xsave as supported feature in cpuid.
As a side-effect this patch moves the bit() helper function
into the x86.h file so that it can be used in svm.c too.

KVM-Stable-Tag.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-12-08 17:28:37 +02:00
Avi Kivity
c8770e7ba6 KVM: VMX: Fix host userspace gsbase corruption
We now use load_gs_index() to load gs safely; unfortunately this also
changes MSR_KERNEL_GS_BASE, which we managed separately.  This resulted
in confusion and breakage running 32-bit host userspace on a 64-bit kernel.

Fix by
- saving guest MSR_KERNEL_GS_BASE before we we reload the host's gs
- doing the host save/load unconditionally, instead of only when in guest
  long mode

Things can be cleaned up further, but this is the minmal fix for now.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-11-17 19:48:05 -02:00
Avi Kivity
0a77fe4c18 KVM: Correct ordering of ldt reload wrt fs/gs reload
If fs or gs refer to the ldt, they must be reloaded after the ldt.  Reorder
the code to that effect.

Userspace code that uses the ldt with kvm is nonexistent, so this doesn't fix
a user-visible bug.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-11-17 19:47:59 -02:00
Nicolas Kaiser
9611c18777 KVM: fix typo in copyright notice
Fix typo in copyright notice.

Signed-off-by: Nicolas Kaiser <nikai@nikai.net>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:53:14 +02:00
Jan Kiszka
07d6f555d5 KVM: VMX: Add AX to list of registers clobbered by guest switch
By chance this caused no harm so far. We overwrite AX during switch
to/from guest context, so we must declare this.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:53:07 +02:00
Avi Kivity
49e9d557f9 KVM: VMX: Respect interrupt window in big real mode
If an interrupt is pending, we need to stop emulation so we
can inject it.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:53:02 +02:00
Mohammed Gamal
a92601bb70 KVM: VMX: Emulated real mode interrupt injection
Replace the inject-as-software-interrupt hack we currently have with
emulated injection.

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:53:01 +02:00
Avi Kivity
625831a3f4 KVM: VMX: Move fixup_rmode_irq() to avoid forward declaration
No code changes.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:52:54 +02:00
Avi Kivity
b463a6f744 KVM: Non-atomic interrupt injection
Change the interrupt injection code to work from preemptible, interrupts
enabled context.  This works by adding a ->cancel_injection() operation
that undoes an injection in case we were not able to actually enter the guest
(this condition could never happen with atomic injection).

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:52:54 +02:00
Avi Kivity
83422e17c1 KVM: VMX: Parameterize vmx_complete_interrupts() for both exit and entry
Currently vmx_complete_interrupts() can decode event information from vmx
exit fields into the generic kvm event queues.  Make it able to decode
the information from the entry fields as well by parametrizing it.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:52:52 +02:00
Avi Kivity
537b37e267 KVM: VMX: Move real-mode interrupt injection fixup to vmx_complete_interrupts()
This allows reuse of vmx_complete_interrupts() for cancelling injections.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:52:51 +02:00
Avi Kivity
51aa01d13d KVM: VMX: Split up vmx_complete_interrupts()
vmx_complete_interrupts() does too much, split it up:
 - vmx_vcpu_run() gets the "cache important vmcs fields" part
 - a new vmx_complete_atomic_exit() gets the parts that must be done atomically
 - a new vmx_recover_nmi_blocking() does what its name says
 - vmx_complete_interrupts() retains the event injection recovery code

This helps in reducing the work done in atomic context.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:52:51 +02:00
Avi Kivity
3842d135ff KVM: Check for pending events before attempting injection
Instead of blindly attempting to inject an event before each guest entry,
check for a possible event first in vcpu->requests.  Sites that can trigger
event injection are modified to set KVM_REQ_EVENT:

- interrupt, nmi window opening
- ppr updates
- i8259 output changes
- local apic irr changes
- rflags updates
- gif flag set
- event set on exit

This improves non-injecting entry performance, and sets the stage for
non-atomic injection.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:52:50 +02:00
Joerg Roedel
ff03a073e7 KVM: MMU: Add kvm_mmu parameter to load_pdptrs function
This function need to be able to load the pdptrs from any
mmu context currently in use. So change this function to
take an kvm_mmu parameter to fit these needs.
As a side effect this patch also moves the cached pdptrs
from vcpu_arch into the kvm_mmu struct.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:52:41 +02:00
Joerg Roedel
1c97f0a04c KVM: X86: Introduce a tdp_set_cr3 function
This patch introduces a special set_tdp_cr3 function pointer
in kvm_x86_ops which is only used for tpd enabled mmu
contexts. This allows to remove some hacks from svm code.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:52:30 +02:00
Zachary Amsden
e48672fa25 KVM: x86: Unify TSC logic
Move the TSC control logic from the vendor backends into x86.c
by adding adjust_tsc_offset to x86 ops.  Now all TSC decisions
can be done in one place.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:51:23 +02:00
Zachary Amsden
99e3e30aee KVM: x86: Move TSC offset writes to common code
Also, ensure that the storing of the offset and the reading of the TSC
are never preempted by taking a spinlock.  While the lock is overkill
now, it is useful later in this patch series.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:51:22 +02:00
Zachary Amsden
f4e1b3c8bd KVM: x86: Convert TSC writes to TSC offset writes
Change svm / vmx to be the same internally and write TSC offset
instead of bare TSC in helper functions.  Isolated as a single
patch to contain code movement.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:51:22 +02:00
Zachary Amsden
ae38436b78 KVM: x86: Drop vm_init_tsc
This is used only by the VMX code, and is not done properly;
if the TSC is indeed backwards, it is out of sync, and will
need proper handling in the logic at each and every CPU change.
For now, drop this test during init as misguided.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:51:21 +02:00
Avi Kivity
d359192fea KVM: VMX: Use host_gdt variable wherever we need the host gdt
Now that we have the host gdt conveniently stored in a variable, make use
of it instead of querying the cpu.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:50:01 +02:00
Avi Kivity
9581d442b9 KVM: Fix fs/gs reload oops with invalid ldt
kvm reloads the host's fs and gs blindly, however the underlying segment
descriptors may be invalid due to the user modifying the ldt after loading
them.

Fix by using the safe accessors (loadsegment() and load_gs_index()) instead
of home grown unsafe versions.

This is CVE-2010-3698.

KVM-Stable-Tag.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-19 14:21:45 -02:00
Linus Torvalds
d9a73c0016 Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  um, x86: Cast to (u64 *) inside set_64bit()
  x86-32, asm: Directly access per-cpu GDT
  x86-64, asm: Directly access per-cpu IST
  x86, asm: Merge cmpxchg_486_u64() and cmpxchg8b_emu()
  x86, asm: Move cmpxchg emulation code to arch/x86/lib
  x86, asm: Clean up and simplify <asm/cmpxchg.h>
  x86, asm: Clean up and simplify set_64bit()
  x86: Add memory modify constraints to xchg() and cmpxchg()
  x86-64: Simplify loading initial_gs
  x86: Use symbolic MSR names
  x86: Remove redundant K6 MSRs
2010-08-06 10:07:34 -07:00
Avi Kivity
3444d7da18 KVM: VMX: Fix host GDT.LIMIT corruption
vmx does not restore GDT.LIMIT to the host value, instead it sets it to 64KB.
This means host userspace can learn a few bits of host memory.

Fix by reloading GDTR when we load other host state.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-02 08:10:18 +03:00
Xiao Guangrong
dd180b3e90 KVM: VMX: fix tlb flush with invalid root
Commit 341d9b535b6c simplify reload logic while entry guest mode, it
can avoid unnecessary sync-root if KVM_REQ_MMU_RELOAD and
KVM_REQ_MMU_SYNC both set.

But, it cause a issue that when we handle 'KVM_REQ_TLB_FLUSH', the
root is invalid, it is triggered during my test:

Kernel BUG at ffffffffa00212b8 [verbose debug info unavailable]
......

Fixed by directly return if the root is not ready.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-02 06:40:16 +03:00
Sheng Yang
f5f48ee15c KVM: VMX: Execute WBINVD to keep data consistency with assigned devices
Some guest device driver may leverage the "Non-Snoop" I/O, and explicitly
WBINVD or CLFLUSH to a RAM space. Since migration may occur before WBINVD or
CLFLUSH, we need to maintain data consistency either by:
1: flushing cache (wbinvd) when the guest is scheduled out if there is no
wbinvd exit, or
2: execute wbinvd on all dirty physical CPUs when guest wbinvd exits.

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:47:21 +03:00
Avi Kivity
a8eeb04a44 KVM: Add mini-API for vcpu->requests
Makes it a little more readable and hackable.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:47:05 +03:00
Avi Kivity
2390218b6a KVM: Fix mov cr3 #GP at wrong instruction
On Intel, we call skip_emulated_instruction() even if we injected a #GP,
resulting in the #GP pointing at the wrong address.

Fix by injecting the exception and skipping the instruction at the same place,
so we can do just one or the other.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:46:35 +03:00
Avi Kivity
a83b29c6ad KVM: Fix mov cr4 #GP at wrong instruction
On Intel, we call skip_emulated_instruction() even if we injected a #GP,
resulting in the #GP pointing at the wrong address.

Fix by injecting the exception and skipping the instruction at the same place,
so we can do just one or the other.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:46:34 +03:00
Avi Kivity
49a9b07edc KVM: Fix mov cr0 #GP at wrong instruction
On Intel, we call skip_emulated_instruction() even if we injected a #GP,
resulting in the #GP pointing at the wrong address.

Fix by injecting the exception and skipping the instruction at the same place,
so we can do just one or the other.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:46:32 +03:00
Dexuan Cui
2acf923e38 KVM: VMX: Enable XSAVE/XRSTOR for guest
This patch enable guest to use XSAVE/XRSTOR instructions.

We assume that host_xcr0 would use all possible bits that OS supported.

And we loaded xcr0 in the same way we handled fpu - do it as late as we can.

Signed-off-by: Dexuan Cui <dexuan.cui@intel.com>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:46:31 +03:00
Avi Kivity
f495c6e5e8 KVM: VMX: Fix incorrect rcu deref in rmode_tss_base()
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:46:30 +03:00
Xiao Guangrong
4b9d3a0451 KVM: VMX: fix rcu usage warning in init_rmode()
fix:

[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
include/linux/kvm_host.h:258 invoked rcu_dereference_check() without protection!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 1
1 lock held by qemu-system-x86/3796:
 #0:  (&vcpu->mutex){+.+.+.}, at: [<ffffffffa0217fd8>] vcpu_load+0x1a/0x66 [kvm]

stack backtrace:
Pid: 3796, comm: qemu-system-x86 Not tainted 2.6.34 #25
Call Trace:
 [<ffffffff81070ed1>] lockdep_rcu_dereference+0x9d/0xa5
 [<ffffffffa0214fdf>] gfn_to_memslot_unaliased+0x65/0xa0 [kvm]
 [<ffffffffa0216139>] gfn_to_hva+0x22/0x4c [kvm]
 [<ffffffffa0216217>] kvm_write_guest_page+0x2a/0x7f [kvm]
 [<ffffffffa0216286>] kvm_clear_guest_page+0x1a/0x1c [kvm]
 [<ffffffffa0278239>] init_rmode+0x3b/0x180 [kvm_intel]
 [<ffffffffa02786ce>] vmx_set_cr0+0x350/0x4d3 [kvm_intel]
 [<ffffffffa02274ff>] kvm_arch_vcpu_ioctl_set_sregs+0x122/0x31a [kvm]
 [<ffffffffa021859c>] kvm_vcpu_ioctl+0x578/0xa3d [kvm]
 [<ffffffff8106624c>] ? cpu_clock+0x2d/0x40
 [<ffffffff810f7d86>] ? fget_light+0x244/0x28e
 [<ffffffff810709b9>] ? trace_hardirqs_off_caller+0x1f/0x10e
 [<ffffffff8110501b>] vfs_ioctl+0x32/0xa6
 [<ffffffff81105597>] do_vfs_ioctl+0x47f/0x4b8
 [<ffffffff813ae654>] ? sub_preempt_count+0xa3/0xb7
 [<ffffffff810f7da8>] ? fget_light+0x266/0x28e
 [<ffffffff810f7c53>] ? fget_light+0x111/0x28e
 [<ffffffff81105617>] sys_ioctl+0x47/0x6a
 [<ffffffff81002c1b>] system_call_fastpath+0x16/0x1b

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:46:23 +03:00
Gui Jianfeng
1760dd4939 KVM: VMX: rename vpid_sync_vcpu_all() to vpid_sync_vcpu_single()
The name "pid_sync_vcpu_all" isn't appropriate since it just affect
a single vpid, so rename it to vpid_sync_vcpu_single().

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:46:16 +03:00
Gui Jianfeng
b9d762fa79 KVM: VMX: Add all-context INVVPID type support
Add all-context INVVPID type support.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:46:04 +03:00
Gui Jianfeng
518c8aee5c KVM: VMX: Make sure single type invvpid is supported before issuing invvpid instruction
According to SDM, we need check whether single-context INVVPID type is supported
before issuing invvpid instruction.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Reviewed-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:39:26 +03:00
Sheng Yang
4bc9b98281 KVM: VMX: Enforce EPT pagetable level checking
We only support 4 levels EPT pagetable now.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:39:25 +03:00
Mohammed Gamal
5120702e73 KVM: VMX: Properly return error to userspace on vmentry failure
The vmexit handler returns KVM_EXIT_UNKNOWN since there is no handler
for vmentry failures. This intercepts vmentry failures and returns
KVM_FAIL_ENTRY to userspace instead.

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:39:24 +03:00
Jan Kiszka
10ab25cd6b KVM: x86: Propagate fpu_alloc errors
Memory allocation may fail. Propagate such errors.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:39:22 +03:00
Avi Kivity
221d059d15 KVM: Update Red Hat copyrights
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:51 +03:00
Dongxiao Xu
4610c9cc6d KVM: VMX: VMXON/VMXOFF usage changes
SDM suggests VMXON should be called before VMPTRLD, and VMXOFF
should be called after doing VMCLEAR.

Therefore in vmm coexistence case, we should firstly call VMXON
before any VMCS operation, and then call VMXOFF after the
operation is done.

Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:35:43 +03:00
Dongxiao Xu
b923e62e4d KVM: VMX: VMCLEAR/VMPTRLD usage changes
Originally VMCLEAR/VMPTRLD is called on vcpu migration. To
support hosted VMM coexistance, VMCLEAR is executed on vcpu
schedule out, and VMPTRLD is executed on vcpu schedule in.
This could also eliminate the IPI when doing VMCLEAR.

Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:35:42 +03:00
Dongxiao Xu
92fe13be74 KVM: VMX: Some minor changes to code structure
Do some preparations for vmm coexistence support.

Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:35:42 +03:00
Dongxiao Xu
7725b89414 KVM: VMX: Define new functions to wrapper direct call of asm code
Define vmcs_load() and kvm_cpu_vmxon() to avoid direct call of asm
code. Also move VMXE bit operation out of kvm_cpu_vmxoff().

Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:35:41 +03:00
Gleb Natapov
6d77dbfc88 KVM: inject #UD if instruction emulation fails and exit to userspace
Do not kill VM when instruction emulation fails. Inject #UD and report
failure to userspace instead. Userspace may choose to reenter guest if
vcpu is in userspace (cpl == 3) in which case guest OS will kill
offending process and continue running.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:35:40 +03:00
Avi Kivity
1c11e71357 KVM: VMX: Avoid writing HOST_CR0 every entry
cr0.ts may change between entries, so we copy cr0 to HOST_CR0 before each
entry.  That is slow, so instead, set HOST_CR0 to have TS set unconditionally
(which is a safe value), and issue a clts() just before exiting vcpu context
if the task indeed owns the fpu.

Saves ~50 cycles/exit.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:28 +03:00
Avi Kivity
c332c83ae7 KVM: VMX: Simplify vmx_get_nmi_mask()
!! is not needed due to the cast to bool.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:27 +03:00
Brian Gerst
8c06585d64 x86: Remove redundant K6 MSRs
MSR_K6_EFER is unused, and MSR_K6_STAR is redundant with MSR_STAR.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
LKML-Reference: <1279371808-24804-1-git-send-email-brgerst@gmail.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-07-21 21:23:05 -07:00
Avi Kivity
da38f43859 KVM: VMX: Fix host MSR_KERNEL_GS_BASE corruption
enter_lmode() and exit_lmode() modify the guest's EFER.LMA before calling
vmx_set_efer().  However, the latter function depends on the value of EFER.LMA
to determine whether MSR_KERNEL_GS_BASE needs reloading, via
vmx_load_host_state().  With EFER.LMA changing under its feet, it took the
wrong choice and corrupted userspace's %gs.

This causes 32-on-64 host userspace to fault.

Fix not touching EFER.LMA; instead ask vmx_set_efer() to change it.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-07-06 11:41:31 +03:00
Linus Torvalds
98edb6ca41 Merge branch 'kvm-updates/2.6.35' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/2.6.35' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (269 commits)
  KVM: x86: Add missing locking to arch specific vcpu ioctls
  KVM: PPC: Add missing vcpu_load()/vcpu_put() in vcpu ioctls
  KVM: MMU: Segregate shadow pages with different cr0.wp
  KVM: x86: Check LMA bit before set_efer
  KVM: Don't allow lmsw to clear cr0.pe
  KVM: Add cpuid.txt file
  KVM: x86: Tell the guest we'll warn it about tsc stability
  x86, paravirt: don't compute pvclock adjustments if we trust the tsc
  x86: KVM guest: Try using new kvm clock msrs
  KVM: x86: export paravirtual cpuid flags in KVM_GET_SUPPORTED_CPUID
  KVM: x86: add new KVMCLOCK cpuid feature
  KVM: x86: change msr numbers for kvmclock
  x86, paravirt: Add a global synchronization point for pvclock
  x86, paravirt: Enable pvclock flags in vcpu_time_info structure
  KVM: x86: Inject #GP with the right rip on efer writes
  KVM: SVM: Don't allow nested guest to VMMCALL into host
  KVM: x86: Fix exception reinjection forced to true
  KVM: Fix wallclock version writing race
  KVM: MMU: Don't read pdptrs with mmu spinlock held in mmu_alloc_roots
  KVM: VMX: enable VMXON check with SMX enabled (Intel TXT)
  ...
2010-05-21 17:16:21 -07:00
Shane Wang
cafd66595d KVM: VMX: enable VMXON check with SMX enabled (Intel TXT)
Per document, for feature control MSR:

  Bit 1 enables VMXON in SMX operation. If the bit is clear, execution
        of VMXON in SMX operation causes a general-protection exception.
  Bit 2 enables VMXON outside SMX operation. If the bit is clear, execution
        of VMXON outside SMX operation causes a general-protection exception.

This patch is to enable this kind of check with SMX for VMXON in KVM.

Signed-off-by: Shane Wang <shane.wang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-19 11:36:34 +03:00
Avi Kivity
84ad33ef5d KVM: VMX: Atomically switch efer if EPT && !EFER.NX
When EPT is enabled, we cannot emulate EFER.NX=0 through the shadow page
tables.  This causes accesses through ptes with bit 63 set to succeed instead
of failing a reserved bit check.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-19 11:36:32 +03:00
Avi Kivity
61d2ef2ce3 KVM: VMX: Add facility to atomically switch MSRs on guest entry/exit
Some guest msr values cannot be used on the host (for example. EFER.NX=0),
so we need to switch them atomically during guest entry or exit.

Add a facility to program the vmx msr autoload registers accordingly.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-19 11:36:31 +03:00
Avi Kivity
0ee75bead8 KVM: Let vcpu structure alignment be determined at runtime
vmx and svm vcpus have different contents and therefore may have different
alignmment requirements.  Let each specify its required alignment.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-19 11:36:29 +03:00
Linus Torvalds
4d7b4ac22f Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (311 commits)
  perf tools: Add mode to build without newt support
  perf symbols: symbol inconsistency message should be done only at verbose=1
  perf tui: Add explicit -lslang option
  perf options: Type check all the remaining OPT_ variants
  perf options: Type check OPT_BOOLEAN and fix the offenders
  perf options: Check v type in OPT_U?INTEGER
  perf options: Introduce OPT_UINTEGER
  perf tui: Add workaround for slang < 2.1.4
  perf record: Fix bug mismatch with -c option definition
  perf options: Introduce OPT_U64
  perf tui: Add help window to show key associations
  perf tui: Make <- exit menus too
  perf newt: Add single key shortcuts for zoom into DSO and threads
  perf newt: Exit browser unconditionally when CTRL+C, q or Q is pressed
  perf newt: Fix the 'A'/'a' shortcut for annotate
  perf newt: Make <- exit the ui_browser
  x86, perf: P4 PMU - fix counters management logic
  perf newt: Make <- zoom out filters
  perf report: Report number of events, not samples
  perf hist: Clarify events_stats fields usage
  ...

Fix up trivial conflicts in kernel/fork.c and tools/perf/builtin-record.c
2010-05-18 08:19:03 -07:00
Joerg Roedel
ce7ddec4bb KVM: x86: Allow marking an exception as reinjected
This patch adds logic to kvm/x86 which allows to mark an
injected exception as reinjected. This allows to remove an
ugly hack from svm_complete_interrupts that prevented
exceptions from being reinjected at all in the nested case.
The hack was necessary because an reinjected exception into
the nested guest could cause a nested vmexit emulation. But
reinjected exceptions must not intercept. The downside of
the hack is that a exception that in injected could get
lost.
This patch fixes the problem and puts the code for it into
generic x86 files because. Nested-VMX will likely have the
same problem and could reuse the code.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:19:26 +03:00
Joerg Roedel
d4330ef2fb KVM: x86: Add callback to let modules decide over some supported cpuid bits
This patch adds the get_supported_cpuid callback to
kvm_x86_ops. It will be used in do_cpuid_ent to delegate the
decission about some supported cpuid bits to the
architecture modules.

Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:19:23 +03:00
Lai Jiangshan
cdbecfc398 KVM: VMX: free vpid when fail to create vcpu
Fix bug of the exception path, free allocated vpid when fail
to create vcpu.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:19:10 +03:00
Lai Jiangshan
90d83dc3d4 KVM: use the correct RCU API for PROVE_RCU=y
The RCU/SRCU API have already changed for proving RCU usage.

I got the following dmesg when PROVE_RCU=y because we used incorrect API.
This patch coverts rcu_deference() to srcu_dereference() or family API.

===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
arch/x86/kvm/mmu.c:3020 invoked rcu_dereference_check() without protection!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
2 locks held by qemu-system-x86/8550:
 #0:  (&kvm->slots_lock){+.+.+.}, at: [<ffffffffa011a6ac>] kvm_set_memory_region+0x29/0x50 [kvm]
 #1:  (&(&kvm->mmu_lock)->rlock){+.+...}, at: [<ffffffffa012262d>] kvm_arch_commit_memory_region+0xa6/0xe2 [kvm]

stack backtrace:
Pid: 8550, comm: qemu-system-x86 Not tainted 2.6.34-rc4-tip-01028-g939eab1 #27
Call Trace:
 [<ffffffff8106c59e>] lockdep_rcu_dereference+0xaa/0xb3
 [<ffffffffa012f6c1>] kvm_mmu_calculate_mmu_pages+0x44/0x7d [kvm]
 [<ffffffffa012263e>] kvm_arch_commit_memory_region+0xb7/0xe2 [kvm]
 [<ffffffffa011a5d7>] __kvm_set_memory_region+0x636/0x6e2 [kvm]
 [<ffffffffa011a6ba>] kvm_set_memory_region+0x37/0x50 [kvm]
 [<ffffffffa015e956>] vmx_set_tss_addr+0x46/0x5a [kvm_intel]
 [<ffffffffa0126592>] kvm_arch_vm_ioctl+0x17a/0xcf8 [kvm]
 [<ffffffff810a8692>] ? unlock_page+0x27/0x2c
 [<ffffffff810bf879>] ? __do_fault+0x3a9/0x3e1
 [<ffffffffa011b12f>] kvm_vm_ioctl+0x364/0x38d [kvm]
 [<ffffffff81060cfa>] ? up_read+0x23/0x3d
 [<ffffffff810f3587>] vfs_ioctl+0x32/0xa6
 [<ffffffff810f3b19>] do_vfs_ioctl+0x495/0x4db
 [<ffffffff810e6b2f>] ? fget_light+0xc2/0x241
 [<ffffffff810e416c>] ? do_sys_open+0x104/0x116
 [<ffffffff81382d6d>] ? retint_swapgs+0xe/0x13
 [<ffffffff810f3ba6>] sys_ioctl+0x47/0x6a
 [<ffffffff810021db>] system_call_fastpath+0x16/0x1b

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:18:01 +03:00
Avi Kivity
9beeaa2d68 Merge branch 'perf'
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:17:58 +03:00
Gleb Natapov
acb5451789 KVM: prevent spurious exit to userspace during task switch emulation.
If kvm_task_switch() fails code exits to userspace without specifying
exit reason, so the previous exit reason is reused by userspace. Fix
this by specifying exit reason correctly.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:55 +03:00
Jan Kiszka
e269fb2189 KVM: x86: Push potential exception error code on task switches
When a fault triggers a task switch, the error code, if existent, has to
be pushed on the new task's stack. Implement the missing bits.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:46 +03:00
Gleb Natapov
020df0794f KVM: move DR register access handling into generic code
Currently both SVM and VMX have their own DR handling code. Move it to
x86.c.

Acked-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:39 +03:00
Gleb Natapov
cf8f70bfe3 KVM: x86 emulator: fix in/out emulation.
in/out emulation is broken now. The breakage is different depending
on where IO device resides. If it is in userspace emulator reports
emulation failure since it incorrectly interprets kvm_emulate_pio()
return value. If IO device is in the kernel emulation of 'in' will do
nothing since kvm_emulate_pio() stores result directly into vcpu
registers, so emulator will overwrite result of emulation during
commit of shadowed register.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:25 +03:00
Gui Jianfeng
3129994458 KVM: VMX: change to use bool return values
Make use of bool as return values, and remove some useless
bool value converting. Thanks Avi to point this out.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:51 +03:00
Wei Yongjun
ec68798c8f KVM: x86: Use native_store_idt() instead of kvm_get_idt()
This patch use generic linux function native_store_idt()
instead of kvm_get_idt(), and also removed the useless
function kvm_get_idt().

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:28 +03:00
Avi Kivity
5bfd8b5455 KVM: Move kvm_exit tracepoint rip reading inside tracepoint
Reading rip is expensive on vmx, so move it inside the tracepoint so we only
incur the cost if tracing is enabled.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:15:25 +03:00
Jan Kiszka
f8c5fae166 KVM: VMX: blocked-by-sti must not defer NMI injections
As the processor may not consider GUEST_INTR_STATE_STI as a reason for
blocking NMI, it could return immediately with EXIT_REASON_NMI_WINDOW
when we asked for it. But as we consider this state as NMI-blocking, we
can run into an endless loop.

Resolve this by allowing NMI injection if just GUEST_INTR_STATE_STI is
active (originally suggested by Gleb). Intel confirmed that this is
safe, the processor will never complain about NMI injection in this
state.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
KVM-Stable-Tag
Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-13 01:31:37 -03:00
Gleb Natapov
2d49ec72d3 KVM: move segment_base() into vmx.c
segment_base() is used only by vmx so move it there.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 13:53:39 +03:00
Gleb Natapov
d6ab1ed446 KVM: Drop kvm_get_gdt() in favor of generic linux function
Linux now has native_store_gdt() to do the same. Use it instead of
kvm local version.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 13:53:32 +03:00
Jan Kiszka
48005f64d0 KVM: x86: Save&restore interrupt shadow mask
The interrupt shadow created by STI or MOV-SS-like operations is part of
the VCPU state and must be preserved across migration. Transfer it in
the spare padding field of kvm_vcpu_events.interrupt.

As a side effect we now have to make vmx_set_interrupt_shadow robust
against both shadow types being set. Give MOV SS a higher priority and
skip STI in that case to avoid that VMX throws a fault on next entry.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 12:38:28 +03:00
Gleb Natapov
89a27f4d0e KVM: use desc_ptr struct instead of kvm private descriptor_table
x86 arch defines desc_ptr for idt/gdt pointers, no need to define
another structure in kvm code.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 12:27:28 +03:00
Ingo Molnar
70bce3ba77 Merge branch 'linus' into perf/core
Merge reason: merge the latest fixes, update to latest -rc.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-23 11:10:30 +02:00
Avi Kivity
78ac8b47c5 KVM: VMX: Save/restore rflags.vm correctly in real mode
Currently we set eflags.vm unconditionally when entering real mode emulation
through virtual-8086 mode, and clear it unconditionally when we enter protected
mode.  The means that the following sequence

  KVM_SET_REGS  (rflags.vm=1)
  KVM_SET_SREGS (cr0.pe=1)

Ends up with rflags.vm clear due to KVM_SET_SREGS triggering enter_pmode().

Fix by shadowing rflags.vm (and rflags.iopl) correctly while in real mode:
reads and writes to those bits access a shadow register instead of the actual
register.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-04-20 12:59:31 +03:00
Zhang, Yanmin
ff9d07a0e7 KVM: Implement perf callbacks for guest sampling
Below patch implements the perf_guest_info_callbacks on kvm.

Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-19 12:36:50 +03:00
Tejun Heo
5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Jan Kiszka
c573cd2293 KVM: VMX: Update instruction length on intercepted BP
We intercept #BP while in guest debugging mode. As VM exits due to
intercepted exceptions do not necessarily come with valid
idt_vectoring, we have to update event_exit_inst_len explicitly in such
cases. At least in the absence of migration, this ensures that
re-injections of #BP will find and use the correct instruction length.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Cc: stable@kernel.org (2.6.32, 2.6.33)
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:14 -03:00
Sheng Yang
a19a6d1131 KVM: VMX: Rename VMX_EPT_IGMT_BIT to VMX_EPT_IPAT_BIT
Following the new SDM. Now the bit is named "Ignore PAT memory type".

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:09 -03:00
Julia Lawall
c45b4fd416 KVM: VMX: Remove redundant test in vmx_set_efer()
msr was tested above, so the second test is not needed.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@r@
expression *x;
expression e;
identifier l;
@@

if (x == NULL || ...) {
    ... when forall
    return ...; }
... when != goto l;
    when != x = e
    when != &x
*x == NULL
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:07 -03:00
Avi Kivity
ebcbab4c03 KVM: VMX: Wire up .fpu_activate() callback
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:07 -03:00
Gui Jianfeng
6d3e435e70 KVM: VMX: Remove redundant check in vm_need_virtualize_apic_accesses()
flexpriority_enabled implies cpu_has_vmx_virtualize_apic_accesses() returning
true, so we don't need this check here.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:06 -03:00
Avi Kivity
59200273c4 KVM: Trace failed msr reads and writes
Record failed msrs reads and writes, and the fact that they failed as well.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:06 -03:00
Avi Kivity
81231c698a KVM: VMX: Pass cr0.mp through to the guest when the fpu is active
When cr0.mp is clear, the guest doesn't expect a #NM in response to
a WAIT instruction.  Because we always keep cr0.mp set, it will get
a #NM, and potentially be confused.

Fix by keeping cr0.mp set only when the fpu is inactive, and passing
it through when inactive.

Reported-by: Lorenzo Martignoni <martignlo@gmail.com>
Analyzed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:06 -03:00
Avi Kivity
f6801dff23 KVM: Rename vcpu->shadow_efer to efer
None of the other registers have the shadow_ prefix.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:04 -03:00
Avi Kivity
3eeb3288bc KVM: Add a helper for checking if the guest is in protected mode
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:04 -03:00
Avi Kivity
6b52d18605 KVM: Activate fpu on clts
Assume that if the guest executes clts, it knows what it's doing, and load the
guest fpu to prevent an #NM exception.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:04 -03:00
Jan Kiszka
fd7373cce7 KVM: VMX: Clean up DR6 emulation
As we trap all debug register accesses, we do not need to switch real
DR6 at all. Clean up update_exception_bitmap at this chance, too.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:02 -03:00
Jan Kiszka
138ac8d88f KVM: VMX: Fix emulation of DR4 and DR5
Make sure DR4 and DR5 are aliased to DR6 and DR7, respectively, if
CR4.DE is not set.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:01 -03:00
Jan Kiszka
f248341529 KVM: VMX: Fix exceptions of mov to dr
Injecting GP without an error code is a bad idea (causes unhandled guest
exits). Moreover, we must not skip the instruction if we injected an
exception.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:01 -03:00
Sheng Yang
7062dcaa36 KVM: VMX: Remove emulation failure report
As Avi noted:

>There are two problems with the kernel failure report.  First, it
>doesn't report enough data - registers, surrounding instructions, etc.
>that are needed to explain what is going on.  Second, it can flood
>dmesg, which is a pretty bad thing to do.

So we remove the emulation failure report in handle_invalid_guest_state(),
and would inspected the guest using userspace tool in the future.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:01 -03:00
Avi Kivity
edcafe3c5a KVM: VMX: Give the guest ownership of cr0.ts when the fpu is active
If the guest fpu is loaded, there is nothing interesing about cr0.ts; let
the guest play with it as it will.  This makes context switches between fpu
intensive guest processes faster, as we won't trap the clts and cr0 write
instructions.

[marcelo: fix cr0 read shadow update on fpu deactivation; kills F8 install]

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:50 -03:00
Avi Kivity
02daab21d9 KVM: Lazify fpu activation and deactivation
Defer fpu deactivation as much as possible - if the guest fpu is loaded, keep
it loaded until the next heavyweight exit (where we are forced to unload it).
This reduces unnecessary exits.

We also defer fpu activation on clts; while clts signals the intent to use the
fpu, we can't be sure the guest will actually use it.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:50 -03:00
Avi Kivity
e8467fda83 KVM: VMX: Allow the guest to own some cr0 bits
We will use this later to give the guest ownership of cr0.ts.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:50 -03:00
Avi Kivity
4d4ec08745 KVM: Replace read accesses of vcpu->arch.cr0 by an accessor
Since we'd like to allow the guest to own a few bits of cr0 at times, we need
to know when we access those bits.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:50 -03:00
Avi Kivity
a1f83a74fe KVM: VMX: trace clts and lmsw instructions as cr accesses
clts writes cr0.ts; lmsw writes cr0[0:15] - record that in ftrace.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:50 -03:00
Sheng Yang
878403b788 KVM: VMX: Enable EPT 1GB page support
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:46 -03:00
Sheng Yang
17cc393596 KVM: x86: Rename gb_page_enable() to get_lpage_level() in kvm_x86_ops
Then the callback can provide the maximum supported large page level, which
is more flexible.

Also move the gb page support into x86_64 specific.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:46 -03:00
Avi Kivity
f4c9e87c83 KVM: Fill out ftrace exit reason strings
Some exit reasons missed their strings; fill out the table.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:45 -03:00
Marcelo Tosatti
79fac95ecf KVM: convert slots_lock to a mutex
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:45 -03:00
Marcelo Tosatti
f656ce0185 KVM: switch vcpu context to use SRCU
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:45 -03:00
Marcelo Tosatti
bc6678a33d KVM: introduce kvm->srcu and convert kvm_set_memory_region to SRCU update
Use two steps for memslot deletion: mark the slot invalid (which stops
instantiation of new shadow pages for that slot, but allows destruction),
then instantiate the new empty slot.

Also simplifies kvm_handle_hva locking.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:44 -03:00
Marcelo Tosatti
46a26bf557 KVM: modify memslots layout in struct kvm
Have a pointer to an allocated region inside struct kvm.

[alex: fix ppc book 3s]

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:43 -03:00
Sheng Yang
4e47c7a6d7 KVM: VMX: Add instruction rdtscp support for guest
Before enabling, execution of "rdtscp" in guest would result in #UD.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:40 -03:00
Sheng Yang
0e85188049 KVM: Add cpuid_update() callback to kvm_x86_ops
Sometime, we need to adjust some state in order to reflect guest CPUID
setting, e.g. if we don't expose rdtscp to guest, we won't want to enable
it on hardware. cpuid_update() is introduced for this purpose.

Also export kvm_find_cpuid_entry() for later use.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:40 -03:00
Sheng Yang
8a7e3f01e6 KVM: VMX: Remove redundant variable
It's no longer necessary.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:40 -03:00
Avi Kivity
bc23008b61 KVM: VMX: Fold ept_update_paging_mode_cr4() into its caller
ept_update_paging_mode_cr4() accesses vcpu->arch.cr4 directly, which usually
needs to be accessed via kvm_read_cr4().  In this case, we can't, since cr4
is in the process of being updated.  Instead of adding inane comments, fold
the function into its caller (vmx_set_cr4), so it can use the not-yet-committed
cr4 directly.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:40 -03:00
Avi Kivity
ce03e4f21a KVM: VMX: When using ept, allow the guest to own cr4.pge
We make no use of cr4.pge if ept is enabled, but the guest does (to flush
global mappings, as with vmap()), so give the guest ownership of this bit.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:39 -03:00
Avi Kivity
4c38609ac5 KVM: VMX: Make guest cr4 mask more conservative
Instead of specifying the bits which we want to trap on, specify the bits
which we allow the guest to change transparently.  This is safer wrt future
changes to cr4.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:39 -03:00
Avi Kivity
fc78f51938 KVM: Add accessor for reading cr4 (or some bits of cr4)
Some bits of cr4 can be owned by the guest on vmx, so when we read them,
we copy them to the vcpu structure.  In preparation for making the set of
guest-owned bits dynamic, use helpers to access these bits so we don't need
to know where the bit resides.

No changes to svm since all bits are host-owned there.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:39 -03:00
Avi Kivity
cdc0e24456 KVM: VMX: Move some cr[04] related constants to vmx.c
They have no place in common code.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:39 -03:00
Sheng Yang
59708670b6 KVM: VMX: Trap and invalid MWAIT/MONITOR instruction
We don't support these instructions, but guest can execute them even if the
feature('monitor') haven't been exposed in CPUID. So we would trap and inject
a #UD if guest try this way.

Cc: stable@kernel.org
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:39 -03:00
Avi Kivity
d5696725b2 KVM: VMX: Fix comparison of guest efer with stale host value
update_transition_efer() masks out some efer bits when deciding whether
to switch the msr during guest entry; for example, NX is emulated using the
mmu so we don't need to disable it, and LMA/LME are handled by the hardware.

However, with shared msrs, the comparison is made against a stale value;
at the time of the guest switch we may be running with another guest's efer.

Fix by deferring the mask/compare to the actual point of guest entry.

Noted by Marcelo.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:34:20 +02:00
Sheng Yang
046d87103a KVM: VMX: Disable unrestricted guest when EPT disabled
Otherwise would cause VMEntry failure when using ept=0 on unrestricted guest
supported processors.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:25 +02:00
Jan Kiszka
3cfc3092f4 KVM: x86: Add KVM_GET/SET_VCPU_EVENTS
This new IOCTL exports all yet user-invisible states related to
exceptions, interrupts, and NMIs. Together with appropriate user space
changes, this fixes sporadic problems of vmsave/restore, live migration
and system reset.

[avi: future-proof abi by adding a flags field]

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:25 +02:00
Avi Kivity
65ac726404 KVM: VMX: Report unexpected simultaneous exceptions as internal errors
These happen when we trap an exception when another exception is being
delivered; we only expect these with MCEs and page faults.  If something
unexpected happens, things probably went south and we're better off reporting
an internal error and freezing.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:24 +02:00
Avi Kivity
a9c7399d6c KVM: Allow internal errors reported to userspace to carry extra data
Usually userspace will freeze the guest so we can inspect it, but some
internal state is not available.  Add extra data to internal error
reporting so we can expose it to the debugger.  Extra data is specific
to the suberror.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:24 +02:00
Avi Kivity
92c0d90015 KVM: VMX: Remove vmx->msr_offset_efer
This variable is used to communicate between a caller and a callee; switch
to a function argument instead.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:23 +02:00
Marcelo Tosatti
7c93be44a4 KVM: VMX: move CR3/PDPTR update to vmx_set_cr3
GUEST_CR3 is updated via kvm_set_cr3 whenever CR3 is modified from
outside guest context. Similarly pdptrs are updated via load_pdptrs.

Let kvm_set_cr3 perform the update, removing it from the vcpu_run
fast path.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Acked-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:22 +02:00
Avi Kivity
26bb0981b3 KVM: VMX: Use shared msr infrastructure
Instead of reloading syscall MSRs on every preemption, use the new shared
msr infrastructure to reload them at the last possible minute (just before
exit to userspace).

Improves vcpu/idle/vcpu switches by about 2000 cycles (when EFER needs to be
reloaded as well).

[jan: fix slot index missing indirection]

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:22 +02:00
Avi Kivity
44ea2b1758 KVM: VMX: Move MSR_KERNEL_GS_BASE out of the vmx autoload msr area
Currently MSR_KERNEL_GS_BASE is saved and restored as part of the
guest/host msr reloading.  Since we wish to lazy-restore all the other
msrs, save and reload MSR_KERNEL_GS_BASE explicitly instead of using
the common code.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:21 +02:00
Eduardo Habkost
fa40052ca0 KVM: VMX: Use macros instead of hex value on cr0 initialization
This should have no effect, it is just to make the code clearer.

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:21 +02:00
Marcelo Tosatti
9fb41ba896 KVM: VMX: fix handle_pause declaration
There's no kvm_run argument anymore.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:18 +02:00
Zhai, Edwin
4b8d54f972 KVM: VMX: Add support for Pause-Loop Exiting
New NHM processors will support Pause-Loop Exiting by adding 2 VM-execution
control fields:
PLE_Gap    - upper bound on the amount of time between two successive
             executions of PAUSE in a loop.
PLE_Window - upper bound on the amount of time a guest is allowed to execute in
             a PAUSE loop

If the time, between this execution of PAUSE and previous one, exceeds the
PLE_Gap, processor consider this PAUSE belongs to a new loop.
Otherwise, processor determins the the total execution time of this loop(since
1st PAUSE in this loop), and triggers a VM exit if total time exceeds the
PLE_Window.
* Refer SDM volume 3b section 21.6.13 & 22.1.3.

Pause-Loop Exiting can be used to detect Lock-Holder Preemption, where one VP
is sched-out after hold a spinlock, then other VPs for same lock are sched-in
to waste the CPU time.

Our tests indicate that most spinlocks are held for less than 212 cycles.
Performance tests show that with 2X LP over-commitment we can get +2% perf
improvement for kernel build(Even more perf gain with more LPs).

Signed-off-by: Zhai Edwin <edwin.zhai@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:17 +02:00
Jan Kiszka
355be0b930 KVM: x86: Refactor guest debug IOCTL handling
Much of so far vendor-specific code for setting up guest debug can
actually be handled by the generic code. This also fixes a minor deficit
in the SVM part /wrt processing KVM_GUESTDBG_ENABLE.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:14 +02:00
Zachary Amsden
3230bb4707 KVM: Fix hotplug of CPUs
Both VMX and SVM require per-cpu memory allocation, which is done at module
init time, for only online cpus.

Backend was not allocating enough structure for all possible CPUs, so
new CPUs coming online could not be hardware enabled.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:13 +02:00
Alexander Graf
10474ae894 KVM: Activate Virtualization On Demand
X86 CPUs need to have some magic happening to enable the virtualization
extensions on them. This magic can result in unpleasant results for
users, like blocking other VMMs from working (vmx) or using invalid TLB
entries (svm).

Currently KVM activates virtualization when the respective kernel module
is loaded. This blocks us from autoloading KVM modules without breaking
other VMMs.

To circumvent this problem at least a bit, this patch introduces on
demand activation of virtualization. This means, that instead
virtualization is enabled on creation of the first virtual machine
and disabled on destruction of the last one.

So using this, KVM can be easily autoloaded, while keeping other
hypervisors usable.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:10 +02:00
Mohammed Gamal
80ced186d1 KVM: VMX: Enhance invalid guest state emulation
- Change returned handle_invalid_guest_state() to return relevant exit codes
- Move triggering the emulation from vmx_vcpu_run() to vmx_handle_exit()
- Return to userspace instead of repeatedly trying to emulate instructions that have already failed

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:09 +02:00
Avi Kivity
851ba6922a KVM: Don't pass kvm_run arguments
They're just copies of vcpu->run, which is readily accessible.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:06 +02:00
Marcelo Tosatti
eb5109e311 KVM: VMX: flush TLB with INVEPT on cpu migration
It is possible that stale EPTP-tagged mappings are used, if a
vcpu migrates to a different pcpu.

Set KVM_REQ_TLB_FLUSH in vmx_vcpu_load, when switching pcpus, which
will invalidate both VPID and EPT mappings on the next vm-entry.

Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-10-04 13:57:24 +02:00
Avi Kivity
0a79b00952 KVM: VMX: Check cpl before emulating debug register access
Debug registers may only be accessed from cpl 0.  Unfortunately, vmx will
code to emulate the instruction even though it was issued from guest
userspace, possibly leading to an unexpected trap later.

Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 18:11:10 +03:00
Gleb Natapov
542423b0dd KVM: VMX: call vmx_load_host_state() only if msr is cached
No need to call it before each kvm_(set|get)_msr_common()

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 18:11:07 +03:00
Avi Kivity
e8a48342e9 KVM: VMX: Conditionally reload debug register 6
Only reload debug register 6 if we're running with the guest's
debug registers.  Saves around 150 cycles from the guest lightweight
exit path.

dr6 contains a couple of bits that are updated on #DB, so intercept
that unconditionally and update those bits then.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 18:11:06 +03:00
Gleb Natapov
5fff7d270b KVM: VMX: Fix cr8 exiting control clobbering by EPT
Don't call adjust_vmx_controls() two times for the same control.
It restores options that were dropped earlier.  This loses us the cr8
exit control, which causes a massive performance regression Windows x64.

Cc: stable@kernel.org
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:57 +03:00
Sheng Yang
95eb84a758 KVM: VMX: Fix EPT with WP bit change during paging
QNX update WP bit when paging enabled, which is not covered yet. This one fix
QNX boot with EPT.

Cc: stable@kernel.org
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:53 +03:00
Avi Kivity
345dcaa8fd KVM: VMX: Adjust rflags if in real mode emulation
We set rflags.vm86 when virtualizing real mode to do through vm8086 mode;
so we need to take it out again when reading rflags.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:43 +03:00
Roel Kluin
3a34a8810b KVM: fix EFER read buffer overflow
Check whether index is within bounds before grabbing the element.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:21 +03:00
Avi Kivity
eab4b8aa34 KVM: VMX: Optimize vmx_get_cpl()
Instead of calling vmx_get_segment() (which reads a whole bunch of
vmcs fields), read only the cs selector which contains the cpl.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:21 +03:00
Joerg Roedel
344f414fa0 KVM: report 1GB page support to userspace
If userspace knows that the kernel part supports 1GB pages it can enable
the corresponding cpuid bit so that guests actually use GB pages.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:19 +03:00
Jan Kiszka
7f582ab6d8 KVM: VMX: Avoid to return ENOTSUPP to userland
Choose some allowed error values for the cases VMX returned ENOTSUPP so
far as these values could be returned by the KVM_RUN IOCTL.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 08:33:17 +03:00
Sheng Yang
b927a3cec0 KVM: VMX: Introduce KVM_SET_IDENTITY_MAP_ADDR ioctl
Now KVM allow guest to modify guest's physical address of EPT's identity mapping page.

(change from v1, discard unnecessary check, change ioctl to accept parameter
address rather than value)

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 08:33:16 +03:00
Marcelo Tosatti
229456fc34 KVM: convert custom marker based tracing to event traces
This allows use of the powerful ftrace infrastructure.

See Documentation/trace/ for usage information.

[avi, stephen: various build fixes]
[sheng: fix control register breakage]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:59 +03:00
Avi Kivity
d3edefc003 KVM: VMX: Only reload guest cr2 if different from host cr2
cr2 changes only rarely, and writing it is expensive.  Avoid the costly cr2
writes by checking if it does not already hold the desired value.

Shaves 70 cycles off the vmexit latency.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:57 +03:00
Marcelo Tosatti
54dee9933e KVM: VMX: conditionally disable 2M pages
Disable usage of 2M pages if VMX_EPT_2MB_PAGE_BIT (bit 16) is clear
in MSR_IA32_VMX_EPT_VPID_CAP and EPT is enabled.

[avi: s/largepages_disabled/largepages_enabled/ to avoid negative logic]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:56 +03:00
Marcelo Tosatti
68f89400bc KVM: VMX: EPT misconfiguration handler
Handler for EPT misconfiguration which checks for valid state
in the shadow pagetables, printing the spte on each level.

The separate WARN_ONs are useful for kerneloops.org.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:56 +03:00
Marcelo Tosatti
e799794e02 KVM: VMX: more MSR_IA32_VMX_EPT_VPID_CAP capability bits
Required for EPT misconfiguration handler.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:55 +03:00
Andre Przywara
71db602322 KVM: Move performance counter MSR access interception to generic x86 path
The performance counter MSRs are different for AMD and Intel CPUs and they
are chosen mainly by the CPUID vendor string. This patch catches writes to
all addresses (regardless of VMX/SVM path) and handles them in the generic
MSR handler routine. Writing a 0 into the event select register is something
we perfectly emulate ;-), so don't print out a warning to dmesg in this
case.
This fixes booting a 64bit Windows guest with an AMD CPUID on an Intel host.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:54 +03:00
Gleb Natapov
c5af89b68a KVM: Introduce kvm_vcpu_is_bsp() function.
Use it instead of open code "vcpu_id zero is BSP" assumption.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:51 +03:00
Avi Kivity
7ffd92c53c KVM: VMX: Move rmode structure to vmx-specific code
rmode is only used in vmx, so move it to vmx.c

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:50 +03:00
Nitin A Kamble
3a624e29c7 KVM: VMX: Support Unrestricted Guest feature
"Unrestricted Guest" feature is added in the VMX specification.
Intel Westmere and onwards processors will support this feature.

    It allows kvm guests to run real mode and unpaged mode
code natively in the VMX mode when EPT is turned on. With the
unrestricted guest there is no need to emulate the guest real mode code
in the vm86 container or in the emulator. Also the guest big real mode
code works like native.

  The attached patch enhances KVM to use the unrestricted guest feature
if available on the processor. It also adds a new kernel/module
parameter to disable the unrestricted guest feature at the boot time.

Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:49 +03:00
Avi Kivity
596ae89565 KVM: VMX: Fix reporting of unhandled EPT violations
Instead of returning -ENOTSUPP, exit normally but indicate the hardware
exit reason.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:46 +03:00
Avi Kivity
6de4f3ada4 KVM: Cache pdptrs
Instead of reloading the pdptrs on every entry and exit (vmcs writes on vmx,
guest memory access on svm) extract them on demand.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:46 +03:00
Avi Kivity
8f5d549f02 KVM: VMX: Simplify pdptr and cr3 management
Instead of reading the PDPTRs from memory after every exit (which is slow
and wrong, as the PDPTRs are stored on the cpu), sync the PDPTRs from
memory to the VMCS before entry, and from the VMCS to memory after exit.
Do the same for cr3.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:46 +03:00
Avi Kivity
2d84e993a8 KVM: VMX: Avoid duplicate ept tlb flush when setting cr3
vmx_set_cr3() will call vmx_tlb_flush(), which will flush the ept context.
So there is no need to call ept_sync_context() explicitly.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:46 +03:00
Gleb Natapov
787ff73637 KVM: Drop interrupt shadow when single stepping should be done only on VMX
The problem exists only on VMX. Also currently we skip this step if
there is pending exception. The patch fixes this too.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:41 +03:00
Jaswinder Singh Rajput
af24a4e4ae KVM: Replace MSR_IA32_TIME_STAMP_COUNTER with MSR_IA32_TSC of msr-index.h
Use standard msr-index.h's MSR declaration.

MSR_IA32_TSC is better than MSR_IA32_TIME_STAMP_COUNTER as it also solves
80 column issue.

Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:38 +03:00
Gleb Natapov
ae0bb3e011 KVM: VMX: Properly handle software interrupt re-injection in real mode
When reinjecting a software interrupt or exception, use the correct
instruction length provided by the hardware instead of a hardcoded 1.

Fixes problems running the suse 9.1 livecd boot loader.

Problem introduced by commit f0a3602c20 ("KVM: Move interrupt injection
logic to x86.c").

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:38 +03:00
Jan Kiszka
263799a361 KVM: VMX: Fix locking imbalance on emulation failure
We have to disable preemption and IRQs on every exit from
handle_invalid_guest_state, otherwise we generate at least a
preempt_disable imbalance.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-08-05 13:59:45 +03:00
Jan Kiszka
34f0c1ad27 KVM: VMX: Fix locking order in handle_invalid_guest_state
Release and re-acquire preemption and IRQ lock in the same order as
vcpu_enter_guest does.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-08-05 13:59:44 +03:00
Avi Kivity
e3c7cb6ad7 KVM: VMX: Handle vmx instruction vmexits
IF a guest tries to use vmx instructions, inject a #UD to let it know the
instruction is not implemented, rather than crashing.

This prevents guest userspace from crashing the guest kernel.

Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-28 14:10:31 +03:00
Mel Gorman
6484eb3e2a page allocator: do not check NUMA node ID when the caller knows the node is valid
Callers of alloc_pages_node() can optionally specify -1 as a node to mean
"allocate from the current node".  However, a number of the callers in
fast paths know for a fact their node is valid.  To avoid a comparison and
branch, this patch adds alloc_pages_exact_node() that only checks the nid
with VM_BUG_ON().  Callers that know their node is valid are then
converted.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Paul Mundt <lethal@linux-sh.org>	[for the SLOB NUMA bits]
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:32 -07:00
Andi Kleen
a0861c02a9 KVM: Add VT-x machine check support
VT-x needs an explicit MC vector intercept to handle machine checks in the
hyper visor.

It also has a special option to catch machine checks that happen
during VT entry.

Do these interceptions and forward them to the Linux machine check
handler. Make it always look like user space is interrupted because
the machine check handler treats kernel/user space differently.

Thanks to Jiang Yunhong for help and testing.

Cc: stable@kernel.org
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 12:27:08 +03:00
Nitin A Kamble
56b237e31a KVM: VMX: Rename rmode.active to rmode.vm86_active
That way the interpretation of rmode.active becomes more clear with
unrestricted guest code.

Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:49:00 +03:00
Gleb Natapov
20f65983e3 KVM: Move "exit due to NMI" handling into vmx_complete_interrupts()
To save us one reading of VM_EXIT_INTR_INFO.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:59 +03:00
Gleb Natapov
66fd3f7f90 KVM: Do not re-execute INTn instruction.
Re-inject event instead. This is what Intel suggest. Also use correct
instruction length when re-injecting soft fault/interrupt.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:58 +03:00
Gleb Natapov
3298b75c88 KVM: Unprotect a page if #PF happens during NMI injection.
It is done for exception and interrupt already.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:57 +03:00
Glauber Costa
2809f5d2c4 KVM: Replace ->drop_interrupt_shadow() by ->set_interrupt_shadow()
This patch replaces drop_interrupt_shadow with the more
general set_interrupt_shadow, that can either drop or raise
it, depending on its parameter.  It also adds ->get_interrupt_shadow()
for future use.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:54 +03:00
Sheng Yang
522c68c441 KVM: Enable snooping control for supported hardware
Memory aliases with different memory type is a problem for guest. For the guest
without assigned device, the memory type of guest memory would always been the
same as host(WB); but for the assigned device, some part of memory may be used
as DMA and then set to uncacheable memory type(UC/WC), which would be a conflict of
host memory type then be a potential issue.

Snooping control can guarantee the cache correctness of memory go through the
DMA engine of VT-d.

[avi: fix build on ia64]

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:50 +03:00
Sheng Yang
4b12f0de33 KVM: Replace get_mt_mask_shift with get_mt_mask
Shadow_mt_mask is out of date, now it have only been used as a flag to indicate
if TDP enabled. Get rid of it and use tdp_enabled instead.

Also put memory type logical in kvm_x86_ops->get_mt_mask().

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:49 +03:00
Gleb Natapov
14d0bc1f7c KVM: Get rid of get_irq() callback
It just returns pending IRQ vector from the queue for VMX/SVM.
Get IRQ directly from the queue before migration and put it back
after.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:49 +03:00
Gleb Natapov
95ba827313 KVM: SVM: Add NMI injection support
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:48 +03:00
Gleb Natapov
c4282df98a KVM: Get rid of arch.interrupt_window_open & arch.nmi_window_open
They are recalculated before each use anyway.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:48 +03:00
Gleb Natapov
0a5fff1923 KVM: Do not report TPR write to userspace if new value bigger or equal to a previous one.
Saves many exits to userspace in a case of IRQ chip in userspace.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:47 +03:00
Gleb Natapov
1d6ed0cb95 KVM: Remove inject_pending_vectors() callback
It is the same as inject_pending_irq() for VMX/SVM now.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:47 +03:00
Gleb Natapov
1cb948ae86 KVM: Remove exception_injected() callback.
It always return false for VMX/SVM now.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:46 +03:00
Gleb Natapov
1f21e79aac KVM: VMX: Cleanup vmx_intr_assist()
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:45 +03:00
Gleb Natapov
863e8e658e KVM: VMX: Consolidate userspace and kernel interrupt injection for VMX
Use the same callback to inject irq/nmi events no matter what irqchip is
in use. Only from VMX for now.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:45 +03:00
Gleb Natapov
8061823a25 KVM: Make kvm_cpu_(has|get)_interrupt() work for userspace irqchip too
At the vector level, kernel and userspace irqchip are fairly similar.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:45 +03:00
Gleb Natapov
64a7ec0668 KVM: Fix unneeded instruction skipping during task switching.
There is no need to skip instruction if the reason for a task switch
is a task gate in IDT and access to it is caused by an external even.
The problem  is currently solved only for VMX since there is no reliable
way to skip an instruction in SVM. We should emulate it instead.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:38 +03:00
Gleb Natapov
8843419048 KVM: VMX: Do not zero idt_vectoring_info in vmx_complete_interrupts().
We will need it later in task_switch().
Code in handle_exception() is dead. is_external_interrupt(vect_info)
will always be false since idt_vectoring_info is zeroed in
vmx_complete_interrupts().

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:37 +03:00
Gleb Natapov
37b96e9880 KVM: VMX: Rewrite vmx_complete_interrupt()'s twisted maze of if() statements
...with a more straightforward switch().

Also fix a bug when NMI could be dropped on exit. Although this should
never happen in practice, since NMIs can only be injected, never triggered
internally by the guest like exceptions.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:37 +03:00
Gleb Natapov
7b4a25cb29 KVM: VMX: Fix handling of a fault during NMI unblocked due to IRET
Bit 12 is undefined in any of the following cases:
 If the VM exit sets the valid bit in the IDT-vectoring information field.
 If the VM exit is due to a double fault.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:36 +03:00
Sheng Yang
93ba03c2e2 KVM: VMX: Fix feature testing
The testing of feature is too early now, before vmcs_config complete initialization.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:36 +03:00
Sheng Yang
045471563d KVM: VMX: Clean up Flex Priority related
And clean paranthes on returns.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:36 +03:00
Sheng Yang
f9c617f611 KVM: VMX: Correct wrong vmcs field sizes
EXIT_QUALIFICATION and GUEST_LINEAR_ADDRESS are natural width, not 64-bit.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:34 +03:00
Avi Kivity
7d433b9f94 KVM: VMX: Make flexpriority module parameter reflect hardware capability
If the hardware does not support flexpriority, zero the module parameter.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:34 +03:00
Gleb Natapov
78646121e9 KVM: Fix interrupt unhalting a vcpu when it shouldn't
kvm_vcpu_block() unhalts vpu on an interrupt/timer without checking
if interrupt window is actually opened.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:33 +03:00
Avi Kivity
089d034e0c KVM: VMX: Fold vm_need_ept() into callers
Trivial.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:32 +03:00
Avi Kivity
575ff2dcb2 KVM: VMX: Zero ept module parameter if ept is not present
Allows reading back hardware capability.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:32 +03:00
Avi Kivity
919818abc2 KVM: VMX: Zero the vpid module parameter if vpid is not supported
This allows reading back how the hardware is configured.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:32 +03:00
Avi Kivity
4462d21a61 KVM: VMX: Annotate module parameters as __read_mostly
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:32 +03:00
Avi Kivity
736caefe15 KVM: VMX: Simplify module parameter names
Instead of 'enable_vpid=1', use a simple 'vpid=1'.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:31 +03:00
Avi Kivity
6062d012ed KVM: VMX: Rename kvm_handle_exit() to vmx_handle_exit()
It is a static vmx-specific function.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:31 +03:00
Avi Kivity
c1f8bc04c6 KVM: VMX: Make module parameters readable
Useful to see how the module was loaded.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:31 +03:00
Gleb Natapov
fe4c7b1914 KVM: reuse (pop|push)_irq from svm.c in vmx.c
The prioritized bit vector manipulation functions are useful in both vmx and
svm.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:31 +03:00
Avi Kivity
5897297bc2 KVM: VMX: Don't intercept MSR_KERNEL_GS_BASE
Windows 2008 accesses this MSR often on context switch intensive workloads;
since we run in guest context with the guest MSR value loaded (so swapgs can
work correctly), we can simply disable interception of rdmsr/wrmsr for this
MSR.

A complication occurs since in legacy mode, we run with the host MSR value
loaded. In this case we enable interception.  This means we need two MSR
bitmaps, one for legacy mode and one for long mode.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:21 +03:00
Avi Kivity
3e7c73e9b1 KVM: VMX: Don't use highmem pages for the msr and pio bitmaps
Highmem pages are a pain, and saving three lowmem pages on i386 isn't worth
the extra code.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:21 +03:00
Avi Kivity
16175a796d KVM: VMX: Don't allow uninhibited access to EFER on i386
vmx_set_msr() does not allow i386 guests to touch EFER, but they can still
do so through the default: label in the switch.  If they set EFER_LME, they
can oops the host.

Fix by having EFER access through the normal channel (which will check for
EFER_LME) even on i386.

Reported-and-tested-by: Benjamin Gilbert <bgilbert@cs.cmu.edu>
Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:15 +02:00
Amit Shah
401d10dee0 KVM: VMX: Update necessary state when guest enters long mode
setup_msrs() should be called when entering long mode to save the
shadow state for the 64-bit guest state.

Using vmx_set_efer() in enter_lmode() removes some duplicated code
and also ensures we call setup_msrs(). We can safely pass the value
of shadow_efer to vmx_set_efer() as no other bits in the efer change
while enabling long mode (guest first sets EFER.LME, then sets CR0.PG
which causes a vmexit where we activate long mode).

With this fix, is_long_mode() can check for EFER.LMA set instead of
EFER.LME and 5e23049e86dd298b72e206b420513dbc3a240cd9 can be reverted.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:13 +02:00
Sheng Yang
49cd7d2238 KVM: VMX: Use kvm_mmu_page_fault() handle EPT violation mmio
Removed duplicated code.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:09 +02:00
Jan Kiszka
34c33d163f KVM: Drop unused evaluations from string pio handlers
Looks like neither the direction nor the rep prefix are used anymore.
Drop related evaluations from SVM's and VMX's I/O exit handlers.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:08 +02:00
Avi Kivity
8b3079a5c0 KVM: VMX: When emulating on invalid vmx state, don't return to userspace unnecessarily
If we aren't doing mmio there's no need to exit to userspace (which will
just be confused).

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:00 +02:00
Avi Kivity
10f32d84c7 KVM: VMX: Prevent exit handler from running if emulating due to invalid state
If we've just emulated an instruction, we won't have any valid exit
reason and associated information.

Fix by moving the clearing of the emulation_required flag to the exit handler.
This way the exit handler can notice that we've been emulating and abort
early.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:00 +02:00
Avi Kivity
9fd4a3b7a4 KVM: VMX: don't clobber segment AR if emulating invalid state
The ususable bit is important for determining state validity; don't
clobber it.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:00 +02:00
Avi Kivity
1872a3f411 KVM: VMX: Fix guest state validity checks
The vmx guest state validity checks are full of bugs.  Make them
conform to the manual.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:59 +02:00
Marcelo Tosatti
53f658b3c3 KVM: VMX: initialize TSC offset relative to vm creation time
VMX initializes the TSC offset for each vcpu at different times, and
also reinitializes it for vcpus other than 0 on APIC SIPI message.

This bug causes the TSC's to appear unsynchronized in the guest, even if
the host is good.

Older Linux kernels don't handle the situation very well, so
gettimeofday is likely to go backwards in time:

http://www.mail-archive.com/kvm@vger.kernel.org/msg02955.html
http://sourceforge.net/tracker/index.php?func=detail&aid=2025534&group_id=180599&atid=893831

Fix it by initializating the offset of each vcpu relative to vm creation
time, and moving it from vmx_vcpu_reset to vmx_vcpu_setup, out of the
APIC MP init path.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:53 +02:00
Jan Kiszka
ae675ef01c KVM: x86: Wire-up hardware breakpoints for guest debugging
Add the remaining bits to make use of debug registers also for guest
debugging, thus enabling the use of hardware breakpoints and
watchpoints.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:50 +02:00
Jan Kiszka
42dbaa5a05 KVM: x86: Virtualize debug registers
So far KVM only had basic x86 debug register support, once introduced to
realize guest debugging that way. The guest itself was not able to use
those registers.

This patch now adds (almost) full support for guest self-debugging via
hardware registers. It refactors the code, moving generic parts out of
SVM (VMX was already cleaned up by the KVM_SET_GUEST_DEBUG patches), and
it ensures that the registers are properly switched between host and
guest.

This patch also prepares debug register usage by the host. The latter
will (once wired-up by the following patch) allow for hardware
breakpoints/watchpoints in guest code. If this is enabled, the guest
will only see faked debug registers without functionality, but with
content reflecting the guest's modifications.

Tested on Intel only, but SVM /should/ work as well, but who knows...

Known limitations: Trapping on tss switch won't work - most probably on
Intel.

Credits also go to Joerg Roedel - I used his once posted debugging
series as platform for this patch.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:49 +02:00
Jan Kiszka
55934c0bd3 KVM: VMX: Allow single-stepping when uninterruptible
When single-stepping over STI and MOV SS, we must clear the
corresponding interruptibility bits in the guest state. Otherwise
vmentry fails as it then expects bit 14 (BS) in pending debug exceptions
being set, but that's not correct for the guest debugging case.

Note that clearing those bits is safe as we check for interruptibility
based on the original state and do not inject interrupts or NMIs if
guest interruptibility was blocked.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:49 +02:00
Jan Kiszka
d0bfb940ec KVM: New guest debug interface
This rips out the support for KVM_DEBUG_GUEST and introduces a new IOCTL
instead: KVM_SET_GUEST_DEBUG. The IOCTL payload consists of a generic
part, controlling the "main switch" and the single-step feature. The
arch specific part adds an x86 interface for intercepting both types of
debug exceptions separately and re-injecting them when the host was not
interested. Moveover, the foundation for guest debugging via debug
registers is layed.

To signal breakpoint events properly back to userland, an arch-specific
data block is now returned along KVM_EXIT_DEBUG. For x86, the arch block
contains the PC, the debug exception, and relevant debug registers to
tell debug events properly apart.

The availability of this new interface is signaled by
KVM_CAP_SET_GUEST_DEBUG. Empty stubs for not yet supported archs are
provided.

Note that both SVM and VTX are supported, but only the latter was tested
yet. Based on the experience with all those VTX corner case, I would be
fairly surprised if SVM will work out of the box.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:49 +02:00
Jan Kiszka
8ab2d2e231 KVM: VMX: Support for injecting software exceptions
VMX differentiates between processor and software generated exceptions
when injecting them into the guest. Extend vmx_queue_exception
accordingly (and refactor related constants) so that we can use this
service reliably for the new guest debugging framework.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:48 +02:00
Avi Kivity
516a1a7e9d KVM: VMX: Flush volatile msrs before emulating rdmsr
Some msrs (notable MSR_KERNEL_GS_BASE) are held in the processor registers
and need to be flushed to the vcpu struture before they can be read.

This fixes cygwin longjmp() failure on Windows x64.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-02-15 02:47:39 +02:00
Marcelo Tosatti
b682b814e3 KVM: x86: fix LAPIC pending count calculation
Simplify LAPIC TMCCT calculation by using hrtimer provided
function to query remaining time until expiration.

Fixes host hang with nested ESX.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-02-15 02:47:38 +02:00
Sheng Yang
2aaf69dcee KVM: MMU: Map device MMIO as UC in EPT
Software are not allow to access device MMIO using cacheable memory type, the
patch limit MMIO region with UC and WC(guest can select WC using PAT and
PCD/PWT).

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-02-15 02:47:37 +02:00
Jan Kiszka
4531220b71 KVM: x86: Rework user space NMI injection as KVM_CAP_USER_NMI
There is no point in doing the ready_for_nmi_injection/
request_nmi_window dance with user space. First, we don't do this for
in-kernel irqchip anyway, while the code path is the same as for user
space irqchip mode. And second, there is nothing to loose if a pending
NMI is overwritten by another one (in contrast to IRQs where we have to
save the number). Actually, there is even the risk of raising spurious
NMIs this way because the reason for the held-back NMI might already be
handled while processing the first one.

Therefore this patch creates a simplified user space NMI injection
interface, exporting it under KVM_CAP_USER_NMI and dropping the old
KVM_CAP_NMI capability. And this time we also take care to provide the
interface only on archs supporting NMIs via KVM (right now only x86).

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:47 +02:00
Jan Kiszka
264ff01d55 KVM: VMX: Fix pending NMI-vs.-IRQ race for user space irqchip
As with the kernel irqchip, don't allow an NMI to stomp over an already
injected IRQ; instead wait for the IRQ injection to be completed.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:47 +02:00
Hannes Eder
efff9e538f KVM: VMX: fix sparse warning
Impact: make global function static

  arch/x86/kvm/vmx.c:134:3: warning: symbol 'vmx_capability' was not declared. Should it be static?

Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:06 +02:00
Avi Kivity
df203ec9a7 KVM: VMX: Conditionally request interrupt window after injecting irq
If we're injecting an interrupt, and another one is pending, request
an interrupt window notification so we don't have excess latency on the
second interrupt.

This shouldn't happen in practice since an EOI will be issued, giving a second
chance to request an interrupt window, but...

Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:00 +02:00
Eduardo Habkost
710ff4a855 KVM: VMX: extract kvm_cpu_vmxoff() from hardware_disable()
Along with some comments on why it is different from the core cpu_vmxoff()
function.

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:52:29 +02:00
Eduardo Habkost
6210e37b12 KVM: VMX: move cpu_has_kvm_support() to an inline on asm/virtext.h
It will be used by core code on kdump and reboot, to disable
vmx if needed.

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:52:28 +02:00
Eduardo Habkost
13673a90f1 KVM: VMX: move vmx.h to include/asm
vmx.h will be used by core code that is independent of KVM, so I am
moving it outside the arch/x86/kvm directory.

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:52:27 +02:00
Guillaume Thouvenin
1d5a4d9b92 KVM: VMX: Handle mmio emulation when guest state is invalid
If emulate_invalid_guest_state is enabled, the emulator is called
when guest state is invalid.  Until now, we reported an mmio failure
when emulate_instruction() returned EMULATE_DO_MMIO.  This patch adds
the case where emulate_instruction() failed and an MMIO emulation
is needed.

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:48 +02:00
Guillaume Thouvenin
e93f36bcfa KVM: allow emulator to adjust rip for emulated pio instructions
If we call the emulator we shouldn't call skip_emulated_instruction()
in the first place, since the emulator already computes the next rip
for us. Thus we move ->skip_emulated_instruction() out of
kvm_emulate_pio() and into handle_io() (and the svm equivalent). We
also replaced "return 0" by "break" in the "do_io:" case because now
the shadow register state needs to be committed. Otherwise eip will never
be updated.

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:48 +02:00
Sheng Yang
6fe639792c KVM: VMX: Move private memory slot position
PCI device assignment would map guest MMIO spaces as separate slot, so it is
possible that the device has more than 2 MMIO spaces and overwrite current
private memslot.

The patch move private memory slot to the top of userspace visible memory slots.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:46 +02:00
Sheng Yang
64d4d52175 KVM: Enable MTRR for EPT
The effective memory type of EPT is the mixture of MSR_IA32_CR_PAT and memory
type field of EPT entry.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:45 +02:00
Sheng Yang
468d472f3f KVM: VMX: Add PAT support for EPT
GUEST_PAT support is a new feature introduced by Intel Core i7 architecture.
With this, cpu would save/load guest and host PAT automatically, for EPT memory
type in guest depends on MSR_IA32_CR_PAT.

Also add save/restore for MSR_IA32_CR_PAT.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:44 +02:00
Jan Kiszka
3b86cd9967 KVM: VMX: work around lacking VNMI support
Older VMX supporting CPUs do not provide the "Virtual NMI" feature for
tracking the NMI-blocked state after injecting such events. For now
KVM is unable to inject NMIs on those CPUs.

Derived from Sheng Yang's suggestion to use the IRQ window notification
for detecting the end of NMI handlers, this patch implements virtual
NMI support without impact on the host's ability to receive real NMIs.
The downside is that the given approach requires some heuristics that
can cause NMI nesting in vary rare corner cases.

The approach works as follows:
 - inject NMI and set a software-based NMI-blocked flag
 - arm the IRQ window start notification whenever an NMI window is
   requested
 - if the guest exits due to an opening IRQ window, clear the emulated
   NMI-blocked flag
 - if the guest net execution time with NMI-blocked but without an IRQ
   window exceeds 1 second, force NMI-blocked reset and inject anyway

This approach covers most practical scenarios:
 - succeeding NMIs are seperated by at least one open IRQ window
 - the guest may spin with IRQs disabled (e.g. due to a bug), but
   leaving the NMI handler takes much less time than one second
 - the guest does not rely on strict ordering or timing of NMIs
   (would be problematic in virtualized environments anyway)

Successfully tested with the 'nmi n' monitor command, the kgdbts
testsuite on smp guests (additional patches required to add debug
register support to kvm) + the kernel's nmi_watchdog=1, and a Siemens-
specific board emulation (+ guest) that comes with its own NMI
watchdog mechanism.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:43 +02:00
Jan Kiszka
487b391d6e KVM: VMX: Provide support for user space injected NMIs
This patch adds the required bits to the VMX side for user space
injected NMIs. As with the preexisting in-kernel irqchip support, the
CPU must provide the "virtual NMI" feature for proper tracking of the
NMI blocking state.

Based on the original patch by Sheng Yang.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:43 +02:00
Jan Kiszka
66a5a347c2 KVM: VMX: fix real-mode NMI support
Fix NMI injection in real-mode with the same pattern we perform IRQ
injection.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:41 +02:00
Jan Kiszka
f460ee43e2 KVM: VMX: refactor IRQ and NMI window enabling
do_interrupt_requests and vmx_intr_assist go different way for
achieving the same: enabling the nmi/irq window start notification.
Unify their code over enable_{irq|nmi}_window, get rid of a redundant
call to enable_intr_window instead of direct enable_nmi_window
invocation and unroll enable_intr_window for both in-kernel and user
space irq injection accordingly.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:41 +02:00
Jan Kiszka
33f089ca5a KVM: VMX: refactor/fix IRQ and NMI injectability determination
There are currently two ways in VMX to check if an IRQ or NMI can be
injected:
 - vmx_{nmi|irq}_enabled and
 - vcpu.arch.{nmi|interrupt}_window_open.
Even worse, one test (at the end of vmx_vcpu_run) uses an inconsistent,
likely incorrect logic.

This patch consolidates and unifies the tests over
{nmi|interrupt}_window_open as cache + vmx_update_window_states
for updating the cache content.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:40 +02:00
Jan Kiszka
60637aacfd KVM: VMX: Support for NMI task gates
Properly set GUEST_INTR_STATE_NMI and reset nmi_injected when a
task-switch vmexit happened due to a task gate being used for handling
NMIs. Also avoid the false warning about valid vectoring info in
kvm_handle_exit.

Based on original patch by Gleb Natapov.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:40 +02:00
Jan Kiszka
e4a41889ec KVM: VMX: Use INTR_TYPE_NMI_INTR instead of magic value
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:40 +02:00
Jan Kiszka
a26bf12afb KVM: VMX: include all IRQ window exits in statistics
irq_window_exits only tracks IRQ window exits due to user space
requests, nmi_window_exits include all exits. The latter makes more
sense, so let's adjust irq_window_exits accounting.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:39 +02:00
Avi Kivity
bd2b3ca768 KVM: VMX: Fix interrupt loss during race with NMI
If an interrupt cannot be injected for some reason (say, page fault
when fetching the IDT descriptor), the interrupt is marked for
reinjection.  However, if an NMI is queued at this time, the NMI
will be injected instead and the NMI will be lost.

Fix by deferring the NMI injection until the interrupt has been
injected successfully.

Analyzed by Jan Kiszka.

Signed-off-by: Avi Kivity <avi@redhat.com>
2008-11-23 14:52:29 +02:00
Sheng Yang
928d4bf747 KVM: VMX: Set IGMT bit in EPT entry
There is a potential issue that, when guest using pagetable without vmexit when
EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory,
which would be inconsistent with host side and would cause host MCE due to
inconsistent cache attribute.

The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default
memory type to protect host (notice that all memory mapped by KVM should be WB).

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-11-11 21:00:37 +02:00
Marcelo Tosatti
83dbc83a0d KVM: VMX: enable invlpg exiting if EPT is disabled
Manually disabling EPT via module option fails to re-enable INVLPG
exiting.

Reported-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 14:25:31 +02:00
Marcelo Tosatti
a7052897b3 KVM: x86: trap invlpg
With pages out of sync invlpg needs to be trapped. For now simply nuke
the entry.

Untested on AMD.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-15 14:25:21 +02:00
Marcelo Tosatti
4c2155ce81 KVM: switch to get_user_pages_fast
Convert gfn_to_pfn to use get_user_pages_fast, which can do lockless
pagetable lookups on x86. Kernel compilation on 4-way guest is 3.7%
faster on VMX.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-15 14:25:06 +02:00
Sheng Yang
9ea542facb KVM: VMX: Rename IA32_FEATURE_CONTROL bits
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:16:14 +02:00
Jan Kiszka
4b92fe0c9d KVM: VMX: Cleanup stalled INTR_INFO read
Commit 1c0f4f5011829dac96347b5f84ba37c2252e1e08 left a useless access
of VM_ENTRY_INTR_INFO_FIELD in vmx_intr_assist behind. Clean this up.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:26 +02:00
Avi Kivity
fa89a81766 KVM: Add statistics for guest irq injections
These can help show whether a guest is making progress or not.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:25 +02:00
Avi Kivity
a16b20da87 KVM: VMX: Change segment dpl at reset to 3
This is more emulation friendly, if not 100% correct.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:21 +02:00
Avi Kivity
5706be0daf KVM: VMX: Change cs reset state to be a data segment
Real mode cs is a data segment, not a code segment.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:21 +02:00
Mohammed Gamal
a89a8fb93b KVM: VMX: Modify mode switching and vmentry functions
This patch modifies mode switching and vmentry function in order to
drive invalid guest state emulation.

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:20 +02:00
Mohammed Gamal
ea953ef0ca KVM: VMX: Add invalid guest state handler
This adds the invalid guest state handler function which invokes the x86
emulator until getting the guest to a VMX-friendly state.

[avi: leave atomic context if scheduling]
[guillaume: return to atomic context correctly]

Signed-off-by: Laurent Vivier <laurent.vivier@bull.net>
Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:20 +02:00
Mohammed Gamal
04fa4d3211 KVM: VMX: Add module parameter and emulation flag.
The patch adds the module parameter required to enable emulating invalid
guest state, as well as the emulation_required flag used to drive
emulation whenever needed.

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:20 +02:00
Mohammed Gamal
648dfaa7df KVM: VMX: Add Guest State Validity Checks
This patch adds functions to check whether guest state is VMX compliant.

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:20 +02:00
Avi Kivity
ecfc79c700 KVM: VMX: Use interrupt queue for !irqchip_in_kernel
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:19 +02:00
Sheng Yang
464d17c8b7 KVM: VMX: Clean up magic number 0x66 in init_rmode_tss
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:19 +02:00
Avi Kivity
313dbd49dc KVM: VMX: Avoid vmwrite(HOST_RSP) when possible
Usually HOST_RSP retains its value across guest entries.  Take advantage
of this and avoid a vmwrite() when this is so.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:15 +02:00
Avi Kivity
c801949ddf KVM: VMX: Unify register save/restore across 32 and 64 bit hosts
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:14 +02:00
Jan Kiszka
77ab6db0a1 KVM: VMX: Reinject real mode exception
As we execute real mode guests in VM86 mode, exception have to be
reinjected appropriately when the guest triggered them. For this purpose
the patch adopts the real-mode injection pattern used in vmx_inject_irq
to vmx_queue_exception, additionally taking care that the IP is set
correctly for #BP exceptions. Furthermore it extends
handle_rmode_exception to reinject all those exceptions that can be
raised in real mode.

This fixes the execution of himem.exe from FreeDOS and also makes its
debug.com work properly.

Note that guest debugging in real mode is broken now. This has to be
fixed by the scheduled debugging infrastructure rework (will be done
once base patches for QEMU have been accepted).

Signed-off-by: Jan Kiszka <jan.kiszka@web.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:14 +02:00
Jan Kiszka
19bd8afdc4 KVM: Consolidate XX_VECTOR defines
Signed-off-by: Jan Kiszka <jan.kiszka@web.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:14 +02:00
Mohammed Gamal
60bd83a125 KVM: VMX: Remove redundant check in handle_rmode_exception
Since checking for vcpu->arch.rmode.active is already done whenever we
call handle_rmode_exception(), checking it inside the function is redundant.

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:13 +02:00
Avi Kivity
f7d9238f5d KVM: VMX: Move interrupt post-processing to vmx_complete_interrupts()
Instead of looking at failed injections in the vm entry path, move
processing to the exit path in vmx_complete_interrupts().  This simplifes
the logic and removes any state that is hidden in vmx registers.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:13 +02:00
Avi Kivity
35920a3569 KVM: VMX: Fix pending exception processing
The vmx code assumes that IDT-Vectoring can only be set when an exception
is injected due to the exception in question.  That's not true, however:
if the exception is injected correctly, and later another exception occurs
but its delivery is blocked due to a fault, then we will incorrectly assume
the first exception was not delivered.

Fix by unconditionally dequeuing the pending exception, and requeuing it
(or the second exception) if we see it in the IDT-Vectoring field.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:13 +02:00
Avi Kivity
668f612fa0 KVM: VMX: Move nmi injection failure processing to vm exit path
Instead of processing nmi injection failure in the vm entry path, move
it to the vm exit path (vm_complete_interrupts()).  This separates nmi
injection from nmi post-processing, and moves the nmi state from the VT
state into vcpu state (new variable nmi_injected specifying an injection
in progress).

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:13 +02:00
Avi Kivity
cf393f7566 KVM: Move NMI IRET fault processing to new vmx_complete_interrupts()
Currently most interrupt exit processing is handled on the entry path,
which is confusing.  Move the NMI IRET fault processing to a new function,
vmx_complete_interrupts(), which is called on the vmexit path.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:12 +02:00
Marcelo Tosatti
5fdbf9765b KVM: x86: accessors for guest registers
As suggested by Avi, introduce accessors to read/write guest registers.
This simplifies the ->cache_regs/->decache_regs interface, and improves
register caching which is important for VMX, where the cost of
vmcs_read/vmcs_write is significant.

[avi: fix warnings]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:13:57 +02:00
Sheng Yang
ca60dfbb69 KVM: VMX: Rename misnamed msr bits
MSR_IA32_FEATURE_LOCKED is just a bit in fact, which shouldn't be prefixed with
MSR_.  So is MSR_IA32_FEATURE_VMXON_ENABLED.

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:13:57 +02:00
Sheng Yang
534e38b447 KVM: VMX: Always return old for clear_flush_young() when using EPT
As well as discard fake accessed bit and dirty bit of EPT.

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-09-11 11:48:19 +03:00
Sheng Yang
5fdbcb9dd1 KVM: VMX: Fix undefined beaviour of EPT after reload kvm-intel.ko
As well as move set base/mask ptes to vmx_init().

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-07-27 11:34:10 +03:00
Sheng Yang
5ec5726a16 KVM: VMX: Fix bypass_guest_pf enabling when disable EPT in module parameter
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-07-27 11:34:10 +03:00
Avi Kivity
577bdc4966 KVM: Avoid instruction emulation when event delivery is pending
When an event (such as an interrupt) is injected, and the stack is
shadowed (and therefore write protected), the guest will exit.  The
current code will see that the stack is shadowed and emulate a few
instructions, each time postponing the injection.  Eventually the
injection may succeed, but at that time the guest may be unwilling
to accept the interrupt (for example, the TPR may have changed).

This occurs every once in a while during a Windows 2008 boot.

Fix by unshadowing the fault address if the fault was due to an event
injection.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-07-27 11:34:10 +03:00
Avi Kivity
d6e88aec07 KVM: Prefix some x86 low level function with kvm_, to avoid namespace issues
Fixes compilation with CONFIG_VMI enabled.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-07-20 12:42:39 +03:00
Sheng Yang
4e1096d27f KVM: VMX: Add ept_sync_context in flush_tlb
Fix a potention issue caused by kvm_mmu_slot_remove_write_access(). The
old behavior don't sync EPT TLB with modified EPT entry, which result
in inconsistent content of EPT TLB and EPT table.

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-07-20 12:42:38 +03:00
Chris Lalancette
efa67e0d1f KVM: VMX: Fake emulate Intel perfctr MSRs
Older linux guests (in this case, 2.6.9) can attempt to
access the performance counter MSRs without a fixup section, and injecting
a GPF kills the guest.  Work around by allowing the guest to write those MSRs.

Tested by me on RHEL-4 i386 and x86_64 guests, as well as F-9 guests.

Signed-off-by: Chris Lalancette <clalance@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-07-20 12:42:36 +03:00
Sheng Yang
65267ea1b3 KVM: VMX: Fix a wrong usage of vmcs_config
The function ept_update_paging_mode_cr0() write to
CPU_BASED_VM_EXEC_CONTROL based on vmcs_config.cpu_based_exec_ctrl. That's
wrong because the variable may not consistent with the content in the
CPU_BASE_VM_EXEC_CONTROL MSR.

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-07-20 12:42:36 +03:00
Sheng Yang
f08864b42a KVM: VMX: Enable NMI with in-kernel irqchip
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-07-20 12:42:26 +03:00
Avi Kivity
7cc8883074 KVM: Remove decache_vcpus_on_cpu() and related callbacks
Obsoleted by the vmx-specific per-cpu list.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-07-20 12:42:25 +03:00
Avi Kivity
543e424366 KVM: VMX: Add list of potentially locally cached vcpus
VMX hardware can cache the contents of a vcpu's vmcs.  This cache needs
to be flushed when migrating a vcpu to another cpu, or (which is the case
that interests us here) when disabling hardware virtualization on a cpu.

The current implementation of decaching iterates over the list of all vcpus,
picks the ones that are potentially cached on the cpu that is being offlined,
and flushes the cache.  The problem is that it uses mutex_trylock() to gain
exclusive access to the vcpu, which fires off a (benign) warning about using
the mutex in an interrupt context.

To avoid this, and to make things generally nicer, add a new per-cpu list
of potentially cached vcus.  This makes the decaching code much simpler.  The
list is vmx-specific since other hardware doesn't have this issue.

[andrea: fix crash on suspend/resume]

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-07-20 12:42:24 +03:00
Avi Kivity
4ecac3fd6d KVM: Handle virtualization instruction #UD faults during reboot
KVM turns off hardware virtualization extensions during reboot, in order
to disassociate the memory used by the virtualization extensions from the
processor, and in order to have the system in a consistent state.
Unfortunately virtual machines may still be running while this goes on,
and once virtualization extensions are turned off, any virtulization
instruction will #UD on execution.

Fix by adding an exception handler to virtualization instructions; if we get
an exception during reboot, we simply spin waiting for the reset to complete.
If it's a true exception, BUG() so we can have our stack trace.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-07-20 12:41:43 +03:00
Avi Kivity
7682f2d0dd KVM: VMX: Trivial vmcs_write64() code simplification
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-07-20 12:40:50 +03:00
Joerg Roedel
c7bf23babc KVM: VMX: move APIC_ACCESS trace entry to generic code
This patch moves the trace entry for APIC accesses from the VMX code to the
generic lapic code. This way APIC accesses from SVM will also be traced.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-07-20 12:40:47 +03:00
Harvey Harrison
8b2cf73cc1 KVM: add statics were possible, function definition in lapic.h
Noticed by sparse:
arch/x86/kvm/vmx.c:1583:6: warning: symbol 'vmx_disable_intercept_for_msr' was not declared. Should it be static?
arch/x86/kvm/x86.c:3406:5: warning: symbol 'kvm_task_switch_16' was not declared. Should it be static?
arch/x86/kvm/x86.c:3429:5: warning: symbol 'kvm_task_switch_32' was not declared. Should it be static?
arch/x86/kvm/mmu.c:1968:6: warning: symbol 'kvm_mmu_remove_one_alloc_mmu_page' was not declared. Should it be static?
arch/x86/kvm/mmu.c:2014:6: warning: symbol 'mmu_destroy_caches' was not declared. Should it be static?
arch/x86/kvm/lapic.c:862:5: warning: symbol 'kvm_lapic_get_base' was not declared. Should it be static?
arch/x86/kvm/i8254.c:94:5: warning: symbol 'pit_get_gate' was not declared. Should it be static?
arch/x86/kvm/i8254.c:196:5: warning: symbol '__pit_timer_fn' was not declared. Should it be static?
arch/x86/kvm/i8254.c:561:6: warning: symbol '__inject_pit_timer_intr' was not declared. Should it be static?

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-07-20 12:40:46 +03:00
Jens Axboe
15c8b6c1aa on_each_cpu(): kill unused 'retry' parameter
It's not even passed on to smp_call_function() anymore, since that
was removed. So kill it.

Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-06-26 11:24:38 +02:00
Jens Axboe
8691e5a8f6 smp_call_function: get rid of the unused nonatomic/retry argument
It's never used and the comments refer to nonatomic and retry
interchangably. So get rid of it.

Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-06-26 11:24:35 +02:00
Avi Kivity
a9b21b6229 KVM: VMX: Fix host msr corruption with preemption enabled
Switching msrs can occur either synchronously as a result of calls to
the msr management functions (usually in response to the guest touching
virtualized msrs), or asynchronously when preempting a kvm thread that has
guest state loaded.  If we're unlucky enough to have the two at the same
time, host msrs are corrupted and the machine goes kaput on the next syscall.

Most easily triggered by Windows Server 2008, as it does a lot of msr
switching during bootup.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-06-24 12:26:17 +03:00
Eli Collins
e693d71b46 KVM: VMX: Clear CR4.VMXE in hardware_disable
Clear CR4.VMXE in hardware_disable. There's no reason to leave it set
after doing a VMXOFF.

VMware Workstation 6.5 checks CR4.VMXE as a proxy for whether the CPU is
in VMX mode, so leaving VMXE set means we'll refuse to power on. With this
change the user can power on after unloading the kvm-intel module. I
tested on kvm-67 and kvm-69.

Signed-off-by: Eli Collins <ecollins@vmware.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-06-06 21:30:20 +03:00
Marcelo Tosatti
2f5997140f KVM: migrate PIT timer
Migrate the PIT timer to the physical CPU which vcpu0 is scheduled on,
similarly to what is done for the LAPIC timers, otherwise PIT interrupts
will be delayed until an unrelated event causes an exit.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-06-06 21:25:51 +03:00
Sheng Yang
1439442c7b KVM: VMX: Enable EPT feature for KVM
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-05-04 14:44:42 +03:00
Sheng Yang
b7ebfb0509 KVM: VMX: Prepare an identity page table for EPT in real mode
[aliguory: plug leak]

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-05-04 14:44:41 +03:00
Sheng Yang
67253af52e KVM: Add kvm_x86_ops get_tdp_level()
The function get_tdp_level() provided the number of tdp level for EPT and
NPT rather than the NPT specific macro.

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-05-04 14:44:34 +03:00
Sheng Yang
d56f546db9 KVM: VMX: EPT Feature Detection
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-05-04 12:26:38 +03:00
Feng (Eric) Liu
2714d1d3d6 KVM: Add trace markers
Trace markers allow userspace to trace execution of a virtual machine
in order to monitor its performance.

Signed-off-by: Feng (Eric) Liu <eric.e.liu@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:01:19 +03:00
Marcelo Tosatti
3200f405a1 KVM: MMU: unify slots_lock usage
Unify slots_lock acquision around vcpu_run(). This is simpler and less
error-prone.

Also fix some callsites that were not grabbing the lock properly.

[avi: drop slots_lock while in guest mode to avoid holding the lock
      for indefinite periods]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:00:52 +03:00
Sheng Yang
25c5f225be KVM: VMX: Enable MSR Bitmap feature
MSR Bitmap controls whether the accessing of an MSR causes VM Exit.
Eliminating exits on automatically saved and restored MSRs yields a
small performance gain.

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:00:52 +03:00
Izik Eidus
37817f2982 KVM: x86: hardware task switching support
This emulates the x86 hardware task switch mechanism in software, as it is
unsupported by either vmx or svm.  It allows operating systems which use it,
like freedos, to run as kvm guests.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:00:39 +03:00
Izik Eidus
2e4d265349 KVM: x86: add functions to get the cpl of vcpu
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:00:38 +03:00
Avi Kivity
4c9fc8ef50 KVM: VMX: Add module option to disable flexpriority
Useful for debugging.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:00:37 +03:00
Avi Kivity
019960ae99 KVM: VMX: Don't adjust tsc offset forward
Most Intel hosts have a stable tsc, and playing with the offset only
reduces accuracy.  By limiting tsc offset adjustment only to forward updates,
we effectively disable tsc offset adjustment on these hosts.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:27 +03:00
Harvey Harrison
b8688d51bb KVM: replace remaining __FUNCTION__ occurances
__FUNCTION__ is gcc-specific, use __func__

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:27 +03:00
Avi Kivity
2d3ad1f40c KVM: Prefix control register accessors with kvm_ to avoid namespace pollution
Names like 'set_cr3()' look dangerously close to affecting the host.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:26 +03:00
Avi Kivity
a5f61300c4 KVM: Use x86's segment descriptor struct instead of private definition
The x86 desc_struct unification allows us to remove segment_descriptor.h.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:24 +03:00
Ryan Harper
2e11384c2c KVM: VMX: fix typo in VMX header define
Looking at Intel Volume 3b, page 148, table 20-11 and noticed
that the field name is 'Deliver' not 'Deliever'.  Attached patch changes
the define name and its user in vmx.c

Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:21 +03:00