Commit Graph

1738 Commits

Author SHA1 Message Date
Avi Kivity
fbc5d139bb KVM: MMU: Do not instantiate nontrapping spte on unsync page
The update_pte() path currently uses a nontrapping spte when a nonpresent
(or nonaccessed) gpte is written.  This is fine since at present it is only
used on sync pages.  However, on an unsync page this will cause an endless
fault loop as the guest is under no obligation to invlpg a gpte that
transitions from nonpresent to present.

Needed for the next patch which reinstates update_pte() on invlpg.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:42 +03:00
Avi Kivity
4a5f48f666 KVM: Don't follow an atomic operation by a non-atomic one
Currently emulated atomic operations are immediately followed by a non-atomic
operation, so that kvm_mmu_pte_write() can be invoked.  This updates the mmu
but undoes the whole point of doing things atomically.

Fix by only performing the atomic operation and the mmu update, and avoiding
the non-atomic write.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:40 +03:00
Avi Kivity
daea3e73cb KVM: Make locked operations truly atomic
Once upon a time, locked operations were emulated while holding the mmu mutex.
Since mmu pages were write protected, it was safe to emulate the writes in
a non-atomic manner, since there could be no other writer, either in the
guest or in the kernel.

These days emulation takes place without holding the mmu spinlock, so the
write could be preempted by an unshadowing event, which exposes the page
to writes by the guest.  This may cause corruption of guest page tables.

Fix by using an atomic cmpxchg for these operations.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:39 +03:00
Avi Kivity
72016f3a42 KVM: MMU: Consolidate two guest pte reads in kvm_mmu_pte_write()
kvm_mmu_pte_write() reads guest ptes in two different occasions, both to
allow a 32-bit pae guest to update a pte with 4-byte writes.  Consolidate
these into a single read, which also allows us to consolidate another read
from an invlpg speculating a gpte into the shadow page table.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:37 +03:00
Wei Yongjun
160d2f6c0c KVM: x86: fix the error of ioctl KVM_IRQ_LINE if no irq chip
If no irq chip in kernel, ioctl KVM_IRQ_LINE will return -EFAULT.
But I see in other place such as KVM_[GET|SET]IRQCHIP, -ENXIO is
return. So this patch used -ENXIO instead of -EFAULT.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:31 +03:00
Wei Yongjun
ec68798c8f KVM: x86: Use native_store_idt() instead of kvm_get_idt()
This patch use generic linux function native_store_idt()
instead of kvm_get_idt(), and also removed the useless
function kvm_get_idt().

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:28 +03:00
Avi Kivity
5c1c85d08d KVM: Trace exception injection
Often an exception can help point out where things start to go wrong.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:15:27 +03:00
Avi Kivity
5bfd8b5455 KVM: Move kvm_exit tracepoint rip reading inside tracepoint
Reading rip is expensive on vmx, so move it inside the tracepoint so we only
incur the cost if tracing is enabled.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:15:25 +03:00
Minchan Kim
d4f64b6cad KVM: remove redundant initialization of page->private
The prep_new_page() in page allocator calls set_page_private(page, 0).
So we don't need to reinitialize private of page.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Avi Kivity<avi@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:15:24 +03:00
Xiao Guangrong
2ed152afc7 KVM: cleanup kvm trace
This patch does:

 - no need call tracepoint_synchronize_unregister() when kvm module
   is unloaded since ftrace can handle it

 - cleanup ftrace's macro

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:15:22 +03:00
Gleb Natapov
835e6b8047 KVM: x86 emulator mark VMMCALL and LMSW as privileged
LMSW is present in both group tables. It was marked privileged only in
one of them. Intel analog of VMMCALL is already marked privileged.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:18 +03:00
Joerg Roedel
f71385383f KVM: SVM: Ignore lower 12 bit of nested msrpm_pa
These bits are ignored by the hardware too. Implement this
for nested svm too.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:16 +03:00
Joerg Roedel
ce2ac085ff KVM; SVM: Add correct handling of nested iopm
This patch adds the correct handling of the nested io
permission bitmap. Old behavior was to not lookup the port
in the iopm but only reinject an io intercept to the guest.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:15 +03:00
Joerg Roedel
0d6b35378e KVM: SVM: Use svm_msrpm_offset in nested_svm_exit_handled_msr
There is a generic function now to calculate msrpm offsets.
Use that function in nested_svm_exit_handled_msr() remove
the duplicate logic (which had a bug anyway).

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:13 +03:00
Joerg Roedel
323c3d809b KVM: SVM: Optimize nested svm msrpm merging
This patch optimizes the way the msrpm of the host and the
guest are merged. The old code merged the 2 msrpm pages
completly. This code needed to touch 24kb of memory for that
operation. The optimized variant this patch introduces
merges only the parts where the host msrpm may contain zero
bits. This reduces the amount of memory which is touched to
48 bytes.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:12 +03:00
Joerg Roedel
ac72a9b733 KVM: SVM: Introduce direct access msr list
This patch introduces a list with all msrs a guest might
have direct access to and changes the svm_vcpu_init_msrpm
function to use this list.
It also adds a check to set_msr_interception which triggers
a warning if a developer changes a msr intercept that is not
in the list.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:10 +03:00
Joerg Roedel
455716fa94 KVM: SVM: Move msrpm offset calculation to seperate function
The algorithm to find the offset in the msrpm for a given
msr is needed at other places too. Move that logic to its
own function.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:08 +03:00
Joerg Roedel
d24778265a KVM: SVM: Return correct values in nested_svm_exit_handled_msr
The nested_svm_exit_handled_msr() returned an bool which is
a bug. I worked by accident because the exected integer
return values match with the true and false values. This
patch changes the return value to int and let the function
return the correct values.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:07 +03:00
Andrea Gelmini
0fc5c3a54d KVM: arch/x86/kvm/kvm_timer.h checkpatch cleanup
arch/x86/kvm/kvm_timer.h:13: ERROR: code indent should use tabs where possible

Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:14:42 +03:00
Jan Kiszka
f8c5fae166 KVM: VMX: blocked-by-sti must not defer NMI injections
As the processor may not consider GUEST_INTR_STATE_STI as a reason for
blocking NMI, it could return immediately with EXIT_REASON_NMI_WINDOW
when we asked for it. But as we consider this state as NMI-blocking, we
can run into an endless loop.

Resolve this by allowing NMI injection if just GUEST_INTR_STATE_STI is
active (originally suggested by Gleb). Intel confirmed that this is
safe, the processor will never complain about NMI injection in this
state.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
KVM-Stable-Tag
Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-13 01:31:37 -03:00
Dongxiao Xu
fe19c5a46b KVM: x86: Call vcpu_load and vcpu_put in cpuid_update
cpuid_update may operate VMCS, so vcpu_load() and vcpu_put()
should be called to ensure correctness.

Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-13 01:31:02 -03:00
Joerg Roedel
061e2fd168 KVM: SVM: Fix wrong intercept masks on 32 bit
This patch makes KVM on 32 bit SVM working again by
correcting the masks used for iret interception. With the
wrong masks the upper 32 bits of the intercepts are masked
out which leaves vmrun unintercepted. This is not legal on
svm and the vmrun fails.
Bug was introduced by commits 95ba827313 and 3cfc3092.

Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-13 01:24:08 -03:00
Gleb Natapov
ea79849d4c KVM: x86 emulator: Implement jmp far opcode ff/5
Implement jmp far opcode ff/5. It is used by multiboot loader.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 13:53:45 +03:00
Gleb Natapov
e35b7b9c9e KVM: x86 emulator: Add decoding of 16bit second in memory argument
Add decoding of Ep type of argument used by callf/jmpf.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 13:53:42 +03:00
Gleb Natapov
2d49ec72d3 KVM: move segment_base() into vmx.c
segment_base() is used only by vmx so move it there.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 13:53:39 +03:00
Gleb Natapov
254d4d48a5 KVM: fix segment_base() error checking
fix segment_base() to properly check for null segment selector and
avoid accessing NULL pointer if ldt selector in null.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 13:53:35 +03:00
Gleb Natapov
d6ab1ed446 KVM: Drop kvm_get_gdt() in favor of generic linux function
Linux now has native_store_gdt() to do the same. Use it instead of
kvm local version.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 13:53:32 +03:00
Joerg Roedel
197717d581 KVM: SVM: Clear exit_info for injected INTR exits
When injecting an vmexit.intr into the nested hypervisor
there might be leftover values in the exit_info fields.
Clear them to not confuse nested hypervisors.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 13:53:26 +03:00
Joerg Roedel
7f5d8b5600 KVM: SVM: Handle nested selective_cr0 intercept correctly
If we have the following situation with nested svm:

1. Host KVM intercepts cr0 writes
2. Guest hypervisor intercepts only selective cr0 writes

Then we get an cr0 write intercept which is handled on the
host. But that intercepts may actually be a selective cr0
intercept for the guest. This patch checks for this
condition and injects a selective cr0 intercept if needed.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 13:53:23 +03:00
Joerg Roedel
b44ea385d8 KVM: x86: Don't set arch.cr0 in kvm_set_cr0
The vcpu->arch.cr0 variable is already set in the
architecture specific set_cr0 callbacks. There is no need to
set it in the common code.
This allows the architecture code to keep the old arch.cr0
value if it wants. This is required for nested svm to decide
if a selective_cr0 exit needs to be injected.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 13:53:20 +03:00
Joerg Roedel
82494028df KVM: SVM: Ignore write of hwcr.ignne
Hyper-V as a guest wants to write this bit. This patch
ignores it.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 13:53:17 +03:00
Joerg Roedel
4a810181c8 KVM: SVM: Implement emulation of vm_cr msr
This patch implements the emulation of the vm_cr msr for
nested svm.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 13:53:14 +03:00
Joerg Roedel
2e554e8d67 KVM: SVM: Add kvm_nested_intercepts tracepoint
This patch adds a tracepoint to get information about the
most important intercept bitmasks from the nested vmcb.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 13:53:10 +03:00
Joerg Roedel
ecf1405df2 KVM: SVM: Restore tracing of nested vmcb address
A recent change broke tracing of the nested vmcb address. It
was reported as 0 all the time. This patch fixes it.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 13:53:07 +03:00
Joerg Roedel
887f500ca1 KVM: SVM: Check for nested intercepts on NMI injection
This patch implements the NMI intercept checking for nested
svm.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 13:53:04 +03:00
Joerg Roedel
0e5cbe368b KVM: SVM: Reset MMU on nested_svm_vmrun for NPT too
Without resetting the MMU the gva_to_pga function will not
work reliably when the vcpu is running in nested context.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 13:53:01 +03:00
Joerg Roedel
e02317153e KVM: SVM: Coding style cleanup
This patch removes whitespace errors, fixes comment formats
and most of checkpatch warnings. Now vim does not show
c-space-errors anymore.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 13:52:58 +03:00
Jan Kiszka
83bf0002c9 KVM: x86: Preserve injected TF across emulation
Call directly into the vendor services for getting/setting rflags in
emulate_instruction to ensure injected TF survives the emulation.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 13:52:55 +03:00
Jan Kiszka
c310bac5a2 KVM: x86: Drop RF manipulation for guest single-stepping
RF is not required for injecting TF as the latter will trigger only
after an instruction execution anyway. So do not touch RF when arming or
disarming guest single-step mode.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 13:52:51 +03:00
Jan Kiszka
66b7138f91 KVM: SVM: Emulate nRIP feature when reinjecting INT3
When in guest debugging mode, we have to reinject those #BP software
exceptions that are caused by guest-injected INT3. As older AMD
processors do not support the required nRIP VMCB field, try to emulate
it by moving RIP past the instruction on exception injection. Fix it up
again in case the injection failed and we were able to catch this. This
does not work for unintercepted faults, but it is better than doing
nothing.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 13:00:43 +03:00
Jan Kiszka
f92653eeb4 KVM: x86: Add kvm_is_linear_rip
Based on Gleb's suggestion: Add a helper kvm_is_linear_rip that matches
a given linear RIP against the current one. Use this for guest
single-stepping, more users will follow.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 13:00:40 +03:00
Jan Kiszka
116a4752c8 KVM: SVM: Move svm_queue_exception
Move svm_queue_exception past skip_emulated_instruction to allow calling
it later on.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 13:00:37 +03:00
Jan Kiszka
50a085bdd4 KVM: x86: Kick VCPU outside PIC lock again
This restores the deferred VCPU kicking before 956f97cf. We need this
over -rt as wake_up* requires non-atomic context in this configuration.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 12:39:28 +03:00
Jan Kiszka
a1efbe77c1 KVM: x86: Add support for saving&restoring debug registers
So far user space was not able to save and restore debug registers for
migration or after reset. Plug this hole.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 12:39:10 +03:00
Jan Kiszka
48005f64d0 KVM: x86: Save&restore interrupt shadow mask
The interrupt shadow created by STI or MOV-SS-like operations is part of
the VCPU state and must be preserved across migration. Transfer it in
the spare padding field of kvm_vcpu_events.interrupt.

As a side effect we now have to make vmx_set_interrupt_shadow robust
against both shadow types being set. Give MOV SS a higher priority and
skip STI in that case to avoid that VMX throws a fault on next entry.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 12:38:28 +03:00
Jan Kiszka
03b82a30ea KVM: x86: Do not return soft events in vcpu_events
To avoid that user space migrates a pending software exception or
interrupt, mask them out on KVM_GET_VCPU_EVENTS. Without this, user
space would try to reinject them, and we would have to reconstruct the
proper instruction length for VMX event injection. Now the pending event
will be reinjected via executing the triggering instruction again.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 12:38:14 +03:00
Joerg Roedel
8fe546547c KVM: SVM: Fix wrong interrupt injection in enable_irq_windows
The nested_svm_intr() function does not execute the vmexit
anymore. Therefore we may still be in the nested state after
that function ran. This patch changes the nested_svm_intr()
function to return wether the irq window could be enabled.

Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 12:38:11 +03:00
Gleb Natapov
112592da0d KVM: drop unneeded kvm_run check in emulate_instruction()
vcpu->run is initialized on vcpu creation and can never be NULL
here.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 12:38:08 +03:00
Joerg Roedel
052ce6211c KVM: SVM: Remove newlines from nested trace points
The tracing infrastructure adds its own newlines. Remove
them from the trace point printk format strings.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 12:34:31 +03:00
Joerg Roedel
66a562f7e2 KVM: SVM: Make lazy FPU switching work with nested svm
The new lazy fpu switching code may disable cr0 intercepts
when running nested. This is a bug because the nested
hypervisor may still want to intercept cr0 which will break
in this situation. This patch fixes this issue and makes
lazy fpu switching working with nested svm.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 12:34:28 +03:00
Joerg Roedel
06fc777269 KVM: SVM: Activate nested state only when guest state is complete
Certain functions called during the emulated world switch
behave differently when the vcpu is running nested. This is
not the expected behavior during a world switch emulation.
This patch ensures that the nested state is activated only
if the vcpu is completly in nested state.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 12:34:25 +03:00
Joerg Roedel
88ab24adc7 KVM: SVM: Don't sync nested cr8 to lapic and back
This patch makes syncing of the guest tpr to the lapic
conditional on !nested. Otherwise a nested guest using the
TPR could freeze the guest.
Another important change this patch introduces is that the
cr8 intercept bits are no longer ORed at vmrun emulation if
the guest sets VINTR_MASKING in its VMCB. The reason is that
nested cr8 accesses need alway be handled by the nested
hypervisor because they change the shadow version of the
tpr.

Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 12:34:22 +03:00
Joerg Roedel
4c7da8cb43 KVM: SVM: Fix nested msr intercept handling
The nested_svm_exit_handled_msr() function maps only one
page of the guests msr permission bitmap. This patch changes
the code to use kvm_read_guest to fix the bug.

Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 12:34:19 +03:00
Joerg Roedel
6c3bd3d766 KVM: SVM: Annotate nested_svm_map with might_sleep()
The nested_svm_map() function can sleep and must not be
called from atomic context. So annotate that function.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 12:34:16 +03:00
Joerg Roedel
cdbbdc1210 KVM: SVM: Sync all control registers on nested vmexit
Currently the vmexit emulation does not sync control
registers were the access is typically intercepted by the
nested hypervisor. But we can not count on that intercepts
to sync these registers too and make the code
architecturally more correct.

Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 12:34:13 +03:00
Joerg Roedel
b8e88bc8ff KVM: SVM: Fix schedule-while-atomic on nested exception handling
Move the actual vmexit routine out of code that runs with
irqs and preemption disabled.

Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 12:34:10 +03:00
Joerg Roedel
7597f129d8 KVM: SVM: Don't use kmap_atomic in nested_svm_map
Use of kmap_atomic disables preemption but if we run in
shadow-shadow mode the vmrun emulation executes kvm_set_cr3
which might sleep or fault. So use kmap instead for
nested_svm_map.

Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 12:34:07 +03:00
Takuya Yoshikawa
0e4176a15f KVM: x86 emulator: Fix x86_emulate_insn() not to use the variable rc for non-X86EMUL values
This patch makes non-X86EMUL_* family functions not to use
the variable rc.

Be sure that this changes nothing but makes the purpose of
the variable rc clearer.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 12:27:50 +03:00
Takuya Yoshikawa
1b30eaa846 KVM: x86 emulator: X86EMUL macro replacements: x86_emulate_insn() and its helpers
This patch just replaces integer values used inside
x86_emulate_insn() and its helper functions to X86EMUL_*.

The purpose of this is to make it clear what will happen
when the variable rc is compared to X86EMUL_* at the end
of x86_emulate_insn().

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 12:27:46 +03:00
Takuya Yoshikawa
3e2815e9fa KVM: x86 emulator: X86EMUL macro replacements: from do_fetch_insn_byte() to x86_decode_insn()
This patch just replaces the integer values used inside x86's
decode functions to X86EMUL_*.

By this patch, it becomes clearer that we are using X86EMUL_*
value propagated from ops->read_std() in do_fetch_insn_byte().

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 12:27:43 +03:00
Gleb Natapov
1161624f15 KVM: inject #UD in 64bit mode from instruction that are not valid there
Some instruction are obsolete in a long mode. Inject #UD.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 12:27:40 +03:00
Gleb Natapov
89a27f4d0e KVM: use desc_ptr struct instead of kvm private descriptor_table
x86 arch defines desc_ptr for idt/gdt pointers, no need to define
another structure in kvm code.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-25 12:27:28 +03:00
Ingo Molnar
70bce3ba77 Merge branch 'linus' into perf/core
Merge reason: merge the latest fixes, update to latest -rc.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-23 11:10:30 +02:00
Jan Kiszka
e8861cfe2c KVM: x86: Fix TSS size check for 16-bit tasks
A 16-bit TSS is only 44 bytes long. So make sure to test for the correct
size on task switch.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-21 13:51:42 +03:00
Takuya Yoshikawa
87bf6e7de1 KVM: fix the handling of dirty bitmaps to avoid overflows
Int is not long enough to store the size of a dirty bitmap.

This patch fixes this problem with the introduction of a wrapper
function to calculate the sizes of dirty bitmaps.

Note: in mark_page_dirty(), we have to consider the fact that
  __set_bit() takes the offset as int, not long.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-04-20 13:06:55 +03:00
Xiao Guangrong
77662e0028 KVM: MMU: fix kvm_mmu_zap_page() and its calling path
This patch fix:

- calculate zapped page number properly in mmu_zap_unsync_children()
- calculate freeed page number properly kvm_mmu_change_mmu_pages()
- if zapped children page it shoud restart hlist walking

KVM-Stable-Tag.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-04-20 12:59:32 +03:00
Avi Kivity
78ac8b47c5 KVM: VMX: Save/restore rflags.vm correctly in real mode
Currently we set eflags.vm unconditionally when entering real mode emulation
through virtual-8086 mode, and clear it unconditionally when we enter protected
mode.  The means that the following sequence

  KVM_SET_REGS  (rflags.vm=1)
  KVM_SET_SREGS (cr0.pe=1)

Ends up with rflags.vm clear due to KVM_SET_SREGS triggering enter_pmode().

Fix by shadowing rflags.vm (and rflags.iopl) correctly while in real mode:
reads and writes to those bits access a shadow register instead of the actual
register.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-04-20 12:59:31 +03:00
Andre Przywara
114be429c8 KVM: allow bit 10 to be cleared in MSR_IA32_MC4_CTL
There is a quirk for AMD K8 CPUs in many Linux kernels (see
arch/x86/kernel/cpu/mcheck/mce.c:__mcheck_cpu_apply_quirks()) that
clears bit 10 in that MCE related MSR. KVM can only cope with all
zeros or all ones, so it will inject a #GP into the guest, which
will let it panic.
So lets add a quirk to the quirk and ignore this single cleared bit.
This fixes -cpu kvm64 on all machines and -cpu host on K8 machines
with some guest Linux kernels.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-20 12:59:31 +03:00
Avi Kivity
d6a23895aa KVM: Don't spam kernel log when injecting exceptions due to bad cr writes
These are guest-triggerable.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-20 12:55:05 +03:00
Takuya Yoshikawa
b7af404338 KVM: SVM: Fix memory leaks that happen when svm_create_vcpu() fails
svm_create_vcpu() does not free the pages allocated during the creation
when it fails to complete the allocations. This patch fixes it.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-20 12:55:04 +03:00
Gleb Natapov
7567cae105 KVM: take srcu lock before call to complete_pio()
complete_pio() may use slot table which is protected by srcu.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-20 12:55:04 +03:00
Zhang, Yanmin
dcf46b9443 perf & kvm: Clean up some of the guest profiling callback API details
Fix some build bug and programming style issues:

 - use valid C
 - fix up various style details

Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Sheng Yang <sheng@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: oerg Roedel <joro@8bytes.org>
Cc: Jes Sorensen <Jes.Sorensen@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Zachary Amsden <zamsden@redhat.com>
Cc: zhiteng.huang@intel.com
Cc: tim.c.chen@intel.com
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
LKML-Reference: <1271729638.2078.624.camel@ymzhang.sh.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-20 08:08:28 +02:00
Zhang, Yanmin
ff9d07a0e7 KVM: Implement perf callbacks for guest sampling
Below patch implements the perf_guest_info_callbacks on kvm.

Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-19 12:36:50 +03:00
Tejun Heo
5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Linus Torvalds
c812a51d11 Merge branch 'kvm-updates/2.6.34' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/2.6.34' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (145 commits)
  KVM: x86: Add KVM_CAP_X86_ROBUST_SINGLESTEP
  KVM: VMX: Update instruction length on intercepted BP
  KVM: Fix emulate_sys[call, enter, exit]()'s fault handling
  KVM: Fix segment descriptor loading
  KVM: Fix load_guest_segment_descriptor() to inject page fault
  KVM: x86 emulator: Forbid modifying CS segment register by mov instruction
  KVM: Convert kvm->requests_lock to raw_spinlock_t
  KVM: Convert i8254/i8259 locks to raw_spinlocks
  KVM: x86 emulator: disallow opcode 82 in 64-bit mode
  KVM: x86 emulator: code style cleanup
  KVM: Plan obsolescence of kernel allocated slots, paravirt mmu
  KVM: x86 emulator: Add LOCK prefix validity checking
  KVM: x86 emulator: Check CPL level during privilege instruction emulation
  KVM: x86 emulator: Fix popf emulation
  KVM: x86 emulator: Check IOPL level during io instruction emulation
  KVM: x86 emulator: fix memory access during x86 emulation
  KVM: x86 emulator: Add Virtual-8086 mode of emulation
  KVM: x86 emulator: Add group9 instruction decoding
  KVM: x86 emulator: Add group8 instruction decoding
  KVM: do not store wqh in irqfd
  ...

Trivial conflicts in Documentation/feature-removal-schedule.txt
2010-03-05 13:12:34 -08:00
Jan Kiszka
d2be1651b7 KVM: x86: Add KVM_CAP_X86_ROBUST_SINGLESTEP
This marks the guest single-step API improvement of 94fe45da and
91586a3b with a capability flag to allow reliable detection by user
space.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Cc: stable@kernel.org (2.6.33)
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:14 -03:00
Jan Kiszka
c573cd2293 KVM: VMX: Update instruction length on intercepted BP
We intercept #BP while in guest debugging mode. As VM exits due to
intercepted exceptions do not necessarily come with valid
idt_vectoring, we have to update event_exit_inst_len explicitly in such
cases. At least in the absence of migration, this ensures that
re-injections of #BP will find and use the correct instruction length.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Cc: stable@kernel.org (2.6.32, 2.6.33)
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:14 -03:00
Takuya Yoshikawa
e54cfa97a9 KVM: Fix emulate_sys[call, enter, exit]()'s fault handling
This patch fixes emulate_syscall(), emulate_sysenter() and
emulate_sysexit() to handle injected faults properly.

Even though original code injects faults in these functions,
we cannot handle these unless we use the different return
value from the UNHANDLEABLE case. So this patch use X86EMUL_*
codes instead of -1 and 0 and makes x86_emulate_insn() to
handle these propagated faults.

Be sure that, in x86_emulate_insn(), goto cannot_emulate and
goto done with rc equals X86EMUL_UNHANDLEABLE have same effect.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:14 -03:00
Gleb Natapov
c697518a86 KVM: Fix segment descriptor loading
Add proper error and permission checking. This patch also change task
switching code to load segment selectors before segment descriptors, like
SDM requires, otherwise permission checking during segment descriptor
loading will be incorrect.

Cc: stable@kernel.org (2.6.33, 2.6.32)
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:14 -03:00
Takuya Yoshikawa
6f550484a1 KVM: Fix load_guest_segment_descriptor() to inject page fault
This patch injects page fault when reading descriptor in
load_guest_segment_descriptor() fails with FAULT.

Effects of this injection: This function is used by
kvm_load_segment_descriptor() which is necessary for the
following instructions:

 - mov seg,r/m16
 - jmp far
 - pop ?s

This patch makes it possible to emulate the page faults
generated by these instructions. But be sure that unless
we change the kvm_load_segment_descriptor()'s ret value
propagation this patch has no effect.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:13 -03:00
Gleb Natapov
8b9f44140b KVM: x86 emulator: Forbid modifying CS segment register by mov instruction
Inject #UD if guest attempts to do so. This is in accordance to Intel
SDM.

Cc: stable@kernel.org (2.6.33, 2.6.32)
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:13 -03:00
Thomas Gleixner
fa8273e954 KVM: Convert i8254/i8259 locks to raw_spinlocks
The i8254/i8259 locks need to be real spinlocks on preempt-rt. Convert
them to raw_spinlock. No change for !RT kernels.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:12 -03:00
Gleb Natapov
e424e19183 KVM: x86 emulator: disallow opcode 82 in 64-bit mode
Instructions with opcode 82 are not valid in 64 bit mode.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:12 -03:00
Wei Yongjun
1d327eac3c KVM: x86 emulator: code style cleanup
Just remove redundant semicolon.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:12 -03:00
Gleb Natapov
d380a5e402 KVM: x86 emulator: Add LOCK prefix validity checking
Instructions which are not allowed to have LOCK prefix should
generate #UD if one is used.

[avi: fold opcode 82 fix from another patch]

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:11 -03:00
Gleb Natapov
e92805ac12 KVM: x86 emulator: Check CPL level during privilege instruction emulation
Add CPL checking in case emulator is tricked into emulating
privilege instruction from userspace.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:11 -03:00
Gleb Natapov
d4c6a1549c KVM: x86 emulator: Fix popf emulation
POPF behaves differently depending on current CPU mode. Emulate correct
logic to prevent guest from changing flags that it can't change otherwise.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:11 -03:00
Gleb Natapov
f850e2e603 KVM: x86 emulator: Check IOPL level during io instruction emulation
Make emulator check that vcpu is allowed to execute IN, INS, OUT,
OUTS, CLI, STI.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:11 -03:00
Gleb Natapov
1871c6020d KVM: x86 emulator: fix memory access during x86 emulation
Currently when x86 emulator needs to access memory, page walk is done with
broadest permission possible, so if emulated instruction was executed
by userspace process it can still access kernel memory. Fix that by
providing correct memory access to page walker during emulation.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:11 -03:00
Gleb Natapov
a004475567 KVM: x86 emulator: Add Virtual-8086 mode of emulation
For some instructions CPU behaves differently for real-mode and
virtual 8086. Let emulator know which mode cpu is in, so it will
not poke into vcpu state directly.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:11 -03:00
Gleb Natapov
60a29d4ea4 KVM: x86 emulator: Add group9 instruction decoding
Use groups mechanism to decode 0F C7 instructions.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:10 -03:00
Gleb Natapov
2db2c2eb62 KVM: x86 emulator: Add group8 instruction decoding
Use groups mechanism to decode 0F BA instructions.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:10 -03:00
Wei Yongjun
72bb2fcd23 KVM: cleanup the failure path of KVM_CREATE_IRQCHIP ioctrl
If we fail to init ioapic device or the fail to setup the default irq
routing, the device register by kvm_create_pic() and kvm_ioapic_init()
remain unregister. This patch fixed to do this.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:10 -03:00
Wei Yongjun
d225f53b76 KVM: PIT: unregister kvm irq notifier if fail to create pit
If fail to create pit, we should unregister kvm irq notifier
which register in kvm_create_pit().

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:09 -03:00
Sheng Yang
a19a6d1131 KVM: VMX: Rename VMX_EPT_IGMT_BIT to VMX_EPT_IPAT_BIT
Following the new SDM. Now the bit is named "Ignore PAT memory type".

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:09 -03:00
Avi Kivity
90bb6fc556 KVM: MMU: Add tracepoint for guest page aging
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:09 -03:00
Takuya Yoshikawa
1976d2d2c9 KVM: Remove redundant reading of rax on OUT instructions
kvm_emulate_pio() and complete_pio() both read out the
RAX register value and copy it to a place into which
the value read out from the port will be copied later.

This patch removes this redundancy.

/*** snippet from arch/x86/kvm/x86.c ***/
int complete_pio(struct kvm_vcpu *vcpu)
{
	...
	if (!io->string) {
		if (io->in) {
			val = kvm_register_read(vcpu, VCPU_REGS_RAX);
			memcpy(&val, vcpu->arch.pio_data, io->size);
			kvm_register_write(vcpu, VCPU_REGS_RAX, val);
		}
	...

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:09 -03:00
Rik van Riel
6316e1c8c6 KVM: VMX: emulate accessed bit for EPT
Currently KVM pretends that pages with EPT mappings never got
accessed.  This has some side effects in the VM, like swapping
out actively used guest pages and needlessly breaking up actively
used hugepages.

We can avoid those very costly side effects by emulating the
accessed bit for EPT PTEs, which should only be slightly costly
because pages pass through page_referenced infrequently.

TLB flushing is taken care of by kvm_mmu_notifier_clear_flush_young().

This seems to help prevent KVM guests from being swapped out when
they should not on my system.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:08 -03:00
Joerg Roedel
8f0b1ab6fb KVM: Introduce kvm_host_page_size
This patch introduces a generic function to find out the
host page size for a given gfn. This function is needed by
the kvm iommu code. This patch also simplifies the x86
host_mapping_level function.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:08 -03:00
Julia Lawall
c45b4fd416 KVM: VMX: Remove redundant test in vmx_set_efer()
msr was tested above, so the second test is not needed.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@r@
expression *x;
expression e;
identifier l;
@@

if (x == NULL || ...) {
    ... when forall
    return ...; }
... when != goto l;
    when != x = e
    when != &x
*x == NULL
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:07 -03:00
Avi Kivity
ebcbab4c03 KVM: VMX: Wire up .fpu_activate() callback
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:07 -03:00
Takuya Yoshikawa
7edcface95 KVM: fix kvm_fix_hypercall() to return X86EMUL_*
This patch fixes kvm_fix_hypercall() to propagate X86EMUL_*
info generated by emulator_write_emulated() to its callers:
suggested by Marcelo.

The effect of this is x86_emulate_insn() will begin to handle
the page faults which occur in emulator_write_emulated():
this should be OK because emulator_write_emulated_onepage()
always injects page fault when emulator_write_emulated()
returns X86EMUL_PROPAGATE_FAULT.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:07 -03:00
Takuya Yoshikawa
c125c60732 KVM: fix load_guest_segment_descriptor() to return X86EMUL_*
This patch fixes load_guest_segment_descriptor() to return
X86EMUL_PROPAGATE_FAULT when it tries to access the descriptor
table beyond the limit of it: suggested by Marcelo.

I have checked current callers of this helper function,
  - kvm_load_segment_descriptor()
  - kvm_task_switch()
and confirmed that this patch will change nothing in the
upper layers if we do not change the handling of this
return value from load_guest_segment_descriptor().

Next step: Although fixing the kvm_task_switch() to handle the
propagated faults properly seems difficult, and maybe not worth
it because TSS is not used commonly these days, we can fix
kvm_load_segment_descriptor(). By doing so, the injected #GP
becomes possible to be handled by the guest. The only problem
for this is how to differentiate this fault from the page faults
generated by kvm_read_guest_virt(). We may have to split this
function to achive this goal.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:07 -03:00
Zhai, Edwin
ab9f4ecbb6 KVM: enable PCI multiple-segments for pass-through device
Enable optional parameter (default 0) - PCI segment (or domain) besides
BDF, when assigning PCI device to guest.

Signed-off-by: Zhai Edwin <edwin.zhai@intel.com>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:06 -03:00
Gui Jianfeng
6d3e435e70 KVM: VMX: Remove redundant check in vm_need_virtualize_apic_accesses()
flexpriority_enabled implies cpu_has_vmx_virtualize_apic_accesses() returning
true, so we don't need this check here.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:06 -03:00
Avi Kivity
59200273c4 KVM: Trace failed msr reads and writes
Record failed msrs reads and writes, and the fact that they failed as well.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:06 -03:00
Avi Kivity
6e7d152967 KVM: Fix msr trace
- data is 64 bits wide, not unsigned long
- rw is confusingly named

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:06 -03:00
Gleb Natapov
e01c242614 KVM: mark segments accessed on HW task switch
On HW task switch newly loaded segments should me marked as accessed.

Reported-by: Lorenzo Martignoni <martignlo@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:06 -03:00
Avi Kivity
81231c698a KVM: VMX: Pass cr0.mp through to the guest when the fpu is active
When cr0.mp is clear, the guest doesn't expect a #NM in response to
a WAIT instruction.  Because we always keep cr0.mp set, it will get
a #NM, and potentially be confused.

Fix by keeping cr0.mp set only when the fpu is inactive, and passing
it through when inactive.

Reported-by: Lorenzo Martignoni <martignlo@gmail.com>
Analyzed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:06 -03:00
Wei Yongjun
d7fa6ab217 KVM: MMU: Remove some useless code from alloc_mmu_pages()
If we fail to alloc page for vcpu->arch.mmu.pae_root, call to
free_mmu_pages() is unnecessary, which just do free the page
malloc for vcpu->arch.mmu.pae_root.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:05 -03:00
Avi Kivity
0c04851c0c KVM: trace guest fpu loads and unloads
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:05 -03:00
Avi Kivity
8ae0991276 KVM: Optimize kvm_read_cr[04]_bits()
'mask' is always a constant, so we can check whether it includes a bit that
might be owned by the guest very cheaply, and avoid the decache call.  Saves
a few hundred bytes of module text.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:05 -03:00
Avi Kivity
f6801dff23 KVM: Rename vcpu->shadow_efer to efer
None of the other registers have the shadow_ prefix.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:04 -03:00
Avi Kivity
836a1b3c34 KVM: Move cr0/cr4/efer related helpers to x86.h
They have more general scope than the mmu.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:04 -03:00
Avi Kivity
3eeb3288bc KVM: Add a helper for checking if the guest is in protected mode
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:04 -03:00
Avi Kivity
6b52d18605 KVM: Activate fpu on clts
Assume that if the guest executes clts, it knows what it's doing, and load the
guest fpu to prevent an #NM exception.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:04 -03:00
Avi Kivity
e5bb40251a KVM: Drop kvm_{load,put}_guest_fpu() exports
Not used anymore.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:04 -03:00
Avi Kivity
2608d7a12f KVM: Allow kvm_load_guest_fpu() even when !vcpu->fpu_active
This allows accessing the guest fpu from the instruction emulator, as well as
being symmetric with kvm_put_guest_fpu().

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:03 -03:00
Gleb Natapov
ab344828eb KVM: x86: fix checking of cr0 validity
Move to/from Control Registers chapter of Intel SDM says.  "Reserved bits
in CR0 remain clear after any load of those registers; attempts to set
them have no impact". Control Register chapter says "Bits 63:32 of CR0 are
reserved and must be written with zeros. Writing a nonzero value to any
of the upper 32 bits results in a general-protection exception, #GP(0)."

This patch tries to implement this twisted logic.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Reported-by: Lorenzo Martignoni <martignlo@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:03 -03:00
Jan Kiszka
727f5a23e2 KVM: SVM: Trap all debug register accesses
To enable proper debug register emulation under all conditions, trap
access to all DR0..7. This may be optimized later on.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:02 -03:00
Jan Kiszka
c76de350c8 KVM: SVM: Clean up and enhance mov dr emulation
Enhance mov dr instruction emulation used by SVM so that it properly
handles dr4/5: alias to dr6/7 if cr4.de is cleared. Otherwise return
EMULATE_FAIL which will let our only possible caller in that scenario,
ud_interception, re-inject UD.

We do not need to inject faults, SVM does this for us (exceptions take
precedence over instruction interceptions). For the same reason, the
value overflow checks can be removed.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:02 -03:00
Jan Kiszka
fd7373cce7 KVM: VMX: Clean up DR6 emulation
As we trap all debug register accesses, we do not need to switch real
DR6 at all. Clean up update_exception_bitmap at this chance, too.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:02 -03:00
Jan Kiszka
138ac8d88f KVM: VMX: Fix emulation of DR4 and DR5
Make sure DR4 and DR5 are aliased to DR6 and DR7, respectively, if
CR4.DE is not set.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:01 -03:00
Jan Kiszka
f248341529 KVM: VMX: Fix exceptions of mov to dr
Injecting GP without an error code is a bad idea (causes unhandled guest
exits). Moreover, we must not skip the instruction if we injected an
exception.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:01 -03:00
Takuya Yoshikawa
b60d513c32 KVM: x86: Use macros for x86_emulate_ops to avoid future mistakes
The return values from x86_emulate_ops are defined
in kvm_emulate.h as macros X86EMUL_*.

But in emulate.c, we are comparing the return values
from these ops with 0 to check if they're X86EMUL_CONTINUE
or not: X86EMUL_CONTINUE is defined as 0 now.

To avoid possible mistakes in the future, this patch
substitutes "X86EMUL_CONTINUE" for "0" that are being
compared with the return values from x86_emulate_ops.

  We think that there are more places we should use these
  macros, but the meanings of rc values in x86_emulate_insn()
  were not so clear at a glance. If we use proper macros in
  this function, we would be able to follow the flow of each
  emulation more easily and, maybe, more securely.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:01 -03:00
Marcelo Tosatti
6474920477 KVM: fix cleanup_srcu_struct on vm destruction
cleanup_srcu_struct on VM destruction remains broken:

BUG: unable to handle kernel paging request at ffffffffffffffff
IP: [<ffffffff802533d2>] srcu_read_lock+0x16/0x21
RIP: 0010:[<ffffffff802533d2>]  [<ffffffff802533d2>] srcu_read_lock+0x16/0x21
Call Trace:
 [<ffffffffa05354c4>] kvm_arch_vcpu_uninit+0x1b/0x48 [kvm]
 [<ffffffffa05339c6>] kvm_vcpu_uninit+0x9/0x15 [kvm]
 [<ffffffffa0569f7d>] vmx_free_vcpu+0x7f/0x8f [kvm_intel]
 [<ffffffffa05357b5>] kvm_arch_destroy_vm+0x78/0x111 [kvm]
 [<ffffffffa053315b>] kvm_put_kvm+0xd4/0xfe [kvm]

Move it to kvm_arch_destroy_vm.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Reported-by: Jan Kiszka <jan.kiszka@siemens.com>
2010-03-01 12:36:01 -03:00
Gleb Natapov
ccd469362e KVM: fix Hyper-V hypercall warnings and wrong mask value
Fix compilation warnings and wrong mask value.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:01 -03:00
Sheng Yang
7062dcaa36 KVM: VMX: Remove emulation failure report
As Avi noted:

>There are two problems with the kernel failure report.  First, it
>doesn't report enough data - registers, surrounding instructions, etc.
>that are needed to explain what is going on.  Second, it can flood
>dmesg, which is a pretty bad thing to do.

So we remove the emulation failure report in handle_invalid_guest_state(),
and would inspected the guest using userspace tool in the future.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:01 -03:00
Takuya Yoshikawa
8dae444529 KVM: rename is_writeble_pte() to is_writable_pte()
There are two spellings of "writable" in
arch/x86/kvm/mmu.c and paging_tmpl.h .

This patch renames is_writeble_pte() to is_writable_pte()
and makes grepping easy.

  New name is consistent with the definition of itself:
  return pte & PT_WRITABLE_MASK;

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:00 -03:00
Gleb Natapov
c25bc1638a KVM: Implement NotifyLongSpinWait HYPER-V hypercall
Windows issues this hypercall after guest was spinning on a spinlock
for too many iterations.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Vadim Rozenfeld <vrozenfe@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:00 -03:00
Gleb Natapov
10388a0716 KVM: Add HYPER-V apic access MSRs
Implement HYPER-V apic MSRs. Spec defines three MSRs that speed-up
access to EOI/TPR/ICR apic registers for PV guests.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Vadim Rozenfeld <vrozenfe@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:00 -03:00
Gleb Natapov
55cd8e5a4e KVM: Implement bare minimum of HYPER-V MSRs
Minimum HYPER-V implementation should have GUEST_OS_ID, HYPERCALL and
VP_INDEX MSRs.

[avi: fix build on i386]

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Vadim Rozenfeld <vrozenfe@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:57 -03:00
Avi Kivity
4610c83cdc KVM: SVM: Lazy fpu with npt
Now that we can allow the guest to play with cr0 when the fpu is loaded,
we can enable lazy fpu when npt is in use.

Acked-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:51 -03:00
Avi Kivity
d225157bc6 KVM: SVM: Selective cr0 intercept
If two conditions apply:
 - no bits outside TS and EM differ between the host and guest cr0
 - the fpu is active

then we can activate the selective cr0 write intercept and drop the
unconditional cr0 read and write intercept, and allow the guest to run
with the host fpu state.  This reduces cr0 exits due to guest fpu management
while the guest fpu is loaded.

Acked-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:51 -03:00
Avi Kivity
888f9f3e0c KVM: SVM: Restore unconditional cr0 intercept under npt
Currently we don't intercept cr0 at all when npt is enabled.  This improves
performance but requires us to activate the fpu at all times.

Remove this behaviour in preparation for adding selective cr0 intercepts.

Acked-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:51 -03:00
Avi Kivity
bff7827479 KVM: SVM: Initialize fpu_active in init_vmcb()
init_vmcb() sets up the intercepts as if the fpu is active, so initialize it
there.  This avoids an INIT from setting up intercepts inconsistent with
fpu_active.

Acked-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:51 -03:00
Avi Kivity
f9a48e6a18 KVM: Set cr0.et when the guest writes cr0
Follow the hardware.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:51 -03:00
Avi Kivity
edcafe3c5a KVM: VMX: Give the guest ownership of cr0.ts when the fpu is active
If the guest fpu is loaded, there is nothing interesing about cr0.ts; let
the guest play with it as it will.  This makes context switches between fpu
intensive guest processes faster, as we won't trap the clts and cr0 write
instructions.

[marcelo: fix cr0 read shadow update on fpu deactivation; kills F8 install]

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:50 -03:00
Avi Kivity
02daab21d9 KVM: Lazify fpu activation and deactivation
Defer fpu deactivation as much as possible - if the guest fpu is loaded, keep
it loaded until the next heavyweight exit (where we are forced to unload it).
This reduces unnecessary exits.

We also defer fpu activation on clts; while clts signals the intent to use the
fpu, we can't be sure the guest will actually use it.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:50 -03:00
Avi Kivity
e8467fda83 KVM: VMX: Allow the guest to own some cr0 bits
We will use this later to give the guest ownership of cr0.ts.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:50 -03:00
Avi Kivity
4d4ec08745 KVM: Replace read accesses of vcpu->arch.cr0 by an accessor
Since we'd like to allow the guest to own a few bits of cr0 at times, we need
to know when we access those bits.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:50 -03:00
Avi Kivity
a1f83a74fe KVM: VMX: trace clts and lmsw instructions as cr accesses
clts writes cr0.ts; lmsw writes cr0[0:15] - record that in ftrace.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:50 -03:00
Sheng Yang
878403b788 KVM: VMX: Enable EPT 1GB page support
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:46 -03:00
Sheng Yang
17cc393596 KVM: x86: Rename gb_page_enable() to get_lpage_level() in kvm_x86_ops
Then the callback can provide the maximum supported large page level, which
is more flexible.

Also move the gb page support into x86_64 specific.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:46 -03:00
Sheng Yang
c9c5417455 KVM: x86: Moving PT_*_LEVEL to mmu.h
We can use them in x86.c and vmx.c now...

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:46 -03:00
Avi Kivity
f4c9e87c83 KVM: Fill out ftrace exit reason strings
Some exit reasons missed their strings; fill out the table.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:45 -03:00
Marcelo Tosatti
79fac95ecf KVM: convert slots_lock to a mutex
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:45 -03:00
Marcelo Tosatti
f656ce0185 KVM: switch vcpu context to use SRCU
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:45 -03:00
Marcelo Tosatti
e93f8a0f82 KVM: convert io_bus to SRCU
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:45 -03:00
Marcelo Tosatti
a983fb2387 KVM: x86: switch kvm_set_memory_alias to SRCU update
Using a similar two-step procedure as for memslots.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:45 -03:00
Marcelo Tosatti
b050b015ab KVM: use SRCU for dirty log
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:44 -03:00
Marcelo Tosatti
bc6678a33d KVM: introduce kvm->srcu and convert kvm_set_memory_region to SRCU update
Use two steps for memslot deletion: mark the slot invalid (which stops
instantiation of new shadow pages for that slot, but allows destruction),
then instantiate the new empty slot.

Also simplifies kvm_handle_hva locking.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:44 -03:00
Marcelo Tosatti
f7784b8ec9 KVM: split kvm_arch_set_memory_region into prepare and commit
Required for SRCU convertion later.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:44 -03:00
Marcelo Tosatti
fef9cce0eb KVM: modify alias layout in x86s struct kvm_arch
Have a pointer to an allocated region inside x86's kvm_arch.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:43 -03:00
Marcelo Tosatti
46a26bf557 KVM: modify memslots layout in struct kvm
Have a pointer to an allocated region inside struct kvm.

[alex: fix ppc book 3s]

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:43 -03:00
Avi Kivity
50eb2a3cd0 KVM: Add KVM_MMIO kconfig item
s390 doesn't have mmio, this will simplify ifdefing it out.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:41 -03:00
Joerg Roedel
953899b659 KVM: SVM: Adjust tsc_offset only if tsc_unstable
The tsc_offset adjustment in svm_vcpu_load is executed
unconditionally even if Linux considers the host tsc as
stable. This causes a Linux guest detecting an unstable tsc
in any case.
This patch removes the tsc_offset adjustment if the host tsc
is stable. The guest will now get the benefit of a stable
tsc too.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:41 -03:00
Sheng Yang
4e47c7a6d7 KVM: VMX: Add instruction rdtscp support for guest
Before enabling, execution of "rdtscp" in guest would result in #UD.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:40 -03:00
Sheng Yang
0e85188049 KVM: Add cpuid_update() callback to kvm_x86_ops
Sometime, we need to adjust some state in order to reflect guest CPUID
setting, e.g. if we don't expose rdtscp to guest, we won't want to enable
it on hardware. cpuid_update() is introduced for this purpose.

Also export kvm_find_cpuid_entry() for later use.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:40 -03:00
Sheng Yang
2bf78fa7b9 KVM: Extended shared_msr_global to per CPU
shared_msr_global saved host value of relevant MSRs, but it have an
assumption that all MSRs it tracked shared the value across the different
CPUs. It's not true with some MSRs, e.g. MSR_TSC_AUX.

Extend it to per CPU to provide the support of MSR_TSC_AUX, and more
alike MSRs.

Notice now the shared_msr_global still have one assumption: it can only deal
with the MSRs that won't change in host after KVM module loaded.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:40 -03:00
Sheng Yang
8a7e3f01e6 KVM: VMX: Remove redundant variable
It's no longer necessary.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:40 -03:00
Avi Kivity
bc23008b61 KVM: VMX: Fold ept_update_paging_mode_cr4() into its caller
ept_update_paging_mode_cr4() accesses vcpu->arch.cr4 directly, which usually
needs to be accessed via kvm_read_cr4().  In this case, we can't, since cr4
is in the process of being updated.  Instead of adding inane comments, fold
the function into its caller (vmx_set_cr4), so it can use the not-yet-committed
cr4 directly.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:40 -03:00
Avi Kivity
ce03e4f21a KVM: VMX: When using ept, allow the guest to own cr4.pge
We make no use of cr4.pge if ept is enabled, but the guest does (to flush
global mappings, as with vmap()), so give the guest ownership of this bit.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:39 -03:00
Avi Kivity
4c38609ac5 KVM: VMX: Make guest cr4 mask more conservative
Instead of specifying the bits which we want to trap on, specify the bits
which we allow the guest to change transparently.  This is safer wrt future
changes to cr4.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:39 -03:00
Avi Kivity
fc78f51938 KVM: Add accessor for reading cr4 (or some bits of cr4)
Some bits of cr4 can be owned by the guest on vmx, so when we read them,
we copy them to the vcpu structure.  In preparation for making the set of
guest-owned bits dynamic, use helpers to access these bits so we don't need
to know where the bit resides.

No changes to svm since all bits are host-owned there.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:39 -03:00
Avi Kivity
cdc0e24456 KVM: VMX: Move some cr[04] related constants to vmx.c
They have no place in common code.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:39 -03:00
Sheng Yang
59708670b6 KVM: VMX: Trap and invalid MWAIT/MONITOR instruction
We don't support these instructions, but guest can execute them even if the
feature('monitor') haven't been exposed in CPUID. So we would trap and inject
a #UD if guest try this way.

Cc: stable@kernel.org
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:39 -03:00
Avi Kivity
186a3e526a KVM: MMU: Report spte not found in rmap before BUG()
In the past we've had errors of single-bit in the other two cases; the
printk() may confirm it for the third case (many->many).

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:39 -03:00
Marcelo Tosatti
cb84b55f6c KVM: x86: raise TSS exception for NULL CS and SS segments
Windows 2003 uses task switch to triple fault and reboot (the other
exception being reserved pdptrs bits).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:38 -03:00
Eddie Dong
3fd28fce76 KVM: x86: make double/triple fault promotion generic to all exceptions
Move Double-Fault generation logic out of page fault
exception generating function to cover more generic case.

Signed-off-by: Eddie Dong <eddie.dong@intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:35:38 -03:00
David S. Miller
2bb4646fce Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2010-02-16 22:09:29 -08:00
Marcelo Tosatti
ee73f656a6 KVM: PIT: control word is write-only
PIT control word (address 0x43) is write-only, reads are undefined.

Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-02-09 19:20:15 +02:00
Jason Wang
923de3cf5b kvmclock: count total_sleep_time when updating guest clock
Current kvm wallclock does not consider the total_sleep_time which could cause
wrong wallclock in guest after host suspend/resume. This patch solve
this issue by counting total_sleep_time to get the correct host boot time.

Cc: stable@kernel.org
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-02-09 19:20:15 +02:00
Wei Yongjun
443c39bc9e KVM: x86: Fix leak of free lapic date in kvm_arch_vcpu_init()
In function kvm_arch_vcpu_init(), if the memory malloc for
vcpu->arch.mce_banks is fail, it does not free the memory
of lapic date. This patch fixed it.

Cc: stable@kernel.org
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-01-25 12:26:40 -02:00
Wei Yongjun
36cb93fd6b KVM: x86: Fix probable memory leak of vcpu->arch.mce_banks
vcpu->arch.mce_banks is malloc in kvm_arch_vcpu_init(), but
never free in any place, this may cause memory leak. So this
patch fixed to free it in kvm_arch_vcpu_uninit().

Cc: stable@kernel.org
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-01-25 12:26:40 -02:00
Marcelo Tosatti
a6085fbaf6 KVM: MMU: bail out pagewalk on kvm_read_guest error
Exit the guest pagetable walk loop if reading gpte failed. Otherwise its
possible to enter an endless loop processing the previous present pte.

Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-01-25 12:26:38 -02:00
Sheng Yang
82b7005f0e KVM: x86: Fix host_mapping_level()
When found a error hva, should not return PAGE_SIZE but the level...

Also clean up the coding style of the following loop.

Cc: stable@kernel.org
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-01-25 12:26:37 -02:00
Avi Kivity
a5d36f82c4 KVM: Fix race between APIC TMR and IRR
When we queue an interrupt to the local apic, we set the IRR before the TMR.
The vcpu can pick up the IRR and inject the interrupt before setting the TMR,
and perhaps even EOI it, causing incorrect behaviour.

The race is really insignificant since it can only occur on the first
interrupt (usually following interrupts will not change TMR), but it's better
closed than open.

Fixed by reordering setting the TMR vs IRR.

Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-01-25 12:26:36 -02:00
David S. Miller
51c24aaaca Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2010-01-23 00:31:06 -08:00
Michael S. Tsirkin
3a4d5c94e9 vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.

There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
  migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)

common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear.  I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.

What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.

How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device.  Backend is also configured by userspace, including vlan/mac
etc.

Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.

Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use

Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item.  The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.

(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)

Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-15 01:43:29 -08:00
Jan Kiszka
dab4b911a5 KVM: x86: Extend KVM_SET_VCPU_EVENTS with selective updates
User space may not want to overwrite asynchronously changing VCPU event
states on write-back. So allow to skip nmi.pending and sipi_vector by
setting corresponding bits in the flags field of kvm_vcpu_events.

[avi: advertise the bits in KVM_GET_VCPU_EVENTS]

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-27 13:36:33 -02:00
Marcelo Tosatti
6e24a6eff4 KVM: LAPIC: make sure IRR bitmap is scanned after vm load
The vcpus are initialized with irr_pending set to false, but
loading the LAPIC registers with pending IRR fails to reset
the irr_pending variable.

Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-27 13:36:31 -02:00
Marcelo Tosatti
fb341f572d KVM: MMU: remove prefault from invlpg handler
The invlpg prefault optimization breaks Windows 2008 R2 occasionally.

The visible effect is that the invlpg handler instantiates a pte which
is, microseconds later, written with a different gfn by another vcpu.

The OS could have other mechanisms to prevent a present translation from
being used, which the hypervisor is unaware of.

While the documentation states that the cpu is at liberty to prefetch tlb
entries, it looks like this is not heeded, so remove tlb prefetch from
invlpg.

Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-27 13:36:30 -02:00
Linus Torvalds
d0316554d3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (34 commits)
  m68k: rename global variable vmalloc_end to m68k_vmalloc_end
  percpu: add missing per_cpu_ptr_to_phys() definition for UP
  percpu: Fix kdump failure if booted with percpu_alloc=page
  percpu: make misc percpu symbols unique
  percpu: make percpu symbols in ia64 unique
  percpu: make percpu symbols in powerpc unique
  percpu: make percpu symbols in x86 unique
  percpu: make percpu symbols in xen unique
  percpu: make percpu symbols in cpufreq unique
  percpu: make percpu symbols in oprofile unique
  percpu: make percpu symbols in tracer unique
  percpu: make percpu symbols under kernel/ and mm/ unique
  percpu: remove some sparse warnings
  percpu: make alloc_percpu() handle array types
  vmalloc: fix use of non-existent percpu variable in put_cpu_var()
  this_cpu: Use this_cpu_xx in trace_functions_graph.c
  this_cpu: Use this_cpu_xx for ftrace
  this_cpu: Use this_cpu_xx in nmi handling
  this_cpu: Use this_cpu operations in RCU
  this_cpu: Use this_cpu ops for VM statistics
  ...

Fix up trivial (famous last words) global per-cpu naming conflicts in
	arch/x86/kvm/svm.c
	mm/slab.c
2009-12-14 09:58:24 -08:00
Joe Perches
a78d9626f4 x86: i8254.c: Add pr_fmt(fmt)
- Add pr_fmt(fmt) "pit: " fmt
 - Strip pit: prefixes from pr_debug

Signed-off-by: Joe Perches <joe@perches.com>
LKML-Reference: <bbd4de532f18bb7c11f64ba20d224c08291cb126.1260383912.git.joe@perches.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-10 08:57:50 +01:00
Linus Torvalds
ed9216c171 Merge branch 'kvm-updates/2.6.33' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/2.6.33' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (84 commits)
  KVM: VMX: Fix comparison of guest efer with stale host value
  KVM: s390: Fix prefix register checking in arch/s390/kvm/sigp.c
  KVM: Drop user return notifier when disabling virtualization on a cpu
  KVM: VMX: Disable unrestricted guest when EPT disabled
  KVM: x86 emulator: limit instructions to 15 bytes
  KVM: s390: Make psw available on all exits, not just a subset
  KVM: x86: Add KVM_GET/SET_VCPU_EVENTS
  KVM: VMX: Report unexpected simultaneous exceptions as internal errors
  KVM: Allow internal errors reported to userspace to carry extra data
  KVM: Reorder IOCTLs in main kvm.h
  KVM: x86: Polish exception injection via KVM_SET_GUEST_DEBUG
  KVM: only clear irq_source_id if irqchip is present
  KVM: x86: disallow KVM_{SET,GET}_LAPIC without allocated in-kernel lapic
  KVM: x86: disallow multiple KVM_CREATE_IRQCHIP
  KVM: VMX: Remove vmx->msr_offset_efer
  KVM: MMU: update invlpg handler comment
  KVM: VMX: move CR3/PDPTR update to vmx_set_cr3
  KVM: remove duplicated task_switch check
  KVM: powerpc: Fix BUILD_BUG_ON condition
  KVM: VMX: Use shared msr infrastructure
  ...

Trivial conflicts due to new Kconfig options in arch/Kconfig and kernel/Makefile
2009-12-08 08:02:38 -08:00
Avi Kivity
d5696725b2 KVM: VMX: Fix comparison of guest efer with stale host value
update_transition_efer() masks out some efer bits when deciding whether
to switch the msr during guest entry; for example, NX is emulated using the
mmu so we don't need to disable it, and LMA/LME are handled by the hardware.

However, with shared msrs, the comparison is made against a stale value;
at the time of the guest switch we may be running with another guest's efer.

Fix by deferring the mask/compare to the actual point of guest entry.

Noted by Marcelo.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:34:20 +02:00
Avi Kivity
3548bab501 KVM: Drop user return notifier when disabling virtualization on a cpu
This way, we don't leave a dangling notifier on cpu hotunplug or module
unload.  In particular, module unload leaves the notifier pointing into
freed memory.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:26 +02:00
Sheng Yang
046d87103a KVM: VMX: Disable unrestricted guest when EPT disabled
Otherwise would cause VMEntry failure when using ept=0 on unrestricted guest
supported processors.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:25 +02:00
Avi Kivity
eb3c79e64a KVM: x86 emulator: limit instructions to 15 bytes
While we are never normally passed an instruction that exceeds 15 bytes,
smp games can cause us to attempt to interpret one, which will cause
large latencies in non-preempt hosts.

Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:25 +02:00
Jan Kiszka
3cfc3092f4 KVM: x86: Add KVM_GET/SET_VCPU_EVENTS
This new IOCTL exports all yet user-invisible states related to
exceptions, interrupts, and NMIs. Together with appropriate user space
changes, this fixes sporadic problems of vmsave/restore, live migration
and system reset.

[avi: future-proof abi by adding a flags field]

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:25 +02:00
Avi Kivity
65ac726404 KVM: VMX: Report unexpected simultaneous exceptions as internal errors
These happen when we trap an exception when another exception is being
delivered; we only expect these with MCEs and page faults.  If something
unexpected happens, things probably went south and we're better off reporting
an internal error and freezing.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:24 +02:00
Avi Kivity
a9c7399d6c KVM: Allow internal errors reported to userspace to carry extra data
Usually userspace will freeze the guest so we can inspect it, but some
internal state is not available.  Add extra data to internal error
reporting so we can expose it to the debugger.  Extra data is specific
to the suberror.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:24 +02:00
Jan Kiszka
4f926bf291 KVM: x86: Polish exception injection via KVM_SET_GUEST_DEBUG
Decouple KVM_GUESTDBG_INJECT_DB and KVM_GUESTDBG_INJECT_BP from
KVM_GUESTDBG_ENABLE, their are actually orthogonal. At this chance,
avoid triggering the WARN_ON in kvm_queue_exception if there is already
an exception pending and reject such invalid requests.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:24 +02:00
Marcelo Tosatti
2204ae3c96 KVM: x86: disallow KVM_{SET,GET}_LAPIC without allocated in-kernel lapic
Otherwise kvm might attempt to dereference a NULL pointer.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:23 +02:00
Marcelo Tosatti
3ddea128ad KVM: x86: disallow multiple KVM_CREATE_IRQCHIP
Otherwise kvm will leak memory on multiple KVM_CREATE_IRQCHIP.
Also serialize multiple accesses with kvm->lock.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:23 +02:00
Avi Kivity
92c0d90015 KVM: VMX: Remove vmx->msr_offset_efer
This variable is used to communicate between a caller and a callee; switch
to a function argument instead.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:23 +02:00
Marcelo Tosatti
5f5c35aad5 KVM: MMU: update invlpg handler comment
Large page translations are always synchronized (either in level 3
or level 2), so its not necessary to properly deal with them
in the invlpg handler.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:23 +02:00
Marcelo Tosatti
7c93be44a4 KVM: VMX: move CR3/PDPTR update to vmx_set_cr3
GUEST_CR3 is updated via kvm_set_cr3 whenever CR3 is modified from
outside guest context. Similarly pdptrs are updated via load_pdptrs.

Let kvm_set_cr3 perform the update, removing it from the vcpu_run
fast path.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Acked-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:22 +02:00
Gleb Natapov
1655e3a3dc KVM: remove duplicated task_switch check
Probably introduced by a bad merge.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:22 +02:00
Avi Kivity
26bb0981b3 KVM: VMX: Use shared msr infrastructure
Instead of reloading syscall MSRs on every preemption, use the new shared
msr infrastructure to reload them at the last possible minute (just before
exit to userspace).

Improves vcpu/idle/vcpu switches by about 2000 cycles (when EFER needs to be
reloaded as well).

[jan: fix slot index missing indirection]

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:22 +02:00
Avi Kivity
18863bdd60 KVM: x86 shared msr infrastructure
The various syscall-related MSRs are fairly expensive to switch.  Currently
we switch them on every vcpu preemption, which is far too often:

- if we're switching to a kernel thread (idle task, threaded interrupt,
  kernel-mode virtio server (vhost-net), for example) and back, then
  there's no need to switch those MSRs since kernel threasd won't
  be exiting to userspace.

- if we're switching to another guest running an identical OS, most likely
  those MSRs will have the same value, so there's little point in reloading
  them.

- if we're running the same OS on the guest and host, the MSRs will have
  identical values and reloading is unnecessary.

This patch uses the new user return notifiers to implement last-minute
switching, and checks the msr values to avoid unnecessary reloading.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:21 +02:00
Avi Kivity
44ea2b1758 KVM: VMX: Move MSR_KERNEL_GS_BASE out of the vmx autoload msr area
Currently MSR_KERNEL_GS_BASE is saved and restored as part of the
guest/host msr reloading.  Since we wish to lazy-restore all the other
msrs, save and reload MSR_KERNEL_GS_BASE explicitly instead of using
the common code.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:21 +02:00
Eduardo Habkost
3ce672d484 KVM: SVM: init_vmcb(): remove redundant save->cr0 initialization
The svm_set_cr0() call will initialize save->cr0 properly even when npt is
enabled, clearing the NW and CD bits as expected, so we don't need to
initialize it manually for npt_enabled anymore.

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:21 +02:00
Eduardo Habkost
18fa000ae4 KVM: SVM: Reset cr0 properly on vcpu reset
svm_vcpu_reset() was not properly resetting the contents of the guest-visible
cr0 register, causing the following issue:
https://bugzilla.redhat.com/show_bug.cgi?id=525699

Without resetting cr0 properly, the vcpu was running the SIPI bootstrap routine
with paging enabled, making the vcpu get a pagefault exception while trying to
run it.

Instead of setting vmcb->save.cr0 directly, the new code just resets
kvm->arch.cr0 and calls kvm_set_cr0(). The bits that were set/cleared on
vmcb->save.cr0 (PG, WP, !CD, !NW) will be set properly by svm_set_cr0().

kvm_set_cr0() is used instead of calling svm_set_cr0() directly to make sure
kvm_mmu_reset_context() is called to reset the mmu to nonpaging mode.

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:21 +02:00
Eduardo Habkost
fa40052ca0 KVM: VMX: Use macros instead of hex value on cr0 initialization
This should have no effect, it is just to make the code clearer.

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:21 +02:00
Glauber Costa
afbcf7ab8d KVM: allow userspace to adjust kvmclock offset
When we migrate a kvm guest that uses pvclock between two hosts, we may
suffer a large skew. This is because there can be significant differences
between the monotonic clock of the hosts involved. When a new host with
a much larger monotonic time starts running the guest, the view of time
will be significantly impacted.

Situation is much worse when we do the opposite, and migrate to a host with
a smaller monotonic clock.

This proposed ioctl will allow userspace to inform us what is the monotonic
clock value in the source host, so we can keep the time skew short, and
more importantly, never goes backwards. Userspace may also need to trigger
the current data, since from the first migration onwards, it won't be
reflected by a simple call to clock_gettime() anymore.

[marcelo: future-proof abi with a flags field]
[jan: fix KVM_GET_CLOCK by clearing flags field instead of checking it]

Signed-off-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:19 +02:00
Jan Kiszka
6be7d3062b KVM: SVM: Cleanup NMI singlestep
Push the NMI-related singlestep variable into vcpu_svm. It's dealing
with an AMD-specific deficit, nothing generic for x86.

Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>

 arch/x86/include/asm/kvm_host.h |    1 -
 arch/x86/kvm/svm.c              |   12 +++++++-----
 2 files changed, 7 insertions(+), 6 deletions(-)
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:19 +02:00
Jan Kiszka
94fe45da48 KVM: x86: Fix guest single-stepping while interruptible
Commit 705c5323 opened the doors of hell by unconditionally injecting
single-step flags as long as guest_debug signaled this. This doesn't
work when the guest branches into some interrupt or exception handler
and triggers a vmexit with flag reloading.

Fix it by saving cs:rip when user space requests single-stepping and
restricting the trace flag injection to this guest code position.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:19 +02:00
Ed Swierk
ffde22ac53 KVM: Xen PV-on-HVM guest support
Support for Xen PV-on-HVM guests can be implemented almost entirely in
userspace, except for handling one annoying MSR that maps a Xen
hypercall blob into guest address space.

A generic mechanism to delegate MSR writes to userspace seems overkill
and risks encouraging similar MSR abuse in the future.  Thus this patch
adds special support for the Xen HVM MSR.

I implemented a new ioctl, KVM_XEN_HVM_CONFIG, that lets userspace tell
KVM which MSR the guest will write to, as well as the starting address
and size of the hypercall blobs (one each for 32-bit and 64-bit) that
userspace has loaded from files.  When the guest writes to the MSR, KVM
copies one page of the blob from userspace to the guest.

I've tested this patch with a hacked-up version of Gerd's userspace
code, booting a number of guests (CentOS 5.3 i386 and x86_64, and
FreeBSD 8.0-RC1 amd64) and exercising PV network and block devices.

[jan: fix i386 build warning]
[avi: future proof abi with a flags field]

Signed-off-by: Ed Swierk <eswierk@aristanetworks.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:18 +02:00
Jan Kiszka
94c30d9ca6 KVM: x86: Drop unneeded CONFIG_HAS_IOMEM check
This (broken) check dates back to the days when this code was shared
across architectures. x86 has IOMEM, so drop it.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:18 +02:00
Marcelo Tosatti
9fb41ba896 KVM: VMX: fix handle_pause declaration
There's no kvm_run argument anymore.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:18 +02:00
Zachary Amsden
6b7d7e762b KVM: x86: Harden against cpufreq
If cpufreq can't determine the CPU khz, or cpufreq is not compiled in,
we should fallback to the measured TSC khz.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:18 +02:00
Mark Langsdorf
565d0998ec KVM: SVM: Support Pause Filter in AMD processors
New AMD processors (Family 0x10 models 8+) support the Pause
Filter Feature.  This feature creates a new field in the VMCB
called Pause Filter Count.  If Pause Filter Count is greater
than 0 and intercepting PAUSEs is enabled, the processor will
increment an internal counter when a PAUSE instruction occurs
instead of intercepting.  When the internal counter reaches the
Pause Filter Count value, a PAUSE intercept will occur.

This feature can be used to detect contended spinlocks,
especially when the lock holding VCPU is not scheduled.
Rescheduling another VCPU prevents the VCPU seeking the
lock from wasting its quantum by spinning idly.

Experimental results show that most spinlocks are held
for less than 1000 PAUSE cycles or more than a few
thousand.  Default the Pause Filter Counter to 3000 to
detect the contended spinlocks.

Processor support for this feature is indicated by a CPUID
bit.

On a 24 core system running 4 guests each with 16 VCPUs,
this patch improved overall performance of each guest's
32 job kernbench by approximately 3-5% when combined
with a scheduler algorithm thati caused the VCPU to
sleep for a brief period. Further performance improvement
may be possible with a more sophisticated yield algorithm.

Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:17 +02:00
Zhai, Edwin
4b8d54f972 KVM: VMX: Add support for Pause-Loop Exiting
New NHM processors will support Pause-Loop Exiting by adding 2 VM-execution
control fields:
PLE_Gap    - upper bound on the amount of time between two successive
             executions of PAUSE in a loop.
PLE_Window - upper bound on the amount of time a guest is allowed to execute in
             a PAUSE loop

If the time, between this execution of PAUSE and previous one, exceeds the
PLE_Gap, processor consider this PAUSE belongs to a new loop.
Otherwise, processor determins the the total execution time of this loop(since
1st PAUSE in this loop), and triggers a VM exit if total time exceeds the
PLE_Window.
* Refer SDM volume 3b section 21.6.13 & 22.1.3.

Pause-Loop Exiting can be used to detect Lock-Holder Preemption, where one VP
is sched-out after hold a spinlock, then other VPs for same lock are sched-in
to waste the CPU time.

Our tests indicate that most spinlocks are held for less than 212 cycles.
Performance tests show that with 2X LP over-commitment we can get +2% perf
improvement for kernel build(Even more perf gain with more LPs).

Signed-off-by: Zhai Edwin <edwin.zhai@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:17 +02:00
Joerg Roedel
d36f19e9ec KVM: SVM: Remove nsvm_printk debugging code
With all important informations now delivered through
tracepoints we can savely remove the nsvm_printk debugging
code for nested svm.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:17 +02:00
Joerg Roedel
532a46b989 KVM: SVM: Add tracepoint for skinit instruction
This patch adds a tracepoint for the event that the guest
executed the SKINIT instruction. This information is
important because SKINIT is an SVM extenstion not yet
implemented by nested SVM and we may need this information
for debugging hypervisors that do not yet run on nested SVM.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:16 +02:00
Joerg Roedel
ec1ff79084 KVM: SVM: Add tracepoint for invlpga instruction
This patch adds a tracepoint for the event that the guest
executed the INVLPGA instruction.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:16 +02:00
Joerg Roedel
236649de33 KVM: SVM: Add tracepoint for #vmexit because intr pending
This patch adds a special tracepoint for the event that a
nested #vmexit is injected because kvm wants to inject an
interrupt into the guest.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:16 +02:00
Joerg Roedel
17897f3668 KVM: SVM: Add tracepoint for injected #vmexit
This patch adds a tracepoint for a nested #vmexit that gets
re-injected to the guest.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:15 +02:00
Joerg Roedel
d8cabddf7e KVM: SVM: Add tracepoint for nested #vmexit
This patch adds a tracepoint for every #vmexit we get from a
nested guest.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:15 +02:00
Joerg Roedel
0ac406de8f KVM: SVM: Add tracepoint for nested vmrun
This patch adds a dedicated kvm tracepoint for a nested
vmrun.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:15 +02:00
Joerg Roedel
cd3ff653ae KVM: SVM: Move INTR vmexit out of atomic code
The nested SVM code emulates a #vmexit caused by a request
to open the irq window right in the request function. This
is a bug because the request function runs with preemption
and interrupts disabled but the #vmexit emulation might
sleep. This can cause a schedule()-while-atomic bug and is
fixed with this patch.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:15 +02:00
Alexander Graf
8d23c46624 KVM: SVM: Notify nested hypervisor of lost event injections
If event_inj is valid on a #vmexit the host CPU would write
the contents to exit_int_info, so the hypervisor knows that
the event wasn't injected.

We don't do this in nested SVM by now which is a bug and
fixed by this patch.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:14 +02:00
Glauber Costa
e3267cbbbf KVM: x86: include pvclock MSRs in msrs_to_save
For a while now, we are issuing a rdmsr instruction to find out which
msrs in our save list are really supported by the underlying machine.
However, it fails to account for kvm-specific msrs, such as the pvclock
ones.

This patch moves then to the beginning of the list, and skip testing them.

Cc: stable@kernel.org
Signed-off-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:14 +02:00
Jan Kiszka
91586a3b7d KVM: x86: Rework guest single-step flag injection and filtering
Push TF and RF injection and filtering on guest single-stepping into the
vender get/set_rflags callbacks. This makes the whole mechanism more
robust wrt user space IOCTL order and instruction emulations.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:14 +02:00
Marcelo Tosatti
a68a6a7282 KVM: x86: disable paravirt mmu reporting
Disable paravirt MMU capability reporting, so that new (or rebooted)
guests switch to native operation.

Paravirt MMU is a burden to maintain and does not bring significant
advantages compared to shadow anymore.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:14 +02:00
Jan Kiszka
355be0b930 KVM: x86: Refactor guest debug IOCTL handling
Much of so far vendor-specific code for setting up guest debug can
actually be handled by the generic code. This also fixes a minor deficit
in the SVM part /wrt processing KVM_GUESTDBG_ENABLE.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:14 +02:00
Juan Quintela
201d945bcf KVM: remove pre_task_link setting in save_state_to_tss16
Now, also remove pre_task_link setting in save_state_to_tss16.

  commit b237ac37a1
  Author: Gleb Natapov <gleb@redhat.com>
  Date:   Mon Mar 30 16:03:24 2009 +0300

    KVM: Fix task switch back link handling.

CC: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:13 +02:00
Zachary Amsden
3230bb4707 KVM: Fix hotplug of CPUs
Both VMX and SVM require per-cpu memory allocation, which is done at module
init time, for only online cpus.

Backend was not allocating enough structure for all possible CPUs, so
new CPUs coming online could not be hardware enabled.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:13 +02:00
Zachary Amsden
e6732a5af9 KVM: Fix printk name error in svm.c
Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:13 +02:00
Zachary Amsden
0cca790753 KVM: Kill the confusing tsc_ref_khz and ref_freq variables
They are globals, not clearly protected by any ordering or locking, and
vulnerable to various startup races.

Instead, for variable TSC machines, register the cpufreq notifier and get
the TSC frequency directly from the cpufreq machinery.  Not only is it
always right, it is also perfectly accurate, as no error prone measurement
is required.

On such machines, when a new CPU online is brought online, it isn't clear what
frequency it will start with, and it may not correspond to the reference, thus
in hardware_enable we clear the cpu_tsc_khz variable to zero and make sure
it is set before running on a VCPU.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:12 +02:00
Zachary Amsden
b820cc0ca2 KVM: Separate timer intialization into an indepedent function
Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:12 +02:00
Joerg Roedel
e935d48e1b KVM: SVM: Remove remaining occurences of rdtscll
This patch replaces them with native_read_tsc() which can
also be used in expressions and saves a variable on the
stack in this case.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:12 +02:00
Joerg Roedel
33527ad7e1 KVM: SVM: don't copy exit_int_info on nested vmrun
The exit_int_info field is only written by the hardware and
never read. So it does not need to be copied on a vmrun
emulation.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:11 +02:00
Joerg Roedel
7fcdb5103d KVM: SVM: reorganize svm_interrupt_allowed
This patch reorganizes the logic in svm_interrupt_allowed to
make it better to read. This is important because the logic
is a lot more complicated with Nested SVM.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:11 +02:00
Huang Weiyi
bfc33beaed KVM: remove duplicated #include
Remove duplicated #include('s) in
  arch/x86/kvm/lapic.c

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:10 +02:00
Alexander Graf
10474ae894 KVM: Activate Virtualization On Demand
X86 CPUs need to have some magic happening to enable the virtualization
extensions on them. This magic can result in unpleasant results for
users, like blocking other VMMs from working (vmx) or using invalid TLB
entries (svm).

Currently KVM activates virtualization when the respective kernel module
is loaded. This blocks us from autoloading KVM modules without breaking
other VMMs.

To circumvent this problem at least a bit, this patch introduces on
demand activation of virtualization. This means, that instead
virtualization is enabled on creation of the first virtual machine
and disabled on destruction of the last one.

So using this, KVM can be easily autoloaded, while keeping other
hypervisors usable.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:10 +02:00
Marcelo Tosatti
e8b3433a5c KVM: SVM: remove needless mmap_sem acquision from nested_svm_map
nested_svm_map unnecessarily takes mmap_sem around gfn_to_page, since
gfn_to_page / get_user_pages are responsible for it.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:10 +02:00
Mohammed Gamal
80ced186d1 KVM: VMX: Enhance invalid guest state emulation
- Change returned handle_invalid_guest_state() to return relevant exit codes
- Move triggering the emulation from vmx_vcpu_run() to vmx_handle_exit()
- Return to userspace instead of repeatedly trying to emulate instructions that have already failed

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03 09:32:09 +02:00
Mohammed Gamal
abcf14b560 KVM: x86 emulator: Add pusha and popa instructions
This adds pusha and popa instructions (opcodes 0x60-0x61), this enables booting
MINIX with invalid guest state emulation on.

[marcelo: remove unused variable]

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:09 +02:00
Mohammed Gamal
94677e61fd KVM: x86 emulator: Add missing decoder flags for 'or' instructions
Add missing decoder flags for or instructions (0xc-0xd).

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:09 +02:00
Avi Kivity
bfd99ff5d4 KVM: Move assigned device code to own file
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:09 +02:00
Avi Kivity
367e1319b2 KVM: Return -ENOTTY on unrecognized ioctls
Not the incorrect -EINVAL.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:08 +02:00
Gleb Natapov
680b3648ba KVM: Drop kvm->irq_lock lock from irq injection path
The only thing it protects now is interrupt injection into lapic and
this can work lockless. Even now with kvm->irq_lock in place access
to lapic is not entirely serialized since vcpu access doesn't take
kvm->irq_lock.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:08 +02:00
Gleb Natapov
eba0226bdf KVM: Move IO APIC to its own lock
The allows removal of irq_lock from the injection path.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:08 +02:00
Gleb Natapov
1a6e4a8c27 KVM: Move irq sharing information to irqchip level
This removes assumptions that max GSIs is smaller than number of pins.
Sharing is tracked on pin level not GSI level.

[avi: no PIC on ia64]

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:06 +02:00
Gleb Natapov
79c727d437 KVM: Call pic_clear_isr() on pic reset to reuse logic there
Also move call of ack notifiers after pic state change.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:06 +02:00
Avi Kivity
851ba6922a KVM: Don't pass kvm_run arguments
They're just copies of vcpu->run, which is readily accessible.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:06 +02:00
Mohammed Gamal
d8769fedd4 KVM: x86 emulator: Introduce No64 decode option
Introduces a new decode option "No64", which is used for instructions that are
invalid in long mode.

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:05 +02:00
Mohammed Gamal
0934ac9d13 KVM: x86 emulator: Add 'push/pop sreg' instructions
[avi: avoid buffer overflow]

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:05 +02:00
Ingo Molnar
96200591a3 Merge branch 'tracing/hw-breakpoints' into perf/core
Conflicts:
	arch/x86/kernel/kprobes.c
	kernel/trace/Makefile

Merge reason: hw-breakpoints perf integration is looking
              good in testing and in reviews, plus conflicts
              are mounting up - so merge & resolve.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-21 14:07:23 +01:00
Frederic Weisbecker
59d8eb53ea hw-breakpoints: Wrap in the KVM breakpoint active state check
Wrap in the cpu dr7 check that tells if we have active
breakpoints that need to be restored in the cpu.

This wrapper makes the check more self-explainable and also
reusable for any further other uses.

Reported-by: Jan Kiszka <jan.kiszka@web.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: "K. Prasad" <prasad@linux.vnet.ibm.com>
2009-11-10 11:23:43 +01:00
Frederic Weisbecker
24f1e32c60 hw-breakpoints: Rewrite the hw-breakpoints layer on top of perf events
This patch rebase the implementation of the breakpoints API on top of
perf events instances.

Each breakpoints are now perf events that handle the
register scheduling, thread/cpu attachment, etc..

The new layering is now made as follows:

       ptrace       kgdb      ftrace   perf syscall
          \          |          /         /
           \         |         /         /
                                        /
            Core breakpoint API        /
                                      /
                     |               /
                     |              /

              Breakpoints perf events

                     |
                     |

               Breakpoints PMU ---- Debug Register constraints handling
                                    (Part of core breakpoint API)
                     |
                     |

             Hardware debug registers

Reasons of this rewrite:

- Use the centralized/optimized pmu registers scheduling,
  implying an easier arch integration
- More powerful register handling: perf attributes (pinned/flexible
  events, exclusive/non-exclusive, tunable period, etc...)

Impact:

- New perf ABI: the hardware breakpoints counters
- Ptrace breakpoints setting remains tricky and still needs some per
  thread breakpoints references.

Todo (in the order):

- Support breakpoints perf counter events for perf tools (ie: implement
  perf_bpcounter_event())
- Support from perf tools

Changes in v2:

- Follow the perf "event " rename
- The ptrace regression have been fixed (ptrace breakpoint perf events
  weren't released when a task ended)
- Drop the struct hw_breakpoint and store generic fields in
  perf_event_attr.
- Separate core and arch specific headers, drop
  asm-generic/hw_breakpoint.h and create linux/hw_breakpoint.h
- Use new generic len/type for breakpoint
- Handle off case: when breakpoints api is not supported by an arch

Changes in v3:

- Fix broken CONFIG_KVM, we need to propagate the breakpoint api
  changes to kvm when we exit the guest and restore the bp registers
  to the host.

Changes in v4:

- Drop the hw_breakpoint_restore() stub as it is only used by KVM
- EXPORT_SYMBOL_GPL hw_breakpoint_restore() as KVM can be built as a
  module
- Restore the breakpoints unconditionally on kvm guest exit:
  TIF_DEBUG_THREAD doesn't anymore cover every cases of running
  breakpoints and vcpu->arch.switch_db_regs might not always be
  set when the guest used debug registers.
  (Waiting for a reliable optimization)

Changes in v5:

- Split-up the asm-generic/hw-breakpoint.h moving to
  linux/hw_breakpoint.h into a separate patch
- Optimize the breakpoints restoring while switching from kvm guest
  to host. We only want to restore the state if we have active
  breakpoints to the host, otherwise we don't care about messed-up
  address registers.
- Add asm/hw_breakpoint.h to Kbuild
- Fix bad breakpoint type in trace_selftest.c

Changes in v6:

- Fix wrong header inclusion in trace.h (triggered a build
  error with CONFIG_FTRACE_SELFTEST

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jan Kiszka <jan.kiszka@web.de>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
2009-11-08 15:34:42 +01:00
Gleb Natapov
abb3911965 KVM: get_tss_base_addr() should return a gpa_t
If TSS we are switching to resides in high memory task switch will fail
since address will be truncated. Windows2k3 does this sometimes when
running with more then 4G

Cc: stable@kernel.org
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-11-04 12:42:36 -02:00
Jan Kiszka
a9e38c3e01 KVM: x86: Catch potential overrun in MCE setup
We only allocate memory for 32 MCE banks (KVM_MAX_MCE_BANKS) but we
allow user space to fill up to 255 on setup (mcg_cap & 0xff), corrupting
kernel memory. Catch these overflows.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-11-04 12:42:35 -02:00
Tejun Heo
0fe1e00954 percpu: make percpu symbols in x86 unique
This patch updates percpu related symbols in x86 such that percpu
symbols are unique and don't clash with local symbols.  This serves
two purposes of decreasing the possibility of global percpu symbol
collision and allowing dropping per_cpu__ prefix from percpu symbols.

* arch/x86/kernel/cpu/common.c: rename local variable to avoid collision

* arch/x86/kvm/svm.c: s/svm_data/sd/ for local variables to avoid collision

* arch/x86/kernel/cpu/cpu_debug.c: s/cpu_arr/cpud_arr/
  				   s/priv_arr/cpud_priv_arr/
				   s/cpu_priv_count/cpud_priv_count/

* arch/x86/kernel/cpu/intel_cacheinfo.c: s/cpuid4_info/ici_cpuid4_info/
  					 s/cache_kobject/ici_cache_kobject/
					 s/index_kobject/ici_index_kobject/

* arch/x86/kernel/ds.c: s/cpu_context/cpu_ds_context/

Partly based on Rusty Russell's "alloc_percpu: rename percpu vars
which cause name clashes" patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: (kvm) Avi Kivity <avi@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: x86@kernel.org
2009-10-29 22:34:14 +09:00
Frederic Weisbecker
0f8f86c7bd Merge commit 'perf/core' into perf/hw-breakpoint
Conflicts:
	kernel/Makefile
	kernel/trace/Makefile
	kernel/trace/trace.h
	samples/Makefile

Merge reason: We need to be uptodate with the perf events development
branch because we plan to rewrite the breakpoints API on top of
perf events.
2009-10-18 01:12:33 +02:00
Frederik Deweerdt
8a8365c560 KVM: MMU: fix pointer cast
On a 32 bits compile, commit 3da0dd433d
introduced the following warnings:

arch/x86/kvm/mmu.c: In function ‘kvm_set_pte_rmapp’:
arch/x86/kvm/mmu.c:770: warning: cast to pointer from integer of different size
arch/x86/kvm/mmu.c: In function ‘kvm_set_spte_hva’:
arch/x86/kvm/mmu.c:849: warning: cast from pointer to integer of different size

The following patch uses 'unsigned long' instead of u64 to match the
pointer size on both arches.

Signed-off-by: Frederik Deweerdt <frederik.deweerdt@xprog.eu>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-10-16 12:30:26 -03:00
Marcelo Tosatti
ace1546487 KVM: use proper hrtimer function to retrieve expiration time
hrtimer->base can be temporarily NULL due to racing hrtimer_start.
See switch_hrtimer_base/lock_hrtimer_base.

Use hrtimer_get_remaining which is robust against it.

CC: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-10-16 12:30:25 -03:00
Izik Eidus
3da0dd433d KVM: add support for change_pte mmu notifiers
this is needed for kvm if it want ksm to directly map pages into its
shadow page tables.

[marcelo: cast pfn assignment to u64]

Signed-off-by: Izik Eidus <ieidus@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-10-04 17:04:53 +02:00
Izik Eidus
1403283acc KVM: MMU: add SPTE_HOST_WRITEABLE flag to the shadow ptes
this flag notify that the host physical page we are pointing to from
the spte is write protected, and therefore we cant change its access
to be write unless we run get_user_pages(write = 1).

(this is needed for change_pte support in kvm)

Signed-off-by: Izik Eidus <ieidus@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-10-04 17:04:50 +02:00
Izik Eidus
acb66dd051 KVM: MMU: dont hold pagecount reference for mapped sptes pages
When using mmu notifiers, we are allowed to remove the page count
reference tooken by get_user_pages to a specific page that is mapped
inside the shadow page tables.

This is needed so we can balance the pagecount against mapcount
checking.

(Right now kvm increase the pagecount and does not increase the
mapcount when mapping page into shadow page table entry,
so when comparing pagecount against mapcount, you have no
reliable result.)

Signed-off-by: Izik Eidus <ieidus@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-10-04 17:04:48 +02:00
Avi Kivity
6a54435560 KVM: Prevent overflow in KVM_GET_SUPPORTED_CPUID
The number of entries is multiplied by the entry size, which can
overflow on 32-bit hosts.  Bound the entry count instead.

Reported-by: David Wagner <daw@cs.berkeley.edu>
Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-10-04 17:04:16 +02:00
Marcelo Tosatti
eb5109e311 KVM: VMX: flush TLB with INVEPT on cpu migration
It is possible that stale EPTP-tagged mappings are used, if a
vcpu migrates to a different pcpu.

Set KVM_REQ_TLB_FLUSH in vmx_vcpu_load, when switching pcpus, which
will invalidate both VPID and EPT mappings on the next vm-entry.

Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-10-04 13:57:24 +02:00
Aurelien Jarno
b2d83cfa3f KVM: fix LAPIC timer period overflow
Don't overflow when computing the 64-bit period from 32-bit registers.

Fixes sourceforge bug #2826486.

Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-10-04 13:57:23 +02:00
Joerg Roedel
20824f30bb KVM: SVM: Handle tsc in svm_get_msr/svm_set_msr correctly
When running nested we need to touch the l1 guests
tsc_offset. Otherwise changes will be lost or a wrong value
be read.

Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-10-04 13:57:23 +02:00
Joerg Roedel
77b1ab1732 KVM: SVM: Fix tsc offset adjustment when running nested
When svm_vcpu_load is called while the vcpu is running in
guest mode the tsc adjustment made there is lost on the next
emulated #vmexit. This causes the tsc running backwards in
the guest. This patch fixes the issue by also adjusting the
tsc_offset in the emulated hsave area so that it will not
get lost.

Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-10-04 13:57:22 +02:00
Ingo Molnar
dca2d6ac09 Merge branch 'linus' into tracing/hw-breakpoints
Conflicts:
	arch/x86/kernel/process_64.c

Semantic conflict fixed in:
	arch/x86/kvm/x86.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-15 12:18:15 +02:00
Linus Torvalds
69def9f05d Merge branch 'kvm-updates/2.6.32' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/2.6.32' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (202 commits)
  MAINTAINERS: update KVM entry
  KVM: correct error-handling code
  KVM: fix compile warnings on s390
  KVM: VMX: Check cpl before emulating debug register access
  KVM: fix misreporting of coalesced interrupts by kvm tracer
  KVM: x86: drop duplicate kvm_flush_remote_tlb calls
  KVM: VMX: call vmx_load_host_state() only if msr is cached
  KVM: VMX: Conditionally reload debug register 6
  KVM: Use thread debug register storage instead of kvm specific data
  KVM guest: do not batch pte updates from interrupt context
  KVM: Fix coalesced interrupt reporting in IOAPIC
  KVM guest: fix bogus wallclock physical address calculation
  KVM: VMX: Fix cr8 exiting control clobbering by EPT
  KVM: Optimize kvm_mmu_unprotect_page_virt() for tdp
  KVM: Document KVM_CAP_IRQCHIP
  KVM: Protect update_cr8_intercept() when running without an apic
  KVM: VMX: Fix EPT with WP bit change during paging
  KVM: Use kvm_{read,write}_guest_virt() to read and write segment descriptors
  KVM: x86 emulator: Add adc and sbb missing decoder flags
  KVM: Add missing #include
  ...
2009-09-14 17:43:43 -07:00
Avi Kivity
0a79b00952 KVM: VMX: Check cpl before emulating debug register access
Debug registers may only be accessed from cpl 0.  Unfortunately, vmx will
code to emulate the instruction even though it was issued from guest
userspace, possibly leading to an unexpected trap later.

Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 18:11:10 +03:00
Gleb Natapov
4da748960a KVM: fix misreporting of coalesced interrupts by kvm tracer
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 18:11:09 +03:00
Marcelo Tosatti
e3904e6ed0 KVM: x86: drop duplicate kvm_flush_remote_tlb calls
kvm_mmu_slot_remove_write_access already calls it.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 18:11:08 +03:00
Gleb Natapov
542423b0dd KVM: VMX: call vmx_load_host_state() only if msr is cached
No need to call it before each kvm_(set|get)_msr_common()

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 18:11:07 +03:00
Avi Kivity
e8a48342e9 KVM: VMX: Conditionally reload debug register 6
Only reload debug register 6 if we're running with the guest's
debug registers.  Saves around 150 cycles from the guest lightweight
exit path.

dr6 contains a couple of bits that are updated on #DB, so intercept
that unconditionally and update those bits then.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 18:11:06 +03:00
Avi Kivity
3d53c27d05 KVM: Use thread debug register storage instead of kvm specific data
Instead of saving the debug registers from the processor to a kvm data
structure, rely in the debug registers stored in the thread structure.
This allows us not to save dr6 and dr7.

Reduces lightweight vmexit cost by 350 cycles, or 11 percent.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 18:11:04 +03:00
Gleb Natapov
5fff7d270b KVM: VMX: Fix cr8 exiting control clobbering by EPT
Don't call adjust_vmx_controls() two times for the same control.
It restores options that were dropped earlier.  This loses us the cr8
exit control, which causes a massive performance regression Windows x64.

Cc: stable@kernel.org
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:57 +03:00
Avi Kivity
60f24784a9 KVM: Optimize kvm_mmu_unprotect_page_virt() for tdp
We know no pages are protected, so we can short-circuit the whole thing
(including fairly nasty guest memory accesses).

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:56 +03:00
Avi Kivity
88c808fd42 KVM: Protect update_cr8_intercept() when running without an apic
update_cr8_intercept() can be triggered from userspace while there
is no apic present.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:54 +03:00
Sheng Yang
95eb84a758 KVM: VMX: Fix EPT with WP bit change during paging
QNX update WP bit when paging enabled, which is not covered yet. This one fix
QNX boot with EPT.

Cc: stable@kernel.org
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:53 +03:00
Mikhail Ershov
d9048d3278 KVM: Use kvm_{read,write}_guest_virt() to read and write segment descriptors
Segment descriptors tables can be placed on two non-contiguous pages.
This patch makes reading segment descriptors by linear address.

Signed-off-by: Mikhail Ershov <Mike.Ershov@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:52 +03:00
Mohammed Gamal
7bdb588827 KVM: x86 emulator: Add adc and sbb missing decoder flags
Add missing decoder flags for adc and sbb instructions
(opcodes 0x14-0x15, 0x1c-0x1d)

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:51 +03:00
Avi Kivity
56e8231841 KVM: Rename x86_emulate.c to emulate.c
We're in arch/x86, what could we possibly be emulating?

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:45 +03:00
Anthony Liguori
c0c7c04b87 KVM: When switching to a vm8086 task, load segments as 16-bit
According to 16.2.5 in the SDM, eflags.vm in the tss is consulted before loading
and new segments.  If eflags.vm == 1, then the segments are treated as 16-bit
segments.  The LDTR and TR are not normally available in vm86 mode so if they
happen to somehow get loaded, they need to be treated as 32-bit segments.

This fixes an invalid vmentry failure in a custom OS that was happening after
a task switch into vm8086 mode.  Since the segments were being mistakenly
treated as 32-bit, we loaded garbage state.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:44 +03:00
Avi Kivity
345dcaa8fd KVM: VMX: Adjust rflags if in real mode emulation
We set rflags.vm86 when virtualizing real mode to do through vm8086 mode;
so we need to take it out again when reading rflags.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:43 +03:00
Avi Kivity
52c7847d12 KVM: SVM: Drop tlb flush workaround in npt
It is no longer possible to reproduce the problem any more, so presumably
it has been fixed.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:40 +03:00
Gleb Natapov
cb142eb743 KVM: Update cr8 intercept when APIC TPR is changed by userspace
Since on vcpu entry we do it only if apic is enabled we should do
it when TPR is changed while apic is disabled. This happens when windows
resets HW without setting TPR to zero.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:39 +03:00
Joerg Roedel
4b6e4dca70 KVM: SVM: enable nested svm by default
Nested SVM is (in my experience) stable enough to be enabled by
default. So omit the requirement to pass a module parameter.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:38 +03:00
Joerg Roedel
108768de55 KVM: SVM: check for nested VINTR flag in svm_interrupt_allowed
Not checking for this flag breaks any nested hypervisor that does not
set VINTR. So fix it with this patch.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:37 +03:00
Joerg Roedel
26666957a5 KVM: SVM: move nested_svm_intr main logic out of if-clause
This patch removes one indentation level from nested_svm_intr and
makes the logic more readable.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:36 +03:00
Joerg Roedel
cda0ffdd86 KVM: SVM: remove unnecessary is_nested check from svm_cpu_run
This check is not necessary. We have to sync the vcpu->arch.cr2 always
back to the VMCB. This patch remove the is_nested check.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:34 +03:00
Joerg Roedel
410e4d573d KVM: SVM: move special nested exit handling to separate function
This patch moves the handling for special nested vmexits like #pf to a
separate function. This makes the kvm_override parameter obsolete and
makes the code more readable.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:33 +03:00
Joerg Roedel
1f8da47805 KVM: SVM: handle errors in vmrun emulation path appropriatly
If nested svm fails to load the msrpm the vmrun succeeds with the old
msrpm which is not correct. This patch changes the logic to roll back
to host mode in case the msrpm cannot be loaded.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:32 +03:00
Joerg Roedel
ea8e064fe2 KVM: SVM: remove nested_svm_do and helper functions
This function is not longer required. So remove it.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:31 +03:00
Joerg Roedel
9738b2c97d KVM: SVM: clean up nested vmrun path
This patch removes the usage of nested_svm_do from the vmrun emulation
path.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:29 +03:00
Joerg Roedel
9966bf6872 KVM: SVM: clean up nestec vmload/vmsave paths
This patch removes the usage of nested_svm_do from the vmload and
vmsave emulation code paths.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:28 +03:00
Joerg Roedel
3d62d9aa98 KVM: SVM: clean up nested_svm_exit_handled_msr
This patch changes nested svm to call nested_svm_exit_handled_msr
directly and not through nested_svm_do.

[alex: fix oops due to nested kmap_atomics]

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:45:43 +03:00
Joerg Roedel
34f80cfad5 KVM: SVM: get rid of nested_svm_vmexit_real
This patch is the starting point of removing nested_svm_do from the
nested svm code. The nested_svm_do function basically maps two guest
physical pages to host virtual addresses and calls a passed function
on it. This function pointer code flow is hard to read and not the
best technical solution here.
As a side effect this patch indroduces the nested_svm_[un]map helper
functions.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:26 +03:00
Joerg Roedel
0295ad7de8 KVM: SVM: simplify nested_svm_check_exception
Makes the code of this function more readable by removing on
indentation level for the core logic.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:25 +03:00
Joerg Roedel
9c4e40b994 KVM: SVM: do nested vmexit in nested_svm_exit_handled
If this function returns true a nested vmexit is required. Move that
vmexit into the nested_svm_exit_handled function. This also simplifies
the handling of nested #pf intercepts in this function.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:25 +03:00
Joerg Roedel
4c2161aed5 KVM: SVM: consolidate nested_svm_exit_handled
When caching guest intercepts there is no need anymore for the
nested_svm_exit_handled_real function. So move its code into
nested_svm_exit_handled.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:24 +03:00
Joerg Roedel
aad42c641c KVM: SVM: cache nested intercepts
When the nested intercepts are cached we don't need to call
get_user_pages and/or map the nested vmcb on every nested #vmexit to
check who will handle the intercept.
Further this patch aligns the emulated svm behavior better to real
hardware.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:24 +03:00
Joerg Roedel
e6aa9abd73 KVM: SVM: move nested svm state into seperate struct
This makes it more clear for which purpose these members in the vcpu_svm
exist.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:24 +03:00
Joerg Roedel
a5c3832dfe KVM: SVM: complete interrupts after handling nested exits
The interrupt completion code must run after nested exits are handled
because not injected interrupts or exceptions may be handled by the l1
guest first.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:24 +03:00
Joerg Roedel
0460a979b4 KVM: SVM: copy only necessary parts of the control area on vmrun/vmexit
The vmcb control area contains more then 800 bytes of reserved fields
which are unnecessarily copied. Fix this by introducing a copy
function which only copies the relevant part and saves time.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:23 +03:00
Joerg Roedel
defbba5660 KVM: SVM: optimize nested vmrun
Only copy the necessary parts of the vmcb save area on vmrun and save
precious time.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:23 +03:00
Joerg Roedel
33740e4009 KVM: SVM: optimize nested #vmexit
It is more efficient to copy only the relevant parts of the vmcb back to
the nested vmcb when we emulate an vmexit.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:23 +03:00
Joerg Roedel
2af9194d1b KVM: SVM: add helper functions for global interrupt flag
This patch makes the code easier to read when it comes to setting,
clearing and checking the status of the virtualized global
interrupt flag for the VCPU.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:23 +03:00
Gleb Natapov
88ba63c265 KVM: Replace pic_lock()/pic_unlock() with direct call to spinlock functions
They are not doing anything else now.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:22 +03:00
Gleb Natapov
938396a234 KVM: Call ack notifiers from PIC when guest OS acks an IRQ.
Currently they are called when irq vector is been delivered.  Calling ack
notifiers at this point is wrong.  Device assignment ack notifier enables
host interrupts, but guest not yet had a chance to clear interrupt
condition in a device.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:22 +03:00
Gleb Natapov
956f97cf66 KVM: Call kvm_vcpu_kick() inside pic spinlock
d5ecfdd25 moved it out because back than it was impossible to
call it inside spinlock. This restriction no longer exists.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:21 +03:00
Roel Kluin
3a34a8810b KVM: fix EFER read buffer overflow
Check whether index is within bounds before grabbing the element.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:21 +03:00
Amit Shah
1f3ee616dd KVM: ignore reads to perfctr msrs
We ignore writes to the perfctr msrs. Ignore reads as well.

Kaspersky antivirus crashes Windows guests if it can't read
these MSRs.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:21 +03:00
Avi Kivity
eab4b8aa34 KVM: VMX: Optimize vmx_get_cpl()
Instead of calling vmx_get_segment() (which reads a whole bunch of
vmcs fields), read only the cs selector which contains the cpl.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:21 +03:00
Jan Kiszka
07708c4af1 KVM: x86: Disallow hypercalls for guest callers in rings > 0
So far unprivileged guest callers running in ring 3 can issue, e.g., MMU
hypercalls. Normally, such callers cannot provide any hand-crafted MMU
command structure as it has to be passed by its physical address, but
they can still crash the guest kernel by passing random addresses.

To close the hole, this patch considers hypercalls valid only if issued
from guest ring 0. This may still be relaxed on a per-hypercall base in
the future once required.

Cc: stable@kernel.org
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:20 +03:00
Marcelo Tosatti
b90c062c65 KVM: MMU: fix bogus alloc_mmu_pages assignment
Remove the bogus n_free_mmu_pages assignment from alloc_mmu_pages.

It breaks accounting of mmu pages, since n_free_mmu_pages is modified
but the real number of pages remains the same.

Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:20 +03:00
Izik Eidus
3b80fffe2b KVM: MMU: make __kvm_mmu_free_some_pages handle empty list
First check if the list is empty before attempting to look at list
entries.

Cc: stable@kernel.org
Signed-off-by: Izik Eidus <ieidus@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:20 +03:00
Bartlomiej Zolnierkiewicz
95fb4eb698 KVM: remove superfluous NULL pointer check in kvm_inject_pit_timer_irqs()
This takes care of the following entries from Dan's list:

arch/x86/kvm/i8254.c +714 kvm_inject_pit_timer_irqs(6) warning: variable derefenced in initializer 'vcpu'
arch/x86/kvm/i8254.c +714 kvm_inject_pit_timer_irqs(6) warning: variable derefenced before check 'vcpu'

Reported-by: Dan Carpenter <error27@gmail.com>
Cc: corbet@lwn.net
Cc: eteo@redhat.com
Cc: Julia Lawall <julia@diku.dk>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Acked-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:19 +03:00
Joerg Roedel
344f414fa0 KVM: report 1GB page support to userspace
If userspace knows that the kernel part supports 1GB pages it can enable
the corresponding cpuid bit so that guests actually use GB pages.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:19 +03:00
Joerg Roedel
7e4e4056f7 KVM: MMU: shadow support for 1gb pages
This patch adds support for shadow paging to the 1gb page table code in KVM.
With this code the guest can use 1gb pages even if the host does not support
them.

[ Marcelo: fix shadow page collision on pmd level if a guest 1gb page is mapped
           with 4kb ptes on host level ]

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:19 +03:00
Joerg Roedel
e04da980c3 KVM: MMU: make page walker aware of mapping levels
The page walker may be used with nested paging too when accessing mmio
areas.  Make it support the additional page-level too.

[ Marcelo: fix reserved bit check for 1gb pte ]

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:18 +03:00
Joerg Roedel
852e3c19ac KVM: MMU: make direct mapping paths aware of mapping levels
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:18 +03:00
Joerg Roedel
d25797b24c KVM: MMU: rename is_largepage_backed to mapping_level
With the new name and the corresponding backend changes this function
can now support multiple hugepage sizes.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:18 +03:00
Joerg Roedel
44ad9944f1 KVM: MMU: make rmap code aware of mapping levels
This patch removes the largepage parameter from the rmap_add function.
Together with rmap_remove this function now uses the role.level field to
find determine if the page is a huge page.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:18 +03:00
Marcelo Tosatti
1444885a04 KVM: limit lapic periodic timer frequency
Otherwise its possible to starve the host by programming lapic timer
with a very high frequency.

Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:17 +03:00
Mikhail Ershov
5f0269f5d7 KVM: Align cr8 threshold when userspace changes cr8
Commit f0a3602c20 ("KVM: Move interrupt injection logic to x86.c") does not
update the cr8 intercept if the lapic is disabled, so when userspace updates
cr8, the cr8 threshold control is not updated and we are left with illegal
control fields.

Fix by explicitly resetting the cr8 threshold.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:17 +03:00
Jan Kiszka
7f582ab6d8 KVM: VMX: Avoid to return ENOTSUPP to userland
Choose some allowed error values for the cases VMX returned ENOTSUPP so
far as these values could be returned by the KVM_RUN IOCTL.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 08:33:17 +03:00
Gleb Natapov
84fde248fe KVM: PIT: Unregister ack notifier callback when freeing
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 08:33:16 +03:00
Sheng Yang
b927a3cec0 KVM: VMX: Introduce KVM_SET_IDENTITY_MAP_ADDR ioctl
Now KVM allow guest to modify guest's physical address of EPT's identity mapping page.

(change from v1, discard unnecessary check, change ioctl to accept parameter
address rather than value)

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 08:33:16 +03:00
Akinobu Mita
b792c344df KVM: x86: use kvm_get_gdt() and kvm_read_ldt()
Use kvm_get_gdt() and kvm_read_ldt() to reduce inline assembly code.

Cc: Avi Kivity <avi@redhat.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 08:33:15 +03:00
Akinobu Mita
46a359e715 KVM: x86: use get_desc_base() and get_desc_limit()
Use get_desc_base() and get_desc_limit() to get the base address and
limit in desc_struct.

Cc: Avi Kivity <avi@redhat.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 08:33:15 +03:00
Marcelo Tosatti
6a1ac77110 KVM: MMU: fix missing locking in alloc_mmu_pages
n_requested_mmu_pages/n_free_mmu_pages are used by
kvm_mmu_change_mmu_pages to calculate the number of pages to zap.

alloc_mmu_pages, called from the vcpu initialization path, modifies this
variables without proper locking, which can result in a negative value
in kvm_mmu_change_mmu_pages (say, with cpu hotplug).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 08:33:14 +03:00
Sheng Yang
3662cb1cd6 KVM: Discard unnecessary kvm_mmu_flush_tlb() in kvm_mmu_load()
set_cr3() should already cover the TLB flushing.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 08:33:14 +03:00
Gleb Natapov
4088bb3cae KVM: silence lapic kernel messages that can be triggered by a guest
Some Linux versions (f8) try to read EOI register that is write only.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 08:33:14 +03:00
Gleb Natapov
a1b37100d9 KVM: Reduce runnability interface with arch support code
Remove kvm_cpu_has_interrupt() and kvm_arch_interrupt_allowed() from
interface between general code and arch code. kvm_arch_vcpu_runnable()
checks for interrupts instead.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:13 +03:00
Gleb Natapov
b59bb7bdf0 KVM: Move exception handling to the same place as other events
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:13 +03:00
Joerg Roedel
a205bc19f0 KVM: MMU: Fix MMU_DEBUG compile breakage
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:13 +03:00
Gregory Haskins
d34e6b175e KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest.  Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.

Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.

However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc).  For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible.  All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling.  This adds additional computational load on the
system, as well as latency to the signalling path.

Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd.  This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.

To test this theory, we built a test-harness called "doorbell".  This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled.  It supports signalling
from either an eventfd, or an ioctl().

We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl().  The other is direct via
ioeventfd.

You can download this test harness here:

ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2

The measured results are as follows:

qemu-mmio:       110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio:  367300 iops, 2.72us rtt

I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy.  However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:

qemu-pio:      153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt

these are just for fun, for now, until I can gather more data.

Here is a graph for your convenience:

http://developer.novell.com/wiki/images/7/76/Iofd-chart.png

The conclusion to draw is that we save about 4us by skipping the userspace
hop.

--------------------

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:12 +03:00
Gregory Haskins
090b7aff27 KVM: make io_bus interface more robust
Today kvm_io_bus_regsiter_dev() returns void and will internally BUG_ON
if it fails.  We want to create dynamic MMIO/PIO entries driven from
userspace later in the series, so we need to enhance the code to be more
robust with the following changes:

   1) Add a return value to the registration function
   2) Fix up all the callsites to check the return code, handle any
      failures, and percolate the error up to the caller.
   3) Add an unregister function that collapses holes in the array

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:12 +03:00
Beth Kon
e9f4275732 KVM: PIT support for HPET legacy mode
When kvm is in hpet_legacy_mode, the hpet is providing the timer
interrupt and the pit should not be. So in legacy mode, the pit timer
is destroyed, but the *state* of the pit is maintained. So if kvm or
the guest tries to modify the state of the pit, this modification is
accepted, *except* that the timer isn't actually started. When we exit
hpet_legacy_mode, the current state of the pit (which is up to date
since we've been accepting modifications) is used to restart the pit
timer.

The saved_mode code in kvm_pit_load_count temporarily changes mode to
0xff in order to destroy the timer, but then restores the actual
value, again maintaining "current" state of the pit for possible later
reenablement.

[avi: add some reserved storage in the ioctl; make SET_PIT2 IOW]
[marcelo: fix memory corruption due to reserved storage]

Signed-off-by: Beth Kon <eak@us.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:12 +03:00
Gleb Natapov
0d1de2d901 KVM: Always report x2apic as supported feature
We emulate x2apic in software, so host support is not required.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:11 +03:00
Gleb Natapov
c7f0f24b1f KVM: No need to kick cpu if not in a guest mode
This will save a couple of IPIs.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:11 +03:00
Gleb Natapov
1000ff8d89 KVM: Add trace points in irqchip code
Add tracepoint in msi/ioapic/pic set_irq() functions,
in IPI sending and in the point where IRQ is placed into
apic's IRR.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:11 +03:00
Andre Przywara
f7c6d14003 KVM: fix MMIO_CONF_BASE MSR access
Some Windows versions check whether the BIOS has setup MMI/O for
config space accesses on AMD Fam10h CPUs, we say "no" by returning 0 on
reads and only allow disabling of MMI/O CfgSpace setup by igoring "0" writes.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:10 +03:00
Avi Kivity
f691fe1da7 KVM: Trace shadow page lifecycle
Create, sync, unsync, zap.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:10 +03:00
Avi Kivity
0742017159 KVM: MMU: Trace guest pagetable walker
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:09 +03:00
Jan Kiszka
dc7e795e3d Revert "KVM: x86: check for cr3 validity in ioctl_set_sregs"
This reverts commit 6c20e1442bb1c62914bb85b7f4a38973d2a423ba.

To my understanding, it became obsolete with the advent of the more
robust check in mmu_alloc_roots (89da4ff17f). Moreover, it prevents
the conceptually safe pattern

 1. set sregs
 2. register mem-slots
 3. run vcpu

by setting a sticky triple fault during step 1.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:09 +03:00
Andre Przywara
6098ca939e KVM: handle AMD microcode MSR
Windows 7 tries to update the CPU's microcode on some processors,
so we ignore the MSR write here. The patchlevel register is already handled
(returning 0), because the MSR number is the same as Intel's.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:08 +03:00
Sheng Yang
756975bbfd KVM: Fix apic_mmio_write return for unaligned write
Some in-famous OS do unaligned writing for APIC MMIO, and the return value
has been missed in recent change, then the OS hangs.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:08 +03:00
Gleb Natapov
0105d1a526 KVM: x2apic interface to lapic
This patch implements MSR interface to local apic as defines by x2apic
Intel specification.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:08 +03:00
Gleb Natapov
fc61b800f9 KVM: Add Directed EOI support to APIC emulation
Directed EOI is specified by x2APIC, but is available even when lapic is
in xAPIC mode.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:07 +03:00
Avi Kivity
cb24772140 KVM: Trace apic registers using their symbolic names
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:07 +03:00
Avi Kivity
aec51dc4f1 KVM: Trace mmio
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:07 +03:00
Andre Przywara
c323c0e5f0 KVM: Ignore PCI ECS I/O enablement
Linux guests will try to enable access to the extended PCI config space
via the I/O ports 0xCF8/0xCFC on AMD Fam10h CPU. Since we (currently?)
don't use ECS, simply ignore write and read attempts.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:06 +03:00
Michael S. Tsirkin
bda9020e24 KVM: remove in_range from io devices
This changes bus accesses to use high-level kvm_io_bus_read/kvm_io_bus_write
functions. in_range now becomes unused so it is removed from device ops in
favor of read/write callbacks performing range checks internally.

This allows aliasing (mostly for in-kernel virtio), as well as better error
handling by making it possible to pass errors up to userspace.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:05 +03:00
Michael S. Tsirkin
6c47469453 KVM: convert bus to slots_lock
Use slots_lock to protect device list on the bus.  slots_lock is already
taken for read everywhere, so we only need to take it for write when
registering devices.  This is in preparation to removing in_range and
kvm->lock around it.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:05 +03:00
Michael S. Tsirkin
108b56690f KVM: switch pit creation to slots_lock
switch pit creation to slots_lock. slots_lock is already taken for read
everywhere, so we only need to take it for write when creating pit.
This is in preparation to removing in_range and kvm->lock around it.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:05 +03:00
Marcelo Tosatti
2023a29cbe KVM: remove old KVMTRACE support code
Return EOPNOTSUPP for KVM_TRACE_ENABLE/PAUSE/DISABLE ioctls.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:03 +03:00
Andre Przywara
ed85c06853 KVM: introduce module parameter for ignoring unknown MSRs accesses
KVM will inject a #GP into the guest if that tries to access unhandled
MSRs. This will crash many guests. Although it would be the correct
way to actually handle these MSRs, we introduce a runtime switchable
module param called "ignore_msrs" (defaults to 0). If this is Y, unknown
MSR reads will return 0, while MSR writes are simply dropped. In both cases
we print a message to dmesg to inform the user about that.

You can change the behaviour at any time by saying:

 # echo 1 > /sys/modules/kvm/parameters/ignore_msrs

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:03 +03:00
Andre Przywara
1fdbd48c24 KVM: ignore reads from AMDs C1E enabled MSR
If the Linux kernel detects an C1E capable AMD processor (K8 RevF and
higher), it will access a certain MSR on every attempt to go to halt.
Explicitly handle this read and return 0 to let KVM run a Linux guest
with the native AMD host CPU propagated to the guest.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:03 +03:00
Andre Przywara
8f1589d95e KVM: ignore AMDs HWCR register access to set the FFDIS bit
Linux tries to disable the flush filter on all AMD K8 CPUs. Since KVM
does not handle the needed MSR, the injected #GP will panic the Linux
kernel. Ignore setting of the HWCR.FFDIS bit in this MSR to let Linux
boot with an AMD K8 family guest CPU.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:02 +03:00
Marcelo Tosatti
894a9c5543 KVM: x86: missing locking in PIT/IRQCHIP/SET_BSP_CPU ioctl paths
Correct missing locking in a few places in x86's vm_ioctl handling path.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:02 +03:00
Joerg Roedel
ec04b2604c KVM: Prepare memslot data structures for multiple hugepage sizes
[avi: fix build on non-x86]

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:02 +03:00
Andre Przywara
4668f05078 KVM: x86 emulator: Add sysexit emulation
Handle #UD intercept of the sysexit instruction in 64bit mode returning to
32bit compat mode on an AMD host.
Setup the segment descriptors for CS and SS and the EIP/ESP registers
according to the manual.

Signed-off-by: Christoph Egger <christoph.egger@amd.com>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:01 +03:00
Andre Przywara
8c60435261 KVM: x86 emulator: Add sysenter emulation
Handle #UD intercept of the sysenter instruction in 32bit compat mode on
an AMD host.
Setup the segment descriptors for CS and SS and the EIP/ESP registers
according to the manual.

Signed-off-by: Christoph Egger <christoph.egger@amd.com>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:01 +03:00
Andre Przywara
e66bb2ccdc KVM: x86 emulator: add syscall emulation
Handle #UD intercept of the syscall instruction in 32bit compat mode on
an Intel host.
Setup the segment descriptors for CS and SS and the EIP/ESP registers
according to the manual. Save the RIP and EFLAGS to the correct registers.

[avi: fix build on i386 due to missing R11]

Signed-off-by: Christoph Egger <christoph.egger@amd.com>
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:00 +03:00
Andre Przywara
e99f050712 KVM: x86 emulator: Prepare for emulation of syscall instructions
Add the flags needed for syscall, sysenter and sysexit to the opcode table.
Catch (but for now ignore) the opcodes in the emulation switch/case.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Christoph Egger <christoph.egger@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:00 +03:00
Andre Przywara
b1d861431e KVM: x86 emulator: Add missing EFLAGS bit definitions
Signed-off-by: Christoph Egger <christoph.egger@amd.com>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:00 +03:00
Andre Przywara
0cb5762ed2 KVM: Allow emulation of syscalls instructions on #UD
Add the opcodes for syscall, sysenter and sysexit to the list of instructions
handled by the undefined opcode handler.

Signed-off-by: Christoph Egger <christoph.egger@amd.com>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:59 +03:00
Marcelo Tosatti
229456fc34 KVM: convert custom marker based tracing to event traces
This allows use of the powerful ftrace infrastructure.

See Documentation/trace/ for usage information.

[avi, stephen: various build fixes]
[sheng: fix control register breakage]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:59 +03:00
Alexander Graf
219b65dcf6 KVM: SVM: Improve nested interrupt injection
While trying to get Hyper-V running, I realized that the interrupt injection
mechanisms that are in place right now are not 100% correct.

This patch makes nested SVM's interrupt injection behave more like on a
real machine.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:59 +03:00
Alexander Graf
ff092385e8 KVM: SVM: Implement INVLPGA
SVM adds another way to do INVLPG by ASID which Hyper-V makes use of,
so let's implement it!

For now we just do the same thing invlpg does, as asid switching
means we flush the mmu anyways. That might change one day though.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:58 +03:00
Alexander Graf
3c5d0a44b0 KVM: Implement MSRs used by Hyper-V
Hyper-V uses some MSRs, some of which are actually reserved for BIOS usage.

But let's be nice today and have it its way, because otherwise it fails
terribly.

[jaswinder: fix build for linux-next changes]

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:58 +03:00
Avi Kivity
b3dbf89e67 KVM: SVM: Don't save/restore host cr2
The host never reads cr2 in process context, so are free to clobber it.  The
vmx code does this, so we can safely remove the save/restore code.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:58 +03:00
Avi Kivity
d3edefc003 KVM: VMX: Only reload guest cr2 if different from host cr2
cr2 changes only rarely, and writing it is expensive.  Avoid the costly cr2
writes by checking if it does not already hold the desired value.

Shaves 70 cycles off the vmexit latency.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:57 +03:00
Jan Kiszka
681405bfc7 KVM: Drop useless atomic test from timer function
The current code tries to optimize the setting of
KVM_REQ_PENDING_TIMER but used atomic_inc_and_test - which always
returns true unless pending had the invalid value of -1 on entry. This
patch drops the test part preserving the original semantic but
expressing it less confusingly.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:57 +03:00
Jan Kiszka
f7104db26a KVM: Fix racy event propagation in timer
Minor issue that likely had no practical relevance: the kvm timer
function so far incremented the pending counter and then may reset it
again to 1 in case reinjection was disabled. This opened a small racy
window with the corresponding VCPU loop that may have happened to run
on another (real) CPU and already consumed the value.

Fix it by skipping the incrementation in case pending is already > 0.
This opens a different race windows, but may only rarely cause lost
events in case we do not care about them anyway (!reinject).

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:57 +03:00
Gleb Natapov
33e4c68656 KVM: Optimize searching for highest IRR
Most of the time IRR is empty, so instead of scanning the whole IRR on
each VM entry keep a variable that tells us if IRR is not empty. IRR
will have to be scanned twice on each IRQ delivery, but this is much
more rare than VM entry.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:57 +03:00
Gleb Natapov
6edf14d8d0 KVM: Replace pending exception by PF if it happens serially
Replace previous exception with a new one in a hope that instruction
re-execution will regenerate lost exception.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:56 +03:00
Marcelo Tosatti
54dee9933e KVM: VMX: conditionally disable 2M pages
Disable usage of 2M pages if VMX_EPT_2MB_PAGE_BIT (bit 16) is clear
in MSR_IA32_VMX_EPT_VPID_CAP and EPT is enabled.

[avi: s/largepages_disabled/largepages_enabled/ to avoid negative logic]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:56 +03:00
Marcelo Tosatti
68f89400bc KVM: VMX: EPT misconfiguration handler
Handler for EPT misconfiguration which checks for valid state
in the shadow pagetables, printing the spte on each level.

The separate WARN_ONs are useful for kerneloops.org.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:56 +03:00
Marcelo Tosatti
94d8b056a2 KVM: MMU: add kvm_mmu_get_spte_hierarchy helper
Required by EPT misconfiguration handler.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:56 +03:00
Marcelo Tosatti
4d88954d62 KVM: MMU: make for_each_shadow_entry aware of largepages
This way there is no need to add explicit checks in every
for_each_shadow_entry user.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:55 +03:00
Marcelo Tosatti
e799794e02 KVM: VMX: more MSR_IA32_VMX_EPT_VPID_CAP capability bits
Required for EPT misconfiguration handler.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:55 +03:00
Andre Przywara
71db602322 KVM: Move performance counter MSR access interception to generic x86 path
The performance counter MSRs are different for AMD and Intel CPUs and they
are chosen mainly by the CPUID vendor string. This patch catches writes to
all addresses (regardless of VMX/SVM path) and handles them in the generic
MSR handler routine. Writing a 0 into the event select register is something
we perfectly emulate ;-), so don't print out a warning to dmesg in this
case.
This fixes booting a 64bit Windows guest with an AMD CPUID on an Intel host.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:54 +03:00
Marcelo Tosatti
2920d72857 KVM: MMU audit: largepage handling
Make the audit code aware of largepages.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:54 +03:00
Marcelo Tosatti
2aaf65e8c4 KVM: MMU audit: audit_mappings tweaks
- Fail early in case gfn_to_pfn returns is_error_pfn.
- For the pre pte write case, avoid spurious "gva is valid but spte is notrap"
  messages (the emulation code does the guest write first, so this particular
  case is OK).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:54 +03:00
Marcelo Tosatti
48fc03174b KVM: MMU audit: nontrapping ptes in nonleaf level
It is valid to set non leaf sptes as notrap.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:54 +03:00
Marcelo Tosatti
e58b0f9e0e KVM: MMU audit: update audit_write_protection
- Unsync pages contain writable sptes in the rmap.
- rmaps do not exclusively contain writable sptes anymore.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:53 +03:00
Marcelo Tosatti
08a3732bf2 KVM: MMU audit: update count_writable_mappings / count_rmaps
Under testing, count_writable_mappings returns a value that is 2 integers
larger than what count_rmaps returns.

Suspicion is that either of the two functions is counting a duplicate (either
positively or negatively).

Modifying check_writable_mappings_rmap to check for rmap existance on
all present MMU pages fails to trigger an error, which should keep Avi
happy.

Also introduce mmu_spte_walk to invoke a callback on all present sptes visible
to the current vcpu, might be useful in the future.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:53 +03:00
Marcelo Tosatti
776e663336 KVM: MMU: introduce is_last_spte helper
Hiding some of the last largepage / level interaction (which is useful
for gbpages and for zero based levels).

Also merge the PT_PAGE_TABLE_LEVEL clearing loop in unlink_children.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:53 +03:00
Avi Kivity
3f5d18a965 KVM: Return to userspace on emulation failure
Instead of mindlessly retrying to execute the instruction, report the
failure to userspace.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:52 +03:00
Gleb Natapov
988a2cae6a KVM: Use macro to iterate over vcpus.
[christian: remove unused variables on s390]

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:52 +03:00
Gleb Natapov
73880c80aa KVM: Break dependency between vcpu index in vcpus array and vcpu_id.
Archs are free to use vcpu_id as they see fit. For x86 it is used as
vcpu's apic id. New ioctl is added to configure boot vcpu id that was
assumed to be 0 till now.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:52 +03:00
Gleb Natapov
1ed0ce000a KVM: Use pointer to vcpu instead of vcpu_id in timer code.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:52 +03:00
Gleb Natapov
c5af89b68a KVM: Introduce kvm_vcpu_is_bsp() function.
Use it instead of open code "vcpu_id zero is BSP" assumption.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:51 +03:00
Avi Kivity
d555c333aa KVM: MMU: s/shadow_pte/spte/
We use shadow_pte and spte inconsistently, switch to the shorter spelling.

Rename set_shadow_pte() to __set_spte() to avoid a conflict with the
existing set_spte(), and to indicate its lowlevelness.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:51 +03:00
Avi Kivity
43a3795a3a KVM: MMU: Adjust pte accessors to explicitly indicate guest or shadow pte
Since the guest and host ptes can have wildly different format, adjust
the pte accessor names to indicate on which type of pte they operate on.

No functional changes.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:51 +03:00
Avi Kivity
439e218a6f KVM: MMU: Fix is_dirty_pte()
is_dirty_pte() is used on guest ptes, not shadow ptes, so it needs to avoid
shadow_dirty_mask and use PT_DIRTY_MASK instead.

Misdetecting dirty pages could lead to unnecessarily setting the dirty bit
under EPT.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:50 +03:00
Avi Kivity
7ffd92c53c KVM: VMX: Move rmode structure to vmx-specific code
rmode is only used in vmx, so move it to vmx.c

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:50 +03:00
Nitin A Kamble
3a624e29c7 KVM: VMX: Support Unrestricted Guest feature
"Unrestricted Guest" feature is added in the VMX specification.
Intel Westmere and onwards processors will support this feature.

    It allows kvm guests to run real mode and unpaged mode
code natively in the VMX mode when EPT is turned on. With the
unrestricted guest there is no need to emulate the guest real mode code
in the vm86 container or in the emulator. Also the guest big real mode
code works like native.

  The attached patch enhances KVM to use the unrestricted guest feature
if available on the processor. It also adds a new kernel/module
parameter to disable the unrestricted guest feature at the boot time.

Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:49 +03:00
Marcelo Tosatti
fa40a8214b KVM: switch irq injection/acking data structures to irq_lock
Protect irq injection/acking data structures with a separate irq_lock
mutex. This fixes the following deadlock:

CPU A                               CPU B
kvm_vm_ioctl_deassign_dev_irq()
  mutex_lock(&kvm->lock);            worker_thread()
  -> kvm_deassign_irq()                -> kvm_assigned_dev_interrupt_work_handler()
    -> deassign_host_irq()               mutex_lock(&kvm->lock);
      -> cancel_work_sync() [blocked]

[gleb: fix ia64 path]

Reported-by: Alex Williamson <alex.williamson@hp.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:49 +03:00
Marcelo Tosatti
9f4cc12765 KVM: Grab pic lock in kvm_pic_clear_isr_ack
isr_ack is protected by kvm_pic->lock.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:48 +03:00
Jan Kiszka
238adc7705 KVM: Cleanup LAPIC interface
None of the interface services the LAPIC emulation provides need to be
exported to modules, and kvm_lapic_get_base is even totally unused
today.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:48 +03:00
Avi Kivity
596ae89565 KVM: VMX: Fix reporting of unhandled EPT violations
Instead of returning -ENOTSUPP, exit normally but indicate the hardware
exit reason.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:46 +03:00
Avi Kivity
6de4f3ada4 KVM: Cache pdptrs
Instead of reloading the pdptrs on every entry and exit (vmcs writes on vmx,
guest memory access on svm) extract them on demand.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:46 +03:00
Avi Kivity
8f5d549f02 KVM: VMX: Simplify pdptr and cr3 management
Instead of reading the PDPTRs from memory after every exit (which is slow
and wrong, as the PDPTRs are stored on the cpu), sync the PDPTRs from
memory to the VMCS before entry, and from the VMCS to memory after exit.
Do the same for cr3.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:46 +03:00
Avi Kivity
2d84e993a8 KVM: VMX: Avoid duplicate ept tlb flush when setting cr3
vmx_set_cr3() will call vmx_tlb_flush(), which will flush the ept context.
So there is no need to call ept_sync_context() explicitly.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:46 +03:00
Gregory Haskins
6b66ac1ae3 KVM: do not register i8254 PIO regions until we are initialized
We currently publish the i8254 resources to the pio_bus before the devices
are fully initialized.  Since we hold the pit_lock, its probably not
a real issue.  But lets clean this up anyway.

Reported-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:45 +03:00
Gregory Haskins
d76685c4a0 KVM: cleanup io_device code
We modernize the io_device code so that we use container_of() instead of
dev->private, and move the vtable to a separate ops structure
(theoretically allows better caching for multiple instances of the same
ops structure)

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:45 +03:00
Avi Kivity
6c8166a77c KVM: SVM: Fold kvm_svm.h info svm.c
kvm_svm.h is only included from svm.c, so fold it in.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:44 +03:00
Andre Przywara
017cb99e87 KVM: SVM: use explicit 64bit storage for sysenter values
Since AMD does not support sysenter in 64bit mode, the VMCB fields storing
the MSRs are truncated to 32bit upon VMRUN/#VMEXIT. So store the values
in a separate 64bit storage to avoid truncation.

[andre: fix amd->amd migration]

Signed-off-by: Christoph Egger <christoph.egger@amd.com>
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:43 +03:00
Jan Kiszka
c5ff41ce66 KVM: Allow PIT emulation without speaker port
The in-kernel speaker emulation is only a dummy and also unneeded from
the performance point of view. Rather, it takes user space support to
generate sound output on the host, e.g. console beeps.

To allow this, introduce KVM_CREATE_PIT2 which controls in-kernel
speaker port emulation via a flag passed along the new IOCTL. It also
leaves room for future extensions of the PIT configuration interface.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:41 +03:00
Gregory Haskins
721eecbf4f KVM: irqfd
KVM provides a complete virtual system environment for guests, including
support for injecting interrupts modeled after the real exception/interrupt
facilities present on the native platform (such as the IDT on x86).
Virtual interrupts can come from a variety of sources (emulated devices,
pass-through devices, etc) but all must be injected to the guest via
the KVM infrastructure.  This patch adds a new mechanism to inject a specific
interrupt to a guest using a decoupled eventfd mechnanism:  Any legal signal
on the irqfd (using eventfd semantics from either userspace or kernel) will
translate into an injected interrupt in the guest at the next available
interrupt window.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:41 +03:00
Avi Kivity
0ba12d1081 KVM: Move common KVM Kconfig items to new file virt/kvm/Kconfig
Reduce Kconfig code duplication.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:41 +03:00
Gleb Natapov
787ff73637 KVM: Drop interrupt shadow when single stepping should be done only on VMX
The problem exists only on VMX. Also currently we skip this step if
there is pending exception. The patch fixes this too.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:41 +03:00
Christoph Hellwig
284e9b0f5a KVM: cleanup arch/x86/kvm/Makefile
Use proper foo-y style list additions to cleanup all the conditionals,
move module selection after compound object selection and remove the
superflous comment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:40 +03:00
Avi Kivity
ee3d29e8be KVM: x86 emulator: fix jmp far decoding (opcode 0xea)
The jump target should not be sign extened; use an unsigned decode flag.

Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:40 +03:00
Avi Kivity
c9eaf20f26 KVM: x86 emulator: Implement zero-extended immediate decoding
Absolute jumps use zero extended immediate operands.

Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:39 +03:00
Mark McLoughlin
cb007648de KVM: fix cpuid E2BIG handling for extended request types
If we run out of cpuid entries for extended request types
we should return -E2BIG, just like we do for the standard
request types.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:39 +03:00
Jaswinder Singh Rajput
60af2ecdc5 KVM: Use MSR names in place of address
Replace 0xc0010010 with MSR_K8_SYSCFG and 0xc0010015 with MSR_K7_HWCR.

Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:39 +03:00
Huang Ying
890ca9aefa KVM: Add MCE support
The related MSRs are emulated. MCE capability is exported via
extension KVM_CAP_MCE and ioctl KVM_X86_GET_MCE_CAP_SUPPORTED.  A new
vcpu ioctl command KVM_X86_SETUP_MCE is used to setup MCE emulation
such as the mcg_cap. MCE is injected via vcpu ioctl command
KVM_X86_SET_MCE. Extended machine-check state (MCG_EXT_P) and CMCI are
not implemented.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:39 +03:00
Jaswinder Singh Rajput
af24a4e4ae KVM: Replace MSR_IA32_TIME_STAMP_COUNTER with MSR_IA32_TSC of msr-index.h
Use standard msr-index.h's MSR declaration.

MSR_IA32_TSC is better than MSR_IA32_TIME_STAMP_COUNTER as it also solves
80 column issue.

Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:38 +03:00
Gleb Natapov
ae0bb3e011 KVM: VMX: Properly handle software interrupt re-injection in real mode
When reinjecting a software interrupt or exception, use the correct
instruction length provided by the hardware instead of a hardcoded 1.

Fixes problems running the suse 9.1 livecd boot loader.

Problem introduced by commit f0a3602c20 ("KVM: Move interrupt injection
logic to x86.c").

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:38 +03:00
Ingo Molnar
5f9ece0240 Merge commit 'v2.6.31-rc7' into x86/cleanups
Merge reason: we were on -rc1 before - go up to -rc7

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-24 12:25:54 +02:00
Marcin Slusarz
9f51e24ee8 x86: Use printk_once()
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
LKML-Reference: <1249847649-11631-6-git-send-email-marcin.slusarz@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-09 22:28:34 +02:00
Marcelo Tosatti
53a27b39ff KVM: MMU: limit rmap chain length
Otherwise the host can spend too long traversing an rmap chain, which
happens under a spinlock.

Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-08-06 12:06:54 +03:00
Jan Kiszka
263799a361 KVM: VMX: Fix locking imbalance on emulation failure
We have to disable preemption and IRQs on every exit from
handle_invalid_guest_state, otherwise we generate at least a
preempt_disable imbalance.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-08-05 13:59:45 +03:00
Jan Kiszka
34f0c1ad27 KVM: VMX: Fix locking order in handle_invalid_guest_state
Release and re-acquire preemption and IRQ lock in the same order as
vcpu_enter_guest does.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-08-05 13:59:44 +03:00
Marcelo Tosatti
025dbbf36a KVM: MMU: handle n_free_mmu_pages > n_alloc_mmu_pages in kvm_mmu_change_mmu_pages
kvm_mmu_change_mmu_pages mishandles the case where n_alloc_mmu_pages is
smaller then n_free_mmu_pages, by not checking if the result of
the subtraction is negative.

Its a valid condition which can happen if a large number of pages has
been recently freed.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-08-05 13:59:43 +03:00
Marcelo Tosatti
4b656b1202 KVM: SVM: force new asid on vcpu migration
If a migrated vcpu matches the asid_generation value of the target pcpu,
there will be no TLB flush via TLB_CONTROL_FLUSH_ALL_ASID.

The check for vcpu.cpu in pre_svm_run is meaningless since svm_vcpu_load
already updated it on schedule in.

Such vcpu will VMRUN with stale TLB entries.

Based on original patch from Joerg Roedel (http://patchwork.kernel.org/patch/10021/)

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-08-05 13:59:29 +03:00
Marcelo Tosatti
d6289b9365 KVM: x86: verify MTRR/PAT validity
Do not allow invalid memory types in MTRR/PAT (generating a #GP
otherwise).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-08-05 13:58:16 +03:00
Marcelo Tosatti
0ff77873b1 KVM: PIT: fix kpit_elapsed division by zero
Fix division by zero triggered by latch count command on uninitialized
counter.

Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-08-05 13:58:11 +03:00
Jan Kiszka
e125e7b694 KVM: Fix KVM_GET_MSR_INDEX_LIST
So far, KVM copied the emulated_msrs (only MSR_IA32_MISC_ENABLE) to a
wrong address in user space due to broken pointer arithmetic. This
caused subtle corruption up there (missing MSR_IA32_MISC_ENABLE had
probably no practical relevance). Moreover, the size check for the
user-provided kvm_msr_list forgot about emulated MSRs.

Cc: stable@kernel.org
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-08-05 13:58:03 +03:00
Jaswinder Singh Rajput
bde8922325 KVM: shut up uninit compiler warning in paging_tmpl.h
Dixes compilation warning:
  CC      arch/x86/kernel/io_delay.o
 arch/x86/kvm/paging_tmpl.h: In function ‘paging64_fetch’:
 arch/x86/kvm/paging_tmpl.h:279: warning: ‘sptep’ may be used uninitialized in this function
 arch/x86/kvm/paging_tmpl.h: In function ‘paging32_fetch’:
 arch/x86/kvm/paging_tmpl.h:279: warning: ‘sptep’ may be used uninitialized in this function

warning is bogus (always have a least one level), but need to shut the compiler
up.

Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-28 14:10:32 +03:00
Amit Shah
9e6996240a KVM: Ignore reads to K7 EVNTSEL MSRs
In commit 7fe29e0faa we ignored the
reads to the P6 EVNTSEL MSRs. That fixed crashes on Intel machines.

Ignore the reads to K7 EVNTSEL MSRs as well to fix this on AMD
hosts.

This fixes Kaspersky antivirus crashing Windows guests on AMD hosts.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-28 14:10:31 +03:00
Avi Kivity
e3c7cb6ad7 KVM: VMX: Handle vmx instruction vmexits
IF a guest tries to use vmx instructions, inject a #UD to let it know the
instruction is not implemented, rather than crashing.

This prevents guest userspace from crashing the guest kernel.

Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-28 14:10:31 +03:00
Jaswinder Singh Rajput
a3f9d3981c KVM: kvm/x86_emulate.c toggle_interruptibility() should be static
toggle_interruptibility() is used only by same file, it should be static.

Fixed following sparse warning :

  arch/x86/kvm/x86_emulate.c:1364:6: warning: symbol 'toggle_interruptibility' was not declared. Should it be static?

Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-28 14:10:30 +03:00
Avi Kivity
29a4b9333b KVM: MMU: Allow 4K ptes with bit 7 (PAT) set
Bit 7 is perfectly legal in the 4K page leve; it is used for the PAT.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-28 14:10:29 +03:00
Mel Gorman
6484eb3e2a page allocator: do not check NUMA node ID when the caller knows the node is valid
Callers of alloc_pages_node() can optionally specify -1 as a node to mean
"allocate from the current node".  However, a number of the callers in
fast paths know for a fact their node is valid.  To avoid a comparison and
branch, this patch adds alloc_pages_exact_node() that only checks the nid
with VM_BUG_ON().  Callers that know their node is valid are then
converted.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Paul Mundt <lethal@linux-sh.org>	[for the SLOB NUMA bits]
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:32 -07:00
Andi Kleen
a0861c02a9 KVM: Add VT-x machine check support
VT-x needs an explicit MC vector intercept to handle machine checks in the
hyper visor.

It also has a special option to catch machine checks that happen
during VT entry.

Do these interceptions and forward them to the Linux machine check
handler. Make it always look like user space is interrupted because
the machine check handler treats kernel/user space differently.

Thanks to Jiang Yunhong for help and testing.

Cc: stable@kernel.org
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 12:27:08 +03:00
Nitin A Kamble
56b237e31a KVM: VMX: Rename rmode.active to rmode.vm86_active
That way the interpretation of rmode.active becomes more clear with
unrestricted guest code.

Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:49:00 +03:00
Gleb Natapov
20f65983e3 KVM: Move "exit due to NMI" handling into vmx_complete_interrupts()
To save us one reading of VM_EXIT_INTR_INFO.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:59 +03:00
Gleb Natapov
8db3baa2db KVM: Disable CR8 intercept if tpr patching is active
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:59 +03:00
Gleb Natapov
36752c9b91 KVM: Do not migrate pending software interrupts.
INTn will be re-executed after migration. If we wanted to migrate
pending software interrupt we would need to migrate interrupt type
and instruction length too, but we do not have all required info on
SVM, so SVM->VMX migration would need to re-execute INTn anyway. To
make it simple never migrate pending soft interrupt.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:59 +03:00
Gleb Natapov
44c11430b5 KVM: inject NMI after IRET from a previous NMI, not before.
If NMI is received during handling of another NMI it should be injected
immediately after IRET from previous NMI handler, but SVM intercept IRET
before instruction execution so we can't inject pending NMI at this
point and there is not way to request exit when NMI window opens. This
patch fix SVM code to open NMI window after IRET by single stepping over
IRET instruction.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:59 +03:00
Gleb Natapov
6a8b1d1312 KVM: Always request IRQ/NMI window if an interrupt is pending
Currently they are not requested if there is pending exception.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:58 +03:00
Gleb Natapov
66fd3f7f90 KVM: Do not re-execute INTn instruction.
Re-inject event instead. This is what Intel suggest. Also use correct
instruction length when re-injecting soft fault/interrupt.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:58 +03:00
Gleb Natapov
f629cf8485 KVM: skip_emulated_instruction() decode instruction if size is not known
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:58 +03:00
Gleb Natapov
923c61bbc6 KVM: Remove irq_pending bitmap
Only one interrupt vector can be injected from userspace irqchip at
any given time so no need to store it in a bitmap. Put it into interrupt
queue directly.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:57 +03:00
Gleb Natapov
fa9726b073 KVM: Do not allow interrupt injection from userspace if there is a pending event.
The exception will immediately close the interrupt window.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:57 +03:00
Gleb Natapov
3298b75c88 KVM: Unprotect a page if #PF happens during NMI injection.
It is done for exception and interrupt already.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:57 +03:00
Robert P. J. Day
58f8ac279a KVM: Expand on "help" info to specify kvm intel and amd module names
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:55 +03:00
Marcelo Tosatti
8986ecc0ef KVM: x86: check for cr3 validity in mmu_alloc_roots
Verify the cr3 address stored in vcpu->arch.cr3 points to an existant
memslot. If not, inject a triple fault.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:55 +03:00
Marcelo Tosatti
7c8a83b75a KVM: MMU: protect kvm_mmu_change_mmu_pages with mmu_lock
kvm_handle_hva, called by MMU notifiers, manipulates mmu data only with
the protection of mmu_lock.

Update kvm_mmu_change_mmu_pages callers to take mmu_lock, thus protecting
against kvm_handle_hva.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:54 +03:00
Glauber Costa
310b5d306c KVM: Deal with interrupt shadow state for emulated instructions
We currently unblock shadow interrupt state when we skip an instruction,
but failing to do so when we actually emulate one. This blocks interrupts
in key instruction blocks, in particular sti; hlt; sequences

If the instruction emulated is an sti, we have to block shadow interrupts.
The same goes for mov ss. pop ss also needs it, but we don't currently
emulate it.

Without this patch, I cannot boot gpxe option roms at vmx machines.
This is described at https://bugzilla.redhat.com/show_bug.cgi?id=494469

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:54 +03:00
Glauber Costa
2809f5d2c4 KVM: Replace ->drop_interrupt_shadow() by ->set_interrupt_shadow()
This patch replaces drop_interrupt_shadow with the more
general set_interrupt_shadow, that can either drop or raise
it, depending on its parameter.  It also adds ->get_interrupt_shadow()
for future use.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:54 +03:00
Marcelo Tosatti
32f8840064 KVM: use smp_send_reschedule in kvm_vcpu_kick
KVM uses a function call IPI to cause the exit of a guest running on a
physical cpu. For virtual interrupt notification there is no need to
wait on IPI receival, or to execute any function.

This is exactly what the reschedule IPI does, without the overhead
of function IPI. So use it instead of smp_call_function_single in
kvm_vcpu_kick.

Also change the "guest_mode" variable to a bit in vcpu->requests, and
use that to collapse multiple IPI's that would be issued between the
first one and zeroing of guest mode.

This allows kvm_vcpu_kick to called with interrupts disabled.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:53 +03:00
Avi Kivity
d149c731e4 KVM: Update cpuid 1.ecx reporting
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:53 +03:00
Avi Kivity
7faa4ee1c7 KVM: Add AMD cpuid bit: cr8_legacy, abm, misaligned sse, sse4, 3dnow prefetch
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:52 +03:00
Avi Kivity
8d753f369b KVM: Fix cpuid feature misreporting
MTRR, PAT, MCE, and MCA are all supported (to some extent) but not reported.
Vista requires these features, so if userspace relies on kernel cpuid
reporting, it loses support for Vista.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:52 +03:00
Jan Kiszka
d6a8c875f3 KVM: Drop request_nmi from stats
The stats entry request_nmi is no longer used as the related user space
interface was dropped. So clean it up.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:51 +03:00
Gleb Natapov
fe8e7f83de KVM: SVM: Don't reinject event that caused a task switch
If a task switch caused by an event remove it from the event queue.
VMX already does that.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:51 +03:00
Andre Przywara
b586eb0253 KVM: SVM: Fix cross vendor migration issue in segment segment descriptor
On AMD CPUs sometimes the DB bit in the stack segment
descriptor is left as 1, although the whole segment has
been made unusable. Clear it here to pass an Intel VMX
entry check when cross vendor migrating.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:51 +03:00
Glauber Costa
9b5843ddd2 KVM: fix apic_debug instances
Apparently nobody turned this on in a while...
setting apic_debug to something compilable, generates
some errors. This patch fixes it.

Signed-off-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:50 +03:00
Sheng Yang
522c68c441 KVM: Enable snooping control for supported hardware
Memory aliases with different memory type is a problem for guest. For the guest
without assigned device, the memory type of guest memory would always been the
same as host(WB); but for the assigned device, some part of memory may be used
as DMA and then set to uncacheable memory type(UC/WC), which would be a conflict of
host memory type then be a potential issue.

Snooping control can guarantee the cache correctness of memory go through the
DMA engine of VT-d.

[avi: fix build on ia64]

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:50 +03:00
Sheng Yang
4b12f0de33 KVM: Replace get_mt_mask_shift with get_mt_mask
Shadow_mt_mask is out of date, now it have only been used as a flag to indicate
if TDP enabled. Get rid of it and use tdp_enabled instead.

Also put memory type logical in kvm_x86_ops->get_mt_mask().

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:49 +03:00
Jan Blunck
9b62e5b10f KVM: Wake up waitqueue before calling get_cpu()
This moves the get_cpu() call down to be called after we wake up the
waiters. Therefore the waitqueue locks can safely be rt mutex.

Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Sven-Thorsten Dietrich <sven@thebigcorporation.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:49 +03:00
Gleb Natapov
14d0bc1f7c KVM: Get rid of get_irq() callback
It just returns pending IRQ vector from the queue for VMX/SVM.
Get IRQ directly from the queue before migration and put it back
after.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:49 +03:00
Gleb Natapov
16d7a19117 KVM: Fix userspace IRQ chip migration
Re-put pending IRQ vector into interrupt_bitmap before migration.
Otherwise it will be lost if migration happens in the wrong time.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:48 +03:00
Gleb Natapov
95ba827313 KVM: SVM: Add NMI injection support
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:48 +03:00
Gleb Natapov
c4282df98a KVM: Get rid of arch.interrupt_window_open & arch.nmi_window_open
They are recalculated before each use anyway.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:48 +03:00
Gleb Natapov
0a5fff1923 KVM: Do not report TPR write to userspace if new value bigger or equal to a previous one.
Saves many exits to userspace in a case of IRQ chip in userspace.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:47 +03:00
Gleb Natapov
615d519305 KVM: sync_lapic_to_cr8() should always sync cr8 to V_TPR
Even if IRQ chip is in userspace.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:47 +03:00
Gleb Natapov
115666dfc7 KVM: Remove kvm_push_irq()
No longer used.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:47 +03:00
Gleb Natapov
1d6ed0cb95 KVM: Remove inject_pending_vectors() callback
It is the same as inject_pending_irq() for VMX/SVM now.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:47 +03:00
Gleb Natapov
1cb948ae86 KVM: Remove exception_injected() callback.
It always return false for VMX/SVM now.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:46 +03:00
Gleb Natapov
9222be18f7 KVM: SVM: Coalesce userspace/kernel irqchip interrupt injection logic
Start to use interrupt/exception queues like VMX does.
This also fix the bug that if exit was caused by a guest
internal exception access to IDT the exception was not
reinjected.

Use EVENTINJ to inject interrupts.  Use VINT only for detecting when IRQ
windows is open again.  EVENTINJ ensures
the interrupt is injected immediately and not delayed.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:46 +03:00
Gleb Natapov
5df5664647 KVM: Use kvm_arch_interrupt_allowed() instead of checking interrupt_window_open directly
kvm_arch_interrupt_allowed() also checks IF so drop the check.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:46 +03:00
Gleb Natapov
1f21e79aac KVM: VMX: Cleanup vmx_intr_assist()
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:45 +03:00
Gleb Natapov
863e8e658e KVM: VMX: Consolidate userspace and kernel interrupt injection for VMX
Use the same callback to inject irq/nmi events no matter what irqchip is
in use. Only from VMX for now.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:45 +03:00
Gleb Natapov
8061823a25 KVM: Make kvm_cpu_(has|get)_interrupt() work for userspace irqchip too
At the vector level, kernel and userspace irqchip are fairly similar.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:45 +03:00
Jan Kiszka
3438253926 KVM: MMU: Fix auditing code
Fix build breakage of hpa lookup in audit_mappings_page. Moreover, make
this function robust against shadow_notrap_nonpresent_pte entries.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:45 +03:00
Marcelo Tosatti
59839dfff5 KVM: x86: check for cr3 validity in ioctl_set_sregs
Matt T. Yourst notes that kvm_arch_vcpu_ioctl_set_sregs lacks validity
checking for the new cr3 value:

"Userspace callers of KVM_SET_SREGS can pass a bogus value of cr3 to
the kernel. This will trigger a NULL pointer access in gfn_to_rmap()
when userspace next tries to call KVM_RUN on the affected VCPU and kvm
attempts to activate the new non-existent page table root.

This happens since kvm only validates that cr3 points to a valid guest
physical memory page when code *inside* the guest sets cr3. However, kvm
currently trusts the userspace caller (e.g. QEMU) on the host machine to
always supply a valid page table root, rather than properly validating
it along with the rest of the reloaded guest state."

http://sourceforge.net/tracker/?func=detail&atid=893831&aid=2687641&group_id=180599

Check for a valid cr3 address in kvm_arch_vcpu_ioctl_set_sregs, triple
fault in case of failure.

Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:43 +03:00
Avi Kivity
463656c000 KVM: Replace kvmclock open-coded get_cpu_var() with the real thing
Suggested by Ingo Molnar.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:42 +03:00
Gleb Natapov
8317c298ea KVM: SVM: Skip instruction on a task switch only when appropriate
If a task switch was initiated because off a task gate in IDT and IDT
was accessed because of an external even the instruction should not
be skipped.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:42 +03:00
Gleb Natapov
ba8afb6b0a KVM: x86 emulator: Add new mode of instruction emulation: skip
In the new mode instruction is decoded, but not executed. The EIP
is moved to point after the instruction.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:42 +03:00
Gleb Natapov
e637b8238a KVM: x86 emulator: Decode soft interrupt instructions
Do not emulate them yet.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:41 +03:00
Gleb Natapov
84ce66a686 KVM: x86 emulator: Completely decode in/out at decoding stage
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:41 +03:00
Gleb Natapov
341de7e372 KVM: x86 emulator: Add unsigned byte immediate decode
Extend "Source operand type" opcode description field to 4 bites
to accommodate new option.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:41 +03:00
Gleb Natapov
d53c4777b3 KVM: x86 emulator: Complete decoding of call near in decode stage
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:41 +03:00
Gleb Natapov
b2833e3cde KVM: x86 emulator: Complete short/near jcc decoding in decode stage
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:40 +03:00
Gleb Natapov
782b877c80 KVM: x86 emulator: Complete ljmp decoding at decode stage
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:40 +03:00
Gleb Natapov
0654169e73 KVM: x86 emulator: Add lcall decoding
No emulation yet.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:40 +03:00
Gleb Natapov
a5f868bd45 KVM: x86 emulator: Add decoding of 16bit second immediate argument
Such as segment number in lcall/ljmp

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:40 +03:00
Marcelo Tosatti
c2d0ee46e6 KVM: MMU: remove global page optimization logic
Complexity to fix it not worthwhile the gains, as discussed
in http://article.gmane.org/gmane.comp.emulators.kvm.devel/28649.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:39 +03:00
Marcelo Tosatti
ede2ccc517 KVM: PIT: fix count read and mode 0 handling
Commit 46ee278652f4cbd51013471b64c7897ba9bcd1b1 causes Solaris 10
to hang on boot.

Assuming that PIT counter reads should return 0 for an expired timer
is wrong: when it is active, the counter never stops (see comment on
__kpit_elapsed).

Also arm a one shot timer for mode 0.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:39 +03:00
Gleb Natapov
2d03319654 KVM: x86 emulator: fix call near emulation
The length of pushed on to the stack return address depends on operand
size not address size.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:38 +03:00
Sheng Yang
4c26b4cd6f KVM: MMU: Discard reserved bits checking on PDE bit 7-8
1. It's related to a Linux kernel bug which fixed by Ingo on
07a66d7c53. The original code exists for quite a
long time, and it would convert a PDE for large page into a normal PDE. But it
fail to fit normal PDE well.  With the code before Ingo's fix, the kernel would
fall reserved bit checking with bit 8 - the remaining global bit of PTE. So the
kernel would receive a double-fault.

2. After discussion, we decide to discard PDE bit 7-8 reserved checking for now.
For this marked as reserved in SDM, but didn't checked by the processor in
fact...

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:38 +03:00
Gleb Natapov
64a7ec0668 KVM: Fix unneeded instruction skipping during task switching.
There is no need to skip instruction if the reason for a task switch
is a task gate in IDT and access to it is caused by an external even.
The problem  is currently solved only for VMX since there is no reliable
way to skip an instruction in SVM. We should emulate it instead.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:38 +03:00
Gleb Natapov
b237ac37a1 KVM: Fix task switch back link handling.
Back link is written to a wrong TSS now.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:37 +03:00
Gleb Natapov
8843419048 KVM: VMX: Do not zero idt_vectoring_info in vmx_complete_interrupts().
We will need it later in task_switch().
Code in handle_exception() is dead. is_external_interrupt(vect_info)
will always be false since idt_vectoring_info is zeroed in
vmx_complete_interrupts().

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:37 +03:00
Gleb Natapov
37b96e9880 KVM: VMX: Rewrite vmx_complete_interrupt()'s twisted maze of if() statements
...with a more straightforward switch().

Also fix a bug when NMI could be dropped on exit. Although this should
never happen in practice, since NMIs can only be injected, never triggered
internally by the guest like exceptions.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:37 +03:00
Gleb Natapov
7b4a25cb29 KVM: VMX: Fix handling of a fault during NMI unblocked due to IRET
Bit 12 is undefined in any of the following cases:
 If the VM exit sets the valid bit in the IDT-vectoring information field.
 If the VM exit is due to a double fault.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:36 +03:00
Dong, Eddie
20c466b561 KVM: Use rsvd_bits_mask in load_pdptrs()
Also remove bit 5-6 from rsvd_bits_mask per latest SDM.

Signed-off-by: Eddie Dong <Eddie.Dong@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:36 +03:00
Sheng Yang
93ba03c2e2 KVM: VMX: Fix feature testing
The testing of feature is too early now, before vmcs_config complete initialization.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:36 +03:00
Sheng Yang
045471563d KVM: VMX: Clean up Flex Priority related
And clean paranthes on returns.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:36 +03:00
Wei Yongjun
7a6ce84c74 KVM: remove pointless conditional before kfree() in lapic initialization
Remove pointless conditional before kfree().

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:35 +03:00
Avi Kivity
9645bb56b3 KVM: MMU: Use different shadows when EFER.NXE changes
A pte that is shadowed when the guest EFER.NXE=1 is not valid when
EFER.NXE=0; if bit 63 is set, the pte should cause a fault, and since the
shadow EFER always has NX enabled, this won't happen.

Fix by using a different shadow page table for different EFER.NXE bits.  This
allows vcpus to run correctly with different values of EFER.NXE, and for
transitions on this bit to be handled correctly without requiring a full
flush.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:35 +03:00
Dong, Eddie
82725b20e2 KVM: MMU: Emulate #PF error code of reserved bits violation
Detect, indicate, and propagate page faults where reserved bits are set.
Take care to handle the different paging modes, each of which has different
sets of reserved bits.

[avi: fix pte reserved bits for efer.nxe=0]

Signed-off-by: Eddie Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:35 +03:00
Eddie Dong
a8b876b1a4 KVM: MMU: Fix comment in page_fault()
The original one is for the code before refactoring.

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:34 +03:00
Sheng Yang
f9c617f611 KVM: VMX: Correct wrong vmcs field sizes
EXIT_QUALIFICATION and GUEST_LINEAR_ADDRESS are natural width, not 64-bit.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:34 +03:00
Avi Kivity
7d433b9f94 KVM: VMX: Make flexpriority module parameter reflect hardware capability
If the hardware does not support flexpriority, zero the module parameter.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:34 +03:00
Gleb Natapov
78646121e9 KVM: Fix interrupt unhalting a vcpu when it shouldn't
kvm_vcpu_block() unhalts vpu on an interrupt/timer without checking
if interrupt window is actually opened.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:33 +03:00
Gleb Natapov
09cec75488 KVM: Timer event should not unconditionally unhalt vcpu.
Currently timer events are processed before entering guest mode. Move it
to main vcpu event loop since timer events should be processed even while
vcpu is halted.  Timer may cause interrupt/nmi to be injected and only then
vcpu will be unhalted.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:33 +03:00
Avi Kivity
089d034e0c KVM: VMX: Fold vm_need_ept() into callers
Trivial.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:32 +03:00
Avi Kivity
575ff2dcb2 KVM: VMX: Zero ept module parameter if ept is not present
Allows reading back hardware capability.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:32 +03:00
Avi Kivity
919818abc2 KVM: VMX: Zero the vpid module parameter if vpid is not supported
This allows reading back how the hardware is configured.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:32 +03:00
Avi Kivity
4462d21a61 KVM: VMX: Annotate module parameters as __read_mostly
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:32 +03:00
Avi Kivity
736caefe15 KVM: VMX: Simplify module parameter names
Instead of 'enable_vpid=1', use a simple 'vpid=1'.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:31 +03:00
Avi Kivity
6062d012ed KVM: VMX: Rename kvm_handle_exit() to vmx_handle_exit()
It is a static vmx-specific function.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:31 +03:00
Avi Kivity
c1f8bc04c6 KVM: VMX: Make module parameters readable
Useful to see how the module was loaded.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:31 +03:00
Gleb Natapov
fe4c7b1914 KVM: reuse (pop|push)_irq from svm.c in vmx.c
The prioritized bit vector manipulation functions are useful in both vmx and
svm.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:31 +03:00
Gleb Natapov
61c50edfcd KVM: SVM: Remove duplicate code in svm_do_inject_vector()
svm_do_inject_vector() reimplements pop_irq().

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:30 +03:00
Amit Shah
7fe29e0faa KVM: x86: Ignore reads to EVNTSEL MSRs
We ignore writes to the performance counters and performance event
selector registers already. Kaspersky antivirus reads the eventsel
MSR causing it to crash with the current behaviour.

Return 0 as data when the eventsel registers are read to stop the
crash.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:30 +03:00
Gleb Natapov
f00be0cae4 KVM: MMU: do not free active mmu pages in free_mmu_pages()
free_mmu_pages() should only undo what alloc_mmu_pages() does.
Free mmu pages from the generic VM destruction function, kvm_destroy_vm().

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:30 +03:00
Sheng Yang
e56d532f20 KVM: Device assignment framework rework
After discussion with Marcelo, we decided to rework device assignment framework
together. The old problems are kernel logic is unnecessary complex. So Marcelo
suggest to split it into a more elegant way:

1. Split host IRQ assign and guest IRQ assign. And userspace determine the
combination. Also discard msi2intx parameter, userspace can specific
KVM_DEV_IRQ_HOST_MSI | KVM_DEV_IRQ_GUEST_INTX in assigned_irq->flags to
enable MSI to INTx convertion.

2. Split assign IRQ and deassign IRQ. Import two new ioctls:
KVM_ASSIGN_DEV_IRQ and KVM_DEASSIGN_DEV_IRQ.

This patch also fixed the reversed _IOR vs _IOW in definition(by deprecated the
old interface).

[avi: replace homemade bitcount() by hweight_long()]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:29 +03:00
Hannes Eder
386eb6e8b3 KVM: make 'lapic_timer_ops' and 'kpit_ops' static
Fix this sparse warnings:
  arch/x86/kvm/lapic.c:916:22: warning: symbol 'lapic_timer_ops' was not declared. Should it be static?
  arch/x86/kvm/i8254.c:268:22: warning: symbol 'kpit_ops' was not declared. Should it be static?

Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:29 +03:00
Gleb Natapov
58c2dde17d KVM: APIC: get rid of deliver_bitmask
Deliver interrupt during destination matching loop.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Acked-by: Xiantao Zhang <xiantao.zhang@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-06-10 11:48:27 +03:00
Gleb Natapov
e1035715ef KVM: change the way how lowest priority vcpu is calculated
The new way does not require additional loop over vcpus to calculate
the one with lowest priority as one is chosen during delivery bitmap
construction.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-06-10 11:48:27 +03:00
Gleb Natapov
343f94fe4d KVM: consolidate ioapic/ipi interrupt delivery logic
Use kvm_apic_match_dest() in kvm_get_intr_delivery_bitmask() instead
of duplicating the same code. Use kvm_get_intr_delivery_bitmask() in
apic_send_ipi() to figure out ipi destination instead of reimplementing
the logic.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-06-10 11:48:27 +03:00
Gleb Natapov
6da7e3f643 KVM: APIC: kvm_apic_set_irq deliver all kinds of interrupts
Get rid of ioapic_inj_irq() and ioapic_inj_nmi() functions.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-06-10 11:48:26 +03:00
Joerg Roedel
f5a1e9f895 KVM: MMU: remove call to kvm_mmu_pte_write from walk_addr
There is no reason to update the shadow pte here because the guest pte
is only changed to dirty state.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-06-10 11:48:26 +03:00
Marcelo Tosatti
d3c7b77d1a KVM: unify part of generic timer handling
Hide the internals of vcpu awakening / injection from the in-kernel
emulated timers. This makes future changes in this logic easier and
decreases the distance to more generic timer handling.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:25 +03:00
Marcelo Tosatti
fd66842370 KVM: PIT: remove usage of count_load_time for channel 0
We can infer elapsed time from hrtimer_expires_remaining.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:25 +03:00
Marcelo Tosatti
5a05d54554 KVM: PIT: remove unused scheduled variable
Unused.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:25 +03:00
Matt T. Yourst
2dea4c84bc KVM: x86: silence preempt warning on kvm_write_guest_time
This issue just appeared in kvm-84 when running on 2.6.28.7 (x86-64)
with PREEMPT enabled.

We're getting syslog warnings like this many (but not all) times qemu
tells KVM to run the VCPU:

BUG: using smp_processor_id() in preemptible [00000000] code:
qemu-system-x86/28938
caller is kvm_arch_vcpu_ioctl_run+0x5d1/0xc70 [kvm]
Pid: 28938, comm: qemu-system-x86 2.6.28.7-mtyrel-64bit
Call Trace:
debug_smp_processor_id+0xf7/0x100
kvm_arch_vcpu_ioctl_run+0x5d1/0xc70 [kvm]
? __wake_up+0x4e/0x70
? wake_futex+0x27/0x40
kvm_vcpu_ioctl+0x2e9/0x5a0 [kvm]
enqueue_hrtimer+0x8a/0x110
_spin_unlock_irqrestore+0x27/0x50
vfs_ioctl+0x31/0xa0
do_vfs_ioctl+0x74/0x480
sys_futex+0xb4/0x140
sys_ioctl+0x99/0xa0
system_call_fastpath+0x16/0x1b

As it turns out, the call trace is messed up due to gcc's inlining, but
I isolated the problem anyway: kvm_write_guest_time() is being used in a
non-thread-safe manner on preemptable kernels.

Basically kvm_write_guest_time()'s body needs to be surrounded by
preempt_disable() and preempt_enable(), since the kernel won't let us
query any per-CPU data (indirectly using smp_processor_id()) without
preemption disabled. The attached patch fixes this issue by disabling
preemption inside kvm_write_guest_time().

[marcelo: surround only __get_cpu_var calls since the warning
is harmless]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:24 +03:00
Sheng Yang
bfd349d073 KVM: bit ops for deliver_bitmap
It's also convenient when we extend KVM supported vcpu number in the future.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:22 +03:00
Sheng Yang
110c2faeba KVM: Update intr delivery func to accept unsigned long* bitmap
Would be used with bit ops, and would be easily extended if KVM_MAX_VCPUS is
increased.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:22 +03:00
Avi Kivity
5897297bc2 KVM: VMX: Don't intercept MSR_KERNEL_GS_BASE
Windows 2008 accesses this MSR often on context switch intensive workloads;
since we run in guest context with the guest MSR value loaded (so swapgs can
work correctly), we can simply disable interception of rdmsr/wrmsr for this
MSR.

A complication occurs since in legacy mode, we run with the host MSR value
loaded. In this case we enable interception.  This means we need two MSR
bitmaps, one for legacy mode and one for long mode.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:21 +03:00
Avi Kivity
3e7c73e9b1 KVM: VMX: Don't use highmem pages for the msr and pio bitmaps
Highmem pages are a pain, and saving three lowmem pages on i386 isn't worth
the extra code.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:21 +03:00
Avi Kivity
a2edf57f51 KVM: Fix PDPTR reloading on CR4 writes
The processor is documented to reload the PDPTRs while in PAE mode if any
of the CR4 bits PSE, PGE, or PAE change.  Linux relies on this
behaviour when zapping the low mappings of PAE kernels during boot.

The code already handled changes to CR4.PAE; augment it to also notice changes
to PSE and PGE.

This triggered while booting an F11 PAE kernel; the futex initialization code
runs before any CR3 reloads and writes to a NULL pointer; the futex subsystem
ended up uninitialized, killing PI futexes and pulseaudio which uses them.

Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-05-25 20:00:53 +03:00
Avi Kivity
a8cd0244e9 KVM: Make paravirt tlb flush also reload the PAE PDPTRs
The paravirt tlb flush may be used not only to flush TLBs, but also
to reload the four page-directory-pointer-table entries, as it is used
as a replacement for reloading CR3.  Change the code to do the entire
CR3 reloading dance instead of simply flushing the TLB.

Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-05-25 20:00:50 +03:00
Avi Kivity
99f85a28a7 KVM: SVM: Remove port 80 passthrough
KVM optimizes guest port 80 accesses by passthing them through to the host.
Some AMD machines die on port 80 writes, allowing the guest to hard-lock the
host.

Remove the port passthrough to avoid the problem.

Cc: stable@kernel.org
Reported-by: Piotr Jaroszyński <p.jaroszynski@gmail.com>
Tested-by: Piotr Jaroszyński <p.jaroszynski@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-05-11 14:40:51 +03:00
Avi Kivity
e286e86e6d KVM: Make EFER reads safe when EFER does not exist
Some processors don't have EFER; don't oops if userspace wants us to
read EFER when we check NX.

Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-05-11 11:19:00 +03:00
Avi Kivity
334b8ad7b1 KVM: Fix NX support reporting
NX support is bit 20, not bit 1.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-05-11 11:18:48 +03:00
Andre Przywara
19bca6ab75 KVM: SVM: Fix cross vendor migration issue with unusable bit
AMDs VMCB does not have an explicit unusable segment descriptor field,
so we emulate it by using "not present". This has to be setup before
the fixups, because this field is used there.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-05-11 11:18:04 +03:00
Jan Kiszka
888d256e9c KVM: Unregister cpufreq notifier on unload
Properly unregister cpufreq notifier on onload if it was registered
during init.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-04-22 13:54:33 +03:00
Joerg Roedel
7f1ea20896 KVM: x86: release time_page on vcpu destruction
Not releasing the time_page causes a leak of that page or the compound
page it is situated in.

Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-04-22 13:52:10 +03:00
Marcelo Tosatti
bf47a760f6 KVM: MMU: disable global page optimization
Complexity to fix it not worthwhile the gains, as discussed
in http://article.gmane.org/gmane.comp.emulators.kvm.devel/28649.

Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-04-22 13:52:09 +03:00
Ingo Molnar
8302294f43 Merge branch 'tracing/core-v2' into tracing-for-linus
Conflicts:
	include/linux/slub_def.h
	lib/Kconfig.debug
	mm/slob.c
	mm/slub.c
2009-04-02 00:49:02 +02:00
Avi Kivity
16175a796d KVM: VMX: Don't allow uninhibited access to EFER on i386
vmx_set_msr() does not allow i386 guests to touch EFER, but they can still
do so through the default: label in the switch.  If they set EFER_LME, they
can oops the host.

Fix by having EFER access through the normal channel (which will check for
EFER_LME) even on i386.

Reported-and-tested-by: Benjamin Gilbert <bgilbert@cs.cmu.edu>
Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:15 +02:00
Andrea Arcangeli
4539b35881 KVM: Fix missing smp tlb flush in invlpg
When kvm emulates an invlpg instruction, it can drop a shadow pte, but
leaves the guest tlbs intact.  This can cause memory corruption when
swapping out.

Without this the other cpu can still write to a freed host physical page.
tlb smp flush must happen if rmap_remove is called always before mmu_lock
is released because the VM will take the mmu_lock before it can finally add
the page to the freelist after swapout. mmu notifier makes it safe to flush
the tlb after freeing the page (otherwise it would never be safe) so we can do
a single flush for multiple sptes invalidated.

Cc: stable@kernel.org
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:14 +02:00
Hannes Eder
cded19f396 KVM: fix sparse warnings: Should it be static?
Impact: Make symbols static.

Fix this sparse warnings:
  arch/x86/kvm/mmu.c:992:5: warning: symbol 'mmu_pages_add' was not declared. Should it be static?
  arch/x86/kvm/mmu.c:1124:5: warning: symbol 'mmu_pages_next' was not declared. Should it be static?
  arch/x86/kvm/mmu.c:1144:6: warning: symbol 'mmu_pages_clear_parents' was not declared. Should it be static?
  arch/x86/kvm/x86.c:2037:5: warning: symbol 'kvm_read_guest_virt' was not declared. Should it be static?
  arch/x86/kvm/x86.c:2067:5: warning: symbol 'kvm_write_guest_virt' was not declared. Should it be static?
  virt/kvm/irq_comm.c:220:5: warning: symbol 'setup_routing_entry' was not declared. Should it be static?

Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:14 +02:00
Hannes Eder
d7364a29b3 KVM: fix sparse warnings: context imbalance
Impact: Attribute function with __acquires(...) resp. __releases(...).

Fix this sparse warnings:
  arch/x86/kvm/i8259.c:34:13: warning: context imbalance in 'pic_lock' - wrong count at exit
  arch/x86/kvm/i8259.c:39:13: warning: context imbalance in 'pic_unlock' - unexpected unlock

Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:13 +02:00
Amit Shah
41d6af1192 KVM: is_long_mode() should check for EFER.LMA
is_long_mode currently checks the LongModeEnable bit in
EFER instead of the LongModeActive bit. This is wrong, but
we survived this till now since it wasn't triggered. This
breaks guests that go from long mode to compatibility mode.

This is noticed on a solaris guest and fixes bug #1842160

Signed-off-by: Amit Shah <amit.shah@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2009-03-24 11:03:13 +02:00
Amit Shah
401d10dee0 KVM: VMX: Update necessary state when guest enters long mode
setup_msrs() should be called when entering long mode to save the
shadow state for the 64-bit guest state.

Using vmx_set_efer() in enter_lmode() removes some duplicated code
and also ensures we call setup_msrs(). We can safely pass the value
of shadow_efer to vmx_set_efer() as no other bits in the efer change
while enabling long mode (guest first sets EFER.LME, then sets CR0.PG
which causes a vmexit where we activate long mode).

With this fix, is_long_mode() can check for EFER.LMA set instead of
EFER.LME and 5e23049e86dd298b72e206b420513dbc3a240cd9 can be reverted.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:13 +02:00
Joerg Roedel
c5bc224240 KVM: MMU: Fix another largepage memory leak
In the paging_fetch function rmap_remove is called after setting a large
pte to non-present. This causes rmap_remove to not drop the reference to
the large page. The result is a memory leak of that page.

Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:11 +02:00
Andre Przywara
1fbdc7a585 KVM: SVM: set accessed bit for VMCB segment selectors
In the segment descriptor _cache_ the accessed bit is always set
(although it can be cleared in the descriptor itself). Since Intel
checks for this condition on a VMENTRY, set this bit in the AMD path
to enable cross vendor migration.

Cc: stable@kernel.org
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Acked-By: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:11 +02:00
Gleb Natapov
4925663a07 KVM: Report IRQ injection status to userspace.
IRQ injection status is either -1 (if there was no CPU found
that should except the interrupt because IRQ was masked or
ioapic was misconfigured or ...) or >= 0 in that case the
number indicates to how many CPUs interrupt was injected.
If the value is 0 it means that the interrupt was coalesced
and probably should be reinjected.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:11 +02:00
Joerg Roedel
452425dbaa KVM: MMU: remove assertion in kvm_mmu_alloc_page
The assertion no longer makes sense since we don't clear page tables on
allocation; instead we clear them during prefetch.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:10 +02:00
Joerg Roedel
6bed6b9e84 KVM: MMU: remove redundant check in mmu_set_spte
The following code flow is unnecessary:

	if (largepage)
		was_rmapped = is_large_pte(*shadow_pte);
	 else
	 	was_rmapped = 1;

The is_large_pte() function will always evaluate to one here because the
(largepage && !is_large_pte) case is already handled in the first
if-clause. So we can remove this check and set was_rmapped to one always
here.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:10 +02:00
Gerd Hoffmann
c807660407 KVM: Fix kvmclock on !constant_tsc boxes
kvmclock currently falls apart on machines without constant tsc.
This patch fixes it.  Changes:

  * keep tsc frequency in a per-cpu variable.
  * handle kvmclock update using a new request flag, thus checking
    whenever we need an update each time we enter guest context.
  * use a cpufreq notifier to track frequency changes and force
    kvmclock updates.
  * send ipis to kick cpu out of guest context if needed to make
    sure the guest doesn't see stale values.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:09 +02:00
Sheng Yang
49cd7d2238 KVM: VMX: Use kvm_mmu_page_fault() handle EPT violation mmio
Removed duplicated code.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:09 +02:00
Jan Kiszka
34c33d163f KVM: Drop unused evaluations from string pio handlers
Looks like neither the direction nor the rep prefix are used anymore.
Drop related evaluations from SVM's and VMX's I/O exit handlers.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:08 +02:00
Alexander Graf
1b2fd70c4e KVM: Add FFXSR support
AMD K10 CPUs implement the FFXSR feature that gets enabled using
EFER. Let's check if the virtual CPU description includes that
CPUID feature bit and allow enabling it then.

This is required for Windows Server 2008 in Hyper-V mode.

v2 adds CPUID capability exposure

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:08 +02:00
Marcelo Tosatti
44882eed2e KVM: make irq ack notifications aware of routing table
IRQ ack notifications assume an identity mapping between pin->gsi,
which might not be the case with, for example, HPET.

Translate before acking.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Gleb Natapov <gleb@redhat.com>
2009-03-24 11:03:08 +02:00
Avi Kivity
399ec807dd KVM: Userspace controlled irq routing
Currently KVM has a static routing from GSI numbers to interrupts (namely,
0-15 are mapped 1:1 to both PIC and IOAPIC, and 16:23 are mapped 1:1 to
the IOAPIC).  This is insufficient for several reasons:

- HPET requires non 1:1 mapping for the timer interrupt
- MSIs need a new method to assign interrupt numbers and dispatch them
- ACPI APIC mode needs to be able to reassign the PCI LINK interrupts to the
  ioapics

This patch implements an interrupt routing table (as a linked list, but this
can be easily changed) and a userspace interface to replace the table.  The
routing table is initialized according to the current hardwired mapping.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:06 +02:00
Amit Shah
1935547504 KVM: x86: Fix typos and whitespace errors
Some typos, comments, whitespace errors corrected in the cpuid code

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:05 +02:00
Avi Kivity
5a41accd3f KVM: MMU: Only enable cr4_pge role in shadow mode
Two dimensional paging is only confused by it.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:04 +02:00
Avi Kivity
f6e2c02b6d KVM: MMU: Rename "metaphysical" attribute to "direct"
This actually describes what is going on, rather than alerting the reader
that something strange is going on.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:04 +02:00
Marcelo Tosatti
9903a927a4 KVM: MMU: drop zeroing on mmu_memory_cache_alloc
Zeroing on mmu_memory_cache_alloc is unnecessary since:

- Smaller areas are pre-allocated with kmem_cache_zalloc.
- Page pointed by ->spt is overwritten with prefetch_page
  and entries in page pointed by ->gfns are initialized
  before reading.

[avi: zeroing pages is unnecessary]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:04 +02:00
Joe Perches
ff81ff10b4 KVM: SVM: Fix typo in has_svm()
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:04 +02:00
Avi Kivity
4780c65904 KVM: Reset PIT irq injection logic when the PIT IRQ is unmasked
While the PIT is masked the guest cannot ack the irq, so the reinject logic
will never allow the interrupt to be injected.

Fix by resetting the reinjection counters on unmask.

Unbreaks Xen.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:03 +02:00
Avi Kivity
5d9b8e30f5 KVM: Add CONFIG_HAVE_KVM_IRQCHIP
Two KVM archs support irqchips and two don't.  Add a Kconfig item to
make selecting between the two models easier.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:02 +02:00
Avi Kivity
4677a3b693 KVM: MMU: Optimize page unshadowing
Using kvm_mmu_lookup_page() will result in multiple scans of the hash chains;
use hlist_for_each_entry_safe() to achieve a single scan instead.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:02 +02:00
Alexander Graf
c8a73f186b KVM: SVM: Add microcode patch level dummy
VMware ESX checks if the microcode level is correct when using a barcelona
CPU, in order to see if it actually can use SVM. Let's tell it we're on the
safe side...

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:02 +02:00
Avi Kivity
269e05e485 KVM: Properly lock PIT creation
Otherwise, two threads can create a PIT in parallel and cause a memory leak.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:01 +02:00
Avi Kivity
a77ab5ead5 KVM: x86 emulator: implement 'ret far' instruction (opcode 0xcb)
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:01 +02:00
Avi Kivity
8b3079a5c0 KVM: VMX: When emulating on invalid vmx state, don't return to userspace unnecessarily
If we aren't doing mmio there's no need to exit to userspace (which will
just be confused).

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:00 +02:00
Avi Kivity
350f69dcd1 KVM: x86 emulator: Make emulate_pop() a little more generic
Allow emulate_pop() to read into arbitrary memory rather than just the
source operand.  Needed for complicated instructions like far returns.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:00 +02:00
Avi Kivity
10f32d84c7 KVM: VMX: Prevent exit handler from running if emulating due to invalid state
If we've just emulated an instruction, we won't have any valid exit
reason and associated information.

Fix by moving the clearing of the emulation_required flag to the exit handler.
This way the exit handler can notice that we've been emulating and abort
early.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:00 +02:00
Avi Kivity
9fd4a3b7a4 KVM: VMX: don't clobber segment AR if emulating invalid state
The ususable bit is important for determining state validity; don't
clobber it.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:00 +02:00
Avi Kivity
1872a3f411 KVM: VMX: Fix guest state validity checks
The vmx guest state validity checks are full of bugs.  Make them
conform to the manual.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:59 +02:00
Marcelo Tosatti
52d939a0bf KVM: PIT: provide an option to disable interrupt reinjection
Certain clocks (such as TSC) in older 2.6 guests overaccount for lost
ticks, causing severe time drift. Interrupt reinjection magnifies the
problem.

Provide an option to disable it.

[avi: allow room for expansion in case we want to disable reinjection
      of other timers]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:55 +02:00
Avi Kivity
61a6bd672b KVM: Fallback support for MSR_VM_HSAVE_PA
Since we advertise MSR_VM_HSAVE_PA, userspace will attempt to read it
even on Intel.  Implement fake support for this MSR to avoid the
warnings.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:54 +02:00
Izik Eidus
0f34607440 KVM: remove the vmap usage
vmap() on guest pages hides those pages from the Linux mm for an extended
(userspace determined) amount of time.  Get rid of it.

Signed-off-by: Izik Eidus <ieidus@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:54 +02:00
Izik Eidus
77c2002e7c KVM: introduce kvm_read_guest_virt, kvm_write_guest_virt
This commit change the name of emulator_read_std into kvm_read_guest_virt,
and add new function name kvm_write_guest_virt that allow writing into a
guest virtual address.

Signed-off-by: Izik Eidus <ieidus@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:54 +02:00
Marcelo Tosatti
53f658b3c3 KVM: VMX: initialize TSC offset relative to vm creation time
VMX initializes the TSC offset for each vcpu at different times, and
also reinitializes it for vcpus other than 0 on APIC SIPI message.

This bug causes the TSC's to appear unsynchronized in the guest, even if
the host is good.

Older Linux kernels don't handle the situation very well, so
gettimeofday is likely to go backwards in time:

http://www.mail-archive.com/kvm@vger.kernel.org/msg02955.html
http://sourceforge.net/tracker/index.php?func=detail&aid=2025534&group_id=180599&atid=893831

Fix it by initializating the offset of each vcpu relative to vm creation
time, and moving it from vmx_vcpu_reset to vmx_vcpu_setup, out of the
APIC MP init path.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:53 +02:00
Avi Kivity
e8c4a4e8a7 KVM: MMU: Drop walk_shadow()
No longer used.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:53 +02:00
Avi Kivity
a461930bc3 KVM: MMU: Replace walk_shadow() by for_each_shadow_entry() in invlpg()
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:53 +02:00
Avi Kivity
e7a04c99b5 KVM: MMU: Replace walk_shadow() by for_each_shadow_entry() in fetch()
Effectively reverting to the pre walk_shadow() version -- but now
with the reusable for_each().

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:52 +02:00
Avi Kivity
9f652d21c3 KVM: MMU: Use for_each_shadow_entry() in __direct_map()
Eliminating a callback and a useless structure.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:52 +02:00
Avi Kivity
2d11123a77 KVM: MMU: Add for_each_shadow_entry(), a simpler alternative to walk_shadow()
Using a for_each loop style removes the need to write callback and nasty
casts.

Implement the walk_shadow() using the for_each_shadow_entry().

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:52 +02:00
Avi Kivity
2b3d2a2060 KVM: Fix vmload and friends misinterpreted as lidt
The AMD SVM instruction family all overload the 0f 01 /3 opcode, further
multiplexing on the three r/m bits.  But the code decided that anything that
isn't a vmmcall must be an lidt (which shares the 0f 01 /3 opcode, for the
case that mod = 3).

Fix by aborting emulation if this isn't a vmmcall.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:51 +02:00
Avi Kivity
e207831804 KVM: MMU: Initialize a shadow page's global attribute from cr4.pge
If cr4.pge is cleared, we ought to treat any ptes in the page as non-global.
This allows us to remove the check from set_spte().

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:51 +02:00
Avi Kivity
2f0b3d60b2 KVM: MMU: Segregate mmu pages created with different cr4.pge settings
Don't allow a vcpu with cr4.pge cleared to use a shadow page created with
cr4.pge set; this might cause a cr3 switch not to sync ptes that have the
global bit set (the global bit has no effect if !cr4.pge).

This can only occur on smp with different cr4.pge settings for different
vcpus (since a cr4 change will resync the shadow ptes), but there's no
cost to being correct here.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:51 +02:00
Avi Kivity
a770f6f28b KVM: MMU: Inherit a shadow page's guest level count from vcpu setup
Instead of "calculating" it on every shadow page allocation, set it once
when switching modes, and copy it when allocating pages.

This doesn't buy us much, but sets up the stage for inheriting more
information related to the mmu setup.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:51 +02:00
Jan Kiszka
ae675ef01c KVM: x86: Wire-up hardware breakpoints for guest debugging
Add the remaining bits to make use of debug registers also for guest
debugging, thus enabling the use of hardware breakpoints and
watchpoints.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:50 +02:00
Jan Kiszka
42dbaa5a05 KVM: x86: Virtualize debug registers
So far KVM only had basic x86 debug register support, once introduced to
realize guest debugging that way. The guest itself was not able to use
those registers.

This patch now adds (almost) full support for guest self-debugging via
hardware registers. It refactors the code, moving generic parts out of
SVM (VMX was already cleaned up by the KVM_SET_GUEST_DEBUG patches), and
it ensures that the registers are properly switched between host and
guest.

This patch also prepares debug register usage by the host. The latter
will (once wired-up by the following patch) allow for hardware
breakpoints/watchpoints in guest code. If this is enabled, the guest
will only see faked debug registers without functionality, but with
content reflecting the guest's modifications.

Tested on Intel only, but SVM /should/ work as well, but who knows...

Known limitations: Trapping on tss switch won't work - most probably on
Intel.

Credits also go to Joerg Roedel - I used his once posted debugging
series as platform for this patch.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:49 +02:00
Jan Kiszka
55934c0bd3 KVM: VMX: Allow single-stepping when uninterruptible
When single-stepping over STI and MOV SS, we must clear the
corresponding interruptibility bits in the guest state. Otherwise
vmentry fails as it then expects bit 14 (BS) in pending debug exceptions
being set, but that's not correct for the guest debugging case.

Note that clearing those bits is safe as we check for interruptibility
based on the original state and do not inject interrupts or NMIs if
guest interruptibility was blocked.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:49 +02:00
Jan Kiszka
d0bfb940ec KVM: New guest debug interface
This rips out the support for KVM_DEBUG_GUEST and introduces a new IOCTL
instead: KVM_SET_GUEST_DEBUG. The IOCTL payload consists of a generic
part, controlling the "main switch" and the single-step feature. The
arch specific part adds an x86 interface for intercepting both types of
debug exceptions separately and re-injecting them when the host was not
interested. Moveover, the foundation for guest debugging via debug
registers is layed.

To signal breakpoint events properly back to userland, an arch-specific
data block is now returned along KVM_EXIT_DEBUG. For x86, the arch block
contains the PC, the debug exception, and relevant debug registers to
tell debug events properly apart.

The availability of this new interface is signaled by
KVM_CAP_SET_GUEST_DEBUG. Empty stubs for not yet supported archs are
provided.

Note that both SVM and VTX are supported, but only the latter was tested
yet. Based on the experience with all those VTX corner case, I would be
fairly surprised if SVM will work out of the box.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:49 +02:00
Jan Kiszka
8ab2d2e231 KVM: VMX: Support for injecting software exceptions
VMX differentiates between processor and software generated exceptions
when injecting them into the guest. Extend vmx_queue_exception
accordingly (and refactor related constants) so that we can use this
service reliably for the new guest debugging framework.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:48 +02:00
Alexander Graf
d80174745b KVM: SVM: Only allow setting of EFER_SVME when CPUID SVM is set
Userspace has to tell the kernel module somehow that nested SVM should be used.
The easiest way that doesn't break anything I could think of is to implement

if (cpuid & svm)
    allow write to efer
else
    deny write to efer

Old userspaces mask the SVM capability bit, so they don't break.
In order to find out that the SVM capability is set, I had to split the
kvm_emulate_cpuid into a finding and an emulating part.

(introduced in v6)

Acked-by: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:48 +02:00
Alexander Graf
236de05553 KVM: SVM: Allow setting the SVME bit
Normally setting the SVME bit in EFER is not allowed, as we did
not support SVM. Not since we do, we should also allow enabling
SVM mode.

v2 comes as last patch, so we don't enable half-ready code
v4 introduces a module option to enable SVM
v6 warns that nesting is enabled

Acked-by: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:48 +02:00
Joerg Roedel
eb6f302edf KVM: SVM: Allow read access to MSR_VM_VR
KVM tries to read the VM_CR MSR to find out if SVM was disabled by
the BIOS. So implement read support for this MSR to make nested
SVM running.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:47 +02:00
Alexander Graf
cf74a78b22 KVM: SVM: Add VMEXIT handler and intercepts
This adds the #VMEXIT intercept, so we return to the level 1 guest
when something happens in the level 2 guest that should return to
the level 1 guest.

v2 implements HIF handling and cleans up exception interception
v3 adds support for V_INTR_MASKING_MASK
v4 uses the host page hsave
v5 removes IOPM merging code
v6 moves mmu code out of the atomic section

Acked-by: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:47 +02:00
Alexander Graf
3d6368ef58 KVM: SVM: Add VMRUN handler
This patch implements VMRUN. VMRUN enters a virtual CPU and runs that
in the same context as the normal guest CPU would run.
So basically it is implemented the same way, a normal CPU would do it.

We also prepare all intercepts that get OR'ed with the original
intercepts, as we do not allow a level 2 guest to be intercepted less
than the first level guest.

v2 implements the following improvements:

- fixes the CPL check
- does not allocate iopm when not used
- remembers the host's IF in the HIF bit in the hflags

v3:

- make use of the new permission checking
- add support for V_INTR_MASKING_MASK

v4:

- use host page backed hsave

v5:

- remove IOPM merging code

v6:

- save cr4 so PAE l1 guests work

v7:

- return 0 on vmrun so we check the MSRs too
- fix MSR check to use the correct variable

Acked-by: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:47 +02:00
Alexander Graf
5542675baa KVM: SVM: Add VMLOAD and VMSAVE handlers
This implements the VMLOAD and VMSAVE instructions, that usually surround
the VMRUN instructions. Both instructions load / restore the same elements,
so we only need to implement them once.

v2 fixes CPL checking and replaces memcpy by assignments
v3 makes use of the new permission checking

Acked-by: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:47 +02:00
Alexander Graf
b286d5d8b0 KVM: SVM: Implement hsave
Implement the hsave MSR, that gives the VCPU a GPA to save the
old guest state in.

v2 allows userspace to save/restore hsave
v4 dummys out the hsave MSR, so we use a host page
v6 remembers the guest's hsave and exports the MSR

Acked-by: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:46 +02:00
Alexander Graf
1371d90460 KVM: SVM: Implement GIF, clgi and stgi
This patch implements the GIF flag and the clgi and stgi instructions that
set this flag. Only if the flag is set (default), interrupts can be received by
the CPU.

To keep the information about that somewhere, this patch adds a new hidden
flags vector. that is used to store information that does not go into the
vmcb, but is SVM specific.

I tried to write some code to make -no-kvm-irqchip work too, but the first
level guest won't even boot with that atm, so I ditched it.

v2 moves the hflags to x86 generic code
v3 makes use of the new permission helper
v6 only enables interrupt_window if GIF=1

Acked-by: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:46 +02:00
Alexander Graf
c0725420cf KVM: SVM: Add helper functions for nested SVM
These are helpers for the nested SVM implementation.

- nsvm_printk implements a debug printk variant
- nested_svm_do calls a handler that can accesses gpa-based memory

v3 makes use of the new permission checker
v6 changes:
- streamline nsvm_debug()
- remove printk(KERN_ERR)
- SVME check before CPL check
- give GP error code
- use new EFER constant

Acked-by: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:46 +02:00
Alexander Graf
9962d032bb KVM: SVM: Move EFER and MSR constants to generic x86 code
MSR_EFER_SVME_MASK, MSR_VM_CR and MSR_VM_HSAVE_PA are set in KVM
specific headers. Linux does have nice header files to collect
EFER bits and MSR IDs, so IMHO we should put them there.

While at it, I also changed the naming scheme to match that
of the other defines.

(introduced in v6)

Acked-by: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:46 +02:00
Alexander Graf
f0b85051d0 KVM: SVM: Clean up VINTR setting
The current VINTR intercept setters don't look clean to me. To make
the code easier to read and enable the possibilty to trap on a VINTR
set, this uses a helper function to set the VINTR intercept.

v2 uses two distinct functions for setting and clearing the bit

Acked-by: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:02:45 +02:00
Ingo Molnar
72c26c9a26 Merge branch 'linus' into tracing/blktrace
Conflicts:
	block/blktrace.c

Semantic merge:
	kernel/trace/blktrace.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-19 09:00:35 +01:00
Avi Kivity
516a1a7e9d KVM: VMX: Flush volatile msrs before emulating rdmsr
Some msrs (notable MSR_KERNEL_GS_BASE) are held in the processor registers
and need to be flushed to the vcpu struture before they can be read.

This fixes cygwin longjmp() failure on Windows x64.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-02-15 02:47:39 +02:00
Marcelo Tosatti
b682b814e3 KVM: x86: fix LAPIC pending count calculation
Simplify LAPIC TMCCT calculation by using hrtimer provided
function to query remaining time until expiration.

Fixes host hang with nested ESX.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-02-15 02:47:38 +02:00
Sheng Yang
2aaf69dcee KVM: MMU: Map device MMIO as UC in EPT
Software are not allow to access device MMIO using cacheable memory type, the
patch limit MMIO region with UC and WC(guest can select WC using PAT and
PCD/PWT).

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-02-15 02:47:37 +02:00
Marcelo Tosatti
abe6655dd6 KVM: x86: disable kvmclock on non constant TSC hosts
This is better.

Currently, this code path is posing us big troubles,
and we won't have a decent patch in time. So, temporarily
disable it.

Signed-off-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-02-15 02:47:36 +02:00
Marcelo Tosatti
d2a8284e8f KVM: PIT: fix i8254 pending count read
count_load_time assignment is bogus: its supposed to contain what it
means, not the expiration time.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-02-15 02:47:36 +02:00
Sheng Yang
ba4cef31d5 KVM: Fix racy in kvm_free_assigned_irq
In the past, kvm_get_kvm() and kvm_put_kvm() was called in assigned device irq
handler and interrupt_work, in order to prevent cancel_work_sync() in
kvm_free_assigned_irq got a illegal state when waiting for interrupt_work done.
But it's tricky and still got two problems:

1. A bug ignored two conditions that cancel_work_sync() would return true result
in a additional kvm_put_kvm().

2. If interrupt type is MSI, we would got a window between cancel_work_sync()
and free_irq(), which interrupt would be injected again...

This patch discard the reference count used for irq handler and interrupt_work,
and ensure the legal state by moving the free function at the very beginning of
kvm_destroy_vm(). And the patch fix the second bug by disable irq before
cancel_work_sync(), which may result in nested disable of irq but OK for we are
going to free it.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-02-15 02:47:36 +02:00
Sheng Yang
ad8ba2cd44 KVM: Add kvm_arch_sync_events to sync with asynchronize events
kvm_arch_sync_events is introduced to quiet down all other events may happen
contemporary with VM destroy process, like IRQ handler and work struct for
assigned device.

For kvm_arch_sync_events is called at the very beginning of kvm_destroy_vm(), so
the state of KVM here is legal and can provide a environment to quiet down other
events.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-02-15 02:47:36 +02:00
Ingo Molnar
3d7a96f5a4 Merge branch 'linus' into tracing/kmemtrace2 2009-01-06 09:53:05 +01:00
Joerg Roedel
19de40a847 KVM: change KVM to use IOMMU API
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2009-01-03 14:11:07 +01:00
Joerg Roedel
c4fa386428 KVM: rename vtd.c to iommu.c
Impact: file renamed

The code in the vtd.c file can be reused for other IOMMUs as well. So
rename it to make it clear that it handle more than VT-d.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2009-01-03 14:10:09 +01:00
Marcelo Tosatti
8791723920 KVM: MMU: handle large host sptes on invlpg/resync
The invlpg and sync walkers lack knowledge of large host sptes,
descending to non-existant pagetable level.

Stop at directory level in such case.

Fixes SMP Windows XP with hugepages.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:49 +02:00
Avi Kivity
3f353858c9 KVM: Add locking to virtual i8259 interrupt controller
While most accesses to the i8259 are with the kvm mutex taken, the call
to kvm_pic_read_irq() is not.  We can't easily take the kvm mutex there
since the function is called with interrupts disabled.

Fix by adding a spinlock to the virtual interrupt controller.  Since we
can't send an IPI under the spinlock (we also take the same spinlock in
an irq disabled context), we defer the IPI until the spinlock is released.
Similarly, we defer irq ack notifications until after spinlock release to
avoid lock recursion.

Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:48 +02:00
Avi Kivity
25e2343246 KVM: MMU: Don't treat a global pte as such if cr4.pge is cleared
The pte.g bit is meaningless if global pages are disabled; deferring
mmu page synchronization on these ptes will lead to the guest using stale
shadow ptes.

Fixes Vista x86 smp bootloader failure.

Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:48 +02:00
Jan Kiszka
4531220b71 KVM: x86: Rework user space NMI injection as KVM_CAP_USER_NMI
There is no point in doing the ready_for_nmi_injection/
request_nmi_window dance with user space. First, we don't do this for
in-kernel irqchip anyway, while the code path is the same as for user
space irqchip mode. And second, there is nothing to loose if a pending
NMI is overwritten by another one (in contrast to IRQs where we have to
save the number). Actually, there is even the risk of raising spurious
NMIs this way because the reason for the held-back NMI might already be
handled while processing the first one.

Therefore this patch creates a simplified user space NMI injection
interface, exporting it under KVM_CAP_USER_NMI and dropping the old
KVM_CAP_NMI capability. And this time we also take care to provide the
interface only on archs supporting NMIs via KVM (right now only x86).

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:47 +02:00
Jan Kiszka
264ff01d55 KVM: VMX: Fix pending NMI-vs.-IRQ race for user space irqchip
As with the kernel irqchip, don't allow an NMI to stomp over an already
injected IRQ; instead wait for the IRQ injection to be completed.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:47 +02:00
Marcelo Tosatti
eb64f1e8cd KVM: MMU: check for present pdptr shadow page in walk_shadow
walk_shadow assumes the caller verified validity of the pdptr pointer in
question, which is not the case for the invlpg handler.

Fixes oops during Solaris 10 install.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:46 +02:00
Avi Kivity
ca9edaee1a KVM: Consolidate userspace memory capability reporting into common code
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:46 +02:00
Marcelo Tosatti
ad218f85e3 KVM: MMU: prepopulate the shadow on invlpg
If the guest executes invlpg, peek into the pagetable and attempt to
prepopulate the shadow entry.

Also stop dirty fault updates from interfering with the fork detector.

2% improvement on RHEL3/AIM7.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:44 +02:00
Marcelo Tosatti
6cffe8ca4a KVM: MMU: skip global pgtables on sync due to cr3 switch
Skip syncing global pages on cr3 switch (but not on cr4/cr0). This is
important for Linux 32-bit guests with PAE, where the kmap page is
marked as global.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:44 +02:00
Marcelo Tosatti
b1a368218a KVM: MMU: collapse remote TLB flushes on root sync
Collapse remote TLB flushes on root sync.

kernbench is 2.7% faster on 4-way guest. Improvements have been seen
with other loads such as AIM7.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:43 +02:00
Marcelo Tosatti
60c8aec6e2 KVM: MMU: use page array in unsync walk
Instead of invoking the handler directly collect pages into
an array so the caller can work with it.

Simplifies TLB flush collapsing.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:43 +02:00
Amit Shah
fbce554e94 KVM: x86 emulator: Fix handling of VMMCALL instruction
The VMMCALL instruction doesn't get recognised and isn't processed
by the emulator.

This is seen on an Intel host that tries to execute the VMMCALL
instruction after a guest live migrates from an AMD host.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:43 +02:00
Guillaume Thouvenin
9bf8ea42fe KVM: x86 emulator: add the emulation of shld and shrd instructions
Add emulation of shld and shrd instructions

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:43 +02:00
Guillaume Thouvenin
d175226a5f KVM: x86 emulator: add the assembler code for three operands
Add the assembler code for instruction with three operands and one
operand is stored in ECX register

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:42 +02:00
Guillaume Thouvenin
bfcadf83ec KVM: x86 emulator: add a new "implied 1" Src decode type
Add SrcOne operand type when we need to decode an implied '1' like with
regular shift instruction

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:42 +02:00
Guillaume Thouvenin
0dc8d10f7d KVM: x86 emulator: add Src2 decode set
Instruction like shld has three operands, so we need to add a Src2
decode set. We start with Src2None, Src2CL, and Src2ImmByte, Src2One to
support shld/shrd and we will expand it later.

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:42 +02:00
Guillaume Thouvenin
45ed60b371 KVM: x86 emulator: Extend the opcode descriptor
Extend the opcode descriptor to 32 bits. This is needed by the
introduction of a new Src2 operand type.

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:41 +02:00
Hannes Eder
efff9e538f KVM: VMX: fix sparse warning
Impact: make global function static

  arch/x86/kvm/vmx.c:134:3: warning: symbol 'vmx_capability' was not declared. Should it be static?

Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:06 +02:00
Avi Kivity
f3fd92fbdb KVM: Remove extraneous semicolon after do/while
Notices by Guillaume Thouvenin.

Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:05 +02:00
Avi Kivity
2b48cc75b2 KVM: x86 emulator: fix popf emulation
Set operand type and size to get correct writeback behavior.

Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:05 +02:00
Avi Kivity
cf5de4f886 KVM: x86 emulator: fix ret emulation
'ret' did not set the operand type or size for the destination, so
writeback ignored it.

Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:05 +02:00
Avi Kivity
8a09b6877f KVM: x86 emulator: switch 'pop reg' instruction to emulate_pop()
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:04 +02:00
Avi Kivity
781d0edc5f KVM: x86 emulator: allow pop from mmio
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:04 +02:00
Avi Kivity
faa5a3ae39 KVM: x86 emulator: Extract 'pop' sequence into a function
Switch 'pop r/m' instruction to use the new function.

Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:04 +02:00
Avi Kivity
6b7ad61ffb KVM: x86 emulator: consolidate emulation of two operand instructions
No need to repeat the same assembly block over and over.

Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:03 +02:00
Avi Kivity
dda96d8f1b KVM: x86 emulator: reduce duplication in one operand emulation thunks
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:03 +02:00
Marcelo Tosatti
ecc5589f19 KVM: MMU: optimize set_spte for page sync
The write protect verification in set_spte is unnecessary for page sync.

Its guaranteed that, if the unsync spte was writable, the target page
does not have a write protected shadow (if it had, the spte would have
been write protected under mmu_lock by rmap_write_protect before).

Same reasoning applies to mark_page_dirty: the gfn has been marked as
dirty via the pagefault path.

The cost of hash table and memslot lookups are quite significant if the
workload is pagetable write intensive resulting in increased mmu_lock
contention.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:02 +02:00
Avi Kivity
df203ec9a7 KVM: VMX: Conditionally request interrupt window after injecting irq
If we're injecting an interrupt, and another one is pending, request
an interrupt window notification so we don't have excess latency on the
second interrupt.

This shouldn't happen in practice since an EOI will be issued, giving a second
chance to request an interrupt window, but...

Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:55:00 +02:00
Eduardo Habkost
2c8dceebb2 KVM: SVM: move svm_hardware_disable() code to asm/virtext.h
Create cpu_svm_disable() function.

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:52:30 +02:00
Eduardo Habkost
63d1142f8f KVM: SVM: move has_svm() code to asm/virtext.h
Use a trick to keep the printk()s on has_svm() working as before. gcc
will take care of not generating code for the 'msg' stuff when the
function is called with a NULL msg argument.

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:52:29 +02:00
Eduardo Habkost
710ff4a855 KVM: VMX: extract kvm_cpu_vmxoff() from hardware_disable()
Along with some comments on why it is different from the core cpu_vmxoff()
function.

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:52:29 +02:00
Eduardo Habkost
6210e37b12 KVM: VMX: move cpu_has_kvm_support() to an inline on asm/virtext.h
It will be used by core code on kdump and reboot, to disable
vmx if needed.

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:52:28 +02:00
Eduardo Habkost
c2cedf7be2 KVM: SVM: move svm.h to include/asm
svm.h will be used by core code that is independent of KVM, so I am
moving it outside the arch/x86/kvm directory.

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:52:28 +02:00
Eduardo Habkost
13673a90f1 KVM: VMX: move vmx.h to include/asm
vmx.h will be used by core code that is independent of KVM, so I am
moving it outside the arch/x86/kvm directory.

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:52:27 +02:00
Nitin A Kamble
0fdf8e59fa KVM: Fix cpuid iteration on multiple leaves per eac
The code to traverse the cpuid data array list for counting type of leaves is
currently broken.

This patches fixes the 2 things in it.

 1. Set the 1st counting entry's flag KVM_CPUID_FLAG_STATE_READ_NEXT. Without
    it the code will never find a valid entry.

 2. Also the stop condition in the for loop while looking for the next unflaged
    entry is broken. It needs to stop when it find one matching entry;
    and in the case of count of 1, it will be the same entry found in this
    iteration.

Signed-Off-By: Nitin A Kamble <nitin.a.kamble@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:52:24 +02:00
Nitin A Kamble
0853d2c1d8 KVM: Fix cpuid leaf 0xb loop termination
For cpuid leaf 0xb the bits 8-15 in ECX register define the end of counting
leaf.      The previous code was using bits 0-7 for this purpose, which is
a bug.

Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:52:24 +02:00
Izik Eidus
2843099fee KVM: MMU: Fix aliased gfns treated as unaliased
Some areas of kvm x86 mmu are using gfn offset inside a slot without
unaliasing the gfn first.  This patch makes sure that the gfn will be
unaliased and add gfn_to_memslot_unaliased() to save the calculating
of the gfn unaliasing in case we have it unaliased already.

Signed-off-by: Izik Eidus <ieidus@redhat.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:50 +02:00
Sheng Yang
6eb55818c0 KVM: Enable Function Level Reset for assigned device
Ideally, every assigned device should in a clear condition before and after
assignment, so that the former state of device won't affect later work.
Some devices provide a mechanism named Function Level Reset, which is
defined in PCI/PCI-e document. We should execute it before and after device
assignment.

(But sadly, the feature is new, and most device on the market now don't
support it. We are considering using D0/D3hot transmit to emulate it later,
but not that elegant and reliable as FLR itself.)

[Update: Reminded by Xiantao, execute FLR after we ensure that the device can
be assigned to the guest.]

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:49 +02:00
Guillaume Thouvenin
1d5a4d9b92 KVM: VMX: Handle mmio emulation when guest state is invalid
If emulate_invalid_guest_state is enabled, the emulator is called
when guest state is invalid.  Until now, we reported an mmio failure
when emulate_instruction() returned EMULATE_DO_MMIO.  This patch adds
the case where emulate_instruction() failed and an MMIO emulation
is needed.

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:48 +02:00
Guillaume Thouvenin
e93f36bcfa KVM: allow emulator to adjust rip for emulated pio instructions
If we call the emulator we shouldn't call skip_emulated_instruction()
in the first place, since the emulator already computes the next rip
for us. Thus we move ->skip_emulated_instruction() out of
kvm_emulate_pio() and into handle_io() (and the svm equivalent). We
also replaced "return 0" by "break" in the "do_io:" case because now
the shadow register state needs to be committed. Otherwise eip will never
be updated.

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:48 +02:00
Amit Shah
c0d09828c8 KVM: SVM: Set the 'busy' flag of the TR selector
The busy flag of the TR selector is not set by the hardware. This breaks
migration from amd hosts to intel hosts.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:48 +02:00
Amit Shah
25022acc3d KVM: SVM: Set the 'g' bit of the cs selector for cross-vendor migration
The hardware does not set the 'g' bit of the cs selector and this breaks
migration from amd hosts to intel hosts. Set this bit if the segment
limit is beyond 1 MB.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:48 +02:00
Amit Shah
b8222ad2e5 KVM: x86: Fix typo in function name
get_segment_descritptor_dtable() contains an obvious type.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:47 +02:00
Jan Kiszka
cc6e462cd5 KVM: x86: Optimize NMI watchdog delivery
As suggested by Avi, this patch introduces a counter of VCPUs that have
LVT0 set to NMI mode. Only if the counter > 0, we push the PIT ticks via
all LAPIC LVT0 lines to enable NMI watchdog support.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Acked-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:47 +02:00
Jan Kiszka
8fdb2351d5 KVM: x86: Fix and refactor NMI watchdog emulation
This patch refactors the NMI watchdog delivery patch, consolidating
tests and providing a proper API for delivering watchdog events.

An included micro-optimization is to check only for apic_hw_enabled in
kvm_apic_local_deliver (the test for LVT mask is covering the
soft-disabled case already).

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Acked-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:46 +02:00
Guillaume Thouvenin
291fd39bfc KVM: x86 emulator: Add decode entries for 0x04 and 0x05 opcodes (add acc, imm)
Add decode entries for 0x04 and 0x05 (ADD) opcodes, execution is already
implemented.

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:46 +02:00
Sheng Yang
6fe639792c KVM: VMX: Move private memory slot position
PCI device assignment would map guest MMIO spaces as separate slot, so it is
possible that the device has more than 2 MMIO spaces and overwrite current
private memslot.

The patch move private memory slot to the top of userspace visible memory slots.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:46 +02:00
Sheng Yang
291f26bc0f KVM: MMU: Extend kvm_mmu_page->slot_bitmap size
Otherwise set_bit() for private memory slot(above KVM_MEMORY_SLOTS) would
corrupted memory in 32bit host.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:45 +02:00
Sheng Yang
64d4d52175 KVM: Enable MTRR for EPT
The effective memory type of EPT is the mixture of MSR_IA32_CR_PAT and memory
type field of EPT entry.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:45 +02:00
Sheng Yang
74be52e3e6 KVM: Add local get_mtrr_type() to support MTRR
For EPT memory type support.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:45 +02:00
Sheng Yang
468d472f3f KVM: VMX: Add PAT support for EPT
GUEST_PAT support is a new feature introduced by Intel Core i7 architecture.
With this, cpu would save/load guest and host PAT automatically, for EPT memory
type in guest depends on MSR_IA32_CR_PAT.

Also add save/restore for MSR_IA32_CR_PAT.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:44 +02:00
Sheng Yang
0bed3b568b KVM: Improve MTRR structure
As well as reset mmu context when set MTRR.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:44 +02:00
Gleb Natapov
5f179287fa KVM: call kvm_arch_vcpu_reset() instead of the kvm_x86_ops callback
Call kvm_arch_vcpu_reset() instead of directly using arch callback.
The function does additional things.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:43 +02:00
Jan Kiszka
3b86cd9967 KVM: VMX: work around lacking VNMI support
Older VMX supporting CPUs do not provide the "Virtual NMI" feature for
tracking the NMI-blocked state after injecting such events. For now
KVM is unable to inject NMIs on those CPUs.

Derived from Sheng Yang's suggestion to use the IRQ window notification
for detecting the end of NMI handlers, this patch implements virtual
NMI support without impact on the host's ability to receive real NMIs.
The downside is that the given approach requires some heuristics that
can cause NMI nesting in vary rare corner cases.

The approach works as follows:
 - inject NMI and set a software-based NMI-blocked flag
 - arm the IRQ window start notification whenever an NMI window is
   requested
 - if the guest exits due to an opening IRQ window, clear the emulated
   NMI-blocked flag
 - if the guest net execution time with NMI-blocked but without an IRQ
   window exceeds 1 second, force NMI-blocked reset and inject anyway

This approach covers most practical scenarios:
 - succeeding NMIs are seperated by at least one open IRQ window
 - the guest may spin with IRQs disabled (e.g. due to a bug), but
   leaving the NMI handler takes much less time than one second
 - the guest does not rely on strict ordering or timing of NMIs
   (would be problematic in virtualized environments anyway)

Successfully tested with the 'nmi n' monitor command, the kgdbts
testsuite on smp guests (additional patches required to add debug
register support to kvm) + the kernel's nmi_watchdog=1, and a Siemens-
specific board emulation (+ guest) that comes with its own NMI
watchdog mechanism.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:43 +02:00
Jan Kiszka
487b391d6e KVM: VMX: Provide support for user space injected NMIs
This patch adds the required bits to the VMX side for user space
injected NMIs. As with the preexisting in-kernel irqchip support, the
CPU must provide the "virtual NMI" feature for proper tracking of the
NMI blocking state.

Based on the original patch by Sheng Yang.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:43 +02:00
Jan Kiszka
c4abb7c9cd KVM: x86: Support for user space injected NMIs
Introduces the KVM_NMI IOCTL to the generic x86 part of KVM for
injecting NMIs from user space and also extends the statistic report
accordingly.

Based on the original patch by Sheng Yang.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:42 +02:00
Jan Kiszka
26df99c6c5 KVM: Kick NMI receiving VCPU
Kick the NMI receiving VCPU in case the triggering caller runs in a
different context.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:42 +02:00
Jan Kiszka
0496fbb973 KVM: x86: VCPU with pending NMI is runnabled
Ensure that a VCPU with pending NMIs is considered runnable.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:42 +02:00
Jan Kiszka
23930f9521 KVM: x86: Enable NMI Watchdog via in-kernel PIT source
LINT0 of the LAPIC can be used to route PIT events as NMI watchdog ticks
into the guest. This patch aligns the in-kernel irqchip emulation with
the user space irqchip with already supports this feature. The trick is
to route PIT interrupts to all LAPIC's LVT0 lines.

Rebased and slightly polished patch originally posted by Sheng Yang.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:41 +02:00
Jan Kiszka
66a5a347c2 KVM: VMX: fix real-mode NMI support
Fix NMI injection in real-mode with the same pattern we perform IRQ
injection.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:41 +02:00
Jan Kiszka
f460ee43e2 KVM: VMX: refactor IRQ and NMI window enabling
do_interrupt_requests and vmx_intr_assist go different way for
achieving the same: enabling the nmi/irq window start notification.
Unify their code over enable_{irq|nmi}_window, get rid of a redundant
call to enable_intr_window instead of direct enable_nmi_window
invocation and unroll enable_intr_window for both in-kernel and user
space irq injection accordingly.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:41 +02:00
Jan Kiszka
33f089ca5a KVM: VMX: refactor/fix IRQ and NMI injectability determination
There are currently two ways in VMX to check if an IRQ or NMI can be
injected:
 - vmx_{nmi|irq}_enabled and
 - vcpu.arch.{nmi|interrupt}_window_open.
Even worse, one test (at the end of vmx_vcpu_run) uses an inconsistent,
likely incorrect logic.

This patch consolidates and unifies the tests over
{nmi|interrupt}_window_open as cache + vmx_update_window_states
for updating the cache content.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:40 +02:00
Jan Kiszka
448fa4a9c5 KVM: x86: Reset pending/inject NMI state on CPU reset
CPU reset invalidates pending or already injected NMIs, therefore reset
the related state variables.

Based on original patch by Gleb Natapov.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:40 +02:00
Jan Kiszka
60637aacfd KVM: VMX: Support for NMI task gates
Properly set GUEST_INTR_STATE_NMI and reset nmi_injected when a
task-switch vmexit happened due to a task gate being used for handling
NMIs. Also avoid the false warning about valid vectoring info in
kvm_handle_exit.

Based on original patch by Gleb Natapov.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:40 +02:00
Jan Kiszka
e4a41889ec KVM: VMX: Use INTR_TYPE_NMI_INTR instead of magic value
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:40 +02:00
Jan Kiszka
a26bf12afb KVM: VMX: include all IRQ window exits in statistics
irq_window_exits only tracks IRQ window exits due to user space
requests, nmi_window_exits include all exits. The latter makes more
sense, so let's adjust irq_window_exits accounting.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:39 +02:00
Guillaume Thouvenin
2786b014ec KVM: x86 emulator: consolidate push reg
This patch consolidate the emulation of push reg instruction.

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-12-31 16:51:39 +02:00
Ingo Molnar
b6ab4afee4 tracing, kvm: change MARKERS to select instead of depends on
Impact: build fix

fix:

 kernel/trace/Kconfig:42:error: found recursive dependency: TRACING ->
 TRACEPOINTS -> MARKERS -> KVM_TRACE -> RELAY -> KMEMTRACE -> TRACING

markers is a facility that should be selected - not depended on
by an interactive Kconfig entry.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-30 09:41:04 +01:00
Marcelo Tosatti
6c475352e8 KVM: MMU: avoid creation of unreachable pages in the shadow
It is possible for a shadow page to have a parent link
pointing to a freed page. When zapping a high level table,
kvm_mmu_page_unlink_children fails to remove the parent_pte link.
For that to happen, the child must be unreachable via the shadow
tree, which can happen in shadow_walk_entry if the guest pte was
modified in between walk() and fetch(). Remove the parent pte
reference in such case.

Possible cause for oops in bug #2217430.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-11-26 12:34:27 +02:00
Marcelo Tosatti
0c0f40bdbe KVM: MMU: fix sync of ptes addressed at owner pagetable
During page sync, if a pagetable contains a self referencing pte (that
points to the pagetable), the corresponding spte may be marked as
writable even though all mappings are supposed to be write protected.

Fix by clearing page unsync before syncing individual sptes.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-11-23 15:24:19 +02:00
Avi Kivity
bd2b3ca768 KVM: VMX: Fix interrupt loss during race with NMI
If an interrupt cannot be injected for some reason (say, page fault
when fetching the IDT descriptor), the interrupt is marked for
reinjection.  However, if an NMI is queued at this time, the NMI
will be injected instead and the NMI will be lost.

Fix by deferring the NMI injection until the interrupt has been
injected successfully.

Analyzed by Jan Kiszka.

Signed-off-by: Avi Kivity <avi@redhat.com>
2008-11-23 14:52:29 +02:00
Avi Kivity
e17d1dc086 KVM: Fix pit memory leak if unable to allocate irq source id
Reported-By: Daniel Marjamäki <danielm77@spray.se>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-11-11 21:01:51 +02:00
Sheng Yang
928d4bf747 KVM: VMX: Set IGMT bit in EPT entry
There is a potential issue that, when guest using pagetable without vmexit when
EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory,
which would be inconsistent with host side and would cause host MCE due to
inconsistent cache attribute.

The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default
memory type to protect host (notice that all memory mapped by KVM should be WB).

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-11-11 21:00:37 +02:00
Avi Kivity
ca93e992fd KVM: Require the PCI subsystem
PCI device assignment makes calls to pci code, so require it to be built
into the kernel.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-11-11 20:56:13 +02:00
Marcelo Tosatti
c41ef344de KVM: MMU: increase per-vcpu rmap cache alloc size
The page fault path can use two rmap_desc structures, if:

- walk_addr's dirty pte update allocates one rmap_desc.
- mmu_lock is dropped, sptes are zapped resulting in rmap_desc being
freed.
- fetch->mmu_set_spte allocates another rmap_desc.

Increase to 4 for safety.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-11-11 20:53:34 +02:00
Sheng Yang
5550af4df1 KVM: Fix guest shared interrupt with in-kernel irqchip
Every call of kvm_set_irq() should offer an irq_source_id, which is
allocated by kvm_request_irq_source_id(). Based on irq_source_id, we
identify the irq source and implement logical OR for shared level
interrupts.

The allocated irq_source_id can be freed by kvm_free_irq_source_id().

Currently, we support at most sizeof(unsigned long) different irq sources.

[Amit: - rebase to kvm.git HEAD
       - move definition of KVM_USERSPACE_IRQ_SOURCE_ID to common file
       - move kvm_request_irq_source_id to the update_irq ioctl]

[Xiantao: - Add kvm/ia64 stuff and make it work for kvm/ia64 guests]

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-28 14:21:34 +02:00
Marcelo Tosatti
6ad9f15c94 KVM: MMU: sync root on paravirt TLB flush
The pvmmu TLB flush handler should request a root sync, similarly to
a native read-write CR3.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-28 14:09:27 +02:00
Arjan van de Ven
651dab4264 Merge commit 'linus/master' into merge-linus
Conflicts:

	arch/x86/kvm/i8254.c
2008-10-17 09:20:26 -07:00
Linus Torvalds
08d19f51f0 Merge branch 'kvm-updates/2.6.28' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm
* 'kvm-updates/2.6.28' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (134 commits)
  KVM: ia64: Add intel iommu support for guests.
  KVM: ia64: add directed mmio range support for kvm guests
  KVM: ia64: Make pmt table be able to hold physical mmio entries.
  KVM: Move irqchip_in_kernel() from ioapic.h to irq.h
  KVM: Separate irq ack notification out of arch/x86/kvm/irq.c
  KVM: Change is_mmio_pfn to kvm_is_mmio_pfn, and make it common for all archs
  KVM: Move device assignment logic to common code
  KVM: Device Assignment: Move vtd.c from arch/x86/kvm/ to virt/kvm/
  KVM: VMX: enable invlpg exiting if EPT is disabled
  KVM: x86: Silence various LAPIC-related host kernel messages
  KVM: Device Assignment: Map mmio pages into VT-d page table
  KVM: PIC: enhance IPI avoidance
  KVM: MMU: add "oos_shadow" parameter to disable oos
  KVM: MMU: speed up mmu_unsync_walk
  KVM: MMU: out of sync shadow core
  KVM: MMU: mmu_convert_notrap helper
  KVM: MMU: awareness of new kvm_mmu_zap_page behaviour
  KVM: MMU: mmu_parent_walk
  KVM: x86: trap invlpg
  KVM: MMU: sync roots on mmu reload
  ...
2008-10-16 15:36:00 -07:00
Harvey Harrison
80a914dc05 misc: replace __FUNCTION__ with __func__
__FUNCTION__ is gcc-specific, use __func__

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:30 -07:00
Xiantao Zhang
3de42dc094 KVM: Separate irq ack notification out of arch/x86/kvm/irq.c
Moving irq ack notification logic as common, and make
it shared with ia64 side.

Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 14:25:35 +02:00
Xiantao Zhang
8a98f6648a KVM: Move device assignment logic to common code
To share with other archs, this patch moves device assignment
logic to common parts.

Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 14:25:33 +02:00
Zhang xiantao
371c01b28e KVM: Device Assignment: Move vtd.c from arch/x86/kvm/ to virt/kvm/
Preparation for kvm/ia64 VT-d support.

Signed-off-by: Zhang xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 14:25:32 +02:00
Marcelo Tosatti
83dbc83a0d KVM: VMX: enable invlpg exiting if EPT is disabled
Manually disabling EPT via module option fails to re-enable INVLPG
exiting.

Reported-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 14:25:31 +02:00
Jan Kiszka
1b10bf31a5 KVM: x86: Silence various LAPIC-related host kernel messages
KVM-x86 dumps a lot of debug messages that have no meaning for normal
operation:
 - INIT de-assertion is ignored
 - SIPIs are sent and received
 - APIC writes are unaligned or < 4 byte long
   (Windows Server 2003 triggers this on SMP)

Degrade them to true debug messages, keeping the host kernel log clean
for real problems.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-15 14:25:30 +02:00
Weidong Han
e5fcfc821a KVM: Device Assignment: Map mmio pages into VT-d page table
Assigned device could DMA to mmio pages, so also need to map mmio pages
into VT-d page table.

Signed-off-by: Weidong Han <weidong.han@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-15 14:25:29 +02:00
Marcelo Tosatti
e48258009d KVM: PIC: enhance IPI avoidance
The PIC code makes little effort to avoid kvm_vcpu_kick(), resulting in
unnecessary guest exits in some conditions.

For example, if the timer interrupt is routed through the IOAPIC, IRR
for IRQ 0 will get set but not cleared, since the APIC is handling the
acks.

This means that everytime an interrupt < 16 is triggered, the priority
logic will find IRQ0 pending and send an IPI to vcpu0 (in case IRQ0 is
not masked, which is Linux's case).

Introduce a new variable isr_ack to represent the IRQ's for which the
guest has been signalled / cleared the ISR. Use it to avoid more than
one IPI per trigger-ack cycle, in addition to the avoidance when ISR is
set in get_priority().

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-15 14:25:28 +02:00
Marcelo Tosatti
582801a95d KVM: MMU: add "oos_shadow" parameter to disable oos
Subject says it all.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-15 14:25:27 +02:00
Marcelo Tosatti
0074ff63eb KVM: MMU: speed up mmu_unsync_walk
Cache the unsynced children information in a per-page bitmap.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-15 14:25:26 +02:00
Marcelo Tosatti
4731d4c7a0 KVM: MMU: out of sync shadow core
Allow guest pagetables to go out of sync.  Instead of emulating write
accesses to guest pagetables, or unshadowing them, we un-write-protect
the page table and allow the guest to modify it at will.  We rely on
invlpg executions to synchronize individual ptes, and will synchronize
the entire pagetable on tlb flushes.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-15 14:25:25 +02:00
Marcelo Tosatti
6844dec694 KVM: MMU: mmu_convert_notrap helper
Need to convert shadow_notrap_nonpresent -> shadow_trap_nonpresent when
unsyncing pages.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-15 14:25:24 +02:00
Marcelo Tosatti
0738541396 KVM: MMU: awareness of new kvm_mmu_zap_page behaviour
kvm_mmu_zap_page will soon zap the unsynced children of a page. Restart
list walk in such case.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-15 14:25:23 +02:00
Marcelo Tosatti
ad8cfbe3ff KVM: MMU: mmu_parent_walk
Introduce a function to walk all parents of a given page, invoking a handler.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-15 14:25:22 +02:00
Marcelo Tosatti
a7052897b3 KVM: x86: trap invlpg
With pages out of sync invlpg needs to be trapped. For now simply nuke
the entry.

Untested on AMD.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-15 14:25:21 +02:00
Marcelo Tosatti
0ba73cdadb KVM: MMU: sync roots on mmu reload
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-15 14:25:20 +02:00
Marcelo Tosatti
e8bc217aef KVM: MMU: mode specific sync_page
Examine guest pagetable and bring the shadow back in sync. Caller is responsible
for local TLB flush before re-entering guest mode.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-15 14:25:19 +02:00
Marcelo Tosatti
38187c830c KVM: MMU: do not write-protect large mappings
There is not much point in write protecting large mappings. This
can only happen when a page is shadowed during the window between
is_largepage_backed and mmu_lock acquision. Zap the entry instead, so
the next pagefault will find a shadowed page via is_largepage_backed and
fallback to 4k translations.

Simplifies out of sync shadow.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-15 14:25:18 +02:00
Marcelo Tosatti
a378b4e64c KVM: MMU: move local TLB flush to mmu_set_spte
Since the sync page path can collapse flushes.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-15 14:25:17 +02:00
Marcelo Tosatti
1e73f9dd88 KVM: MMU: split mmu_set_spte
Split the spte entry creation code into a new set_spte function.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-15 14:25:16 +02:00
Marcelo Tosatti
93a423e704 KVM: MMU: flush remote TLBs on large->normal entry overwrite
It is necessary to flush all TLB's when a large spte entry is
overwritten with a normal page directory pointer.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-15 14:25:15 +02:00
Gleb Natapov
af2152f545 KVM: don't enter guest after SIPI was received by a CPU
The vcpu should process pending SIPI message before entering guest mode again.
kvm_arch_vcpu_runnable() returns true if the vcpu is in SIPI state, so
we can't call it here.

Signed-off-by: Gleb Natapov <gleb@qumranet.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-15 14:25:09 +02:00
Harvey Harrison
2259e3a7a6 KVM: x86.c make kvm_load_realmode_segment static
Noticed by sparse:
arch/x86/kvm/x86.c:3591:5: warning: symbol 'kvm_load_realmode_segment' was not declared. Should it be static?

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-15 14:25:07 +02:00
Marcelo Tosatti
4c2155ce81 KVM: switch to get_user_pages_fast
Convert gfn_to_pfn to use get_user_pages_fast, which can do lockless
pagetable lookups on x86. Kernel compilation on 4-way guest is 3.7%
faster on VMX.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-15 14:25:06 +02:00
Amit Shah
bfadaded0d KVM: Device Assignment: Free device structures if IRQ allocation fails
When an IRQ allocation fails, we free up the device structures and
disable the device so that we can unregister the device in the
userspace and not expose it to the guest at all.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2008-10-15 14:25:04 +02:00
Ben-Ami Yassour
62c476c7c7 KVM: Device Assignment with VT-d
Based on a patch by: Kay, Allen M <allen.m.kay@intel.com>

This patch enables PCI device assignment based on VT-d support.
When a device is assigned to the guest, the guest memory is pinned and
the mapping is updated in the VT-d IOMMU.

[Amit: Expose KVM_CAP_IOMMU so we can check if an IOMMU is present
and also control enable/disable from userspace]

Signed-off-by: Kay, Allen M <allen.m.kay@intel.com>
Signed-off-by: Weidong Han <weidong.han@intel.com>
Signed-off-by: Ben-Ami Yassour <benami@il.ibm.com>
Signed-off-by: Amit Shah <amit.shah@qumranet.com>

Acked-by: Mark Gross <mgross@linux.intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 14:25:04 +02:00
Guillaume Thouvenin
aa3a816b6d KVM: x86 emulator: Use DstAcc for 'and'
For instruction 'and al,imm' we use DstAcc instead of doing
the emulation directly into the instruction's opcode.

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:16:14 +02:00
Guillaume Thouvenin
8a9fee67fb KVM: x86 emulator: Add cmp al, imm and cmp ax, imm instructions (ocodes 3c, 3d)
Add decode entries for these opcodes; execution is already implemented.

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:16:14 +02:00
Guillaume Thouvenin
9c9fddd0e7 KVM: x86 emulator: Add DstAcc operand type
Add DstAcc operand type. That means that there are 4 bits now for
DstMask.

"In the good old days cpus would have only one register that was able to
 fully participate in arithmetic operations, typically called A for
 Accumulator.  The x86 retains this tradition by having special, shorter
 encodings for the A register (like the cmp opcode), and even some
 instructions that only operate on A (like mul).

 SrcAcc and DstAcc would accommodate these instructions by decoding A
 into the corresponding 'struct operand'."
  -- Avi Kivity

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:16:14 +02:00
Sheng Yang
defed7ed92 x86: Move FEATURE_CONTROL bits to msr-index.h
For MSR_IA32_FEATURE_CONTROL is already there.

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:16:14 +02:00
Sheng Yang
9ea542facb KVM: VMX: Rename IA32_FEATURE_CONTROL bits
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:16:14 +02:00
Avi Kivity
ef46f18ea0 KVM: x86 emulator: fix jmp r/m64 instruction
jmp r/m64 doesn't require the rex.w prefix to indicate the operand size
is 64 bits.  Set the Stack attribute (even though it doesn't involve the
stack, really) to indicate this.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:27 +02:00
Jan Kiszka
4b92fe0c9d KVM: VMX: Cleanup stalled INTR_INFO read
Commit 1c0f4f5011829dac96347b5f84ba37c2252e1e08 left a useless access
of VM_ENTRY_INTR_INFO_FIELD in vmx_intr_assist behind. Clean this up.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:26 +02:00
Marcelo Tosatti
9c3e4aab5a KVM: x86: unhalt vcpu0 on reset
Since "KVM: x86: do not execute halted vcpus", HLT by vcpu0 before system
reset by the IO thread will hang the guest.

Mark vcpu as runnable in such case.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:26 +02:00
Mohammed Gamal
d19292e457 KVM: x86 emulator: Add call near absolute instruction (opcode 0xff/2)
Add call near absolute instruction.

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:26 +02:00
Marcelo Tosatti
d76901750a KVM: x86: do not execute halted vcpus
Offline or uninitialized vcpu's can be executed if requested to perform
userspace work.

Follow Avi's suggestion to handle halted vcpu's in the main loop,
simplifying kvm_emulate_halt(). Introduce a new vcpu->requests bit to
indicate events that promote state from halted to running.

Also standardize vcpu wake sites.

Signed-off-by: Marcelo Tosatti <mtosatti <at> redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:26 +02:00
Mohammed Gamal
a6a3034cb9 KVM: x86 emulator: Add in/out instructions (opcodes 0xe4-0xe7, 0xec-0xef)
The patch adds in/out instructions to the x86 emulator.

The instruction was encountered while running the BIOS while using
the invalid guest state emulation patch.

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:25 +02:00
Avi Kivity
fa89a81766 KVM: Add statistics for guest irq injections
These can help show whether a guest is making progress or not.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:25 +02:00
Sheng Yang
d40a1ee485 KVM: MMU: Modify kvm_shadow_walk.entry to accept u64 addr
EPT is 4 level by default in 32pae(48 bits), but the addr parameter
of kvm_shadow_walk->entry() only accept unsigned long as virtual
address, which is 32bit in 32pae. This result in SHADOW_PT_INDEX()
overflow when try to fetch level 4 index.

Fix it by extend kvm_shadow_walk->entry() to accept 64bit addr in
parameter.

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:25 +02:00
Mohammed Gamal
fb4616f431 KVM: x86 emulator: Add std and cld instructions (opcodes 0xfc-0xfd)
This adds the std and cld instructions to the emulator.

Encountered while running the BIOS with invalid guest
state emulation enabled.

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:25 +02:00
Joerg Roedel
a89c1ad270 KVM: add MC5_MISC msr read support
Currently KVM implements MC0-MC4_MISC read support. When booting Linux this
results in KVM warnings in the kernel log when the guest tries to read
MC5_MISC. Fix this warnings with this patch.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:24 +02:00
Avi Kivity
48d1503949 KVM: SVM: No need to unprotect memory during event injection when using npt
No memory is protected anyway.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:24 +02:00
Avi Kivity
3201b5d9f0 KVM: MMU: Fix setting the accessed bit on non-speculative sptes
The accessed bit was accidentally turned on in a random flag word, rather
than, the spte itself, which was lucky, since it used the non-EPT compatible
PT_ACCESSED_MASK.

Fix by turning the bit on in the spte and changing it to use the portable
accessed mask.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:24 +02:00
Avi Kivity
171d595d3b KVM: MMU: Flush tlbs after clearing write permission when accessing dirty log
Otherwise, the cpu may allow writes to the tracked pages, and we lose
some display bits or fail to migrate correctly.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:24 +02:00
Avi Kivity
2245a28fe2 KVM: MMU: Add locking around kvm_mmu_slot_remove_write_access()
It was generally safe due to slots_lock being held for write, but it wasn't
very nice.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:24 +02:00
Avi Kivity
bc2d429979 KVM: MMU: Account for npt/ept/realmode page faults
Now that two-dimensional paging is becoming common, account for tdp page
faults.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-10-15 10:15:23 +02:00