Commit Graph

1112 Commits

Author SHA1 Message Date
Marcelo Tosatti
886b470cb1 KVM: x86: pass host_tsc to read_l1_tsc
Allow the caller to pass host tsc value to kvm_x86_ops->read_l1_tsc().

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:11 -02:00
Marcelo Tosatti
78c0337a38 KVM: x86: retain pvclock guest stopped bit in guest memory
Otherwise its possible for an unrelated KVM_REQ_UPDATE_CLOCK (such as due to CPU
migration) to clear the bit.

Noticed by Paolo Bonzini.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:05 -02:00
Guo Chao
807f12e57c KVM: remove unnecessary return value check
No need to check return value before breaking switch.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-13 22:14:29 -02:00
Guo Chao
951179ce86 KVM: x86: fix return value of kvm_vm_ioctl_set_tss_addr()
Return value of this function will be that of ioctl().

#include <stdio.h>
#include <linux/kvm.h>

int main () {
	int fd;
	fd = open ("/dev/kvm", 0);
	fd = ioctl (fd, KVM_CREATE_VM, 0);
	ioctl (fd, KVM_SET_TSS_ADDR, 0xfffff000);
	perror ("");
	return 0;
}

Output is "Operation not permitted". That's not what
we want.

Return -EINVAL in this case.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-13 22:14:29 -02:00
Guo Chao
18595411a7 KVM: do not kfree error pointer
We should avoid kfree()ing error pointer in kvm_vcpu_ioctl() and
kvm_arch_vcpu_ioctl().

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-13 22:14:28 -02:00
Petr Matousek
6d1068b3a9 KVM: x86: invalid opcode oops on SET_SREGS with OSXSAVE bit set (CVE-2012-4461)
On hosts without the XSAVE support unprivileged local user can trigger
oops similar to the one below by setting X86_CR4_OSXSAVE bit in guest
cr4 register using KVM_SET_SREGS ioctl and later issuing KVM_RUN
ioctl.

invalid opcode: 0000 [#2] SMP
Modules linked in: tun ip6table_filter ip6_tables ebtable_nat ebtables
...
Pid: 24935, comm: zoog_kvm_monito Tainted: G      D      3.2.0-3-686-pae
EIP: 0060:[<f8b9550c>] EFLAGS: 00210246 CPU: 0
EIP is at kvm_arch_vcpu_ioctl_run+0x92a/0xd13 [kvm]
EAX: 00000001 EBX: 000f387e ECX: 00000000 EDX: 00000000
ESI: 00000000 EDI: 00000000 EBP: ef5a0060 ESP: d7c63e70
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process zoog_kvm_monito (pid: 24935, ti=d7c62000 task=ed84a0c0
task.ti=d7c62000)
Stack:
 00000001 f70a1200 f8b940a9 ef5a0060 00000000 00200202 f8769009 00000000
 ef5a0060 000f387e eda5c020 8722f9c8 00015bae 00000000 ed84a0c0 ed84a0c0
 c12bf02d 0000ae80 ef7f8740 fffffffb f359b740 ef5a0060 f8b85dc1 0000ae80
Call Trace:
 [<f8b940a9>] ? kvm_arch_vcpu_ioctl_set_sregs+0x2fe/0x308 [kvm]
...
 [<c12bfb44>] ? syscall_call+0x7/0xb
Code: 89 e8 e8 14 ee ff ff ba 00 00 04 00 89 e8 e8 98 48 ff ff 85 c0 74
1e 83 7d 48 00 75 18 8b 85 08 07 00 00 31 c9 8b 95 0c 07 00 00 <0f> 01
d1 c7 45 48 01 00 00 00 c7 45 1c 01 00 00 00 0f ae f0 89
EIP: [<f8b9550c>] kvm_arch_vcpu_ioctl_run+0x92a/0xd13 [kvm] SS:ESP
0068:d7c63e70

QEMU first retrieves the supported features via KVM_GET_SUPPORTED_CPUID
and then sets them later. So guest's X86_FEATURE_XSAVE should be masked
out on hosts without X86_FEATURE_XSAVE, making kvm_set_cr4 with
X86_CR4_OSXSAVE fail. Userspaces that allow specifying guest cpuid with
X86_FEATURE_XSAVE even on hosts that do not support it, might be
susceptible to this attack from inside the guest as well.

Allow setting X86_CR4_OSXSAVE bit only if host has XSAVE support.

Signed-off-by: Petr Matousek <pmatouse@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-12 21:16:45 -02:00
Xiao Guangrong
87da7e66a4 KVM: x86: fix vcpu->mmio_fragments overflow
After commit b3356bf0db (KVM: emulator: optimize "rep ins" handling),
the pieces of io data can be collected and write them to the guest memory
or MMIO together

Unfortunately, kvm splits the mmio access into 8 bytes and store them to
vcpu->mmio_fragments. If the guest uses "rep ins" to move large data, it
will cause vcpu->mmio_fragments overflow

The bug can be exposed by isapc (-M isapc):

[23154.818733] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
[ ......]
[23154.858083] Call Trace:
[23154.859874]  [<ffffffffa04f0e17>] kvm_get_cr8+0x1d/0x28 [kvm]
[23154.861677]  [<ffffffffa04fa6d4>] kvm_arch_vcpu_ioctl_run+0xcda/0xe45 [kvm]
[23154.863604]  [<ffffffffa04f5a1a>] ? kvm_arch_vcpu_load+0x17b/0x180 [kvm]

Actually, we can use one mmio_fragment to store a large mmio access then
split it when we pass the mmio-exit-info to userspace. After that, we only
need two entries to store mmio info for the cross-mmio pages access

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-10-31 20:36:30 -02:00
Xiao Guangrong
81c52c56e2 KVM: do not treat noslot pfn as a error pfn
This patch filters noslot pfn out from error pfns based on Marcelo comment:
noslot pfn is not a error pfn

After this patch,
- is_noslot_pfn indicates that the gfn is not in slot
- is_error_pfn indicates that the gfn is in slot but the error is occurred
  when translate the gfn to pfn
- is_error_noslot_pfn indicates that the pfn either it is error pfns or it
  is noslot pfn
And is_invalid_pfn can be removed, it makes the code more clean

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-10-29 20:31:04 -02:00
Gleb Natapov
471842ec49 KVM: do not de-cache cr4 bits needlessly
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-10-18 14:49:28 +02:00
Jan Kiszka
b6785def83 KVM: x86: Make emulator_fix_hypercall static
No users outside of kvm/x86.c.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-10-08 14:03:16 -03:00
Jan Kiszka
8b6e4547e0 KVM: x86: Convert kvm_arch_vcpu_reset into private kvm_vcpu_reset
There are no external callers of this function as there is no concept of
resetting a vcpu from generic code.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-10-08 14:02:35 -03:00
Linus Torvalds
ecefbd94b8 KVM updates for the 3.7 merge window
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQbY/2AAoJEI7yEDeUysxlymQQAIv5svpAI/FUe3FhvBi3IW2h
 WWMIpbdhHyocaINT18qNp8prO0iwoaBfgsnU8zuB34MrbdUgiwSHgM6T4Ff4NGa+
 R4u+gpyKYwxNQYKeJyj04luXra/krxwHL1u9OwN7o44JuQXAmzrw2tZ9ad1ArvL3
 eoZ6kGsPcdHPZMZWw2jN5xzBsRtqybm0GPPQh1qPXdn8UlPPd1X7owvbaud2y4+e
 StVIpGY6wrsO36f7UcA4Gm1EP/1E6Lm5KMXJyHgM9WBRkEfp92jTY5+XKv91vK8Z
 VKUd58QMdZE5NCNBkAR9U5N9aH0oSXnFU/g8hgiwGvrhS3IsSkKUePE6sVyMVTIO
 VptKRYe0AdmD/g25p6ApJsguV7ITlgoCPaE4rMmRcW9/bw8+iY098r7tO7w11H8M
 TyFOXihc3B+rlH8WdzOblwxHMC4yRuiPIktaA3WwbX7eA7Xv/ZRtdidifXKtgsVE
 rtubVqwGyYcHoX1Y+JiByIW1NN0pYncJhPEdc8KbRe2wKs3amA9rio1mUpBYYBPO
 B0ygcITftyXbhcTtssgcwBDGXB0AAGqI7wqdtJhFeIrKwHXD7fNeAGRwO8oKxmlj
 0aPwo9fDtpI+e6BFTohEgjZBocRvXXNWLnDSFB0E7xDR31bACck2FG5FAp1DxdS7
 lb/nbAsXf9UJLgGir4I1
 =kN6V
 -----END PGP SIGNATURE-----

Merge tag 'kvm-3.7-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Avi Kivity:
 "Highlights of the changes for this release include support for vfio
  level triggered interrupts, improved big real mode support on older
  Intels, a streamlines guest page table walker, guest APIC speedups,
  PIO optimizations, better overcommit handling, and read-only memory."

* tag 'kvm-3.7-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (138 commits)
  KVM: s390: Fix vcpu_load handling in interrupt code
  KVM: x86: Fix guest debug across vcpu INIT reset
  KVM: Add resampling irqfds for level triggered interrupts
  KVM: optimize apic interrupt delivery
  KVM: MMU: Eliminate pointless temporary 'ac'
  KVM: MMU: Avoid access/dirty update loop if all is well
  KVM: MMU: Eliminate eperm temporary
  KVM: MMU: Optimize is_last_gpte()
  KVM: MMU: Simplify walk_addr_generic() loop
  KVM: MMU: Optimize pte permission checks
  KVM: MMU: Update accessed and dirty bits after guest pagetable walk
  KVM: MMU: Move gpte_access() out of paging_tmpl.h
  KVM: MMU: Optimize gpte_access() slightly
  KVM: MMU: Push clean gpte write protection out of gpte_access()
  KVM: clarify kvmclock documentation
  KVM: make processes waiting on vcpu mutex killable
  KVM: SVM: Make use of asm.h
  KVM: VMX: Make use of asm.h
  KVM: VMX: Make lto-friendly
  KVM: x86: lapic: Clean up find_highest_vector() and count_vectors()
  ...

Conflicts:
	arch/s390/include/asm/processor.h
	arch/x86/kvm/i8259.c
2012-10-04 09:30:33 -07:00
Linus Torvalds
ac07f5c3cb Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/fpu update from Ingo Molnar:
 "The biggest change is the addition of the non-lazy (eager) FPU saving
  support model and enabling it on CPUs with optimized xsaveopt/xrstor
  FPU state saving instructions.

  There are also various Sparse fixes"

Fix up trivial add-add conflict in arch/x86/kernel/traps.c

* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, kvm: fix kvm's usage of kernel_fpu_begin/end()
  x86, fpu: remove cpu_has_xmm check in the fx_finit()
  x86, fpu: make eagerfpu= boot param tri-state
  x86, fpu: enable eagerfpu by default for xsaveopt
  x86, fpu: decouple non-lazy/eager fpu restore from xsave
  x86, fpu: use non-lazy fpu restore for processors supporting xsave
  lguest, x86: handle guest TS bit for lazy/non-lazy fpu host models
  x86, fpu: always use kernel_fpu_begin/end() for in-kernel FPU usage
  x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()
  x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig()
  x86, fpu: drop_fpu() before restoring new state from sigframe
  x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels
  x86, fpu: Consolidate inline asm routines for saving/restoring fpu state
  x86, signal: Cleanup ifdefs and is_ia32, is_x32
2012-10-01 11:10:52 -07:00
Jan Kiszka
c863901075 KVM: x86: Fix guest debug across vcpu INIT reset
If we reset a vcpu on INIT, we so far overwrote dr7 as provided by
KVM_SET_GUEST_DEBUG, and we also cleared switch_db_regs unconditionally.

Fix this by saving the dr7 used for guest debugging and calculating the
effective register value as well as switch_db_regs on any potential
change. This will change to focus of the set_guest_debug vendor op to
update_dp_bp_intercept.

Found while trying to stop on start_secondary.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-23 15:00:07 +02:00
Alex Williamson
7a84428af7 KVM: Add resampling irqfds for level triggered interrupts
To emulate level triggered interrupts, add a resample option to
KVM_IRQFD.  When specified, a new resamplefd is provided that notifies
the user when the irqchip has been resampled by the VM.  This may, for
instance, indicate an EOI.  Also in this mode, posting of an interrupt
through an irqfd only asserts the interrupt.  On resampling, the
interrupt is automatically de-asserted prior to user notification.
This enables level triggered interrupts to be posted and re-enabled
from vfio with no userspace intervention.

All resampling irqfds can make use of a single irq source ID, so we
reserve a new one for this interface.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-23 13:50:15 +02:00
Suresh Siddha
b1a74bf821 x86, kvm: fix kvm's usage of kernel_fpu_begin/end()
Preemption is disabled between kernel_fpu_begin/end() and as such
it is not a good idea to use these routines in kvm_load/put_guest_fpu()
which can be very far apart.

kvm_load/put_guest_fpu() routines are already called with
preemption disabled and KVM already uses the preempt notifier to save
the guest fpu state using kvm_put_guest_fpu().

So introduce __kernel_fpu_begin/end() routines which don't touch
preemption and use them instead of kernel_fpu_begin/end()
for KVM's use model of saving/restoring guest FPU state.

Also with this change (and with eagerFPU model), fix the host cr0.TS vm-exit
state in the case of VMX. For eagerFPU case, host cr0.TS is always clear.
So no need to worry about it. For the traditional lazyFPU restore case,
change the cr0.TS bit for the host state during vm-exit to be always clear
and cr0.TS bit is set in the __vmx_load_host_state() when the FPU
(guest FPU or the host task's FPU) state is not active. This ensures
that the host/guest FPU state is properly saved, restored
during context-switch and with interrupts (using irq_fpu_usable()) not
stomping on the active FPU state.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1348164109.26695.338.camel@sbsiddha-desk.sc.intel.com
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-09-21 16:59:04 -07:00
Gleb Natapov
1e08ec4a13 KVM: optimize apic interrupt delivery
Most interrupt are delivered to only one vcpu. Use pre-build tables to
find interrupt destination instead of looping through all vcpus. In case
of logical mode loop only through vcpus in a logical cluster irq is sent
to.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-20 15:05:26 +03:00
Avi Kivity
97d64b7881 KVM: MMU: Optimize pte permission checks
walk_addr_generic() permission checks are a maze of branchy code, which is
performed four times per lookup.  It depends on the type of access, efer.nxe,
cr0.wp, cr4.smep, and in the near future, cr4.smap.

Optimize this away by precalculating all variants and storing them in a
bitmap.  The bitmap is recalculated when rarely-changing variables change
(cr0, cr4) and is indexed by the often-changing variables (page fault error
code, pte access permissions).

The permission check is moved to the end of the loop, otherwise an SMEP
fault could be reported as a false positive, when PDE.U=1 but PTE.U=0.
Noted by Xiao Guangrong.

The result is short, branch-free code.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-20 13:00:08 +03:00
Suresh Siddha
9c1c3fac53 x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()
kvm's guest fpu save/restore should be wrapped around
kernel_fpu_begin/end(). This will avoid for example taking a DNA
in kvm_load_guest_fpu() when it tries to load the fpu immediately
after doing unlazy_fpu() on the host side.

More importantly this will prevent the host process fpu from being
corrupted.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1345842782-24175-4-git-send-email-suresh.b.siddha@intel.com
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-09-18 15:52:07 -07:00
Michael S. Tsirkin
9fc77441e5 KVM: make processes waiting on vcpu mutex killable
vcpu mutex can be held for unlimited time so
taking it with mutex_lock on an ioctl is wrong:
one process could be passed a vcpu fd and
call this ioctl on the vcpu used by another process,
it will then be unkillable until the owner exits.

Call mutex_lock_killable instead and return status.
Note: mutex_lock_interruptible would be even nicer,
but I am not sure all users are prepared to handle EINTR
from these ioctls. They might misinterpret it as an error.

Cleanup paths expect a vcpu that can't be used by
any userspace so this will always succeed - catch bugs
by calling BUG_ON.

Catch callers that don't check return state by adding
__must_check.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-09-17 13:46:32 -03:00
Xiao Guangrong
4484141a94 KVM: fix error paths for failed gfn_to_page() calls
This bug was triggered:
[ 4220.198458] BUG: unable to handle kernel paging request at fffffffffffffffe
[ 4220.203907] IP: [<ffffffff81104d85>] put_page+0xf/0x34
......
[ 4220.237326] Call Trace:
[ 4220.237361]  [<ffffffffa03830d0>] kvm_arch_destroy_vm+0xf9/0x101 [kvm]
[ 4220.237382]  [<ffffffffa036fe53>] kvm_put_kvm+0xcc/0x127 [kvm]
[ 4220.237401]  [<ffffffffa03702bc>] kvm_vcpu_release+0x18/0x1c [kvm]
[ 4220.237407]  [<ffffffff81145425>] __fput+0x111/0x1ed
[ 4220.237411]  [<ffffffff8114550f>] ____fput+0xe/0x10
[ 4220.237418]  [<ffffffff81063511>] task_work_run+0x5d/0x88
[ 4220.237424]  [<ffffffff8104c3f7>] do_exit+0x2bf/0x7ca

The test case:

	printf(fmt, ##args);		\
	exit(-1);} while (0)

static int create_vm(void)
{
	int sys_fd, vm_fd;

	sys_fd = open("/dev/kvm", O_RDWR);
	if (sys_fd < 0)
		die("open /dev/kvm fail.\n");

	vm_fd = ioctl(sys_fd, KVM_CREATE_VM, 0);
	if (vm_fd < 0)
		die("KVM_CREATE_VM fail.\n");

	return vm_fd;
}

static int create_vcpu(int vm_fd)
{
	int vcpu_fd;

	vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);
	if (vcpu_fd < 0)
		die("KVM_CREATE_VCPU ioctl.\n");
	printf("Create vcpu.\n");
	return vcpu_fd;
}

static void *vcpu_thread(void *arg)
{
	int vm_fd = (int)(long)arg;

	create_vcpu(vm_fd);
	return NULL;
}

int main(int argc, char *argv[])
{
	pthread_t thread;
	int vm_fd;

	(void)argc;
	(void)argv;

	vm_fd = create_vm();
	pthread_create(&thread, NULL, vcpu_thread, (void *)(long)vm_fd);
	printf("Exit.\n");
	return 0;
}

It caused by release kvm->arch.ept_identity_map_addr which is the
error page.

The parent thread can send KILL signal to the vcpu thread when it was
exiting which stops faulting pages and potentially allocating memory.
So gfn_to_pfn/gfn_to_page may fail at this time

Fixed by checking the page before it is used

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-10 11:34:11 +03:00
Michael S. Tsirkin
a50abc3b2b KVM: use symbolic constant for nr interrupts
interrupt_bitmap is KVM_NR_INTERRUPTS bits in size,
so just use that instead of hard-coded constants
and math.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-06 18:37:44 +03:00
Gleb Natapov
716d51abff KVM: Provide userspace IO exit completion callback
Current code assumes that IO exit was due to instruction emulation
and handles execution back to emulator directly. This patch adds new
userspace IO exit completion callback that can be set by any other code
that caused IO exit to userspace.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-06 18:06:37 +03:00
Marcelo Tosatti
3b4dc3a031 KVM: move postcommit flush to x86, as mmio sptes are x86 specific
Other arches do not need this.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

v2: fix incorrect deletion of mmio sptes on gpa move (noticed by Takuya)
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-06 16:37:30 +03:00
Marcelo Tosatti
2df72e9bc4 KVM: split kvm_arch_flush_shadow
Introducing kvm_arch_flush_shadow_memslot, to invalidate the
translations of a single memory slot.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-06 16:37:25 +03:00
Mathias Krause
f1d248315a KVM: x86: more constification
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-05 12:42:02 +03:00
Mathias Krause
0fbe9b0b19 KVM: x86: constify read_write_emulator_ops
We never change those, make them r/o.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-05 12:41:54 +03:00
Mathias Krause
0225fb509d KVM: x86 emulator: constify emulate_ops
We never change emulate_ops[] at runtime so it should be r/o.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-05 12:41:48 +03:00
Marcelo Tosatti
9a7819774e KVM: x86: remove unused variable from kvm_task_switch()
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-30 17:45:54 -03:00
Avi Kivity
dd856efafe KVM: x86 emulator: access GPRs on demand
Instead of populating the entire register file, read in registers
as they are accessed, and write back only the modified ones.  This
saves a VMREAD and VMWRITE on Intel (for rsp, since it is not usually
used during emulation), and a two 128-byte copies for the registers.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 18:38:55 -03:00
Michael S. Tsirkin
1d92128fe9 KVM: x86: fix KVM_GET_MSR for PV EOI
KVM_GET_MSR was missing support for PV EOI,
which is needed for migration.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 18:03:05 -03:00
Xiao Guangrong
4d8b81abc4 KVM: introduce readonly memslot
In current code, if we map a readonly memory space from host to guest
and the page is not currently mapped in the host, we will get a fault
pfn and async is not allowed, then the vm will crash

We introduce readonly memory region to map ROM/ROMD to the guest, read access
is happy for readonly memslot, write access on readonly memslot will cause
KVM_EXIT_MMIO exit

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-22 15:09:03 +03:00
Xiao Guangrong
8e3d9d061b KVM: x86: fix possible infinite loop caused by reexecute_instruction
Currently, we reexecute all unhandleable instructions if they do not
access on the mmio, however, it can not work if host map the readonly
memory to guest. If the instruction try to write this kind of memory,
it will fault again when guest retry it, then we will goto a infinite
loop: retry instruction -> write #PF -> emulation fail ->
retry instruction -> ...

Fix it by retrying the instruction only when it faults on the writable
memory

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-22 15:08:49 +03:00
Marcelo Tosatti
51d59c6b42 KVM: x86: fix pvclock guest stopped flag reporting
kvm_guest_time_update unconditionally clears hv_clock.flags field,
so the notification never reaches the guest.

Fix it by allowing PVCLOCK_GUEST_STOPPED to passthrough.

Reviewed-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-13 16:10:45 -03:00
Gleb Natapov
64eb062029 KVM: correctly detect APIC SW state in kvm_apic_post_state_restore()
For apic_set_spiv() to track APIC SW state correctly it needs to see
previous and next values of the spurious vector register, but currently
memset() overwrite the old value before apic_set_spiv() get a chance to
do tracking. Fix it by calling apic_set_spiv() before overwriting old
value.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-09 12:44:46 +03:00
Gleb Natapov
54e9818f39 KVM: use jump label to optimize checking for in kernel local apic presence
Usually all vcpus have local apic pointer initialized, so the check may
be completely skipped.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 19:00:44 +03:00
Gleb Natapov
c5cc421ba3 KVM: use jump label to optimize checking for HW enabled APIC in APIC_BASE MSR
Usually all APICs are HW enabled so the check can be optimized out.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 19:00:43 +03:00
Gleb Natapov
8a5a87d9b7 KVM: clean up kvm_(set|get)_apic_base
kvm_get_apic_base() needlessly checks irqchip_in_kernel although it does
the same no matter what result of the check is. kvm_set_apic_base() also
checks for irqchip_in_kernel, but kvm_lapic_set_base() can handle this
case.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 16:20:03 +03:00
Xiao Guangrong
32cad84f44 KVM: do not release the error page
After commit a2766325cf, the error page is replaced by the
error code, it need not be released anymore

[ The patch has been compiling tested for powerpc ]

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 16:04:58 +03:00
Takuya Yoshikawa
d89cc617b9 KVM: Push rmap into kvm_arch_memory_slot
Two reasons:
 - x86 can integrate rmap and rmap_pde and remove heuristics in
   __gfn_to_rmap().
 - Some architectures do not need rmap.

Since rmap is one of the most memory consuming stuff in KVM, ppc'd
better restrict the allocation to Book3S HV.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 12:47:30 +03:00
Takuya Yoshikawa
aab2eb7a38 KVM: Stop checking rmap to see if slot is being created
Instead, check npages consistently.  This helps to make rmap
architecture specific in a later patch.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 12:47:02 +03:00
Gleb Natapov
439793d4b3 KVM: x86: update KVM_SAVE_MSRS_BEGIN to correct value
When MSR_KVM_PV_EOI_EN was added to msrs_to_save array
KVM_SAVE_MSRS_BEGIN was not updated accordingly.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-05 16:52:38 +03:00
Avi Kivity
fe56097b23 Merge remote-tracking branch 'upstream' into next
- bring back critical fixes (esp. aa67f6096c)
 - provide an updated base for development

* upstream: (4334 commits)
  missed mnt_drop_write() in do_dentry_open()
  UBIFS: nuke pdflush from comments
  gfs2: nuke pdflush from comments
  drbd: nuke pdflush from comments
  nilfs2: nuke write_super from comments
  hfs: nuke write_super from comments
  vfs: nuke pdflush from comments
  jbd/jbd2: nuke write_super from comments
  btrfs: nuke pdflush from comments
  btrfs: nuke write_super from comments
  ext4: nuke pdflush from comments
  ext4: nuke write_super from comments
  ext3: nuke write_super from comments
  Documentation: fix the VM knobs descritpion WRT pdflush
  Documentation: get rid of write_super
  vfs: kill write_super and sync_supers
  ACPI processor: Fix tick_broadcast_mask online/offline regression
  ACPI: Only count valid srat memory structures
  ACPI: Untangle a return statement for better readability
  Linux 3.6-rc1
  ...

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-05 13:25:10 +03:00
Gleb Natapov
e115676e04 KVM: x86: update KVM_SAVE_MSRS_BEGIN to correct value
When MSR_KVM_PV_EOI_EN was added to msrs_to_save array
KVM_SAVE_MSRS_BEGIN was not updated accordingly.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-03 15:08:22 -03:00
Bruce Rogers
4b6486659a KVM: x86: apply kvmclock offset to guest wall clock time
When a guest migrates to a new host, the system time difference from the
previous host is used in the updates to the kvmclock system time visible
to the guest, resulting in a continuation of correct kvmclock based guest
timekeeping.

The wall clock component of the kvmclock provided time is currently not
updated with this same time offset. Since the Linux guest caches the
wall clock based time, this discrepency is not noticed until the guest is
rebooted. After reboot the guest's time calculations are off.

This patch adjusts the wall clock by the kvmclock_offset, resulting in
correct guest time after a reboot.

Cc: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Bruce Rogers <brogers@suse.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-01 17:23:50 -03:00
Avi Kivity
26ef19242f KVM: fold kvm_pit_timer into kvm_kpit_state
One structure nests inside the other, providing no value at all.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-01 00:21:07 -03:00
Christoffer Dall
23d43cf998 KVM: Move KVM_IRQ_LINE to arch-generic code
Handle KVM_IRQ_LINE and KVM_IRQ_LINE_STATUS in the generic
kvm_vm_ioctl() function and call into kvm_vm_ioctl_irq_line().

This is even more relevant when KVM/ARM also uses this ioctl.

Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-26 12:23:25 +03:00
Guo Chao
4a9699807c KVM: x86: Fix typos in x86.c
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-20 15:26:36 -03:00
Xiao Guangrong
9d3c92af47 KVM: x86: remove unnecessary mark_page_dirty
fix:
[  132.474633] 3.5.0-rc1+ #50 Not tainted
[  132.474634] -------------------------------
[  132.474635] include/linux/kvm_host.h:369 suspicious rcu_dereference_check() usage!
[  132.474636]
[  132.474636] other info that might help us debug this:
[  132.474636]
[  132.474638]
[  132.474638] rcu_scheduler_active = 1, debug_locks = 1
[  132.474640] 1 lock held by qemu-kvm/2832:
[  132.474657]  #0:  (&vcpu->mutex){+.+.+.}, at: [<ffffffffa01e1636>] vcpu_load+0x1e/0x91 [kvm]
[  132.474658]
[  132.474658] stack backtrace:
[  132.474660] Pid: 2832, comm: qemu-kvm Not tainted 3.5.0-rc1+ #50
[  132.474661] Call Trace:
[  132.474665]  [<ffffffff81092f40>] lockdep_rcu_suspicious+0xfc/0x105
[  132.474675]  [<ffffffffa01e0c85>] kvm_memslots+0x6d/0x75 [kvm]
[  132.474683]  [<ffffffffa01e0ca1>] gfn_to_memslot+0x14/0x4c [kvm]
[  132.474693]  [<ffffffffa01e3575>] mark_page_dirty+0x17/0x2a [kvm]
[  132.474706]  [<ffffffffa01f21ea>] kvm_arch_vcpu_ioctl+0xbcf/0xc07 [kvm]

Actually, we do not write vcpu->arch.time at this time, mark_page_dirty
should be removed.

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-19 21:00:20 -03:00
Takuya Yoshikawa
77d11309b3 KVM: Separate rmap_pde from kvm_lpage_info->write_count
This makes it possible to loop over rmap_pde arrays in the same way as
we do over rmap so that we can optimize kvm_handle_hva_range() easily in
the following patch.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-18 16:55:04 -03:00
Mao, Junjie
ad756a1603 KVM: VMX: Implement PCID/INVPCID for guests with EPT
This patch handles PCID/INVPCID for guests.

Process-context identifiers (PCIDs) are a facility by which a logical processor
may cache information for multiple linear-address spaces so that the processor
may retain cached information when software switches to a different linear
address space. Refer to section 4.10.1 in IA32 Intel Software Developer's Manual
Volume 3A for details.

For guests with EPT, the PCID feature is enabled and INVPCID behaves as running
natively.
For guests without EPT, the PCID feature is disabled and INVPCID triggers #UD.

Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-12 13:07:34 +03:00
Avi Kivity
0017f93a27 KVM: x86 emulator: change ->get_cpuid() accessor to use the x86 semantics
Instead of getting an exact leaf, follow the spec and fall back to the last
main leaf instead.  This lets us easily emulate the cpuid instruction in the
emulator.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:00 +03:00
Michael S. Tsirkin
ae7a2a3fb6 KVM: host side for eoi optimization
Implementation of PV EOI using shared memory.
This reduces the number of exits an interrupt
causes as much as by half.

The idea is simple: there's a bit, per APIC, in guest memory,
that tells the guest that it does not need EOI.
We set it before injecting an interrupt and clear
before injecting a nested one. Guest tests it using
a test and clear operation - this is necessary
so that host can detect interrupt nesting -
and if set, it can skip the EOI MSR.

There's a new MSR to set the address of said register
in guest memory. Otherwise not much changed:
- Guest EOI is not required
- Register is tested & ISR is automatically cleared on exit

For testing results see description of previous patch
'kvm_para: guest side for eoi avoidance'.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-25 12:40:55 +03:00
Michael S. Tsirkin
d905c06935 KVM: rearrange injection cancelling code
Each time we need to cancel injection we invoke same code
(cancel_injection callback).  Move it towards the end of function using
the familiar goto on error pattern.

Will make it easier to do more cleanups for PV EOI.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-25 12:40:50 +03:00
Michael S. Tsirkin
5cfb1d5a65 KVM: only sync when attention bits set
Commit eb0dc6d0368072236dcd086d7fdc17fd3c4574d4 introduced apic
attention bitmask but kvm still syncs lapic unconditionally.
As that commit suggested and in anticipation of adding more attention
bits, only sync lapic if(apic_attention).

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-25 12:40:40 +03:00
Takuya Yoshikawa
9e40b67bf2 KVM: Use kvm_kvfree() to free memory allocated by kvm_kvzalloc()
The following commit did not care about the error handling path:

  commit c1a7b32a14
  KVM: Avoid wasting pages for small lpage_info arrays

If memory allocation fails, vfree() will be called with the address
returned by kzalloc().  This patch fixes this issue.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-19 16:10:25 +03:00
Christoffer Dall
a737f256bf KVM: Cleanup the kvm_print functions and introduce pr_XX wrappers
Introduces a couple of print functions, which are essentially wrappers
around standard printk functions, with a KVM: prefix.

Functions introduced or modified are:
 - kvm_err(fmt, ...)
 - kvm_info(fmt, ...)
 - kvm_debug(fmt, ...)
 - kvm_pr_unimpl(fmt, ...)
 - pr_unimpl(vcpu, fmt, ...) -> vcpu_unimpl(vcpu, fmt, ...)

Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-06 15:24:00 +03:00
Takuya Yoshikawa
c1a7b32a14 KVM: Avoid wasting pages for small lpage_info arrays
lpage_info is created for each large level even when the memory slot is
not for RAM.  This means that when we add one slot for a PCI device, we
end up allocating at least KVM_NR_PAGE_SIZES - 1 pages by vmalloc().

To make things worse, there is an increasing number of devices which
would result in more pages being wasted this way.

This patch mitigates this problem by using kvm_kvzalloc().

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-05 16:29:49 +03:00
Linus Torvalds
07acfc2a93 Merge branch 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM changes from Avi Kivity:
 "Changes include additional instruction emulation, page-crossing MMIO,
  faster dirty logging, preventing the watchdog from killing a stopped
  guest, module autoload, a new MSI ABI, and some minor optimizations
  and fixes.  Outside x86 we have a small s390 and a very large ppc
  update.

  Regarding the new (for kvm) rebaseless workflow, some of the patches
  that were merged before we switch trees had to be rebased, while
  others are true pulls.  In either case the signoffs should be correct
  now."

Fix up trivial conflicts in Documentation/feature-removal-schedule.txt
arch/powerpc/kvm/book3s_segment.S and arch/x86/include/asm/kvm_para.h.

I suspect the kvm_para.h resolution ends up doing the "do I have cpuid"
check effectively twice (it was done differently in two different
commits), but better safe than sorry ;)

* 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (125 commits)
  KVM: make asm-generic/kvm_para.h have an ifdef __KERNEL__ block
  KVM: s390: onereg for timer related registers
  KVM: s390: epoch difference and TOD programmable field
  KVM: s390: KVM_GET/SET_ONEREG for s390
  KVM: s390: add capability indicating COW support
  KVM: Fix mmu_reload() clash with nested vmx event injection
  KVM: MMU: Don't use RCU for lockless shadow walking
  KVM: VMX: Optimize %ds, %es reload
  KVM: VMX: Fix %ds/%es clobber
  KVM: x86 emulator: convert bsf/bsr instructions to emulate_2op_SrcV_nobyte()
  KVM: VMX: unlike vmcs on fail path
  KVM: PPC: Emulator: clean up SPR reads and writes
  KVM: PPC: Emulator: clean up instruction parsing
  kvm/powerpc: Add new ioctl to retreive server MMU infos
  kvm/book3s: Make kernel emulated H_PUT_TCE available for "PR" KVM
  KVM: PPC: bookehv: Fix r8/r13 storing in level exception handler
  KVM: PPC: Book3S: Enable IRQs during exit handling
  KVM: PPC: Fix PR KVM on POWER7 bare metal
  KVM: PPC: Fix stbux emulation
  KVM: PPC: bookehv: Use lwz/stw instead of PPC_LL/PPC_STL for 32-bit fields
  ...
2012-05-24 16:17:30 -07:00
Avi Kivity
d8368af8b4 KVM: Fix mmu_reload() clash with nested vmx event injection
Currently the inject_pending_event() call during guest entry happens after
kvm_mmu_reload().  This is for historical reasons - we used to
inject_pending_event() in atomic context, while kvm_mmu_reload() needs task
context.

A problem is that nested vmx can cause the mmu context to be reset, if event
injection is intercepted and causes a #VMEXIT instead (the #VMEXIT resets
CR0/CR3/CR4).  If this happens, we end up with invalid root_hpa, and since
kvm_mmu_reload() has already run, no one will fix it and we end up entering
the guest this way.

Fix by reordering event injection to be before kvm_mmu_reload().  Use
->cancel_injection() to undo if kvm_mmu_reload() fails.

https://bugzilla.kernel.org/show_bug.cgi?id=42980

Reported-by: Luke-Jr <luke-jr+linuxbugs@utopios.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-05-16 18:09:26 -03:00
Gleb Natapov
a4fa163531 KVM: ensure async PF event wakes up vcpu from halt
If vcpu executes hlt instruction while async PF is waiting to be delivered
vcpu can block and deliver async PF only after another even wakes it
up. This happens because kvm_check_async_pf_completion() will remove
completion event from vcpu->async_pf.done before entering kvm_vcpu_block()
and this will make kvm_arch_vcpu_runnable() return false. The solution
is to make vcpu runnable when processing completion.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-05-06 14:56:54 +03:00
Al Viro
bfce281c28 kill mm argument of vm_munmap()
it's always current->mm

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-04-21 01:58:20 -04:00
Linus Torvalds
6be5ceb02e VM: add "vm_mmap()" helper function
This continues the theme started with vm_brk() and vm_munmap():
vm_mmap() does the same thing as do_mmap(), but additionally does the
required VM locking.

This uninlines (and rewrites it to be clearer) do_mmap(), which sadly
duplicates it in mm/mmap.c and mm/nommu.c.  But that way we don't have
to export our internal do_mmap_pgoff() function.

Some day we hopefully don't have to export do_mmap() either, if all
modular users can become the simpler vm_mmap() instead.  We're actually
very close to that already, with the notable exception of the (broken)
use in i810, and a couple of stragglers in binfmt_elf.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-20 17:29:13 -07:00
Linus Torvalds
a46ef99d80 VM: add "vm_munmap()" helper function
Like the vm_brk() function, this is the same as "do_munmap()", except it
does the VM locking for the caller.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-20 17:29:13 -07:00
Avi Kivity
f78146b0f9 KVM: Fix page-crossing MMIO
MMIO that are split across a page boundary are currently broken - the
code does not expect to be aborted by the exit to userspace for the
first MMIO fragment.

This patch fixes the problem by generalizing the current code for handling
16-byte MMIOs to handle a number of "fragments", and changes the MMIO
code to create those fragments.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-04-19 20:35:07 -03:00
Takuya Yoshikawa
60c34612b7 KVM: Switch to srcu-less get_dirty_log()
We have seen some problems of the current implementation of
get_dirty_log() which uses synchronize_srcu_expedited() for updating
dirty bitmaps; e.g. it is noticeable that this sometimes gives us ms
order of latency when we use VGA displays.

Furthermore the recent discussion on the following thread
    "srcu: Implement call_srcu()"
    http://lkml.org/lkml/2012/1/31/211
also motivated us to implement get_dirty_log() without SRCU.

This patch achieves this goal without sacrificing the performance of
both VGA and live migration: in practice the new code is much faster
than the old one unless we have too many dirty pages.

Implementation:

The key part of the implementation is the use of xchg() operation for
clearing dirty bits atomically.  Since this allows us to update only
BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap
until every dirty bit is cleared again for the next call.

Although some people may worry about the problem of using the atomic
memory instruction many times to the concurrently accessible bitmap,
it is usually accessed with mmu_lock held and we rarely see concurrent
accesses: so what we need to care about is the pure xchg() overheads.

Another point to note is that we do not use for_each_set_bit() to check
which ones in each BITS_PER_LONG pages are actually dirty.  Instead we
simply use __ffs() in a loop.  This is much faster than repeatedly call
find_next_bit().

Performance:

The dirty-log-perf unit test showed nice improvements, some times faster
than before, except for some extreme cases; for such cases the speed of
getting dirty page information is much faster than we process it in the
userspace.

For real workloads, both VGA and live migration, we have observed pure
improvements: when the guest was reading a file during live migration,
we originally saw a few ms of latency, but with the new method the
latency was less than 200us.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-08 12:50:00 +03:00
Takuya Yoshikawa
5dc99b2380 KVM: Avoid checking huge page mappings in get_dirty_log()
Dropped such mappings when we enabled dirty logging and we will never
create new ones until we stop the logging.

For this we introduce a new function which can be used to write protect
a range of PT level pages: although we do not need to care about a range
of pages at this point, the following patch will need this feature to
optimize the write protection of many pages.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-08 12:49:58 +03:00
Eric B Munson
1c0b28c2a4 KVM: x86: Add ioctl for KVM_KVMCLOCK_CTRL
Now that we have a flag that will tell the guest it was suspended, create an
interface for that communication using a KVM ioctl.

Signed-off-by: Eric B Munson <emunson@mgebm.net>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-08 12:49:01 +03:00
Christoffer Dall
b6d33834bd KVM: Factor out kvm_vcpu_kick to arch-generic code
The kvm_vcpu_kick function performs roughly the same funcitonality on
most all architectures, so we shouldn't have separate copies.

PowerPC keeps a pointer to interchanging waitqueues on the vcpu_arch
structure and to accomodate this special need a
__KVM_HAVE_ARCH_VCPU_GET_WQ define and accompanying function
kvm_arch_vcpu_wq have been defined. For all other architectures this
is a generic inline that just returns &vcpu->wq;

Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-08 12:47:47 +03:00
Linus Torvalds
2e7580b0e7 Merge branch 'kvm-updates/3.4' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Avi Kivity:
 "Changes include timekeeping improvements, support for assigning host
  PCI devices that share interrupt lines, s390 user-controlled guests, a
  large ppc update, and random fixes."

This is with the sign-off's fixed, hopefully next merge window we won't
have rebased commits.

* 'kvm-updates/3.4' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (130 commits)
  KVM: Convert intx_mask_lock to spin lock
  KVM: x86: fix kvm_write_tsc() TSC matching thinko
  x86: kvmclock: abstract save/restore sched_clock_state
  KVM: nVMX: Fix erroneous exception bitmap check
  KVM: Ignore the writes to MSR_K7_HWCR(3)
  KVM: MMU: make use of ->root_level in reset_rsvds_bits_mask
  KVM: PMU: add proper support for fixed counter 2
  KVM: PMU: Fix raw event check
  KVM: PMU: warn when pin control is set in eventsel msr
  KVM: VMX: Fix delayed load of shared MSRs
  KVM: use correct tlbs dirty type in cmpxchg
  KVM: Allow host IRQ sharing for assigned PCI 2.3 devices
  KVM: Ensure all vcpus are consistent with in-kernel irqchip settings
  KVM: x86 emulator: Allow PM/VM86 switch during task switch
  KVM: SVM: Fix CPL updates
  KVM: x86 emulator: VM86 segments must have DPL 3
  KVM: x86 emulator: Fix task switch privilege checks
  arch/powerpc/kvm/book3s_hv.c: included linux/sched.h twice
  KVM: x86 emulator: correctly mask pmc index bits in RDPMC instruction emulation
  KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
  ...
2012-03-28 14:35:31 -07:00
Linus Torvalds
35cb8d9e18 Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/fpu changes from Ingo Molnar.

* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  i387: Split up <asm/i387.h> into exported and internal interfaces
  i387: Uninline the generic FP helpers that we expose to kernel modules
2012-03-22 09:41:22 -07:00
Cong Wang
8fd75e1216 x86: remove the second argument of k[un]map_atomic()
Acked-by: Avi Kivity <avi@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-03-20 21:48:15 +08:00
Marcelo Tosatti
02626b6af5 KVM: x86: fix kvm_write_tsc() TSC matching thinko
kvm_write_tsc() converts from guest TSC to microseconds, not nanoseconds
as intended. The result is that the window for matching is 1000 seconds,
not 1 second.

Microsecond precision is enough for checking whether the TSC write delta
is within the heuristic values, so use it instead of nanoseconds.

Noted by Avi Kivity.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-20 12:40:36 +02:00
Nicolae Mogoreanu
a223c313cb KVM: Ignore the writes to MSR_K7_HWCR(3)
When CPUID Fn8000_0001_EAX reports 0x00100f22 Windows 7 x64 guest
tries to set bit 3 in MSRC001_0015 in nt!KiDisableCacheErrataSource
and fails. This patch will ignore this step and allow things to move
on without having to fake CPUID value.

Signed-off-by: Nicolae Mogoreanu <mogoreanu@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:14:14 +02:00
Jan Kiszka
07700a94b0 KVM: Allow host IRQ sharing for assigned PCI 2.3 devices
PCI 2.3 allows to generically disable IRQ sources at device level. This
enables us to share legacy IRQs of such devices with other host devices
when passing them to a guest.

The new IRQ sharing feature introduced here is optional, user space has
to request it explicitly. Moreover, user space can inform us about its
view of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the
interrupt and signaling it if the guest masked it via the virtualized
PCI config space.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:11:36 +02:00
Avi Kivity
3e515705a1 KVM: Ensure all vcpus are consistent with in-kernel irqchip settings
If some vcpus are created before KVM_CREATE_IRQCHIP, then
irqchip_in_kernel() and vcpu->arch.apic will be inconsistent, leading
to potential NULL pointer dereferences.

Fix by:
- ensuring that no vcpus are installed when KVM_CREATE_IRQCHIP is called
- ensuring that a vcpu has an apic if it is installed after KVM_CREATE_IRQCHIP

This is somewhat long winded because vcpu->arch.apic is created without
kvm->lock held.

Based on earlier patch by Michael Ellerman.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:30 +02:00
Kevin Wolf
4cee4798a3 KVM: x86 emulator: Allow PM/VM86 switch during task switch
Task switches can switch between Protected Mode and VM86. The current
mode must be updated during the task switch emulation so that the new
segment selectors are interpreted correctly.

In order to let privilege checks succeed, rflags needs to be updated in
the vcpu struct as this causes a CPL update.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:29 +02:00
Kevin Wolf
7f3d35fddd KVM: x86 emulator: Fix task switch privilege checks
Currently, all task switches check privileges against the DPL of the
TSS. This is only correct for jmp/call to a TSS. If a task gate is used,
the DPL of this take gate is used for the check instead. Exceptions,
external interrupts and iret shouldn't perform any check.

[avi: kill kvm-kmod remnants]

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:26 +02:00
Takuya Yoshikawa
db3fe4eb45 KVM: Introduce kvm_memory_slot::arch and move lpage_info into it
Some members of kvm_memory_slot are not used by every architecture.

This patch is the first step to make this difference clear by
introducing kvm_memory_slot::arch;  lpage_info is moved into it.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:22 +02:00
Takuya Yoshikawa
6dbf79e716 KVM: Fix write protection race during dirty logging
This patch fixes a race introduced by:

  commit 95d4c16ce7
  KVM: Optimize dirty logging by rmap_write_protect()

During protecting pages for dirty logging, other threads may also try
to protect a page in mmu_sync_children() or kvm_mmu_get_page().

In such a case, because get_dirty_log releases mmu_lock before flushing
TLB's, the following race condition can happen:

  A (get_dirty_log)     B (another thread)

  lock(mmu_lock)
  clear pte.w
  unlock(mmu_lock)
                        lock(mmu_lock)
                        pte.w is already cleared
                        unlock(mmu_lock)
                        skip TLB flush
                        return
  ...
  TLB flush

Though thread B assumes the page has already been protected when it
returns, the remaining TLB entry will break that assumption.

This patch fixes this problem by making get_dirty_log hold the mmu_lock
until it flushes the TLB's.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:12 +02:00
Zachary Amsden
e26101b116 KVM: Track TSC synchronization in generations
This allows us to track the original nanosecond and counter values
at each phase of TSC writing by the guest.  This gets us perfect
offset matching for stable TSC systems, and perfect software
computed TSC matching for machines with unstable TSC.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:09 +02:00
Zachary Amsden
0dd6a6edb0 KVM: Dont mark TSC unstable due to S4 suspend
During a host suspend, TSC may go backwards, which KVM interprets
as an unstable TSC.  Technically, KVM should not be marking the
TSC unstable, which causes the TSC clocksource to go bad, but we
need to be adjusting the TSC offsets in such a case.

Dealing with this issue is a little tricky as the only place we
can reliably do it is before much of the timekeeping infrastructure
is up and running.  On top of this, we are not in a KVM thread
context, so we may not be able to safely access VCPU fields.
Instead, we compute our best known hardware offset at power-up and
stash it to be applied to all VCPUs when they actually start running.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:08 +02:00
Marcelo Tosatti
f1e2b26003 KVM: Allow adjust_tsc_offset to be in host or guest cycles
Redefine the API to take a parameter indicating whether an
adjustment is in host or guest cycles.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:07 +02:00
Zachary Amsden
6f526ec538 KVM: Add last_host_tsc tracking back to KVM
The variable last_host_tsc was removed from upstream code.  I am adding
it back for two reasons.  First, it is unnecessary to use guest TSC
computation to conclude information about the host TSC.  The guest may
set the TSC backwards (this case handled by the previous patch), but
the computation of guest TSC (and fetching an MSR) is significanlty more
work and complexity than simply reading the hardware counter.  In addition,
we don't actually need the guest TSC for any part of the computation,
by always recomputing the offset, we can eliminate the need to deal with
the current offset and any scaling factors that may apply.

The second reason is that later on, we are going to be using the host
TSC value to restore TSC offsets after a host S4 suspend, so we need to
be reading the host values, not the guest values here.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:06 +02:00
Zachary Amsden
b183aa580a KVM: Fix last_guest_tsc / tsc_offset semantics
The variable last_guest_tsc was being used as an ad-hoc indicator
that guest TSC has been initialized and recorded correctly.  However,
it may not have been, it could be that guest TSC has been set to some
large value, the back to a small value (by, say, a software reboot).

This defeats the logic and causes KVM to falsely assume that the
guest TSC has gone backwards, marking the host TSC unstable, which
is undesirable behavior.

In addition, rather than try to compute an offset adjustment for the
TSC on unstable platforms, just recompute the whole offset.  This
allows us to get rid of one callsite for adjust_tsc_offset, which
is problematic because the units it takes are in guest units, but
here, the computation was originally being done in host units.

Doing this, and also recording last_guest_tsc when the TSC is written
allow us to remove the tricky logic which depended on last_guest_tsc
being zero to indicate a reset of uninitialized value.

Instead, we now have the guarantee that the guest TSC offset is
always at least something which will get us last_guest_tsc.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:05 +02:00
Zachary Amsden
4dd7980b21 KVM: Leave TSC synchronization window open with each new sync
Currently, when the TSC is written by the guest, the variable
ns is updated to force the current write to appear to have taken
place at the time of the first write in this sync phase.  This
leaves a cliff at the end of the match window where updates will
fall of the end.  There are two scenarios where this can be a
problem in practe - first, on a system with a large number of
VCPUs, the sync period may last for an extended period of time.

The second way this can happen is if the VM reboots very rapidly
and we catch a VCPU TSC synchronization just around the edge.
We may be unaware of the reboot, and thus the first VCPU might
synchronize with an old set of the timer (at, say 0.97 seconds
ago, when first powered on).  The second VCPU can come in 0.04
seconds later to try to synchronize, but it misses the window
because it is just over the threshold.

Instead, stop doing this artificial setback of the ns variable
and just update it with every write of the TSC.

It may be observed that doing so causes values computed by
compute_guest_tsc to diverge slightly across CPUs - note that
the last_tsc_ns and last_tsc_write variable are used here, and
now they last_tsc_ns will be different for each VCPU, reflecting
the actual time of the update.

However, compute_guest_tsc is used only for guests which already
have TSC stability issues, and further, note that the previous
patch has caused last_tsc_write to be incremented by the difference
in nanoseconds, converted back into guest cycles.  As such, only
boundary rounding errors should be visible, which given the
resolution in nanoseconds, is going to only be a few cycles and
only visible in cross-CPU consistency tests.  The problem can be
fixed by adding a new set of variables to track the start offset
and start write value for the current sync cycle.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:04 +02:00
Zachary Amsden
5d3cb0f6a8 KVM: Improve TSC offset matching
There are a few improvements that can be made to the TSC offset
matching code.  First, we don't need to call the 128-bit multiply
(especially on a constant number), the code works much nicer to
do computation in nanosecond units.

Second, the way everything is setup with software TSC rate scaling,
we currently have per-cpu rates.  Obviously this isn't too desirable
to use in practice, but if for some reason we do change the rate of
all VCPUs at runtime, then reset the TSCs, we will only want to
match offsets for VCPUs running at the same rate.

Finally, for the case where we have an unstable host TSC, but
rate scaling is being done in hardware, we should call the platform
code to compute the TSC offset, so the math is reorganized to recompute
the base instead, then transform the base into an offset using the
existing API.

[avi: fix 64-bit division on i386]

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: Fix 64-bit division in kvm_write_tsc()

Breaks i386 build.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:03 +02:00
Zachary Amsden
cc578287e3 KVM: Infrastructure for software and hardware based TSC rate scaling
This requires some restructuring; rather than use 'virtual_tsc_khz'
to indicate whether hardware rate scaling is in effect, we consider
each VCPU to always have a virtual TSC rate.  Instead, there is new
logic above the vendor-specific hardware scaling that decides whether
it is even necessary to use and updates all rate variables used by
common code.  This means we can simply query the virtual rate at
any point, which is needed for software rate scaling.

There is also now a threshold added to the TSC rate scaling; minor
differences and variations of measured TSC rate can accidentally
provoke rate scaling to be used when it is not needed.  Instead,
we have a tolerance variable called tsc_tolerance_ppm, which is
the maximum variation from user requested rate at which scaling
will be used.  The default is 250ppm, which is the half the
threshold for NTP adjustment, allowing for some hardware variation.

In the event that hardware rate scaling is not available, we can
kludge a bit by forcing TSC catchup to turn on when a faster than
hardware speed has been requested, but there is nothing available
yet for the reverse case; this requires a trap and emulate software
implementation for RDTSC, which is still forthcoming.

[avi: fix 64-bit division on i386]

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:09:35 +02:00
Boris Ostrovsky
2b036c6b86 KVM: SVM: Add support for AMD's OSVW feature in guests
In some cases guests should not provide workarounds for errata even when the
physical processor is affected. For example, because of erratum 400 on family
10h processors a Linux guest will read an MSR (resulting in VMEXIT) before
going to idle in order to avoid getting stuck in a non-C0 state. This is not
necessary: HLT and IO instructions are intercepted and therefore there is no
reason for erratum 400 workaround in the guest.

This patch allows us to present a guest with certain errata as fixed,
regardless of the state of actual hardware.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:21 +02:00
Carsten Otte
5b1c1493af KVM: s390: ucontrol: export SIE control block to user
This patch exports the s390 SIE hardware control block to userspace
via the mapping of the vcpu file descriptor. In order to do so,
a new arch callback named kvm_arch_vcpu_fault  is introduced for all
architectures. It allows to map architecture specific pages.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:19 +02:00
Carsten Otte
e08b963716 KVM: s390: add parameter for KVM_CREATE_VM
This patch introduces a new config option for user controlled kernel
virtual machines. It introduces a parameter to KVM_CREATE_VM that
allows to set bits that alter the capabilities of the newly created
virtual machine.
The parameter is passed to kvm_arch_init_vm for all architectures.
The only valid modifier bit for now is KVM_VM_S390_UCONTROL.
This requires CAP_SYS_ADMIN privileges and creates a user controlled
virtual machine on s390 architectures.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:18 +02:00
Linus Torvalds
1361b83a13 i387: Split up <asm/i387.h> into exported and internal interfaces
While various modules include <asm/i387.h> to get access to things we
actually *intend* for them to use, most of that header file was really
pretty low-level internal stuff that we really don't want to expose to
others.

So split the header file into two: the small exported interfaces remain
in <asm/i387.h>, while the internal definitions that are only used by
core architecture code are now in <asm/fpu-internal.h>.

The guiding principle for this was to expose functions that we export to
modules, and leave them in <asm/i387.h>, while stuff that is used by
task switching or was marked GPL-only is in <asm/fpu-internal.h>.

The fpu-internal.h file could be further split up too, especially since
arch/x86/kvm/ uses some of the remaining stuff for its module.  But that
kvm usage should probably be abstracted out a bit, and at least now the
internal FPU accessor functions are much more contained.  Even if it
isn't perhaps as contained as it _could_ be.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202211340330.5354@i5.linux-foundation.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-02-21 14:12:54 -08:00
Gleb Natapov
5753785fa9 KVM: do not #GP on perf MSR writes when vPMU is disabled
Return to behaviour perf MSR had before introducing vPMU in case vPMU
is disabled. Some guests access those registers unconditionally and do
not expect it to fail.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-02-01 11:44:46 +02:00
Stephan Bärwolf
bdb42f5afe KVM: x86: extend "struct x86_emulate_ops" with "get_cpuid"
In order to be able to proceed checks on CPU-specific properties
within the emulator, function "get_cpuid" is introduced.
With "get_cpuid" it is possible to virtually call the guests
"cpuid"-opcode without changing the VM's context.

[mtosatti: cleanup/beautify code]

Signed-off-by: Stephan Baerwolf <stephan.baerwolf@tu-ilmenau.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-02-01 11:43:33 +02:00
Rusty Russell
476bc0015b module_param: make bool parameters really bool (arch)
module_param(bool) used to counter-intuitively take an int.  In
fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy
trick.

It's time to remove the int/unsigned int option.  For this version
it'll simply give a warning, but it'll break next kernel version.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-01-13 09:32:18 +10:30
Avi Kivity
222d21aa07 KVM: x86 emulator: implement RDPMC (0F 33)
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:24:43 +02:00
Avi Kivity
022cd0e840 KVM: Add generic RDPMC support
Add a helper function that emulates the RDPMC instruction operation.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:24:35 +02:00
Gleb Natapov
f5132b0138 KVM: Expose a version 2 architectural PMU to a guests
Use perf_events to emulate an architectural PMU, version 2.

Based on PMU version 1 emulation by Avi Kivity.

[avi: adjust for cpuid.c]
[jan: fix anonymous field initialization for older gcc]

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:24:29 +02:00
Sasha Levin
ff5c2c0316 KVM: Use memdup_user instead of kmalloc/copy_from_user
Switch to using memdup_user when possible. This makes code more
smaller and compact, and prevents errors.

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:21 +02:00
Sasha Levin
cdfca7b346 KVM: Use kmemdup() instead of kmalloc/memcpy
Switch to kmemdup() in two places to shorten the code and avoid possible bugs.

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:20 +02:00
Alex,Shi
086c985501 KVM: use this_cpu_xxx replace percpu_xxx funcs
percpu_xxx funcs are duplicated with this_cpu_xxx funcs, so replace them
for further code clean up.

And in preempt safe scenario, __this_cpu_xxx funcs has a bit better
performance since __this_cpu_xxx has no redundant preempt_disable()

Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:13 +02:00
Xiao Guangrong
e459e3228d KVM: MMU: move the relevant mmu code to mmu.c
Move the mmu code in kvm_arch_vcpu_init() to kvm_mmu_create()

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:09 +02:00
Xiao Guangrong
9edb17d55f KVM: x86: remove the dead code of KVM_EXIT_HYPERCALL
KVM_EXIT_HYPERCALL is not used anymore, so remove the code

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:07 +02:00
Avi Kivity
00b27a3efb KVM: Move cpuid code to new file
The cpuid code has grown; put it into a separate file.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:21:49 +02:00
Xiao Guangrong
28a37544fb KVM: introduce id_to_memslot function
Introduce id_to_memslot to get memslot by slot id

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:39 +02:00
Xiao Guangrong
be593d6286 KVM: introduce update_memslots function
Introduce update_memslots to update slot which will be update to
kvm->memslots

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:35 +02:00
Takuya Yoshikawa
95d4c16ce7 KVM: Optimize dirty logging by rmap_write_protect()
Currently, write protecting a slot needs to walk all the shadow pages
and checks ones which have a pte mapping a page in it.

The walk is overly heavy when dirty pages in that slot are not so many
and checking the shadow pages would result in unwanted cache pollution.

To mitigate this problem, we use rmap_write_protect() and check only
the sptes which can be reached from gfns marked in the dirty bitmap
when the number of dirty pages are less than that of shadow pages.

This criterion is reasonable in its meaning and worked well in our test:
write protection became some times faster than before when the ratio of
dirty pages are low and was not worse even when the ratio was near the
criterion.

Note that the locking for this write protection becomes fine grained.
The reason why this is safe is descripted in the comments.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:20 +02:00
Takuya Yoshikawa
7850ac5420 KVM: Count the number of dirty pages for dirty logging
Needed for the next patch which uses this number to decide how to write
protect a slot.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:19 +02:00
Chris Wright
fb92045843 KVM: MMU: remove KVM host pv mmu support
The host side pv mmu support has been marked for feature removal in
January 2011.  It's not in use, is slower than shadow or hardware
assisted paging, and a maintenance burden.  It's November 2011, time to
remove it.

Signed-off-by: Chris Wright <chrisw@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:10 +02:00
Xiao Guangrong
f57f2ef58f KVM: MMU: fast prefetch spte on invlpg path
Fast prefetch spte for the unsync shadow page on invlpg path

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:56 +02:00
Xiao Guangrong
6f6fbe98c3 KVM: x86: cleanup port-in/port-out emulated
Remove the same code between emulator_pio_in_emulated and
emulator_pio_out_emulated

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:51 +02:00
Xiao Guangrong
1cb3f3ae5a KVM: x86: retry non-page-table writing instructions
If the emulation is caused by #PF and it is non-page_table writing instruction,
it means the VM-EXIT is caused by shadow page protected, we can zap the shadow
page and retry this instruction directly

The idea is from Avi

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:50 +02:00
Nadav Har'El
d6185f20a0 KVM: nVMX: Add KVM_REQ_IMMEDIATE_EXIT
This patch adds a new vcpu->requests bit, KVM_REQ_IMMEDIATE_EXIT.
This bit requests that when next entering the guest, we should run it only
for as little as possible, and exit again.

We use this new option in nested VMX: When L1 launches L2, but L0 wishes L1
to continue running so it can inject an event to it, we unfortunately cannot
just pretend to have run L2 for a little while - We must really launch L2,
otherwise certain one-off vmcs12 parameters (namely, L1 injection into L2)
will be lost. So the existing code runs L2 in this case.
But L2 could potentially run for a long time until it exits, and the
injection into L1 will be delayed. The new KVM_REQ_IMMEDIATE_EXIT allows us
to request that L2 will be entered, as necessary, but will exit as soon as
possible after entry.

Our implementation of this request uses smp_send_reschedule() to send a
self-IPI, with interrupts disabled. The interrupts remain disabled until the
guest is entered, and then, after the entry is complete (often including
processing an injection and jumping to the relevant handler), the physical
interrupt is noticed and causes an exit.

On recent Intel processors, we could have achieved the same goal by using
MTF instead of a self-IPI. Another technique worth considering in the future
is to use VM_EXIT_ACK_INTR_ON_EXIT and a highest-priority vector IPI - to
slightly improve performance by avoiding the useless interrupt handler
which ends up being called when smp_send_reschedule() is used.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:43 +02:00
Jan Kiszka
4d25a066b6 KVM: Don't automatically expose the TSC deadline timer in cpuid
Unlike all of the other cpuid bits, the TSC deadline timer bit is set
unconditionally, regardless of what userspace wants.

This is broken in several ways:
 - if userspace doesn't use KVM_CREATE_IRQCHIP, and doesn't emulate the TSC
   deadline timer feature, a guest that uses the feature will break
 - live migration to older host kernels that don't support the TSC deadline
   timer will cause the feature to be pulled from under the guest's feet;
   breaking it
 - guests that are broken wrt the feature will fail.

Fix by not enabling the feature automatically; instead report it to userspace.
Because the feature depends on KVM_CREATE_IRQCHIP, which we cannot guarantee
will be called, we expose it via a KVM_CAP_TSC_DEADLINE_TIMER and not
KVM_GET_SUPPORTED_CPUID.

Fixes the Illumos guest kernel, which uses the TSC deadline timer feature.

[avi: add the KVM_CAP + documentation]

Reported-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Tested-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-26 13:27:44 +02:00
Linus Torvalds
0cfdc72439 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (33 commits)
  iommu/core: Remove global iommu_ops and register_iommu
  iommu/msm: Use bus_set_iommu instead of register_iommu
  iommu/omap: Use bus_set_iommu instead of register_iommu
  iommu/vt-d: Use bus_set_iommu instead of register_iommu
  iommu/amd: Use bus_set_iommu instead of register_iommu
  iommu/core: Use bus->iommu_ops in the iommu-api
  iommu/core: Convert iommu_found to iommu_present
  iommu/core: Add bus_type parameter to iommu_domain_alloc
  Driver core: Add iommu_ops to bus_type
  iommu/core: Define iommu_ops and register_iommu only with CONFIG_IOMMU_API
  iommu/amd: Fix wrong shift direction
  iommu/omap: always provide iommu debug code
  iommu/core: let drivers know if an iommu fault handler isn't installed
  iommu/core: export iommu_set_fault_handler()
  iommu/omap: Fix build error with !IOMMU_SUPPORT
  iommu/omap: Migrate to the generic fault report mechanism
  iommu/core: Add fault reporting mechanism
  iommu/core: Use PAGE_SIZE instead of hard-coded value
  iommu/core: use the existing IS_ALIGNED macro
  iommu/msm: ->unmap() should return order of unmapped page
  ...

Fixup trivial conflicts in drivers/iommu/Makefile: "move omap iommu to
dedicated iommu folder" vs "Rename the DMAR and INTR_REMAP config
options" just happened to touch lines next to each other.
2011-10-30 15:46:19 -07:00
Joerg Roedel
a1b60c1cd9 iommu/core: Convert iommu_found to iommu_present
With per-bus iommu_ops the iommu_found function needs to
work on a bus_type too. This patch adds a bus_type parameter
to that function and converts all call-places.
The function is also renamed to iommu_present because the
function now checks if an iommu is present for a given bus
and does not check for a global iommu anymore.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2011-10-21 14:37:20 +02:00
Liu, Jinsong
a3e06bbe84 KVM: emulate lapic tsc deadline timer for guest
This patch emulate lapic tsc deadline timer for guest:
Enumerate tsc deadline timer capability by CPUID;
Enable tsc deadline timer mode by lapic MMIO;
Start tsc deadline timer by WRMSR;

[jan: use do_div()]
[avi: fix for !irqchip_in_kernel()]
[marcelo: another fix for !irqchip_in_kernel()]

Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-10-05 15:34:56 +02:00
Avi Kivity
7460fb4a34 KVM: Fix simultaneous NMIs
If simultaneous NMIs happen, we're supposed to queue the second
and next (collapsing them), but currently we sometimes collapse
the second into the first.

Fix by using a counter for pending NMIs instead of a bool; since
the counter limit depends on whether the processor is currently
in an NMI handler, which can only be checked in vcpu context
(via the NMI mask), we add a new KVM_REQ_NMI to request recalculation
of the counter.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:59 +03:00
Nadav Har'El
d5c1785d2f KVM: L1 TSC handling
KVM assumed in several places that reading the TSC MSR returns the value for
L1. This is incorrect, because when L2 is running, the correct TSC read exit
emulation is to return L2's value.

We therefore add a new x86_ops function, read_l1_tsc, to use in places that
specifically need to read the L1 TSC, NOT the TSC of the current level of
guest.

Note that one change, of one line in kvm_arch_vcpu_load, is made redundant
by a different patch sent by Zachary Amsden (and not yet applied):
kvm_arch_vcpu_load() should not read the guest TSC, and if it didn't, of
course we didn't have to change the call of kvm_get_msr() to read_l1_tsc().

[avi: moved callback to kvm_x86_ops tsc block]

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Acked-by: Zachary Amsdem <zamsden@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:18:02 +03:00
Marcelo Tosatti
742bc67042 KVM: x86: report valid microcode update ID
Windows Server 2008 SP2 checked build with smp > 1 BSOD's during
boot due to lack of microcode update:

*** Assertion failed: The system BIOS on this machine does not properly
support the processor.  The system BIOS did not load any microcode update.
A BIOS containing the latest microcode update is needed for system reliability.
(CurrentUpdateRevision != 0)
***   Source File: d:\longhorn\base\hals\update\intelupd\update.c, line 440

Report a non-zero microcode update signature to make it happy.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:18:01 +03:00
Takuya Yoshikawa
1d2887e2d8 KVM: x86 emulator: Make x86_decode_insn() return proper macros
Return EMULATION_OK/FAILED consistently.  Also treat instruction fetch
errors, not restricted to X86EMUL_UNHANDLEABLE, as EMULATION_FAILED;
although this cannot happen in practice, the current logic will continue
the emulation even if the decoder fails to fetch the instruction.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:18:01 +03:00
Sasha Levin
743eeb0b01 KVM: Intelligent device lookup on I/O bus
Currently the method of dealing with an IO operation on a bus (PIO/MMIO)
is to call the read or write callback for each device registered
on the bus until we find a device which handles it.

Since the number of devices on a bus can be significant due to ioeventfds
and coalesced MMIO zones, this leads to a lot of overhead on each IO
operation.

Instead of registering devices, we now register ranges which points to
a device. Lookup is done using an efficient bsearch instead of a linear
search.

Performance test was conducted by comparing exit count per second with
200 ioeventfds created on one byte and the guest is trying to access a
different byte continuously (triggering usermode exits).
Before the patch the guest has achieved 259k exits per second, after the
patch the guest does 274k exits per second.

Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:17:59 +03:00
Mike Waychison
d1613ad5d0 KVM: Really fix HV_X64_MSR_APIC_ASSIST_PAGE
Commit 0945d4b228 tried to fix the get_msr path for the
HV_X64_MSR_APIC_ASSIST_PAGE msr, but was poorly tested.  We should be
returning 0 if the read succeeded, and passing the value back to the
caller via the pdata out argument, not returning the value directly.

Signed-off-by: Mike Waychison <mikew@google.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:17:58 +03:00
Mike Waychison
14fa67ee95 KVM: x86: get_msr support for HV_X64_MSR_APIC_ASSIST_PAGE
"get" support for the HV_X64_MSR_APIC_ASSIST_PAGE msr was missing, even
though it is explicitly enumerated as something the vmm should save in
msrs_to_save and reported to userland via the KVM_GET_MSR_INDEX_LIST
ioctl.

Add "get" support for HV_X64_MSR_APIC_ASSIST_PAGE.  We simply return the
guest visible value of this register, which seems to be correct as a set
on the register is validated for us already.

Signed-off-by: Mike Waychison <mikew@google.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:17:58 +03:00
Sasha Levin
8c3ba334f8 KVM: x86: Raise the hard VCPU count limit
The patch raises the hard limit of VCPU count to 254.

This will allow developers to easily work on scalability
and will allow users to test high VCPU setups easily without
patching the kernel.

To prevent possible issues with current setups, KVM_CAP_NR_VCPUS
now returns the recommended VCPU limit (which is still 64) - this
should be a safe value for everybody, while a new KVM_CAP_MAX_VCPUS
returns the hard limit which is now 254.

Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Suggested-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:17:57 +03:00
Xiao Guangrong
22388a3c8c KVM: x86: cleanup the code of read/write emulation
Using the read/write operation to remove the same code

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:17:17 +03:00
Xiao Guangrong
77d197b2ca KVM: x86: abstract the operation for read/write emulation
The operations of read emulation and write emulation are very similar, so we
can abstract the operation of them, in larter patch, it is used to cleanup the
same code

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:17:17 +03:00
Xiao Guangrong
ca7d58f375 KVM: x86: fix broken read emulation spans a page boundary
If the range spans a page boundary, the mmio access can be broke, fix it as
write emulation.

And we already get the guest physical address, so use it to read guest data
directly to avoid walking guest page table again

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:17:17 +03:00
Xiao Guangrong
4f0226482d KVM: MMU: trace mmio page fault
Add tracepoints to trace mmio page fault

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:41 +03:00
Xiao Guangrong
ce88decffd KVM: MMU: mmio page fault support
The idea is from Avi:

| We could cache the result of a miss in an spte by using a reserved bit, and
| checking the page fault error code (or seeing if we get an ept violation or
| ept misconfiguration), so if we get repeated mmio on a page, we don't need to
| search the slot list/tree.
| (https://lkml.org/lkml/2011/2/22/221)

When the page fault is caused by mmio, we cache the info in the shadow page
table, and also set the reserved bits in the shadow page table, so if the mmio
is caused again, we can quickly identify it and emulate it directly

Searching mmio gfn in memslots is heavy since we need to walk all memeslots, it
can be reduced by this feature, and also avoid walking guest page table for
soft mmu.

[jan: fix operator precedence issue]

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:40 +03:00
Xiao Guangrong
c37079586f KVM: MMU: remove bypass_guest_pf
The idea is from Avi:
| Maybe it's time to kill off bypass_guest_pf=1.  It's not as effective as
| it used to be, since unsync pages always use shadow_trap_nonpresent_pte,
| and since we convert between the two nonpresent_ptes during sync and unsync.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:33 +03:00
Xiao Guangrong
bebb106a5a KVM: MMU: cache mmio info on page fault path
If the page fault is caused by mmio, we can cache the mmio info, later, we do
not need to walk guest page table and quickly know it is a mmio fault while we
emulate the mmio instruction

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:26 +03:00
Xiao Guangrong
af7cc7d1ee KVM: x86: introduce vcpu_mmio_gva_to_gpa to cleanup the code
Introduce vcpu_mmio_gva_to_gpa to translate the gva to gpa, we can use it
to cleanup the code between read emulation and write emulation

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:25 +03:00
Glauber Costa
c9aaa8957f KVM: Steal time implementation
To implement steal time, we need the hypervisor to pass the guest
information about how much time was spent running other processes
outside the VM, while the vcpu had meaningful work to do - halt
time does not count.

This information is acquired through the run_delay field of
delayacct/schedstats infrastructure, that counts time spent in a
runqueue but not running.

Steal time is a per-cpu information, so the traditional MSR-based
infrastructure is used. A new msr, KVM_MSR_STEAL_TIME, holds the
memory area address containing information about steal time

This patch contains the hypervisor part of the steal time infrasructure,
and can be backported independently of the guest portion.

[avi, yongjie: export delayacct_on, to avoid build failures in some configs]

Signed-off-by: Glauber Costa <glommer@redhat.com>
Tested-by: Eric B Munson <emunson@mgebm.net>
CC: Rik van Riel <riel@redhat.com>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Yongjie Ren <yongjie.ren@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-14 12:59:14 +03:00
Yang, Wei
a01c8f9b4e KVM: Enable ERMS feature support for KVM
This patch exposes ERMS feature to KVM guests.

The REP MOVSB/STOSB instruction can enhance fast strings attempts to
move as much of the data with larger size load/stores as possible.

Signed-off-by: Yang, Wei <wei.y.yang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-12 13:16:25 +03:00
Yang, Wei
176f61da82 KVM: Expose RDWRGSFS bit to KVM guests
This patch exposes RDWRGSFS bit to KVM guests.

Signed-off-by: Yang, Wei <wei.y.yang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-12 13:16:24 +03:00
Yang, Wei
74dc2b4ffe KVM: Add RDWRGSFS support when setting CR4
This patch adds RDWRGSFS support when setting CR4.

Signed-off-by: Yang, Wei <wei.y.yang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-12 13:16:23 +03:00
Yang, Wei Y
4a00efdf0c KVM: Enable DRNG feature support for KVM
This patch exposes DRNG feature to KVM guests.

The RDRAND instruction can provide software with sequences of
random numbers generated from white noise.

Signed-off-by: Yang, Wei <wei.y.yang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-12 13:16:21 +03:00
Andre Przywara
02668b061d KVM: fix XSAVE bit scanning (now properly)
commit 123108f1c1aafd51d6a5c79cc04d7999dd88a930 tried to fix KVMs
XSAVE valid feature scanning, but it was wrong. It was not considering
the sparse nature of this bitfield, instead reading values from
uninitialized members of the entries array.
This patch now separates subleaf indicies from KVM's array indicies
and fills the entry before querying it's value.
This fixes AVX support in KVM guests.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-12 13:16:20 +03:00
Yang, Wei Y
611c120f74 KVM: Mask function7 ebx against host capability word9
This patch masks CPUID leaf 7 ebx against host capability word9.

Signed-off-by: Yang, Wei <wei.y.yang@intel.com>
Signed-off-by: Shan, Haitao <haitao.shan@intel.com>
Signed-off-by: Li, Xin <xin.li@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-12 13:16:14 +03:00
Yang, Wei Y
c68b734fba KVM: Add SMEP support when setting CR4
This patch adds SMEP handling when setting CR4.

Signed-off-by: Yang, Wei <wei.y.yang@intel.com>
Signed-off-by: Shan, Haitao <haitao.shan@intel.com>
Signed-off-by: Li, Xin <xin.li@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-12 13:16:13 +03:00
Avi Kivity
9dac77fa40 KVM: x86 emulator: fold decode_cache into x86_emulate_ctxt
This saves a lot of pointless casts x86_emulate_ctxt and decode_cache.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 13:16:09 +03:00
Avi Kivity
36dd9bb5ce KVM: x86 emulator: rename decode_cache::eip to _eip
The name eip conflicts with a field of the same name in x86_emulate_ctxt,
which we plan to fold decode_cache into.

The name _eip is unfortunate, but what's really needed is a refactoring
here, not a better name.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 13:16:09 +03:00
Takuya Yoshikawa
9d74191ab1 KVM: x86 emulator: Use the pointers ctxt and c consistently
We should use the local variables ctxt and c when the emulate_ctxt and
decode appears many times.  At least, we need to be consistent about
how we use these in a function.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 13:15:57 +03:00
Nadav Har'El
6a4d755060 KVM: nVMX: Implement VMPTRST
This patch implements the VMPTRST instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:13 +03:00
Nadav Har'El
27d6c86521 KVM: nVMX: Implement VMCLEAR
This patch implements the VMCLEAR instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:12 +03:00
Nadav Har'El
064aea7747 KVM: nVMX: Decoding memory operands of VMX instructions
This patch includes a utility function for decoding pointer operands of VMX
instructions issued by L1 (a guest hypervisor)

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:11 +03:00
Nadav Har'El
5e1746d620 KVM: nVMX: Allow setting the VMXE bit in CR4
This patch allows the guest to enable the VMXE bit in CR4, which is a
prerequisite to running VMXON.

Whether to allow setting the VMXE bit now depends on the architecture (svm
or vmx), so its checking has moved to kvm_x86_ops->set_cr4(). This function
now returns an int: If kvm_x86_ops->set_cr4() returns 1, __kvm_set_cr4()
will also return 1, and this will cause kvm_set_cr4() will throw a #GP.

Turning on the VMXE bit is allowed only when the nested VMX feature is
enabled, and turning it off is forbidden after a vmxon.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:10 +03:00
Takuya Yoshikawa
b5c9ff731f KVM: x86 emulator: Avoid clearing the whole decode_cache
During tracing the emulator, we noticed that init_emulate_ctxt()
sometimes took a bit longer time than we expected.

This patch is for mitigating the problem by some degree.

By looking into the function, we soon notice that it clears the whole
decode_cache whose size is about 2.5K bytes now.  Furthermore, most of
the bytes are taken for the two read_cache arrays, which are used only
by a few instructions.

Considering the fact that we are not assuming the cache arrays have
been cleared when we store actual data, we do not need to clear the
arrays: 2K bytes elimination.  In addition, we can avoid clearing the
fetch_cache and regs arrays.

This patch changes the initialization not to clear the arrays.

On our 64-bit host, init_emulate_ctxt() becomes 0.3 to 0.5us faster with
this patch applied.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-12 11:45:09 +03:00
Takuya Yoshikawa
adf52235b4 KVM: x86 emulator: Clean up init_emulate_ctxt()
Use a local pointer to the emulate_ctxt for simplicity.  Then, arrange
the hard-to-read mode selection lines neatly.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-12 11:45:08 +03:00
Jan Kiszka
d780592b99 KVM: Clean up error handling during VCPU creation
So far kvm_arch_vcpu_setup is responsible for freeing the vcpu struct if
it fails. Move this confusing resonsibility back into the hands of
kvm_vm_ioctl_create_vcpu. Only kvm_arch_vcpu_setup of x86 is affected,
all other archs cannot fail.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-12 11:45:08 +03:00
Avi Kivity
24c82e576b KVM: Sanitize cpuid
Instead of blacklisting known-unsupported cpuid leaves, whitelist known-
supported leaves.  This is more conservative and prevents us from reporting
features we don't support.  Also whitelist a few more leaves while at it.

Signed-off-by: Avi Kivity <avi@redhat.com>
Acked-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:07 +03:00
Xiao Guangrong
8b0cedff04 KVM: use __copy_to_user/__clear_user to write guest page
Simply use __copy_to_user/__clear_user to write guest page since we have
already verified the user address when the memslot is set

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:03 +03:00
Takuya Yoshikawa
7b105ca290 KVM: x86 emulator: Stop passing ctxt->ops as arg of emul functions
Dereference it in the actual users.

This not only cleans up the emulator but also makes it easy to convert
the old emulation functions to the new em_xxx() form later.

Note: Remove some inline keywords to let the compiler decide inlining.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:44:59 +03:00
Avi Kivity
1aa366163b KVM: x86 emulator: consolidate segment accessors
Instead of separate accessors for the segment selector and cached descriptor,
use one accessor for both.  This simplifies the code somewhat.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-22 08:47:39 -04:00
BrillyWu@viatech.com.cn
4429d5dc11 KVM: Add CPUID support for VIA CPU
The CPUIDs for Centaur are added, and then the features of
PadLock hardware engine on VIA CPU, such as "ace", "ace_en"
and so on, can be passed into the kvm guest.

Signed-off-by: Brilly Wu <brillywu@viatech.com.cn>
Signed-off-by: Kary Jin <karyjin@viatech.com.cn>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-22 08:39:55 -04:00
Gleb Natapov
2aab2c5b2b KVM: call cache_all_regs() only once during instruction emulation
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-22 08:39:54 -04:00
Gleb Natapov
0004c7c257 KVM: Fix compound mmio
mmio_index should be taken into account when copying data from
userspace.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-22 08:39:52 -04:00
Gleb Natapov
8d7d810255 KVM: mmio_fault_cr2 is not used
Remove unused variable mmio_fault_cr2.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-22 08:39:49 -04:00
Avi Kivity
13db70eca6 KVM: x86 emulator: drop x86_emulate_ctxt::vcpu
No longer used.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-22 08:39:24 -04:00
Avi Kivity
5197b808a7 KVM: Avoid using x86_emulate_ctxt.vcpu
We can use container_of() instead.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-22 08:39:22 -04:00
Avi Kivity
bcaf5cc543 KVM: x86 emulator: add new ->wbinvd() callback
Instead of calling kvm_emulate_wbinvd() directly.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-22 08:39:20 -04:00
Avi Kivity
d6aa10003b KVM: x86 emulator: add ->fix_hypercall() callback
Artificial, but needed to remove direct calls to KVM.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-22 08:39:18 -04:00
Avi Kivity
6c3287f7c5 KVM: x86 emulator: add new ->halt() callback
Instead of reaching into vcpu internals.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-22 08:39:17 -04:00
Avi Kivity
3cb16fe78c KVM: x86 emulator: make emulate_invlpg() an emulator callback
Removing direct calls to KVM.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-22 08:39:15 -04:00
Avi Kivity
2d04a05bd7 KVM: x86 emulator: emulate CLTS internally
Avoid using ctxt->vcpu; we can do everything with ->get_cr() and ->set_cr().

A side effect is that we no longer activate the fpu on emulated CLTS; but that
should be very rare.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-22 08:39:14 -04:00
Avi Kivity
1ac9d0cfb0 KVM: x86 emulator: add and use new callbacks set_idt(), set_gdt()
Replacing direct calls to realmode_lgdt(), realmode_lidt().

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-22 08:39:08 -04:00
Avi Kivity
2953538ebb KVM: x86 emulator: drop vcpu argument from intercept callback
Making the emulator caller agnostic.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-22 08:39:05 -04:00
Avi Kivity
717746e382 KVM: x86 emulator: drop vcpu argument from cr/dr/cpl/msr callbacks
Making the emulator caller agnostic.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-22 08:39:03 -04:00
Avi Kivity
4bff1e86ad KVM: x86 emulator: drop vcpu argument from segment/gdt/idt callbacks
Making the emulator caller agnostic.

[Takuya Yoshikawa: fix typo leading to LDT failures]

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-22 08:35:20 -04:00
Avi Kivity
ca1d4a9e77 KVM: x86 emulator: drop vcpu argument from pio callbacks
Making the emulator caller agnostic.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:57:11 -04:00
Avi Kivity
0f65dd70a4 KVM: x86 emulator: drop vcpu argument from memory read/write callbacks
Making the emulator caller agnostic.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:57:10 -04:00
Joerg Roedel
7c4c0f4fd5 KVM: X86: Update last_guest_tsc in vcpu_put
The last_guest_tsc is used in vcpu_load to adjust the
tsc_offset since tsc-scaling is merged. So the
last_guest_tsc needs to be updated in vcpu_put instead of
the the last_host_tsc. This is fixed with this patch.

Reported-by: Jan Kiszka <jan.kiszka@web.de>
Tested-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:57:10 -04:00
Serge E. Hallyn
71f9833bb1 KVM: fix push of wrong eip when doing softint
When doing a soft int, we need to bump eip before pushing it to
the stack.  Otherwise we'll do the int a second time.

[apw@canonical.com: merged eip update as per Jan's recommendation.]
Signed-off-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:57:09 -04:00
Gleb Natapov
7ae441eac5 KVM: emulator: do not needlesly sync registers from emulator ctxt to vcpu
Currently we sync registers back and forth before/after exiting
to userspace for IO, but during IO device model shouldn't need to
read/write the registers, so we can as well skip those sync points. The
only exaception is broken vmware backdor interface. The new code sync
registers content during IO only if registers are read from/written to
by userspace in the middle of the IO operation and this almost never
happens in practise.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-05-11 07:57:08 -04:00
Joerg Roedel
92a1f12d25 KVM: X86: Implement userspace interface to set virtual_tsc_khz
This patch implements two new vm-ioctls to get and set the
virtual_tsc_khz if the machine supports tsc-scaling. Setting
the tsc-frequency is only possible before userspace creates
any vcpu.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:57:06 -04:00
Joerg Roedel
857e40999e KVM: X86: Delegate tsc-offset calculation to architecture code
With TSC scaling in SVM the tsc-offset needs to be
calculated differently. This patch propagates this
calculation into the architecture specific modules so that
this complexity can be handled there.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:57:05 -04:00
Joerg Roedel
8f6055cbaf KVM: X86: Make tsc_delta calculation a function of guest tsc
The calculation of the tsc_delta value to ensure a
forward-going tsc for the guest is a function of the
host-tsc. This works as long as the guests tsc_khz is equal
to the hosts tsc_khz. With tsc-scaling hardware support this
is not longer true and the tsc_delta needs to be calculated
using guest_tsc values.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:57:04 -04:00
Joerg Roedel
1e993611d0 KVM: X86: Let kvm-clock report the right tsc frequency
This patch changes the kvm_guest_time_update function to use
TSC frequency the guest actually has for updating its clock.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:57:04 -04:00
Joerg Roedel
cfec82cb7d KVM: SVM: Add intercept check for emulated cr accesses
This patch adds all necessary intercept checks for
instructions that access the crX registers.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:57:01 -04:00
Joerg Roedel
8a76d7f25f KVM: x86: Add x86 callback for intercept check
This patch adds a callback into kvm_x86_ops so that svm and
vmx code can do intercept checks on emulated instructions.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:57:01 -04:00
Joerg Roedel
775fde8648 KVM: x86 emulator: Don't write-back cpu-state on X86EMUL_INTERCEPTED
This patch prevents the changed CPU state to be written back
when the emulator detected that the instruction was
intercepted by the guest.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:57:00 -04:00
Avi Kivity
c4f035c60d KVM: x86 emulator: add framework for instruction intercepts
When running in guest mode, certain instructions can be intercepted by
hardware.  This also holds for nested guests running on emulated
virtualization hardware, in particular instructions emulated by kvm
itself.

This patch adds a framework for intercepting instructions.  If an
instruction is marked for interception, and if we're running in guest
mode, a callback is called to check whether an intercept is needed or
not.  The callback is called at three points in time: immediately after
beginning execution, after checking privilge exceptions, and after
checking memory exception.  This suits the different interception points
defined for different instructions and for the various virtualization
instruction sets.

In addition, a new X86EMUL_INTERCEPT is defined, which any callback or
memory access may define, allowing the more complicated intercepts to be
implemented in existing callbacks.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:57:00 -04:00
Avi Kivity
5037f6f324 KVM: x86 emulator: define callbacks for using the guest fpu within the emulator
Needed for emulating fpu instructions.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:56:58 -04:00
Avi Kivity
cef4dea07f KVM: 16-byte mmio support
Since sse instructions can issue 16-byte mmios, we need to support them.  We
can't increase the kvm_run mmio buffer size to 16 bytes without breaking
compatibility, so instead we break the large mmios into two smaller 8-byte
ones.  Since the bus is 64-bit we aren't breaking any atomicity guarantees.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:56:58 -04:00
Avi Kivity
5287f194bf KVM: Split mmio completion into a function
Make room for sse mmio completions.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:56:58 -04:00
Avi Kivity
70252a1053 KVM: extend in-kernel mmio to handle >8 byte transactions
Needed for coalesced mmio using sse.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:56:58 -04:00
Gleb Natapov
1499e54af0 KVM: x86: better fix for race between nmi injection and enabling nmi window
Fix race between nmi injection and enabling nmi window in a simpler way.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-05-11 07:56:57 -04:00
Marcelo Tosatti
c761e5868e Revert "KVM: Fix race between nmi injection and enabling nmi window"
This reverts commit f86368493e.

Simpler fix to follow.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-05-11 07:56:57 -04:00
Glauber Costa
3291892450 KVM: expose async pf through our standard mechanism
As Avi recently mentioned, the new standard mechanism for exposing features
is KVM_GET_SUPPORTED_CPUID, not spamming CAPs. For some reason async pf
missed that.

So expose async_pf here.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Gleb Natapov <gleb@redhat.com>
CC: Avi Kivity <avi@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:56:57 -04:00
Avi Kivity
f6e7847589 KVM: Use kvm_get_rflags() and kvm_set_rflags() instead of the raw versions
Some rflags bits are owned by the host, not guest, so we need to use
kvm_get_rflags() to strip those bits away or kvm_set_rflags() to add them
back.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:56:54 -04:00
Andre Przywara
bd22f5cfcf KVM: move and fix substitue search for missing CPUID entries
If KVM cannot find an exact match for a requested CPUID leaf, the
code will try to find the closest match instead of simply confessing
it's failure.
The implementation was meant to satisfy the CPUID specification, but
did not properly check for extended and standard leaves and also
didn't account for the index subleaf.
Beside that this rule only applies to CPUID intercepts, which is not
the only user of the kvm_find_cpuid_entry() function.

So fix this algorithm and call it from kvm_emulate_cpuid().
This fixes a crash of newer Linux kernels as KVM guests on
AMD Bulldozer CPUs, where bogus values were returned in response to
a CPUID intercept.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-04-06 13:15:56 +03:00
Andre Przywara
20800bc940 KVM: fix XSAVE bit scanning
When KVM scans the 0xD CPUID leaf for propagating the XSAVE save area
leaves, it assumes that the leaves are contigious and stops at the
first zero one. On AMD hardware there is a gap, though, as LWP uses
leaf 62 to announce it's state save area.
So lets iterate through all 64 possible leaves and simply skip zero
ones to also cover later features.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-04-06 13:15:55 +03:00
Linus Torvalds
f2e1fbb5f2 Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86: Flush TLB if PGD entry is changed in i386 PAE mode
  x86, dumpstack: Correct stack dump info when frame pointer is available
  x86: Clean up csum-copy_64.S a bit
  x86: Fix common misspellings
  x86: Fix misspelling and align params
  x86: Use PentiumPro-optimized partial_csum() on VIA C7
2011-03-18 10:45:21 -07:00
Lucas De Marchi
0d2eb44f63 x86: Fix common misspellings
They were generated by 'codespell' and then manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: trivial@kernel.org
LKML-Reference: <1300389856-1099-3-git-send-email-lucas.demarchi@profusion.mobi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-18 10:39:30 +01:00
Nikola Ciprich
1aa8ceef03 KVM: fix kvmclock regression due to missing clock update
commit 387b9f97750444728962b236987fbe8ee8cc4f8c moved kvm_request_guest_time_update(vcpu),
breaking 32bit SMP guests using kvm-clock. Fix this by moving (new) clock update function
to proper place.

Signed-off-by: Nikola Ciprich <nikola.ciprich@linuxbox.cz>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:33 -03:00
Gleb Natapov
5601d05b8c KVM: emulator: Fix io permission checking for 64bit guest
Current implementation truncates upper 32bit of TR base address during IO
permission bitmap check. The patch fixes this.

Reported-and-tested-by: Francis Moreau <francis.moro@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:33 -03:00
Xiao Guangrong
48c0e4e906 KVM: MMU: move mmu pages calculated out of mmu lock
kvm_mmu_calculate_mmu_pages need to walk all memslots and it's protected by
kvm->slots_lock, so move it out of mmu spinlock

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:32 -03:00
Lai Jiangshan
1260edbe7d KVM: better readability of efer_reserved_bits
use EFER_SCE, EFER_LME and EFER_LMA instead of magic numbers.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:31 -03:00
Lai Jiangshan
d170c41906 KVM: Clear async page fault hash after switching to real mode
The hash array of async gfns may still contain some left gfns after
kvm_clear_async_pf_completion_queue() called, need to clear them.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:31 -03:00
Jan Kiszka
038f8c110e KVM: x86: Convert tsc_write_lock to raw_spinlock
Code under this lock requires non-preemptibility. Ensure this also over
-rt by converting it to raw spinlock.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:30 -03:00
Gleb Natapov
7049467b53 KVM: remove isr_ack logic from PIC
isr_ack logic was added by e48258009d to avoid unnecessary IPIs. Back
then it made sense, but now the code checks that vcpu is ready to accept
interrupt before sending IPI, so this logic is no longer needed. The
patch removes it.

Fixes a regression with Debian/Hurd.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Reported-and-tested-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:30 -03:00
Jan Kiszka
e935b8372c KVM: Convert kvm_lock to raw_spinlock
Code under this lock requires non-preemptibility. Ensure this also over
-rt by converting it to raw spinlock.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:30 -03:00
Avi Kivity
f86368493e KVM: Fix race between nmi injection and enabling nmi window
The interrupt injection logic looks something like

  if an nmi is pending, and nmi injection allowed
    inject nmi
  if an nmi is pending
    request exit on nmi window

the problem is that "nmi is pending" can be set asynchronously by
the PIT; if it happens to fire between the two if statements, we
will request an nmi window even though nmi injection is allowed.  On
SVM, this has disasterous results, since it causes eflags.TF to be
set in random guest code.

The fix is simple; make nmi_pending synchronous using the standard
vcpu->requests mechanism; this ensures the code above is completely
synchronous wrt nmi_pending.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:30 -03:00
Avi Kivity
4005996e42 KVM: Drop ad-hoc vendor specific instruction restriction
Use the new support in the emulator, and drop the ad-hoc code in x86.c.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-03-17 13:08:28 -03:00
Avi Kivity
3e90943907 KVM: Drop bogus x86_decode_insn() error check
x86_decode_insn() doesn't return X86EMUL_* values, so the check
for X86EMUL_PROPOGATE_FAULT will always fail.  There is a proper
check later on, so there is no need for a replacement for this
code.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-03-17 13:08:28 -03:00
Glauber Costa
12f9a48f7b KVM: x86: release kvmclock page on reset
When a vcpu is reset, kvmclock page keeps being written to this days.
This is wrong and inconsistent: a cpu reset should take it to its
initial state.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-03-17 13:08:28 -03:00
john cooper
91c9c3eda4 KVM: x86: handle guest access to BBL_CR_CTL3 MSR
A correction to Intel cpu model CPUID data (patch queued)
caused winxp to BSOD when booted with a Penryn model.
This was traced to the CPUID "model" field correction from
6 -> 23 (as is proper for a Penryn class of cpu).  Only in
this case does the problem surface.

The cause for this failure is winxp accessing the BBL_CR_CTL3
MSR which is unsupported by current kvm, appears to be a
legacy MSR not fully characterized yet existing in current
silicon, and is apparently carried forward in MSR space to
accommodate vintage code as here.  It is not yet conclusive
whether this MSR implements any of its legacy functionality
or is just an ornamental dud for compatibility.  While I
found no silicon version specific documentation link to
this MSR, a general description exists in Intel's developer's
reference which agrees with the functional behavior of
other bootloader/kernel code I've examined accessing
BBL_CR_CTL3.  Regrettably winxp appears to be setting bit #19
called out as "reserved" in the above document.

So to minimally accommodate this MSR, kvm msr get will provide
the equivalent mock data and kvm msr write will simply toss the
guest passed data without interpretation.  While this treatment
of BBL_CR_CTL3 addresses the immediate problem, the approach may
be modified pending clarification from Intel.

Signed-off-by: john cooper <john.cooper@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-03-17 13:08:27 -03:00
Xiao Guangrong
6b7e2d0991 KVM: Add "exiting guest mode" state
Currently we keep track of only two states: guest mode and host
mode.  This patch adds an "exiting guest mode" state that tells
us that an IPI will happen soon, so unless we need to wait for the
IPI, we can avoid it completely.

Also
1: No need atomically to read/write ->mode in vcpu's thread

2: reorganize struct kvm_vcpu to make ->mode and ->requests
   in the same cache line explicitly

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:26 -03:00
Jan Kiszka
9ca5231831 KVM: x86: Remove user space triggerable MCE error message
This case is a pure user space error we do not need to record. Moreover,
it can be misused to flood the kernel log. Remove it.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-03-17 13:08:26 -03:00
Xiao Guangrong
63f42e023e KVM: fix rcu usage warning in kvm_arch_vcpu_ioctl_set_sregs()
Fix:

[ 1001.499596] ===================================================
[ 1001.499599] [ INFO: suspicious rcu_dereference_check() usage. ]
[ 1001.499601] ---------------------------------------------------
[ 1001.499604] include/linux/kvm_host.h:301 invoked rcu_dereference_check() without protection!
	......
[ 1001.499636] Pid: 6035, comm: qemu-system-x86 Not tainted 2.6.37-rc6+ #62
[ 1001.499638] Call Trace:
[ 1001.499644]  [] lockdep_rcu_dereference+0x9d/0xa5
[ 1001.499653]  [] gfn_to_memslot+0x8d/0xc8 [kvm]
[ 1001.499661]  [] gfn_to_hva+0x16/0x3f [kvm]
[ 1001.499669]  [] kvm_read_guest_page+0x1e/0x5e [kvm]
[ 1001.499681]  [] kvm_read_guest_page_mmu+0x53/0x5e [kvm]
[ 1001.499699]  [] load_pdptrs+0x3f/0x9c [kvm]
[ 1001.499705]  [] ? vmx_set_cr0+0x507/0x517 [kvm_intel]
[ 1001.499717]  [] kvm_arch_vcpu_ioctl_set_sregs+0x1f3/0x3c0 [kvm]
[ 1001.499727]  [] kvm_vcpu_ioctl+0x6a5/0xbc5 [kvm]

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-03-17 13:08:26 -03:00
Linus Torvalds
55065bc527 Merge branch 'kvm-updates/2.6.38' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/2.6.38' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (142 commits)
  KVM: Initialize fpu state in preemptible context
  KVM: VMX: when entering real mode align segment base to 16 bytes
  KVM: MMU: handle 'map_writable' in set_spte() function
  KVM: MMU: audit: allow audit more guests at the same time
  KVM: Fetch guest cr3 from hardware on demand
  KVM: Replace reads of vcpu->arch.cr3 by an accessor
  KVM: MMU: only write protect mappings at pagetable level
  KVM: VMX: Correct asm constraint in vmcs_load()/vmcs_clear()
  KVM: MMU: Initialize base_role for tdp mmus
  KVM: VMX: Optimize atomic EFER load
  KVM: VMX: Add definitions for more vm entry/exit control bits
  KVM: SVM: copy instruction bytes from VMCB
  KVM: SVM: implement enhanced INVLPG intercept
  KVM: SVM: enhance mov DR intercept handler
  KVM: SVM: enhance MOV CR intercept handler
  KVM: SVM: add new SVM feature bit names
  KVM: cleanup emulate_instruction
  KVM: move complete_insn_gp() into x86.c
  KVM: x86: fix CR8 handling
  KVM guest: Fix kvm clock initialization when it's configured out
  ...
2011-01-13 10:14:24 -08:00
Avi Kivity
e5c3014282 KVM: Initialize fpu state in preemptible context
init_fpu() (which is indirectly called by the fpu switching code) assumes
it is in process context.  Rather than makeing init_fpu() use an atomic
allocation, which can cause a task to be killed, make sure the fpu is
already initialized when we enter the run loop.

KVM-Stable-Tag.
Reported-and-tested-by: Kirill A. Shutemov <kas@openvz.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 12:02:26 +02:00
Avi Kivity
aff48baa34 KVM: Fetch guest cr3 from hardware on demand
Instead of syncing the guest cr3 every exit, which is expensince on vmx
with ept enabled, sync it only on demand.

[sheng: fix incorrect cr3 seen by Windows XP]

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:31:16 +02:00
Avi Kivity
9f8fe5043f KVM: Replace reads of vcpu->arch.cr3 by an accessor
This allows us to keep cr3 in the VMCS, later on.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:31:15 +02:00
Andre Przywara
dc25e89e07 KVM: SVM: copy instruction bytes from VMCB
In case of a nested page fault or an intercepted #PF newer SVM
implementations provide a copy of the faulting instruction bytes
in the VMCB.
Use these bytes to feed the instruction emulator and avoid the costly
guest instruction fetch in this case.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:31:07 +02:00
Andre Przywara
51d8b66199 KVM: cleanup emulate_instruction
emulate_instruction had many callers, but only one used all
parameters. One parameter was unused, another one is now
hidden by a wrapper function (required for a future addition
anyway), so most callers use now a shorter parameter list.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:31:00 +02:00
Andre Przywara
db8fcefaa7 KVM: move complete_insn_gp() into x86.c
move the complete_insn_gp() helper function out of the VMX part
into the generic x86 part to make it usable by SVM.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:30:59 +02:00
Andre Przywara
eea1cff9ab KVM: x86: fix CR8 handling
The handling of CR8 writes in KVM is currently somewhat cumbersome.
This patch makes it look like the other CR register handlers
and fixes a possible issue in VMX, where the RIP would be incremented
despite an injected #GP.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:30:58 +02:00
Takuya Yoshikawa
175504cdbf KVM: Take missing slots_lock for kvm_io_bus_unregister_dev()
In KVM_CREATE_IRQCHIP, kvm_io_bus_unregister_dev() is called without taking
slots_lock in the error handling path.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:30:55 +02:00
Lai Jiangshan
a355c85c5f KVM: return true when user space query KVM_CAP_USER_NMI extension
userspace may check this extension in runtime.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:30:54 +02:00
Avi Kivity
61cfab2e83 KVM: Correct kvm_pio tracepoint count field
Currently, we record '1' for count regardless of the real count.  Fix.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:30:52 +02:00
Xiao Guangrong
fb67e14fc9 KVM: MMU: retry #PF for softmmu
Retry #PF for softmmu only when the current vcpu has the same cr3 as the time
when #PF occurs

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:30:41 +02:00
Joerg Roedel
fc3a9157d3 KVM: X86: Don't report L2 emulation failures to user-space
This patch prevents that emulation failures which result
from emulating an instruction for an L2-Guest results in
being reported to userspace.
Without this patch a malicious L2-Guest would be able to
kill the L1 by triggering a race-condition between an vmexit
and the instruction emulator.
With this patch the L2 will most likely only kill itself in
this situation.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:30:07 +02:00
Avi Kivity
6389ee9463 KVM: Pull extra page fault information into struct x86_exception
Currently page fault cr2 and nesting infomation are carried outside
the fault data structure.  Instead they are placed in the vcpu struct,
which results in confusion as global variables are manipulated instead
of passing parameters.

Fix this issue by adding address and nested fields to struct x86_exception,
so this struct can carry all information associated with a fault.

Signed-off-by: Avi Kivity <avi@redhat.com>
Tested-by: Joerg Roedel <joerg.roedel@amd.com>
Tested-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:30:02 +02:00
Avi Kivity
ab9ae31387 KVM: Push struct x86_exception info the various gva_to_gpa variants
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:29:59 +02:00
Avi Kivity
bcc55cba9f KVM: x86 emulator: make emulator memory callbacks return full exception
This way, they can return #GP, not just #PF.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:29:54 +02:00
Avi Kivity
da9cb575b1 KVM: x86 emulator: introduce struct x86_exception to communicate faults
Introduce a structure that can contain an exception to be passed back
to main kvm code.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:29:53 +02:00
Avi Kivity
945ee35e07 KVM: Mask KVM_GET_SUPPORTED_CPUID data with Linux cpuid info
This allows Linux to mask cpuid bits if, for example, nx is enabled on only
some cpus.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:29:17 +02:00
Xiao Guangrong
c4806acdce KVM: MMU: fix apf prefault if nested guest is enabled
If apf is generated in L2 guest and is completed in L1 guest, it will
prefault this apf in L1 guest's mmu context.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:29:14 +02:00
Xiao Guangrong
e5f3f02796 KVM: MMU: clear apfs if page state is changed
If CR0.PG is changed, the page fault cann't be avoid when the prefault address
is accessed later

And it also fix a bug: it can retry a page enabled #PF in page disabled context
if mmu is shadow page

This idear is from Gleb Natapov

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:29:12 +02:00
Jan Kiszka
d89f5eff70 KVM: Clean up vm creation and release
IA64 support forces us to abstract the allocation of the kvm structure.
But instead of mixing this up with arch-specific initialization and
doing the same on destruction, split both steps. This allows to move
generic destruction calls into generic code.

It also fixes error clean-up on failures of kvm_create_vm for IA64.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:29:09 +02:00
Xiao Guangrong
e6d53e3b0d KVM: avoid unnecessary wait for a async pf
In current code, it checks async pf completion out of the wait context,
like this:

if (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
		    !vcpu->arch.apf.halted)
			r = vcpu_enter_guest(vcpu);
		else {
			......
			kvm_vcpu_block(vcpu)
			 ^- waiting until 'async_pf.done' is not empty
}

kvm_check_async_pf_completion(vcpu)
 ^- delete list from async_pf.done

So, if we check aysnc pf completion first, it can be blocked at
kvm_vcpu_block

Fixed by mark the vcpu is unhalted in kvm_check_async_pf_completion()
path

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:29:00 +02:00
Xiao Guangrong
c7d28c2404 KVM: fix searching async gfn in kvm_async_pf_gfn_slot
Don't search later slots if the slot is empty

Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:28:59 +02:00
Jan Kiszka
2eec734374 KVM: x86: Avoid issuing wbinvd twice
Micro optimization to avoid calling wbinvd twice on the CPU that has to
emulate it. As we might be preempted between smp_call_function_many and
the local wbinvd, the cache might be filled again so that real work
could be done uselessly.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:28:52 +02:00
Takuya Yoshikawa
515a01279a KVM: pre-allocate one more dirty bitmap to avoid vmalloc()
Currently x86's kvm_vm_ioctl_get_dirty_log() needs to allocate a bitmap by
vmalloc() which will be used in the next logging and this has been causing
bad effect to VGA and live-migration: vmalloc() consumes extra systime,
triggers tlb flush, etc.

This patch resolves this issue by pre-allocating one more bitmap and switching
between two bitmaps during dirty logging.

Performance improvement:
  I measured performance for the case of VGA update by trace-cmd.
  The result was 1.5 times faster than the original one.

  In the case of live migration, the improvement ratio depends on the workload
  and the guest memory size. In general, the larger the memory size is the more
  benefits we get.

Note:
  This does not change other architectures's logic but the allocation size
  becomes twice. This will increase the actual memory consumption only when
  the new size changes the number of pages allocated by vmalloc().

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:28:46 +02:00
Marcelo Tosatti
982c25658c KVM: MMU: remove kvm_mmu_set_base_ptes
Unused.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:23:38 +02:00
Gleb Natapov
fc5f06fac6 KVM: Send async PF when guest is not in userspace too.
If guest indicates that it can handle async pf in kernel mode too send
it, but only if interrupts are enabled.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:23:22 +02:00
Gleb Natapov
6adba52742 KVM: Let host know whether the guest can handle async PF in non-userspace context.
If guest can detect that it runs in non-preemptable context it can
handle async PFs at any time, so let host know that it can send async
PF even if guest cpu is not in userspace.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:23:21 +02:00
Gleb Natapov
7c90705bf2 KVM: Inject asynchronous page fault into a PV guest if page is swapped out.
Send async page fault to a PV guest if it accesses swapped out memory.
Guest will choose another task to run upon receiving the fault.

Allow async page fault injection only when guest is in user mode since
otherwise guest may be in non-sleepable context and will not be able
to reschedule.

Vcpu will be halted if guest will fault on the same page again or if
vcpu executes kernel code.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:23:17 +02:00
Gleb Natapov
344d9588a9 KVM: Add PV MSR to enable asynchronous page faults delivery.
Guest enables async PF vcpu functionality using this MSR.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:23:12 +02:00
Gleb Natapov
49c7754ce5 KVM: Add memory slot versioning and use it to provide fast guest write interface
Keep track of memslots changes by keeping generation number in memslots
structure. Provide kvm_write_guest_cached() function that skips
gfn_to_hva() translation if memslots was not changed since previous
invocation.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:23:08 +02:00
Gleb Natapov
56028d0861 KVM: Retry fault before vmentry
When page is swapped in it is mapped into guest memory only after guest
tries to access it again and generate another fault. To save this fault
we can map it immediately since we know that guest is going to access
the page. Do it only when tdp is enabled for now. Shadow paging case is
more complicated. CR[034] and EFER registers should be switched before
doing mapping and then switched back.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:23:06 +02:00
Gleb Natapov
af585b921e KVM: Halt vcpu if page it tries to access is swapped out
If a guest accesses swapped out memory do not swap it in from vcpu thread
context. Schedule work to do swapping and put vcpu into halted state
instead.

Interrupts will still be delivered to the guest and if interrupt will
cause reschedule guest will continue to run another task.

[avi: remove call to get_user_pages_noio(), nacked by Linus; this
      makes everything synchrnous again]

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:21:39 +02:00
Linus Torvalds
72eb6a7914 Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (30 commits)
  gameport: use this_cpu_read instead of lookup
  x86: udelay: Use this_cpu_read to avoid address calculation
  x86: Use this_cpu_inc_return for nmi counter
  x86: Replace uses of current_cpu_data with this_cpu ops
  x86: Use this_cpu_ops to optimize code
  vmstat: User per cpu atomics to avoid interrupt disable / enable
  irq_work: Use per cpu atomics instead of regular atomics
  cpuops: Use cmpxchg for xchg to avoid lock semantics
  x86: this_cpu_cmpxchg and this_cpu_xchg operations
  percpu: Generic this_cpu_cmpxchg() and this_cpu_xchg support
  percpu,x86: relocate this_cpu_add_return() and friends
  connector: Use this_cpu operations
  xen: Use this_cpu_inc_return
  taskstats: Use this_cpu_ops
  random: Use this_cpu_inc_return
  fs: Use this_cpu_inc_return in buffer.c
  highmem: Use this_cpu_xx_return() operations
  vmstat: Use this_cpu_inc_return for vm statistics
  x86: Support for this_cpu_add, sub, dec, inc_return
  percpu: Generic support for this_cpu_add, sub, dec, inc_return
  ...

Fixed up conflicts: in arch/x86/kernel/{apic/nmi.c, apic/x2apic_uv_x.c, process.c}
as per Tejun.
2011-01-07 17:02:58 -08:00
Avi Kivity
010c520e20 KVM: Don't reset mmu context unnecessarily when updating EFER
The only bit of EFER that affects the mmu is NX, and this is already
accounted for (LME only takes effect when changing cr0).

Based on a patch by Hillf Danton.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-02 12:05:15 +02:00
Tejun Heo
0a3aee0da4 x86: Use this_cpu_ops to optimize code
Go through x86 code and replace __get_cpu_var and get_cpu_var
instances that refer to a scalar and are not used for address
determinations.

Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-12-30 12:20:28 +01:00
Avi Kivity
3e26f23091 KVM: Fix preemption counter leak in kvm_timer_init()
Based on a patch from Thomas Meyer.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-12-16 12:39:31 +02:00
Joerg Roedel
24d1b15f72 KVM: SVM: Do not report xsave in supported cpuid
To support xsave properly for the guest the SVM module need
software support for it. As long as this is not present do
not report the xsave as supported feature in cpuid.
As a side-effect this patch moves the bit() helper function
into the x86.h file so that it can be used in svm.c too.

KVM-Stable-Tag.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-12-08 17:28:37 +02:00
Sheng Yang
3ea3aa8cf6 KVM: Fix OSXSAVE after migration
CPUID's OSXSAVE is a mirror of CR4.OSXSAVE bit. We need to update the CPUID
after migration.

KVM-Stable-Tag.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-12-08 17:28:31 +02:00
Jan Kiszka
453d9c57e2 KVM: x86: Issue smp_call_function_many with preemption disabled
smp_call_function_many is specified to be called only with preemption
disabled. Fulfill this requirement.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-11-05 14:42:27 -02:00
Vasiliy Kulikov
97e69aa62f KVM: x86: fix information leak to userland
Structures kvm_vcpu_events, kvm_debugregs, kvm_pit_state2 and
kvm_clock_data are copied to userland with some padding and reserved
fields unitialized.  It leads to leaking of contents of kernel stack
memory.  We have to initialize them to zero.

In patch v1 Jan Kiszka suggested to fill reserved fields with zeros
instead of memset'ting the whole struct.  It makes sense as these
fields are explicitly marked as padding.  No more fields need zeroing.

KVM-Stable-Tag.
Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-11-05 14:42:27 -02:00
Michael S. Tsirkin
edde99ce05 KVM: Write protect memory after slot swap
I have observed the following bug trigger:

1. userspace calls GET_DIRTY_LOG
2. kvm_mmu_slot_remove_write_access is called and makes a page ro
3. page fault happens and makes the page writeable
   fault is logged in the bitmap appropriately
4. kvm_vm_ioctl_get_dirty_log swaps slot pointers

a lot of time passes

5. guest writes into the page
6. userspace calls GET_DIRTY_LOG

At point (5), bitmap is clean and page is writeable,
thus, guest modification of memory is not logged
and GET_DIRTY_LOG returns an empty bitmap.

The rule is that all pages are either dirty in the current bitmap,
or write-protected, which is violated here.

It seems that just moving kvm_mmu_slot_remove_write_access down
to after the slot pointer swap should fix this bug.

KVM-Stable-Tag.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-11-05 14:42:25 -02:00
Linus Torvalds
1765a1fe5d Merge branch 'kvm-updates/2.6.37' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/2.6.37' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (321 commits)
  KVM: Drop CONFIG_DMAR dependency around kvm_iommu_map_pages
  KVM: Fix signature of kvm_iommu_map_pages stub
  KVM: MCE: Send SRAR SIGBUS directly
  KVM: MCE: Add MCG_SER_P into KVM_MCE_CAP_SUPPORTED
  KVM: fix typo in copyright notice
  KVM: Disable interrupts around get_kernel_ns()
  KVM: MMU: Avoid sign extension in mmu_alloc_direct_roots() pae root address
  KVM: MMU: move access code parsing to FNAME(walk_addr) function
  KVM: MMU: audit: check whether have unsync sps after root sync
  KVM: MMU: audit: introduce audit_printk to cleanup audit code
  KVM: MMU: audit: unregister audit tracepoints before module unloaded
  KVM: MMU: audit: fix vcpu's spte walking
  KVM: MMU: set access bit for direct mapping
  KVM: MMU: cleanup for error mask set while walk guest page table
  KVM: MMU: update 'root_hpa' out of loop in PAE shadow path
  KVM: x86 emulator: Eliminate compilation warning in x86_decode_insn()
  KVM: x86: Fix constant type in kvm_get_time_scale
  KVM: VMX: Add AX to list of registers clobbered by guest switch
  KVM guest: Move a printk that's using the clock before it's ready
  KVM: x86: TSC catchup mode
  ...
2010-10-24 12:47:25 -07:00
Huang Ying
5854dbca9b KVM: MCE: Add MCG_SER_P into KVM_MCE_CAP_SUPPORTED
Now we have MCG_SER_P (and corresponding SRAO/SRAR MCE) support in
kernel and QEMU-KVM, the MCG_SER_P should be added into
KVM_MCE_CAP_SUPPORTED to make all these code really works.

Reported-by: Dean Nelson <dnelson@redhat.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:53:15 +02:00
Nicolas Kaiser
9611c18777 KVM: fix typo in copyright notice
Fix typo in copyright notice.

Signed-off-by: Nicolas Kaiser <nikai@nikai.net>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:53:14 +02:00
Avi Kivity
395c6b0a9d KVM: Disable interrupts around get_kernel_ns()
get_kernel_ns() wants preemption disabled.  It doesn't make a lot of sense
during the get/set ioctls (no way to make them non-racy) but the callee wants
it.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:53:14 +02:00
Jan Kiszka
50933623e5 KVM: x86: Fix constant type in kvm_get_time_scale
Older gcc versions complain about the improper type (for x86-32), 4.5
seems to fix this silently. However, we should better use the right type
initially.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:53:08 +02:00
Zachary Amsden
c285545f81 KVM: x86: TSC catchup mode
Negate the effects of AN TYM spell while kvm thread is preempted by tracking
conversion factor to the highest TSC rate and catching the TSC up when it has
fallen behind the kernel view of time.  Note that once triggered, we don't
turn off catchup mode.

A slightly more clever version of this is possible, which only does catchup
when TSC rate drops, and which specifically targets only CPUs with broken
TSC, but since these all are considered unstable_tsc(), this patch covers
all necessary cases.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:53:05 +02:00
Zachary Amsden
34c238a1d1 KVM: x86: Rename timer function
This just changes some names to better reflect the usage they
will be given.  Separated out to keep confusion to a minimum.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:53:05 +02:00
Zachary Amsden
5f4e3f8827 KVM: x86: Make math work for other scales
The math in kvm_get_time_scale relies on the fact that
NSEC_PER_SEC < 2^32.  To use the same function to compute
arbitrary time scales, we must extend the first reduction
step to shrink the base rate to a 32-bit value, and
possibly reduce the scaled rate into a 32-bit as well.

Note we must take care to avoid an arithmetic overflow
when scaling up the tps32 value (this could not happen
with the fixed scaled value of NSEC_PER_SEC, but can
happen with scaled rates above 2^31.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:53:04 +02:00
Mohammed Gamal
63995653ad KVM: Add kvm_inject_realmode_interrupt() wrapper
This adds a wrapper function kvm_inject_realmode_interrupt() around the
emulator function emulate_int_real() to allow real mode interrupt injection.

[avi: initialize operand and address sizes before emulating interrupts]
[avi: initialize rip for real mode interrupt injection]
[avi: clear interrupt pending flag after emulating interrupt injection]

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:53:01 +02:00
Avi Kivity
f4f5105087 KVM: Convert PIC lock from raw spinlock to ordinary spinlock
The PIC code used to be called from preempt_disable() context, which
wasn't very good for PREEMPT_RT.  That is no longer the case, so move
back from raw_spinlock_t to spinlock_t.

Signed-off-by: Avi Kivity <avi@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:52:56 +02:00
Zachary Amsden
28e4639adf KVM: x86: Fix kvmclock bug
If preempted after kvmclock values are updated, but before hardware
virtualization is entered, the last tsc time as read by the guest is
never set.  It underflows the next time kvmclock is updated if there
has not yet been a successful entry / exit into hardware virt.

Fix this by simply setting last_tsc to the newly read tsc value so
that any computed nsec advance of kvmclock is nulled.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:52:56 +02:00
Joerg Roedel
0959ffacf3 KVM: MMU: Don't track nested fault info in error-code
This patch moves the detection whether a page-fault was
nested or not out of the error code and moves it into a
separate variable in the fault struct.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:52:55 +02:00
Avi Kivity
b463a6f744 KVM: Non-atomic interrupt injection
Change the interrupt injection code to work from preemptible, interrupts
enabled context.  This works by adding a ->cancel_injection() operation
that undoes an injection in case we were not able to actually enter the guest
(this condition could never happen with atomic injection).

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:52:54 +02:00
Avi Kivity
3842d135ff KVM: Check for pending events before attempting injection
Instead of blindly attempting to inject an event before each guest entry,
check for a possible event first in vcpu->requests.  Sites that can trigger
event injection are modified to set KVM_REQ_EVENT:

- interrupt, nmi window opening
- ppr updates
- i8259 output changes
- local apic irr changes
- rflags updates
- gif flag set
- event set on exit

This improves non-injecting entry performance, and sets the stage for
non-atomic injection.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:52:50 +02:00
Joerg Roedel
4c62a2dc92 KVM: X86: Report SVM bit to userspace only when supported
This patch fixes a bug in KVM where it _always_ reports the
support of the SVM feature to userspace. But KVM only
supports SVM on AMD hardware and only when it is enabled in
the kernel module. This patch fixes the wrong reporting.

Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:52:48 +02:00
Joerg Roedel
ff03a073e7 KVM: MMU: Add kvm_mmu parameter to load_pdptrs function
This function need to be able to load the pdptrs from any
mmu context currently in use. So change this function to
take an kvm_mmu parameter to fit these needs.
As a side effect this patch also moves the cached pdptrs
from vcpu_arch into the kvm_mmu struct.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:52:41 +02:00
Joerg Roedel
d47f00a62b KVM: X86: Propagate fetch faults
KVM currently ignores fetch faults in the instruction
emulator. With nested-npt we could have such faults. This
patch adds the code to handle these.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:52:41 +02:00
Joerg Roedel
d4f8cf664e KVM: MMU: Propagate the right fault back to the guest after gva_to_gpa
This patch implements logic to make sure that either a
page-fault/page-fault-vmexit or a nested-page-fault-vmexit
is propagated back to the guest.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:52:40 +02:00
Joerg Roedel
02f59dc9f1 KVM: MMU: Introduce init_kvm_nested_mmu()
This patch introduces the init_kvm_nested_mmu() function
which is used to re-initialize the nested mmu when the l2
guest changes its paging mode.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:52:39 +02:00
Joerg Roedel
3d06b8bfd4 KVM: MMU: Introduce kvm_read_nested_guest_page()
This patch introduces the kvm_read_guest_page_x86 function
which reads from the physical memory of the guest. If the
guest is running in guest-mode itself with nested paging
enabled it will read from the guest's guest physical memory
instead.
The patch also changes changes the code to use this function
where it is necessary.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:52:38 +02:00
Joerg Roedel
ec92fe44e7 KVM: X86: Add kvm_read_guest_page_mmu function
This patch adds a function which can read from the guests
physical memory or from the guest's guest physical memory.
This will be used in the two-dimensional page table walker.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:52:37 +02:00
Joerg Roedel
14dfe855f9 KVM: X86: Introduce pointer to mmu context used for gva_to_gpa
This patch introduces the walk_mmu pointer which points to
the mmu-context currently used for gva_to_gpa translations.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:52:35 +02:00
Joerg Roedel
c30a358d33 KVM: MMU: Add infrastructure for two-level page walker
This patch introduces a mmu-callback to translate gpa
addresses in the walk_addr code. This is later used to
translate l2_gpa addresses into l1_gpa addresses.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:52:34 +02:00
Joerg Roedel
8df25a328a KVM: MMU: Track page fault data in struct vcpu
This patch introduces a struct with two new fields in
vcpu_arch for x86:

	* fault.address
	* fault.error_code

This will be used to correctly propagate page faults back
into the guest when we could have either an ordinary page
fault or a nested page fault. In the case of a nested page
fault the fault-address is different from the original
address that should be walked. So we need to keep track
about the real fault-address.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:52:33 +02:00
Jes Sorensen
7b91409822 KVM: x86: Emulate MSR_EBC_FREQUENCY_ID
Some operating systems store data about the host processor at the
time of installation, and when booted on a more uptodate cpu tries
to read MSR_EBC_FREQUENCY_ID. This has been found with XP.

Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:52:27 +02:00
Jes Sorensen
84e0cefa8d KVM: Fix guest kernel crash on MSR_K7_CLK_CTL
MSR_K7_CLK_CTL is a no longer documented MSR, which is only relevant
on said old AMD CPU models. This change returns the expected value,
which the Linux kernel is expecting to avoid writing back the MSR,
plus it ignores all writes to the MSR.

Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:51:55 +02:00
Avi Kivity
e90aa41e6c KVM: Don't save/restore MSR_IA32_PERF_STATUS
It is read/only; restoring it only results in annoying messages.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:51:51 +02:00
Avi Kivity
c41a15dd46 KVM: Fix pio trace direction
out = write, in = read, not the other way round.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:51:49 +02:00
Avi Kivity
217fc9cfca KVM: Fix build error due to 64-bit division in nsec_to_cycles()
Use do_div() instead.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:51:43 +02:00
Gleb Natapov
d2ddd1c483 KVM: x86 emulator: get rid of "restart" in emulation context.
x86_emulate_insn() will return 1 if instruction can be restarted
without re-entering a guest.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:51:34 +02:00
Zachary Amsden
1d5f066e0b KVM: x86: Fix a possible backwards warp of kvmclock
Kernel time, which advances in discrete steps may progress much slower
than TSC.  As a result, when kvmclock is adjusted to a new base, the
apparent time to the guest, which runs at a much higher, nsec scaled
rate based on the current TSC, may have already been observed to have
a larger value (kernel_ns + scaled tsc) than the value to which we are
setting it (kernel_ns + 0).

We must instead compute the clock as potentially observed by the guest
for kernel_ns to make sure it does not go backwards.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:51:24 +02:00
Zachary Amsden
ca84d1a24c KVM: x86: Add clock sync request to hardware enable
If there are active VCPUs which are marked as belonging to
a particular hardware CPU, request a clock sync for them when
enabling hardware; the TSC could be desynchronized on a newly
arriving CPU, and we need to recompute guests system time
relative to boot after a suspend event.

This covers both cases.

Note that it is acceptable to take the spinlock, as either
no other tasks will be running and no locks held (BSP after
resume), or other tasks will be guaranteed to drop the lock
relatively quickly (AP on CPU_STARTING).

Noting we now get clock synchronization requests for VCPUs
which are starting up (or restarting), it is tempting to
attempt to remove the arch/x86/kvm/x86.c CPU hot-notifiers
at this time, however it is not correct to do so; they are
required for systems with non-constant TSC as the frequency
may not be known immediately after the processor has started
until the cpufreq driver has had a chance to run and query
the chipset.

Updated: implement better locking semantics for hardware_enable

Removed the hack of dropping and retaking the lock by adding the
semantic that we always hold kvm_lock when hardware_enable is
called.  The one place that doesn't need to worry about it is
resume, as resuming a frozen CPU, the spinlock won't be taken.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:51:24 +02:00
Zachary Amsden
46543ba45f KVM: x86: Robust TSC compensation
Make the match of TSC find TSC writes that are close to each other
instead of perfectly identical; this allows the compensator to also
work in migration / suspend scenarios.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:51:23 +02:00
Zachary Amsden
759379dd68 KVM: x86: Add helper functions for time computation
Add a helper function to compute the kernel time and convert nanoseconds
back to CPU specific cycles.  Note that these must not be called in preemptible
context, as that would mean the kernel could enter software suspend state,
which would cause non-atomic operation.

Also, convert the KVM_SET_CLOCK / KVM_GET_CLOCK ioctls to use the kernel
time helper, these should be bootbased as well.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:51:23 +02:00
Zachary Amsden
48434c20e1 KVM: x86: Fix deep C-state TSC desynchronization
When CPUs with unstable TSCs enter deep C-state, TSC may stop
running.  This causes us to require resynchronization.  Since
we can't tell when this may potentially happen, we assume the
worst by forcing re-compensation for it at every point the VCPU
task is descheduled.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:51:23 +02:00
Zachary Amsden
e48672fa25 KVM: x86: Unify TSC logic
Move the TSC control logic from the vendor backends into x86.c
by adding adjust_tsc_offset to x86 ops.  Now all TSC decisions
can be done in one place.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:51:23 +02:00
Zachary Amsden
6755bae8e6 KVM: x86: Warn about unstable TSC
If creating an SMP guest with unstable host TSC, issue a warning

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:51:22 +02:00
Zachary Amsden
8cfdc00085 KVM: x86: Make cpu_tsc_khz updates use local CPU
This simplifies much of the init code; we can now simply always
call tsc_khz_changed, optionally passing it a new value, or letting
it figure out the existing value (while interrupts are disabled, and
thus, by inference from the rule, not raceful against CPU hotplug or
frequency updates, which will issue IPIs to the local CPU to perform
this very same task).

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:51:22 +02:00
Zachary Amsden
f38e098ff3 KVM: x86: TSC reset compensation
Attempt to synchronize TSCs which are reset to the same value.  In the
case of a reliable hardware TSC, we can just re-use the same offset, but
on non-reliable hardware, we can get closer by adjusting the offset to
match the elapsed time.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:51:22 +02:00
Zachary Amsden
99e3e30aee KVM: x86: Move TSC offset writes to common code
Also, ensure that the storing of the offset and the reading of the TSC
are never preempted by taking a spinlock.  While the lock is overkill
now, it is useful later in this patch series.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:51:22 +02:00
Zachary Amsden
ae38436b78 KVM: x86: Drop vm_init_tsc
This is used only by the VMX code, and is not done properly;
if the TSC is indeed backwards, it is out of sync, and will
need proper handling in the logic at each and every CPU change.
For now, drop this test during init as misguided.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:51:21 +02:00
Dave Hansen
39de71ec53 KVM: rename x86 kvm->arch.n_alloc_mmu_pages
arch.n_alloc_mmu_pages is a poor choice of name. This value truly
means, "the number of pages which _may_ be allocated".  But,
reading the name, "n_alloc_mmu_pages" implies "the number of allocated
mmu pages", which is dead wrong.

It's really the high watermark, so let's give it a name to match:
nr_max_mmu_pages.  This change will make the next few patches
much more obvious and easy to read.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Tim Pepper <lnxninja@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:51:18 +02:00
Mohammed Gamal
8ec4722dd2 KVM: Separate emulation context initialization in a separate function
The code for initializing the emulation context is duplicated at two
locations (emulate_instruction() and kvm_task_switch()). Separate it
in a separate function and call it from there.

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:51:04 +02:00
Mohammed Gamal
160ce1f1a8 KVM: x86 emulator: Allow accessing IDT via emulator ops
The patch adds a new member get_idt() to x86_emulate_ops.
It also adds a function to get the idt in order to be used by the emulator.

This is needed for real mode interrupt injection and the emulation of int
instructions.

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:50:59 +02:00
Gleb Natapov
4fc40f076f KVM: x86 emulator: check io permissions only once for string pio
Do not recheck io permission on every iteration.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-10-24 10:50:29 +02:00
Gleb Natapov
e85d28f8e8 KVM: x86 emulator: don't update vcpu state if instruction is restarted
No need to update vcpu state since instruction is in the middle of the
emulation.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:50:27 +02:00
Avi Kivity
9aabc88fc8 KVM: x86 emulator: store x86_emulate_ops in emulation context
It doesn't ever change, so we don't need to pass it around everywhere.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-24 10:50:21 +02:00
Andre Przywara
6d886fd042 x86, cpu: Fix allowed CPUID bits for KVM guests
The AMD extensions to AVX (FMA4, XOP) work on the same YMM register set
as AVX, so they are safe for guests to use, as long as AVX itself
is allowed. Add F16C and AES on the way for the same reasons.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
LKML-Reference: <1283778860-26843-4-git-send-email-andre.przywara@amd.com>
Acked-by: Avi Kivity <avi@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2010-09-08 13:34:15 -07:00
Andre Przywara
7ef8aa72ab x86, cpu: Fix renamed, not-yet-shipping AMD CPUID feature bit
The AMD SSE5 feature set as-it has been replaced by some extensions
to the AVX instruction set. Thus the bit formerly advertised as SSE5
is re-used for one of these extensions (XOP).
Although this changes the /proc/cpuinfo output, it is not user visible, as
there are no CPUs (yet) having this feature.
To avoid confusion this should be added to the stable series, too.

Cc: stable@kernel.org [.32.x .34.x, .35.x]
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
LKML-Reference: <1283778860-26843-2-git-send-email-andre.przywara@amd.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2010-09-08 13:32:55 -07:00
Linus Torvalds
3dc8d7f07e Merge branch 'kvm-updates/2.6.36' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/2.6.36' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: PIT: free irq source id in handling error path
  KVM: destroy workqueue on kvm_create_pit() failures
  KVM: fix poison overwritten caused by using wrong xstate size
2010-08-22 11:27:36 -07:00
Xiaotian Feng
f45755b834 KVM: fix poison overwritten caused by using wrong xstate size
fpu.state is allocated from task_xstate_cachep, the size of task_xstate_cachep
is xstate_size. xstate_size is set from cpuid instruction, which is often
smaller than sizeof(struct xsave_struct). kvm is using sizeof(struct xsave_struct)
to fill in/out fpu.state.xsave, as what we allocated for fpu.state is
xstate_size, kernel will write out of memory and caused poison/redzone/padding
overwritten warnings.

Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Reviewed-by: Sheng Yang <sheng@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Sheng Yang <sheng@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-15 14:10:15 +03:00
Linus Torvalds
d9a73c0016 Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  um, x86: Cast to (u64 *) inside set_64bit()
  x86-32, asm: Directly access per-cpu GDT
  x86-64, asm: Directly access per-cpu IST
  x86, asm: Merge cmpxchg_486_u64() and cmpxchg8b_emu()
  x86, asm: Move cmpxchg emulation code to arch/x86/lib
  x86, asm: Clean up and simplify <asm/cmpxchg.h>
  x86, asm: Clean up and simplify set_64bit()
  x86: Add memory modify constraints to xchg() and cmpxchg()
  x86-64: Simplify loading initial_gs
  x86: Use symbolic MSR names
  x86: Remove redundant K6 MSRs
2010-08-06 10:07:34 -07:00
Wei Yongjun
c19b8bd60e KVM: x86 emulator: fix xchg instruction emulation
If the destination is a memory operand and the memory cannot
map to a valid page, the xchg instruction emulation and locked
instruction will not work on io regions and stuck in endless
loop. We should emulate exchange as write to fix it.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-02 06:40:53 +03:00
Gleb Natapov
68be080345 KVM: x86: never re-execute instruction with enabled tdp
With tdp enabled we should get into emulator only when emulating io, so
reexecution will always bring us back into emulator.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-02 06:40:51 +03:00
Avi Kivity
908e75f3e7 KVM: Expose MCE control MSRs to userspace
Userspace needs to reset and save/restore these MSRs.

The MCE banks are not exposed since their number varies from vcpu to vcpu.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-02 06:40:36 +03:00
Xiao Guangrong
aea924f606 KVM: PIT: stop vpit before freeing irq_routing
Fix:
general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
......
Call Trace:
 [<ffffffffa0159bd1>] ? kvm_set_irq+0xdd/0x24b [kvm]
 [<ffffffff8106ea8b>] ? trace_hardirqs_off_caller+0x1f/0x10e
 [<ffffffff813ad17f>] ? sub_preempt_count+0xe/0xb6
 [<ffffffff8106d273>] ? put_lock_stats+0xe/0x27
...
RIP  [<ffffffffa0159c72>] kvm_set_irq+0x17e/0x24b [kvm]

This bug is triggered when guest is shutdown, is because we freed
irq_routing before pit thread stopped

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-02 06:40:35 +03:00
Gleb Natapov
a6f177efaa KVM: Reenter guest after emulation failure if due to access to non-mmio address
When shadow pages are in use sometimes KVM try to emulate an instruction
when it accesses a shadowed page. If emulation fails KVM un-shadows the
page and reenter guest to allow vcpu to execute the instruction. If page
is not in shadow page hash KVM assumes that this was attempt to do MMIO
and reports emulation failure to userspace since there is no way to fix
the situation. This logic has a race though. If two vcpus tries to write
to the same shadowed page simultaneously both will enter emulator, but
only one of them will find the page in shadow page hash since the one who
founds it also removes it from there, so another cpu will report failure
to userspace and will abort the guest.

Fix this by checking (in addition to checking shadowed page hash) that
page that caused the emulation belongs to valid memory slot. If it is
then reenter the guest to allow vcpu to reexecute the instruction.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-02 06:40:34 +03:00
Sheng Yang
f5f48ee15c KVM: VMX: Execute WBINVD to keep data consistency with assigned devices
Some guest device driver may leverage the "Non-Snoop" I/O, and explicitly
WBINVD or CLFLUSH to a RAM space. Since migration may occur before WBINVD or
CLFLUSH, we need to maintain data consistency either by:
1: flushing cache (wbinvd) when the guest is scheduled out if there is no
wbinvd exit, or
2: execute wbinvd on all dirty physical CPUs when guest wbinvd exits.

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:47:21 +03:00
Avi Kivity
3e00750947 KVM: Simplify vcpu_enter_guest() mmu reload logic slightly
No need to reload the mmu in between two different vcpu->requests checks.

kvm_mmu_reload() may trigger KVM_REQ_TRIPLE_FAULT, but that will be caught
during atomic guest entry later.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:47:19 +03:00
Sheng Yang
6c3f604117 KVM: x86: Enable AVX for guest
Enable Intel(R) Advanced Vector Extension(AVX) for guest.

The detection of AVX feature includes OSXSAVE bit testing. When OSXSAVE bit is
not set, even if AVX is supported, the AVX instruction would result in UD as
well. So we're safe to expose AVX bits to guest directly.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:47:10 +03:00
Avi Kivity
7ac77099ce KVM: Prevent internal slots from being COWed
If a process with a memory slot is COWed, the page will change its address
(despite having an elevated reference count).  This breaks internal memory
slots which have their physical addresses loaded into vmcs registers (see
the APIC access memory slot).

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:47:08 +03:00
Avi Kivity
a8eeb04a44 KVM: Add mini-API for vcpu->requests
Makes it a little more readable and hackable.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:47:05 +03:00
Avi Kivity
b74a07beed KVM: Remove kernel-allocated memory regions
Equivalent (and better) functionality is provided by user-allocated memory
regions.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:47:01 +03:00
Avi Kivity
a1f4d39500 KVM: Remove memory alias support
As advertised in feature-removal-schedule.txt.  Equivalent support is provided
by overlapping memory regions.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:47:00 +03:00
Avi Kivity
d1ac91d8a2 KVM: Consolidate load/save temporary buffer allocation and freeing
Instead of three temporary variables and three free calls, have one temporary
variable (with four names) and one free call.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:46:57 +03:00
Avi Kivity
a1a005f36e KVM: Fix xsave and xcr save/restore memory leak
We allocate temporary kernel buffers for these structures, but never free them.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:46:56 +03:00
Sheng Yang
2d5b5a6655 KVM: x86: XSAVE/XRSTOR live migration support
This patch enable save/restore of xsave state.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:46:37 +03:00
Avi Kivity
2390218b6a KVM: Fix mov cr3 #GP at wrong instruction
On Intel, we call skip_emulated_instruction() even if we injected a #GP,
resulting in the #GP pointing at the wrong address.

Fix by injecting the exception and skipping the instruction at the same place,
so we can do just one or the other.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:46:35 +03:00
Avi Kivity
a83b29c6ad KVM: Fix mov cr4 #GP at wrong instruction
On Intel, we call skip_emulated_instruction() even if we injected a #GP,
resulting in the #GP pointing at the wrong address.

Fix by injecting the exception and skipping the instruction at the same place,
so we can do just one or the other.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:46:34 +03:00
Avi Kivity
49a9b07edc KVM: Fix mov cr0 #GP at wrong instruction
On Intel, we call skip_emulated_instruction() even if we injected a #GP,
resulting in the #GP pointing at the wrong address.

Fix by injecting the exception and skipping the instruction at the same place,
so we can do just one or the other.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:46:32 +03:00
Dexuan Cui
2acf923e38 KVM: VMX: Enable XSAVE/XRSTOR for guest
This patch enable guest to use XSAVE/XRSTOR instructions.

We assume that host_xcr0 would use all possible bits that OS supported.

And we loaded xcr0 in the same way we handled fpu - do it as late as we can.

Signed-off-by: Dexuan Cui <dexuan.cui@intel.com>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:46:31 +03:00
Lai Jiangshan
7bee342a9e KVM: x86: use linux/uaccess.h instead of asm/uaccess.h
Should use linux/uaccess.h instead of asm/uaccess.h

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:39:25 +03:00
Jan Kiszka
10ab25cd6b KVM: x86: Propagate fpu_alloc errors
Memory allocation may fail. Propagate such errors.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:39:22 +03:00
Avi Kivity
221d059d15 KVM: Update Red Hat copyrights
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:51 +03:00
Sheng Yang
98918833a3 KVM: x86: Use FPU API
Convert KVM to use generic FPU API.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:49 +03:00
Sheng Yang
7cf30855e0 KVM: x86: Use unlazy_fpu() for host FPU
We can avoid unnecessary fpu load when userspace process
didn't use FPU frequently.

Derived from Avi's idea.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:48 +03:00
Avi Kivity
9373662463 KVM: Consolidate arch specific vcpu ioctl locking
Now that all arch specific ioctls have centralized locking, it is easy to
move it to the central dispatcher.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:48 +03:00
Avi Kivity
526b78ad1a KVM: x86: Lock arch specific vcpu ioctls centrally
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:47 +03:00
Avi Kivity
2122ff5eab KVM: move vcpu locking to dispatcher for generic vcpu ioctls
All vcpu ioctls need to be locked, so instead of locking each one specifically
we lock at the generic dispatcher.

This patch only updates generic ioctls and leaves arch specific ioctls alone.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:47 +03:00
Xiao Guangrong
1683b2416e KVM: x86: cleanup unused local variable
fix:
 arch/x86/kvm/x86.c: In function ‘handle_emulation_failure’:
 arch/x86/kvm/x86.c:3844: warning: unused variable ‘ctxt’

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:35:46 +03:00
Sheng Yang
aad827034e KVM: VMX: Only reset MMU when necessary
Only modifying some bits of CR0/CR4 needs paging mode switch.

Modify EFER.NXE bit would result in reserved bit updates.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:35:45 +03:00
Sheng Yang
62ad07551a KVM: x86: Clean up duplicate assignment
mmu.free() already set root_hpa to INVALID_PAGE, no need to do it again in the
destory_kvm_mmu().

kvm_x86_ops->set_cr4() and set_efer() already assign cr4/efer to
vcpu->arch.cr4/efer, no need to do it again later.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:35:44 +03:00
Gleb Natapov
6d77dbfc88 KVM: inject #UD if instruction emulation fails and exit to userspace
Do not kill VM when instruction emulation fails. Inject #UD and report
failure to userspace instead. Userspace may choose to reenter guest if
vcpu is in userspace (cpl == 3) in which case guest OS will kill
offending process and continue running.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-08-01 10:35:40 +03:00
Avi Kivity
d94e1dc9af KVM: Get rid of KVM_REQ_KICK
KVM_REQ_KICK poisons vcpu->requests by having a bit set during normal
operation.  This causes the fast path check for a clear vcpu->requests
to fail all the time, triggering tons of atomic operations.

Fix by replacing KVM_REQ_KICK with a vcpu->guest_mode atomic.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:37 +03:00
Gleb Natapov
54b8486f46 KVM: x86 emulator: do not inject exception directly into vcpu
Return exception as a result of instruction emulation and handle
injection in KVM code.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:37 +03:00
Gleb Natapov
95cb229530 KVM: x86 emulator: move interruptibility state tracking out of emulator
Emulator shouldn't access vcpu directly.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:36 +03:00
Gleb Natapov
4d2179e1e9 KVM: x86 emulator: handle shadowed registers outside emulator
Emulator shouldn't access vcpu directly.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:36 +03:00
Gleb Natapov
ef050dc039 KVM: x86 emulator: set RFLAGS outside x86 emulator code
Removes the need for set_flags() callback.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:35 +03:00
Gleb Natapov
95c5588652 KVM: x86 emulator: advance RIP outside x86 emulator code
Return new RIP as part of instruction emulation result instead of
updating KVM's RIP from x86 emulator code.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:35 +03:00
Gleb Natapov
3457e4192e KVM: handle emulation failure case first
If emulation failed return immediately.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:34 +03:00
Gleb Natapov
8fe681e984 KVM: do not inject #PF in (read|write)_emulated() callbacks
Return error to x86 emulator instead of injection exception behind its back.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:34 +03:00
Gleb Natapov
f181b96d4c KVM: remove export of emulator_write_emulated()
It is not called directly outside of the file it's defined in anymore.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:34 +03:00
Gleb Natapov
c3cd7ffaf5 KVM: x86 emulator: x86_emulate_insn() return -1 only in case of emulation failure
Currently emulator returns -1 when emulation failed or IO is needed.
Caller tries to guess whether emulation failed by looking at other
variables. Make it easier for caller to recognise error condition by
always returning -1 in case of failure. For this new emulator
internal return value X86EMUL_IO_NEEDED is introduced. It is used to
distinguish between error condition (which returns X86EMUL_UNHANDLEABLE)
and condition that requires IO exit to userspace to continue emulation.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:33 +03:00
Gleb Natapov
411c35b7ef KVM: fill in run->mmio details in (read|write)_emulated function
Fill in run->mmio details in (read|write)_emulated function just like
pio does. There is no point in filling only vcpu fields there just to
copy them into vcpu->run a little bit later.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:33 +03:00
Gleb Natapov
338dbc9781 KVM: x86 emulator: make (get|set)_dr() callback return error if it fails
Make (get|set)_dr() callback return error if it fails instead of
injecting exception behind emulator's back.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:32 +03:00
Gleb Natapov
0f12244fe7 KVM: x86 emulator: make set_cr() callback return error if it fails
Make set_cr() callback return error if it fails instead of injecting #GP
behind emulator's back.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:31 +03:00
Gleb Natapov
5951c44237 KVM: x86 emulator: add get_cached_segment_base() callback to x86_emulate_ops
On VMX it is expensive to call get_cached_descriptor() just to get segment
base since multiple vmcs_reads are done instead of only one. Introduce
new call back get_cached_segment_base() for efficiency.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:31 +03:00
Gleb Natapov
3fb1b5dbd3 KVM: x86 emulator: add (set|get)_msr callbacks to x86_emulate_ops
Add (set|get)_msr callbacks to x86_emulate_ops instead of calling
them directly.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:30 +03:00
Gleb Natapov
35aa5375d4 KVM: x86 emulator: add (set|get)_dr callbacks to x86_emulate_ops
Add (set|get)_dr callbacks to x86_emulate_ops instead of calling
them directly.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:30 +03:00
Avi Kivity
1c11e71357 KVM: VMX: Avoid writing HOST_CR0 every entry
cr0.ts may change between entries, so we copy cr0 to HOST_CR0 before each
entry.  That is slow, so instead, set HOST_CR0 to have TS set unconditionally
(which is a safe value), and issue a clts() just before exiting vcpu context
if the task indeed owns the fpu.

Saves ~50 cycles/exit.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:28 +03:00
Takuya Yoshikawa
914ebccd2d KVM: x86: avoid unnecessary bitmap allocation when memslot is clean
Although we always allocate a new dirty bitmap in x86's get_dirty_log(),
it is only used as a zero-source of copy_to_user() and freed right after
that when memslot is clean. This patch uses clear_user() instead of doing
this unnecessary zero-source allocation.

Performance improvement: as we can expect easily, the time needed to
allocate a bitmap is completely reduced. In my test, the improved ioctl
was about 4 to 10 times faster than the original one for clean slots.
Furthermore, reducing memory allocations and copies will produce good
effects to caches too.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-01 10:35:27 +03:00
H. Peter Anvin
d3608b5681 Merge remote branch 'origin/x86/urgent' into x86/asm 2010-07-27 23:28:28 -07:00
Avi Kivity
7a73c0283d KVM: Use kmalloc() instead of vmalloc() for KVM_[GS]ET_MSR
We don't need more than a page, and vmalloc() is slower (much
slower recently due to a regression).

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-07-23 09:07:14 +03:00
Brian Gerst
8c06585d64 x86: Remove redundant K6 MSRs
MSR_K6_EFER is unused, and MSR_K6_STAR is redundant with MSR_STAR.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
LKML-Reference: <1279371808-24804-1-git-send-email-brgerst@gmail.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-07-21 21:23:05 -07:00
Linus Torvalds
98edb6ca41 Merge branch 'kvm-updates/2.6.35' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/2.6.35' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (269 commits)
  KVM: x86: Add missing locking to arch specific vcpu ioctls
  KVM: PPC: Add missing vcpu_load()/vcpu_put() in vcpu ioctls
  KVM: MMU: Segregate shadow pages with different cr0.wp
  KVM: x86: Check LMA bit before set_efer
  KVM: Don't allow lmsw to clear cr0.pe
  KVM: Add cpuid.txt file
  KVM: x86: Tell the guest we'll warn it about tsc stability
  x86, paravirt: don't compute pvclock adjustments if we trust the tsc
  x86: KVM guest: Try using new kvm clock msrs
  KVM: x86: export paravirtual cpuid flags in KVM_GET_SUPPORTED_CPUID
  KVM: x86: add new KVMCLOCK cpuid feature
  KVM: x86: change msr numbers for kvmclock
  x86, paravirt: Add a global synchronization point for pvclock
  x86, paravirt: Enable pvclock flags in vcpu_time_info structure
  KVM: x86: Inject #GP with the right rip on efer writes
  KVM: SVM: Don't allow nested guest to VMMCALL into host
  KVM: x86: Fix exception reinjection forced to true
  KVM: Fix wallclock version writing race
  KVM: MMU: Don't read pdptrs with mmu spinlock held in mmu_alloc_roots
  KVM: VMX: enable VMXON check with SMX enabled (Intel TXT)
  ...
2010-05-21 17:16:21 -07:00
Avi Kivity
8fbf065d62 KVM: x86: Add missing locking to arch specific vcpu ioctls
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-19 11:41:11 +03:00
Sheng Yang
a3d204e285 KVM: x86: Check LMA bit before set_efer
kvm_x86_ops->set_efer() would execute vcpu->arch.efer = efer, so the
checking of LMA bit didn't work.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-19 11:41:09 +03:00
Avi Kivity
f78e917688 KVM: Don't allow lmsw to clear cr0.pe
The current lmsw implementation allows the guest to clear cr0.pe, contrary
to the manual, which breaks EMM386.EXE.

Fix by ORing the old cr0.pe with lmsw's operand.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-19 11:41:08 +03:00
Glauber Costa
371bcf646d KVM: x86: Tell the guest we'll warn it about tsc stability
This patch puts up the flag that tells the guest that we'll warn it
about the tsc being trustworthy or not. By now, we also say
it is not.

Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-19 11:41:06 +03:00
Glauber Costa
84478c829d KVM: x86: export paravirtual cpuid flags in KVM_GET_SUPPORTED_CPUID
Right now, we were using individual KVM_CAP entities to communicate
userspace about which cpuids we support. This is suboptimal, since it
generates a delay between the feature arriving in the host, and
being available at the guest.

A much better mechanism is to list para features in KVM_GET_SUPPORTED_CPUID.
This makes userspace automatically aware of what we provide. And if we
ever add a new cpuid bit in the future, we have to do that again,
which create some complexity and delay in feature adoption.

Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-19 11:41:03 +03:00
Glauber Costa
11c6bffa42 KVM: x86: change msr numbers for kvmclock
Avi pointed out a while ago that those MSRs falls into the pentium
PMU range. So the idea here is to add new ones, and after a while,
deprecate the old ones.

Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-19 11:41:01 +03:00
Roedel, Joerg
b69e8caef5 KVM: x86: Inject #GP with the right rip on efer writes
This patch fixes a bug in the KVM efer-msr write path. If a
guest writes to a reserved efer bit the set_efer function
injects the #GP directly. The architecture dependent wrmsr
function does not see this, assumes success and advances the
rip. This results in a #GP in the guest with the wrong rip.
This patch fixes this by reporting efer write errors back to
the architectural wrmsr function.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-19 11:36:39 +03:00
Joerg Roedel
3f0fd2927b KVM: x86: Fix exception reinjection forced to true
The patch merged recently which allowed to mark an exception
as reinjected has a bug as it always marks the exception as
reinjected. This breaks nested-svm shadow-on-shadow
implementation.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-19 11:36:37 +03:00
Avi Kivity
9ed3c444ab KVM: Fix wallclock version writing race
Wallclock writing uses an unprotected global variable to hold the version;
this can cause one guest to interfere with another if both write their
wallclock at the same time.

Acked-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-19 11:36:36 +03:00
Marcelo Tosatti
f1d86e469b KVM: x86: properly update ready_for_interrupt_injection
The recent changes to emulate string instructions without entering guest
mode exposed a bug where pending interrupts are not properly reflected
in ready_for_interrupt_injection.

The result is that userspace overwrites a previously queued interrupt,
when irqchip's are emulated in userspace.

Fix by always updating state before returning to userspace.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-19 11:36:33 +03:00
Linus Torvalds
4d7b4ac22f Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (311 commits)
  perf tools: Add mode to build without newt support
  perf symbols: symbol inconsistency message should be done only at verbose=1
  perf tui: Add explicit -lslang option
  perf options: Type check all the remaining OPT_ variants
  perf options: Type check OPT_BOOLEAN and fix the offenders
  perf options: Check v type in OPT_U?INTEGER
  perf options: Introduce OPT_UINTEGER
  perf tui: Add workaround for slang < 2.1.4
  perf record: Fix bug mismatch with -c option definition
  perf options: Introduce OPT_U64
  perf tui: Add help window to show key associations
  perf tui: Make <- exit menus too
  perf newt: Add single key shortcuts for zoom into DSO and threads
  perf newt: Exit browser unconditionally when CTRL+C, q or Q is pressed
  perf newt: Fix the 'A'/'a' shortcut for annotate
  perf newt: Make <- exit the ui_browser
  x86, perf: P4 PMU - fix counters management logic
  perf newt: Make <- zoom out filters
  perf report: Report number of events, not samples
  perf hist: Clarify events_stats fields usage
  ...

Fix up trivial conflicts in kernel/fork.c and tools/perf/builtin-record.c
2010-05-18 08:19:03 -07:00
Joerg Roedel
ce7ddec4bb KVM: x86: Allow marking an exception as reinjected
This patch adds logic to kvm/x86 which allows to mark an
injected exception as reinjected. This allows to remove an
ugly hack from svm_complete_interrupts that prevented
exceptions from being reinjected at all in the nested case.
The hack was necessary because an reinjected exception into
the nested guest could cause a nested vmexit emulation. But
reinjected exceptions must not intercept. The downside of
the hack is that a exception that in injected could get
lost.
This patch fixes the problem and puts the code for it into
generic x86 files because. Nested-VMX will likely have the
same problem and could reuse the code.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:19:26 +03:00
Joerg Roedel
d4330ef2fb KVM: x86: Add callback to let modules decide over some supported cpuid bits
This patch adds the get_supported_cpuid callback to
kvm_x86_ops. It will be used in do_cpuid_ent to delegate the
decission about some supported cpuid bits to the
architecture modules.

Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:19:23 +03:00
Avi Kivity
8d3b932309 Merge remote branch 'tip/perf/core'
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:19:15 +03:00
Avi Kivity
87bc3bf972 KVM: MMU: Drop cr4.pge from shadow page role
Since commit bf47a760f6, we no longer handle ptes with the global bit
set specially, so there is no reason to distinguish between shadow pages
created with cr4.gpe set and clear.

Such tracking is expensive when the guest toggles cr4.pge, so drop it.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:18:03 +03:00
Lai Jiangshan
90d83dc3d4 KVM: use the correct RCU API for PROVE_RCU=y
The RCU/SRCU API have already changed for proving RCU usage.

I got the following dmesg when PROVE_RCU=y because we used incorrect API.
This patch coverts rcu_deference() to srcu_dereference() or family API.

===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
arch/x86/kvm/mmu.c:3020 invoked rcu_dereference_check() without protection!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
2 locks held by qemu-system-x86/8550:
 #0:  (&kvm->slots_lock){+.+.+.}, at: [<ffffffffa011a6ac>] kvm_set_memory_region+0x29/0x50 [kvm]
 #1:  (&(&kvm->mmu_lock)->rlock){+.+...}, at: [<ffffffffa012262d>] kvm_arch_commit_memory_region+0xa6/0xe2 [kvm]

stack backtrace:
Pid: 8550, comm: qemu-system-x86 Not tainted 2.6.34-rc4-tip-01028-g939eab1 #27
Call Trace:
 [<ffffffff8106c59e>] lockdep_rcu_dereference+0xaa/0xb3
 [<ffffffffa012f6c1>] kvm_mmu_calculate_mmu_pages+0x44/0x7d [kvm]
 [<ffffffffa012263e>] kvm_arch_commit_memory_region+0xb7/0xe2 [kvm]
 [<ffffffffa011a5d7>] __kvm_set_memory_region+0x636/0x6e2 [kvm]
 [<ffffffffa011a6ba>] kvm_set_memory_region+0x37/0x50 [kvm]
 [<ffffffffa015e956>] vmx_set_tss_addr+0x46/0x5a [kvm_intel]
 [<ffffffffa0126592>] kvm_arch_vm_ioctl+0x17a/0xcf8 [kvm]
 [<ffffffff810a8692>] ? unlock_page+0x27/0x2c
 [<ffffffff810bf879>] ? __do_fault+0x3a9/0x3e1
 [<ffffffffa011b12f>] kvm_vm_ioctl+0x364/0x38d [kvm]
 [<ffffffff81060cfa>] ? up_read+0x23/0x3d
 [<ffffffff810f3587>] vfs_ioctl+0x32/0xa6
 [<ffffffff810f3b19>] do_vfs_ioctl+0x495/0x4db
 [<ffffffff810e6b2f>] ? fget_light+0xc2/0x241
 [<ffffffff810e416c>] ? do_sys_open+0x104/0x116
 [<ffffffff81382d6d>] ? retint_swapgs+0xe/0x13
 [<ffffffff810f3ba6>] sys_ioctl+0x47/0x6a
 [<ffffffff810021db>] system_call_fastpath+0x16/0x1b

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:18:01 +03:00
Avi Kivity
9beeaa2d68 Merge branch 'perf'
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:17:58 +03:00
Gleb Natapov
19d0443726 KVM: fix emulator_task_switch() return value.
emulator_task_switch() should return -1 for failure and 0 for success to
the caller, just like x86_emulate_insn() does.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:49 +03:00
Jan Kiszka
e269fb2189 KVM: x86: Push potential exception error code on task switches
When a fault triggers a task switch, the error code, if existent, has to
be pushed on the new task's stack. Implement the missing bits.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:46 +03:00
Gleb Natapov
8f6abd06f5 KVM: x86: get rid of mmu_only parameter in emulator_write_emulated()
We can call kvm_mmu_pte_write() directly from
emulator_cmpxchg_emulated() instead of passing mmu_only down to
emulator_write_emulated_onepage() and call it there.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:43 +03:00
Gleb Natapov
020df0794f KVM: move DR register access handling into generic code
Currently both SVM and VMX have their own DR handling code. Move it to
x86.c.

Acked-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:39 +03:00
Avi Kivity
f7a711971e KVM: Fix MAXPHYADDR calculation when cpuid does not support it
MAXPHYADDR is derived from cpuid 0x80000008, but when that isn't present, we
get some random value.

Fix by checking first that cpuid 0x80000008 is supported.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:36 +03:00
Avi Kivity
e46479f852 KVM: Trace emulated instructions
Log emulated instructions in ftrace, especially if they failed.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:35 +03:00
Gleb Natapov
482ac18ae2 KVM: x86 emulator: commit rflags as part of registers commit
Make sure that rflags is committed only after successful instruction
emulation.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:16:35 +03:00
Jan Kiszka
9749a6c0f0 KVM: x86: Fix 32-bit build breakage due to typo
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:16:34 +03:00
Gleb Natapov
92bf9748b5 KVM: small kvm_arch_vcpu_ioctl_run() cleanup.
Unify all conditions that get us back into emulator after returning from
userspace.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:32 +03:00
Gleb Natapov
5cd21917da KVM: x86 emulator: restart string instruction without going back to a guest.
Currently when string instruction is only partially complete we go back
to a guest mode, guest tries to reexecute instruction and exits again
and at this point emulation continues. Avoid all of this by restarting
instruction without going back to a guest mode, but return to a guest
mode each 1024 iterations to allow interrupt injection. Pending
exception causes immediate guest entry too.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:29 +03:00
Gleb Natapov
7972995b0c KVM: x86 emulator: Move string pio emulation into emulator.c
Currently emulation is done outside of emulator so things like doing
ins/outs to/from mmio are broken it also makes it hard (if not impossible)
to implement single stepping in the future. The implementation in this
patch is not efficient since it exits to userspace for each IO while
previous implementation did 'ins' in batches. Further patch that
implements pio in string read ahead address this problem.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:26 +03:00
Gleb Natapov
cf8f70bfe3 KVM: x86 emulator: fix in/out emulation.
in/out emulation is broken now. The breakage is different depending
on where IO device resides. If it is in userspace emulator reports
emulation failure since it incorrectly interprets kvm_emulate_pio()
return value. If IO device is in the kernel emulation of 'in' will do
nothing since kvm_emulate_pio() stores result directly into vcpu
registers, so emulator will overwrite result of emulation during
commit of shadowed register.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:25 +03:00
Gleb Natapov
ceffb45972 KVM: Use task switch from emulator.c
Remove old task switch code from x86.c

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:18 +03:00
Gleb Natapov
2dafc6c234 KVM: x86 emulator: Provide more callbacks for x86 emulator.
Provide get_cached_descriptor(), set_cached_descriptor(),
get_segment_selector(), set_segment_selector(), get_gdt(),
write_std() callbacks.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:14 +03:00
Gleb Natapov
063db061b9 KVM: Provide current eip as part of emulator context.
Eliminate the need to call back into KVM to get it from emulator.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:59 +03:00
Gleb Natapov
9c5372445c KVM: Provide x86_emulate_ctxt callback to get current cpl
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:57 +03:00
Gleb Natapov
93a152be5a KVM: remove realmode_lmsw function.
Use (get|set)_cr callback to emulate lmsw inside emulator.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:56 +03:00
Gleb Natapov
52a4661737 KVM: Provide callback to get/set control registers in emulator ops.
Use this callback instead of directly call kvm function. Also rename
realmode_(set|get)_cr to emulator_(set|get)_cr since function has nothing
to do with real mode.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:54 +03:00
Gleb Natapov
49c6799a2c KVM: Remove pointer to rflags from realmode_set_cr parameters.
Mov reg, cr instruction doesn't change flags in any meaningful way, so
no need to update rflags after instruction execution.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:50 +03:00
Avi Kivity
4a5f48f666 KVM: Don't follow an atomic operation by a non-atomic one
Currently emulated atomic operations are immediately followed by a non-atomic
operation, so that kvm_mmu_pte_write() can be invoked.  This updates the mmu
but undoes the whole point of doing things atomically.

Fix by only performing the atomic operation and the mmu update, and avoiding
the non-atomic write.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:40 +03:00
Avi Kivity
daea3e73cb KVM: Make locked operations truly atomic
Once upon a time, locked operations were emulated while holding the mmu mutex.
Since mmu pages were write protected, it was safe to emulate the writes in
a non-atomic manner, since there could be no other writer, either in the
guest or in the kernel.

These days emulation takes place without holding the mmu spinlock, so the
write could be preempted by an unshadowing event, which exposes the page
to writes by the guest.  This may cause corruption of guest page tables.

Fix by using an atomic cmpxchg for these operations.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:39 +03:00
Wei Yongjun
160d2f6c0c KVM: x86: fix the error of ioctl KVM_IRQ_LINE if no irq chip
If no irq chip in kernel, ioctl KVM_IRQ_LINE will return -EFAULT.
But I see in other place such as KVM_[GET|SET]IRQCHIP, -ENXIO is
return. So this patch used -ENXIO instead of -EFAULT.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:31 +03:00
Avi Kivity
5c1c85d08d KVM: Trace exception injection
Often an exception can help point out where things start to go wrong.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:15:27 +03:00
Xiao Guangrong
2ed152afc7 KVM: cleanup kvm trace
This patch does:

 - no need call tracepoint_synchronize_unregister() when kvm module
   is unloaded since ftrace can handle it

 - cleanup ftrace's macro

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:15:22 +03:00
Dongxiao Xu
fe19c5a46b KVM: x86: Call vcpu_load and vcpu_put in cpuid_update
cpuid_update may operate VMCS, so vcpu_load() and vcpu_put()
should be called to ensure correctness.

Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-13 01:31:02 -03:00