linux

Author	SHA1	Message	Date
Marcelo Tosatti	b050b015ab	KVM: use SRCU for dirty log Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2010-03-01 12:35:44 -03:00
Marcelo Tosatti	bc6678a33d	KVM: introduce kvm->srcu and convert kvm_set_memory_region to SRCU update Use two steps for memslot deletion: mark the slot invalid (which stops instantiation of new shadow pages for that slot, but allows destruction), then instantiate the new empty slot. Also simplifies kvm_handle_hva locking. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2010-03-01 12:35:44 -03:00
Marcelo Tosatti	f7784b8ec9	KVM: split kvm_arch_set_memory_region into prepare and commit Required for SRCU convertion later. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2010-03-01 12:35:44 -03:00
Marcelo Tosatti	fef9cce0eb	KVM: modify alias layout in x86s struct kvm_arch Have a pointer to an allocated region inside x86's kvm_arch. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2010-03-01 12:35:43 -03:00
Marcelo Tosatti	46a26bf557	KVM: modify memslots layout in struct kvm Have a pointer to an allocated region inside struct kvm. [alex: fix ppc book 3s] Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2010-03-01 12:35:43 -03:00
Avi Kivity	50eb2a3cd0	KVM: Add KVM_MMIO kconfig item s390 doesn't have mmio, this will simplify ifdefing it out. Signed-off-by: Avi Kivity <avi@redhat.com>	2010-03-01 12:35:41 -03:00
Joerg Roedel	953899b659	KVM: SVM: Adjust tsc_offset only if tsc_unstable The tsc_offset adjustment in svm_vcpu_load is executed unconditionally even if Linux considers the host tsc as stable. This causes a Linux guest detecting an unstable tsc in any case. This patch removes the tsc_offset adjustment if the host tsc is stable. The guest will now get the benefit of a stable tsc too. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2010-03-01 12:35:41 -03:00
Sheng Yang	4e47c7a6d7	KVM: VMX: Add instruction rdtscp support for guest Before enabling, execution of "rdtscp" in guest would result in #UD. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2010-03-01 12:35:40 -03:00
Sheng Yang	0e85188049	KVM: Add cpuid_update() callback to kvm_x86_ops Sometime, we need to adjust some state in order to reflect guest CPUID setting, e.g. if we don't expose rdtscp to guest, we won't want to enable it on hardware. cpuid_update() is introduced for this purpose. Also export kvm_find_cpuid_entry() for later use. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2010-03-01 12:35:40 -03:00
Sheng Yang	2bf78fa7b9	KVM: Extended shared_msr_global to per CPU shared_msr_global saved host value of relevant MSRs, but it have an assumption that all MSRs it tracked shared the value across the different CPUs. It's not true with some MSRs, e.g. MSR_TSC_AUX. Extend it to per CPU to provide the support of MSR_TSC_AUX, and more alike MSRs. Notice now the shared_msr_global still have one assumption: it can only deal with the MSRs that won't change in host after KVM module loaded. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2010-03-01 12:35:40 -03:00
Sheng Yang	8a7e3f01e6	KVM: VMX: Remove redundant variable It's no longer necessary. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2010-03-01 12:35:40 -03:00
Avi Kivity	bc23008b61	KVM: VMX: Fold ept_update_paging_mode_cr4() into its caller ept_update_paging_mode_cr4() accesses vcpu->arch.cr4 directly, which usually needs to be accessed via kvm_read_cr4(). In this case, we can't, since cr4 is in the process of being updated. Instead of adding inane comments, fold the function into its caller (vmx_set_cr4), so it can use the not-yet-committed cr4 directly. Signed-off-by: Avi Kivity <avi@redhat.com>	2010-03-01 12:35:40 -03:00
Avi Kivity	ce03e4f21a	KVM: VMX: When using ept, allow the guest to own cr4.pge We make no use of cr4.pge if ept is enabled, but the guest does (to flush global mappings, as with vmap()), so give the guest ownership of this bit. Signed-off-by: Avi Kivity <avi@redhat.com>	2010-03-01 12:35:39 -03:00
Avi Kivity	4c38609ac5	KVM: VMX: Make guest cr4 mask more conservative Instead of specifying the bits which we want to trap on, specify the bits which we allow the guest to change transparently. This is safer wrt future changes to cr4. Signed-off-by: Avi Kivity <avi@redhat.com>	2010-03-01 12:35:39 -03:00
Avi Kivity	fc78f51938	KVM: Add accessor for reading cr4 (or some bits of cr4) Some bits of cr4 can be owned by the guest on vmx, so when we read them, we copy them to the vcpu structure. In preparation for making the set of guest-owned bits dynamic, use helpers to access these bits so we don't need to know where the bit resides. No changes to svm since all bits are host-owned there. Signed-off-by: Avi Kivity <avi@redhat.com>	2010-03-01 12:35:39 -03:00
Avi Kivity	cdc0e24456	KVM: VMX: Move some cr[04] related constants to vmx.c They have no place in common code. Signed-off-by: Avi Kivity <avi@redhat.com>	2010-03-01 12:35:39 -03:00
Sheng Yang	59708670b6	KVM: VMX: Trap and invalid MWAIT/MONITOR instruction We don't support these instructions, but guest can execute them even if the feature('monitor') haven't been exposed in CPUID. So we would trap and inject a #UD if guest try this way. Cc: stable@kernel.org Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2010-03-01 12:35:39 -03:00
Avi Kivity	186a3e526a	KVM: MMU: Report spte not found in rmap before BUG() In the past we've had errors of single-bit in the other two cases; the printk() may confirm it for the third case (many->many). Signed-off-by: Avi Kivity <avi@redhat.com>	2010-03-01 12:35:39 -03:00
Marcelo Tosatti	cb84b55f6c	KVM: x86: raise TSS exception for NULL CS and SS segments Windows 2003 uses task switch to triple fault and reboot (the other exception being reserved pdptrs bits). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2010-03-01 12:35:38 -03:00
Eddie Dong	3fd28fce76	KVM: x86: make double/triple fault promotion generic to all exceptions Move Double-Fault generation logic out of page fault exception generating function to cover more generic case. Signed-off-by: Eddie Dong <eddie.dong@intel.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2010-03-01 12:35:38 -03:00
David S. Miller	2bb4646fce	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2010-02-16 22:09:29 -08:00
Marcelo Tosatti	ee73f656a6	KVM: PIT: control word is write-only PIT control word (address 0x43) is write-only, reads are undefined. Cc: stable@kernel.org Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2010-02-09 19:20:15 +02:00
Jason Wang	923de3cf5b	kvmclock: count total_sleep_time when updating guest clock Current kvm wallclock does not consider the total_sleep_time which could cause wrong wallclock in guest after host suspend/resume. This patch solve this issue by counting total_sleep_time to get the correct host boot time. Cc: stable@kernel.org Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Glauber Costa <glommer@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2010-02-09 19:20:15 +02:00
Wei Yongjun	443c39bc9e	KVM: x86: Fix leak of free lapic date in kvm_arch_vcpu_init() In function kvm_arch_vcpu_init(), if the memory malloc for vcpu->arch.mce_banks is fail, it does not free the memory of lapic date. This patch fixed it. Cc: stable@kernel.org Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2010-01-25 12:26:40 -02:00
Wei Yongjun	36cb93fd6b	KVM: x86: Fix probable memory leak of vcpu->arch.mce_banks vcpu->arch.mce_banks is malloc in kvm_arch_vcpu_init(), but never free in any place, this may cause memory leak. So this patch fixed to free it in kvm_arch_vcpu_uninit(). Cc: stable@kernel.org Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2010-01-25 12:26:40 -02:00
Marcelo Tosatti	a6085fbaf6	KVM: MMU: bail out pagewalk on kvm_read_guest error Exit the guest pagetable walk loop if reading gpte failed. Otherwise its possible to enter an endless loop processing the previous present pte. Cc: stable@kernel.org Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2010-01-25 12:26:38 -02:00
Sheng Yang	82b7005f0e	KVM: x86: Fix host_mapping_level() When found a error hva, should not return PAGE_SIZE but the level... Also clean up the coding style of the following loop. Cc: stable@kernel.org Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2010-01-25 12:26:37 -02:00
Avi Kivity	a5d36f82c4	KVM: Fix race between APIC TMR and IRR When we queue an interrupt to the local apic, we set the IRR before the TMR. The vcpu can pick up the IRR and inject the interrupt before setting the TMR, and perhaps even EOI it, causing incorrect behaviour. The race is really insignificant since it can only occur on the first interrupt (usually following interrupts will not change TMR), but it's better closed than open. Fixed by reordering setting the TMR vs IRR. Cc: stable@kernel.org Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2010-01-25 12:26:36 -02:00
David S. Miller	51c24aaaca	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2010-01-23 00:31:06 -08:00
Michael S. Tsirkin	3a4d5c94e9	vhost_net: a kernel-level virtio server What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. There's similarity with vringfd, with some differences and reduced scope - uses eventfd for signalling - structures can be moved around in memory at any time (good for migration, bug work-arounds in userspace) - write logging is supported (good for migration) - support memory table and not just an offset (needed for kvm) common virtio related code has been put in a separate file vhost.c and can be made into a separate module if/when more backends appear. I used Rusty's lguest.c as the source for developing this part : this supplied me with witty comments I wouldn't be able to write myself. What it is not: vhost net is not a bus, and not a generic new system call. No assumptions are made on how guest performs hypercalls. Userspace hypervisors are supported as well as kvm. How it works: Basically, we connect virtio frontend (configured by userspace) to a backend. The backend could be a network device, or a tap device. Backend is also configured by userspace, including vlan/mac etc. Status: This works for me, and I haven't see any crashes. Compared to userspace, people reported improved latency (as I save up to 4 system calls per packet), as well as better bandwidth and CPU utilization. Features that I plan to look at in the future: - mergeable buffers - zero copy - scalability tuning: figure out the best threading model to use Note on RCU usage (this is also documented in vhost.h, near private_pointer which is the value protected by this variant of RCU): what is happening is that the rcu_dereference() is being used in a workqueue item. The role of rcu_read_lock() is taken on by the start of execution of the workqueue item, of rcu_read_unlock() by the end of execution of the workqueue item, and of synchronize_rcu() by flush_workqueue()/flush_work(). In the future we might need to apply some gcc attribute or sparse annotation to the function passed to INIT_WORK(). Paul's ack below is for this RCU usage. (Includes fixes by Alan Cox <alan@linux.intel.com>, David L Stevens <dlstevens@us.ibm.com>, Chris Wright <chrisw@redhat.com>) Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-01-15 01:43:29 -08:00
Jan Kiszka	dab4b911a5	KVM: x86: Extend KVM_SET_VCPU_EVENTS with selective updates User space may not want to overwrite asynchronously changing VCPU event states on write-back. So allow to skip nmi.pending and sipi_vector by setting corresponding bits in the flags field of kvm_vcpu_events. [avi: advertise the bits in KVM_GET_VCPU_EVENTS] Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-27 13:36:33 -02:00
Marcelo Tosatti	6e24a6eff4	KVM: LAPIC: make sure IRR bitmap is scanned after vm load The vcpus are initialized with irr_pending set to false, but loading the LAPIC registers with pending IRR fails to reset the irr_pending variable. Cc: stable@kernel.org Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-27 13:36:31 -02:00
Marcelo Tosatti	fb341f572d	KVM: MMU: remove prefault from invlpg handler The invlpg prefault optimization breaks Windows 2008 R2 occasionally. The visible effect is that the invlpg handler instantiates a pte which is, microseconds later, written with a different gfn by another vcpu. The OS could have other mechanisms to prevent a present translation from being used, which the hypervisor is unaware of. While the documentation states that the cpu is at liberty to prefetch tlb entries, it looks like this is not heeded, so remove tlb prefetch from invlpg. Cc: stable@kernel.org Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-27 13:36:30 -02:00
Linus Torvalds	d0316554d3	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (34 commits) m68k: rename global variable vmalloc_end to m68k_vmalloc_end percpu: add missing per_cpu_ptr_to_phys() definition for UP percpu: Fix kdump failure if booted with percpu_alloc=page percpu: make misc percpu symbols unique percpu: make percpu symbols in ia64 unique percpu: make percpu symbols in powerpc unique percpu: make percpu symbols in x86 unique percpu: make percpu symbols in xen unique percpu: make percpu symbols in cpufreq unique percpu: make percpu symbols in oprofile unique percpu: make percpu symbols in tracer unique percpu: make percpu symbols under kernel/ and mm/ unique percpu: remove some sparse warnings percpu: make alloc_percpu() handle array types vmalloc: fix use of non-existent percpu variable in put_cpu_var() this_cpu: Use this_cpu_xx in trace_functions_graph.c this_cpu: Use this_cpu_xx for ftrace this_cpu: Use this_cpu_xx in nmi handling this_cpu: Use this_cpu operations in RCU this_cpu: Use this_cpu ops for VM statistics ... Fix up trivial (famous last words) global per-cpu naming conflicts in arch/x86/kvm/svm.c mm/slab.c	2009-12-14 09:58:24 -08:00
Joe Perches	a78d9626f4	x86: i8254.c: Add pr_fmt(fmt) - Add pr_fmt(fmt) "pit: " fmt - Strip pit: prefixes from pr_debug Signed-off-by: Joe Perches <joe@perches.com> LKML-Reference: <bbd4de532f18bb7c11f64ba20d224c08291cb126.1260383912.git.joe@perches.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-10 08:57:50 +01:00
Linus Torvalds	ed9216c171	Merge branch 'kvm-updates/2.6.33' of git://git.kernel.org/pub/scm/virt/kvm/kvm * 'kvm-updates/2.6.33' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (84 commits) KVM: VMX: Fix comparison of guest efer with stale host value KVM: s390: Fix prefix register checking in arch/s390/kvm/sigp.c KVM: Drop user return notifier when disabling virtualization on a cpu KVM: VMX: Disable unrestricted guest when EPT disabled KVM: x86 emulator: limit instructions to 15 bytes KVM: s390: Make psw available on all exits, not just a subset KVM: x86: Add KVM_GET/SET_VCPU_EVENTS KVM: VMX: Report unexpected simultaneous exceptions as internal errors KVM: Allow internal errors reported to userspace to carry extra data KVM: Reorder IOCTLs in main kvm.h KVM: x86: Polish exception injection via KVM_SET_GUEST_DEBUG KVM: only clear irq_source_id if irqchip is present KVM: x86: disallow KVM_{SET,GET}_LAPIC without allocated in-kernel lapic KVM: x86: disallow multiple KVM_CREATE_IRQCHIP KVM: VMX: Remove vmx->msr_offset_efer KVM: MMU: update invlpg handler comment KVM: VMX: move CR3/PDPTR update to vmx_set_cr3 KVM: remove duplicated task_switch check KVM: powerpc: Fix BUILD_BUG_ON condition KVM: VMX: Use shared msr infrastructure ... Trivial conflicts due to new Kconfig options in arch/Kconfig and kernel/Makefile	2009-12-08 08:02:38 -08:00
Avi Kivity	d5696725b2	KVM: VMX: Fix comparison of guest efer with stale host value update_transition_efer() masks out some efer bits when deciding whether to switch the msr during guest entry; for example, NX is emulated using the mmu so we don't need to disable it, and LMA/LME are handled by the hardware. However, with shared msrs, the comparison is made against a stale value; at the time of the guest switch we may be running with another guest's efer. Fix by deferring the mask/compare to the actual point of guest entry. Noted by Marcelo. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:34:20 +02:00
Avi Kivity	3548bab501	KVM: Drop user return notifier when disabling virtualization on a cpu This way, we don't leave a dangling notifier on cpu hotunplug or module unload. In particular, module unload leaves the notifier pointing into freed memory. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:26 +02:00
Sheng Yang	046d87103a	KVM: VMX: Disable unrestricted guest when EPT disabled Otherwise would cause VMEntry failure when using ept=0 on unrestricted guest supported processors. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:25 +02:00
Avi Kivity	eb3c79e64a	KVM: x86 emulator: limit instructions to 15 bytes While we are never normally passed an instruction that exceeds 15 bytes, smp games can cause us to attempt to interpret one, which will cause large latencies in non-preempt hosts. Cc: stable@kernel.org Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:25 +02:00
Jan Kiszka	3cfc3092f4	KVM: x86: Add KVM_GET/SET_VCPU_EVENTS This new IOCTL exports all yet user-invisible states related to exceptions, interrupts, and NMIs. Together with appropriate user space changes, this fixes sporadic problems of vmsave/restore, live migration and system reset. [avi: future-proof abi by adding a flags field] Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:25 +02:00
Avi Kivity	65ac726404	KVM: VMX: Report unexpected simultaneous exceptions as internal errors These happen when we trap an exception when another exception is being delivered; we only expect these with MCEs and page faults. If something unexpected happens, things probably went south and we're better off reporting an internal error and freezing. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:24 +02:00
Avi Kivity	a9c7399d6c	KVM: Allow internal errors reported to userspace to carry extra data Usually userspace will freeze the guest so we can inspect it, but some internal state is not available. Add extra data to internal error reporting so we can expose it to the debugger. Extra data is specific to the suberror. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:24 +02:00
Jan Kiszka	4f926bf291	KVM: x86: Polish exception injection via KVM_SET_GUEST_DEBUG Decouple KVM_GUESTDBG_INJECT_DB and KVM_GUESTDBG_INJECT_BP from KVM_GUESTDBG_ENABLE, their are actually orthogonal. At this chance, avoid triggering the WARN_ON in kvm_queue_exception if there is already an exception pending and reject such invalid requests. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:24 +02:00
Marcelo Tosatti	2204ae3c96	KVM: x86: disallow KVM_{SET,GET}_LAPIC without allocated in-kernel lapic Otherwise kvm might attempt to dereference a NULL pointer. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:23 +02:00
Marcelo Tosatti	3ddea128ad	KVM: x86: disallow multiple KVM_CREATE_IRQCHIP Otherwise kvm will leak memory on multiple KVM_CREATE_IRQCHIP. Also serialize multiple accesses with kvm->lock. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:23 +02:00
Avi Kivity	92c0d90015	KVM: VMX: Remove vmx->msr_offset_efer This variable is used to communicate between a caller and a callee; switch to a function argument instead. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:23 +02:00
Marcelo Tosatti	5f5c35aad5	KVM: MMU: update invlpg handler comment Large page translations are always synchronized (either in level 3 or level 2), so its not necessary to properly deal with them in the invlpg handler. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:23 +02:00
Marcelo Tosatti	7c93be44a4	KVM: VMX: move CR3/PDPTR update to vmx_set_cr3 GUEST_CR3 is updated via kvm_set_cr3 whenever CR3 is modified from outside guest context. Similarly pdptrs are updated via load_pdptrs. Let kvm_set_cr3 perform the update, removing it from the vcpu_run fast path. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Acked-by: Acked-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:22 +02:00
Gleb Natapov	1655e3a3dc	KVM: remove duplicated task_switch check Probably introduced by a bad merge. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:22 +02:00
Avi Kivity	26bb0981b3	KVM: VMX: Use shared msr infrastructure Instead of reloading syscall MSRs on every preemption, use the new shared msr infrastructure to reload them at the last possible minute (just before exit to userspace). Improves vcpu/idle/vcpu switches by about 2000 cycles (when EFER needs to be reloaded as well). [jan: fix slot index missing indirection] Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:22 +02:00
Avi Kivity	18863bdd60	KVM: x86 shared msr infrastructure The various syscall-related MSRs are fairly expensive to switch. Currently we switch them on every vcpu preemption, which is far too often: - if we're switching to a kernel thread (idle task, threaded interrupt, kernel-mode virtio server (vhost-net), for example) and back, then there's no need to switch those MSRs since kernel threasd won't be exiting to userspace. - if we're switching to another guest running an identical OS, most likely those MSRs will have the same value, so there's little point in reloading them. - if we're running the same OS on the guest and host, the MSRs will have identical values and reloading is unnecessary. This patch uses the new user return notifiers to implement last-minute switching, and checks the msr values to avoid unnecessary reloading. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:21 +02:00
Avi Kivity	44ea2b1758	KVM: VMX: Move MSR_KERNEL_GS_BASE out of the vmx autoload msr area Currently MSR_KERNEL_GS_BASE is saved and restored as part of the guest/host msr reloading. Since we wish to lazy-restore all the other msrs, save and reload MSR_KERNEL_GS_BASE explicitly instead of using the common code. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:21 +02:00
Eduardo Habkost	3ce672d484	KVM: SVM: init_vmcb(): remove redundant save->cr0 initialization The svm_set_cr0() call will initialize save->cr0 properly even when npt is enabled, clearing the NW and CD bits as expected, so we don't need to initialize it manually for npt_enabled anymore. Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:21 +02:00
Eduardo Habkost	18fa000ae4	KVM: SVM: Reset cr0 properly on vcpu reset svm_vcpu_reset() was not properly resetting the contents of the guest-visible cr0 register, causing the following issue: https://bugzilla.redhat.com/show_bug.cgi?id=525699 Without resetting cr0 properly, the vcpu was running the SIPI bootstrap routine with paging enabled, making the vcpu get a pagefault exception while trying to run it. Instead of setting vmcb->save.cr0 directly, the new code just resets kvm->arch.cr0 and calls kvm_set_cr0(). The bits that were set/cleared on vmcb->save.cr0 (PG, WP, !CD, !NW) will be set properly by svm_set_cr0(). kvm_set_cr0() is used instead of calling svm_set_cr0() directly to make sure kvm_mmu_reset_context() is called to reset the mmu to nonpaging mode. Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:21 +02:00
Eduardo Habkost	fa40052ca0	KVM: VMX: Use macros instead of hex value on cr0 initialization This should have no effect, it is just to make the code clearer. Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:21 +02:00
Glauber Costa	afbcf7ab8d	KVM: allow userspace to adjust kvmclock offset When we migrate a kvm guest that uses pvclock between two hosts, we may suffer a large skew. This is because there can be significant differences between the monotonic clock of the hosts involved. When a new host with a much larger monotonic time starts running the guest, the view of time will be significantly impacted. Situation is much worse when we do the opposite, and migrate to a host with a smaller monotonic clock. This proposed ioctl will allow userspace to inform us what is the monotonic clock value in the source host, so we can keep the time skew short, and more importantly, never goes backwards. Userspace may also need to trigger the current data, since from the first migration onwards, it won't be reflected by a simple call to clock_gettime() anymore. [marcelo: future-proof abi with a flags field] [jan: fix KVM_GET_CLOCK by clearing flags field instead of checking it] Signed-off-by: Glauber Costa <glommer@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:19 +02:00
Jan Kiszka	6be7d3062b	KVM: SVM: Cleanup NMI singlestep Push the NMI-related singlestep variable into vcpu_svm. It's dealing with an AMD-specific deficit, nothing generic for x86. Acked-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> arch/x86/include/asm/kvm_host.h \| 1 - arch/x86/kvm/svm.c \| 12 +++++++----- 2 files changed, 7 insertions(+), 6 deletions(-) Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:19 +02:00
Jan Kiszka	94fe45da48	KVM: x86: Fix guest single-stepping while interruptible Commit 705c5323 opened the doors of hell by unconditionally injecting single-step flags as long as guest_debug signaled this. This doesn't work when the guest branches into some interrupt or exception handler and triggers a vmexit with flag reloading. Fix it by saving cs:rip when user space requests single-stepping and restricting the trace flag injection to this guest code position. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:19 +02:00
Ed Swierk	ffde22ac53	KVM: Xen PV-on-HVM guest support Support for Xen PV-on-HVM guests can be implemented almost entirely in userspace, except for handling one annoying MSR that maps a Xen hypercall blob into guest address space. A generic mechanism to delegate MSR writes to userspace seems overkill and risks encouraging similar MSR abuse in the future. Thus this patch adds special support for the Xen HVM MSR. I implemented a new ioctl, KVM_XEN_HVM_CONFIG, that lets userspace tell KVM which MSR the guest will write to, as well as the starting address and size of the hypercall blobs (one each for 32-bit and 64-bit) that userspace has loaded from files. When the guest writes to the MSR, KVM copies one page of the blob from userspace to the guest. I've tested this patch with a hacked-up version of Gerd's userspace code, booting a number of guests (CentOS 5.3 i386 and x86_64, and FreeBSD 8.0-RC1 amd64) and exercising PV network and block devices. [jan: fix i386 build warning] [avi: future proof abi with a flags field] Signed-off-by: Ed Swierk <eswierk@aristanetworks.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:18 +02:00
Jan Kiszka	94c30d9ca6	KVM: x86: Drop unneeded CONFIG_HAS_IOMEM check This (broken) check dates back to the days when this code was shared across architectures. x86 has IOMEM, so drop it. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:18 +02:00
Marcelo Tosatti	9fb41ba896	KVM: VMX: fix handle_pause declaration There's no kvm_run argument anymore. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:18 +02:00
Zachary Amsden	6b7d7e762b	KVM: x86: Harden against cpufreq If cpufreq can't determine the CPU khz, or cpufreq is not compiled in, we should fallback to the measured TSC khz. Signed-off-by: Zachary Amsden <zamsden@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:18 +02:00
Mark Langsdorf	565d0998ec	KVM: SVM: Support Pause Filter in AMD processors New AMD processors (Family 0x10 models 8+) support the Pause Filter Feature. This feature creates a new field in the VMCB called Pause Filter Count. If Pause Filter Count is greater than 0 and intercepting PAUSEs is enabled, the processor will increment an internal counter when a PAUSE instruction occurs instead of intercepting. When the internal counter reaches the Pause Filter Count value, a PAUSE intercept will occur. This feature can be used to detect contended spinlocks, especially when the lock holding VCPU is not scheduled. Rescheduling another VCPU prevents the VCPU seeking the lock from wasting its quantum by spinning idly. Experimental results show that most spinlocks are held for less than 1000 PAUSE cycles or more than a few thousand. Default the Pause Filter Counter to 3000 to detect the contended spinlocks. Processor support for this feature is indicated by a CPUID bit. On a 24 core system running 4 guests each with 16 VCPUs, this patch improved overall performance of each guest's 32 job kernbench by approximately 3-5% when combined with a scheduler algorithm thati caused the VCPU to sleep for a brief period. Further performance improvement may be possible with a more sophisticated yield algorithm. Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:17 +02:00
Zhai, Edwin	4b8d54f972	KVM: VMX: Add support for Pause-Loop Exiting New NHM processors will support Pause-Loop Exiting by adding 2 VM-execution control fields: PLE_Gap - upper bound on the amount of time between two successive executions of PAUSE in a loop. PLE_Window - upper bound on the amount of time a guest is allowed to execute in a PAUSE loop If the time, between this execution of PAUSE and previous one, exceeds the PLE_Gap, processor consider this PAUSE belongs to a new loop. Otherwise, processor determins the the total execution time of this loop(since 1st PAUSE in this loop), and triggers a VM exit if total time exceeds the PLE_Window. * Refer SDM volume 3b section 21.6.13 & 22.1.3. Pause-Loop Exiting can be used to detect Lock-Holder Preemption, where one VP is sched-out after hold a spinlock, then other VPs for same lock are sched-in to waste the CPU time. Our tests indicate that most spinlocks are held for less than 212 cycles. Performance tests show that with 2X LP over-commitment we can get +2% perf improvement for kernel build(Even more perf gain with more LPs). Signed-off-by: Zhai Edwin <edwin.zhai@intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:17 +02:00
Joerg Roedel	d36f19e9ec	KVM: SVM: Remove nsvm_printk debugging code With all important informations now delivered through tracepoints we can savely remove the nsvm_printk debugging code for nested svm. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:17 +02:00
Joerg Roedel	532a46b989	KVM: SVM: Add tracepoint for skinit instruction This patch adds a tracepoint for the event that the guest executed the SKINIT instruction. This information is important because SKINIT is an SVM extenstion not yet implemented by nested SVM and we may need this information for debugging hypervisors that do not yet run on nested SVM. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:16 +02:00
Joerg Roedel	ec1ff79084	KVM: SVM: Add tracepoint for invlpga instruction This patch adds a tracepoint for the event that the guest executed the INVLPGA instruction. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:16 +02:00
Joerg Roedel	236649de33	KVM: SVM: Add tracepoint for #vmexit because intr pending This patch adds a special tracepoint for the event that a nested #vmexit is injected because kvm wants to inject an interrupt into the guest. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:16 +02:00
Joerg Roedel	17897f3668	KVM: SVM: Add tracepoint for injected #vmexit This patch adds a tracepoint for a nested #vmexit that gets re-injected to the guest. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:15 +02:00
Joerg Roedel	d8cabddf7e	KVM: SVM: Add tracepoint for nested #vmexit This patch adds a tracepoint for every #vmexit we get from a nested guest. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:15 +02:00
Joerg Roedel	0ac406de8f	KVM: SVM: Add tracepoint for nested vmrun This patch adds a dedicated kvm tracepoint for a nested vmrun. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:15 +02:00
Joerg Roedel	cd3ff653ae	KVM: SVM: Move INTR vmexit out of atomic code The nested SVM code emulates a #vmexit caused by a request to open the irq window right in the request function. This is a bug because the request function runs with preemption and interrupts disabled but the #vmexit emulation might sleep. This can cause a schedule()-while-atomic bug and is fixed with this patch. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:15 +02:00
Alexander Graf	8d23c46624	KVM: SVM: Notify nested hypervisor of lost event injections If event_inj is valid on a #vmexit the host CPU would write the contents to exit_int_info, so the hypervisor knows that the event wasn't injected. We don't do this in nested SVM by now which is a bug and fixed by this patch. Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:14 +02:00
Glauber Costa	e3267cbbbf	KVM: x86: include pvclock MSRs in msrs_to_save For a while now, we are issuing a rdmsr instruction to find out which msrs in our save list are really supported by the underlying machine. However, it fails to account for kvm-specific msrs, such as the pvclock ones. This patch moves then to the beginning of the list, and skip testing them. Cc: stable@kernel.org Signed-off-by: Glauber Costa <glommer@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:14 +02:00
Jan Kiszka	91586a3b7d	KVM: x86: Rework guest single-step flag injection and filtering Push TF and RF injection and filtering on guest single-stepping into the vender get/set_rflags callbacks. This makes the whole mechanism more robust wrt user space IOCTL order and instruction emulations. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:14 +02:00
Marcelo Tosatti	a68a6a7282	KVM: x86: disable paravirt mmu reporting Disable paravirt MMU capability reporting, so that new (or rebooted) guests switch to native operation. Paravirt MMU is a burden to maintain and does not bring significant advantages compared to shadow anymore. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:14 +02:00
Jan Kiszka	355be0b930	KVM: x86: Refactor guest debug IOCTL handling Much of so far vendor-specific code for setting up guest debug can actually be handled by the generic code. This also fixes a minor deficit in the SVM part /wrt processing KVM_GUESTDBG_ENABLE. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:14 +02:00
Juan Quintela	201d945bcf	KVM: remove pre_task_link setting in save_state_to_tss16 Now, also remove pre_task_link setting in save_state_to_tss16. commit `b237ac37a1` Author: Gleb Natapov <gleb@redhat.com> Date: Mon Mar 30 16:03:24 2009 +0300 KVM: Fix task switch back link handling. CC: Gleb Natapov <gleb@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:13 +02:00
Zachary Amsden	3230bb4707	KVM: Fix hotplug of CPUs Both VMX and SVM require per-cpu memory allocation, which is done at module init time, for only online cpus. Backend was not allocating enough structure for all possible CPUs, so new CPUs coming online could not be hardware enabled. Signed-off-by: Zachary Amsden <zamsden@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:13 +02:00
Zachary Amsden	e6732a5af9	KVM: Fix printk name error in svm.c Signed-off-by: Zachary Amsden <zamsden@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:13 +02:00
Zachary Amsden	0cca790753	KVM: Kill the confusing tsc_ref_khz and ref_freq variables They are globals, not clearly protected by any ordering or locking, and vulnerable to various startup races. Instead, for variable TSC machines, register the cpufreq notifier and get the TSC frequency directly from the cpufreq machinery. Not only is it always right, it is also perfectly accurate, as no error prone measurement is required. On such machines, when a new CPU online is brought online, it isn't clear what frequency it will start with, and it may not correspond to the reference, thus in hardware_enable we clear the cpu_tsc_khz variable to zero and make sure it is set before running on a VCPU. Signed-off-by: Zachary Amsden <zamsden@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:12 +02:00
Zachary Amsden	b820cc0ca2	KVM: Separate timer intialization into an indepedent function Signed-off-by: Zachary Amsden <zamsden@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:12 +02:00
Joerg Roedel	e935d48e1b	KVM: SVM: Remove remaining occurences of rdtscll This patch replaces them with native_read_tsc() which can also be used in expressions and saves a variable on the stack in this case. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:12 +02:00
Joerg Roedel	33527ad7e1	KVM: SVM: don't copy exit_int_info on nested vmrun The exit_int_info field is only written by the hardware and never read. So it does not need to be copied on a vmrun emulation. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:11 +02:00
Joerg Roedel	7fcdb5103d	KVM: SVM: reorganize svm_interrupt_allowed This patch reorganizes the logic in svm_interrupt_allowed to make it better to read. This is important because the logic is a lot more complicated with Nested SVM. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:11 +02:00
Huang Weiyi	bfc33beaed	KVM: remove duplicated #include Remove duplicated #include('s) in arch/x86/kvm/lapic.c Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:10 +02:00
Alexander Graf	10474ae894	KVM: Activate Virtualization On Demand X86 CPUs need to have some magic happening to enable the virtualization extensions on them. This magic can result in unpleasant results for users, like blocking other VMMs from working (vmx) or using invalid TLB entries (svm). Currently KVM activates virtualization when the respective kernel module is loaded. This blocks us from autoloading KVM modules without breaking other VMMs. To circumvent this problem at least a bit, this patch introduces on demand activation of virtualization. This means, that instead virtualization is enabled on creation of the first virtual machine and disabled on destruction of the last one. So using this, KVM can be easily autoloaded, while keeping other hypervisors usable. Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:10 +02:00
Marcelo Tosatti	e8b3433a5c	KVM: SVM: remove needless mmap_sem acquision from nested_svm_map nested_svm_map unnecessarily takes mmap_sem around gfn_to_page, since gfn_to_page / get_user_pages are responsible for it. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Acked-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:10 +02:00
Mohammed Gamal	80ced186d1	KVM: VMX: Enhance invalid guest state emulation - Change returned handle_invalid_guest_state() to return relevant exit codes - Move triggering the emulation from vmx_vcpu_run() to vmx_handle_exit() - Return to userspace instead of repeatedly trying to emulate instructions that have already failed Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-12-03 09:32:09 +02:00
Mohammed Gamal	abcf14b560	KVM: x86 emulator: Add pusha and popa instructions This adds pusha and popa instructions (opcodes 0x60-0x61), this enables booting MINIX with invalid guest state emulation on. [marcelo: remove unused variable] Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:09 +02:00
Mohammed Gamal	94677e61fd	KVM: x86 emulator: Add missing decoder flags for 'or' instructions Add missing decoder flags for or instructions (0xc-0xd). Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:09 +02:00
Avi Kivity	bfd99ff5d4	KVM: Move assigned device code to own file Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:09 +02:00
Avi Kivity	367e1319b2	KVM: Return -ENOTTY on unrecognized ioctls Not the incorrect -EINVAL. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:08 +02:00
Gleb Natapov	680b3648ba	KVM: Drop kvm->irq_lock lock from irq injection path The only thing it protects now is interrupt injection into lapic and this can work lockless. Even now with kvm->irq_lock in place access to lapic is not entirely serialized since vcpu access doesn't take kvm->irq_lock. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:08 +02:00
Gleb Natapov	eba0226bdf	KVM: Move IO APIC to its own lock The allows removal of irq_lock from the injection path. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:08 +02:00
Gleb Natapov	1a6e4a8c27	KVM: Move irq sharing information to irqchip level This removes assumptions that max GSIs is smaller than number of pins. Sharing is tracked on pin level not GSI level. [avi: no PIC on ia64] Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:06 +02:00
Gleb Natapov	79c727d437	KVM: Call pic_clear_isr() on pic reset to reuse logic there Also move call of ack notifiers after pic state change. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:06 +02:00
Avi Kivity	851ba6922a	KVM: Don't pass kvm_run arguments They're just copies of vcpu->run, which is readily accessible. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:06 +02:00
Mohammed Gamal	d8769fedd4	KVM: x86 emulator: Introduce No64 decode option Introduces a new decode option "No64", which is used for instructions that are invalid in long mode. Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:05 +02:00
Mohammed Gamal	0934ac9d13	KVM: x86 emulator: Add 'push/pop sreg' instructions [avi: avoid buffer overflow] Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:32:05 +02:00
Ingo Molnar	96200591a3	Merge branch 'tracing/hw-breakpoints' into perf/core Conflicts: arch/x86/kernel/kprobes.c kernel/trace/Makefile Merge reason: hw-breakpoints perf integration is looking good in testing and in reviews, plus conflicts are mounting up - so merge & resolve. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:07:23 +01:00
Frederic Weisbecker	59d8eb53ea	hw-breakpoints: Wrap in the KVM breakpoint active state check Wrap in the cpu dr7 check that tells if we have active breakpoints that need to be restored in the cpu. This wrapper makes the check more self-explainable and also reusable for any further other uses. Reported-by: Jan Kiszka <jan.kiszka@web.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Avi Kivity <avi@redhat.com> Cc: "K. Prasad" <prasad@linux.vnet.ibm.com>	2009-11-10 11:23:43 +01:00
Frederic Weisbecker	24f1e32c60	hw-breakpoints: Rewrite the hw-breakpoints layer on top of perf events This patch rebase the implementation of the breakpoints API on top of perf events instances. Each breakpoints are now perf events that handle the register scheduling, thread/cpu attachment, etc.. The new layering is now made as follows: ptrace kgdb ftrace perf syscall \ \| / / \ \| / / / Core breakpoint API / / \| / \| / Breakpoints perf events \| \| Breakpoints PMU ---- Debug Register constraints handling (Part of core breakpoint API) \| \| Hardware debug registers Reasons of this rewrite: - Use the centralized/optimized pmu registers scheduling, implying an easier arch integration - More powerful register handling: perf attributes (pinned/flexible events, exclusive/non-exclusive, tunable period, etc...) Impact: - New perf ABI: the hardware breakpoints counters - Ptrace breakpoints setting remains tricky and still needs some per thread breakpoints references. Todo (in the order): - Support breakpoints perf counter events for perf tools (ie: implement perf_bpcounter_event()) - Support from perf tools Changes in v2: - Follow the perf "event " rename - The ptrace regression have been fixed (ptrace breakpoint perf events weren't released when a task ended) - Drop the struct hw_breakpoint and store generic fields in perf_event_attr. - Separate core and arch specific headers, drop asm-generic/hw_breakpoint.h and create linux/hw_breakpoint.h - Use new generic len/type for breakpoint - Handle off case: when breakpoints api is not supported by an arch Changes in v3: - Fix broken CONFIG_KVM, we need to propagate the breakpoint api changes to kvm when we exit the guest and restore the bp registers to the host. Changes in v4: - Drop the hw_breakpoint_restore() stub as it is only used by KVM - EXPORT_SYMBOL_GPL hw_breakpoint_restore() as KVM can be built as a module - Restore the breakpoints unconditionally on kvm guest exit: TIF_DEBUG_THREAD doesn't anymore cover every cases of running breakpoints and vcpu->arch.switch_db_regs might not always be set when the guest used debug registers. (Waiting for a reliable optimization) Changes in v5: - Split-up the asm-generic/hw-breakpoint.h moving to linux/hw_breakpoint.h into a separate patch - Optimize the breakpoints restoring while switching from kvm guest to host. We only want to restore the state if we have active breakpoints to the host, otherwise we don't care about messed-up address registers. - Add asm/hw_breakpoint.h to Kbuild - Fix bad breakpoint type in trace_selftest.c Changes in v6: - Fix wrong header inclusion in trace.h (triggered a build error with CONFIG_FTRACE_SELFTEST Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jan Kiszka <jan.kiszka@web.de> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Avi Kivity <avi@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org>	2009-11-08 15:34:42 +01:00
Gleb Natapov	abb3911965	KVM: get_tss_base_addr() should return a gpa_t If TSS we are switching to resides in high memory task switch will fail since address will be truncated. Windows2k3 does this sometimes when running with more then 4G Cc: stable@kernel.org Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-11-04 12:42:36 -02:00
Jan Kiszka	a9e38c3e01	KVM: x86: Catch potential overrun in MCE setup We only allocate memory for 32 MCE banks (KVM_MAX_MCE_BANKS) but we allow user space to fill up to 255 on setup (mcg_cap & 0xff), corrupting kernel memory. Catch these overflows. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-11-04 12:42:35 -02:00
Tejun Heo	0fe1e00954	percpu: make percpu symbols in x86 unique This patch updates percpu related symbols in x86 such that percpu symbols are unique and don't clash with local symbols. This serves two purposes of decreasing the possibility of global percpu symbol collision and allowing dropping per_cpu__ prefix from percpu symbols. * arch/x86/kernel/cpu/common.c: rename local variable to avoid collision * arch/x86/kvm/svm.c: s/svm_data/sd/ for local variables to avoid collision * arch/x86/kernel/cpu/cpu_debug.c: s/cpu_arr/cpud_arr/ s/priv_arr/cpud_priv_arr/ s/cpu_priv_count/cpud_priv_count/ * arch/x86/kernel/cpu/intel_cacheinfo.c: s/cpuid4_info/ici_cpuid4_info/ s/cache_kobject/ici_cache_kobject/ s/index_kobject/ici_index_kobject/ * arch/x86/kernel/ds.c: s/cpu_context/cpu_ds_context/ Partly based on Rusty Russell's "alloc_percpu: rename percpu vars which cause name clashes" patch. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: (kvm) Avi Kivity <avi@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: x86@kernel.org	2009-10-29 22:34:14 +09:00
Frederic Weisbecker	0f8f86c7bd	Merge commit 'perf/core' into perf/hw-breakpoint Conflicts: kernel/Makefile kernel/trace/Makefile kernel/trace/trace.h samples/Makefile Merge reason: We need to be uptodate with the perf events development branch because we plan to rewrite the breakpoints API on top of perf events.	2009-10-18 01:12:33 +02:00
Frederik Deweerdt	8a8365c560	KVM: MMU: fix pointer cast On a 32 bits compile, commit `3da0dd433d` introduced the following warnings: arch/x86/kvm/mmu.c: In function ‘kvm_set_pte_rmapp’: arch/x86/kvm/mmu.c:770: warning: cast to pointer from integer of different size arch/x86/kvm/mmu.c: In function ‘kvm_set_spte_hva’: arch/x86/kvm/mmu.c:849: warning: cast from pointer to integer of different size The following patch uses 'unsigned long' instead of u64 to match the pointer size on both arches. Signed-off-by: Frederik Deweerdt <frederik.deweerdt@xprog.eu> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-10-16 12:30:26 -03:00
Marcelo Tosatti	ace1546487	KVM: use proper hrtimer function to retrieve expiration time hrtimer->base can be temporarily NULL due to racing hrtimer_start. See switch_hrtimer_base/lock_hrtimer_base. Use hrtimer_get_remaining which is robust against it. CC: stable@kernel.org Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-10-16 12:30:25 -03:00
Izik Eidus	3da0dd433d	KVM: add support for change_pte mmu notifiers this is needed for kvm if it want ksm to directly map pages into its shadow page tables. [marcelo: cast pfn assignment to u64] Signed-off-by: Izik Eidus <ieidus@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-10-04 17:04:53 +02:00
Izik Eidus	1403283acc	KVM: MMU: add SPTE_HOST_WRITEABLE flag to the shadow ptes this flag notify that the host physical page we are pointing to from the spte is write protected, and therefore we cant change its access to be write unless we run get_user_pages(write = 1). (this is needed for change_pte support in kvm) Signed-off-by: Izik Eidus <ieidus@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-10-04 17:04:50 +02:00
Izik Eidus	acb66dd051	KVM: MMU: dont hold pagecount reference for mapped sptes pages When using mmu notifiers, we are allowed to remove the page count reference tooken by get_user_pages to a specific page that is mapped inside the shadow page tables. This is needed so we can balance the pagecount against mapcount checking. (Right now kvm increase the pagecount and does not increase the mapcount when mapping page into shadow page table entry, so when comparing pagecount against mapcount, you have no reliable result.) Signed-off-by: Izik Eidus <ieidus@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-10-04 17:04:48 +02:00
Avi Kivity	6a54435560	KVM: Prevent overflow in KVM_GET_SUPPORTED_CPUID The number of entries is multiplied by the entry size, which can overflow on 32-bit hosts. Bound the entry count instead. Reported-by: David Wagner <daw@cs.berkeley.edu> Cc: stable@kernel.org Signed-off-by: Avi Kivity <avi@redhat.com>	2009-10-04 17:04:16 +02:00
Marcelo Tosatti	eb5109e311	KVM: VMX: flush TLB with INVEPT on cpu migration It is possible that stale EPTP-tagged mappings are used, if a vcpu migrates to a different pcpu. Set KVM_REQ_TLB_FLUSH in vmx_vcpu_load, when switching pcpus, which will invalidate both VPID and EPT mappings on the next vm-entry. Cc: stable@kernel.org Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-10-04 13:57:24 +02:00
Aurelien Jarno	b2d83cfa3f	KVM: fix LAPIC timer period overflow Don't overflow when computing the 64-bit period from 32-bit registers. Fixes sourceforge bug #2826486. Signed-off-by: Aurelien Jarno <aurelien@aurel32.net> Cc: stable@kernel.org Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-10-04 13:57:23 +02:00
Joerg Roedel	20824f30bb	KVM: SVM: Handle tsc in svm_get_msr/svm_set_msr correctly When running nested we need to touch the l1 guests tsc_offset. Otherwise changes will be lost or a wrong value be read. Cc: stable@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-10-04 13:57:23 +02:00
Joerg Roedel	77b1ab1732	KVM: SVM: Fix tsc offset adjustment when running nested When svm_vcpu_load is called while the vcpu is running in guest mode the tsc adjustment made there is lost on the next emulated #vmexit. This causes the tsc running backwards in the guest. This patch fixes the issue by also adjusting the tsc_offset in the emulated hsave area so that it will not get lost. Cc: stable@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-10-04 13:57:22 +02:00
Ingo Molnar	dca2d6ac09	Merge branch 'linus' into tracing/hw-breakpoints Conflicts: arch/x86/kernel/process_64.c Semantic conflict fixed in: arch/x86/kvm/x86.c Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-15 12:18:15 +02:00
Linus Torvalds	69def9f05d	Merge branch 'kvm-updates/2.6.32' of git://git.kernel.org/pub/scm/virt/kvm/kvm * 'kvm-updates/2.6.32' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (202 commits) MAINTAINERS: update KVM entry KVM: correct error-handling code KVM: fix compile warnings on s390 KVM: VMX: Check cpl before emulating debug register access KVM: fix misreporting of coalesced interrupts by kvm tracer KVM: x86: drop duplicate kvm_flush_remote_tlb calls KVM: VMX: call vmx_load_host_state() only if msr is cached KVM: VMX: Conditionally reload debug register 6 KVM: Use thread debug register storage instead of kvm specific data KVM guest: do not batch pte updates from interrupt context KVM: Fix coalesced interrupt reporting in IOAPIC KVM guest: fix bogus wallclock physical address calculation KVM: VMX: Fix cr8 exiting control clobbering by EPT KVM: Optimize kvm_mmu_unprotect_page_virt() for tdp KVM: Document KVM_CAP_IRQCHIP KVM: Protect update_cr8_intercept() when running without an apic KVM: VMX: Fix EPT with WP bit change during paging KVM: Use kvm_{read,write}_guest_virt() to read and write segment descriptors KVM: x86 emulator: Add adc and sbb missing decoder flags KVM: Add missing #include ...	2009-09-14 17:43:43 -07:00
Avi Kivity	0a79b00952	KVM: VMX: Check cpl before emulating debug register access Debug registers may only be accessed from cpl 0. Unfortunately, vmx will code to emulate the instruction even though it was issued from guest userspace, possibly leading to an unexpected trap later. Cc: stable@kernel.org Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-09-10 18:11:10 +03:00
Gleb Natapov	4da748960a	KVM: fix misreporting of coalesced interrupts by kvm tracer Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-09-10 18:11:09 +03:00
Marcelo Tosatti	e3904e6ed0	KVM: x86: drop duplicate kvm_flush_remote_tlb calls kvm_mmu_slot_remove_write_access already calls it. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 18:11:08 +03:00
Gleb Natapov	542423b0dd	KVM: VMX: call vmx_load_host_state() only if msr is cached No need to call it before each kvm_(set\|get)_msr_common() Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-09-10 18:11:07 +03:00
Avi Kivity	e8a48342e9	KVM: VMX: Conditionally reload debug register 6 Only reload debug register 6 if we're running with the guest's debug registers. Saves around 150 cycles from the guest lightweight exit path. dr6 contains a couple of bits that are updated on #DB, so intercept that unconditionally and update those bits then. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-09-10 18:11:06 +03:00
Avi Kivity	3d53c27d05	KVM: Use thread debug register storage instead of kvm specific data Instead of saving the debug registers from the processor to a kvm data structure, rely in the debug registers stored in the thread structure. This allows us not to save dr6 and dr7. Reduces lightweight vmexit cost by 350 cycles, or 11 percent. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 18:11:04 +03:00
Gleb Natapov	5fff7d270b	KVM: VMX: Fix cr8 exiting control clobbering by EPT Don't call adjust_vmx_controls() two times for the same control. It restores options that were dropped earlier. This loses us the cr8 exit control, which causes a massive performance regression Windows x64. Cc: stable@kernel.org Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 10:46:57 +03:00
Avi Kivity	60f24784a9	KVM: Optimize kvm_mmu_unprotect_page_virt() for tdp We know no pages are protected, so we can short-circuit the whole thing (including fairly nasty guest memory accesses). Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 10:46:56 +03:00
Avi Kivity	88c808fd42	KVM: Protect update_cr8_intercept() when running without an apic update_cr8_intercept() can be triggered from userspace while there is no apic present. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 10:46:54 +03:00
Sheng Yang	95eb84a758	KVM: VMX: Fix EPT with WP bit change during paging QNX update WP bit when paging enabled, which is not covered yet. This one fix QNX boot with EPT. Cc: stable@kernel.org Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 10:46:53 +03:00
Mikhail Ershov	d9048d3278	KVM: Use kvm_{read,write}_guest_virt() to read and write segment descriptors Segment descriptors tables can be placed on two non-contiguous pages. This patch makes reading segment descriptors by linear address. Signed-off-by: Mikhail Ershov <Mike.Ershov@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 10:46:52 +03:00
Mohammed Gamal	7bdb588827	KVM: x86 emulator: Add adc and sbb missing decoder flags Add missing decoder flags for adc and sbb instructions (opcodes 0x14-0x15, 0x1c-0x1d) Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 10:46:51 +03:00
Avi Kivity	56e8231841	KVM: Rename x86_emulate.c to emulate.c We're in arch/x86, what could we possibly be emulating? Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 10:46:45 +03:00
Anthony Liguori	c0c7c04b87	KVM: When switching to a vm8086 task, load segments as 16-bit According to 16.2.5 in the SDM, eflags.vm in the tss is consulted before loading and new segments. If eflags.vm == 1, then the segments are treated as 16-bit segments. The LDTR and TR are not normally available in vm86 mode so if they happen to somehow get loaded, they need to be treated as 32-bit segments. This fixes an invalid vmentry failure in a custom OS that was happening after a task switch into vm8086 mode. Since the segments were being mistakenly treated as 32-bit, we loaded garbage state. Signed-off-by: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 10:46:44 +03:00
Avi Kivity	345dcaa8fd	KVM: VMX: Adjust rflags if in real mode emulation We set rflags.vm86 when virtualizing real mode to do through vm8086 mode; so we need to take it out again when reading rflags. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 10:46:43 +03:00
Avi Kivity	52c7847d12	KVM: SVM: Drop tlb flush workaround in npt It is no longer possible to reproduce the problem any more, so presumably it has been fixed. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 10:46:40 +03:00
Gleb Natapov	cb142eb743	KVM: Update cr8 intercept when APIC TPR is changed by userspace Since on vcpu entry we do it only if apic is enabled we should do it when TPR is changed while apic is disabled. This happens when windows resets HW without setting TPR to zero. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 10:46:39 +03:00
Joerg Roedel	4b6e4dca70	KVM: SVM: enable nested svm by default Nested SVM is (in my experience) stable enough to be enabled by default. So omit the requirement to pass a module parameter. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 10:46:38 +03:00
Joerg Roedel	108768de55	KVM: SVM: check for nested VINTR flag in svm_interrupt_allowed Not checking for this flag breaks any nested hypervisor that does not set VINTR. So fix it with this patch. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 10:46:37 +03:00
Joerg Roedel	26666957a5	KVM: SVM: move nested_svm_intr main logic out of if-clause This patch removes one indentation level from nested_svm_intr and makes the logic more readable. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 10:46:36 +03:00
Joerg Roedel	cda0ffdd86	KVM: SVM: remove unnecessary is_nested check from svm_cpu_run This check is not necessary. We have to sync the vcpu->arch.cr2 always back to the VMCB. This patch remove the is_nested check. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 10:46:34 +03:00
Joerg Roedel	410e4d573d	KVM: SVM: move special nested exit handling to separate function This patch moves the handling for special nested vmexits like #pf to a separate function. This makes the kvm_override parameter obsolete and makes the code more readable. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 10:46:33 +03:00
Joerg Roedel	1f8da47805	KVM: SVM: handle errors in vmrun emulation path appropriatly If nested svm fails to load the msrpm the vmrun succeeds with the old msrpm which is not correct. This patch changes the logic to roll back to host mode in case the msrpm cannot be loaded. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 10:46:32 +03:00
Joerg Roedel	ea8e064fe2	KVM: SVM: remove nested_svm_do and helper functions This function is not longer required. So remove it. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 10:46:31 +03:00
Joerg Roedel	9738b2c97d	KVM: SVM: clean up nested vmrun path This patch removes the usage of nested_svm_do from the vmrun emulation path. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 10:46:29 +03:00
Joerg Roedel	9966bf6872	KVM: SVM: clean up nestec vmload/vmsave paths This patch removes the usage of nested_svm_do from the vmload and vmsave emulation code paths. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 10:46:28 +03:00
Joerg Roedel	3d62d9aa98	KVM: SVM: clean up nested_svm_exit_handled_msr This patch changes nested svm to call nested_svm_exit_handled_msr directly and not through nested_svm_do. [alex: fix oops due to nested kmap_atomics] Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 10:45:43 +03:00
Joerg Roedel	34f80cfad5	KVM: SVM: get rid of nested_svm_vmexit_real This patch is the starting point of removing nested_svm_do from the nested svm code. The nested_svm_do function basically maps two guest physical pages to host virtual addresses and calls a passed function on it. This function pointer code flow is hard to read and not the best technical solution here. As a side effect this patch indroduces the nested_svm_[un]map helper functions. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:26 +03:00
Joerg Roedel	0295ad7de8	KVM: SVM: simplify nested_svm_check_exception Makes the code of this function more readable by removing on indentation level for the core logic. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:25 +03:00
Joerg Roedel	9c4e40b994	KVM: SVM: do nested vmexit in nested_svm_exit_handled If this function returns true a nested vmexit is required. Move that vmexit into the nested_svm_exit_handled function. This also simplifies the handling of nested #pf intercepts in this function. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Acked-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:25 +03:00
Joerg Roedel	4c2161aed5	KVM: SVM: consolidate nested_svm_exit_handled When caching guest intercepts there is no need anymore for the nested_svm_exit_handled_real function. So move its code into nested_svm_exit_handled. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Acked-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:24 +03:00
Joerg Roedel	aad42c641c	KVM: SVM: cache nested intercepts When the nested intercepts are cached we don't need to call get_user_pages and/or map the nested vmcb on every nested #vmexit to check who will handle the intercept. Further this patch aligns the emulated svm behavior better to real hardware. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:24 +03:00
Joerg Roedel	e6aa9abd73	KVM: SVM: move nested svm state into seperate struct This makes it more clear for which purpose these members in the vcpu_svm exist. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Acked-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:24 +03:00
Joerg Roedel	a5c3832dfe	KVM: SVM: complete interrupts after handling nested exits The interrupt completion code must run after nested exits are handled because not injected interrupts or exceptions may be handled by the l1 guest first. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Acked-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:24 +03:00
Joerg Roedel	0460a979b4	KVM: SVM: copy only necessary parts of the control area on vmrun/vmexit The vmcb control area contains more then 800 bytes of reserved fields which are unnecessarily copied. Fix this by introducing a copy function which only copies the relevant part and saves time. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Acked-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:23 +03:00
Joerg Roedel	defbba5660	KVM: SVM: optimize nested vmrun Only copy the necessary parts of the vmcb save area on vmrun and save precious time. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Acked-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:23 +03:00
Joerg Roedel	33740e4009	KVM: SVM: optimize nested #vmexit It is more efficient to copy only the relevant parts of the vmcb back to the nested vmcb when we emulate an vmexit. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Acked-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:23 +03:00
Joerg Roedel	2af9194d1b	KVM: SVM: add helper functions for global interrupt flag This patch makes the code easier to read when it comes to setting, clearing and checking the status of the virtualized global interrupt flag for the VCPU. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:23 +03:00
Gleb Natapov	88ba63c265	KVM: Replace pic_lock()/pic_unlock() with direct call to spinlock functions They are not doing anything else now. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:22 +03:00
Gleb Natapov	938396a234	KVM: Call ack notifiers from PIC when guest OS acks an IRQ. Currently they are called when irq vector is been delivered. Calling ack notifiers at this point is wrong. Device assignment ack notifier enables host interrupts, but guest not yet had a chance to clear interrupt condition in a device. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:22 +03:00
Gleb Natapov	956f97cf66	KVM: Call kvm_vcpu_kick() inside pic spinlock d5ecfdd25 moved it out because back than it was impossible to call it inside spinlock. This restriction no longer exists. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:21 +03:00
Roel Kluin	3a34a8810b	KVM: fix EFER read buffer overflow Check whether index is within bounds before grabbing the element. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Cc: Avi Kivity <avi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:21 +03:00
Amit Shah	1f3ee616dd	KVM: ignore reads to perfctr msrs We ignore writes to the perfctr msrs. Ignore reads as well. Kaspersky antivirus crashes Windows guests if it can't read these MSRs. Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:21 +03:00
Avi Kivity	eab4b8aa34	KVM: VMX: Optimize vmx_get_cpl() Instead of calling vmx_get_segment() (which reads a whole bunch of vmcs fields), read only the cs selector which contains the cpl. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:21 +03:00
Jan Kiszka	07708c4af1	KVM: x86: Disallow hypercalls for guest callers in rings > 0 So far unprivileged guest callers running in ring 3 can issue, e.g., MMU hypercalls. Normally, such callers cannot provide any hand-crafted MMU command structure as it has to be passed by its physical address, but they can still crash the guest kernel by passing random addresses. To close the hole, this patch considers hypercalls valid only if issued from guest ring 0. This may still be relaxed on a per-hypercall base in the future once required. Cc: stable@kernel.org Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:20 +03:00
Marcelo Tosatti	b90c062c65	KVM: MMU: fix bogus alloc_mmu_pages assignment Remove the bogus n_free_mmu_pages assignment from alloc_mmu_pages. It breaks accounting of mmu pages, since n_free_mmu_pages is modified but the real number of pages remains the same. Cc: stable@kernel.org Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:20 +03:00
Izik Eidus	3b80fffe2b	KVM: MMU: make __kvm_mmu_free_some_pages handle empty list First check if the list is empty before attempting to look at list entries. Cc: stable@kernel.org Signed-off-by: Izik Eidus <ieidus@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:20 +03:00
Bartlomiej Zolnierkiewicz	95fb4eb698	KVM: remove superfluous NULL pointer check in kvm_inject_pit_timer_irqs() This takes care of the following entries from Dan's list: arch/x86/kvm/i8254.c +714 kvm_inject_pit_timer_irqs(6) warning: variable derefenced in initializer 'vcpu' arch/x86/kvm/i8254.c +714 kvm_inject_pit_timer_irqs(6) warning: variable derefenced before check 'vcpu' Reported-by: Dan Carpenter <error27@gmail.com> Cc: corbet@lwn.net Cc: eteo@redhat.com Cc: Julia Lawall <julia@diku.dk> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Acked-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:19 +03:00
Joerg Roedel	344f414fa0	KVM: report 1GB page support to userspace If userspace knows that the kernel part supports 1GB pages it can enable the corresponding cpuid bit so that guests actually use GB pages. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:19 +03:00
Joerg Roedel	7e4e4056f7	KVM: MMU: shadow support for 1gb pages This patch adds support for shadow paging to the 1gb page table code in KVM. With this code the guest can use 1gb pages even if the host does not support them. [ Marcelo: fix shadow page collision on pmd level if a guest 1gb page is mapped with 4kb ptes on host level ] Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:19 +03:00
Joerg Roedel	e04da980c3	KVM: MMU: make page walker aware of mapping levels The page walker may be used with nested paging too when accessing mmio areas. Make it support the additional page-level too. [ Marcelo: fix reserved bit check for 1gb pte ] Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:18 +03:00
Joerg Roedel	852e3c19ac	KVM: MMU: make direct mapping paths aware of mapping levels Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:18 +03:00
Joerg Roedel	d25797b24c	KVM: MMU: rename is_largepage_backed to mapping_level With the new name and the corresponding backend changes this function can now support multiple hugepage sizes. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:18 +03:00
Joerg Roedel	44ad9944f1	KVM: MMU: make rmap code aware of mapping levels This patch removes the largepage parameter from the rmap_add function. Together with rmap_remove this function now uses the role.level field to find determine if the page is a huge page. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:18 +03:00
Marcelo Tosatti	1444885a04	KVM: limit lapic periodic timer frequency Otherwise its possible to starve the host by programming lapic timer with a very high frequency. Cc: stable@kernel.org Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:17 +03:00
Mikhail Ershov	5f0269f5d7	KVM: Align cr8 threshold when userspace changes cr8 Commit f0a3602c20 ("KVM: Move interrupt injection logic to x86.c") does not update the cr8 intercept if the lapic is disabled, so when userspace updates cr8, the cr8 threshold control is not updated and we are left with illegal control fields. Fix by explicitly resetting the cr8 threshold. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:17 +03:00
Jan Kiszka	7f582ab6d8	KVM: VMX: Avoid to return ENOTSUPP to userland Choose some allowed error values for the cases VMX returned ENOTSUPP so far as these values could be returned by the KVM_RUN IOCTL. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-09-10 08:33:17 +03:00
Gleb Natapov	84fde248fe	KVM: PIT: Unregister ack notifier callback when freeing Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-09-10 08:33:16 +03:00
Sheng Yang	b927a3cec0	KVM: VMX: Introduce KVM_SET_IDENTITY_MAP_ADDR ioctl Now KVM allow guest to modify guest's physical address of EPT's identity mapping page. (change from v1, discard unnecessary check, change ioctl to accept parameter address rather than value) Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-09-10 08:33:16 +03:00
Akinobu Mita	b792c344df	KVM: x86: use kvm_get_gdt() and kvm_read_ldt() Use kvm_get_gdt() and kvm_read_ldt() to reduce inline assembly code. Cc: Avi Kivity <avi@redhat.com> Cc: kvm@vger.kernel.org Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-09-10 08:33:15 +03:00
Akinobu Mita	46a359e715	KVM: x86: use get_desc_base() and get_desc_limit() Use get_desc_base() and get_desc_limit() to get the base address and limit in desc_struct. Cc: Avi Kivity <avi@redhat.com> Cc: kvm@vger.kernel.org Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-09-10 08:33:15 +03:00
Marcelo Tosatti	6a1ac77110	KVM: MMU: fix missing locking in alloc_mmu_pages n_requested_mmu_pages/n_free_mmu_pages are used by kvm_mmu_change_mmu_pages to calculate the number of pages to zap. alloc_mmu_pages, called from the vcpu initialization path, modifies this variables without proper locking, which can result in a negative value in kvm_mmu_change_mmu_pages (say, with cpu hotplug). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-09-10 08:33:14 +03:00
Sheng Yang	3662cb1cd6	KVM: Discard unnecessary kvm_mmu_flush_tlb() in kvm_mmu_load() set_cr3() should already cover the TLB flushing. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-09-10 08:33:14 +03:00
Gleb Natapov	4088bb3cae	KVM: silence lapic kernel messages that can be triggered by a guest Some Linux versions (f8) try to read EOI register that is write only. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-09-10 08:33:14 +03:00
Gleb Natapov	a1b37100d9	KVM: Reduce runnability interface with arch support code Remove kvm_cpu_has_interrupt() and kvm_arch_interrupt_allowed() from interface between general code and arch code. kvm_arch_vcpu_runnable() checks for interrupts instead. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:13 +03:00
Gleb Natapov	b59bb7bdf0	KVM: Move exception handling to the same place as other events Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:13 +03:00
Joerg Roedel	a205bc19f0	KVM: MMU: Fix MMU_DEBUG compile breakage Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:13 +03:00
Gregory Haskins	d34e6b175e	KVM: add ioeventfd support ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd signal when written to by a guest. Host userspace can register any arbitrary IO address with a corresponding eventfd and then pass the eventfd to a specific end-point of interest for handling. Normal IO requires a blocking round-trip since the operation may cause side-effects in the emulated model or may return data to the caller. Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM "heavy-weight" exit back to userspace, and is ultimately serviced by qemu's device model synchronously before returning control back to the vcpu. However, there is a subclass of IO which acts purely as a trigger for other IO (such as to kick off an out-of-band DMA request, etc). For these patterns, the synchronous call is particularly expensive since we really only want to simply get our notification transmitted asychronously and return as quickly as possible. All the sychronous infrastructure to ensure proper data-dependencies are met in the normal IO case are just unecessary overhead for signalling. This adds additional computational load on the system, as well as latency to the signalling path. Therefore, we provide a mechanism for registration of an in-kernel trigger point that allows the VCPU to only require a very brief, lightweight exit just long enough to signal an eventfd. This also means that any clients compatible with the eventfd interface (which includes userspace and kernelspace equally well) can now register to be notified. The end result should be a more flexible and higher performance notification API for the backend KVM hypervisor and perhipheral components. To test this theory, we built a test-harness called "doorbell". This module has a function called "doorbell_ring()" which simply increments a counter for each time the doorbell is signaled. It supports signalling from either an eventfd, or an ioctl(). We then wired up two paths to the doorbell: One via QEMU via a registered io region and through the doorbell ioctl(). The other is direct via ioeventfd. You can download this test harness here: ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2 The measured results are as follows: qemu-mmio: 110000 iops, 9.09us rtt ioeventfd-mmio: 200100 iops, 5.00us rtt ioeventfd-pio: 367300 iops, 2.72us rtt I didn't measure qemu-pio, because I have to figure out how to register a PIO region with qemu's device model, and I got lazy. However, for now we can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO, and -350ns for HC, we get: qemu-pio: 153139 iops, 6.53us rtt ioeventfd-hc: 412585 iops, 2.37us rtt these are just for fun, for now, until I can gather more data. Here is a graph for your convenience: http://developer.novell.com/wiki/images/7/76/Iofd-chart.png The conclusion to draw is that we save about 4us by skipping the userspace hop. -------------------- Signed-off-by: Gregory Haskins <ghaskins@novell.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:12 +03:00
Gregory Haskins	090b7aff27	KVM: make io_bus interface more robust Today kvm_io_bus_regsiter_dev() returns void and will internally BUG_ON if it fails. We want to create dynamic MMIO/PIO entries driven from userspace later in the series, so we need to enhance the code to be more robust with the following changes: 1) Add a return value to the registration function 2) Fix up all the callsites to check the return code, handle any failures, and percolate the error up to the caller. 3) Add an unregister function that collapses holes in the array Signed-off-by: Gregory Haskins <ghaskins@novell.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:12 +03:00
Beth Kon	e9f4275732	KVM: PIT support for HPET legacy mode When kvm is in hpet_legacy_mode, the hpet is providing the timer interrupt and the pit should not be. So in legacy mode, the pit timer is destroyed, but the state of the pit is maintained. So if kvm or the guest tries to modify the state of the pit, this modification is accepted, except that the timer isn't actually started. When we exit hpet_legacy_mode, the current state of the pit (which is up to date since we've been accepting modifications) is used to restart the pit timer. The saved_mode code in kvm_pit_load_count temporarily changes mode to 0xff in order to destroy the timer, but then restores the actual value, again maintaining "current" state of the pit for possible later reenablement. [avi: add some reserved storage in the ioctl; make SET_PIT2 IOW] [marcelo: fix memory corruption due to reserved storage] Signed-off-by: Beth Kon <eak@us.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:12 +03:00
Gleb Natapov	0d1de2d901	KVM: Always report x2apic as supported feature We emulate x2apic in software, so host support is not required. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:11 +03:00
Gleb Natapov	c7f0f24b1f	KVM: No need to kick cpu if not in a guest mode This will save a couple of IPIs. Signed-off-by: Gleb Natapov <gleb@redhat.com> Acked-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:11 +03:00
Gleb Natapov	1000ff8d89	KVM: Add trace points in irqchip code Add tracepoint in msi/ioapic/pic set_irq() functions, in IPI sending and in the point where IRQ is placed into apic's IRR. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:11 +03:00
Andre Przywara	f7c6d14003	KVM: fix MMIO_CONF_BASE MSR access Some Windows versions check whether the BIOS has setup MMI/O for config space accesses on AMD Fam10h CPUs, we say "no" by returning 0 on reads and only allow disabling of MMI/O CfgSpace setup by igoring "0" writes. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:10 +03:00
Avi Kivity	f691fe1da7	KVM: Trace shadow page lifecycle Create, sync, unsync, zap. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:10 +03:00
Avi Kivity	0742017159	KVM: MMU: Trace guest pagetable walker Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:09 +03:00
Jan Kiszka	dc7e795e3d	Revert "KVM: x86: check for cr3 validity in ioctl_set_sregs" This reverts commit 6c20e1442bb1c62914bb85b7f4a38973d2a423ba. To my understanding, it became obsolete with the advent of the more robust check in mmu_alloc_roots (89da4ff17f). Moreover, it prevents the conceptually safe pattern 1. set sregs 2. register mem-slots 3. run vcpu by setting a sticky triple fault during step 1. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:09 +03:00
Andre Przywara	6098ca939e	KVM: handle AMD microcode MSR Windows 7 tries to update the CPU's microcode on some processors, so we ignore the MSR write here. The patchlevel register is already handled (returning 0), because the MSR number is the same as Intel's. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:08 +03:00
Sheng Yang	756975bbfd	KVM: Fix apic_mmio_write return for unaligned write Some in-famous OS do unaligned writing for APIC MMIO, and the return value has been missed in recent change, then the OS hangs. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:08 +03:00
Gleb Natapov	0105d1a526	KVM: x2apic interface to lapic This patch implements MSR interface to local apic as defines by x2apic Intel specification. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:08 +03:00
Gleb Natapov	fc61b800f9	KVM: Add Directed EOI support to APIC emulation Directed EOI is specified by x2APIC, but is available even when lapic is in xAPIC mode. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:07 +03:00
Avi Kivity	cb24772140	KVM: Trace apic registers using their symbolic names Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:07 +03:00
Avi Kivity	aec51dc4f1	KVM: Trace mmio Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:07 +03:00
Andre Przywara	c323c0e5f0	KVM: Ignore PCI ECS I/O enablement Linux guests will try to enable access to the extended PCI config space via the I/O ports 0xCF8/0xCFC on AMD Fam10h CPU. Since we (currently?) don't use ECS, simply ignore write and read attempts. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:06 +03:00
Michael S. Tsirkin	bda9020e24	KVM: remove in_range from io devices This changes bus accesses to use high-level kvm_io_bus_read/kvm_io_bus_write functions. in_range now becomes unused so it is removed from device ops in favor of read/write callbacks performing range checks internally. This allows aliasing (mostly for in-kernel virtio), as well as better error handling by making it possible to pass errors up to userspace. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:05 +03:00
Michael S. Tsirkin	6c47469453	KVM: convert bus to slots_lock Use slots_lock to protect device list on the bus. slots_lock is already taken for read everywhere, so we only need to take it for write when registering devices. This is in preparation to removing in_range and kvm->lock around it. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:05 +03:00
Michael S. Tsirkin	108b56690f	KVM: switch pit creation to slots_lock switch pit creation to slots_lock. slots_lock is already taken for read everywhere, so we only need to take it for write when creating pit. This is in preparation to removing in_range and kvm->lock around it. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:05 +03:00
Marcelo Tosatti	2023a29cbe	KVM: remove old KVMTRACE support code Return EOPNOTSUPP for KVM_TRACE_ENABLE/PAUSE/DISABLE ioctls. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:03 +03:00
Andre Przywara	ed85c06853	KVM: introduce module parameter for ignoring unknown MSRs accesses KVM will inject a #GP into the guest if that tries to access unhandled MSRs. This will crash many guests. Although it would be the correct way to actually handle these MSRs, we introduce a runtime switchable module param called "ignore_msrs" (defaults to 0). If this is Y, unknown MSR reads will return 0, while MSR writes are simply dropped. In both cases we print a message to dmesg to inform the user about that. You can change the behaviour at any time by saying: # echo 1 > /sys/modules/kvm/parameters/ignore_msrs Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:03 +03:00
Andre Przywara	1fdbd48c24	KVM: ignore reads from AMDs C1E enabled MSR If the Linux kernel detects an C1E capable AMD processor (K8 RevF and higher), it will access a certain MSR on every attempt to go to halt. Explicitly handle this read and return 0 to let KVM run a Linux guest with the native AMD host CPU propagated to the guest. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:03 +03:00
Andre Przywara	8f1589d95e	KVM: ignore AMDs HWCR register access to set the FFDIS bit Linux tries to disable the flush filter on all AMD K8 CPUs. Since KVM does not handle the needed MSR, the injected #GP will panic the Linux kernel. Ignore setting of the HWCR.FFDIS bit in this MSR to let Linux boot with an AMD K8 family guest CPU. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:02 +03:00
Marcelo Tosatti	894a9c5543	KVM: x86: missing locking in PIT/IRQCHIP/SET_BSP_CPU ioctl paths Correct missing locking in a few places in x86's vm_ioctl handling path. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:02 +03:00
Joerg Roedel	ec04b2604c	KVM: Prepare memslot data structures for multiple hugepage sizes [avi: fix build on non-x86] Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:02 +03:00
Andre Przywara	4668f05078	KVM: x86 emulator: Add sysexit emulation Handle #UD intercept of the sysexit instruction in 64bit mode returning to 32bit compat mode on an AMD host. Setup the segment descriptors for CS and SS and the EIP/ESP registers according to the manual. Signed-off-by: Christoph Egger <christoph.egger@amd.com> Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:01 +03:00
Andre Przywara	8c60435261	KVM: x86 emulator: Add sysenter emulation Handle #UD intercept of the sysenter instruction in 32bit compat mode on an AMD host. Setup the segment descriptors for CS and SS and the EIP/ESP registers according to the manual. Signed-off-by: Christoph Egger <christoph.egger@amd.com> Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:01 +03:00
Andre Przywara	e66bb2ccdc	KVM: x86 emulator: add syscall emulation Handle #UD intercept of the syscall instruction in 32bit compat mode on an Intel host. Setup the segment descriptors for CS and SS and the EIP/ESP registers according to the manual. Save the RIP and EFLAGS to the correct registers. [avi: fix build on i386 due to missing R11] Signed-off-by: Christoph Egger <christoph.egger@amd.com> Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:00 +03:00
Andre Przywara	e99f050712	KVM: x86 emulator: Prepare for emulation of syscall instructions Add the flags needed for syscall, sysenter and sysexit to the opcode table. Catch (but for now ignore) the opcodes in the emulation switch/case. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Christoph Egger <christoph.egger@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:00 +03:00
Andre Przywara	b1d861431e	KVM: x86 emulator: Add missing EFLAGS bit definitions Signed-off-by: Christoph Egger <christoph.egger@amd.com> Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:00 +03:00
Andre Przywara	0cb5762ed2	KVM: Allow emulation of syscalls instructions on #UD Add the opcodes for syscall, sysenter and sysexit to the list of instructions handled by the undefined opcode handler. Signed-off-by: Christoph Egger <christoph.egger@amd.com> Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:59 +03:00
Marcelo Tosatti	229456fc34	KVM: convert custom marker based tracing to event traces This allows use of the powerful ftrace infrastructure. See Documentation/trace/ for usage information. [avi, stephen: various build fixes] [sheng: fix control register breakage] Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:59 +03:00
Alexander Graf	219b65dcf6	KVM: SVM: Improve nested interrupt injection While trying to get Hyper-V running, I realized that the interrupt injection mechanisms that are in place right now are not 100% correct. This patch makes nested SVM's interrupt injection behave more like on a real machine. Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:59 +03:00
Alexander Graf	ff092385e8	KVM: SVM: Implement INVLPGA SVM adds another way to do INVLPG by ASID which Hyper-V makes use of, so let's implement it! For now we just do the same thing invlpg does, as asid switching means we flush the mmu anyways. That might change one day though. Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:58 +03:00
Alexander Graf	3c5d0a44b0	KVM: Implement MSRs used by Hyper-V Hyper-V uses some MSRs, some of which are actually reserved for BIOS usage. But let's be nice today and have it its way, because otherwise it fails terribly. [jaswinder: fix build for linux-next changes] Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:58 +03:00
Avi Kivity	b3dbf89e67	KVM: SVM: Don't save/restore host cr2 The host never reads cr2 in process context, so are free to clobber it. The vmx code does this, so we can safely remove the save/restore code. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:58 +03:00
Avi Kivity	d3edefc003	KVM: VMX: Only reload guest cr2 if different from host cr2 cr2 changes only rarely, and writing it is expensive. Avoid the costly cr2 writes by checking if it does not already hold the desired value. Shaves 70 cycles off the vmexit latency. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:57 +03:00
Jan Kiszka	681405bfc7	KVM: Drop useless atomic test from timer function The current code tries to optimize the setting of KVM_REQ_PENDING_TIMER but used atomic_inc_and_test - which always returns true unless pending had the invalid value of -1 on entry. This patch drops the test part preserving the original semantic but expressing it less confusingly. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:57 +03:00
Jan Kiszka	f7104db26a	KVM: Fix racy event propagation in timer Minor issue that likely had no practical relevance: the kvm timer function so far incremented the pending counter and then may reset it again to 1 in case reinjection was disabled. This opened a small racy window with the corresponding VCPU loop that may have happened to run on another (real) CPU and already consumed the value. Fix it by skipping the incrementation in case pending is already > 0. This opens a different race windows, but may only rarely cause lost events in case we do not care about them anyway (!reinject). Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:57 +03:00
Gleb Natapov	33e4c68656	KVM: Optimize searching for highest IRR Most of the time IRR is empty, so instead of scanning the whole IRR on each VM entry keep a variable that tells us if IRR is not empty. IRR will have to be scanned twice on each IRQ delivery, but this is much more rare than VM entry. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:57 +03:00
Gleb Natapov	6edf14d8d0	KVM: Replace pending exception by PF if it happens serially Replace previous exception with a new one in a hope that instruction re-execution will regenerate lost exception. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:56 +03:00
Marcelo Tosatti	54dee9933e	KVM: VMX: conditionally disable 2M pages Disable usage of 2M pages if VMX_EPT_2MB_PAGE_BIT (bit 16) is clear in MSR_IA32_VMX_EPT_VPID_CAP and EPT is enabled. [avi: s/largepages_disabled/largepages_enabled/ to avoid negative logic] Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:56 +03:00
Marcelo Tosatti	68f89400bc	KVM: VMX: EPT misconfiguration handler Handler for EPT misconfiguration which checks for valid state in the shadow pagetables, printing the spte on each level. The separate WARN_ONs are useful for kerneloops.org. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:56 +03:00
Marcelo Tosatti	94d8b056a2	KVM: MMU: add kvm_mmu_get_spte_hierarchy helper Required by EPT misconfiguration handler. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:56 +03:00
Marcelo Tosatti	4d88954d62	KVM: MMU: make for_each_shadow_entry aware of largepages This way there is no need to add explicit checks in every for_each_shadow_entry user. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:55 +03:00
Marcelo Tosatti	e799794e02	KVM: VMX: more MSR_IA32_VMX_EPT_VPID_CAP capability bits Required for EPT misconfiguration handler. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:55 +03:00
Andre Przywara	71db602322	KVM: Move performance counter MSR access interception to generic x86 path The performance counter MSRs are different for AMD and Intel CPUs and they are chosen mainly by the CPUID vendor string. This patch catches writes to all addresses (regardless of VMX/SVM path) and handles them in the generic MSR handler routine. Writing a 0 into the event select register is something we perfectly emulate ;-), so don't print out a warning to dmesg in this case. This fixes booting a 64bit Windows guest with an AMD CPUID on an Intel host. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:54 +03:00
Marcelo Tosatti	2920d72857	KVM: MMU audit: largepage handling Make the audit code aware of largepages. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:54 +03:00
Marcelo Tosatti	2aaf65e8c4	KVM: MMU audit: audit_mappings tweaks - Fail early in case gfn_to_pfn returns is_error_pfn. - For the pre pte write case, avoid spurious "gva is valid but spte is notrap" messages (the emulation code does the guest write first, so this particular case is OK). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:54 +03:00
Marcelo Tosatti	48fc03174b	KVM: MMU audit: nontrapping ptes in nonleaf level It is valid to set non leaf sptes as notrap. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:54 +03:00
Marcelo Tosatti	e58b0f9e0e	KVM: MMU audit: update audit_write_protection - Unsync pages contain writable sptes in the rmap. - rmaps do not exclusively contain writable sptes anymore. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:53 +03:00
Marcelo Tosatti	08a3732bf2	KVM: MMU audit: update count_writable_mappings / count_rmaps Under testing, count_writable_mappings returns a value that is 2 integers larger than what count_rmaps returns. Suspicion is that either of the two functions is counting a duplicate (either positively or negatively). Modifying check_writable_mappings_rmap to check for rmap existance on all present MMU pages fails to trigger an error, which should keep Avi happy. Also introduce mmu_spte_walk to invoke a callback on all present sptes visible to the current vcpu, might be useful in the future. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:53 +03:00
Marcelo Tosatti	776e663336	KVM: MMU: introduce is_last_spte helper Hiding some of the last largepage / level interaction (which is useful for gbpages and for zero based levels). Also merge the PT_PAGE_TABLE_LEVEL clearing loop in unlink_children. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:53 +03:00
Avi Kivity	3f5d18a965	KVM: Return to userspace on emulation failure Instead of mindlessly retrying to execute the instruction, report the failure to userspace. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:52 +03:00
Gleb Natapov	988a2cae6a	KVM: Use macro to iterate over vcpus. [christian: remove unused variables on s390] Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:52 +03:00
Gleb Natapov	73880c80aa	KVM: Break dependency between vcpu index in vcpus array and vcpu_id. Archs are free to use vcpu_id as they see fit. For x86 it is used as vcpu's apic id. New ioctl is added to configure boot vcpu id that was assumed to be 0 till now. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:52 +03:00
Gleb Natapov	1ed0ce000a	KVM: Use pointer to vcpu instead of vcpu_id in timer code. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:52 +03:00
Gleb Natapov	c5af89b68a	KVM: Introduce kvm_vcpu_is_bsp() function. Use it instead of open code "vcpu_id zero is BSP" assumption. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:51 +03:00
Avi Kivity	d555c333aa	KVM: MMU: s/shadow_pte/spte/ We use shadow_pte and spte inconsistently, switch to the shorter spelling. Rename set_shadow_pte() to __set_spte() to avoid a conflict with the existing set_spte(), and to indicate its lowlevelness. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:51 +03:00
Avi Kivity	43a3795a3a	KVM: MMU: Adjust pte accessors to explicitly indicate guest or shadow pte Since the guest and host ptes can have wildly different format, adjust the pte accessor names to indicate on which type of pte they operate on. No functional changes. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:51 +03:00
Avi Kivity	439e218a6f	KVM: MMU: Fix is_dirty_pte() is_dirty_pte() is used on guest ptes, not shadow ptes, so it needs to avoid shadow_dirty_mask and use PT_DIRTY_MASK instead. Misdetecting dirty pages could lead to unnecessarily setting the dirty bit under EPT. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:50 +03:00
Avi Kivity	7ffd92c53c	KVM: VMX: Move rmode structure to vmx-specific code rmode is only used in vmx, so move it to vmx.c Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:50 +03:00
Nitin A Kamble	3a624e29c7	KVM: VMX: Support Unrestricted Guest feature "Unrestricted Guest" feature is added in the VMX specification. Intel Westmere and onwards processors will support this feature. It allows kvm guests to run real mode and unpaged mode code natively in the VMX mode when EPT is turned on. With the unrestricted guest there is no need to emulate the guest real mode code in the vm86 container or in the emulator. Also the guest big real mode code works like native. The attached patch enhances KVM to use the unrestricted guest feature if available on the processor. It also adds a new kernel/module parameter to disable the unrestricted guest feature at the boot time. Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:49 +03:00
Marcelo Tosatti	fa40a8214b	KVM: switch irq injection/acking data structures to irq_lock Protect irq injection/acking data structures with a separate irq_lock mutex. This fixes the following deadlock: CPU A CPU B kvm_vm_ioctl_deassign_dev_irq() mutex_lock(&kvm->lock); worker_thread() -> kvm_deassign_irq() -> kvm_assigned_dev_interrupt_work_handler() -> deassign_host_irq() mutex_lock(&kvm->lock); -> cancel_work_sync() [blocked] [gleb: fix ia64 path] Reported-by: Alex Williamson <alex.williamson@hp.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:49 +03:00
Marcelo Tosatti	9f4cc12765	KVM: Grab pic lock in kvm_pic_clear_isr_ack isr_ack is protected by kvm_pic->lock. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:48 +03:00
Jan Kiszka	238adc7705	KVM: Cleanup LAPIC interface None of the interface services the LAPIC emulation provides need to be exported to modules, and kvm_lapic_get_base is even totally unused today. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:48 +03:00
Avi Kivity	596ae89565	KVM: VMX: Fix reporting of unhandled EPT violations Instead of returning -ENOTSUPP, exit normally but indicate the hardware exit reason. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:46 +03:00
Avi Kivity	6de4f3ada4	KVM: Cache pdptrs Instead of reloading the pdptrs on every entry and exit (vmcs writes on vmx, guest memory access on svm) extract them on demand. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:46 +03:00
Avi Kivity	8f5d549f02	KVM: VMX: Simplify pdptr and cr3 management Instead of reading the PDPTRs from memory after every exit (which is slow and wrong, as the PDPTRs are stored on the cpu), sync the PDPTRs from memory to the VMCS before entry, and from the VMCS to memory after exit. Do the same for cr3. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:46 +03:00
Avi Kivity	2d84e993a8	KVM: VMX: Avoid duplicate ept tlb flush when setting cr3 vmx_set_cr3() will call vmx_tlb_flush(), which will flush the ept context. So there is no need to call ept_sync_context() explicitly. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:46 +03:00
Gregory Haskins	6b66ac1ae3	KVM: do not register i8254 PIO regions until we are initialized We currently publish the i8254 resources to the pio_bus before the devices are fully initialized. Since we hold the pit_lock, its probably not a real issue. But lets clean this up anyway. Reported-by: Avi Kivity <avi@redhat.com> Signed-off-by: Gregory Haskins <ghaskins@novell.com> Acked-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:45 +03:00
Gregory Haskins	d76685c4a0	KVM: cleanup io_device code We modernize the io_device code so that we use container_of() instead of dev->private, and move the vtable to a separate ops structure (theoretically allows better caching for multiple instances of the same ops structure) Signed-off-by: Gregory Haskins <ghaskins@novell.com> Acked-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:45 +03:00
Avi Kivity	6c8166a77c	KVM: SVM: Fold kvm_svm.h info svm.c kvm_svm.h is only included from svm.c, so fold it in. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:44 +03:00
Andre Przywara	017cb99e87	KVM: SVM: use explicit 64bit storage for sysenter values Since AMD does not support sysenter in 64bit mode, the VMCB fields storing the MSRs are truncated to 32bit upon VMRUN/#VMEXIT. So store the values in a separate 64bit storage to avoid truncation. [andre: fix amd->amd migration] Signed-off-by: Christoph Egger <christoph.egger@amd.com> Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:43 +03:00
Jan Kiszka	c5ff41ce66	KVM: Allow PIT emulation without speaker port The in-kernel speaker emulation is only a dummy and also unneeded from the performance point of view. Rather, it takes user space support to generate sound output on the host, e.g. console beeps. To allow this, introduce KVM_CREATE_PIT2 which controls in-kernel speaker port emulation via a flag passed along the new IOCTL. It also leaves room for future extensions of the PIT configuration interface. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:41 +03:00
Gregory Haskins	721eecbf4f	KVM: irqfd KVM provides a complete virtual system environment for guests, including support for injecting interrupts modeled after the real exception/interrupt facilities present on the native platform (such as the IDT on x86). Virtual interrupts can come from a variety of sources (emulated devices, pass-through devices, etc) but all must be injected to the guest via the KVM infrastructure. This patch adds a new mechanism to inject a specific interrupt to a guest using a decoupled eventfd mechnanism: Any legal signal on the irqfd (using eventfd semantics from either userspace or kernel) will translate into an injected interrupt in the guest at the next available interrupt window. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:41 +03:00
Avi Kivity	0ba12d1081	KVM: Move common KVM Kconfig items to new file virt/kvm/Kconfig Reduce Kconfig code duplication. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:41 +03:00
Gleb Natapov	787ff73637	KVM: Drop interrupt shadow when single stepping should be done only on VMX The problem exists only on VMX. Also currently we skip this step if there is pending exception. The patch fixes this too. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:41 +03:00
Christoph Hellwig	284e9b0f5a	KVM: cleanup arch/x86/kvm/Makefile Use proper foo-y style list additions to cleanup all the conditionals, move module selection after compound object selection and remove the superflous comment. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:40 +03:00
Avi Kivity	ee3d29e8be	KVM: x86 emulator: fix jmp far decoding (opcode 0xea) The jump target should not be sign extened; use an unsigned decode flag. Cc: stable@kernel.org Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:40 +03:00
Avi Kivity	c9eaf20f26	KVM: x86 emulator: Implement zero-extended immediate decoding Absolute jumps use zero extended immediate operands. Cc: stable@kernel.org Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:39 +03:00
Mark McLoughlin	cb007648de	KVM: fix cpuid E2BIG handling for extended request types If we run out of cpuid entries for extended request types we should return -E2BIG, just like we do for the standard request types. Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:39 +03:00
Jaswinder Singh Rajput	60af2ecdc5	KVM: Use MSR names in place of address Replace 0xc0010010 with MSR_K8_SYSCFG and 0xc0010015 with MSR_K7_HWCR. Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:39 +03:00
Huang Ying	890ca9aefa	KVM: Add MCE support The related MSRs are emulated. MCE capability is exported via extension KVM_CAP_MCE and ioctl KVM_X86_GET_MCE_CAP_SUPPORTED. A new vcpu ioctl command KVM_X86_SETUP_MCE is used to setup MCE emulation such as the mcg_cap. MCE is injected via vcpu ioctl command KVM_X86_SET_MCE. Extended machine-check state (MCG_EXT_P) and CMCI are not implemented. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:39 +03:00
Jaswinder Singh Rajput	af24a4e4ae	KVM: Replace MSR_IA32_TIME_STAMP_COUNTER with MSR_IA32_TSC of msr-index.h Use standard msr-index.h's MSR declaration. MSR_IA32_TSC is better than MSR_IA32_TIME_STAMP_COUNTER as it also solves 80 column issue. Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:38 +03:00
Gleb Natapov	ae0bb3e011	KVM: VMX: Properly handle software interrupt re-injection in real mode When reinjecting a software interrupt or exception, use the correct instruction length provided by the hardware instead of a hardcoded 1. Fixes problems running the suse 9.1 livecd boot loader. Problem introduced by commit f0a3602c20 ("KVM: Move interrupt injection logic to x86.c"). Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:32:38 +03:00
Ingo Molnar	5f9ece0240	Merge commit 'v2.6.31-rc7' into x86/cleanups Merge reason: we were on -rc1 before - go up to -rc7 Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-08-24 12:25:54 +02:00
Marcin Slusarz	9f51e24ee8	x86: Use printk_once() Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: "H. Peter Anvin" <hpa@zytor.com> LKML-Reference: <1249847649-11631-6-git-send-email-marcin.slusarz@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-08-09 22:28:34 +02:00
Marcelo Tosatti	53a27b39ff	KVM: MMU: limit rmap chain length Otherwise the host can spend too long traversing an rmap chain, which happens under a spinlock. Cc: stable@kernel.org Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-08-06 12:06:54 +03:00
Jan Kiszka	263799a361	KVM: VMX: Fix locking imbalance on emulation failure We have to disable preemption and IRQs on every exit from handle_invalid_guest_state, otherwise we generate at least a preempt_disable imbalance. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-08-05 13:59:45 +03:00
Jan Kiszka	34f0c1ad27	KVM: VMX: Fix locking order in handle_invalid_guest_state Release and re-acquire preemption and IRQ lock in the same order as vcpu_enter_guest does. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-08-05 13:59:44 +03:00
Marcelo Tosatti	025dbbf36a	KVM: MMU: handle n_free_mmu_pages > n_alloc_mmu_pages in kvm_mmu_change_mmu_pages kvm_mmu_change_mmu_pages mishandles the case where n_alloc_mmu_pages is smaller then n_free_mmu_pages, by not checking if the result of the subtraction is negative. Its a valid condition which can happen if a large number of pages has been recently freed. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-08-05 13:59:43 +03:00
Marcelo Tosatti	4b656b1202	KVM: SVM: force new asid on vcpu migration If a migrated vcpu matches the asid_generation value of the target pcpu, there will be no TLB flush via TLB_CONTROL_FLUSH_ALL_ASID. The check for vcpu.cpu in pre_svm_run is meaningless since svm_vcpu_load already updated it on schedule in. Such vcpu will VMRUN with stale TLB entries. Based on original patch from Joerg Roedel (http://patchwork.kernel.org/patch/10021/) Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Acked-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-08-05 13:59:29 +03:00
Marcelo Tosatti	d6289b9365	KVM: x86: verify MTRR/PAT validity Do not allow invalid memory types in MTRR/PAT (generating a #GP otherwise). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-08-05 13:58:16 +03:00
Marcelo Tosatti	0ff77873b1	KVM: PIT: fix kpit_elapsed division by zero Fix division by zero triggered by latch count command on uninitialized counter. Cc: stable@kernel.org Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-08-05 13:58:11 +03:00
Jan Kiszka	e125e7b694	KVM: Fix KVM_GET_MSR_INDEX_LIST So far, KVM copied the emulated_msrs (only MSR_IA32_MISC_ENABLE) to a wrong address in user space due to broken pointer arithmetic. This caused subtle corruption up there (missing MSR_IA32_MISC_ENABLE had probably no practical relevance). Moreover, the size check for the user-provided kvm_msr_list forgot about emulated MSRs. Cc: stable@kernel.org Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-08-05 13:58:03 +03:00
Jaswinder Singh Rajput	bde8922325	KVM: shut up uninit compiler warning in paging_tmpl.h Dixes compilation warning: CC arch/x86/kernel/io_delay.o arch/x86/kvm/paging_tmpl.h: In function ‘paging64_fetch’: arch/x86/kvm/paging_tmpl.h:279: warning: ‘sptep’ may be used uninitialized in this function arch/x86/kvm/paging_tmpl.h: In function ‘paging32_fetch’: arch/x86/kvm/paging_tmpl.h:279: warning: ‘sptep’ may be used uninitialized in this function warning is bogus (always have a least one level), but need to shut the compiler up. Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-28 14:10:32 +03:00
Amit Shah	9e6996240a	KVM: Ignore reads to K7 EVNTSEL MSRs In commit `7fe29e0faa` we ignored the reads to the P6 EVNTSEL MSRs. That fixed crashes on Intel machines. Ignore the reads to K7 EVNTSEL MSRs as well to fix this on AMD hosts. This fixes Kaspersky antivirus crashing Windows guests on AMD hosts. Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-28 14:10:31 +03:00
Avi Kivity	e3c7cb6ad7	KVM: VMX: Handle vmx instruction vmexits IF a guest tries to use vmx instructions, inject a #UD to let it know the instruction is not implemented, rather than crashing. This prevents guest userspace from crashing the guest kernel. Cc: stable@kernel.org Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-28 14:10:31 +03:00
Jaswinder Singh Rajput	a3f9d3981c	KVM: kvm/x86_emulate.c toggle_interruptibility() should be static toggle_interruptibility() is used only by same file, it should be static. Fixed following sparse warning : arch/x86/kvm/x86_emulate.c:1364:6: warning: symbol 'toggle_interruptibility' was not declared. Should it be static? Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-28 14:10:30 +03:00
Avi Kivity	29a4b9333b	KVM: MMU: Allow 4K ptes with bit 7 (PAT) set Bit 7 is perfectly legal in the 4K page leve; it is used for the PAT. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-28 14:10:29 +03:00
Mel Gorman	6484eb3e2a	page allocator: do not check NUMA node ID when the caller knows the node is valid Callers of alloc_pages_node() can optionally specify -1 as a node to mean "allocate from the current node". However, a number of the callers in fast paths know for a fact their node is valid. To avoid a comparison and branch, this patch adds alloc_pages_exact_node() that only checks the nid with VM_BUG_ON(). Callers that know their node is valid are then converted. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Paul Mundt <lethal@linux-sh.org> [for the SLOB NUMA bits] Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:32 -07:00
Andi Kleen	a0861c02a9	KVM: Add VT-x machine check support VT-x needs an explicit MC vector intercept to handle machine checks in the hyper visor. It also has a special option to catch machine checks that happen during VT entry. Do these interceptions and forward them to the Linux machine check handler. Make it always look like user space is interrupted because the machine check handler treats kernel/user space differently. Thanks to Jiang Yunhong for help and testing. Cc: stable@kernel.org Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 12:27:08 +03:00
Nitin A Kamble	56b237e31a	KVM: VMX: Rename rmode.active to rmode.vm86_active That way the interpretation of rmode.active becomes more clear with unrestricted guest code. Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:49:00 +03:00
Gleb Natapov	20f65983e3	KVM: Move "exit due to NMI" handling into vmx_complete_interrupts() To save us one reading of VM_EXIT_INTR_INFO. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:59 +03:00
Gleb Natapov	8db3baa2db	KVM: Disable CR8 intercept if tpr patching is active Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:59 +03:00
Gleb Natapov	36752c9b91	KVM: Do not migrate pending software interrupts. INTn will be re-executed after migration. If we wanted to migrate pending software interrupt we would need to migrate interrupt type and instruction length too, but we do not have all required info on SVM, so SVM->VMX migration would need to re-execute INTn anyway. To make it simple never migrate pending soft interrupt. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:59 +03:00
Gleb Natapov	44c11430b5	KVM: inject NMI after IRET from a previous NMI, not before. If NMI is received during handling of another NMI it should be injected immediately after IRET from previous NMI handler, but SVM intercept IRET before instruction execution so we can't inject pending NMI at this point and there is not way to request exit when NMI window opens. This patch fix SVM code to open NMI window after IRET by single stepping over IRET instruction. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:59 +03:00
Gleb Natapov	6a8b1d1312	KVM: Always request IRQ/NMI window if an interrupt is pending Currently they are not requested if there is pending exception. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:58 +03:00
Gleb Natapov	66fd3f7f90	KVM: Do not re-execute INTn instruction. Re-inject event instead. This is what Intel suggest. Also use correct instruction length when re-injecting soft fault/interrupt. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:58 +03:00
Gleb Natapov	f629cf8485	KVM: skip_emulated_instruction() decode instruction if size is not known Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:58 +03:00
Gleb Natapov	923c61bbc6	KVM: Remove irq_pending bitmap Only one interrupt vector can be injected from userspace irqchip at any given time so no need to store it in a bitmap. Put it into interrupt queue directly. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:57 +03:00
Gleb Natapov	fa9726b073	KVM: Do not allow interrupt injection from userspace if there is a pending event. The exception will immediately close the interrupt window. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:57 +03:00
Gleb Natapov	3298b75c88	KVM: Unprotect a page if #PF happens during NMI injection. It is done for exception and interrupt already. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:57 +03:00
Robert P. J. Day	58f8ac279a	KVM: Expand on "help" info to specify kvm intel and amd module names Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Cc: Avi Kivity <avi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:55 +03:00
Marcelo Tosatti	8986ecc0ef	KVM: x86: check for cr3 validity in mmu_alloc_roots Verify the cr3 address stored in vcpu->arch.cr3 points to an existant memslot. If not, inject a triple fault. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:55 +03:00
Marcelo Tosatti	7c8a83b75a	KVM: MMU: protect kvm_mmu_change_mmu_pages with mmu_lock kvm_handle_hva, called by MMU notifiers, manipulates mmu data only with the protection of mmu_lock. Update kvm_mmu_change_mmu_pages callers to take mmu_lock, thus protecting against kvm_handle_hva. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:54 +03:00
Glauber Costa	310b5d306c	KVM: Deal with interrupt shadow state for emulated instructions We currently unblock shadow interrupt state when we skip an instruction, but failing to do so when we actually emulate one. This blocks interrupts in key instruction blocks, in particular sti; hlt; sequences If the instruction emulated is an sti, we have to block shadow interrupts. The same goes for mov ss. pop ss also needs it, but we don't currently emulate it. Without this patch, I cannot boot gpxe option roms at vmx machines. This is described at https://bugzilla.redhat.com/show_bug.cgi?id=494469 Signed-off-by: Glauber Costa <glommer@redhat.com> CC: H. Peter Anvin <hpa@zytor.com> CC: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:54 +03:00
Glauber Costa	2809f5d2c4	KVM: Replace ->drop_interrupt_shadow() by ->set_interrupt_shadow() This patch replaces drop_interrupt_shadow with the more general set_interrupt_shadow, that can either drop or raise it, depending on its parameter. It also adds ->get_interrupt_shadow() for future use. Signed-off-by: Glauber Costa <glommer@redhat.com> CC: H. Peter Anvin <hpa@zytor.com> CC: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:54 +03:00
Marcelo Tosatti	32f8840064	KVM: use smp_send_reschedule in kvm_vcpu_kick KVM uses a function call IPI to cause the exit of a guest running on a physical cpu. For virtual interrupt notification there is no need to wait on IPI receival, or to execute any function. This is exactly what the reschedule IPI does, without the overhead of function IPI. So use it instead of smp_call_function_single in kvm_vcpu_kick. Also change the "guest_mode" variable to a bit in vcpu->requests, and use that to collapse multiple IPI's that would be issued between the first one and zeroing of guest mode. This allows kvm_vcpu_kick to called with interrupts disabled. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:53 +03:00
Avi Kivity	d149c731e4	KVM: Update cpuid 1.ecx reporting Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:53 +03:00
Avi Kivity	7faa4ee1c7	KVM: Add AMD cpuid bit: cr8_legacy, abm, misaligned sse, sse4, 3dnow prefetch Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:52 +03:00
Avi Kivity	8d753f369b	KVM: Fix cpuid feature misreporting MTRR, PAT, MCE, and MCA are all supported (to some extent) but not reported. Vista requires these features, so if userspace relies on kernel cpuid reporting, it loses support for Vista. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:52 +03:00
Jan Kiszka	d6a8c875f3	KVM: Drop request_nmi from stats The stats entry request_nmi is no longer used as the related user space interface was dropped. So clean it up. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:51 +03:00
Gleb Natapov	fe8e7f83de	KVM: SVM: Don't reinject event that caused a task switch If a task switch caused by an event remove it from the event queue. VMX already does that. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:51 +03:00
Andre Przywara	b586eb0253	KVM: SVM: Fix cross vendor migration issue in segment segment descriptor On AMD CPUs sometimes the DB bit in the stack segment descriptor is left as 1, although the whole segment has been made unusable. Clear it here to pass an Intel VMX entry check when cross vendor migrating. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:51 +03:00
Glauber Costa	9b5843ddd2	KVM: fix apic_debug instances Apparently nobody turned this on in a while... setting apic_debug to something compilable, generates some errors. This patch fixes it. Signed-off-by: Glauber Costa <glommer@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:50 +03:00
Sheng Yang	522c68c441	KVM: Enable snooping control for supported hardware Memory aliases with different memory type is a problem for guest. For the guest without assigned device, the memory type of guest memory would always been the same as host(WB); but for the assigned device, some part of memory may be used as DMA and then set to uncacheable memory type(UC/WC), which would be a conflict of host memory type then be a potential issue. Snooping control can guarantee the cache correctness of memory go through the DMA engine of VT-d. [avi: fix build on ia64] Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:50 +03:00
Sheng Yang	4b12f0de33	KVM: Replace get_mt_mask_shift with get_mt_mask Shadow_mt_mask is out of date, now it have only been used as a flag to indicate if TDP enabled. Get rid of it and use tdp_enabled instead. Also put memory type logical in kvm_x86_ops->get_mt_mask(). Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:49 +03:00
Jan Blunck	9b62e5b10f	KVM: Wake up waitqueue before calling get_cpu() This moves the get_cpu() call down to be called after we wake up the waiters. Therefore the waitqueue locks can safely be rt mutex. Signed-off-by: Jan Blunck <jblunck@suse.de> Signed-off-by: Sven-Thorsten Dietrich <sven@thebigcorporation.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:49 +03:00
Gleb Natapov	14d0bc1f7c	KVM: Get rid of get_irq() callback It just returns pending IRQ vector from the queue for VMX/SVM. Get IRQ directly from the queue before migration and put it back after. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:49 +03:00
Gleb Natapov	16d7a19117	KVM: Fix userspace IRQ chip migration Re-put pending IRQ vector into interrupt_bitmap before migration. Otherwise it will be lost if migration happens in the wrong time. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:48 +03:00
Gleb Natapov	95ba827313	KVM: SVM: Add NMI injection support Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:48 +03:00
Gleb Natapov	c4282df98a	KVM: Get rid of arch.interrupt_window_open & arch.nmi_window_open They are recalculated before each use anyway. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:48 +03:00
Gleb Natapov	0a5fff1923	KVM: Do not report TPR write to userspace if new value bigger or equal to a previous one. Saves many exits to userspace in a case of IRQ chip in userspace. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:47 +03:00
Gleb Natapov	615d519305	KVM: sync_lapic_to_cr8() should always sync cr8 to V_TPR Even if IRQ chip is in userspace. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:47 +03:00
Gleb Natapov	115666dfc7	KVM: Remove kvm_push_irq() No longer used. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:47 +03:00
Gleb Natapov	1d6ed0cb95	KVM: Remove inject_pending_vectors() callback It is the same as inject_pending_irq() for VMX/SVM now. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:47 +03:00
Gleb Natapov	1cb948ae86	KVM: Remove exception_injected() callback. It always return false for VMX/SVM now. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:46 +03:00
Gleb Natapov	9222be18f7	KVM: SVM: Coalesce userspace/kernel irqchip interrupt injection logic Start to use interrupt/exception queues like VMX does. This also fix the bug that if exit was caused by a guest internal exception access to IDT the exception was not reinjected. Use EVENTINJ to inject interrupts. Use VINT only for detecting when IRQ windows is open again. EVENTINJ ensures the interrupt is injected immediately and not delayed. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:46 +03:00
Gleb Natapov	5df5664647	KVM: Use kvm_arch_interrupt_allowed() instead of checking interrupt_window_open directly kvm_arch_interrupt_allowed() also checks IF so drop the check. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:46 +03:00
Gleb Natapov	1f21e79aac	KVM: VMX: Cleanup vmx_intr_assist() Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:45 +03:00
Gleb Natapov	863e8e658e	KVM: VMX: Consolidate userspace and kernel interrupt injection for VMX Use the same callback to inject irq/nmi events no matter what irqchip is in use. Only from VMX for now. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:45 +03:00
Gleb Natapov	8061823a25	KVM: Make kvm_cpu_(has\|get)_interrupt() work for userspace irqchip too At the vector level, kernel and userspace irqchip are fairly similar. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:45 +03:00
Jan Kiszka	3438253926	KVM: MMU: Fix auditing code Fix build breakage of hpa lookup in audit_mappings_page. Moreover, make this function robust against shadow_notrap_nonpresent_pte entries. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:45 +03:00
Marcelo Tosatti	59839dfff5	KVM: x86: check for cr3 validity in ioctl_set_sregs Matt T. Yourst notes that kvm_arch_vcpu_ioctl_set_sregs lacks validity checking for the new cr3 value: "Userspace callers of KVM_SET_SREGS can pass a bogus value of cr3 to the kernel. This will trigger a NULL pointer access in gfn_to_rmap() when userspace next tries to call KVM_RUN on the affected VCPU and kvm attempts to activate the new non-existent page table root. This happens since kvm only validates that cr3 points to a valid guest physical memory page when code inside the guest sets cr3. However, kvm currently trusts the userspace caller (e.g. QEMU) on the host machine to always supply a valid page table root, rather than properly validating it along with the rest of the reloaded guest state." http://sourceforge.net/tracker/?func=detail&atid=893831&aid=2687641&group_id=180599 Check for a valid cr3 address in kvm_arch_vcpu_ioctl_set_sregs, triple fault in case of failure. Cc: stable@kernel.org Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:43 +03:00
Avi Kivity	463656c000	KVM: Replace kvmclock open-coded get_cpu_var() with the real thing Suggested by Ingo Molnar. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:42 +03:00
Gleb Natapov	8317c298ea	KVM: SVM: Skip instruction on a task switch only when appropriate If a task switch was initiated because off a task gate in IDT and IDT was accessed because of an external even the instruction should not be skipped. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:42 +03:00
Gleb Natapov	ba8afb6b0a	KVM: x86 emulator: Add new mode of instruction emulation: skip In the new mode instruction is decoded, but not executed. The EIP is moved to point after the instruction. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:42 +03:00
Gleb Natapov	e637b8238a	KVM: x86 emulator: Decode soft interrupt instructions Do not emulate them yet. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:41 +03:00
Gleb Natapov	84ce66a686	KVM: x86 emulator: Completely decode in/out at decoding stage Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:41 +03:00
Gleb Natapov	341de7e372	KVM: x86 emulator: Add unsigned byte immediate decode Extend "Source operand type" opcode description field to 4 bites to accommodate new option. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:41 +03:00
Gleb Natapov	d53c4777b3	KVM: x86 emulator: Complete decoding of call near in decode stage Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:41 +03:00
Gleb Natapov	b2833e3cde	KVM: x86 emulator: Complete short/near jcc decoding in decode stage Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:40 +03:00
Gleb Natapov	782b877c80	KVM: x86 emulator: Complete ljmp decoding at decode stage Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:40 +03:00
Gleb Natapov	0654169e73	KVM: x86 emulator: Add lcall decoding No emulation yet. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:40 +03:00
Gleb Natapov	a5f868bd45	KVM: x86 emulator: Add decoding of 16bit second immediate argument Such as segment number in lcall/ljmp Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:40 +03:00
Marcelo Tosatti	c2d0ee46e6	KVM: MMU: remove global page optimization logic Complexity to fix it not worthwhile the gains, as discussed in http://article.gmane.org/gmane.comp.emulators.kvm.devel/28649. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:39 +03:00
Marcelo Tosatti	ede2ccc517	KVM: PIT: fix count read and mode 0 handling Commit 46ee278652f4cbd51013471b64c7897ba9bcd1b1 causes Solaris 10 to hang on boot. Assuming that PIT counter reads should return 0 for an expired timer is wrong: when it is active, the counter never stops (see comment on __kpit_elapsed). Also arm a one shot timer for mode 0. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:39 +03:00
Gleb Natapov	2d03319654	KVM: x86 emulator: fix call near emulation The length of pushed on to the stack return address depends on operand size not address size. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:38 +03:00
Sheng Yang	4c26b4cd6f	KVM: MMU: Discard reserved bits checking on PDE bit 7-8 1. It's related to a Linux kernel bug which fixed by Ingo on `07a66d7c53`. The original code exists for quite a long time, and it would convert a PDE for large page into a normal PDE. But it fail to fit normal PDE well. With the code before Ingo's fix, the kernel would fall reserved bit checking with bit 8 - the remaining global bit of PTE. So the kernel would receive a double-fault. 2. After discussion, we decide to discard PDE bit 7-8 reserved checking for now. For this marked as reserved in SDM, but didn't checked by the processor in fact... Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:38 +03:00
Gleb Natapov	64a7ec0668	KVM: Fix unneeded instruction skipping during task switching. There is no need to skip instruction if the reason for a task switch is a task gate in IDT and access to it is caused by an external even. The problem is currently solved only for VMX since there is no reliable way to skip an instruction in SVM. We should emulate it instead. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:38 +03:00
Gleb Natapov	b237ac37a1	KVM: Fix task switch back link handling. Back link is written to a wrong TSS now. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:37 +03:00
Gleb Natapov	8843419048	KVM: VMX: Do not zero idt_vectoring_info in vmx_complete_interrupts(). We will need it later in task_switch(). Code in handle_exception() is dead. is_external_interrupt(vect_info) will always be false since idt_vectoring_info is zeroed in vmx_complete_interrupts(). Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:37 +03:00
Gleb Natapov	37b96e9880	KVM: VMX: Rewrite vmx_complete_interrupt()'s twisted maze of if() statements ...with a more straightforward switch(). Also fix a bug when NMI could be dropped on exit. Although this should never happen in practice, since NMIs can only be injected, never triggered internally by the guest like exceptions. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:37 +03:00
Gleb Natapov	7b4a25cb29	KVM: VMX: Fix handling of a fault during NMI unblocked due to IRET Bit 12 is undefined in any of the following cases: If the VM exit sets the valid bit in the IDT-vectoring information field. If the VM exit is due to a double fault. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:36 +03:00
Dong, Eddie	20c466b561	KVM: Use rsvd_bits_mask in load_pdptrs() Also remove bit 5-6 from rsvd_bits_mask per latest SDM. Signed-off-by: Eddie Dong <Eddie.Dong@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:36 +03:00
Sheng Yang	93ba03c2e2	KVM: VMX: Fix feature testing The testing of feature is too early now, before vmcs_config complete initialization. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:36 +03:00
Sheng Yang	045471563d	KVM: VMX: Clean up Flex Priority related And clean paranthes on returns. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:36 +03:00
Wei Yongjun	7a6ce84c74	KVM: remove pointless conditional before kfree() in lapic initialization Remove pointless conditional before kfree(). Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:35 +03:00
Avi Kivity	9645bb56b3	KVM: MMU: Use different shadows when EFER.NXE changes A pte that is shadowed when the guest EFER.NXE=1 is not valid when EFER.NXE=0; if bit 63 is set, the pte should cause a fault, and since the shadow EFER always has NX enabled, this won't happen. Fix by using a different shadow page table for different EFER.NXE bits. This allows vcpus to run correctly with different values of EFER.NXE, and for transitions on this bit to be handled correctly without requiring a full flush. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:35 +03:00
Dong, Eddie	82725b20e2	KVM: MMU: Emulate #PF error code of reserved bits violation Detect, indicate, and propagate page faults where reserved bits are set. Take care to handle the different paging modes, each of which has different sets of reserved bits. [avi: fix pte reserved bits for efer.nxe=0] Signed-off-by: Eddie Dong <eddie.dong@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:35 +03:00
Eddie Dong	a8b876b1a4	KVM: MMU: Fix comment in page_fault() The original one is for the code before refactoring. Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com> Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:34 +03:00
Sheng Yang	f9c617f611	KVM: VMX: Correct wrong vmcs field sizes EXIT_QUALIFICATION and GUEST_LINEAR_ADDRESS are natural width, not 64-bit. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:34 +03:00
Avi Kivity	7d433b9f94	KVM: VMX: Make flexpriority module parameter reflect hardware capability If the hardware does not support flexpriority, zero the module parameter. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:34 +03:00
Gleb Natapov	78646121e9	KVM: Fix interrupt unhalting a vcpu when it shouldn't kvm_vcpu_block() unhalts vpu on an interrupt/timer without checking if interrupt window is actually opened. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:33 +03:00
Gleb Natapov	09cec75488	KVM: Timer event should not unconditionally unhalt vcpu. Currently timer events are processed before entering guest mode. Move it to main vcpu event loop since timer events should be processed even while vcpu is halted. Timer may cause interrupt/nmi to be injected and only then vcpu will be unhalted. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:33 +03:00
Avi Kivity	089d034e0c	KVM: VMX: Fold vm_need_ept() into callers Trivial. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:32 +03:00
Avi Kivity	575ff2dcb2	KVM: VMX: Zero ept module parameter if ept is not present Allows reading back hardware capability. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:32 +03:00
Avi Kivity	919818abc2	KVM: VMX: Zero the vpid module parameter if vpid is not supported This allows reading back how the hardware is configured. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:32 +03:00
Avi Kivity	4462d21a61	KVM: VMX: Annotate module parameters as __read_mostly Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:32 +03:00
Avi Kivity	736caefe15	KVM: VMX: Simplify module parameter names Instead of 'enable_vpid=1', use a simple 'vpid=1'. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:31 +03:00
Avi Kivity	6062d012ed	KVM: VMX: Rename kvm_handle_exit() to vmx_handle_exit() It is a static vmx-specific function. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:31 +03:00
Avi Kivity	c1f8bc04c6	KVM: VMX: Make module parameters readable Useful to see how the module was loaded. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:31 +03:00
Gleb Natapov	fe4c7b1914	KVM: reuse (pop\|push)_irq from svm.c in vmx.c The prioritized bit vector manipulation functions are useful in both vmx and svm. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:31 +03:00
Gleb Natapov	61c50edfcd	KVM: SVM: Remove duplicate code in svm_do_inject_vector() svm_do_inject_vector() reimplements pop_irq(). Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:30 +03:00
Amit Shah	7fe29e0faa	KVM: x86: Ignore reads to EVNTSEL MSRs We ignore writes to the performance counters and performance event selector registers already. Kaspersky antivirus reads the eventsel MSR causing it to crash with the current behaviour. Return 0 as data when the eventsel registers are read to stop the crash. Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:30 +03:00
Gleb Natapov	f00be0cae4	KVM: MMU: do not free active mmu pages in free_mmu_pages() free_mmu_pages() should only undo what alloc_mmu_pages() does. Free mmu pages from the generic VM destruction function, kvm_destroy_vm(). Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:30 +03:00
Sheng Yang	e56d532f20	KVM: Device assignment framework rework After discussion with Marcelo, we decided to rework device assignment framework together. The old problems are kernel logic is unnecessary complex. So Marcelo suggest to split it into a more elegant way: 1. Split host IRQ assign and guest IRQ assign. And userspace determine the combination. Also discard msi2intx parameter, userspace can specific KVM_DEV_IRQ_HOST_MSI \| KVM_DEV_IRQ_GUEST_INTX in assigned_irq->flags to enable MSI to INTx convertion. 2. Split assign IRQ and deassign IRQ. Import two new ioctls: KVM_ASSIGN_DEV_IRQ and KVM_DEASSIGN_DEV_IRQ. This patch also fixed the reversed _IOR vs _IOW in definition(by deprecated the old interface). [avi: replace homemade bitcount() by hweight_long()] Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:29 +03:00
Hannes Eder	386eb6e8b3	KVM: make 'lapic_timer_ops' and 'kpit_ops' static Fix this sparse warnings: arch/x86/kvm/lapic.c:916:22: warning: symbol 'lapic_timer_ops' was not declared. Should it be static? arch/x86/kvm/i8254.c:268:22: warning: symbol 'kpit_ops' was not declared. Should it be static? Signed-off-by: Hannes Eder <hannes@hanneseder.net> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:29 +03:00
Gleb Natapov	58c2dde17d	KVM: APIC: get rid of deliver_bitmask Deliver interrupt during destination matching loop. Signed-off-by: Gleb Natapov <gleb@redhat.com> Acked-by: Xiantao Zhang <xiantao.zhang@intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-06-10 11:48:27 +03:00
Gleb Natapov	e1035715ef	KVM: change the way how lowest priority vcpu is calculated The new way does not require additional loop over vcpus to calculate the one with lowest priority as one is chosen during delivery bitmap construction. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-06-10 11:48:27 +03:00
Gleb Natapov	343f94fe4d	KVM: consolidate ioapic/ipi interrupt delivery logic Use kvm_apic_match_dest() in kvm_get_intr_delivery_bitmask() instead of duplicating the same code. Use kvm_get_intr_delivery_bitmask() in apic_send_ipi() to figure out ipi destination instead of reimplementing the logic. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-06-10 11:48:27 +03:00
Gleb Natapov	6da7e3f643	KVM: APIC: kvm_apic_set_irq deliver all kinds of interrupts Get rid of ioapic_inj_irq() and ioapic_inj_nmi() functions. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-06-10 11:48:26 +03:00
Joerg Roedel	f5a1e9f895	KVM: MMU: remove call to kvm_mmu_pte_write from walk_addr There is no reason to update the shadow pte here because the guest pte is only changed to dirty state. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-06-10 11:48:26 +03:00
Marcelo Tosatti	d3c7b77d1a	KVM: unify part of generic timer handling Hide the internals of vcpu awakening / injection from the in-kernel emulated timers. This makes future changes in this logic easier and decreases the distance to more generic timer handling. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:25 +03:00
Marcelo Tosatti	fd66842370	KVM: PIT: remove usage of count_load_time for channel 0 We can infer elapsed time from hrtimer_expires_remaining. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:25 +03:00
Marcelo Tosatti	5a05d54554	KVM: PIT: remove unused scheduled variable Unused. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:25 +03:00
Matt T. Yourst	2dea4c84bc	KVM: x86: silence preempt warning on kvm_write_guest_time This issue just appeared in kvm-84 when running on 2.6.28.7 (x86-64) with PREEMPT enabled. We're getting syslog warnings like this many (but not all) times qemu tells KVM to run the VCPU: BUG: using smp_processor_id() in preemptible [00000000] code: qemu-system-x86/28938 caller is kvm_arch_vcpu_ioctl_run+0x5d1/0xc70 [kvm] Pid: 28938, comm: qemu-system-x86 2.6.28.7-mtyrel-64bit Call Trace: debug_smp_processor_id+0xf7/0x100 kvm_arch_vcpu_ioctl_run+0x5d1/0xc70 [kvm] ? __wake_up+0x4e/0x70 ? wake_futex+0x27/0x40 kvm_vcpu_ioctl+0x2e9/0x5a0 [kvm] enqueue_hrtimer+0x8a/0x110 _spin_unlock_irqrestore+0x27/0x50 vfs_ioctl+0x31/0xa0 do_vfs_ioctl+0x74/0x480 sys_futex+0xb4/0x140 sys_ioctl+0x99/0xa0 system_call_fastpath+0x16/0x1b As it turns out, the call trace is messed up due to gcc's inlining, but I isolated the problem anyway: kvm_write_guest_time() is being used in a non-thread-safe manner on preemptable kernels. Basically kvm_write_guest_time()'s body needs to be surrounded by preempt_disable() and preempt_enable(), since the kernel won't let us query any per-CPU data (indirectly using smp_processor_id()) without preemption disabled. The attached patch fixes this issue by disabling preemption inside kvm_write_guest_time(). [marcelo: surround only __get_cpu_var calls since the warning is harmless] Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:24 +03:00
Sheng Yang	bfd349d073	KVM: bit ops for deliver_bitmap It's also convenient when we extend KVM supported vcpu number in the future. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:22 +03:00
Sheng Yang	110c2faeba	KVM: Update intr delivery func to accept unsigned long* bitmap Would be used with bit ops, and would be easily extended if KVM_MAX_VCPUS is increased. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:22 +03:00
Avi Kivity	5897297bc2	KVM: VMX: Don't intercept MSR_KERNEL_GS_BASE Windows 2008 accesses this MSR often on context switch intensive workloads; since we run in guest context with the guest MSR value loaded (so swapgs can work correctly), we can simply disable interception of rdmsr/wrmsr for this MSR. A complication occurs since in legacy mode, we run with the host MSR value loaded. In this case we enable interception. This means we need two MSR bitmaps, one for legacy mode and one for long mode. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:21 +03:00
Avi Kivity	3e7c73e9b1	KVM: VMX: Don't use highmem pages for the msr and pio bitmaps Highmem pages are a pain, and saving three lowmem pages on i386 isn't worth the extra code. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:21 +03:00
Avi Kivity	a2edf57f51	KVM: Fix PDPTR reloading on CR4 writes The processor is documented to reload the PDPTRs while in PAE mode if any of the CR4 bits PSE, PGE, or PAE change. Linux relies on this behaviour when zapping the low mappings of PAE kernels during boot. The code already handled changes to CR4.PAE; augment it to also notice changes to PSE and PGE. This triggered while booting an F11 PAE kernel; the futex initialization code runs before any CR3 reloads and writes to a NULL pointer; the futex subsystem ended up uninitialized, killing PI futexes and pulseaudio which uses them. Cc: stable@kernel.org Signed-off-by: Avi Kivity <avi@redhat.com>	2009-05-25 20:00:53 +03:00
Avi Kivity	a8cd0244e9	KVM: Make paravirt tlb flush also reload the PAE PDPTRs The paravirt tlb flush may be used not only to flush TLBs, but also to reload the four page-directory-pointer-table entries, as it is used as a replacement for reloading CR3. Change the code to do the entire CR3 reloading dance instead of simply flushing the TLB. Cc: stable@kernel.org Signed-off-by: Avi Kivity <avi@redhat.com>	2009-05-25 20:00:50 +03:00
Avi Kivity	99f85a28a7	KVM: SVM: Remove port 80 passthrough KVM optimizes guest port 80 accesses by passthing them through to the host. Some AMD machines die on port 80 writes, allowing the guest to hard-lock the host. Remove the port passthrough to avoid the problem. Cc: stable@kernel.org Reported-by: Piotr Jaroszyński <p.jaroszynski@gmail.com> Tested-by: Piotr Jaroszyński <p.jaroszynski@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-05-11 14:40:51 +03:00
Avi Kivity	e286e86e6d	KVM: Make EFER reads safe when EFER does not exist Some processors don't have EFER; don't oops if userspace wants us to read EFER when we check NX. Cc: stable@kernel.org Signed-off-by: Avi Kivity <avi@redhat.com>	2009-05-11 11:19:00 +03:00
Avi Kivity	334b8ad7b1	KVM: Fix NX support reporting NX support is bit 20, not bit 1. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-05-11 11:18:48 +03:00
Andre Przywara	19bca6ab75	KVM: SVM: Fix cross vendor migration issue with unusable bit AMDs VMCB does not have an explicit unusable segment descriptor field, so we emulate it by using "not present". This has to be setup before the fixups, because this field is used there. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-05-11 11:18:04 +03:00
Jan Kiszka	888d256e9c	KVM: Unregister cpufreq notifier on unload Properly unregister cpufreq notifier on onload if it was registered during init. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-04-22 13:54:33 +03:00
Joerg Roedel	7f1ea20896	KVM: x86: release time_page on vcpu destruction Not releasing the time_page causes a leak of that page or the compound page it is situated in. Cc: stable@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-04-22 13:52:10 +03:00
Marcelo Tosatti	bf47a760f6	KVM: MMU: disable global page optimization Complexity to fix it not worthwhile the gains, as discussed in http://article.gmane.org/gmane.comp.emulators.kvm.devel/28649. Cc: stable@kernel.org Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-04-22 13:52:09 +03:00
Ingo Molnar	8302294f43	Merge branch 'tracing/core-v2' into tracing-for-linus Conflicts: include/linux/slub_def.h lib/Kconfig.debug mm/slob.c mm/slub.c	2009-04-02 00:49:02 +02:00
Avi Kivity	16175a796d	KVM: VMX: Don't allow uninhibited access to EFER on i386 vmx_set_msr() does not allow i386 guests to touch EFER, but they can still do so through the default: label in the switch. If they set EFER_LME, they can oops the host. Fix by having EFER access through the normal channel (which will check for EFER_LME) even on i386. Reported-and-tested-by: Benjamin Gilbert <bgilbert@cs.cmu.edu> Cc: stable@kernel.org Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:15 +02:00
Andrea Arcangeli	4539b35881	KVM: Fix missing smp tlb flush in invlpg When kvm emulates an invlpg instruction, it can drop a shadow pte, but leaves the guest tlbs intact. This can cause memory corruption when swapping out. Without this the other cpu can still write to a freed host physical page. tlb smp flush must happen if rmap_remove is called always before mmu_lock is released because the VM will take the mmu_lock before it can finally add the page to the freelist after swapout. mmu notifier makes it safe to flush the tlb after freeing the page (otherwise it would never be safe) so we can do a single flush for multiple sptes invalidated. Cc: stable@kernel.org Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:14 +02:00
Hannes Eder	cded19f396	KVM: fix sparse warnings: Should it be static? Impact: Make symbols static. Fix this sparse warnings: arch/x86/kvm/mmu.c:992:5: warning: symbol 'mmu_pages_add' was not declared. Should it be static? arch/x86/kvm/mmu.c:1124:5: warning: symbol 'mmu_pages_next' was not declared. Should it be static? arch/x86/kvm/mmu.c:1144:6: warning: symbol 'mmu_pages_clear_parents' was not declared. Should it be static? arch/x86/kvm/x86.c:2037:5: warning: symbol 'kvm_read_guest_virt' was not declared. Should it be static? arch/x86/kvm/x86.c:2067:5: warning: symbol 'kvm_write_guest_virt' was not declared. Should it be static? virt/kvm/irq_comm.c:220:5: warning: symbol 'setup_routing_entry' was not declared. Should it be static? Signed-off-by: Hannes Eder <hannes@hanneseder.net> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:14 +02:00
Hannes Eder	d7364a29b3	KVM: fix sparse warnings: context imbalance Impact: Attribute function with __acquires(...) resp. __releases(...). Fix this sparse warnings: arch/x86/kvm/i8259.c:34:13: warning: context imbalance in 'pic_lock' - wrong count at exit arch/x86/kvm/i8259.c:39:13: warning: context imbalance in 'pic_unlock' - unexpected unlock Signed-off-by: Hannes Eder <hannes@hanneseder.net> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:13 +02:00
Amit Shah	41d6af1192	KVM: is_long_mode() should check for EFER.LMA is_long_mode currently checks the LongModeEnable bit in EFER instead of the LongModeActive bit. This is wrong, but we survived this till now since it wasn't triggered. This breaks guests that go from long mode to compatibility mode. This is noticed on a solaris guest and fixes bug #1842160 Signed-off-by: Amit Shah <amit.shah@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2009-03-24 11:03:13 +02:00
Amit Shah	401d10dee0	KVM: VMX: Update necessary state when guest enters long mode setup_msrs() should be called when entering long mode to save the shadow state for the 64-bit guest state. Using vmx_set_efer() in enter_lmode() removes some duplicated code and also ensures we call setup_msrs(). We can safely pass the value of shadow_efer to vmx_set_efer() as no other bits in the efer change while enabling long mode (guest first sets EFER.LME, then sets CR0.PG which causes a vmexit where we activate long mode). With this fix, is_long_mode() can check for EFER.LMA set instead of EFER.LME and 5e23049e86dd298b72e206b420513dbc3a240cd9 can be reverted. Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:13 +02:00
Joerg Roedel	c5bc224240	KVM: MMU: Fix another largepage memory leak In the paging_fetch function rmap_remove is called after setting a large pte to non-present. This causes rmap_remove to not drop the reference to the large page. The result is a memory leak of that page. Cc: stable@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Acked-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:11 +02:00
Andre Przywara	1fbdc7a585	KVM: SVM: set accessed bit for VMCB segment selectors In the segment descriptor _cache_ the accessed bit is always set (although it can be cleared in the descriptor itself). Since Intel checks for this condition on a VMENTRY, set this bit in the AMD path to enable cross vendor migration. Cc: stable@kernel.org Signed-off-by: Andre Przywara <andre.przywara@amd.com> Acked-By: Amit Shah <amit.shah@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:11 +02:00
Gleb Natapov	4925663a07	KVM: Report IRQ injection status to userspace. IRQ injection status is either -1 (if there was no CPU found that should except the interrupt because IRQ was masked or ioapic was misconfigured or ...) or >= 0 in that case the number indicates to how many CPUs interrupt was injected. If the value is 0 it means that the interrupt was coalesced and probably should be reinjected. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:11 +02:00
Joerg Roedel	452425dbaa	KVM: MMU: remove assertion in kvm_mmu_alloc_page The assertion no longer makes sense since we don't clear page tables on allocation; instead we clear them during prefetch. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:10 +02:00
Joerg Roedel	6bed6b9e84	KVM: MMU: remove redundant check in mmu_set_spte The following code flow is unnecessary: if (largepage) was_rmapped = is_large_pte(*shadow_pte); else was_rmapped = 1; The is_large_pte() function will always evaluate to one here because the (largepage && !is_large_pte) case is already handled in the first if-clause. So we can remove this check and set was_rmapped to one always here. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Acked-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:10 +02:00
Gerd Hoffmann	c807660407	KVM: Fix kvmclock on !constant_tsc boxes kvmclock currently falls apart on machines without constant tsc. This patch fixes it. Changes: * keep tsc frequency in a per-cpu variable. * handle kvmclock update using a new request flag, thus checking whenever we need an update each time we enter guest context. * use a cpufreq notifier to track frequency changes and force kvmclock updates. * send ipis to kick cpu out of guest context if needed to make sure the guest doesn't see stale values. Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:09 +02:00
Sheng Yang	49cd7d2238	KVM: VMX: Use kvm_mmu_page_fault() handle EPT violation mmio Removed duplicated code. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:09 +02:00
Jan Kiszka	34c33d163f	KVM: Drop unused evaluations from string pio handlers Looks like neither the direction nor the rep prefix are used anymore. Drop related evaluations from SVM's and VMX's I/O exit handlers. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:08 +02:00
Alexander Graf	1b2fd70c4e	KVM: Add FFXSR support AMD K10 CPUs implement the FFXSR feature that gets enabled using EFER. Let's check if the virtual CPU description includes that CPUID feature bit and allow enabling it then. This is required for Windows Server 2008 in Hyper-V mode. v2 adds CPUID capability exposure Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:08 +02:00
Marcelo Tosatti	44882eed2e	KVM: make irq ack notifications aware of routing table IRQ ack notifications assume an identity mapping between pin->gsi, which might not be the case with, for example, HPET. Translate before acking. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Acked-by: Gleb Natapov <gleb@redhat.com>	2009-03-24 11:03:08 +02:00
Avi Kivity	399ec807dd	KVM: Userspace controlled irq routing Currently KVM has a static routing from GSI numbers to interrupts (namely, 0-15 are mapped 1:1 to both PIC and IOAPIC, and 16:23 are mapped 1:1 to the IOAPIC). This is insufficient for several reasons: - HPET requires non 1:1 mapping for the timer interrupt - MSIs need a new method to assign interrupt numbers and dispatch them - ACPI APIC mode needs to be able to reassign the PCI LINK interrupts to the ioapics This patch implements an interrupt routing table (as a linked list, but this can be easily changed) and a userspace interface to replace the table. The routing table is initialized according to the current hardwired mapping. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:06 +02:00
Amit Shah	1935547504	KVM: x86: Fix typos and whitespace errors Some typos, comments, whitespace errors corrected in the cpuid code Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:05 +02:00
Avi Kivity	5a41accd3f	KVM: MMU: Only enable cr4_pge role in shadow mode Two dimensional paging is only confused by it. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:04 +02:00
Avi Kivity	f6e2c02b6d	KVM: MMU: Rename "metaphysical" attribute to "direct" This actually describes what is going on, rather than alerting the reader that something strange is going on. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:04 +02:00
Marcelo Tosatti	9903a927a4	KVM: MMU: drop zeroing on mmu_memory_cache_alloc Zeroing on mmu_memory_cache_alloc is unnecessary since: - Smaller areas are pre-allocated with kmem_cache_zalloc. - Page pointed by ->spt is overwritten with prefetch_page and entries in page pointed by ->gfns are initialized before reading. [avi: zeroing pages is unnecessary] Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:04 +02:00
Joe Perches	ff81ff10b4	KVM: SVM: Fix typo in has_svm() Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:04 +02:00
Avi Kivity	4780c65904	KVM: Reset PIT irq injection logic when the PIT IRQ is unmasked While the PIT is masked the guest cannot ack the irq, so the reinject logic will never allow the interrupt to be injected. Fix by resetting the reinjection counters on unmask. Unbreaks Xen. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:03 +02:00
Avi Kivity	5d9b8e30f5	KVM: Add CONFIG_HAVE_KVM_IRQCHIP Two KVM archs support irqchips and two don't. Add a Kconfig item to make selecting between the two models easier. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:02 +02:00
Avi Kivity	4677a3b693	KVM: MMU: Optimize page unshadowing Using kvm_mmu_lookup_page() will result in multiple scans of the hash chains; use hlist_for_each_entry_safe() to achieve a single scan instead. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:02 +02:00
Alexander Graf	c8a73f186b	KVM: SVM: Add microcode patch level dummy VMware ESX checks if the microcode level is correct when using a barcelona CPU, in order to see if it actually can use SVM. Let's tell it we're on the safe side... Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:02 +02:00
Avi Kivity	269e05e485	KVM: Properly lock PIT creation Otherwise, two threads can create a PIT in parallel and cause a memory leak. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:01 +02:00
Avi Kivity	a77ab5ead5	KVM: x86 emulator: implement 'ret far' instruction (opcode 0xcb) Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:01 +02:00
Avi Kivity	8b3079a5c0	KVM: VMX: When emulating on invalid vmx state, don't return to userspace unnecessarily If we aren't doing mmio there's no need to exit to userspace (which will just be confused). Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:00 +02:00
Avi Kivity	350f69dcd1	KVM: x86 emulator: Make emulate_pop() a little more generic Allow emulate_pop() to read into arbitrary memory rather than just the source operand. Needed for complicated instructions like far returns. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:00 +02:00
Avi Kivity	10f32d84c7	KVM: VMX: Prevent exit handler from running if emulating due to invalid state If we've just emulated an instruction, we won't have any valid exit reason and associated information. Fix by moving the clearing of the emulation_required flag to the exit handler. This way the exit handler can notice that we've been emulating and abort early. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:00 +02:00
Avi Kivity	9fd4a3b7a4	KVM: VMX: don't clobber segment AR if emulating invalid state The ususable bit is important for determining state validity; don't clobber it. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:03:00 +02:00
Avi Kivity	1872a3f411	KVM: VMX: Fix guest state validity checks The vmx guest state validity checks are full of bugs. Make them conform to the manual. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:59 +02:00
Marcelo Tosatti	52d939a0bf	KVM: PIT: provide an option to disable interrupt reinjection Certain clocks (such as TSC) in older 2.6 guests overaccount for lost ticks, causing severe time drift. Interrupt reinjection magnifies the problem. Provide an option to disable it. [avi: allow room for expansion in case we want to disable reinjection of other timers] Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:55 +02:00
Avi Kivity	61a6bd672b	KVM: Fallback support for MSR_VM_HSAVE_PA Since we advertise MSR_VM_HSAVE_PA, userspace will attempt to read it even on Intel. Implement fake support for this MSR to avoid the warnings. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:54 +02:00
Izik Eidus	0f34607440	KVM: remove the vmap usage vmap() on guest pages hides those pages from the Linux mm for an extended (userspace determined) amount of time. Get rid of it. Signed-off-by: Izik Eidus <ieidus@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:54 +02:00
Izik Eidus	77c2002e7c	KVM: introduce kvm_read_guest_virt, kvm_write_guest_virt This commit change the name of emulator_read_std into kvm_read_guest_virt, and add new function name kvm_write_guest_virt that allow writing into a guest virtual address. Signed-off-by: Izik Eidus <ieidus@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:54 +02:00
Marcelo Tosatti	53f658b3c3	KVM: VMX: initialize TSC offset relative to vm creation time VMX initializes the TSC offset for each vcpu at different times, and also reinitializes it for vcpus other than 0 on APIC SIPI message. This bug causes the TSC's to appear unsynchronized in the guest, even if the host is good. Older Linux kernels don't handle the situation very well, so gettimeofday is likely to go backwards in time: http://www.mail-archive.com/kvm@vger.kernel.org/msg02955.html http://sourceforge.net/tracker/index.php?func=detail&aid=2025534&group_id=180599&atid=893831 Fix it by initializating the offset of each vcpu relative to vm creation time, and moving it from vmx_vcpu_reset to vmx_vcpu_setup, out of the APIC MP init path. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:53 +02:00
Avi Kivity	e8c4a4e8a7	KVM: MMU: Drop walk_shadow() No longer used. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:53 +02:00
Avi Kivity	a461930bc3	KVM: MMU: Replace walk_shadow() by for_each_shadow_entry() in invlpg() Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:53 +02:00
Avi Kivity	e7a04c99b5	KVM: MMU: Replace walk_shadow() by for_each_shadow_entry() in fetch() Effectively reverting to the pre walk_shadow() version -- but now with the reusable for_each(). Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:52 +02:00
Avi Kivity	9f652d21c3	KVM: MMU: Use for_each_shadow_entry() in __direct_map() Eliminating a callback and a useless structure. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:52 +02:00
Avi Kivity	2d11123a77	KVM: MMU: Add for_each_shadow_entry(), a simpler alternative to walk_shadow() Using a for_each loop style removes the need to write callback and nasty casts. Implement the walk_shadow() using the for_each_shadow_entry(). Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:52 +02:00
Avi Kivity	2b3d2a2060	KVM: Fix vmload and friends misinterpreted as lidt The AMD SVM instruction family all overload the 0f 01 /3 opcode, further multiplexing on the three r/m bits. But the code decided that anything that isn't a vmmcall must be an lidt (which shares the 0f 01 /3 opcode, for the case that mod = 3). Fix by aborting emulation if this isn't a vmmcall. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:51 +02:00
Avi Kivity	e207831804	KVM: MMU: Initialize a shadow page's global attribute from cr4.pge If cr4.pge is cleared, we ought to treat any ptes in the page as non-global. This allows us to remove the check from set_spte(). Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:51 +02:00
Avi Kivity	2f0b3d60b2	KVM: MMU: Segregate mmu pages created with different cr4.pge settings Don't allow a vcpu with cr4.pge cleared to use a shadow page created with cr4.pge set; this might cause a cr3 switch not to sync ptes that have the global bit set (the global bit has no effect if !cr4.pge). This can only occur on smp with different cr4.pge settings for different vcpus (since a cr4 change will resync the shadow ptes), but there's no cost to being correct here. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:51 +02:00
Avi Kivity	a770f6f28b	KVM: MMU: Inherit a shadow page's guest level count from vcpu setup Instead of "calculating" it on every shadow page allocation, set it once when switching modes, and copy it when allocating pages. This doesn't buy us much, but sets up the stage for inheriting more information related to the mmu setup. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:51 +02:00
Jan Kiszka	ae675ef01c	KVM: x86: Wire-up hardware breakpoints for guest debugging Add the remaining bits to make use of debug registers also for guest debugging, thus enabling the use of hardware breakpoints and watchpoints. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:50 +02:00
Jan Kiszka	42dbaa5a05	KVM: x86: Virtualize debug registers So far KVM only had basic x86 debug register support, once introduced to realize guest debugging that way. The guest itself was not able to use those registers. This patch now adds (almost) full support for guest self-debugging via hardware registers. It refactors the code, moving generic parts out of SVM (VMX was already cleaned up by the KVM_SET_GUEST_DEBUG patches), and it ensures that the registers are properly switched between host and guest. This patch also prepares debug register usage by the host. The latter will (once wired-up by the following patch) allow for hardware breakpoints/watchpoints in guest code. If this is enabled, the guest will only see faked debug registers without functionality, but with content reflecting the guest's modifications. Tested on Intel only, but SVM /should/ work as well, but who knows... Known limitations: Trapping on tss switch won't work - most probably on Intel. Credits also go to Joerg Roedel - I used his once posted debugging series as platform for this patch. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:49 +02:00
Jan Kiszka	55934c0bd3	KVM: VMX: Allow single-stepping when uninterruptible When single-stepping over STI and MOV SS, we must clear the corresponding interruptibility bits in the guest state. Otherwise vmentry fails as it then expects bit 14 (BS) in pending debug exceptions being set, but that's not correct for the guest debugging case. Note that clearing those bits is safe as we check for interruptibility based on the original state and do not inject interrupts or NMIs if guest interruptibility was blocked. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:49 +02:00
Jan Kiszka	d0bfb940ec	KVM: New guest debug interface This rips out the support for KVM_DEBUG_GUEST and introduces a new IOCTL instead: KVM_SET_GUEST_DEBUG. The IOCTL payload consists of a generic part, controlling the "main switch" and the single-step feature. The arch specific part adds an x86 interface for intercepting both types of debug exceptions separately and re-injecting them when the host was not interested. Moveover, the foundation for guest debugging via debug registers is layed. To signal breakpoint events properly back to userland, an arch-specific data block is now returned along KVM_EXIT_DEBUG. For x86, the arch block contains the PC, the debug exception, and relevant debug registers to tell debug events properly apart. The availability of this new interface is signaled by KVM_CAP_SET_GUEST_DEBUG. Empty stubs for not yet supported archs are provided. Note that both SVM and VTX are supported, but only the latter was tested yet. Based on the experience with all those VTX corner case, I would be fairly surprised if SVM will work out of the box. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:49 +02:00
Jan Kiszka	8ab2d2e231	KVM: VMX: Support for injecting software exceptions VMX differentiates between processor and software generated exceptions when injecting them into the guest. Extend vmx_queue_exception accordingly (and refactor related constants) so that we can use this service reliably for the new guest debugging framework. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:48 +02:00
Alexander Graf	d80174745b	KVM: SVM: Only allow setting of EFER_SVME when CPUID SVM is set Userspace has to tell the kernel module somehow that nested SVM should be used. The easiest way that doesn't break anything I could think of is to implement if (cpuid & svm) allow write to efer else deny write to efer Old userspaces mask the SVM capability bit, so they don't break. In order to find out that the SVM capability is set, I had to split the kvm_emulate_cpuid into a finding and an emulating part. (introduced in v6) Acked-by: Joerg Roedel <joro@8bytes.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:48 +02:00
Alexander Graf	236de05553	KVM: SVM: Allow setting the SVME bit Normally setting the SVME bit in EFER is not allowed, as we did not support SVM. Not since we do, we should also allow enabling SVM mode. v2 comes as last patch, so we don't enable half-ready code v4 introduces a module option to enable SVM v6 warns that nesting is enabled Acked-by: Joerg Roedel <joro@8bytes.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:48 +02:00
Joerg Roedel	eb6f302edf	KVM: SVM: Allow read access to MSR_VM_VR KVM tries to read the VM_CR MSR to find out if SVM was disabled by the BIOS. So implement read support for this MSR to make nested SVM running. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:47 +02:00
Alexander Graf	cf74a78b22	KVM: SVM: Add VMEXIT handler and intercepts This adds the #VMEXIT intercept, so we return to the level 1 guest when something happens in the level 2 guest that should return to the level 1 guest. v2 implements HIF handling and cleans up exception interception v3 adds support for V_INTR_MASKING_MASK v4 uses the host page hsave v5 removes IOPM merging code v6 moves mmu code out of the atomic section Acked-by: Joerg Roedel <joro@8bytes.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:47 +02:00
Alexander Graf	3d6368ef58	KVM: SVM: Add VMRUN handler This patch implements VMRUN. VMRUN enters a virtual CPU and runs that in the same context as the normal guest CPU would run. So basically it is implemented the same way, a normal CPU would do it. We also prepare all intercepts that get OR'ed with the original intercepts, as we do not allow a level 2 guest to be intercepted less than the first level guest. v2 implements the following improvements: - fixes the CPL check - does not allocate iopm when not used - remembers the host's IF in the HIF bit in the hflags v3: - make use of the new permission checking - add support for V_INTR_MASKING_MASK v4: - use host page backed hsave v5: - remove IOPM merging code v6: - save cr4 so PAE l1 guests work v7: - return 0 on vmrun so we check the MSRs too - fix MSR check to use the correct variable Acked-by: Joerg Roedel <joro@8bytes.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:47 +02:00
Alexander Graf	5542675baa	KVM: SVM: Add VMLOAD and VMSAVE handlers This implements the VMLOAD and VMSAVE instructions, that usually surround the VMRUN instructions. Both instructions load / restore the same elements, so we only need to implement them once. v2 fixes CPL checking and replaces memcpy by assignments v3 makes use of the new permission checking Acked-by: Joerg Roedel <joro@8bytes.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:47 +02:00
Alexander Graf	b286d5d8b0	KVM: SVM: Implement hsave Implement the hsave MSR, that gives the VCPU a GPA to save the old guest state in. v2 allows userspace to save/restore hsave v4 dummys out the hsave MSR, so we use a host page v6 remembers the guest's hsave and exports the MSR Acked-by: Joerg Roedel <joro@8bytes.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:46 +02:00
Alexander Graf	1371d90460	KVM: SVM: Implement GIF, clgi and stgi This patch implements the GIF flag and the clgi and stgi instructions that set this flag. Only if the flag is set (default), interrupts can be received by the CPU. To keep the information about that somewhere, this patch adds a new hidden flags vector. that is used to store information that does not go into the vmcb, but is SVM specific. I tried to write some code to make -no-kvm-irqchip work too, but the first level guest won't even boot with that atm, so I ditched it. v2 moves the hflags to x86 generic code v3 makes use of the new permission helper v6 only enables interrupt_window if GIF=1 Acked-by: Joerg Roedel <joro@8bytes.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:46 +02:00
Alexander Graf	c0725420cf	KVM: SVM: Add helper functions for nested SVM These are helpers for the nested SVM implementation. - nsvm_printk implements a debug printk variant - nested_svm_do calls a handler that can accesses gpa-based memory v3 makes use of the new permission checker v6 changes: - streamline nsvm_debug() - remove printk(KERN_ERR) - SVME check before CPL check - give GP error code - use new EFER constant Acked-by: Joerg Roedel <joro@8bytes.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:46 +02:00
Alexander Graf	9962d032bb	KVM: SVM: Move EFER and MSR constants to generic x86 code MSR_EFER_SVME_MASK, MSR_VM_CR and MSR_VM_HSAVE_PA are set in KVM specific headers. Linux does have nice header files to collect EFER bits and MSR IDs, so IMHO we should put them there. While at it, I also changed the naming scheme to match that of the other defines. (introduced in v6) Acked-by: Joerg Roedel <joro@8bytes.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:46 +02:00
Alexander Graf	f0b85051d0	KVM: SVM: Clean up VINTR setting The current VINTR intercept setters don't look clean to me. To make the code easier to read and enable the possibilty to trap on a VINTR set, this uses a helper function to set the VINTR intercept. v2 uses two distinct functions for setting and clearing the bit Acked-by: Joerg Roedel <joro@8bytes.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-03-24 11:02:45 +02:00
Ingo Molnar	72c26c9a26	Merge branch 'linus' into tracing/blktrace Conflicts: block/blktrace.c Semantic merge: kernel/trace/blktrace.c Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-02-19 09:00:35 +01:00
Avi Kivity	516a1a7e9d	KVM: VMX: Flush volatile msrs before emulating rdmsr Some msrs (notable MSR_KERNEL_GS_BASE) are held in the processor registers and need to be flushed to the vcpu struture before they can be read. This fixes cygwin longjmp() failure on Windows x64. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-02-15 02:47:39 +02:00
Marcelo Tosatti	b682b814e3	KVM: x86: fix LAPIC pending count calculation Simplify LAPIC TMCCT calculation by using hrtimer provided function to query remaining time until expiration. Fixes host hang with nested ESX. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-02-15 02:47:38 +02:00
Sheng Yang	2aaf69dcee	KVM: MMU: Map device MMIO as UC in EPT Software are not allow to access device MMIO using cacheable memory type, the patch limit MMIO region with UC and WC(guest can select WC using PAT and PCD/PWT). Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-02-15 02:47:37 +02:00
Marcelo Tosatti	abe6655dd6	KVM: x86: disable kvmclock on non constant TSC hosts This is better. Currently, this code path is posing us big troubles, and we won't have a decent patch in time. So, temporarily disable it. Signed-off-by: Glauber Costa <glommer@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-02-15 02:47:36 +02:00
Marcelo Tosatti	d2a8284e8f	KVM: PIT: fix i8254 pending count read count_load_time assignment is bogus: its supposed to contain what it means, not the expiration time. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-02-15 02:47:36 +02:00
Sheng Yang	ba4cef31d5	KVM: Fix racy in kvm_free_assigned_irq In the past, kvm_get_kvm() and kvm_put_kvm() was called in assigned device irq handler and interrupt_work, in order to prevent cancel_work_sync() in kvm_free_assigned_irq got a illegal state when waiting for interrupt_work done. But it's tricky and still got two problems: 1. A bug ignored two conditions that cancel_work_sync() would return true result in a additional kvm_put_kvm(). 2. If interrupt type is MSI, we would got a window between cancel_work_sync() and free_irq(), which interrupt would be injected again... This patch discard the reference count used for irq handler and interrupt_work, and ensure the legal state by moving the free function at the very beginning of kvm_destroy_vm(). And the patch fix the second bug by disable irq before cancel_work_sync(), which may result in nested disable of irq but OK for we are going to free it. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-02-15 02:47:36 +02:00
Sheng Yang	ad8ba2cd44	KVM: Add kvm_arch_sync_events to sync with asynchronize events kvm_arch_sync_events is introduced to quiet down all other events may happen contemporary with VM destroy process, like IRQ handler and work struct for assigned device. For kvm_arch_sync_events is called at the very beginning of kvm_destroy_vm(), so the state of KVM here is legal and can provide a environment to quiet down other events. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-02-15 02:47:36 +02:00
Ingo Molnar	3d7a96f5a4	Merge branch 'linus' into tracing/kmemtrace2	2009-01-06 09:53:05 +01:00
Joerg Roedel	19de40a847	KVM: change KVM to use IOMMU API Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-01-03 14:11:07 +01:00
Joerg Roedel	c4fa386428	KVM: rename vtd.c to iommu.c Impact: file renamed The code in the vtd.c file can be reused for other IOMMUs as well. So rename it to make it clear that it handle more than VT-d. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-01-03 14:10:09 +01:00
Marcelo Tosatti	8791723920	KVM: MMU: handle large host sptes on invlpg/resync The invlpg and sync walkers lack knowledge of large host sptes, descending to non-existant pagetable level. Stop at directory level in such case. Fixes SMP Windows XP with hugepages. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:49 +02:00
Avi Kivity	3f353858c9	KVM: Add locking to virtual i8259 interrupt controller While most accesses to the i8259 are with the kvm mutex taken, the call to kvm_pic_read_irq() is not. We can't easily take the kvm mutex there since the function is called with interrupts disabled. Fix by adding a spinlock to the virtual interrupt controller. Since we can't send an IPI under the spinlock (we also take the same spinlock in an irq disabled context), we defer the IPI until the spinlock is released. Similarly, we defer irq ack notifications until after spinlock release to avoid lock recursion. Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:48 +02:00
Avi Kivity	25e2343246	KVM: MMU: Don't treat a global pte as such if cr4.pge is cleared The pte.g bit is meaningless if global pages are disabled; deferring mmu page synchronization on these ptes will lead to the guest using stale shadow ptes. Fixes Vista x86 smp bootloader failure. Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:48 +02:00
Jan Kiszka	4531220b71	KVM: x86: Rework user space NMI injection as KVM_CAP_USER_NMI There is no point in doing the ready_for_nmi_injection/ request_nmi_window dance with user space. First, we don't do this for in-kernel irqchip anyway, while the code path is the same as for user space irqchip mode. And second, there is nothing to loose if a pending NMI is overwritten by another one (in contrast to IRQs where we have to save the number). Actually, there is even the risk of raising spurious NMIs this way because the reason for the held-back NMI might already be handled while processing the first one. Therefore this patch creates a simplified user space NMI injection interface, exporting it under KVM_CAP_USER_NMI and dropping the old KVM_CAP_NMI capability. And this time we also take care to provide the interface only on archs supporting NMIs via KVM (right now only x86). Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:47 +02:00
Jan Kiszka	264ff01d55	KVM: VMX: Fix pending NMI-vs.-IRQ race for user space irqchip As with the kernel irqchip, don't allow an NMI to stomp over an already injected IRQ; instead wait for the IRQ injection to be completed. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:47 +02:00
Marcelo Tosatti	eb64f1e8cd	KVM: MMU: check for present pdptr shadow page in walk_shadow walk_shadow assumes the caller verified validity of the pdptr pointer in question, which is not the case for the invlpg handler. Fixes oops during Solaris 10 install. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:46 +02:00
Avi Kivity	ca9edaee1a	KVM: Consolidate userspace memory capability reporting into common code Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:46 +02:00
Marcelo Tosatti	ad218f85e3	KVM: MMU: prepopulate the shadow on invlpg If the guest executes invlpg, peek into the pagetable and attempt to prepopulate the shadow entry. Also stop dirty fault updates from interfering with the fork detector. 2% improvement on RHEL3/AIM7. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:44 +02:00
Marcelo Tosatti	6cffe8ca4a	KVM: MMU: skip global pgtables on sync due to cr3 switch Skip syncing global pages on cr3 switch (but not on cr4/cr0). This is important for Linux 32-bit guests with PAE, where the kmap page is marked as global. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:44 +02:00
Marcelo Tosatti	b1a368218a	KVM: MMU: collapse remote TLB flushes on root sync Collapse remote TLB flushes on root sync. kernbench is 2.7% faster on 4-way guest. Improvements have been seen with other loads such as AIM7. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:43 +02:00
Marcelo Tosatti	60c8aec6e2	KVM: MMU: use page array in unsync walk Instead of invoking the handler directly collect pages into an array so the caller can work with it. Simplifies TLB flush collapsing. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:43 +02:00
Amit Shah	fbce554e94	KVM: x86 emulator: Fix handling of VMMCALL instruction The VMMCALL instruction doesn't get recognised and isn't processed by the emulator. This is seen on an Intel host that tries to execute the VMMCALL instruction after a guest live migrates from an AMD host. Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:43 +02:00
Guillaume Thouvenin	9bf8ea42fe	KVM: x86 emulator: add the emulation of shld and shrd instructions Add emulation of shld and shrd instructions Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:43 +02:00
Guillaume Thouvenin	d175226a5f	KVM: x86 emulator: add the assembler code for three operands Add the assembler code for instruction with three operands and one operand is stored in ECX register Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:42 +02:00
Guillaume Thouvenin	bfcadf83ec	KVM: x86 emulator: add a new "implied 1" Src decode type Add SrcOne operand type when we need to decode an implied '1' like with regular shift instruction Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:42 +02:00
Guillaume Thouvenin	0dc8d10f7d	KVM: x86 emulator: add Src2 decode set Instruction like shld has three operands, so we need to add a Src2 decode set. We start with Src2None, Src2CL, and Src2ImmByte, Src2One to support shld/shrd and we will expand it later. Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:42 +02:00
Guillaume Thouvenin	45ed60b371	KVM: x86 emulator: Extend the opcode descriptor Extend the opcode descriptor to 32 bits. This is needed by the introduction of a new Src2 operand type. Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:41 +02:00
Hannes Eder	efff9e538f	KVM: VMX: fix sparse warning Impact: make global function static arch/x86/kvm/vmx.c:134:3: warning: symbol 'vmx_capability' was not declared. Should it be static? Signed-off-by: Hannes Eder <hannes@hanneseder.net> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:06 +02:00
Avi Kivity	f3fd92fbdb	KVM: Remove extraneous semicolon after do/while Notices by Guillaume Thouvenin. Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:05 +02:00
Avi Kivity	2b48cc75b2	KVM: x86 emulator: fix popf emulation Set operand type and size to get correct writeback behavior. Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:05 +02:00
Avi Kivity	cf5de4f886	KVM: x86 emulator: fix ret emulation 'ret' did not set the operand type or size for the destination, so writeback ignored it. Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:05 +02:00
Avi Kivity	8a09b6877f	KVM: x86 emulator: switch 'pop reg' instruction to emulate_pop() Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:04 +02:00
Avi Kivity	781d0edc5f	KVM: x86 emulator: allow pop from mmio Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:04 +02:00
Avi Kivity	faa5a3ae39	KVM: x86 emulator: Extract 'pop' sequence into a function Switch 'pop r/m' instruction to use the new function. Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:04 +02:00
Avi Kivity	6b7ad61ffb	KVM: x86 emulator: consolidate emulation of two operand instructions No need to repeat the same assembly block over and over. Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:03 +02:00
Avi Kivity	dda96d8f1b	KVM: x86 emulator: reduce duplication in one operand emulation thunks Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:03 +02:00
Marcelo Tosatti	ecc5589f19	KVM: MMU: optimize set_spte for page sync The write protect verification in set_spte is unnecessary for page sync. Its guaranteed that, if the unsync spte was writable, the target page does not have a write protected shadow (if it had, the spte would have been write protected under mmu_lock by rmap_write_protect before). Same reasoning applies to mark_page_dirty: the gfn has been marked as dirty via the pagefault path. The cost of hash table and memslot lookups are quite significant if the workload is pagetable write intensive resulting in increased mmu_lock contention. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:02 +02:00
Avi Kivity	df203ec9a7	KVM: VMX: Conditionally request interrupt window after injecting irq If we're injecting an interrupt, and another one is pending, request an interrupt window notification so we don't have excess latency on the second interrupt. This shouldn't happen in practice since an EOI will be issued, giving a second chance to request an interrupt window, but... Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:55:00 +02:00
Eduardo Habkost	2c8dceebb2	KVM: SVM: move svm_hardware_disable() code to asm/virtext.h Create cpu_svm_disable() function. Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:52:30 +02:00
Eduardo Habkost	63d1142f8f	KVM: SVM: move has_svm() code to asm/virtext.h Use a trick to keep the printk()s on has_svm() working as before. gcc will take care of not generating code for the 'msg' stuff when the function is called with a NULL msg argument. Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:52:29 +02:00
Eduardo Habkost	710ff4a855	KVM: VMX: extract kvm_cpu_vmxoff() from hardware_disable() Along with some comments on why it is different from the core cpu_vmxoff() function. Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:52:29 +02:00
Eduardo Habkost	6210e37b12	KVM: VMX: move cpu_has_kvm_support() to an inline on asm/virtext.h It will be used by core code on kdump and reboot, to disable vmx if needed. Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:52:28 +02:00
Eduardo Habkost	c2cedf7be2	KVM: SVM: move svm.h to include/asm svm.h will be used by core code that is independent of KVM, so I am moving it outside the arch/x86/kvm directory. Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:52:28 +02:00
Eduardo Habkost	13673a90f1	KVM: VMX: move vmx.h to include/asm vmx.h will be used by core code that is independent of KVM, so I am moving it outside the arch/x86/kvm directory. Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:52:27 +02:00
Nitin A Kamble	0fdf8e59fa	KVM: Fix cpuid iteration on multiple leaves per eac The code to traverse the cpuid data array list for counting type of leaves is currently broken. This patches fixes the 2 things in it. 1. Set the 1st counting entry's flag KVM_CPUID_FLAG_STATE_READ_NEXT. Without it the code will never find a valid entry. 2. Also the stop condition in the for loop while looking for the next unflaged entry is broken. It needs to stop when it find one matching entry; and in the case of count of 1, it will be the same entry found in this iteration. Signed-Off-By: Nitin A Kamble <nitin.a.kamble@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:52:24 +02:00
Nitin A Kamble	0853d2c1d8	KVM: Fix cpuid leaf 0xb loop termination For cpuid leaf 0xb the bits 8-15 in ECX register define the end of counting leaf. The previous code was using bits 0-7 for this purpose, which is a bug. Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:52:24 +02:00
Izik Eidus	2843099fee	KVM: MMU: Fix aliased gfns treated as unaliased Some areas of kvm x86 mmu are using gfn offset inside a slot without unaliasing the gfn first. This patch makes sure that the gfn will be unaliased and add gfn_to_memslot_unaliased() to save the calculating of the gfn unaliasing in case we have it unaliased already. Signed-off-by: Izik Eidus <ieidus@redhat.com> Acked-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:50 +02:00
Sheng Yang	6eb55818c0	KVM: Enable Function Level Reset for assigned device Ideally, every assigned device should in a clear condition before and after assignment, so that the former state of device won't affect later work. Some devices provide a mechanism named Function Level Reset, which is defined in PCI/PCI-e document. We should execute it before and after device assignment. (But sadly, the feature is new, and most device on the market now don't support it. We are considering using D0/D3hot transmit to emulate it later, but not that elegant and reliable as FLR itself.) [Update: Reminded by Xiantao, execute FLR after we ensure that the device can be assigned to the guest.] Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:49 +02:00
Guillaume Thouvenin	1d5a4d9b92	KVM: VMX: Handle mmio emulation when guest state is invalid If emulate_invalid_guest_state is enabled, the emulator is called when guest state is invalid. Until now, we reported an mmio failure when emulate_instruction() returned EMULATE_DO_MMIO. This patch adds the case where emulate_instruction() failed and an MMIO emulation is needed. Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:48 +02:00
Guillaume Thouvenin	e93f36bcfa	KVM: allow emulator to adjust rip for emulated pio instructions If we call the emulator we shouldn't call skip_emulated_instruction() in the first place, since the emulator already computes the next rip for us. Thus we move ->skip_emulated_instruction() out of kvm_emulate_pio() and into handle_io() (and the svm equivalent). We also replaced "return 0" by "break" in the "do_io:" case because now the shadow register state needs to be committed. Otherwise eip will never be updated. Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:48 +02:00
Amit Shah	c0d09828c8	KVM: SVM: Set the 'busy' flag of the TR selector The busy flag of the TR selector is not set by the hardware. This breaks migration from amd hosts to intel hosts. Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:48 +02:00
Amit Shah	25022acc3d	KVM: SVM: Set the 'g' bit of the cs selector for cross-vendor migration The hardware does not set the 'g' bit of the cs selector and this breaks migration from amd hosts to intel hosts. Set this bit if the segment limit is beyond 1 MB. Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:48 +02:00
Amit Shah	b8222ad2e5	KVM: x86: Fix typo in function name get_segment_descritptor_dtable() contains an obvious type. Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:47 +02:00
Jan Kiszka	cc6e462cd5	KVM: x86: Optimize NMI watchdog delivery As suggested by Avi, this patch introduces a counter of VCPUs that have LVT0 set to NMI mode. Only if the counter > 0, we push the PIT ticks via all LAPIC LVT0 lines to enable NMI watchdog support. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Acked-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:47 +02:00
Jan Kiszka	8fdb2351d5	KVM: x86: Fix and refactor NMI watchdog emulation This patch refactors the NMI watchdog delivery patch, consolidating tests and providing a proper API for delivering watchdog events. An included micro-optimization is to check only for apic_hw_enabled in kvm_apic_local_deliver (the test for LVT mask is covering the soft-disabled case already). Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Acked-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:46 +02:00
Guillaume Thouvenin	291fd39bfc	KVM: x86 emulator: Add decode entries for 0x04 and 0x05 opcodes (add acc, imm) Add decode entries for 0x04 and 0x05 (ADD) opcodes, execution is already implemented. Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:46 +02:00
Sheng Yang	6fe639792c	KVM: VMX: Move private memory slot position PCI device assignment would map guest MMIO spaces as separate slot, so it is possible that the device has more than 2 MMIO spaces and overwrite current private memslot. The patch move private memory slot to the top of userspace visible memory slots. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:46 +02:00
Sheng Yang	291f26bc0f	KVM: MMU: Extend kvm_mmu_page->slot_bitmap size Otherwise set_bit() for private memory slot(above KVM_MEMORY_SLOTS) would corrupted memory in 32bit host. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:45 +02:00
Sheng Yang	64d4d52175	KVM: Enable MTRR for EPT The effective memory type of EPT is the mixture of MSR_IA32_CR_PAT and memory type field of EPT entry. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:45 +02:00
Sheng Yang	74be52e3e6	KVM: Add local get_mtrr_type() to support MTRR For EPT memory type support. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:45 +02:00
Sheng Yang	468d472f3f	KVM: VMX: Add PAT support for EPT GUEST_PAT support is a new feature introduced by Intel Core i7 architecture. With this, cpu would save/load guest and host PAT automatically, for EPT memory type in guest depends on MSR_IA32_CR_PAT. Also add save/restore for MSR_IA32_CR_PAT. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:44 +02:00
Sheng Yang	0bed3b568b	KVM: Improve MTRR structure As well as reset mmu context when set MTRR. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:44 +02:00
Gleb Natapov	5f179287fa	KVM: call kvm_arch_vcpu_reset() instead of the kvm_x86_ops callback Call kvm_arch_vcpu_reset() instead of directly using arch callback. The function does additional things. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:43 +02:00
Jan Kiszka	3b86cd9967	KVM: VMX: work around lacking VNMI support Older VMX supporting CPUs do not provide the "Virtual NMI" feature for tracking the NMI-blocked state after injecting such events. For now KVM is unable to inject NMIs on those CPUs. Derived from Sheng Yang's suggestion to use the IRQ window notification for detecting the end of NMI handlers, this patch implements virtual NMI support without impact on the host's ability to receive real NMIs. The downside is that the given approach requires some heuristics that can cause NMI nesting in vary rare corner cases. The approach works as follows: - inject NMI and set a software-based NMI-blocked flag - arm the IRQ window start notification whenever an NMI window is requested - if the guest exits due to an opening IRQ window, clear the emulated NMI-blocked flag - if the guest net execution time with NMI-blocked but without an IRQ window exceeds 1 second, force NMI-blocked reset and inject anyway This approach covers most practical scenarios: - succeeding NMIs are seperated by at least one open IRQ window - the guest may spin with IRQs disabled (e.g. due to a bug), but leaving the NMI handler takes much less time than one second - the guest does not rely on strict ordering or timing of NMIs (would be problematic in virtualized environments anyway) Successfully tested with the 'nmi n' monitor command, the kgdbts testsuite on smp guests (additional patches required to add debug register support to kvm) + the kernel's nmi_watchdog=1, and a Siemens- specific board emulation (+ guest) that comes with its own NMI watchdog mechanism. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:43 +02:00
Jan Kiszka	487b391d6e	KVM: VMX: Provide support for user space injected NMIs This patch adds the required bits to the VMX side for user space injected NMIs. As with the preexisting in-kernel irqchip support, the CPU must provide the "virtual NMI" feature for proper tracking of the NMI blocking state. Based on the original patch by Sheng Yang. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:43 +02:00
Jan Kiszka	c4abb7c9cd	KVM: x86: Support for user space injected NMIs Introduces the KVM_NMI IOCTL to the generic x86 part of KVM for injecting NMIs from user space and also extends the statistic report accordingly. Based on the original patch by Sheng Yang. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:42 +02:00
Jan Kiszka	26df99c6c5	KVM: Kick NMI receiving VCPU Kick the NMI receiving VCPU in case the triggering caller runs in a different context. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:42 +02:00
Jan Kiszka	0496fbb973	KVM: x86: VCPU with pending NMI is runnabled Ensure that a VCPU with pending NMIs is considered runnable. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:42 +02:00
Jan Kiszka	23930f9521	KVM: x86: Enable NMI Watchdog via in-kernel PIT source LINT0 of the LAPIC can be used to route PIT events as NMI watchdog ticks into the guest. This patch aligns the in-kernel irqchip emulation with the user space irqchip with already supports this feature. The trick is to route PIT interrupts to all LAPIC's LVT0 lines. Rebased and slightly polished patch originally posted by Sheng Yang. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:41 +02:00
Jan Kiszka	66a5a347c2	KVM: VMX: fix real-mode NMI support Fix NMI injection in real-mode with the same pattern we perform IRQ injection. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:41 +02:00
Jan Kiszka	f460ee43e2	KVM: VMX: refactor IRQ and NMI window enabling do_interrupt_requests and vmx_intr_assist go different way for achieving the same: enabling the nmi/irq window start notification. Unify their code over enable_{irq\|nmi}_window, get rid of a redundant call to enable_intr_window instead of direct enable_nmi_window invocation and unroll enable_intr_window for both in-kernel and user space irq injection accordingly. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:41 +02:00
Jan Kiszka	33f089ca5a	KVM: VMX: refactor/fix IRQ and NMI injectability determination There are currently two ways in VMX to check if an IRQ or NMI can be injected: - vmx_{nmi\|irq}_enabled and - vcpu.arch.{nmi\|interrupt}_window_open. Even worse, one test (at the end of vmx_vcpu_run) uses an inconsistent, likely incorrect logic. This patch consolidates and unifies the tests over {nmi\|interrupt}_window_open as cache + vmx_update_window_states for updating the cache content. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:40 +02:00
Jan Kiszka	448fa4a9c5	KVM: x86: Reset pending/inject NMI state on CPU reset CPU reset invalidates pending or already injected NMIs, therefore reset the related state variables. Based on original patch by Gleb Natapov. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:40 +02:00
Jan Kiszka	60637aacfd	KVM: VMX: Support for NMI task gates Properly set GUEST_INTR_STATE_NMI and reset nmi_injected when a task-switch vmexit happened due to a task gate being used for handling NMIs. Also avoid the false warning about valid vectoring info in kvm_handle_exit. Based on original patch by Gleb Natapov. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:40 +02:00
Jan Kiszka	e4a41889ec	KVM: VMX: Use INTR_TYPE_NMI_INTR instead of magic value Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:40 +02:00
Jan Kiszka	a26bf12afb	KVM: VMX: include all IRQ window exits in statistics irq_window_exits only tracks IRQ window exits due to user space requests, nmi_window_exits include all exits. The latter makes more sense, so let's adjust irq_window_exits accounting. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:39 +02:00
Guillaume Thouvenin	2786b014ec	KVM: x86 emulator: consolidate push reg This patch consolidate the emulation of push reg instruction. Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@bull.net> Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-12-31 16:51:39 +02:00
Ingo Molnar	b6ab4afee4	tracing, kvm: change MARKERS to select instead of depends on Impact: build fix fix: kernel/trace/Kconfig:42:error: found recursive dependency: TRACING -> TRACEPOINTS -> MARKERS -> KVM_TRACE -> RELAY -> KMEMTRACE -> TRACING markers is a facility that should be selected - not depended on by an interactive Kconfig entry. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-30 09:41:04 +01:00
Marcelo Tosatti	6c475352e8	KVM: MMU: avoid creation of unreachable pages in the shadow It is possible for a shadow page to have a parent link pointing to a freed page. When zapping a high level table, kvm_mmu_page_unlink_children fails to remove the parent_pte link. For that to happen, the child must be unreachable via the shadow tree, which can happen in shadow_walk_entry if the guest pte was modified in between walk() and fetch(). Remove the parent pte reference in such case. Possible cause for oops in bug #2217430. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-11-26 12:34:27 +02:00
Marcelo Tosatti	0c0f40bdbe	KVM: MMU: fix sync of ptes addressed at owner pagetable During page sync, if a pagetable contains a self referencing pte (that points to the pagetable), the corresponding spte may be marked as writable even though all mappings are supposed to be write protected. Fix by clearing page unsync before syncing individual sptes. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-11-23 15:24:19 +02:00
Avi Kivity	bd2b3ca768	KVM: VMX: Fix interrupt loss during race with NMI If an interrupt cannot be injected for some reason (say, page fault when fetching the IDT descriptor), the interrupt is marked for reinjection. However, if an NMI is queued at this time, the NMI will be injected instead and the NMI will be lost. Fix by deferring the NMI injection until the interrupt has been injected successfully. Analyzed by Jan Kiszka. Signed-off-by: Avi Kivity <avi@redhat.com>	2008-11-23 14:52:29 +02:00
Avi Kivity	e17d1dc086	KVM: Fix pit memory leak if unable to allocate irq source id Reported-By: Daniel Marjamäki <danielm77@spray.se> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-11-11 21:01:51 +02:00
Sheng Yang	928d4bf747	KVM: VMX: Set IGMT bit in EPT entry There is a potential issue that, when guest using pagetable without vmexit when EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory, which would be inconsistent with host side and would cause host MCE due to inconsistent cache attribute. The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default memory type to protect host (notice that all memory mapped by KVM should be WB). Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-11-11 21:00:37 +02:00
Avi Kivity	ca93e992fd	KVM: Require the PCI subsystem PCI device assignment makes calls to pci code, so require it to be built into the kernel. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-11-11 20:56:13 +02:00
Marcelo Tosatti	c41ef344de	KVM: MMU: increase per-vcpu rmap cache alloc size The page fault path can use two rmap_desc structures, if: - walk_addr's dirty pte update allocates one rmap_desc. - mmu_lock is dropped, sptes are zapped resulting in rmap_desc being freed. - fetch->mmu_set_spte allocates another rmap_desc. Increase to 4 for safety. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-11-11 20:53:34 +02:00
Sheng Yang	5550af4df1	KVM: Fix guest shared interrupt with in-kernel irqchip Every call of kvm_set_irq() should offer an irq_source_id, which is allocated by kvm_request_irq_source_id(). Based on irq_source_id, we identify the irq source and implement logical OR for shared level interrupts. The allocated irq_source_id can be freed by kvm_free_irq_source_id(). Currently, we support at most sizeof(unsigned long) different irq sources. [Amit: - rebase to kvm.git HEAD - move definition of KVM_USERSPACE_IRQ_SOURCE_ID to common file - move kvm_request_irq_source_id to the update_irq ioctl] [Xiantao: - Add kvm/ia64 stuff and make it work for kvm/ia64 guests] Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-28 14:21:34 +02:00
Marcelo Tosatti	6ad9f15c94	KVM: MMU: sync root on paravirt TLB flush The pvmmu TLB flush handler should request a root sync, similarly to a native read-write CR3. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-28 14:09:27 +02:00
Arjan van de Ven	651dab4264	Merge commit 'linus/master' into merge-linus Conflicts: arch/x86/kvm/i8254.c	2008-10-17 09:20:26 -07:00
Linus Torvalds	08d19f51f0	Merge branch 'kvm-updates/2.6.28' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm * 'kvm-updates/2.6.28' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (134 commits) KVM: ia64: Add intel iommu support for guests. KVM: ia64: add directed mmio range support for kvm guests KVM: ia64: Make pmt table be able to hold physical mmio entries. KVM: Move irqchip_in_kernel() from ioapic.h to irq.h KVM: Separate irq ack notification out of arch/x86/kvm/irq.c KVM: Change is_mmio_pfn to kvm_is_mmio_pfn, and make it common for all archs KVM: Move device assignment logic to common code KVM: Device Assignment: Move vtd.c from arch/x86/kvm/ to virt/kvm/ KVM: VMX: enable invlpg exiting if EPT is disabled KVM: x86: Silence various LAPIC-related host kernel messages KVM: Device Assignment: Map mmio pages into VT-d page table KVM: PIC: enhance IPI avoidance KVM: MMU: add "oos_shadow" parameter to disable oos KVM: MMU: speed up mmu_unsync_walk KVM: MMU: out of sync shadow core KVM: MMU: mmu_convert_notrap helper KVM: MMU: awareness of new kvm_mmu_zap_page behaviour KVM: MMU: mmu_parent_walk KVM: x86: trap invlpg KVM: MMU: sync roots on mmu reload ...	2008-10-16 15:36:00 -07:00
Harvey Harrison	80a914dc05	misc: replace __FUNCTION__ with __func__ __FUNCTION__ is gcc-specific, use __func__ Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:30 -07:00
Xiantao Zhang	3de42dc094	KVM: Separate irq ack notification out of arch/x86/kvm/irq.c Moving irq ack notification logic as common, and make it shared with ia64 side. Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 14:25:35 +02:00
Xiantao Zhang	8a98f6648a	KVM: Move device assignment logic to common code To share with other archs, this patch moves device assignment logic to common parts. Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 14:25:33 +02:00
Zhang xiantao	371c01b28e	KVM: Device Assignment: Move vtd.c from arch/x86/kvm/ to virt/kvm/ Preparation for kvm/ia64 VT-d support. Signed-off-by: Zhang xiantao <xiantao.zhang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 14:25:32 +02:00
Marcelo Tosatti	83dbc83a0d	KVM: VMX: enable invlpg exiting if EPT is disabled Manually disabling EPT via module option fails to re-enable INVLPG exiting. Reported-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 14:25:31 +02:00
Jan Kiszka	1b10bf31a5	KVM: x86: Silence various LAPIC-related host kernel messages KVM-x86 dumps a lot of debug messages that have no meaning for normal operation: - INIT de-assertion is ignored - SIPIs are sent and received - APIC writes are unaligned or < 4 byte long (Windows Server 2003 triggers this on SMP) Degrade them to true debug messages, keeping the host kernel log clean for real problems. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:30 +02:00
Weidong Han	e5fcfc821a	KVM: Device Assignment: Map mmio pages into VT-d page table Assigned device could DMA to mmio pages, so also need to map mmio pages into VT-d page table. Signed-off-by: Weidong Han <weidong.han@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:29 +02:00
Marcelo Tosatti	e48258009d	KVM: PIC: enhance IPI avoidance The PIC code makes little effort to avoid kvm_vcpu_kick(), resulting in unnecessary guest exits in some conditions. For example, if the timer interrupt is routed through the IOAPIC, IRR for IRQ 0 will get set but not cleared, since the APIC is handling the acks. This means that everytime an interrupt < 16 is triggered, the priority logic will find IRQ0 pending and send an IPI to vcpu0 (in case IRQ0 is not masked, which is Linux's case). Introduce a new variable isr_ack to represent the IRQ's for which the guest has been signalled / cleared the ISR. Use it to avoid more than one IPI per trigger-ack cycle, in addition to the avoidance when ISR is set in get_priority(). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:28 +02:00
Marcelo Tosatti	582801a95d	KVM: MMU: add "oos_shadow" parameter to disable oos Subject says it all. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:27 +02:00
Marcelo Tosatti	0074ff63eb	KVM: MMU: speed up mmu_unsync_walk Cache the unsynced children information in a per-page bitmap. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:26 +02:00
Marcelo Tosatti	4731d4c7a0	KVM: MMU: out of sync shadow core Allow guest pagetables to go out of sync. Instead of emulating write accesses to guest pagetables, or unshadowing them, we un-write-protect the page table and allow the guest to modify it at will. We rely on invlpg executions to synchronize individual ptes, and will synchronize the entire pagetable on tlb flushes. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:25 +02:00
Marcelo Tosatti	6844dec694	KVM: MMU: mmu_convert_notrap helper Need to convert shadow_notrap_nonpresent -> shadow_trap_nonpresent when unsyncing pages. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:24 +02:00
Marcelo Tosatti	0738541396	KVM: MMU: awareness of new kvm_mmu_zap_page behaviour kvm_mmu_zap_page will soon zap the unsynced children of a page. Restart list walk in such case. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:23 +02:00
Marcelo Tosatti	ad8cfbe3ff	KVM: MMU: mmu_parent_walk Introduce a function to walk all parents of a given page, invoking a handler. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:22 +02:00
Marcelo Tosatti	a7052897b3	KVM: x86: trap invlpg With pages out of sync invlpg needs to be trapped. For now simply nuke the entry. Untested on AMD. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:21 +02:00
Marcelo Tosatti	0ba73cdadb	KVM: MMU: sync roots on mmu reload Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:20 +02:00
Marcelo Tosatti	e8bc217aef	KVM: MMU: mode specific sync_page Examine guest pagetable and bring the shadow back in sync. Caller is responsible for local TLB flush before re-entering guest mode. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:19 +02:00
Marcelo Tosatti	38187c830c	KVM: MMU: do not write-protect large mappings There is not much point in write protecting large mappings. This can only happen when a page is shadowed during the window between is_largepage_backed and mmu_lock acquision. Zap the entry instead, so the next pagefault will find a shadowed page via is_largepage_backed and fallback to 4k translations. Simplifies out of sync shadow. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:18 +02:00
Marcelo Tosatti	a378b4e64c	KVM: MMU: move local TLB flush to mmu_set_spte Since the sync page path can collapse flushes. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:17 +02:00
Marcelo Tosatti	1e73f9dd88	KVM: MMU: split mmu_set_spte Split the spte entry creation code into a new set_spte function. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:16 +02:00
Marcelo Tosatti	93a423e704	KVM: MMU: flush remote TLBs on large->normal entry overwrite It is necessary to flush all TLB's when a large spte entry is overwritten with a normal page directory pointer. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:15 +02:00
Gleb Natapov	af2152f545	KVM: don't enter guest after SIPI was received by a CPU The vcpu should process pending SIPI message before entering guest mode again. kvm_arch_vcpu_runnable() returns true if the vcpu is in SIPI state, so we can't call it here. Signed-off-by: Gleb Natapov <gleb@qumranet.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:09 +02:00
Harvey Harrison	2259e3a7a6	KVM: x86.c make kvm_load_realmode_segment static Noticed by sparse: arch/x86/kvm/x86.c:3591:5: warning: symbol 'kvm_load_realmode_segment' was not declared. Should it be static? Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:07 +02:00
Marcelo Tosatti	4c2155ce81	KVM: switch to get_user_pages_fast Convert gfn_to_pfn to use get_user_pages_fast, which can do lockless pagetable lookups on x86. Kernel compilation on 4-way guest is 3.7% faster on VMX. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:06 +02:00
Amit Shah	bfadaded0d	KVM: Device Assignment: Free device structures if IRQ allocation fails When an IRQ allocation fails, we free up the device structures and disable the device so that we can unregister the device in the userspace and not expose it to the guest at all. Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:04 +02:00
Ben-Ami Yassour	62c476c7c7	KVM: Device Assignment with VT-d Based on a patch by: Kay, Allen M <allen.m.kay@intel.com> This patch enables PCI device assignment based on VT-d support. When a device is assigned to the guest, the guest memory is pinned and the mapping is updated in the VT-d IOMMU. [Amit: Expose KVM_CAP_IOMMU so we can check if an IOMMU is present and also control enable/disable from userspace] Signed-off-by: Kay, Allen M <allen.m.kay@intel.com> Signed-off-by: Weidong Han <weidong.han@intel.com> Signed-off-by: Ben-Ami Yassour <benami@il.ibm.com> Signed-off-by: Amit Shah <amit.shah@qumranet.com> Acked-by: Mark Gross <mgross@linux.intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 14:25:04 +02:00
Guillaume Thouvenin	aa3a816b6d	KVM: x86 emulator: Use DstAcc for 'and' For instruction 'and al,imm' we use DstAcc instead of doing the emulation directly into the instruction's opcode. Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:16:14 +02:00
Guillaume Thouvenin	8a9fee67fb	KVM: x86 emulator: Add cmp al, imm and cmp ax, imm instructions (ocodes 3c, 3d) Add decode entries for these opcodes; execution is already implemented. Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:16:14 +02:00
Guillaume Thouvenin	9c9fddd0e7	KVM: x86 emulator: Add DstAcc operand type Add DstAcc operand type. That means that there are 4 bits now for DstMask. "In the good old days cpus would have only one register that was able to fully participate in arithmetic operations, typically called A for Accumulator. The x86 retains this tradition by having special, shorter encodings for the A register (like the cmp opcode), and even some instructions that only operate on A (like mul). SrcAcc and DstAcc would accommodate these instructions by decoding A into the corresponding 'struct operand'." -- Avi Kivity Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:16:14 +02:00
Sheng Yang	defed7ed92	x86: Move FEATURE_CONTROL bits to msr-index.h For MSR_IA32_FEATURE_CONTROL is already there. Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:16:14 +02:00
Sheng Yang	9ea542facb	KVM: VMX: Rename IA32_FEATURE_CONTROL bits Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:16:14 +02:00
Avi Kivity	ef46f18ea0	KVM: x86 emulator: fix jmp r/m64 instruction jmp r/m64 doesn't require the rex.w prefix to indicate the operand size is 64 bits. Set the Stack attribute (even though it doesn't involve the stack, really) to indicate this. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:27 +02:00
Jan Kiszka	4b92fe0c9d	KVM: VMX: Cleanup stalled INTR_INFO read Commit 1c0f4f5011829dac96347b5f84ba37c2252e1e08 left a useless access of VM_ENTRY_INTR_INFO_FIELD in vmx_intr_assist behind. Clean this up. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:26 +02:00
Marcelo Tosatti	9c3e4aab5a	KVM: x86: unhalt vcpu0 on reset Since "KVM: x86: do not execute halted vcpus", HLT by vcpu0 before system reset by the IO thread will hang the guest. Mark vcpu as runnable in such case. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:26 +02:00
Mohammed Gamal	d19292e457	KVM: x86 emulator: Add call near absolute instruction (opcode 0xff/2) Add call near absolute instruction. Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:26 +02:00
Marcelo Tosatti	d76901750a	KVM: x86: do not execute halted vcpus Offline or uninitialized vcpu's can be executed if requested to perform userspace work. Follow Avi's suggestion to handle halted vcpu's in the main loop, simplifying kvm_emulate_halt(). Introduce a new vcpu->requests bit to indicate events that promote state from halted to running. Also standardize vcpu wake sites. Signed-off-by: Marcelo Tosatti <mtosatti <at> redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:26 +02:00
Mohammed Gamal	a6a3034cb9	KVM: x86 emulator: Add in/out instructions (opcodes 0xe4-0xe7, 0xec-0xef) The patch adds in/out instructions to the x86 emulator. The instruction was encountered while running the BIOS while using the invalid guest state emulation patch. Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:25 +02:00
Avi Kivity	fa89a81766	KVM: Add statistics for guest irq injections These can help show whether a guest is making progress or not. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:25 +02:00
Sheng Yang	d40a1ee485	KVM: MMU: Modify kvm_shadow_walk.entry to accept u64 addr EPT is 4 level by default in 32pae(48 bits), but the addr parameter of kvm_shadow_walk->entry() only accept unsigned long as virtual address, which is 32bit in 32pae. This result in SHADOW_PT_INDEX() overflow when try to fetch level 4 index. Fix it by extend kvm_shadow_walk->entry() to accept 64bit addr in parameter. Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:25 +02:00
Mohammed Gamal	fb4616f431	KVM: x86 emulator: Add std and cld instructions (opcodes 0xfc-0xfd) This adds the std and cld instructions to the emulator. Encountered while running the BIOS with invalid guest state emulation enabled. Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:25 +02:00
Joerg Roedel	a89c1ad270	KVM: add MC5_MISC msr read support Currently KVM implements MC0-MC4_MISC read support. When booting Linux this results in KVM warnings in the kernel log when the guest tries to read MC5_MISC. Fix this warnings with this patch. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:24 +02:00
Avi Kivity	48d1503949	KVM: SVM: No need to unprotect memory during event injection when using npt No memory is protected anyway. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:24 +02:00
Avi Kivity	3201b5d9f0	KVM: MMU: Fix setting the accessed bit on non-speculative sptes The accessed bit was accidentally turned on in a random flag word, rather than, the spte itself, which was lucky, since it used the non-EPT compatible PT_ACCESSED_MASK. Fix by turning the bit on in the spte and changing it to use the portable accessed mask. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:24 +02:00
Avi Kivity	171d595d3b	KVM: MMU: Flush tlbs after clearing write permission when accessing dirty log Otherwise, the cpu may allow writes to the tracked pages, and we lose some display bits or fail to migrate correctly. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:24 +02:00
Avi Kivity	2245a28fe2	KVM: MMU: Add locking around kvm_mmu_slot_remove_write_access() It was generally safe due to slots_lock being held for write, but it wasn't very nice. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:24 +02:00
Avi Kivity	bc2d429979	KVM: MMU: Account for npt/ept/realmode page faults Now that two-dimensional paging is becoming common, account for tdp page faults. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:23 +02:00
Mohammed Gamal	a5e2e82b8b	KVM: x86 emulator: Add mov r, imm instructions (opcodes 0xb0-0xbf) The emulator only supported one instance of mov r, imm instruction (opcode 0xb8), this adds the rest of these instructions. Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:23 +02:00
Avi Kivity	acee3c04e8	KVM: Allocate guest memory as MAP_PRIVATE, not MAP_SHARED There is no reason to share internal memory slots with fork()ed instances. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:23 +02:00
Avi Kivity	abb9e0b8e3	KVM: MMU: Convert the paging mode shadow walk to use the generic walker Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:23 +02:00
Avi Kivity	140754bc80	KVM: MMU: Convert direct maps to use the generic shadow walker Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:23 +02:00
Avi Kivity	3d000db568	KVM: MMU: Add generic shadow walker We currently walk the shadow page tables in two places: direct map (for real mode and two dimensional paging) and paging mode shadow. Since we anticipate requiring a third walk (for invlpg), it makes sense to have a generic facility for shadow walk. This patch adds such a shadow walker, walks the page tables and calls a method for every spte encountered. The method can examine the spte, modify it, or even instantiate it. The walk can be aborted by returning nonzero from the method. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:23 +02:00
Avi Kivity	6c41f428b7	KVM: MMU: Infer shadow root level in direct_map() In all cases the shadow root level is available in mmu.shadow_root_level, so there is no need to pass it as a parameter. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:22 +02:00
Avi Kivity	6e37d3dc3e	KVM: MMU: Unify direct map 4K and large page paths The two paths are equivalent except for one argument, which is already available. Merge the two codepaths. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:22 +02:00
Avi Kivity	135f8c2b07	KVM: MMU: Move SHADOW_PT_INDEX to mmu.c It is not specific to the paging mode, so can be made global (and reusable). Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:22 +02:00
Avi Kivity	6eb06cb286	KVM: x86 emulator: remove bad ByteOp specifier from NEG descriptor Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:22 +02:00
roel kluin	41afa02587	KVM: x86 emulator: remove duplicate SrcImm Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:21 +02:00
Avi Kivity	f4bbd9aaaa	KVM: Load real mode segments correctly Real mode segments to not reference the GDT or LDT; they simply compute base = selector * 16. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:21 +02:00
Avi Kivity	a16b20da87	KVM: VMX: Change segment dpl at reset to 3 This is more emulation friendly, if not 100% correct. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:21 +02:00
Avi Kivity	5706be0daf	KVM: VMX: Change cs reset state to be a data segment Real mode cs is a data segment, not a code segment. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:21 +02:00
Harvey Harrison	ee032c993e	KVM: make irq ack notifier functions static sparse says: arch/x86/kvm/x86.c:107:32: warning: symbol 'kvm_find_assigned_dev' was not declared. Should it be static? arch/x86/kvm/i8254.c:225:6: warning: symbol 'kvm_pit_ack_irq' was not declared. Should it be static? Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:21 +02:00
Amit Shah	29c8fa32c5	KVM: Use kvm_set_irq to inject interrupts ... instead of using the pic and ioapic variants Signed-off-by: Amit Shah <amit.shah@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:21 +02:00
Amit Shah	94c935a1ee	KVM: SVM: Fix typo Fix typo in as-yet unused macro definition. Signed-off-by: Amit Shah <amit.shah@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:20 +02:00
Mohammed Gamal	a89a8fb93b	KVM: VMX: Modify mode switching and vmentry functions This patch modifies mode switching and vmentry function in order to drive invalid guest state emulation. Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:20 +02:00
Mohammed Gamal	ea953ef0ca	KVM: VMX: Add invalid guest state handler This adds the invalid guest state handler function which invokes the x86 emulator until getting the guest to a VMX-friendly state. [avi: leave atomic context if scheduling] [guillaume: return to atomic context correctly] Signed-off-by: Laurent Vivier <laurent.vivier@bull.net> Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:20 +02:00
Mohammed Gamal	04fa4d3211	KVM: VMX: Add module parameter and emulation flag. The patch adds the module parameter required to enable emulating invalid guest state, as well as the emulation_required flag used to drive emulation whenever needed. Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:20 +02:00
Mohammed Gamal	648dfaa7df	KVM: VMX: Add Guest State Validity Checks This patch adds functions to check whether guest state is VMX compliant. Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:20 +02:00
Amit Shah	6762b7299a	KVM: Device assignment: Check for privileges before assigning irq Even though we don't share irqs at the moment, we should ensure regular user processes don't try to allocate system resources. We check for capability to access IO devices (CAP_SYS_RAWIO) before we request_irq on behalf of the guest. Noticed by Avi. Signed-off-by: Amit Shah <amit.shah@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:20 +02:00
Avi Kivity	dc7404cea3	KVM: Handle spurious acks for PIT interrupts Spurious acks can be generated, for example if the PIC is being reset. Handle those acks gracefully rather than flooding the log with warnings. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:19 +02:00
Marcelo Tosatti	85428ac7c3	KVM: fix i8259 reset irq acking The irq ack during pic reset has three problems: - Ignores slave/master PIC, using gsi 0-8 for both. - Generates an ACK even if the APIC is in control. - Depends upon IMR being clear, which is broken if the irq was masked at the time it was generated. The last one causes the BIOS to hang after the first reboot of Windows installation, since PIT interrupts stop. [avi: fix check whether pic interrupts are seen by cpu] Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:19 +02:00
Avi Kivity	ecfc79c700	KVM: VMX: Use interrupt queue for !irqchip_in_kernel Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:19 +02:00
Marcelo Tosatti	29415c37f0	KVM: set debug registers after "schedulable" section The vcpu thread can be preempted after the guest_debug_pre() callback, resulting in invalid debug registers on the new vcpu. Move it inside the non-preemptable section. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:19 +02:00
Sheng Yang	464d17c8b7	KVM: VMX: Clean up magic number 0x66 in init_rmode_tss Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:19 +02:00
Dave Hansen	6ad18fba05	KVM: Reduce stack usage in kvm_pv_mmu_op() We're in a hot path. We can't use kmalloc() because it might impact performance. So, we just stick the buffer that we need into the kvm_vcpu_arch structure. This is used very often, so it is not really a waste. We also have to move the buffer structure's definition to the arch-specific x86 kvm header. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:18 +02:00
Dave Hansen	b772ff362e	KVM: Reduce stack usage in kvm_arch_vcpu_ioctl() [sheng: fix KVM_GET_LAPIC using wrong size] Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:18 +02:00
Dave Hansen	f0d662759a	KVM: Reduce kvm stack usage in kvm_arch_vm_ioctl() On my machine with gcc 3.4, kvm uses ~2k of stack in a few select functions. This is mostly because gcc fails to notice that the different case: statements could have their stack usage combined. It overflows very nicely if interrupts happen during one of these large uses. This patch uses two methods for reducing stack usage. 1. dynamically allocate large objects instead of putting on the stack. 2. Use a union{} member for all of the case variables. This tricks gcc into combining them all into a single stack allocation. (There's also a comment on this) Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:18 +02:00
Ben-Ami Yassour	4d5c5d0fe8	KVM: pci device assignment Based on a patch from: Amit Shah <amit.shah@qumranet.com> This patch adds support for handling PCI devices that are assigned to the guest. The device to be assigned to the guest is registered in the host kernel and interrupt delivery is handled. If a device is already assigned, or the device driver for it is still loaded on the host, the device assignment is failed by conveying a -EBUSY reply to the userspace. Devices that share their interrupt line are not supported at the moment. By itself, this patch will not make devices work within the guest. The VT-d extension is required to enable the device to perform DMA. Another alternative is PVDMA. Signed-off-by: Amit Shah <amit.shah@qumranet.com> Signed-off-by: Ben-Ami Yassour <benami@il.ibm.com> Signed-off-by: Weidong Han <weidong.han@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:18 +02:00
Marcelo Tosatti	3cf57fed21	KVM: PIT: fix injection logic and count The PIT injection logic is problematic under the following cases: 1) If there is a higher priority vector to be delivered by the time kvm_pit_timer_intr_post is invoked ps->inject_pending won't be set. This opens the possibility for missing many PIT event injections (say if guest executes hlt at this point). 2) ps->inject_pending is racy with more than two vcpus. Since there's no locking around read/dec of pt->pending, two vcpu's can inject two interrupts for a single pt->pending count. Fix 1 by using an irq ack notifier: only reinject when the previous irq has been acked. Fix 2 with appropriate locking around manipulation of pending count and irq_ack by the injection / ack paths. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:17 +02:00
Marcelo Tosatti	f52447261b	KVM: irq ack notification Based on a patch from: Ben-Ami Yassour <benami@il.ibm.com> which was based on a patch from: Amit Shah <amit.shah@qumranet.com> Notify IRQ acking on PIC/APIC emulation. The previous patch missed two things: - Edge triggered interrupts on IOAPIC - PIC reset with IRR/ISR set should be equivalent to ack (LAPIC probably needs something similar). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> CC: Amit Shah <amit.shah@qumranet.com> CC: Ben-Ami Yassour <benami@il.ibm.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:16 +02:00
Avi Kivity	564f15378f	KVM: Add irq ack notifier list This can be used by kvm subsystems that are interested in when interrupts are acked, for example time drift compensation. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:16 +02:00
Alexander Graf	b5e2fec0eb	KVM: Ignore DEBUGCTL MSRs with no effect Netware writes to DEBUGCTL and reads from the DEBUGCTL and LAST*IP MSRs without further checks and is really confused to receive a #GP during that. To make it happy we should just make them stubs, which is exactly what SVM already does. Writes to DEBUGCTL that are vendor-specific are resembled to behave as if the virtual CPU does not know them. Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:15 +02:00
Avi Kivity	313dbd49dc	KVM: VMX: Avoid vmwrite(HOST_RSP) when possible Usually HOST_RSP retains its value across guest entries. Take advantage of this and avoid a vmwrite() when this is so. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:15 +02:00
Avi Kivity	80e31d4f61	KVM: SVM: Unify register save/restore across 32 and 64 bit hosts Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:14 +02:00
Avi Kivity	c801949ddf	KVM: VMX: Unify register save/restore across 32 and 64 bit hosts Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:14 +02:00
Jan Kiszka	77ab6db0a1	KVM: VMX: Reinject real mode exception As we execute real mode guests in VM86 mode, exception have to be reinjected appropriately when the guest triggered them. For this purpose the patch adopts the real-mode injection pattern used in vmx_inject_irq to vmx_queue_exception, additionally taking care that the IP is set correctly for #BP exceptions. Furthermore it extends handle_rmode_exception to reinject all those exceptions that can be raised in real mode. This fixes the execution of himem.exe from FreeDOS and also makes its debug.com work properly. Note that guest debugging in real mode is broken now. This has to be fixed by the scheduled debugging infrastructure rework (will be done once base patches for QEMU have been accepted). Signed-off-by: Jan Kiszka <jan.kiszka@web.de> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:14 +02:00
Jan Kiszka	19bd8afdc4	KVM: Consolidate XX_VECTOR defines Signed-off-by: Jan Kiszka <jan.kiszka@web.de> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:14 +02:00
Avi Kivity	7edd0ce058	KVM: Consolidate PIC isr clearing into a function Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:14 +02:00
Mohammed Gamal	60bd83a125	KVM: VMX: Remove redundant check in handle_rmode_exception Since checking for vcpu->arch.rmode.active is already done whenever we call handle_rmode_exception(), checking it inside the function is redundant. Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:13 +02:00
Avi Kivity	f7d9238f5d	KVM: VMX: Move interrupt post-processing to vmx_complete_interrupts() Instead of looking at failed injections in the vm entry path, move processing to the exit path in vmx_complete_interrupts(). This simplifes the logic and removes any state that is hidden in vmx registers. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:13 +02:00
Avi Kivity	937a7eaef9	KVM: Add a pending interrupt queue Similar to the exception queue, this hold interrupts that have been accepted by the virtual processor core but not yet injected. Not yet used. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:13 +02:00
Avi Kivity	35920a3569	KVM: VMX: Fix pending exception processing The vmx code assumes that IDT-Vectoring can only be set when an exception is injected due to the exception in question. That's not true, however: if the exception is injected correctly, and later another exception occurs but its delivery is blocked due to a fault, then we will incorrectly assume the first exception was not delivered. Fix by unconditionally dequeuing the pending exception, and requeuing it (or the second exception) if we see it in the IDT-Vectoring field. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:13 +02:00
Avi Kivity	26eef70c3e	KVM: Clear exception queue before emulating an instruction If we're emulating an instruction, either it will succeed, in which case any previously queued exception will be spurious, or we will requeue the same exception. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:13 +02:00
Avi Kivity	668f612fa0	KVM: VMX: Move nmi injection failure processing to vm exit path Instead of processing nmi injection failure in the vm entry path, move it to the vm exit path (vm_complete_interrupts()). This separates nmi injection from nmi post-processing, and moves the nmi state from the VT state into vcpu state (new variable nmi_injected specifying an injection in progress). Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:13 +02:00
Avi Kivity	cf393f7566	KVM: Move NMI IRET fault processing to new vmx_complete_interrupts() Currently most interrupt exit processing is handled on the entry path, which is confusing. Move the NMI IRET fault processing to a new function, vmx_complete_interrupts(), which is called on the vmexit path. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:12 +02:00
Avi Kivity	5b5c6a5a60	KVM: MMU: Simplify kvm_mmu_zap_page() The twisty maze of conditionals can be reduced. [joerg: fix tlb flushing] Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:12 +02:00
Avi Kivity	31aa2b44af	KVM: MMU: Separate the code for unlinking a shadow page from its parents Place into own function, in preparation for further cleanups. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:12 +02:00
Amit Shah	867767a365	KVM: Introduce kvm_set_irq to inject interrupts in guests This function injects an interrupt into the guest given the kvm struct, the (guest) irq number and the interrupt level. Signed-off-by: Amit Shah <amit.shah@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:12 +02:00
Marcelo Tosatti	5fdbf9765b	KVM: x86: accessors for guest registers As suggested by Avi, introduce accessors to read/write guest registers. This simplifies the ->cache_regs/->decache_regs interface, and improves register caching which is important for VMX, where the cost of vmcs_read/vmcs_write is significant. [avi: fix warnings] Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:13:57 +02:00
Sheng Yang	ca60dfbb69	KVM: VMX: Rename misnamed msr bits MSR_IA32_FEATURE_LOCKED is just a bit in fact, which shouldn't be prefixed with MSR_. So is MSR_IA32_FEATURE_VMXON_ENABLED. Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:13:57 +02:00
Ingo Molnar	0afe2db213	Merge branch 'x86/unify-cpu-detect' into x86-v28-for-linus-phase4-D Conflicts: arch/x86/kernel/cpu/common.c arch/x86/kernel/signal_64.c include/asm-x86/cpufeature.h	2008-10-11 20:23:20 +02:00
Sheng Yang	534e38b447	KVM: VMX: Always return old for clear_flush_young() when using EPT As well as discard fake accessed bit and dirty bit of EPT. Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-09-11 11:48:19 +03:00
Joerg Roedel	e5eab0cede	KVM: SVM: fix guest global tlb flushes with NPT Accesses to CR4 are intercepted even with Nested Paging enabled. But the code does not check if the guest wants to do a global TLB flush. So this flush gets lost. This patch adds the check and the flush to svm_set_cr4. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-09-11 11:39:25 +03:00
Joerg Roedel	44874f8491	KVM: SVM: fix random segfaults with NPT enabled This patch introduces a guest TLB flush on every NPF exit in KVM. This fixes random segfaults and #UD exceptions in the guest seen under some workloads (e.g. long running compile workloads or tbench). A kernbench run with and without that fix showed that it has a slowdown lower than 0.5% Cc: stable@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-09-11 11:31:53 +03:00
Sheng Yang	315a6558f3	x86: move VMX MSRs to msr-index.h They are hardware specific MSRs, and we would use them in virtualization feature detection later. Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-10 14:00:55 +02:00
Arjan van de Ven	beb20d52d0	hrtimer: convert kvm to the new hrtimer apis In order to be able to do range hrtimers we need to use accessor functions to the "expire" member of the hrtimer struct. This patch converts KVM to these accessors. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2008-09-05 21:35:07 -07:00
Avi Kivity	cd5998ebfb	KVM: MMU: Fix torn shadow pte The shadow code assigns a pte directly in one place, which is nonatomic on i386 can can cause random memory references. Fix by using an atomic setter. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-08-25 17:24:27 +03:00
Avi Kivity	ed84862433	KVM: Advertise synchronized mmu support to userspace Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-29 12:34:02 +03:00
Andrea Arcangeli	e930bffe95	KVM: Synchronize guest physical memory map to host virtual memory map Synchronize changes to host virtual addresses which are part of a KVM memory slot to the KVM shadow mmu. This allows pte operations like swapping, page migration, and madvise() to transparently work with KVM. Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-29 12:33:53 +03:00
Andrea Arcangeli	604b38ac03	KVM: Allow browsing memslots with mmu_lock This allows reading memslots with only the mmu_lock hold for mmu notifiers that runs in atomic context and with mmu_lock held. Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-29 12:33:50 +03:00
Andrea Arcangeli	a1708ce8a3	KVM: Allow reading aliases with mmu_lock This allows the mmu notifier code to run unalias_gfn with only the mmu_lock held. Only alias writes need the mmu_lock held. Readers will either take the slots_lock in read mode or the mmu_lock. Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-29 12:33:40 +03:00
Andrea Arcangeli	cddb8a5c14	mmu-notifiers: core With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages. There are secondary MMUs (with secondary sptes and secondary tlbs) too. sptes in the kvm case are shadow pagetables, but when I say spte in mmu-notifier context, I mean "secondary pte". In GRU case there's no actual secondary pte and there's only a secondary tlb because the GRU secondary MMU has no knowledge about sptes and every secondary tlb miss event in the MMU always generates a page fault that has to be resolved by the CPU (this is not the case of KVM where the a secondary tlb miss will walk sptes in hardware and it will refill the secondary tlb transparently to software if the corresponding spte is present). The same way zap_page_range has to invalidate the pte before freeing the page, the spte (and secondary tlb) must also be invalidated before any page is freed and reused. Currently we take a page_count pin on every page mapped by sptes, but that means the pages can't be swapped whenever they're mapped by any spte because they're part of the guest working set. Furthermore a spte unmap event can immediately lead to a page to be freed when the pin is released (so requiring the same complex and relatively slow tlb_gather smp safe logic we have in zap_page_range and that can be avoided completely if the spte unmap event doesn't require an unpin of the page previously mapped in the secondary MMU). The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know when the VM is swapping or freeing or doing anything on the primary MMU so that the secondary MMU code can drop sptes before the pages are freed, avoiding all page pinning and allowing 100% reliable swapping of guest physical address space. Furthermore it avoids the code that teardown the mappings of the secondary MMU, to implement a logic like tlb_gather in zap_page_range that would require many IPI to flush other cpu tlbs, for each fixed number of spte unmapped. To make an example: if what happens on the primary MMU is a protection downgrade (from writeable to wrprotect) the secondary MMU mappings will be invalidated, and the next secondary-mmu-page-fault will call get_user_pages and trigger a do_wp_page through get_user_pages if it called get_user_pages with write=1, and it'll re-establishing an updated spte or secondary-tlb-mapping on the copied page. Or it will setup a readonly spte or readonly tlb mapping if it's a guest-read, if it calls get_user_pages with write=0. This is just an example. This allows to map any page pointed by any pte (and in turn visible in the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an full MMU with both sptes and secondary-tlb like the shadow-pagetable layer with kvm), or a remote DMA in software like XPMEM (hence needing of schedule in XPMEM code to send the invalidate to the remote node, while no need to schedule in kvm/gru as it's an immediate event like invalidating primary-mmu pte). At least for KVM without this patch it's impossible to swap guests reliably. And having this feature and removing the page pin allows several other optimizations that simplify life considerably. Dependencies: 1) mm_take_all_locks() to register the mmu notifier when the whole VM isn't doing anything with "mm". This allows mmu notifier users to keep track if the VM is in the middle of the invalidate_range_begin/end critical section with an atomic counter incraese in range_begin and decreased in range_end. No secondary MMU page fault is allowed to map any spte or secondary tlb reference, while the VM is in the middle of range_begin/end as any page returned by get_user_pages in that critical section could later immediately be freed without any further ->invalidate_page notification (invalidate_range_begin/end works on ranges and ->invalidate_page isn't called immediately before freeing the page). To stop all page freeing and pagetable overwrites the mmap_sem must be taken in write mode and all other anon_vma/i_mmap locks must be taken too. 2) It'd be a waste to add branches in the VM if nobody could possibly run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of mmu notifiers, but this already allows to compile a KVM external module against a kernel with mmu notifiers enabled and from the next pull from kvm.git we'll start using them. And GRU/XPMEM will also be able to continue the development by enabling KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n). This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n. The mmu_notifier_register call can fail because mm_take_all_locks may be interrupted by a signal and return -EINTR. Because mmu_notifier_reigster is used when a driver startup, a failure can be gracefully handled. Here an example of the change applied to kvm to register the mmu notifiers. Usually when a driver startups other allocations are required anyway and -ENOMEM failure paths exists already. struct kvm kvm_arch_create_vm(void) { struct kvm kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); + int err; if (!kvm) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops; + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm); + if (err) { + kfree(kvm); + return ERR_PTR(err); + } + return kvm; } mmu_notifier_unregister returns void and it's reliable. The patch also adds a few needed but missing includes that would prevent kernel to compile after these changes on non-x86 archs (x86 didn't need them by luck). [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix mm/filemap_xip.c build] [akpm@linux-foundation.org: fix mm/mmu_notifier.c build] Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Cc: Jack Steiner <steiner@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Kanoj Sarcar <kanojsarcar@yahoo.com> Cc: Roland Dreier <rdreier@cisco.com> Cc: Steve Wise <swise@opengridcomputing.com> Cc: Avi Kivity <avi@qumranet.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Anthony Liguori <aliguori@us.ibm.com> Cc: Chris Wright <chrisw@redhat.com> Cc: Marcelo Tosatti <marcelo@kvack.org> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Izik Eidus <izike@qumranet.com> Cc: Anthony Liguori <aliguori@us.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-28 16:30:21 -07:00
Sheng Yang	5fdbcb9dd1	KVM: VMX: Fix undefined beaviour of EPT after reload kvm-intel.ko As well as move set base/mask ptes to vmx_init(). Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-27 11:34:10 +03:00
Sheng Yang	5ec5726a16	KVM: VMX: Fix bypass_guest_pf enabling when disable EPT in module parameter Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-27 11:34:10 +03:00
Marcelo Tosatti	c93cd3a588	KVM: task switch: translate guest segment limit to virt-extension byte granular field If 'g' is one then limit is 4kb granular. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-27 11:34:10 +03:00
Avi Kivity	577bdc4966	KVM: Avoid instruction emulation when event delivery is pending When an event (such as an interrupt) is injected, and the stack is shadowed (and therefore write protected), the guest will exit. The current code will see that the stack is shadowed and emulate a few instructions, each time postponing the injection. Eventually the injection may succeed, but at that time the guest may be unwilling to accept the interrupt (for example, the TPR may have changed). This occurs every once in a while during a Windows 2008 boot. Fix by unshadowing the fault address if the fault was due to an event injection. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-27 11:34:10 +03:00
Marcelo Tosatti	34198bf842	KVM: task switch: use seg regs provided by subarch instead of reading from GDT There is no guarantee that the old TSS descriptor in the GDT contains the proper base address. This is the case for Windows installation's reboot-via-triplefault. Use guest registers instead. Also translate the address properly. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-27 11:34:09 +03:00
Marcelo Tosatti	98899aa0e0	KVM: task switch: segment base is linear address The segment base is always a linear address, so translate before accessing guest memory. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-27 11:34:09 +03:00
Joerg Roedel	5f4cb662a0	KVM: SVM: allow enabling/disabling NPT by reloading only the architecture module If NPT is enabled after loading both KVM modules on AMD and it should be disabled, both KVM modules must be reloaded. If only the architecture module is reloaded the behavior is undefined. With this patch it is possible to disable NPT only by reloading the kvm_amd module. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-27 11:34:09 +03:00
Avi Kivity	722c05f219	KVM: MMU: Fix potential race setting upper shadow ptes on nonpae hosts The direct mapped shadow code (used for real mode and two dimensional paging) sets upper-level ptes using direct assignment rather than calling set_shadow_pte(). A nonpae host will split this into two writes, which opens up a race if another vcpu accesses the same memory area. Fix by calling set_shadow_pte() instead of assigning directly. Noticed by Izik Eidus. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:40 +03:00
Glauber Costa	2a7c5b8b55	KVM: x86 emulator: emulate clflush If the guest issues a clflush in a mmio address, the instruction can trap into the hypervisor. Currently, we do not decode clflush properly, causing the guest to hang. This patch fixes this emulating clflush (opcode 0f ae). Signed-off-by: Glauber Costa <gcosta@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:40 +03:00
Marcelo Tosatti	376c53c2b3	KVM: MMU: improve invalid shadow root page handling Harden kvm_mmu_zap_page() against invalid root pages that had been shadowed from memslots that are gone. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:40 +03:00
Marcelo Tosatti	34d4cb8fca	KVM: MMU: nuke shadowed pgtable pages and ptes on memslot destruction Flush the shadow mmu before removing regions to avoid stale entries. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:40 +03:00
Avi Kivity	d6e88aec07	KVM: Prefix some x86 low level function with kvm_, to avoid namespace issues Fixes compilation with CONFIG_VMI enabled. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:39 +03:00
Ben-Ami Yassour	c65bbfa1d6	KVM: check injected pic irq within valid pic irqs Check that an injected pic irq is between 0 and 15. Signed-off-by: Ben-Ami Yassour <benami@il.ibm.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:39 +03:00
Mohammed Gamal	19fdfa0d13	KVM: x86 emulator: Fix HLT instruction This patch fixes issue encountered with HLT instruction under FreeDOS's HIMEM XMS Driver. The HLT instruction jumped directly to the done label and skips updating the EIP value, therefore causing the guest to spin endlessly on the same instruction. The patch changes the instruction so that it writes back the updated EIP value. Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:38 +03:00
Avi Kivity	ac9f6dc0db	KVM: Apply the kernel sigmask to vcpus blocked due to being uninitialized Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:38 +03:00
Sheng Yang	4e1096d27f	KVM: VMX: Add ept_sync_context in flush_tlb Fix a potention issue caused by kvm_mmu_slot_remove_write_access(). The old behavior don't sync EPT TLB with modified EPT entry, which result in inconsistent content of EPT TLB and EPT table. Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:38 +03:00
Marcelo Tosatti	5a4c928804	KVM: mmu_shrink: kvm_mmu_zap_page requires slots_lock to be held kvm_mmu_zap_page() needs slots lock held (rmap_remove->gfn_to_memslot, for example). Since kvm_lock spinlock is held in mmu_shrink(), do a non-blocking down_read_trylock(). Untested. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:38 +03:00
Joerg Roedel	0da1db75a2	KVM: SVM: fix suspend/resume support On suspend the svm_hardware_disable function is called which frees all svm_data variables. On resume they are not re-allocated. This patch removes the deallocation of svm_data from the hardware_disable function to the hardware_unsetup function which is not called on suspend. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:37 +03:00
Marcelo Tosatti	f8b78fa3d4	KVM: move slots_lock acquision down to vapic_exit There is no need to grab slots_lock if the vapic_page will not be touched. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:36 +03:00
Chris Lalancette	efa67e0d1f	KVM: VMX: Fake emulate Intel perfctr MSRs Older linux guests (in this case, 2.6.9) can attempt to access the performance counter MSRs without a fixup section, and injecting a GPF kills the guest. Work around by allowing the guest to write those MSRs. Tested by me on RHEL-4 i386 and x86_64 guests, as well as F-9 guests. Signed-off-by: Chris Lalancette <clalance@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:36 +03:00
Sheng Yang	65267ea1b3	KVM: VMX: Fix a wrong usage of vmcs_config The function ept_update_paging_mode_cr0() write to CPU_BASED_VM_EXEC_CONTROL based on vmcs_config.cpu_based_exec_ctrl. That's wrong because the variable may not consistent with the content in the CPU_BASE_VM_EXEC_CONTROL MSR. Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:36 +03:00
Avi Kivity	db475c39ec	KVM: MMU: Fix printk format Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:35 +03:00
Avi Kivity	6ada8cca79	KVM: MMU: When debug is enabled, make it a run-time parameter Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:35 +03:00
Avi Kivity	7a5b56dfd3	KVM: x86 emulator: lazily evaluate segment registers Instead of prefetching all segment bases before emulation, read them at the last moment. Since most of them are unneeded, we save some cycles on Intel machines where this is a bit expensive. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:35 +03:00
Avi Kivity	0adc8675d6	KVM: x86 emulator: avoid segment base adjust for lea Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:34 +03:00
Avi Kivity	f5b4edcd52	KVM: x86 emulator: simplify rip relative decoding rip relative decoding is relative to the instruction pointer of the next instruction; by moving address adjustment until after decoding is complete, we remove the need to determine the instruction size. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:34 +03:00
Avi Kivity	84411d85da	KVM: x86 emulator: simplify r/m decoding Consolidate the duplicated code when not in any special case. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:33 +03:00
Avi Kivity	dc71d0f162	KVM: x86 emulator: simplify sib decoding Instead of using sparse switches, use simpler if/else sequences. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:33 +03:00
Avi Kivity	8684c0af0b	KVM: x86 emulator: handle undecoded rex.b with r/m = 5 in certain cases x86_64 does not decode rex.b in certain cases, where the r/m field = 5. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:33 +03:00
Mohammed Gamal	b13354f8f0	KVM: x86 emulator: emulate nop and xchg reg, acc (opcodes 0x90 - 0x97) Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:33 +03:00
Avi Kivity	f76c710d75	KVM: Use printk_rlimit() instead of reporting emulation failures just once Emulation failure reports are useful, so allow more than one per the lifetime of the module. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:32 +03:00
Glauber Costa	25be46080f	KVM: Do not calculate linear rip in emulation failure report If we're not gonna do anything (case in which failure is already reported), we do not need to even bother with calculating the linear rip. Signed-off-by: Glauber Costa <gcosta@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:32 +03:00
Marcelo Tosatti	622395a9e6	KVM: only abort guest entry if timer count goes from 0->1 Only abort guest entry if the timer count went from 0->1, since for 1->2 or larger the bit will either be set already or a timer irq will have been injected. Using atomic_inc_and_test() for it also introduces an SMP barrier to the LAPIC version (thought it was unecessary because of timer migration, but guest can be scheduled to a different pCPU between exit and kvm_vcpu_block(), so there is the possibility for a race). Noticed by Avi. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:32 +03:00
Laurent Vivier	542472b53e	KVM: Add coalesced MMIO support (x86 part) This patch enables coalesced MMIO for x86 architecture. It defines KVM_MMIO_PAGE_OFFSET and KVM_CAP_COALESCED_MMIO. It enables the compilation of coalesced_mmio.c. Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:31 +03:00
Laurent Vivier	92760499d0	KVM: kvm_io_device: extend in_range() to manage len and write attribute Modify member in_range() of structure kvm_io_device to pass length and the type of the I/O (write or read). This modification allows to use kvm_io_device with coalesced MMIO. Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:30 +03:00
Avi Kivity	131d82791b	KVM: MMU: Avoid page prefetch on SVM SVM cannot benefit from page prefetching since guest page fault bypass cannot by made to work there. Avoid accessing the guest page table in this case. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:30 +03:00
Avi Kivity	d761a501cf	KVM: MMU: Move nonpaging_prefetch_page() In preparation for next patch. No code change. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:30 +03:00
Avi Kivity	91ed7a0e15	KVM: x86 emulator: implement 'push imm' (opcode 0x68) Encountered in FC6 boot sequence, now that we don't force ss.rpl = 0 during the protected mode transition. Not really necessary, but nice to have. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:29 +03:00
Avi Kivity	19e43636b5	KVM: x86 emulator: simplify push imm8 emulation Instead of fetching the data explicitly, use SrcImmByte. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:29 +03:00
Avi Kivity	eab9f71feb	KVM: MMU: Optimize prefetch_page() Instead of reading each pte individually, read 256 bytes worth of ptes and batch process them. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:28 +03:00
Guillaume Thouvenin	38d5bc6d50	KVM: x86 emulator: Add support for mov r, sreg (0x8c) instruction Add support for mov r, sreg (0x8c) instruction Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Laurent Vivier <laurent.vivier@bull.net> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:28 +03:00
Guillaume Thouvenin	4257198ae2	KVM: x86 emulator: Add support for mov seg, r (0x8e) instruction Add support for mov r, sreg (0x8c) instruction. [avi: drop the sreg decoding table in favor of 1:1 encoding] Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Laurent Vivier <laurent.vivier@bull.net> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:28 +03:00
Guillaume Thouvenin	615ac12561	KVM: x86 emulator: adds support to mov r,imm (opcode 0xb8) instruction Add support to mov r, imm (0xb8) instruction. Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Laurent Vivier <laurent.vivier@bull.net> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:27 +03:00
Guillaume Thouvenin	954cd36f76	KVM: x86 emulator: add support for jmp far 0xea Add support for jmp far (opcode 0xea) instruction. Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Laurent Vivier <laurent.vivier@bull.net> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:27 +03:00
Guillaume Thouvenin	89c696383d	KVM: x86 emulator: Update c->dst.bytes in decode instruction Update c->dst.bytes in decode instruction instead of instruction itself. It's needed because if c->dst.bytes is equal to 0, the instruction is not emulated. Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Laurent Vivier <laurent.vivier@bull.net> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:27 +03:00
Guillaume Thouvenin	3e6e0aab1b	KVM: Prefixes segment functions that will be exported with "kvm_" Prefixes functions that will be exported with kvm_. We also prefixed set_segment() even if it still static to be coherent. signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Laurent Vivier <laurent.vivier@bull.net> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:27 +03:00
Avi Kivity	9ba075a664	KVM: MTRR support Add emulation for the memory type range registers, needed by VMware esx 3.5, and by pci device assignment. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:26 +03:00
Sheng Yang	f08864b42a	KVM: VMX: Enable NMI with in-kernel irqchip Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:26 +03:00
Sheng Yang	3419ffc8e4	KVM: IOAPIC/LAPIC: Enable NMI support [avi: fix ia64 build breakage] Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:25 +03:00
Avi Kivity	50d40d7fb9	KVM: Remove unnecessary ->decache_regs() call Since we aren't modifying any register, there's no need to decache the register state. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:25 +03:00
Avi Kivity	7cc8883074	KVM: Remove decache_vcpus_on_cpu() and related callbacks Obsoleted by the vmx-specific per-cpu list. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:25 +03:00
Avi Kivity	543e424366	KVM: VMX: Add list of potentially locally cached vcpus VMX hardware can cache the contents of a vcpu's vmcs. This cache needs to be flushed when migrating a vcpu to another cpu, or (which is the case that interests us here) when disabling hardware virtualization on a cpu. The current implementation of decaching iterates over the list of all vcpus, picks the ones that are potentially cached on the cpu that is being offlined, and flushes the cache. The problem is that it uses mutex_trylock() to gain exclusive access to the vcpu, which fires off a (benign) warning about using the mutex in an interrupt context. To avoid this, and to make things generally nicer, add a new per-cpu list of potentially cached vcus. This makes the decaching code much simpler. The list is vmx-specific since other hardware doesn't have this issue. [andrea: fix crash on suspend/resume] Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:42:24 +03:00
Avi Kivity	4ecac3fd6d	KVM: Handle virtualization instruction #UD faults during reboot KVM turns off hardware virtualization extensions during reboot, in order to disassociate the memory used by the virtualization extensions from the processor, and in order to have the system in a consistent state. Unfortunately virtual machines may still be running while this goes on, and once virtualization extensions are turned off, any virtulization instruction will #UD on execution. Fix by adding an exception handler to virtualization instructions; if we get an exception during reboot, we simply spin waiting for the reset to complete. If it's a true exception, BUG() so we can have our stack trace. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:41:43 +03:00
Avi Kivity	1b7fcd3263	KVM: MMU: Fix false flooding when a pte points to page table The KVM MMU tries to detect when a speculative pte update is not actually used by demand fault, by checking the accessed bit of the shadow pte. If the shadow pte has not been accessed, we deem that page table flooded and remove the shadow page table, allowing further pte updates to proceed without emulation. However, if the pte itself points at a page table and only used for write operations, the accessed bit will never be set since all access will happen through the emulator. This is exactly what happens with kscand on old (2.4.x) HIGHMEM kernels. The kernel points a kmap_atomic() pte at a page table, and then proceeds with read-modify-write operations to look at the dirty and accessed bits. We get a false flood trigger on the kmap ptes, which results in the mmu spending all its time setting up and tearing down shadows. Fix by setting the shadow accessed bit on emulated accesses. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:40:50 +03:00
Avi Kivity	7682f2d0dd	KVM: VMX: Trivial vmcs_write64() code simplification Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:40:50 +03:00
Chris Lalancette	14ae51b6c0	KVM: SVM: Fake MSR_K7 performance counters Attached is a patch that fixes a guest crash when booting older Linux kernels. The problem stems from the fact that we are currently emulating MSR_K7_EVNTSEL[0-3], but not emulating MSR_K7_PERFCTR[0-3]. Because of this, setup_k7_watchdog() in the Linux kernel receives a GPF when it attempts to write into MSR_K7_PERFCTR, which causes an OOPs. The patch fixes it by just "fake" emulating the appropriate MSRs, throwing away the data in the process. This causes the NMI watchdog to not actually work, but it's not such a big deal in a virtualized environment. When we get a write to one of these counters, we printk_ratelimit() a warning. I decided to print it out for all writes, even if the data is 0; it doesn't seem to make sense to me to special case when data == 0. Tested by myself on a RHEL-4 guest, and Joerg Roedel on a Windows XP 64-bit guest. Signed-off-by: Chris Lalancette <clalance@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:40:49 +03:00
Aurelien Jarno	f697554515	KVM: PIT: support mode 3 The in-kernel PIT emulation ignores pending timers if operating under mode 3, which for example Hurd uses. This mode should output a square wave, high for (N+1)/2 counts and low for (N-1)/2 counts. As we only care about the resulting interrupts, the period is N, and mode 3 is the same as mode 2 with regard to interrupts. Signed-off-by: Aurelien Jarno <aurelien@aurel32.net> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:40:49 +03:00
Joerg Roedel	d2ebb4103f	KVM: SVM: add tracing support for TDP page faults To distinguish between real page faults and nested page faults they should be traced as different events. This is implemented by this patch. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:40:48 +03:00
Joerg Roedel	af9ca2d703	KVM: SVM: add missing kvmtrace markers This patch adds the missing kvmtrace markers to the svm module of kvm. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:40:48 +03:00
Joerg Roedel	54e445ca84	KVM: add missing kvmtrace bits This patch adds some kvmtrace bits to the generic x86 code where it is instrumented from SVM. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:40:48 +03:00
Joerg Roedel	a069805579	KVM: SVM: implement dedicated INTR exit handler With an exit handler for INTR intercepts its possible to account them using kvmtrace. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:40:47 +03:00
Joerg Roedel	c47f098d69	KVM: SVM: implement dedicated NMI exit handler With an exit handler for NMI intercepts its possible to account them using kvmtrace. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:40:47 +03:00
Joerg Roedel	c7bf23babc	KVM: VMX: move APIC_ACCESS trace entry to generic code This patch moves the trace entry for APIC accesses from the VMX code to the generic lapic code. This way APIC accesses from SVM will also be traced. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:40:47 +03:00
Harvey Harrison	8b2cf73cc1	KVM: add statics were possible, function definition in lapic.h Noticed by sparse: arch/x86/kvm/vmx.c:1583:6: warning: symbol 'vmx_disable_intercept_for_msr' was not declared. Should it be static? arch/x86/kvm/x86.c:3406:5: warning: symbol 'kvm_task_switch_16' was not declared. Should it be static? arch/x86/kvm/x86.c:3429:5: warning: symbol 'kvm_task_switch_32' was not declared. Should it be static? arch/x86/kvm/mmu.c:1968:6: warning: symbol 'kvm_mmu_remove_one_alloc_mmu_page' was not declared. Should it be static? arch/x86/kvm/mmu.c:2014:6: warning: symbol 'mmu_destroy_caches' was not declared. Should it be static? arch/x86/kvm/lapic.c:862:5: warning: symbol 'kvm_lapic_get_base' was not declared. Should it be static? arch/x86/kvm/i8254.c:94:5: warning: symbol 'pit_get_gate' was not declared. Should it be static? arch/x86/kvm/i8254.c:196:5: warning: symbol '__pit_timer_fn' was not declared. Should it be static? arch/x86/kvm/i8254.c:561:6: warning: symbol '__inject_pit_timer_intr' was not declared. Should it be static? Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-20 12:40:46 +03:00
Jens Axboe	15c8b6c1aa	on_each_cpu(): kill unused 'retry' parameter It's not even passed on to smp_call_function() anymore, since that was removed. So kill it. Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-06-26 11:24:38 +02:00
Jens Axboe	8691e5a8f6	smp_call_function: get rid of the unused nonatomic/retry argument It's never used and the comments refer to nonatomic and retry interchangably. So get rid of it. Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-06-26 11:24:35 +02:00
Gerd Hoffmann	50d0a0f987	KVM: Make kvm host use the paravirt clocksource structs This patch updates the kvm host code to use the pvclock structs. It also makes the paravirt clock compatible with Xen. Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-06-24 21:02:32 +03:00
Avi Kivity	a9b21b6229	KVM: VMX: Fix host msr corruption with preemption enabled Switching msrs can occur either synchronously as a result of calls to the msr management functions (usually in response to the guest touching virtualized msrs), or asynchronously when preempting a kvm thread that has guest state loaded. If we're unlucky enough to have the two at the same time, host msrs are corrupted and the machine goes kaput on the next syscall. Most easily triggered by Windows Server 2008, as it does a lot of msr switching during bootup. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-06-24 12:26:17 +03:00
Avi Kivity	6bf6a9532f	KVM: MMU: Fix oops on guest userspace access to guest pagetable KVM has a heuristic to unshadow guest pagetables when userspace accesses them, on the assumption that most guests do not allow userspace to access pagetables directly. Unfortunately, in addition to unshadowing the pagetables, it also oopses. This never triggers on ordinary guests since sane OSes will clear the pagetables before assigning them to userspace, which will trigger the flood heuristic, unshadowing the pagetables before the first userspace access. One particular guest, though (Xenner) will run the kernel in userspace, triggering the oops. Since the heuristic is incorrect in this case, we can simply remove it. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-06-24 12:20:12 +03:00
Marcelo Tosatti	3094538739	KVM: MMU: large page update_pte issue with non-PAE 32-bit guests (resend) kvm_mmu_pte_write() does not handle 32-bit non-PAE large page backed guests properly. It will instantiate two 2MB sptes pointing to the same physical 2MB page when a guest large pte update is trapped. Instead of duplicating code to handle this, disallow directory level updates to happen through kvm_mmu_pte_write(), so the two 2MB sptes emulating one guest 4MB pte can be correctly created by the page fault handling path. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-06-24 12:18:18 +03:00
Marcelo Tosatti	6597ca09e6	KVM: MMU: Fix rmap_write_protect() hugepage iteration bug rmap_next() does not work correctly after rmap_remove(), as it expects the rmap chains not to change during iteration. Fix (for now) by restarting iteration from the beginning. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-06-24 12:17:10 +03:00
Marcelo Tosatti	06e0564566	KVM: close timer injection race window in __vcpu_run If a timer fires after kvm_inject_pending_timer_irqs() but before local_irq_disable() the code will enter guest mode and only inject such timer interrupt the next time an unrelated event causes an exit. It would be simpler if the timer->pending irq conversion could be done with IRQ's disabled, so that the above problem cannot happen. For now introduce a new vcpu requests bit to cancel guest entry. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-06-24 12:16:59 +03:00
Marcelo Tosatti	d4acf7e7ab	KVM: Fix race between timer migration and vcpu migration A guest vcpu instance can be scheduled to a different physical CPU between the test for KVM_REQ_MIGRATE_TIMER and local_irq_disable(). If that happens, the timer will only be migrated to the current pCPU on the next exit, meaning that guest LAPIC timer event can be delayed until a host interrupt is triggered. Fix it by cancelling guest entry if any vcpu request is pending. This has the side effect of nicely consolidating vcpu->requests checks. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-06-24 12:16:52 +03:00
Avi Kivity	3c9155106d	KVM: MMU: Fix is_empty_shadow_page() check The check is only looking at one of two possible empty ptes. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-06-06 21:36:33 +03:00
Avi Kivity	ebb0e6264c	KVM: MMU: Fix printk() format string Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-06-06 21:36:20 +03:00
Avi Kivity	8d2d73b9a5	KVM: MMU: reschedule during shadow teardown Shadows for large guests can take a long time to tear down, so reschedule occasionally to avoid softlockup warnings. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-06-06 21:32:20 +03:00
Eli Collins	e693d71b46	KVM: VMX: Clear CR4.VMXE in hardware_disable Clear CR4.VMXE in hardware_disable. There's no reason to leave it set after doing a VMXOFF. VMware Workstation 6.5 checks CR4.VMXE as a proxy for whether the CPU is in VMX mode, so leaving VMXE set means we'll refuse to power on. With this change the user can power on after unloading the kvm-intel module. I tested on kvm-67 and kvm-69. Signed-off-by: Eli Collins <ecollins@vmware.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-06-06 21:30:20 +03:00
Marcelo Tosatti	2f5997140f	KVM: migrate PIT timer Migrate the PIT timer to the physical CPU which vcpu0 is scheduled on, similarly to what is done for the LAPIC timers, otherwise PIT interrupts will be delayed until an unrelated event causes an exit. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-06-06 21:25:51 +03:00
Avi Kivity	33e3885de2	KVM: x86 emulator: fix hypercall return value on AMD The hypercall instructions on Intel and AMD are different. KVM allows the guest to choose one or the other (the default is Intel), and if the guest chooses incorrectly, KVM will patch it at runtime to select the correct instruction. This allows live migration between Intel and AMD machines. This patching occurs in the x86 emulator. The current code also executes the hypercall. Unfortunately, the tail end of the x86 emulator code also executes, overwriting the return value of the hypercall with the original contents of rax (which happens to be the hypercall number). Fix not by executing the hypercall in the emulator context; instead let the guest reissue the patched instruction and execute the hypercall via the normal path. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-06-06 21:08:25 +03:00
Ingo Molnar	2ddfd20e7c	namespacecheck: automated fixes Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-23 14:08:06 +02:00
Marcelo Tosatti	54aaacee35	KVM: LAPIC: ignore pending timers if LVTT is disabled Only use the APIC pending timers count to break out of HLT emulation if the timer vector is enabled. Certain configurations of Windows simply mask out the vector without disabling the timer. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-05-18 14:39:39 +03:00
Marcelo Tosatti	eedaa4e2af	KVM: PIT: take inject_pending into account when emulating hlt Otherwise hlt emulation fails if PIT is not injecting IRQ's. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-05-18 14:34:15 +03:00
Avi Kivity	107d6d2efa	KVM: x86 emulator: fix writes to registers with modrm encodings A register destination encoded with a mod=3 encoding left dst.ptr NULL. Normally we don't trap writes to registers, but in the case of smsw, we do. Fix by pointing dst.ptr at the destination register. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-05-18 14:34:14 +03:00
Avi Kivity	93df766322	KVM: MMU: Allow more than PAGES_PER_HPAGE write protections per large page nonpae guests can call rmap_write_protect twice per page (for page tables) or four times per page (for page directories), triggering a bogus warning. Remove the warning. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-05-04 14:44:49 +03:00
Andrea Arcangeli	bc1a34f1bf	KVM: avoid fx_init() schedule in atomic This make sure not to schedule in atomic during fx_init. I also changed the name of fpu_init to fx_finit to avoid duplicating the name with fpu_init that is already used in the kernel, this makes grep simpler if nothing else. Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-05-04 14:44:48 +03:00
Jan Kiszka	b4f14abd95	KVM: Avoid spurious execeptions after setting registers Clear pending exceptions when setting new register values. This avoids spurious exceptions after restoring a vcpu state or after reset-on-triple-fault. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-05-04 14:44:47 +03:00
Marcelo Tosatti	ece15babfa	KVM: PIT: support mode 4 The in-kernel PIT emulation ignores pending timers if operating under mode 4, which for example DragonFlyBSD uses (and Plan9 too, apparently). Mode 4 seems to be similar to one-shot mode, other than the fact that it starts counting after the next CLK pulse once programmed, while mode 1 starts counting immediately, so add a FIXME to enhance precision. Fixes sourceforge bug `1952988`. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Acked-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-05-04 14:44:46 +03:00
Avi Kivity	dc7457ea52	KVM: x86 emulator: disable writeback on lmsw The recent changes allowing memory operands with lmsw and smsw left lmsw with writeback enabled. Since lmsw has no oridinary destination operand, the dst pointer was not initialized, resulting in an oops. Close the hole by disabling writeback for lmsw. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-05-04 14:44:45 +03:00
Izik Eidus	3fe913e7c5	KVM: x86: task switch: fix wrong bit setting for the busy flag The busy bit is bit 1 of the type field, not bit 8. Signed-off-by: Izik Eidus <izike@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-05-04 14:44:43 +03:00
Sheng Yang	1439442c7b	KVM: VMX: Enable EPT feature for KVM Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-05-04 14:44:42 +03:00
Sheng Yang	b7ebfb0509	KVM: VMX: Prepare an identity page table for EPT in real mode [aliguory: plug leak] Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-05-04 14:44:41 +03:00
Sheng Yang	1ac593c97e	KVM: MMU: Remove #ifdef CONFIG_X86_64 to support 4 level EPT Currently EPT level is 4 for both pae and x86_64. The patch remove the #ifdef for alloc root_hpa and free root_hpa to support EPT. Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-05-04 14:44:39 +03:00
Sheng Yang	7b52345e2c	KVM: MMU: Add EPT support Enable kvm_set_spte() to generate EPT entries. Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-05-04 14:44:38 +03:00
Sheng Yang	67253af52e	KVM: Add kvm_x86_ops get_tdp_level() The function get_tdp_level() provided the number of tdp level for EPT and NPT rather than the NPT specific macro. Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-05-04 14:44:34 +03:00
Sheng Yang	8c6d6adc6b	KVM: MMU: Move some definitions to a header file Move some definitions to mmu.h in order to allow building common table entries between EPT and non-EPT. Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-05-04 12:26:38 +03:00
Sheng Yang	d56f546db9	KVM: VMX: EPT Feature Detection Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-05-04 12:26:38 +03:00
Roman Zippel	6f6d6a1a6a	rename div64_64 to div64_u64 Rename div64_64 to div64_u64 to make it consistent with the other divide functions, so it clearly includes the type of the divide. Move its definition to math64.h as currently no architecture overrides the generic implementation. They can still override it of course, but the duplicated declarations are avoided. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: Avi Kivity <avi@qumranet.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Howells <dhowells@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:58 -07:00
Marcelo Tosatti	960b399169	KVM: MMU: kvm_pv_mmu_op should not take mmap_sem kvm_pv_mmu_op should not take mmap_sem. All gfn_to_page() callers down in the MMU processing will take it if necessary, so as it is it can deadlock. Apparently a leftover from the days before slots_lock. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 18:21:45 +03:00
Joerg Roedel	1336028b9a	KVM: SVM: remove selective CR0 comment There is not selective cr0 intercept bug. The code in the comment sets the CR0.PG bit. But KVM sets the CR4.PG bit for SVM always to implement the paged real mode. So the 'mov %eax,%cr0' instruction does not change the CR0.PG bit. Selective CR0 intercepts only occur when a bit is actually changed. So its the right behavior that there is no intercept on this instruction. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 18:21:44 +03:00
Joerg Roedel	aaf697e4e0	KVM: SVM: remove now obsolete FIXME comment With the usage of the V_TPR field this comment is now obsolete. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 18:21:43 +03:00
Joerg Roedel	aaacfc9ae2	KVM: SVM: disable CR8 intercept when tpr is not masking interrupts This patch disables the intercept of CR8 writes if the TPR is not masking interrupts. This reduces the total number CR8 intercepts to below 1 percent of what we have without this patch using Windows 64 bit guests. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 18:21:43 +03:00
Joerg Roedel	d7bf8221a3	KVM: SVM: sync V_TPR with LAPIC.TPR if CR8 write intercept is disabled If the CR8 write intercept is disabled the V_TPR field of the VMCB needs to be synced with the TPR field in the local apic. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 18:21:42 +03:00
Joerg Roedel	ec7cf6903f	KVM: export kvm_lapic_set_tpr() to modules This patch exports the kvm_lapic_set_tpr() function from the lapic code to modules. It is required in the kvm-amd module to optimize CR8 intercepts. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 18:21:41 +03:00
Joerg Roedel	649d68643e	KVM: SVM: sync TPR value to V_TPR field in the VMCB This patch adds syncing of the lapic.tpr field to the V_TPR field of the VMCB. With this change we can safely remove the CR8 read intercept. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 18:21:40 +03:00
Avi Kivity	f9b7aab35c	KVM: x86 emulator: fix lea to really get the effective address We never hit this, since there is currently no reason to emulate lea. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 18:21:35 +03:00
Avi Kivity	16286d082d	KVM: x86 emulator: fix smsw and lmsw with a memory operand lmsw and smsw were implemented only with a register operand. Extend them to support a memory operand as well. Fixes Windows running some display compatibility test on AMD hosts. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 18:21:34 +03:00
Avi Kivity	66b8550573	KVM: x86 emulator: initialize src.val and dst.val for register operands This lets us treat the case where mod == 3 in the same manner as other cases. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 18:21:33 +03:00
Avi Kivity	a79d2f1805	KVM: SVM: force a new asid when initializing the vmcb Shutdown interception clears the vmcb, leaving the asid at zero (which is illegal. so force a new asid on vmcb initialization. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 18:21:32 +03:00
Marcelo Tosatti	e9571ed54b	KVM: fix kvm_vcpu_kick vs __vcpu_run race There is a window open between testing of pending IRQ's and assignment of guest_mode in __vcpu_run. Injection of IRQ's can race with __vcpu_run as follows: CPU0 CPU1 kvm_x86_ops->run() vcpu->guest_mode = 0 SET_IRQ_LINE ioctl .. kvm_x86_ops->inject_pending_irq kvm_cpu_has_interrupt() apic_test_and_set_irr() kvm_vcpu_kick if (vcpu->guest_mode) send_ipi() vcpu->guest_mode = 1 So move guest_mode=1 assignment before ->inject_pending_irq, and make sure that it won't reorder after it. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 18:21:32 +03:00
Marcelo Tosatti	62d9f0dbc9	KVM: add ioctls to save/store mpstate So userspace can save/restore the mpstate during migration. [avi: export the #define constants describing the value] [christian: add s390 stubs] [avi: ditto for ia64] Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 18:21:16 +03:00
Avi Kivity	a45352908b	KVM: Rename VCPU_MP_STATE_* to KVM_MP_STATE_* We wish to export it to userspace, so move it into the kvm namespace. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:04:13 +03:00
Marcelo Tosatti	3d80840d96	KVM: hlt emulation should take in-kernel APIC/PIT timers into account Timers that fire between guest hlt and vcpu_block's add_wait_queue() are ignored, possibly resulting in hangs. Also make sure that atomic_inc and waitqueue_active tests happen in the specified order, otherwise the following race is open: CPU0 CPU1 if (waitqueue_active(wq)) add_wait_queue() if (!atomic_read(pit_timer->pending)) schedule() atomic_inc(pit_timer->pending) Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:04:11 +03:00
Joerg Roedel	3564990af1	KVM: SVM: do not intercept task switch with NPT When KVM uses NPT there is no reason to intercept task switches. This patch removes the intercept for it in that case. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:01:23 +03:00
Feng(Eric) Liu	d4c9ff2d1b	KVM: Add kvm trace userspace interface This interface allows user a space application to read the trace of kvm related events through relayfs. Signed-off-by: Feng (Eric) Liu <eric.e.liu@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:01:22 +03:00
Feng (Eric) Liu	2714d1d3d6	KVM: Add trace markers Trace markers allow userspace to trace execution of a virtual machine in order to monitor its performance. Signed-off-by: Feng (Eric) Liu <eric.e.liu@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:01:19 +03:00
Joerg Roedel	53371b5098	KVM: SVM: add intercept for machine check exception To properly forward a MCE occured while the guest is running to the host, we have to intercept this exception and call the host handler by hand. This is implemented by this patch. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:01:18 +03:00
Joerg Roedel	6394b6494c	KVM: SVM: align shadow CR4.MCE with host This patch aligns the host version of the CR4.MCE bit with the CR4 active in the guest. This is necessary to get MCE exceptions when the guest is running. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:01:18 +03:00
Joerg Roedel	ec077263b2	KVM: SVM: indent svm_set_cr4 with tabs instead of spaces The svm_set_cr4 function is indented with spaces. This patch replaces them with tabs. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:01:17 +03:00
Anthony Liguori	35149e2129	KVM: MMU: Don't assume struct page for x86 This patch introduces a gfn_to_pfn() function and corresponding functions like kvm_release_pfn_dirty(). Using these new functions, we can modify the x86 MMU to no longer assume that it can always get a struct page for any given gfn. We don't want to eliminate gfn_to_page() entirely because a number of places assume they can do gfn_to_page() and then kmap() the results. When we support IO memory, gfn_to_page() will fail for IO pages although gfn_to_pfn() will succeed. This does not implement support for avoiding reference counting for reserved RAM or for IO memory. However, it should make those things pretty straight forward. Since we're only introducing new common symbols, I don't think it will break the non-x86 architectures but I haven't tested those. I've tested Intel, AMD, NPT, and hugetlbfs with Windows and Linux guests. [avi: fix overflow when shifting left pfns by adding casts] Signed-off-by: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:01:15 +03:00
Marcelo Tosatti	bed1d1dfc4	KVM: MMU: prepopulate guest pages after write-protecting Zdenek reported a bug where a looping "dmsetup status" eventually hangs on SMP guests. The problem is that kvm_mmu_get_page() prepopulates the shadow MMU before write protecting the guest page tables. By doing so, it leaves a window open where the guest can mark a pte as present while the host has shadow cached such pte as "notrap". Accesses to such address will fault in the guest without the host having a chance to fix the situation. Fix by moving the write protection before the pte prefetch. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:00:58 +03:00
Avi Kivity	fcd6dbac92	KVM: MMU: Only mark_page_accessed() if the page was accessed by the guest If the accessed bit is not set, the guest has never accessed this page (at least through this spte), so there's no need to mark the page accessed. This provides more accurate data for the eviction algortithm. Noted by Andrea Arcangeli. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:00:57 +03:00
Avi Kivity	3d45830c2b	KVM: Free apic access page on vm destruction Noticed by Marcelo Tosatti. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:00:54 +03:00
Izik Eidus	3ee16c8145	KVM: MMU: allow the vm to shrink the kvm mmu shadow caches Allow the Linux memory manager to reclaim memory in the kvm shadow cache. Signed-off-by: Izik Eidus <izike@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:00:53 +03:00
Marcelo Tosatti	3200f405a1	KVM: MMU: unify slots_lock usage Unify slots_lock acquision around vcpu_run(). This is simpler and less error-prone. Also fix some callsites that were not grabbing the lock properly. [avi: drop slots_lock while in guest mode to avoid holding the lock for indefinite periods] Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:00:52 +03:00
Sheng Yang	25c5f225be	KVM: VMX: Enable MSR Bitmap feature MSR Bitmap controls whether the accessing of an MSR causes VM Exit. Eliminating exits on automatically saved and restored MSRs yields a small performance gain. Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:00:52 +03:00
Izik Eidus	37817f2982	KVM: x86: hardware task switching support This emulates the x86 hardware task switch mechanism in software, as it is unsupported by either vmx or svm. It allows operating systems which use it, like freedos, to run as kvm guests. Signed-off-by: Izik Eidus <izike@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:00:39 +03:00
Izik Eidus	2e4d265349	KVM: x86: add functions to get the cpl of vcpu Signed-off-by: Izik Eidus <izike@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:00:38 +03:00
Avi Kivity	4c9fc8ef50	KVM: VMX: Add module option to disable flexpriority Useful for debugging. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:00:37 +03:00
Avi Kivity	268fe02ae0	KVM: no longer EXPERIMENTAL Long overdue. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:00:36 +03:00
Avi Kivity	0b49ea8659	KVM: MMU: Introduce and use spte_to_page() Encapsulate the pte mask'n'shift in a function. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:00:35 +03:00
Izik Eidus	855149aaa9	KVM: MMU: fix dirty bit setting when removing write permissions When mmu_set_spte() checks if a page related to spte should be release as dirty or clean, it check if the shadow pte was writeble, but in case rmap_write_protect() is called called it is possible for shadow ptes that were writeble to become readonly and therefor mmu_set_spte will release the pages as clean. This patch fix this issue by marking the page as dirty inside rmap_write_protect(). Signed-off-by: Izik Eidus <izike@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:00:34 +03:00
Avi Kivity	947da53830	KVM: MMU: Set the accessed bit on non-speculative shadow ptes If we populate a shadow pte due to a fault (and not speculatively due to a pte write) then we can set the accessed bit on it, as we know it will be set immediately on the next guest instruction. This saves a read-modify-write operation. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:00:33 +03:00
Marcelo Tosatti	2f333bcb4e	KVM: MMU: hypercall based pte updates and TLB flushes Hypercall based pte updates are faster than faults, and also allow use of the lazy MMU mode to batch operations. Don't report the feature if two dimensional paging is enabled. [avi: - one mmu_op hypercall instead of one per op - allow 64-bit gpa on hypercall - don't pass host errors (-ENOMEM) to guest] [akpm: warning fix on i386] Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:00:27 +03:00
Avi Kivity	9f81128591	KVM: Provide unlocked version of emulator_write_phys() Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:00:26 +03:00
Marcelo Tosatti	a28e4f5a62	KVM: add basic paravirt support Add basic KVM paravirt support. Avoid vm-exits on IO delays. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:00:24 +03:00
Sheng Yang	308b0f239e	KVM: Add reset support for in kernel PIT Separate the reset part and prepare for reset support. Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:00:23 +03:00
Sheng Yang	e0f63cb927	KVM: Add save/restore supporting of in kernel PIT Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:00:22 +03:00
Sheng Yang	7837699fa6	KVM: In kernel PIT model The patch moves the PIT model from userspace to kernel, and increases the timer accuracy greatly. [marcelo: make last_injected_time per-guest] Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Tested-and-Acked-by: Alex Davis <alex14641@yahoo.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:00:21 +03:00
Avi Kivity	4fcaa98267	KVM: Remove pointless desc_ptr #ifdef The desc_struct changes left an unnecessary #ifdef; remove it. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:27 +03:00
Avi Kivity	019960ae99	KVM: VMX: Don't adjust tsc offset forward Most Intel hosts have a stable tsc, and playing with the offset only reduces accuracy. By limiting tsc offset adjustment only to forward updates, we effectively disable tsc offset adjustment on these hosts. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:27 +03:00
Harvey Harrison	b8688d51bb	KVM: replace remaining __FUNCTION__ occurances __FUNCTION__ is gcc-specific, use __func__ Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:27 +03:00
Joerg Roedel	71c4dfafc0	KVM: detect if VCPU triple faults In the current inject_page_fault path KVM only checks if there is another PF pending and injects a DF then. But it has to check for a pending DF too to detect a shutdown condition in the VCPU. If this is not detected the VCPU goes to a PF -> DF -> PF loop when it should triple fault. This patch detects this condition and handles it with an KVM_SHUTDOWN exit to userspace. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:27 +03:00
Avi Kivity	2d3ad1f40c	KVM: Prefix control register accessors with kvm_ to avoid namespace pollution Names like 'set_cr3()' look dangerously close to affecting the host. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:26 +03:00
Marcelo Tosatti	05da45583d	KVM: MMU: large page support Create large pages mappings if the guest PTE's are marked as such and the underlying memory is hugetlbfs backed. If the largepage contains write-protected pages, a large pte is not used. Gives a consistent 2% improvement for data copies on ram mounted filesystem, without NPT/EPT. Anthony measures a 4% improvement on 4-way kernbench, with NPT. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:25 +03:00
Marcelo Tosatti	2e53d63acb	KVM: MMU: ignore zapped root pagetables Mark zapped root pagetables as invalid and ignore such pages during lookup. This is a problem with the cr3-target feature, where a zapped root table fools the faulting code into creating a read-only mapping. The result is a lockup if the instruction can't be emulated. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Cc: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:25 +03:00
Alexander Graf	847f0ad8cb	KVM: Implement dummy values for MSR_PERF_STATUS Darwin relies on this and ceases to work without. Signed-off-by: Alexander Graf <alex@csgraf.de> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:25 +03:00
Harvey Harrison	14af3f3c56	KVM: sparse fixes for kvm/x86.c In two case statements, use the ever popular 'i' instead of index: arch/x86/kvm/x86.c:1063:7: warning: symbol 'index' shadows an earlier one arch/x86/kvm/x86.c:1000:9: originally declared here arch/x86/kvm/x86.c:1079:7: warning: symbol 'index' shadows an earlier one arch/x86/kvm/x86.c:1000:9: originally declared here Make it static. arch/x86/kvm/x86.c:1945:24: warning: symbol 'emulate_ops' was not declared. Should it be static? Drop the return statements. arch/x86/kvm/x86.c:2878:2: warning: returning void-valued expression arch/x86/kvm/x86.c:2944:2: warning: returning void-valued expression Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:24 +03:00
Harvey Harrison	4866d5e3d5	KVM: SVM: make iopm_base static Fixes sparse warning as well. arch/x86/kvm/svm.c:69:15: warning: symbol 'iopm_base' was not declared. Should it be static? Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:24 +03:00
Harvey Harrison	77cd337f22	KVM: x86 emulator: fix sparse warnings in x86_emulate.c Nesting __emulate_2op_nobyte inside__emulate_2op produces many shadowed variable warnings on the internal variable _tmp used by both macros. Change the outer macro to use __tmp. Avoids a sparse warning like the following at every call site of __emulate_2op arch/x86/kvm/x86_emulate.c:1091:3: warning: symbol '_tmp' shadows an earlier one arch/x86/kvm/x86_emulate.c:1091:3: originally declared here [18 more warnings suppressed] Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:24 +03:00
Amit Shah	f11c3a8d84	KVM: Add stat counter for hypercalls Signed-off-by: Amit Shah <amit.shah@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:24 +03:00
Avi Kivity	a5f61300c4	KVM: Use x86's segment descriptor struct instead of private definition The x86 desc_struct unification allows us to remove segment_descriptor.h. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:24 +03:00
Avi Kivity	a988b910ef	KVM: Add API for determining the number of supported memory slots Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:23 +03:00
Avi Kivity	f725230af9	KVM: Add API to retrieve the number of supported vcpus per vm Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:23 +03:00
Harvey Harrison	7a95727567	KVM: x86 emulator: make register_address_increment and JMP_REL static inlines Change jmp_rel() to a function as well. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:23 +03:00
Harvey Harrison	e4706772ea	KVM: x86 emulator: make register_address, address_mask static inlines Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:22 +03:00
Harvey Harrison	ddcb2885e2	KVM: x86 emulator: add ad_mask static inline Replaces open-coded mask calculation in macros. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:22 +03:00
Glauber de Oliveira Costa	18068523d3	KVM: paravirtualized clocksource: host part This is the host part of kvm clocksource implementation. As it does not include clockevents, it is a fairly simple implementation. We only have to register a per-vcpu area, and start writing to it periodically. The area is binary compatible with xen, as we use the same shadow_info structure. [marcelo: fix bad_page on MSR_KVM_SYSTEM_TIME] [avi: save full value of the msr, even if enable bit is clear] [avi: clear previous value of time_page] Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:22 +03:00
Joerg Roedel	24e09cbf48	KVM: SVM: enable LBR virtualization This patch implements the Last Branch Record Virtualization (LBRV) feature of the AMD Barcelona and Phenom processors into the kvm-amd module. It will only be enabled if the guest enables last branch recording in the DEBUG_CTL MSR. So there is no increased world switch overhead when the guest doesn't use these MSRs. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Markus Rechberger <markus.rechberger@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:21 +03:00
Joerg Roedel	f65c229c3e	KVM: SVM: allocate the MSR permission map per VCPU This patch changes the kvm-amd module to allocate the SVM MSR permission map per VCPU instead of a global map for all VCPUs. With this we have more flexibility allowing specific guests to access virtualized MSRs. This is required for LBR virtualization. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Markus Rechberger <markus.rechberger@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:21 +03:00
Joerg Roedel	e6101a96c9	KVM: SVM: let init_vmcb() take struct vcpu_svm as parameter Change the parameter of the init_vmcb() function in the kvm-amd module from struct vmcb to struct vcpu_svm. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Markus Rechberger <markus.rechberger@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:21 +03:00
Ryan Harper	2e11384c2c	KVM: VMX: fix typo in VMX header define Looking at Intel Volume 3b, page 148, table 20-11 and noticed that the field name is 'Deliver' not 'Deliever'. Attached patch changes the define name and its user in vmx.c Signed-off-by: Ryan Harper <ryanh@us.ibm.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:21 +03:00
Joerg Roedel	709ddebf81	KVM: SVM: add support for Nested Paging This patch contains the SVM architecture dependent changes for KVM to enable support for the Nested Paging feature of AMD Barcelona and Phenom processors. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:21 +03:00
Joerg Roedel	fb72d1674d	KVM: MMU: add TDP support to the KVM MMU This patch contains the changes to the KVM MMU necessary for support of the Nested Paging feature in AMD Barcelona and Phenom Processors. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:20 +03:00
Joerg Roedel	cc4b6871e7	KVM: export the load_pdptrs() function to modules The load_pdptrs() function is required in the SVM module for NPT support. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:20 +03:00
Joerg Roedel	4d9976bbdc	KVM: MMU: make the __nonpaging_map function generic The mapping function for the nonpaging case in the softmmu does basically the same as required for Nested Paging. Make this function generic so it can be used for both. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:20 +03:00
Joerg Roedel	1855267210	KVM: export information about NPT to generic x86 code The generic x86 code has to know if the specific implementation uses Nested Paging. In the generic code Nested Paging is called Two Dimensional Paging (TDP) to avoid confusion with (future) TDP implementations of other vendors. This patch exports the availability of TDP to the generic x86 code. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:19 +03:00
Joerg Roedel	6c7dac72d5	KVM: SVM: add module parameter to disable Nested Paging To disable the use of the Nested Paging feature even if it is available in hardware this patch adds a module parameter. Nested Paging can be disabled by passing npt=0 to the kvm_amd module. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:19 +03:00
Joerg Roedel	e3da3acdb3	KVM: SVM: add detection of Nested Paging feature Let SVM detect if the Nested Paging feature is available on the hardware. Disable it to keep this patch series bisectable. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:19 +03:00
Joerg Roedel	33bd6a0b3e	KVM: SVM: move feature detection to hardware setup code By moving the SVM feature detection from the each_cpu code to the hardware setup code it runs only once. As an additional advance the feature check is now available earlier in the module setup process. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:19 +03:00
Joerg Roedel	9457a712a2	KVM: allow access to EFER in 32bit KVM This patch makes the EFER register accessible on a 32bit KVM host. This is necessary to boot 32 bit PAE guests under SVM. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:19 +03:00
Joerg Roedel	9f62e19a11	KVM: VMX: unifdef the EFER specific code To allow access to the EFER register in 32bit KVM the EFER specific code has to be exported to the x86 generic code. This patch does this in a backwards compatible manner. [avi: add check for EFER-less hosts] Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:18 +03:00
Joerg Roedel	50a37eb4e0	KVM: align valid EFER bits with the features of the host system This patch aligns the bits the guest can set in the EFER register with the features in the host processor. Currently it lets EFER.NX disabled if the processor does not support it and enables EFER.LME and EFER.LMA only for KVM on 64 bit hosts. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:18 +03:00
Joerg Roedel	f2b4b7ddf6	KVM: make EFER_RESERVED_BITS configurable for architecture code This patch give the SVM and VMX implementations the ability to add some bits the guest can set in its EFER register. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:18 +03:00
Sheng Yang	2384d2b326	KVM: VMX: Enable Virtual Processor Identification (VPID) To allow TLB entries to be retained across VM entry and VM exit, the VMM can now identify distinct address spaces through a new virtual-processor ID (VPID) field of the VMCS. [avi: drop vpid_sync_all()] [avi: add "cc" to asm constraints] Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:17 +03:00
Avi Kivity	d196e34336	KVM: MMU: Decouple mmio from shadow page tables Currently an mmio guest pte is encoded in the shadow pagetable as a not-present trapping pte, with the SHADOW_IO_MARK bit set. However nothing is ever done with this information, so maintaining it is a useless complication. This patch moves the check for mmio to before shadow ptes are instantiated, so the shadow code is never invoked for ptes that reference mmio. The code is simpler, and with future work, can be made to handle mmio concurrently. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:17 +03:00
Avi Kivity	1d6ad2073e	KVM: x86 emulator: group decoding for group 1 instructions Opcodes 0x80-0x83 Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:16 +03:00
Avi Kivity	d95058a1a7	KVM: x86 emulator: add group 7 decoding This adds group decoding for opcode 0x0f 0x01 (group 7). Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:15 +03:00
Avi Kivity	fd60754e4f	KVM: x86 emulator: Group decoding for groups 4 and 5 Add group decoding support for opcode 0xfe (group 4) and 0xff (group 5). Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:15 +03:00
Avi Kivity	7d858a19ef	KVM: x86 emulator: Group decoding for group 3 This adds group decoding support for opcodes 0xf6, 0xf7 (group 3). Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:14 +03:00
Avi Kivity	43bb19cd33	KVM: x86 emulator: group decoding for group 1A This adds group decode support for opcode 0x8f. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:14 +03:00
Avi Kivity	e09d082c03	KVM: x86 emulator: add support for group decoding Certain x86 instructions use bits 3:5 of the byte following the opcode as an opcode extension, with the decode sometimes depending on bits 6:7 as well. Add support for this in the main decoding table rather than an ad-hock adaptation per opcode. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:14 +03:00
Dong, Eddie	1ae0a13def	KVM: MMU: Simplify hash table indexing Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:14 +03:00
Dong, Eddie	489f1d6526	KVM: MMU: Update shadow ptes on partial guest pte writes A guest partial guest pte write will leave shadow_trap_nonpresent_pte in spte, which generates a vmexit at the next guest access through that pte. This patch improves this by reading the full guest pte in advance and thus being able to update the spte and eliminate the vmexit. This helps pae guests which use two 32-bit writes to set a single 64-bit pte. [truncation fix by Eric] Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com> Signed-off-by: Feng (Eric) Liu <eric.e.liu@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 11:53:13 +03:00
Avi Kivity	e48bb497b9	KVM: MMU: Fix memory leak on guest demand faults While backporting `72dc67a696`, a gfn_to_page() call was duplicated instead of moved (due to an unrelated patch not being present in mainline). This caused a page reference leak, resulting in a fairly massive memory leak. Fix by removing the extraneous gfn_to_page() call. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-03-25 10:22:17 +02:00
Marcelo Tosatti	707a18a51d	KVM: VMX: convert init_rmode_tss() to slots_lock init_rmode_tss was forgotten during the conversion from mmap_sem to slots_lock. INFO: task qemu-system-x86:3748 blocked for more than 120 seconds. Call Trace: [<ffffffff8053d100>] __down_read+0x86/0x9e [<ffffffff8053fb43>] do_page_fault+0x346/0x78e [<ffffffff8053d235>] trace_hardirqs_on_thunk+0x35/0x3a [<ffffffff8053dcad>] error_exit+0x0/0xa9 [<ffffffff8035a7a7>] copy_user_generic_string+0x17/0x40 [<ffffffff88099a8a>] :kvm:kvm_write_guest_page+0x3e/0x5f [<ffffffff880b661a>] :kvm_intel:init_rmode_tss+0xa7/0xf9 [<ffffffff880b7d7e>] :kvm_intel:vmx_vcpu_reset+0x10/0x38a [<ffffffff8809b9a5>] :kvm:kvm_arch_vcpu_setup+0x20/0x53 [<ffffffff8809a1e4>] :kvm:kvm_vm_ioctl+0xad/0x1cf [<ffffffff80249dea>] __lock_acquire+0x4f7/0xc28 [<ffffffff8028fad9>] vfs_ioctl+0x21/0x6b [<ffffffff8028fd75>] do_vfs_ioctl+0x252/0x26b [<ffffffff8028fdca>] sys_ioctl+0x3c/0x5e [<ffffffff8020b01b>] system_call_after_swapgs+0x7b/0x80 Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-03-25 10:22:17 +02:00
Marcelo Tosatti	15aaa819e2	KVM: MMU: handle page removal with shadow mapping Do not assume that a shadow mapping will always point to the same host frame number. Fixes crash with madvise(MADV_DONTNEED). [avi: move after first printk(), add another printk()] Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-03-25 10:22:17 +02:00
Avi Kivity	4b1a80fa65	KVM: MMU: Fix is_rmap_pte() with io ptes is_rmap_pte() doesn't take into account io ptes, which have the avail bit set. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-03-25 10:22:16 +02:00
Avi Kivity	5dc8326282	KVM: VMX: Restore tss even on x86_64 The vmx hardware state restore restores the tss selector and base address, but not its length. Usually, this does not matter since most of the tss contents is within the default length of 0x67. However, if a process is using ioperm() to grant itself I/O port permissions, an additional bitmap within the tss, but outside the default length is consulted. The effect is that the process will receive a SIGSEGV instead of transparently accessing the port. Fix by restoring the tss length. Note that i386 had this working already. Closes bugzilla 10246. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-03-25 10:22:16 +02:00
Avi Kivity	33f9c505ed	KVM: VMX: Avoid rearranging switched guest msrs while they are loaded KVM tries to run as much as possible with the guest msrs loaded instead of host msrs, since switching msrs is very expensive. It also tries to minimize the number of msrs switched according to the guest mode; for example, MSR_LSTAR is needed only by long mode guests. This optimization is done by setup_msrs(). However, we must not change which msrs are switched while we are running with guest msr state: - switch to guest msr state - call setup_msrs(), removing some msrs from the list - switch to host msr state, leaving a few guest msrs loaded An easy way to trigger this is to kexec an x86_64 linux guest. Early during setup, the guest will switch EFER to not include SCE. KVM will stop saving MSR_LSTAR, and on the next msr switch it will leave the guest LSTAR loaded. The next host syscall will end up in a random location in the kernel. Fix by reloading the host msrs before changing the msr list. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-03-04 15:19:50 +02:00
Avi Kivity	f7d9c7b7b9	KVM: MMU: Fix race when instantiating a shadow pte For improved concurrency, the guest walk is performed concurrently with other vcpus. This means that we need to revalidate the guest ptes once we have write-protected the guest page tables, at which point they can no longer be modified. The current code attempts to avoid this check if the shadow page table is not new, on the assumption that if it has existed before, the guest could not have modified the pte without the shadow lock. However the assumption is incorrect, as the racing vcpu could have modified the pte, then instantiated the shadow page, before our vcpu regains control: vcpu0 vcpu1 fault walk pte modify pte fault in same pagetable instantiate shadow page lookup shadow page conclude it is old instantiate spte based on stale guest pte We could do something clever with generation counters, but a test run by Marcelo suggests this is unnecessary and we can just do the revalidation unconditionally. The pte will be in the processor cache and the check can be quite fast. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-03-04 15:19:49 +02:00
Avi Kivity	0b975a3c2d	KVM: Avoid infinite-frequency local apic timer If the local apic initial count is zero, don't start a an hrtimer with infinite frequency, locking up the host. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-03-04 15:19:48 +02:00
Marcelo Tosatti	24993d5349	KVM: make MMU_DEBUG compile again the cr3 variable is now inside the vcpu->arch structure. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-03-04 15:19:47 +02:00
Marcelo Tosatti	5e4a0b3c1b	KVM: move alloc_apic_access_page() outside of non-preemptable region alloc_apic_access_page() can sleep, while vmx_vcpu_setup is called inside a non preemptable region. Move it after put_cpu(). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-03-04 15:19:46 +02:00
Joerg Roedel	a2938c8070	KVM: SVM: fix Windows XP 64 bit installation crash While installing Windows XP 64 bit wants to access the DEBUGCTL and the last branch record (LBR) MSRs. Don't allowing this in KVM causes the installation to crash. This patch allow the access to these MSRs and fixes the issue. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Markus Rechberger <markus.rechberger@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-03-04 15:19:45 +02:00
Izik Eidus	72dc67a696	KVM: remove the usage of the mmap_sem for the protection of the memory slots. This patch replaces the mmap_sem lock for the memory slots with a new kvm private lock, it is needed beacuse untill now there were cases where kvm accesses user memory while holding the mmap semaphore. Signed-off-by: Izik Eidus <izike@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-03-04 15:19:40 +02:00
Joerg Roedel	c7ac679c16	KVM: emulate access to MSR_IA32_MCG_CTL Injecting an GP when accessing this MSR lets Windows crash when running some stress test tools in KVM. So this patch emulates access to this MSR. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Markus Rechberger <markus.rechberger@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-03-03 11:22:37 +02:00
Avi Kivity	674eea0fc4	KVM: Make the supported cpuid list a host property rather than a vm property One of the use cases for the supported cpuid list is to create a "greatest common denominator" of cpu capabilities in a server farm. As such, it is useful to be able to get the list without creating a virtual machine first. Since the code does not depend on the vm in any way, all that is needed is to move it to the device ioctl handler. The capability identifier is also changed so that binaries made against -rc1 will fail gracefully. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-03-03 11:22:25 +02:00
Paul Knowles	d730616384	KVM: Fix kvm_arch_vcpu_ioctl_set_sregs so that set_cr0 works properly Whilst working on getting a VM to initialize in to IA32e mode I found this issue. set_cr0 relies on comparing the old cr0 to the new one to work correctly. Move the assignment below so the compare can work. Signed-off-by: Paul Knowles <paul@transitive.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-03-03 11:22:14 +02:00
Joerg Roedel	6b390b6392	KVM: SVM: set NM intercept when enabling CR0.TS in the guest Explicitly enable the NM intercept in svm_set_cr0 if we enable TS in the guest copy of CR0 for lazy FPU switching. This fixes guest SMP with Linux under SVM. Without that patch Linux deadlocks or panics right after trying to boot the other CPUs. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Markus Rechberger <markus.rechberger@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-03-03 11:20:21 +02:00
Joerg Roedel	334df50a86	KVM: SVM: Fix lazy FPU switching If the guest writes to cr0 and leaves the TS flag at 0 while vcpu->fpu_active is also 0, the TS flag in the guest's cr0 gets lost. This leads to corrupt FPU state an causes Windows Vista 64bit to crash very soon after boot. This patch fixes this bug. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Markus Rechberger <markus.rechberger@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-03-03 11:18:18 +02:00
Andrew Morton	c0b49b0d16	kvm: i386 fix arch/x86/kvm/x86.c: In function 'emulator_cmpxchg_emulated': arch/x86/kvm/x86.c:1746: warning: passing argument 2 of 'vcpu->arch.mmu.gva_to_gpa' makes integer from pointer without a cast arch/x86/kvm/x86.c:1746: warning: 'addr' is used uninitialized in this function Is true. Local variable `addr' shadows incoming arg `addr'. Avi is on vacation for a while, so... Cc: Avi Kivity <avi@qumranet.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:06 -08:00
Anthony Liguori	0ad07ec1fd	virtio: Put the virtio under the virtualization menu This patch moves virtio under the virtualization menu and changes virtio devices to not claim to only be for lguest. Signed-off-by: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-02-04 23:50:05 +11:00
Avi Kivity	2f52d58c92	KVM: Move apic timer migration away from critical section Migrating the apic timer in the critical section is not very nice, and is absolutely horrible with the real-time port. Move migration to the regular vcpu execution path, triggered by a new bitflag. Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-01-30 18:01:22 +02:00
Avi Kivity	6c14280125	KVM: Fix unbounded preemption latency When preparing to enter the guest, if an interrupt comes in while preemption is disabled but interrupts are still enabled, we miss a preemption point. Fix by explicitly checking whether we need to reschedule. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-01-30 18:01:22 +02:00
Avi Kivity	97db56ce6c	KVM: Initialize the mmu caches only after verifying cpu support Otherwise we re-initialize the mmu caches, which will fail since the caches are already registered, which will cause us to deinitialize said caches. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-01-30 18:01:22 +02:00
Izik Eidus	75e68e6078	KVM: MMU: Fix dirty page setting for pages removed from rmap Right now rmap_remove won't set the page as dirty if the shadow pte pointed to this page had write access and then it became readonly. This patches fixes that, by setting the page as dirty for spte changes from write to readonly access. Signed-off-by: Izik Eidus <izike@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-01-30 18:01:22 +02:00
Sheng Yang	571008dacc	KVM: x86 emulator: Only allow VMCALL/VMMCALL trapped by #UD When executing a test program called "crashme", we found the KVM guest cannot survive more than ten seconds, then encounterd kernel panic. The basic concept of "crashme" is generating random assembly code and trying to execute it. After some fixes on emulator insn validity judgment, we found it's hard to get the current emulator handle the invalid instructions correctly, for the #UD trap for hypercall patching caused troubles. The problem is, if the opcode itself was OK, but combination of opcode and modrm_reg was invalid, and one operand of the opcode was memory (SrcMem or DstMem), the emulator will fetch the memory operand first rather than checking the validity, and may encounter an error there. For example, ".byte 0xfe, 0x34, 0xcd" has this problem. In the patch, we simply check that if the invalid opcode wasn't vmcall/vmmcall, then return from emulate_instruction() and inject a #UD to guest. With the patch, the guest had been running for more than 12 hours. Signed-off-by: Feng (Eric) Liu <eric.e.liu@intel.com> Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-01-30 18:01:21 +02:00
Dong, Eddie	5882842f9b	KVM: MMU: Merge shadow level check in FNAME(fetch) Remove the redundant level check when fetching shadow pte for present & non-present spte. Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-01-30 18:01:21 +02:00
Avi Kivity	eb787d10af	KVM: MMU: Move kvm_free_some_pages() into critical section If some other cpu steals mmu pages between our check and an attempt to allocate, we can run out of mmu pages. Fix by moving the check into the same critical section as the allocation. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-01-30 18:01:21 +02:00
Marcelo Tosatti	aaee2c94f7	KVM: MMU: Switch to mmu spinlock Convert the synchronization of the shadow handling to a separate mmu_lock spinlock. Also guard fetch() by mmap_sem in read-mode to protect against alias and memslot changes. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-01-30 18:01:21 +02:00
Avi Kivity	d7824fff89	KVM: MMU: Avoid calling gfn_to_page() in mmu_set_spte() Since gfn_to_page() is a sleeping function, and we want to make the core mmu spinlocked, we need to pass the page from the walker context (which can sleep) to the shadow context (which cannot). [marcelo: avoid recursive locking of mmap_sem] Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-01-30 18:01:21 +02:00
Marcelo Tosatti	7ec5458821	KVM: Add kvm_read_guest_atomic() In preparation for a mmu spinlock, add kvm_read_guest_atomic() and use it in fetch() and prefetch_page(). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-01-30 18:01:20 +02:00
Marcelo Tosatti	10589a4699	KVM: MMU: Concurrent guest walkers Do not hold kvm->lock mutex across the entire pagefault code, only acquire it in places where it is necessary, such as mmu hash list, active list, rmap and parent pte handling. Allow concurrent guest walkers by switching walk_addr() to use mmap_sem in read-mode. And get rid of the lockless __gfn_to_page. [avi: move kvm_mmu_pte_write() locking inside the function] [avi: add locking for real mode] [avi: fix cmpxchg locking] Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-01-30 18:01:20 +02:00
Avi Kivity	774ead3ad9	KVM: Disable vapic support on Intel machines with FlexPriority FlexPriority accelerates the tpr without any patching. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-01-30 18:01:20 +02:00
Avi Kivity	b93463aa59	KVM: Accelerated apic support This adds a mechanism for exposing the virtual apic tpr to the guest, and a protocol for letting the guest update the tpr without causing a vmexit if conditions allow (e.g. there is no interrupt pending with a higher priority than the new tpr). Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-01-30 18:01:20 +02:00
Avi Kivity	b209749f52	KVM: local APIC TPR access reporting facility Add a facility to report on accesses to the local apic tpr even if the local apic is emulated in the kernel. This is basically a hack that allows userspace to patch Windows which tends to bang on the tpr a lot. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-01-30 18:01:20 +02:00
Avi Kivity	565f1fbd9d	KVM: Print data for unimplemented wrmsr This can help diagnosing what the guest is trying to do. In many cases we can get away with partial emulation of msrs. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-01-30 18:01:20 +02:00
Avi Kivity	dfc5aa00cb	KVM: MMU: Add cache miss statistic Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-01-30 18:01:19 +02:00
Eddie Dong	caa5b8a5ed	KVM: MMU: Coalesce remote tlb flushes Host side TLB flush can be merged together if multiple spte need to be write-protected. Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-01-30 18:01:19 +02:00
Zhang Xiantao	5736199afb	KVM: Move kvm_vcpu_kick() to x86.c Moving kvm_vcpu_kick() to x86.c. Since it should be common for all archs, put its declarations in <linux/kvm_host.h> Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-01-30 18:01:19 +02:00
Zhang Xiantao	0eb8f49848	KVM: Move ioapic code to common directory. Move ioapic code to common, since IA64 also needs it. Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-01-30 18:01:19 +02:00
Zhang Xiantao	82470196fa	KVM: Move irqchip declarations into new ioapic.h and lapic.h This allows reuse of ioapic in ia64. Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-01-30 18:01:19 +02:00
Avi Kivity	0fce5623ba	KVM: Move drivers/kvm/* to virt/kvm/ Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-01-30 18:01:18 +02:00
Avi Kivity	edf884172e	KVM: Move arch dependent files to new directory arch/x86/kvm/ This paves the way for multiple architecture support. Note that while ioapic.c could potentially be shared with ia64, it is also moved. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-01-30 18:01:18 +02:00

... 16 17 18 19 20 ...

1738 Commits