commit fc5b7f3bf1 upstream.
An interrupt handler that uses the fpu can kill a KVM VM, if it runs
under the following conditions:
- the guest's xcr0 register is loaded on the cpu
- the guest's fpu context is not loaded
- the host is using eagerfpu
Note that the guest's xcr0 register and fpu context are not loaded as
part of the atomic world switch into "guest mode". They are loaded by
KVM while the cpu is still in "host mode".
Usage of the fpu in interrupt context is gated by irq_fpu_usable(). The
interrupt handler will look something like this:
if (irq_fpu_usable()) {
kernel_fpu_begin();
[... code that uses the fpu ...]
kernel_fpu_end();
}
As long as the guest's fpu is not loaded and the host is using eager
fpu, irq_fpu_usable() returns true (interrupted_kernel_fpu_idle()
returns true). The interrupt handler proceeds to use the fpu with
the guest's xcr0 live.
kernel_fpu_begin() saves the current fpu context. If this uses
XSAVE[OPT], it may leave the xsave area in an undesirable state.
According to the SDM, during XSAVE bit i of XSTATE_BV is not modified
if bit i is 0 in xcr0. So it's possible that XSTATE_BV[i] == 1 and
xcr0[i] == 0 following an XSAVE.
kernel_fpu_end() restores the fpu context. Now if any bit i in
XSTATE_BV == 1 while xcr0[i] == 0, XRSTOR generates a #GP. The
fault is trapped and SIGSEGV is delivered to the current process.
Only pre-4.2 kernels appear to be vulnerable to this sequence of
events. Commit 653f52c ("kvm,x86: load guest FPU context more eagerly")
from 4.2 forces the guest's fpu to always be loaded on eagerfpu hosts.
This patch fixes the bug by keeping the host's xcr0 loaded outside
of the interrupts-disabled region where KVM switches into guest mode.
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: David Matlack <dmatlack@google.com>
[Move load after goto cancel_injection. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 321c5658c5 upstream.
Non maskable interrupts (NMI) are preferred to interrupts in current
implementation. If a NMI is pending and NMI is blocked by the result
of nmi_allowed(), pending interrupt is not injected and
enable_irq_window() is not executed, even if interrupts injection is
allowed.
In old kernel (e.g. 2.6.32), schedule() is often called in NMI context.
In this case, interrupts are needed to execute iret that intends end
of NMI. The flag of blocking new NMI is not cleared until the guest
execute the iret, and interrupts are blocked by pending NMI. Due to
this, iret can't be invoked in the guest, and the guest is starved
until block is cleared by some events (e.g. canceling injection).
This patch injects pending interrupts, when it's allowed, even if NMI
is blocked. And, If an interrupts is pending after executing
inject_pending_event(), enable_irq_window() is executed regardless of
NMI pending counter.
Signed-off-by: Yuki Shibuya <shibuya.yk@ncos.nec.co.jp>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ef697a712a upstream.
Old KVM guests invoke single-context invvpid without actually checking
whether it is supported. This was fixed by commit 518c8ae ("KVM: VMX:
Make sure single type invvpid is supported before issuing invvpid
instruction", 2010-08-01) and the patch after, but pre-2.6.36
kernels lack it including RHEL 6.
Reported-by: jmontleo@redhat.com
Tested-by: jmontleo@redhat.com
Fixes: 99b83ac893
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7dd0fdff14 upstream.
Discard policy uses ack_notifiers to prevent injection of PIT interrupts
before EOI from the last one.
This patch changes the policy to always try to deliver the interrupt,
which makes a difference when its vector is in ISR.
Old implementation would drop the interrupt, but proposed one injects to
IRR, like real hardware would.
The old policy breaks legacy NMI watchdogs, where PIT is used through
virtual wire (LVT0): PIT never sends an interrupt before receiving EOI,
thus a guest deadlock with disabled interrupts will stop NMIs.
Note that NMI doesn't do EOI, so PIT also had to send a normal interrupt
through IOAPIC. (KVM's PIT is deeply rotten and luckily not used much
in modern systems.)
Even though there is a chance of regressions, I think we can fix the
LVT0 NMI bug without introducing a new tick policy.
Reported-by: Yuki Shibuya <shibuya.yk@ncos.nec.co.jp>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5f0b819995 upstream.
KVM has special logic to handle pages with pte.u=1 and pte.w=0 when
CR0.WP=1. These pages' SPTEs flip continuously between two states:
U=1/W=0 (user and supervisor reads allowed, supervisor writes not allowed)
and U=0/W=1 (supervisor reads and writes allowed, user writes not allowed).
When SMEP is in effect, however, U=0 will enable kernel execution of
this page. To avoid this, KVM also sets NX=1 in the shadow PTE together
with U=0, making the two states U=1/W=0/NX=gpte.NX and U=0/W=1/NX=1.
When guest EFER has the NX bit cleared, the reserved bit check thinks
that the latter state is invalid; teach it that the smep_andnot_wp case
will also use the NX bit of SPTEs.
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.inel.com>
Fixes: c258b62b26
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 844a5fe219 upstream.
Yes, all of these are needed. :) This is admittedly a bit odd, but
kvm-unit-tests access.flat tests this if you run it with "-cpu host"
and of course ept=0.
KVM runs the guest with CR0.WP=1, so it must handle supervisor writes
specially when pte.u=1/pte.w=0/CR0.WP=0. Such writes cause a fault
when U=1 and W=0 in the SPTE, but they must succeed because CR0.WP=0.
When KVM gets the fault, it sets U=0 and W=1 in the shadow PTE and
restarts execution. This will still cause a user write to fault, while
supervisor writes will succeed. User reads will fault spuriously now,
and KVM will then flip U and W again in the SPTE (U=1, W=0). User reads
will be enabled and supervisor writes disabled, going back to the
originary situation where supervisor writes fault spuriously.
When SMEP is in effect, however, U=0 will enable kernel execution of
this page. To avoid this, KVM also sets NX=1 in the shadow PTE together
with U=0. If the guest has not enabled NX, the result is a continuous
stream of page faults due to the NX bit being reserved.
The fix is to force EFER.NX=1 even if the CPU is taking care of the EFER
switch. (All machines with SMEP have the CPU_LOAD_IA32_EFER vm-entry
control, so they do not use user-return notifiers for EFER---if they did,
EFER.NX would be forced to the same value as the host).
There is another bug in the reserved bit check, which I've split to a
separate patch for easier application to stable kernels.
Cc: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Fixes: f6577a5fa1
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7099e2e1f4 upstream.
Linux guests on Haswell (and also SandyBridge and Broadwell, at least)
would crash if you decided to run a host command that uses PEBS, like
perf record -e 'cpu/mem-stores/pp' -a
This happens because KVM is using VMX MSR switching to disable PEBS, but
SDM [2015-12] 18.4.4.4 Re-configuring PEBS Facilities explains why it
isn't safe:
When software needs to reconfigure PEBS facilities, it should allow a
quiescent period between stopping the prior event counting and setting
up a new PEBS event. The quiescent period is to allow any latent
residual PEBS records to complete its capture at their previously
specified buffer address (provided by IA32_DS_AREA).
There might not be a quiescent period after the MSR switch, so a CPU
ends up using host's MSR_IA32_DS_AREA to access an area in guest's
memory. (Or MSR switching is just buggy on some models.)
The guest can learn something about the host this way:
If the guest doesn't map address pointed by MSR_IA32_DS_AREA, it results
in #PF where we leak host's MSR_IA32_DS_AREA through CR2.
After that, a malicious guest can map and configure memory where
MSR_IA32_DS_AREA is pointing and can therefore get an output from
host's tracing.
This is not a critical leak as the host must initiate with PEBS tracing
and I have not been able to get a record from more than one instruction
before vmentry in vmx_vcpu_run() (that place has most registers already
overwritten with guest's).
We could disable PEBS just few instructions before vmentry, but
disabling it earlier shouldn't affect host tracing too much.
We also don't need to switch MSR_IA32_PEBS_ENABLE on VMENTRY, but that
optimization isn't worth its code, IMO.
(If you are implementing PEBS for guests, be sure to handle the case
where both host and guest enable PEBS, because this patch doesn't.)
Fixes: 26a4f3c08d ("perf/x86: disable PEBS on a guest entry.")
Reported-by: Jiří Olša <jolsa@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 70e4da7a8f upstream.
Commit 172b2386ed ("KVM: x86: fix missed hardware breakpoints",
2016-02-10) worked around a case where the debug registers are not loaded
correctly on preemption and on the first entry to KVM_RUN.
However, Xiao Guangrong pointed out that the root cause must be that
KVM_DEBUGREG_BP_ENABLED is not being set correctly. This can indeed
happen due to the lazy debug exit mechanism, which does not call
kvm_update_dr7. Fix it by replacing the existing loop (more or less
equivalent to kvm_update_dr0123) with calls to all the kvm_update_dr*
functions.
Fixes: 172b2386ed
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2680d6da45 upstream.
vmx.c writes the TSC_MULTIPLIER field in vmx_vcpu_load, but only when a
vcpu has migrated physical cpus. Record the last value written and
update in vmx_vcpu_load on any change, otherwise a cpu migration must
occur for TSC frequency scaling to take effect.
Fixes: ff2c3a1803
Signed-off-by: Owen Hofmann <osh@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0c1d77f4ba upstream.
Commit e8dd2d2d64 ("Silence compiler warning in arch/x86/kvm/emulate.c",
2015-09-06) broke boot of the Hurd. The bug is that the "default:"
case actually could modify "la", but after the patch this change is
not reflected in *linear.
The bug is visible whenever a non-zero segment base causes the linear
address to wrap around the 4GB mark.
Fixes: e8dd2d2d64
Reported-by: Aurelien Jarno <aurelien@aurel32.net>
Tested-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 45bdbcfdf2 upstream.
vmx_cpuid_tries to update SECONDARY_VM_EXEC_CONTROL in the VMCS, but
it will cause a vmwrite error on older CPUs because the code does not
check for the presence of CPU_BASED_ACTIVATE_SECONDARY_CONTROLS.
This will get rid of the following trace on e.g. Core2 6600:
vmwrite error: reg 401e value 10 (err 12)
Call Trace:
[<ffffffff8116e2b9>] dump_stack+0x40/0x57
[<ffffffffa020b88d>] vmx_cpuid_update+0x5d/0x150 [kvm_intel]
[<ffffffffa01d8fdc>] kvm_vcpu_ioctl_set_cpuid2+0x4c/0x70 [kvm]
[<ffffffffa01b8363>] kvm_arch_vcpu_ioctl+0x903/0xfa0 [kvm]
Fixes: feda805fe7
Reported-by: Zdenek Kaspar <zkaspar82@gmail.com>
Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit aba2f06c07 upstream.
Poor #AC was so unimportant until a few days ago that we were
not even tracing its name correctly. But now it's all over
the place.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9dbe6cf941 upstream.
If we do not do this, it is not properly saved and restored across
migration. Windows notices due to its self-protection mechanisms,
and is very upset about it (blue screen of death).
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
While setting the KVM PIT counters in 'kvm_pit_load_count', if
'hpet_legacy_start' is set, the function disables the timer on
channel[0], instead of the respective index 'channel'. This is
because channels 1-3 are not linked to the HPET. Fix the caller
to only activate the special HPET processing for channel 0.
Reported-by: P J P <pjp@fedoraproject.org>
Fixes: 0185604c2d
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently if userspace restores the pit counters with a count of 0
on channels 1 or 2 and the guest attempts to read the count on those
channels, then KVM will perform a mod of 0 and crash. This will ensure
that 0 values are converted to 65536 as per the spec.
This is CVE-2015-7513.
Signed-off-by: Andy Honig <ahonig@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Virtual machines can be run with CPUID such that there are no MTRRs.
In that case, the firmware will never enable MTRRs and it is obviously
undesirable to run the guest entirely with UC memory. Check out guest
CPUID, and use WB memory if MTRR do not exist.
Cc: qemu-stable@nongnu.org
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=107561
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Conversion of MTRRs to ranges used the maxphyaddr from the boot CPU.
This is wrong, because var_mtrr_range's mask variable then is discontiguous
(like FF00FFFF000, where the first run of 0s corresponds to the bits
between host and guest maxphyaddr). Instead always set up the masks
to be full 64-bit values---we know that the reserved bits at the top
are zero, and we can restore them when reading the MSR. This way
var_mtrr_range gets a mask that just works.
Fixes: a13842dc66
Cc: qemu-stable@nongnu.org
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=107561
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The current handling of accesses to guest MSR_TSC_AUX returns error if
vcpu does not support rdtscp, though those accesses are initiated by
host. This can result in the reboot failure of some versions of
QEMU. This patch fixes this issue by passing those host initiated
accesses for further handling instead.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch removes the vpid check when emulating nested invvpid
instruction of type all-contexts invalidation. The existing code is
incorrect because:
(1) According to Intel SDM Vol 3, Section "INVVPID - Invalidate
Translations Based on VPID", invvpid instruction does not check
vpid in the invvpid descriptor when its type is all-contexts
invalidation.
(2) According to the same document, invvpid of type all-contexts
invalidation does not require there is an active VMCS, so/and
get_vmcs12() in the existing code may result in a NULL-pointer
dereference. In practice, it can crash both KVM itself and L1
hypervisors that use invvpid (e.g. Xen).
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Before this patch, we incorrectly enter the guest without requesting an
interrupt window if the IRQ chip is split between user space and the
kernel.
Because lapic_in_kernel no longer implies the PIC is in the kernel, this
patch tests pic_in_kernel to determining whether an interrupt window
should be requested when entering the guest.
If the APIC is in the kernel and we request an interrupt window the
guest will return immediately. If the APIC is masked the guest will not
not make forward progress and unmask it, leading to a loop when KVM
reenters and requests again. This patch adds a check to ensure the APIC
is ready to accept an interrupt before requesting a window.
Reviewed-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Matt Gingell <gingell@google.com>
[Use the other newly introduced functions. - Paolo]
Fixes: 1c1a9ce973
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Set KVM_REQ_EVENT when a PIC in user space injects a local interrupt.
Currently a request is only made when neither the PIC nor the APIC is in
the kernel, which is not sufficient in the split IRQ chip case.
This addresses a problem in QEMU where interrupts are delayed until
another path invokes the event loop.
Reviewed-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Matt Gingell <gingell@google.com>
Fixes: 1c1a9ce973
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch breaks out a new function kvm_vcpu_ready_for_interrupt_injection.
This routine encapsulates the logic required to determine whether a vcpu
is ready to accept an interrupt injection, which is now required on
multiple paths.
Reviewed-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Matt Gingell <gingell@google.com>
Fixes: 1c1a9ce973
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch ensures that dm_request_for_irq_injection and
post_kvm_run_save are in sync, avoiding that an endless ping-pong
between userspace (who correctly notices that IF=0) and
the kernel (who insists that userspace handles its request
for the interrupt window).
To synchronize them, it also adds checks for kvm_arch_interrupt_allowed
and !kvm_event_needs_reinjection. These are always needed, not
just for in-kernel LAPIC.
Signed-off-by: Matt Gingell <gingell@google.com>
[A collage of two patches from Matt. - Paolo]
Fixes: 1c1a9ce973
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull second batch of kvm updates from Paolo Bonzini:
"Four changes:
- x86: work around two nasty cases where a benign exception occurs
while another is being delivered. The endless stream of exceptions
causes an infinite loop in the processor, which not even NMIs or
SMIs can interrupt; in the virt case, there is no possibility to
exit to the host either.
- x86: support for Skylake per-guest TSC rate. Long supported by
AMD, the patches mostly move things from there to common
arch/x86/kvm/ code.
- generic: remove local_irq_save/restore from the guest entry and
exit paths when context tracking is enabled. The patches are a few
months old, but we discussed them again at kernel summit. Andy
will pick up from here and, in 4.5, try to remove it from the user
entry/exit paths.
- PPC: Two bug fixes, see merge commit 370289756b for details"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
KVM: x86: rename update_db_bp_intercept to update_bp_intercept
KVM: svm: unconditionally intercept #DB
KVM: x86: work around infinite loop in microcode when #AC is delivered
context_tracking: avoid irq_save/irq_restore on guest entry and exit
context_tracking: remove duplicate enabled check
KVM: VMX: Dump TSC multiplier in dump_vmcs()
KVM: VMX: Use a scaled host TSC for guest readings of MSR_IA32_TSC
KVM: VMX: Setup TSC scaling ratio when a vcpu is loaded
KVM: VMX: Enable and initialize VMX TSC scaling
KVM: x86: Use the correct vcpu's TSC rate to compute time scale
KVM: x86: Move TSC scaling logic out of call-back read_l1_tsc()
KVM: x86: Move TSC scaling logic out of call-back adjust_tsc_offset()
KVM: x86: Replace call-back compute_tsc_offset() with a common function
KVM: x86: Replace call-back set_tsc_khz() with a common function
KVM: x86: Add a common TSC scaling function
KVM: x86: Add a common TSC scaling ratio field in kvm_vcpu_arch
KVM: x86: Collect information for setting TSC scaling ratio
KVM: x86: declare a few variables as __read_mostly
KVM: x86: merge handle_mmio_page_fault and handle_mmio_page_fault_common
KVM: PPC: Book3S HV: Don't dynamically split core when already split
...
Because #DB is now intercepted unconditionally, this callback
only operates on #BP for both VMX and SVM.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is needed to avoid the possibility that the guest triggers
an infinite stream of #DB exceptions (CVE-2015-8104).
VMX is not affected: because it does not save DR6 in the VMCS,
it already intercepts #DB unconditionally.
Reported-by: Jan Beulich <jbeulich@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It was found that a guest can DoS a host by triggering an infinite
stream of "alignment check" (#AC) exceptions. This causes the
microcode to enter an infinite loop where the core never receives
another interrupt. The host kernel panics pretty quickly due to the
effects (CVE-2015-5307).
Signed-off-by: Eric Northup <digitaleric@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch enhances dump_vmcs() to dump the value of TSC multiplier
field in VMCS.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch makes kvm-intel to return a scaled host TSC plus the TSC
offset when handling guest readings to MSR_IA32_TSC.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch makes kvm-intel module to load TSC scaling ratio into TSC
multiplier field of VMCS when a vcpu is loaded, so that TSC scaling
ratio can take effect if VMX TSC scaling is enabled.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch exhances kvm-intel module to enable VMX TSC scaling and
collects information of TSC scaling ratio during initialization.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch makes KVM use virtual_tsc_khz rather than the host TSC rate
as vcpu's TSC rate to compute the time scale if TSC scaling is enabled.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Both VMX and SVM scales the host TSC in the same way in call-back
read_l1_tsc(), so this patch moves the scaling logic from call-back
read_l1_tsc() to a common function kvm_read_l1_tsc().
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For both VMX and SVM, if the 2nd argument of call-back
adjust_tsc_offset() is the host TSC, then adjust_tsc_offset() will scale
it first. This patch moves this common TSC scaling logic to its caller
adjust_tsc_offset_host() and rename the call-back adjust_tsc_offset() to
adjust_tsc_offset_guest().
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Both VMX and SVM calculate the tsc-offset in the same way, so this
patch removes the call-back compute_tsc_offset() and replaces it with a
common function kvm_compute_tsc_offset().
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Both VMX and SVM propagate virtual_tsc_khz in the same way, so this
patch removes the call-back set_tsc_khz() and replaces it with a common
function.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VMX and SVM calculate the TSC scaling ratio in a similar logic, so this
patch generalizes it to a common TSC scaling function.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
[Inline the multiplication and shift steps into mul_u64_u64_shr. Remove
BUG_ON. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch moves the field of TSC scaling ratio from the architecture
struct vcpu_svm to the common struct kvm_vcpu_arch.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The number of bits of the fractional part of the 64-bit TSC scaling
ratio in VMX and SVM is different. This patch makes the architecture
code to collect the number of fractional bits and other related
information into variables that can be accessed in the common code.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
They are exactly the same, except that handle_mmio_page_fault
has an unused argument and a call to WARN_ON. Remove the unused
argument from the callers, and move the warning to (the former)
handle_mmio_page_fault_common.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull KVM updates from Paolo Bonzini:
"First batch of KVM changes for 4.4.
s390:
A bunch of fixes and optimizations for interrupt and time handling.
PPC:
Mostly bug fixes.
ARM:
No big features, but many small fixes and prerequisites including:
- a number of fixes for the arch-timer
- introducing proper level-triggered semantics for the arch-timers
- a series of patches to synchronously halt a guest (prerequisite
for IRQ forwarding)
- some tracepoint improvements
- a tweak for the EL2 panic handlers
- some more VGIC cleanups getting rid of redundant state
x86:
Quite a few changes:
- support for VT-d posted interrupts (i.e. PCI devices can inject
interrupts directly into vCPUs). This introduces a new
component (in virt/lib/) that connects VFIO and KVM together.
The same infrastructure will be used for ARM interrupt
forwarding as well.
- more Hyper-V features, though the main one Hyper-V synthetic
interrupt controller will have to wait for 4.5. These will let
KVM expose Hyper-V devices.
- nested virtualization now supports VPID (same as PCID but for
vCPUs) which makes it quite a bit faster
- for future hardware that supports NVDIMM, there is support for
clflushopt, clwb, pcommit
- support for "split irqchip", i.e. LAPIC in kernel +
IOAPIC/PIC/PIT in userspace, which reduces the attack surface of
the hypervisor
- obligatory smattering of SMM fixes
- on the guest side, stable scheduler clock support was rewritten
to not require help from the hypervisor"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (123 commits)
KVM: VMX: Fix commit which broke PML
KVM: x86: obey KVM_X86_QUIRK_CD_NW_CLEARED in kvm_set_cr0()
KVM: x86: allow RSM from 64-bit mode
KVM: VMX: fix SMEP and SMAP without EPT
KVM: x86: move kvm_set_irq_inatomic to legacy device assignment
KVM: device assignment: remove pointless #ifdefs
KVM: x86: merge kvm_arch_set_irq with kvm_set_msi_inatomic
KVM: x86: zero apic_arb_prio on reset
drivers/hv: share Hyper-V SynIC constants with userspace
KVM: x86: handle SMBASE as physical address in RSM
KVM: x86: add read_phys to x86_emulate_ops
KVM: x86: removing unused variable
KVM: don't pointlessly leave KVM_COMPAT=y in non-KVM configs
KVM: arm/arm64: Merge vgic_set_lr() and vgic_sync_lr_elrsr()
KVM: arm/arm64: Clean up vgic_retire_lr() and surroundings
KVM: arm/arm64: Optimize away redundant LR tracking
KVM: s390: use simple switch statement as multiplexer
KVM: s390: drop useless newline in debugging data
KVM: s390: SCA must not cross page boundaries
KVM: arm: Do not indent the arguments of DECLARE_BITMAP
...
I found PML was broken since below commit:
commit feda805fe7
Author: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Date: Wed Sep 9 14:05:55 2015 +0800
KVM: VMX: unify SECONDARY_VM_EXEC_CONTROL update
Unify the update in vmx_cpuid_update()
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
[Rewrite to use vmcs_set_secondary_exec_control. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The reason is in above commit vmx_cpuid_update calls vmx_secondary_exec_control,
in which currently SECONDARY_EXEC_ENABLE_PML bit is cleared unconditionally (as
PML is enabled in creating vcpu). Therefore if vcpu_cpuid_update is called after
vcpu is created, PML will be disabled unexpectedly while log-dirty code still
thinks PML is used.
Fix this by clearing SECONDARY_EXEC_ENABLE_PML in vmx_secondary_exec_control
only when PML is not supported or not enabled (!enable_pml). This is more
reasonable as PML is currently either always enabled or disabled. With this
explicit updating SECONDARY_EXEC_ENABLE_PML in vmx_enable{disable}_pml is not
needed so also rename vmx_enable{disable}_pml to vmx_create{destroy}_pml_buffer.
Fixes: feda805fe7
Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
[While at it, change a wrong ASSERT to an "if". The condition can happen
if creating the VCPU fails with ENOMEM. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>