Commit Graph

785 Commits

Author SHA1 Message Date
Sean Christopherson
60fc3d02d5 KVM: x86: Remove emulation_result enums, EMULATE_{DONE,FAIL,USER_EXIT}
Deferring emulation failure handling (in some cases) to the caller of
x86_emulate_instruction() has proven fragile, e.g. multiple instances of
KVM not setting run->exit_reason on EMULATE_FAIL, largely due to it
being difficult to discern what emulation types can return what result,
and which combination of types and results are handled where.

Now that x86_emulate_instruction() always handles emulation failure,
i.e. EMULATION_FAIL is only referenced in callers, remove the
emulation_result enums entirely.  Per KVM's existing exit handling
conventions, return '0' and '1' for "exit to userspace" and "resume
guest" respectively.  Doing so cleans up many callers, e.g. they can
return kvm_emulate_instruction() directly instead of having to interpret
its result.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24 14:34:00 +02:00
Sean Christopherson
1051778f6e KVM: x86: Handle emulation failure directly in kvm_task_switch()
Consolidate the reporting of emulation failure into kvm_task_switch()
so that it can return EMULATE_USER_EXIT.  This helps pave the way for
removing EMULATE_FAIL altogether.

This also fixes a theoretical bug where task switch interception could
suppress an EMULATE_USER_EXIT return.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24 14:31:13 +02:00
Sean Christopherson
738fece46d KVM: x86: Exit to userspace on emulation skip failure
Kill a few birds with one stone by forcing an exit to userspace on skip
emulation failure.  This removes a reference to EMULATE_FAIL, fixes a
bug in handle_ept_misconfig() where it would exit to userspace without
setting run->exit_reason, and fixes a theoretical bug in SVM's
task_switch_interception() where it would overwrite run->exit_reason on
a return of EMULATE_USER_EXIT.

Note, this technically doesn't fully fix task_switch_interception()
as it now incorrectly handles EMULATE_FAIL, but in practice there is no
bug as EMULATE_FAIL will never be returned for EMULTYPE_SKIP.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24 14:31:07 +02:00
Sean Christopherson
42cbf06872 KVM: x86: Move #GP injection for VMware into x86_emulate_instruction()
Immediately inject a #GP when VMware emulation fails and return
EMULATE_DONE instead of propagating EMULATE_FAIL up the stack.  This
helps pave the way for removing EMULATE_FAIL altogether.

Rename EMULTYPE_VMWARE to EMULTYPE_VMWARE_GP to document that the x86
emulator is called to handle VMware #GP interception, e.g. why a #GP
is injected on emulation failure for EMULTYPE_VMWARE_GP.

Drop EMULTYPE_NO_UD_ON_FAIL as a standalone type.  The "no #UD on fail"
is used only in the VMWare case and is obsoleted by having the emulator
itself reinject #GP.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24 14:30:47 +02:00
Sean Christopherson
a6c6ed1e81 KVM: x86: Don't attempt VMWare emulation on #GP with non-zero error code
The VMware backdoor hooks #GP faults on IN{S}, OUT{S}, and RDPMC, none
of which generate a non-zero error code for their #GP.  Re-injecting #GP
instead of attempting emulation on a non-zero error code will allow a
future patch to move #GP injection (for emulation failure) into
kvm_emulate_instruction() without having to plumb in the error code.

Reviewed-and-tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24 14:30:42 +02:00
Vitaly Kuznetsov
956e255c59 KVM: x86: svm: remove unneeded nested_enable_evmcs() hook
Since commit 5158917c7b ("KVM: x86: nVMX: Allow nested_enable_evmcs to
be NULL") the code in x86.c is prepared to see nested_enable_evmcs being
NULL and in VMX case it actually is when nesting is disabled. Remove the
unneeded stub from SVM code.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24 13:37:16 +02:00
Linus Torvalds
fe38bd6862 * s390: ioctl hardening, selftests
* ARM: ITS translation cache; support for 512 vCPUs, various cleanups
 and bugfixes
 
 * PPC: various minor fixes and preparation
 
 * x86: bugfixes all over the place (posted interrupts, SVM, emulation
 corner cases, blocked INIT), some IPI optimizations
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJdf7fdAAoJEL/70l94x66DJzkIAKDcuWXJB4Qtoto6yUvPiHZm
 LYkY/Dn1zulb/DhzrBoXFey/jZXwl9kxMYkVTefnrAl0fRwFGX+G1UYnQrtAL6Gr
 ifdTYdy3kZhXCnnp99QAantWDswJHo1THwbmHrlmkxS4MdisEaTHwgjaHrDRZ4/d
 FAEwW2isSonP3YJfTtsKFFjL9k2D4iMnwZ/R2B7UOaWvgnerZ1GLmOkilvnzGGEV
 IQ89IIkWlkKd4SKgq8RkDKlfW5JrLrSdTK2Uf0DvAxV+J0EFkEaR+WlLsqumra0z
 Eg3KwNScfQj0DyT0TzurcOxObcQPoMNSFYXLRbUu1+i0CGgm90XpF1IosiuihgU=
 =w6I3
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "s390:
   - ioctl hardening
   - selftests

  ARM:
   - ITS translation cache
   - support for 512 vCPUs
   - various cleanups and bugfixes

  PPC:
   - various minor fixes and preparation

  x86:
   - bugfixes all over the place (posted interrupts, SVM, emulation
     corner cases, blocked INIT)
   - some IPI optimizations"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (75 commits)
  KVM: X86: Use IPI shorthands in kvm guest when support
  KVM: x86: Fix INIT signal handling in various CPU states
  KVM: VMX: Introduce exit reason for receiving INIT signal on guest-mode
  KVM: VMX: Stop the preemption timer during vCPU reset
  KVM: LAPIC: Micro optimize IPI latency
  kvm: Nested KVM MMUs need PAE root too
  KVM: x86: set ctxt->have_exception in x86_decode_insn()
  KVM: x86: always stop emulation on page fault
  KVM: nVMX: trace nested VM-Enter failures detected by H/W
  KVM: nVMX: add tracepoint for failed nested VM-Enter
  x86: KVM: svm: Fix a check in nested_svm_vmrun()
  KVM: x86: Return to userspace with internal error on unexpected exit reason
  KVM: x86: Add kvm_emulate_{rd,wr}msr() to consolidate VXM/SVM code
  KVM: x86: Refactor up kvm_{g,s}et_msr() to simplify callers
  doc: kvm: Fix return description of KVM_SET_MSRS
  KVM: X86: Tune PLE Window tracepoint
  KVM: VMX: Change ple_window type to unsigned int
  KVM: X86: Remove tailing newline for tracepoints
  KVM: X86: Trace vcpu_id for vmexit
  KVM: x86: Manually calculate reserved bits when loading PDPTRS
  ...
2019-09-18 09:49:13 -07:00
Liran Alon
4b9852f4f3 KVM: x86: Fix INIT signal handling in various CPU states
Commit cd7764fe9f ("KVM: x86: latch INITs while in system management mode")
changed code to latch INIT while vCPU is in SMM and process latched INIT
when leaving SMM. It left a subtle remark in commit message that similar
treatment should also be done while vCPU is in VMX non-root-mode.

However, INIT signals should actually be latched in various vCPU states:
(*) For both Intel and AMD, INIT signals should be latched while vCPU
is in SMM.
(*) For Intel, INIT should also be latched while vCPU is in VMX
operation and later processed when vCPU leaves VMX operation by
executing VMXOFF.
(*) For AMD, INIT should also be latched while vCPU runs with GIF=0
or in guest-mode with intercept defined on INIT signal.

To fix this:
1) Add kvm_x86_ops->apic_init_signal_blocked() such that each CPU vendor
can define the various CPU states in which INIT signals should be
blocked and modify kvm_apic_accept_events() to use it.
2) Modify vmx_check_nested_events() to check for pending INIT signal
while vCPU in guest-mode. If so, emualte vmexit on
EXIT_REASON_INIT_SIGNAL. Note that nSVM should have similar behaviour
but is currently left as a TODO comment to implement in the future
because nSVM don't yet implement svm_check_nested_events().

Note: Currently KVM nVMX implementation don't support VMX wait-for-SIPI
activity state as specified in MSR_IA32_VMX_MISC bits 6:8 exposed to
guest (See nested_vmx_setup_ctls_msrs()).
If and when support for this activity state will be implemented,
kvm_check_nested_events() would need to avoid emulating vmexit on
INIT signal in case activity-state is wait-for-SIPI. In addition,
kvm_apic_accept_events() would need to be modified to avoid discarding
SIPI in case VMX activity-state is wait-for-SIPI but instead delay
SIPI processing to vmx_check_nested_events() that would clear
pending APIC events and emulate vmexit on SIPI.

Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Co-developed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11 18:11:45 +02:00
Dan Carpenter
a061985b81 x86: KVM: svm: Fix a check in nested_svm_vmrun()
We refactored this code a bit and accidentally deleted the "-" character
from "-EINVAL".  The kvm_vcpu_map() function never returns positive
EINVAL.

Fixes: c8e16b78c6 ("x86: KVM: svm: eliminate hardcoded RIP advancement from vmrun_interception()")
Cc: stable@vger.kernel.org
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11 17:28:01 +02:00
Liran Alon
7396d337cf KVM: x86: Return to userspace with internal error on unexpected exit reason
Receiving an unexpected exit reason from hardware should be considered
as a severe bug in KVM. Therefore, instead of just injecting #UD to
guest and ignore it, exit to userspace on internal error so that
it could handle it properly (probably by terminating guest).

In addition, prefer to use vcpu_unimpl() instead of WARN_ONCE()
as handling unexpected exit reason should be a rare unexpected
event (that was expected to never happen) and we prefer to print
a message on it every time it occurs to guest.

Furthermore, dump VMCS/VMCB to dmesg to assist diagnosing such cases.

Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11 15:42:45 +02:00
Sean Christopherson
1edce0a9eb KVM: x86: Add kvm_emulate_{rd,wr}msr() to consolidate VXM/SVM code
Move RDMSR and WRMSR emulation into common x86 code to consolidate
nearly identical SVM and VMX code.

Note, consolidating RDMSR introduces an extra indirect call, i.e.
retpoline, due to reaching {svm,vmx}_get_msr() via kvm_x86_ops, but a
guest kernel likely has bigger problems if increasing the latency of
RDMSR VM-Exits by ~70 cycles has a measurable impact on overall VM
performance.  E.g. the only recurring RDMSR VM-Exits (after booting) on
my system running Linux 5.2 in the guest are for MSR_IA32_TSC_ADJUST via
arch_cpu_idle_enter().

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10 19:18:29 +02:00
Sean Christopherson
f20935d85a KVM: x86: Refactor up kvm_{g,s}et_msr() to simplify callers
Refactor the top-level MSR accessors to take/return the index and value
directly instead of requiring the caller to dump them into a msr_data
struct.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10 19:18:14 +02:00
Peter Xu
4f75bcc332 KVM: X86: Tune PLE Window tracepoint
The PLE window tracepoint triggers even if the window is not changed,
and the wording can be a bit confusing too.  One example line:

  kvm_ple_window: vcpu 0: ple_window 4096 (shrink 4096)

It easily let people think of "the window now is 4096 which is
shrinked", but the truth is the value actually didn't change (4096).

Let's only dump this message if the value really changed, and we make
the message even simpler like:

  kvm_ple_window: vcpu 4 old 4096 new 8192 (growed)

Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10 19:13:21 +02:00
Alexander Graf
fdcf756213 KVM: x86: Disable posted interrupts for non-standard IRQs delivery modes
We can easily route hardware interrupts directly into VM context when
they target the "Fixed" or "LowPriority" delivery modes.

However, on modes such as "SMI" or "Init", we need to go via KVM code
to actually put the vCPU into a different mode of operation, so we can
not post the interrupt

Add code in the VMX and SVM PI logic to explicitly refuse to establish
posted mappings for advanced IRQ deliver modes. This reflects the logic
in __apic_accept_irq() which also only ever passes Fixed and LowPriority
interrupts as posted interrupts into the guest.

This fixes a bug I have with code which configures real hardware to
inject virtual SMIs into my guest.

Signed-off-by: Alexander Graf <graf@amazon.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Wanpeng Li <wanpengli@tencent.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10 16:39:34 +02:00
Vitaly Kuznetsov
ea1529873a KVM: x86: hyper-v: don't crash on KVM_GET_SUPPORTED_HV_CPUID when kvm_intel.nested is disabled
If kvm_intel is loaded with nested=0 parameter an attempt to perform
KVM_GET_SUPPORTED_HV_CPUID results in OOPS as nested_get_evmcs_version hook
in kvm_x86_ops is NULL (we assign it in nested_vmx_hardware_setup() and
this only happens in case nested is enabled).

Check that kvm_x86_ops->nested_get_evmcs_version is not NULL before
calling it. With this, we can remove the stub from svm as it is no
longer needed.

Cc: <stable@vger.kernel.org>
Fixes: e2e871ab2f ("x86/kvm/hyper-v: Introduce nested_get_evmcs_version() helper")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2019-08-27 20:59:04 +02:00
Sean Christopherson
12b58f4ed2 KVM: Assert that struct kvm_vcpu is always as offset zero
KVM implementations that wrap struct kvm_vcpu with a vendor specific
struct, e.g. struct vcpu_vmx, must place the vcpu member at offset 0,
otherwise the usercopy region intended to encompass struct kvm_vcpu_arch
will instead overlap random chunks of the vendor specific struct.
E.g. padding a large number of bytes before struct kvm_vcpu triggers
a usercopy warn when running with CONFIG_HARDENED_USERCOPY=y.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22 10:09:27 +02:00
Vitaly Kuznetsov
c8e16b78c6 x86: KVM: svm: eliminate hardcoded RIP advancement from vmrun_interception()
Just like we do with other intercepts, in vmrun_interception() we should be
doing kvm_skip_emulated_instruction() and not just RIP += 3. Also, it is
wrong to increment RIP before nested_svm_vmrun() as it can result in
kvm_inject_gp().

We can't call kvm_skip_emulated_instruction() after nested_svm_vmrun() so
move it inside.

Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22 10:09:22 +02:00
Vitaly Kuznetsov
e7134c1bb5 x86: KVM: svm: eliminate weird goto from vmrun_interception()
Regardless of whether or not nested_svm_vmrun_msrpm() fails, we return 1
from vmrun_interception() so there's no point in doing goto. Also,
nested_svm_vmrun_msrpm() call can be made from nested_svm_vmrun() where
other nested launch issues are handled.

nested_svm_vmrun() returns a bool, however, its result is ignored in
vmrun_interception() as we always return '1'. As a preparatory change
to putting kvm_skip_emulated_instruction() inside nested_svm_vmrun()
make nested_svm_vmrun() return an int (always '1' for now).

Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22 10:09:22 +02:00
Vitaly Kuznetsov
c4762fdab5 x86: KVM: svm: remove hardcoded instruction length from intercepts
Various intercepts hard-code the respective instruction lengths to optimize
skip_emulated_instruction(): when next_rip is pre-set we skip
kvm_emulate_instruction(vcpu, EMULTYPE_SKIP). The optimization is, however,
incorrect: different (redundant) prefixes could be used to enlarge the
instruction. We can't really avoid decoding.

svm->next_rip is not used when CPU supports 'nrips' (X86_FEATURE_NRIPS)
feature: next RIP is provided in VMCB. The feature is not really new
(Opteron G3s had it already) and the change should have zero affect.

Remove manual svm->next_rip setting with hard-coded instruction lengths.
The only case where we now use svm->next_rip is EXIT_IOIO: the instruction
length is provided to us by hardware.

Hardcoded RIP advancement remains in vmrun_interception(), this is going to
be taken care of separately.

Reported-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22 10:09:21 +02:00
Vitaly Kuznetsov
02d4160fbd x86: KVM: add xsetbv to the emulator
To avoid hardcoding xsetbv length to '3' we need to support decoding it in
the emulator.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22 10:09:20 +02:00
Vitaly Kuznetsov
f8ea7c6049 x86: kvm: svm: propagate errors from skip_emulated_instruction()
On AMD, kvm_x86_ops->skip_emulated_instruction(vcpu) can, in theory,
fail: in !nrips case we call kvm_emulate_instruction(EMULTYPE_SKIP).
Currently, we only do printk(KERN_DEBUG) when this happens and this
is not ideal. Propagate the error up the stack.

On VMX, skip_emulated_instruction() doesn't fail, we have two call
sites calling it explicitly: handle_exception_nmi() and
handle_task_switch(), we can just ignore the result.

On SVM, we also have two explicit call sites:
svm_queue_exception() and it seems we don't need to do anything there as
we check if RIP was advanced or not. In task_switch_interception(),
however, we are better off not proceeding to kvm_task_switch() in case
skip_emulated_instruction() failed.

Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22 10:09:19 +02:00
Vitaly Kuznetsov
05402f6454 x86: KVM: svm: don't pretend to advance RIP in case wrmsr_interception() results in #GP
svm->next_rip is only used by skip_emulated_instruction() and in case
kvm_set_msr() fails we rightfully don't do that. Move svm->next_rip
advancement to 'else' branch to avoid creating false impression that
it's always advanced (and make it look like rdmsr_interception()).

This is a preparatory change to removing hardcoded RIP advancement
from instruction intercepts, no functional change.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22 10:09:18 +02:00
Paolo Bonzini
50896de4be KVM: x86: always expose VIRT_SSBD to guests
Even though it is preferrable to use SPEC_CTRL (represented by
X86_FEATURE_AMD_SSBD) instead of VIRT_SPEC, VIRT_SPEC is always
supported anyway because otherwise it would be impossible to
migrate from old to new CPUs.  Make this apparent in the
result of KVM_GET_SUPPORTED_CPUID as well.

However, we need to hide the bit on Intel processors, so move
the setting to svm_set_supported_cpuid.

Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reported-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22 10:09:07 +02:00
Miaohe Lin
c8e174b398 KVM: x86: svm: remove redundant assignment of var new_entry
new_entry is reassigned a new value next line. So
it's redundant and remove it.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-14 16:28:36 +02:00
Wanpeng Li
17e433b543 KVM: Fix leak vCPU's VMCS value into other pCPU
After commit d73eb57b80 (KVM: Boost vCPUs that are delivering interrupts), a
five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
in the VMs after stress testing:

 INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
 Call Trace:
   flush_tlb_mm_range+0x68/0x140
   tlb_flush_mmu.part.75+0x37/0xe0
   tlb_finish_mmu+0x55/0x60
   zap_page_range+0x142/0x190
   SyS_madvise+0x3cd/0x9c0
   system_call_fastpath+0x1c/0x21

swait_active() sustains to be true before finish_swait() is called in
kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
by kvm_vcpu_on_spin() loop greatly increases the probability condition
kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
VMCS.

This patch fixes it by checking conservatively a subset of events.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Marc Zyngier <Marc.Zyngier@arm.com>
Cc: stable@vger.kernel.org
Fixes: 98f4a1467 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-05 12:55:47 +02:00
Wanpeng Li
d9a710e5fc KVM: X86: Dynamically allocate user_fpu
After reverting commit 240c35a378 (kvm: x86: Use task structs fpu field
for user), struct kvm_vcpu is 19456 bytes on my server, PAGE_ALLOC_COSTLY_ORDER(3)
is the order at which allocations are deemed costly to service. In serveless
scenario, one host can service hundreds/thoudands firecracker/kata-container
instances, howerver, new instance will fail to launch after memory is too
fragmented to allocate kvm_vcpu struct on host, this was observed in some
cloud provider product environments.

This patch dynamically allocates user_fpu, kvm_vcpu is 15168 bytes now on my
Skylake server.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-22 13:55:48 +02:00
Liran Alon
118154bdf5 KVM: SVM: Fix detection of AMD Errata 1096
When CPU raise #NPF on guest data access and guest CR4.SMAP=1, it is
possible that CPU microcode implementing DecodeAssist will fail
to read bytes of instruction which caused #NPF. This is AMD errata
1096 and it happens because CPU microcode reading instruction bytes
incorrectly attempts to read code as implicit supervisor-mode data
accesses (that is, just like it would read e.g. a TSS), which are
susceptible to SMAP faults. The microcode reads CS:RIP and if it is
a user-mode address according to the page tables, the processor
gives up and returns no instruction bytes.  In this case,
GuestIntrBytes field of the VMCB on a VMEXIT will incorrectly
return 0 instead of the correct guest instruction bytes.

Current KVM code attemps to detect and workaround this errata, but it
has multiple issues:

1) It mistakenly checks if guest CR4.SMAP=0 instead of guest CR4.SMAP=1,
which is required for encountering a SMAP fault.

2) It assumes SMAP faults can only occur when guest CPL==3.
However, in case guest CR4.SMEP=0, the guest can execute an instruction
which reside in a user-accessible page with CPL<3 priviledge. If this
instruction raise a #NPF on it's data access, then CPU DecodeAssist
microcode will still encounter a SMAP violation.  Even though no sane
OS will do so (as it's an obvious priviledge escalation vulnerability),
we still need to handle this semanticly correct in KVM side.

Note that (2) *is* a useful optimization, because CR4.SMAP=1 is an easy
triggerable condition and guests usually enable SMAP together with SMEP.
If the vCPU has CR4.SMEP=1, the errata could indeed be encountered onlt
at guest CPL==3; otherwise, the CPU would raise a SMEP fault to guest
instead of #NPF.  We keep this condition to avoid false positives in
the detection of the errata.

In addition, to avoid future confusion and improve code readbility,
include details of the errata in code and not just in commit message.

Fixes: 05d5a48635 ("KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation)")
Cc: Singh Brijesh <brijesh.singh@amd.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-20 09:00:44 +02:00
Paolo Bonzini
a45ff5994c KVM/arm updates for 5.3
- Add support for chained PMU counters in guests
 - Improve SError handling
 - Handle Neoverse N1 erratum #1349291
 - Allow side-channel mitigation status to be migrated
 - Standardise most AArch64 system register accesses to msr_s/mrs_s
 - Fix host MPIDR corruption on 32bit
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl0kge4VHG1hcmMuenlu
 Z2llckBhcm0uY29tAAoJECPQ0LrRPXpDYyQP/3XY5tFcLKkp/h9rnGaCXwAxhNzn
 TyF/IZEFBKFTSoDMXKLLc8KllvoPQ7aUl03heYbuayYpyKR1+LCx7lDwu1MYyEf+
 aSSuOKlbG//tLUEGp09pTRCgjs2mhhZYqOj5GF2mZ7xpovFVSNOPzTazbXDNQ7tw
 zUAs43YNg+bUMwj+SLWpBlizjrLr7T34utIr6daKJE/GSfmIrcYXhGbZqUh0zbO0
 z5LNasebws8/pHyeGI7+/yoMIKaQ8foMgywTpsRpBsx6YI+AbOLjEmCk2IBOPcEK
 pm9KkSIBZEO2CSxZKl3NQiEow/Qd/lnz2xLMCSfh4XrYoI2Th4gNcsbJpiBDWP5a
 0eZ5jSiexxKngIbM+to7jR3m0yc9RgcuzceJg3Uly7Ya0vb5RqKwOX4Ge4XP4VDT
 DzIVFdQjxDKdVIf3EvGp1cj4P7dRUU3xbZcbzyuRPEmT3vgjEnbxawmPLs3QMAl1
 31Wd2wIsPB86kSxzSMel27Vs5VgMhgyHE26zN91R745CvhDXaDKydIWjGjdVMHsB
 GuX/h2kL+ohx+N/OpZPgwsVUAGLSOQFP3pE/EcGtqc2kkfqa+bx12DKcZ3zdmJvy
 +cu5ixU8q5thPH/pZob/C3hKUY/eLy02emS34RK0Jh2sZHbQgAOtMsiqUxNHEjUm
 6TkpdWa5SRd7CtGV
 =yfCs
 -----END PGP SIGNATURE-----

Merge tag 'kvm-arm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm updates for 5.3

- Add support for chained PMU counters in guests
- Improve SError handling
- Handle Neoverse N1 erratum #1349291
- Allow side-channel mitigation status to be migrated
- Standardise most AArch64 system register accesses to msr_s/mrs_s
- Fix host MPIDR corruption on 32bit
2019-07-11 15:14:16 +02:00
Sean Christopherson
d7a08882a0 KVM: x86: Unconditionally enable irqs in guest context
On VMX, KVM currently does not re-enable irqs until after it has exited
the guest context.  As a result, a tick that fires in the window between
VM-Exit and guest_exit_irqoff() will be accounted as system time.  While
said window is relatively small, it's large enough to be problematic in
some configurations, e.g. if VM-Exits are consistently occurring a hair
earlier than the tick irq.

Intentionally toggle irqs back off so that guest_exit_irqoff() can be
used in lieu of guest_exit() in order to avoid the save/restore of flags
in guest_exit().  On my Haswell system, "nop; cli; sti" is ~6 cycles,
versus ~28 cycles for "pushf; pop <reg>; cli; push <reg>; popf".

Fixes: f2485b3e0c ("KVM: x86: use guest_exit_irqoff")
Reported-by: Wei Yang <w90p710@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-11 15:10:22 +02:00
Paolo Bonzini
d647eb63e6 KVM: svm: add nrips module parameter
Allow testing code for old processors that lack the next RIP save
feature, by disabling usage of the next_rip field.

Nested hypervisors however get the feature unconditionally.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-03 16:14:38 +02:00
Paolo Bonzini
95c5c7c77c KVM: nVMX: list VMX MSRs in KVM_GET_MSR_INDEX_LIST
This allows userspace to know which MSRs are supported by the hypervisor.
Unfortunately userspace must resort to tricks for everything except
MSR_IA32_VMX_VMFUNC (which was just added in the previous patch).
One possibility is to use the feature control MSR, which is tied to nested
VMX as well and is present on all KVM versions that support feature MSRs.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02 17:36:18 +02:00
Thomas Gleixner
20c8ccb197 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499
Based on 1 normalized pattern(s):

  this work is licensed under the terms of the gnu gpl version 2 see
  the copying file in the top level directory

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 35 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.797835076@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:53 +02:00
Sean Christopherson
95b5a48c4f KVM: VMX: Handle NMIs, #MCs and async #PFs in common irqs-disabled fn
Per commit 1b6269db3f ("KVM: VMX: Handle NMIs before enabling
interrupts and preemption"), NMIs are handled directly in vmx_vcpu_run()
to "make sure we handle NMI on the current cpu, and that we don't
service maskable interrupts before non-maskable ones".  The other
exceptions handled by complete_atomic_exit(), e.g. async #PF and #MC,
have similar requirements, and are located there to avoid extra VMREADs
since VMX bins hardware exceptions and NMIs into a single exit reason.

Clean up the code and eliminate the vaguely named complete_atomic_exit()
by moving the interrupts-disabled exception and NMI handling into the
existing handle_external_intrs() callback, and rename the callback to
a more appropriate name.  Rename VMexit handlers throughout so that the
atomic and non-atomic counterparts have similar names.

In addition to improving code readability, this also ensures the NMI
handler is run with the host's debug registers loaded in the unlikely
event that the user is debugging NMIs.  Accuracy of the last_guest_tsc
field is also improved when handling NMIs (and #MCs) as the handler
will run after updating said field.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
[Naming cleanups. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:46:04 +02:00
Sean Christopherson
165072b089 KVM: x86: Move kvm_{before,after}_interrupt() calls to vendor code
VMX can conditionally call kvm_{before,after}_interrupt() since KVM
always uses "ack interrupt on exit" and therefore explicitly handles
interrupts as opposed to blindly enabling irqs.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:46:03 +02:00
Sean Christopherson
f257d6dcda KVM: Directly return result from kvm_arch_check_processor_compat()
Add a wrapper to invoke kvm_arch_check_processor_compat() so that the
boilerplate ugliness of checking virtualization support on all CPUs is
hidden from the arch specific code.  x86's implementation in particular
is quite heinous, as it unnecessarily propagates the out-param pattern
into kvm_x86_ops.

While the x86 specific issue could be resolved solely by changing
kvm_x86_ops, make the change for all architectures as returning a value
directly is prettier and technically more robust, e.g. s390 doesn't set
the out param, which could lead to subtle breakage in the (highly
unlikely) scenario where the out-param was not pre-initialized by the
caller.

Opportunistically annotate svm_check_processor_compat() with __init.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04 19:27:32 +02:00
Suthikulpanit, Suravee
0532dd52df kvm: svm/avic: Do not send AVIC doorbell to self
AVIC doorbell is used to notify a running vCPU that interrupts
has been injected into the vCPU AVIC backing page. Current logic
checks only if a VCPU is running before sending a doorbell.
However, the doorbell is not necessary if the destination
CPU is itself.

Add logic to check currently running CPU before sending doorbell.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Alexander Graf <graf@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04 19:27:31 +02:00
Wanpeng Li
b6c4bc659c KVM: LAPIC: Optimize timer latency further
Advance lapic timer tries to hidden the hypervisor overhead between the
host emulated timer fires and the guest awares the timer is fired. However,
it just hidden the time between apic_timer_fn/handle_preemption_timer ->
wait_lapic_expire, instead of the real position of vmentry which is
mentioned in the orignial commit d0659d946b ("KVM: x86: add option to
advance tscdeadline hrtimer expiration"). There is 700+ cpu cycles between
the end of wait_lapic_expire and before world switch on my haswell desktop.

This patch tries to narrow the last gap(wait_lapic_expire -> world switch),
it takes the real overhead time between apic_timer_fn/handle_preemption_timer
and before world switch into consideration when adaptively tuning timer
advancement. The patch can reduce 40% latency (~1600+ cycles to ~1000+ cycles
on a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing
busy waits.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04 19:27:30 +02:00
Vitaly Kuznetsov
8f38302c0b KVM/nSVM: properly map nested VMCB
Commit 8c5fbf1a72 ("KVM/nSVM: Use the new mapping API for mapping guest
memory") broke nested SVM completely: kvm_vcpu_map()'s second parameter is
GFN so vmcb_gpa needs to be converted with gpa_to_gfn(), not the other way
around.

Fixes: 8c5fbf1a72 ("KVM/nSVM: Use the new mapping API for mapping guest memory")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04 19:27:27 +02:00
Paolo Bonzini
6f2f84532c KVM: x86: do not spam dmesg with VMCS/VMCB dumps
Userspace can easily set up invalid processor state in such a way that
dmesg will be filled with VMCS or VMCB dumps.  Disable this by default
using a module parameter.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:12 +02:00
Suthikulpanit, Suravee
c9bcd3e333 kvm: svm/avic: fix off-by-one in checking host APIC ID
Current logic does not allow VCPU to be loaded onto CPU with
APIC ID 255. This should be allowed since the host physical APIC ID
field in the AVIC Physical APIC table entry is an 8-bit value,
and APIC ID 255 is valid in system with x2APIC enabled.
Instead, do not allow VCPU load if the host APIC ID cannot be
represented by an 8-bit value.

Also, use the more appropriate AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK
instead of AVIC_MAX_PHYSICAL_ID_COUNT.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:11 +02:00
Linus Torvalds
0ef0fd3515 * ARM: support for SVE and Pointer Authentication in guests, PMU improvements
* POWER: support for direct access to the POWER9 XIVE interrupt controller,
 memory and performance optimizations.
 
 * x86: support for accessing memory not backed by struct page, fixes and refactoring
 
 * Generic: dirty page tracking improvements
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJc3qV/AAoJEL/70l94x66Dn3QH/jX1Bn0P/RZAIt4w0SySklSg
 PqxUKDyBQqB9vN9Qeb9jWXAKPH2CtM3+up/rz7oRnBWp7qA6vXcC/R/QJYAvzdXE
 nklsR/oYCsflR1KdlVYuDvvPCPP2fLBU5zfN83OsaBQ8fNRkm3gN+N5XQ2SbXbLy
 Mo9tybS4otY201UAC96e8N0ipwwyCRpDneQpLcl+F5nH3RBt63cVbs04O+70MXn7
 eT4I+8K3+Go7LATzT8hglD21D/7uvE31qQb6yr5L33IfhU4GB51RZzBXTNaAdY8n
 hT1rMrRkAMAFWYZPQDfoMadjWU3i5DIfstKjDxOr9oTfuOEp5Z+GvJwvVnUDg1I=
 =D0+p
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "ARM:
   - support for SVE and Pointer Authentication in guests
   - PMU improvements

  POWER:
   - support for direct access to the POWER9 XIVE interrupt controller
   - memory and performance optimizations

  x86:
   - support for accessing memory not backed by struct page
   - fixes and refactoring

  Generic:
   - dirty page tracking improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (155 commits)
  kvm: fix compilation on aarch64
  Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU"
  kvm: x86: Fix L1TF mitigation for shadow MMU
  KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible
  KVM: PPC: Book3S: Remove useless checks in 'release' method of KVM device
  KVM: PPC: Book3S HV: XIVE: Fix spelling mistake "acessing" -> "accessing"
  KVM: PPC: Book3S HV: Make sure to load LPID for radix VCPUs
  kvm: nVMX: Set nested_run_pending in vmx_set_nested_state after checks complete
  tests: kvm: Add tests for KVM_SET_NESTED_STATE
  KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state
  tests: kvm: Add tests for KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_CPU_ID
  tests: kvm: Add tests to .gitignore
  KVM: Introduce KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
  KVM: Fix kvm_clear_dirty_log_protect off-by-(minus-)one
  KVM: Fix the bitmap range to copy during clear dirty
  KVM: arm64: Fix ptrauth ID register masking logic
  KVM: x86: use direct accessors for RIP and RSP
  KVM: VMX: Use accessors for GPRs outside of dedicated caching logic
  KVM: x86: Omit caching logic for always-available GPRs
  kvm, x86: Properly check whether a pfn is an MMIO or not
  ...
2019-05-17 10:33:30 -07:00
Ira Weiny
73b0140bf0 mm/gup: change GUP fast to use flags rather than a write 'bool'
To facilitate additional options to get_user_pages_fast() change the
singular write parameter to be gup_flags.

This patch does not change any functionality.  New functionality will
follow in subsequent patches.

Some of the get_user_pages_fast() call sites were unchanged because they
already passed FOLL_WRITE or 0 for the write parameter.

NOTE: It was suggested to change the ordering of the get_user_pages_fast()
arguments to ensure that callers were converted.  This breaks the current
GUP call site convention of having the returned pages be the final
parameter.  So the suggestion was rejected.

Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marshall <hubcap@omnibond.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:46 -07:00
Paolo Bonzini
e9c16c7850 KVM: x86: use direct accessors for RIP and RSP
Use specific inline functions for RIP and RSP instead of
going through kvm_register_read and kvm_register_write,
which are quite a mouthful.  kvm_rsp_read and kvm_rsp_write
did not exist, so add them.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 22:07:26 +02:00
Sean Christopherson
de3cd117ed KVM: x86: Omit caching logic for always-available GPRs
Except for RSP and RIP, which are held in VMX's VMCS, GPRs are always
treated "available and dirtly" on both VMX and SVM, i.e. are
unconditionally loaded/saved immediately before/after VM-Enter/VM-Exit.

Eliminating the unnecessary caching code reduces the size of KVM by a
non-trivial amount, much of which comes from the most common code paths.
E.g. on x86_64, kvm_emulate_cpuid() is reduced from 342 to 182 bytes and
kvm_emulate_hypercall() from 1362 to 1143, with the total size of KVM
dropping by ~1000 bytes.  With CONFIG_RETPOLINE=y, the numbers are even
more pronounced, e.g.: 353->182, 1418->1172 and well over 2000 bytes.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:56:12 +02:00
KarimAllah Ahmed
8c5fbf1a72 KVM/nSVM: Use the new mapping API for mapping guest memory
Use the new mapping API for mapping guest memory to avoid depending on
"struct page".

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:49:34 +02:00
Sean Christopherson
9ec19493fb KVM: x86: clear SMM flags before loading state while leaving SMM
RSM emulation is currently broken on VMX when the interrupted guest has
CR4.VMXE=1.  Stop dancing around the issue of HF_SMM_MASK being set when
loading SMSTATE into architectural state, e.g. by toggling it for
problematic flows, and simply clear HF_SMM_MASK prior to loading
architectural state (from SMRAM save state area).

Reported-by: Jon Doron <arilou@gmail.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Liran Alon <liran.alon@oracle.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Fixes: 5bea5123cb ("KVM: VMX: check nested state and CR4.VMXE against SMM")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:36 +02:00
Sean Christopherson
ed19321fb6 KVM: x86: Load SMRAM in a single shot when leaving SMM
RSM emulation is currently broken on VMX when the interrupted guest has
CR4.VMXE=1.  Rather than dance around the issue of HF_SMM_MASK being set
when loading SMSTATE into architectural state, ideally RSM emulation
itself would be reworked to clear HF_SMM_MASK prior to loading non-SMM
architectural state.

Ostensibly, the only motivation for having HF_SMM_MASK set throughout
the loading of state from the SMRAM save state area is so that the
memory accesses from GET_SMSTATE() are tagged with role.smm.  Load
all of the SMRAM save state area from guest memory at the beginning of
RSM emulation, and load state from the buffer instead of reading guest
memory one-by-one.

This paves the way for clearing HF_SMM_MASK prior to loading state,
and also aligns RSM with the enter_smm() behavior, which fills a
buffer and writes SMRAM save state in a single go.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:35 +02:00
WANG Chao
1811d979c7 x86/kvm: move kvm_load/put_guest_xcr0 into atomic context
guest xcr0 could leak into host when MCE happens in guest mode. Because
do_machine_check() could schedule out at a few places.

For example:

kvm_load_guest_xcr0
...
kvm_x86_ops->run(vcpu) {
  vmx_vcpu_run
    vmx_complete_atomic_exit
      kvm_machine_check
        do_machine_check
          do_memory_failure
            memory_failure
              lock_page

In this case, host_xcr0 is 0x2ff, guest vcpu xcr0 is 0xff. After schedule
out, host cpu has guest xcr0 loaded (0xff).

In __switch_to {
     switch_fpu_finish
       copy_kernel_to_fpregs
         XRSTORS

If any bit i in XSTATE_BV[i] == 1 and xcr0[i] == 0, XRSTORS will
generate #GP (In this case, bit 9). Then ex_handler_fprestore kicks in
and tries to reinitialize fpu by restoring init fpu state. Same story as
last #GP, except we get DOUBLE FAULT this time.

Cc: stable@vger.kernel.org
Signed-off-by: WANG Chao <chao.wang@ucloud.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:33 +02:00
Vitaly Kuznetsov
99c221796a KVM: x86: svm: make sure NMI is injected after nmi_singlestep
I noticed that apic test from kvm-unit-tests always hangs on my EPYC 7401P,
the hanging test nmi-after-sti is trying to deliver 30000 NMIs and tracing
shows that we're sometimes able to deliver a few but never all.

When we're trying to inject an NMI we may fail to do so immediately for
various reasons, however, we still need to inject it so enable_nmi_window()
arms nmi_singlestep mode. #DB occurs as expected, but we're not checking
for pending NMIs before entering the guest and unless there's a different
event to process, the NMI will never get delivered.

Make KVM_REQ_EVENT request on the vCPU from db_interception() to make sure
pending NMIs are checked and possibly injected.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:32 +02:00
Suthikulpanit, Suravee
e44e3eaccc svm/avic: Fix invalidate logical APIC id entry
Only clear the valid bit when invalidate logical APIC id entry.
The current logic clear the valid bit, but also set the rest of
the bits (including reserved bits) to 1.

Fixes: 98d90582be ('svm: Fix AVIC DFR and LDR handling')
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:32 +02:00
Suthikulpanit, Suravee
4a58038b9e Revert "svm: Fix AVIC incomplete IPI emulation"
This reverts commit bb218fbcfa.

As Oren Twaig pointed out the old discussion:

  https://patchwork.kernel.org/patch/8292231/

that the change coud potentially cause an extra IPI to be sent to
the destination vcpu because the AVIC hardware already set the IRR bit
before the incomplete IPI #VMEXIT with id=1 (target vcpu is not running).
Since writting to ICR and ICR2 will also set the IRR. If something triggers
the destination vcpu to get scheduled before the emulation finishes, then
this could result in an additional IPI.

Also, the issue mentioned in the commit bb218fbcfa was misdiagnosed.

Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Reported-by: Oren Twaig <oren@scalemp.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:31 +02:00
David Rientjes
b86bc2858b KVM: SVM: prevent DBG_DECRYPT and DBG_ENCRYPT overflow
This ensures that the address and length provided to DBG_DECRYPT and
DBG_ENCRYPT do not cause an overflow.

At the same time, pass the actual number of pages pinned in memory to
sev_unpin_memory() as a cleanup.

Reported-by: Cfir Cohen <cfir@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-05 20:49:42 +02:00
David Rientjes
ede885ecb2 kvm: svm: fix potential get_num_contig_pages overflow
get_num_contig_pages() could potentially overflow int so make its type
consistent with its usage.

Reported-by: Cfir Cohen <cfir@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-05 20:48:59 +02:00
Singh, Brijesh
05d5a48635 KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation)
Errata#1096:

On a nested data page fault when CR.SMAP=1 and the guest data read
generates a SMAP violation, GuestInstrBytes field of the VMCB on a
VMEXIT will incorrectly return 0h instead the correct guest
instruction bytes .

Recommend Workaround:

To determine what instruction the guest was executing the hypervisor
will have to decode the instruction at the instruction pointer.

The recommended workaround can not be implemented for the SEV
guest because guest memory is encrypted with the guest specific key,
and instruction decoder will not be able to decode the instruction
bytes. If we hit this errata in the SEV guest then log the message
and request a guest shutdown.

Reported-by: Venkatesh Srinivas <venkateshs@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-28 17:27:17 +01:00
Ben Gardon
1ec696470c kvm: svm: Add memcg accounting to KVM allocations
There are many KVM kernel memory allocations which are tied to the life of
the VM process and should be charged to the VM process's cgroup. If the
allocations aren't tied to the process, the OOM killer will not know
that killing the process will free the associated kernel memory.
Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
charged to the VM process's cgroup.

Tested:
	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
	introduced no new failures.
	Ran a kernel memory accounting test which creates a VM to touch
	memory and then checks that the kernel memory allocated for the
	process is within certain bounds.
	With this patch we account for much more of the vmalloc and slab memory
	allocated for the VM.

Signed-off-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:31 +01:00
Suthikulpanit, Suravee
c57cd3c89e svm: Fix improper check when deactivate AVIC
The function svm_refresh_apicv_exec_ctrl() always returning prematurely
as kvm_vcpu_apicv_active() always return false when calling from
the function arch/x86/kvm/x86.c:kvm_vcpu_deactivate_apicv().
This is because the apicv_active is set to false just before calling
refresh_apicv_exec_ctrl().

Also, we need to mark VMCB_AVIC bit as dirty instead of VMCB_INTR.

So, fix svm_refresh_apicv_exec_ctrl() to properly deactivate AVIC.

Fixes: 67034bb9dd ('KVM: SVM: Add irqchip_split() checks before enabling AVIC')
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:22 +01:00
Suthikulpanit, Suravee
98d90582be svm: Fix AVIC DFR and LDR handling
Current SVM AVIC driver makes two incorrect assumptions:
  1. APIC LDR register cannot be zero
  2. APIC DFR for all vCPUs must be the same

LDR=0 means the local APIC does not support logical destination mode.
Therefore, the driver should mark any previously assigned logical APIC ID
table entry as invalid, and return success.  Also, DFR is specific to
a particular local APIC, and can be different among all vCPUs
(as observed on Windows 10).

These incorrect assumptions cause Windows 10 and FreeBSD VMs to fail
to boot with AVIC enabled. So, instead of flush the whole logical APIC ID
table, handle DFR and LDR for each vCPU independently.

Fixes: 18f40c53e1 ('svm: Add VMEXIT handlers for AVIC')
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Reported-by: Julian Stecklina <jsteckli@amazon.de>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:20 +01:00
Gustavo A. R. Silva
b2869f28e1 KVM: x86: Mark expected switch fall-throughs
In preparation to enabling -Wimplicit-fallthrough, mark switch
cases where we are expecting to fall through.

This patch fixes the following warnings:

arch/x86/kvm/lapic.c:1037:27: warning: this statement may fall through [-Wimplicit-fallthrough=]
arch/x86/kvm/lapic.c:1876:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
arch/x86/kvm/hyperv.c:1637:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
arch/x86/kvm/svm.c:4396:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
arch/x86/kvm/mmu.c:4372:36: warning: this statement may fall through [-Wimplicit-fallthrough=]
arch/x86/kvm/x86.c:3835:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
arch/x86/kvm/x86.c:7938:23: warning: this statement may fall through [-Wimplicit-fallthrough=]
arch/x86/kvm/vmx/vmx.c:2015:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
arch/x86/kvm/vmx/vmx.c:1773:6: warning: this statement may fall through [-Wimplicit-fallthrough=]

Warning level 3 was used: -Wimplicit-fallthrough=3

This patch is part of the ongoing efforts to enabling -Wimplicit-fallthrough.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-01-25 19:29:36 +01:00
Vitaly Kuznetsov
619ad846fc KVM: nSVM: clear events pending from svm_complete_interrupts() when exiting to L1
kvm-unit-tests' eventinj "NMI failing on IDT" test results in NMI being
delivered to the host (L1) when it's running nested. The problem seems to
be: svm_complete_interrupts() raises 'nmi_injected' flag but later we
decide to reflect EXIT_NPF to L1. The flag remains pending and we do NMI
injection upon entry so it got delivered to L1 instead of L2.

It seems that VMX code solves the same issue in prepare_vmcs12(), this was
introduced with code refactoring in commit 5f3d579997 ("KVM: nVMX: Rework
event injection and recovery").

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-01-25 19:11:35 +01:00
Suravee Suthikulpanit
bb218fbcfa svm: Fix AVIC incomplete IPI emulation
In case of incomplete IPI with invalid interrupt type, the current
SVM driver does not properly emulate the IPI, and fails to boot
FreeBSD guests with multiple vcpus when enabling AVIC.

Fix this by update APIC ICR high/low registers, which also
emulate sending the IPI.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-01-25 19:11:34 +01:00
Suravee Suthikulpanit
37ef0c4414 svm: Add warning message for AVIC IPI invalid target
Print warning message when IPI target ID is invalid due to one of
the following reasons:
  * In logical mode: cluster > max_cluster (64)
  * In physical mode: target > max_physical (512)
  * Address is not present in the physical or logical ID tables

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-01-25 19:11:34 +01:00
David Rientjes
3f14a89d11 kvm: sev: Fail KVM_SEV_INIT if already initialized
By code inspection, it was found that multiple calls to KVM_SEV_INIT
could deplete asid bits and overwrite kvm_sev_info's regions_list.

Multiple calls to KVM_SVM_INIT is not likely to occur with QEMU, but this
should likely be fixed anyway.

This code is serialized by kvm->lock.

Fixes: 1654efcbc4 ("KVM: SVM: Add KVM_SEV_INIT command")
Reported-by: Cfir Cohen <cfir@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2019-01-11 18:38:07 +01:00
Linus Torvalds
42b00f122c * ARM: selftests improvements, large PUD support for HugeTLB,
single-stepping fixes, improved tracing, various timer and vGIC
 fixes
 
 * x86: Processor Tracing virtualization, STIBP support, some correctness fixes,
 refactorings and splitting of vmx.c, use the Hyper-V range TLB flush hypercall,
 reduce order of vcpu struct, WBNOINVD support, do not use -ftrace for __noclone
 functions, nested guest support for PAUSE filtering on AMD, more Hyper-V
 enlightenments (direct mode for synthetic timers)
 
 * PPC: nested VFIO
 
 * s390: bugfixes only this time
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJcH0vFAAoJEL/70l94x66Dw/wH/2FZp1YOM5OgiJzgqnXyDbyf
 dNEfWo472MtNiLsuf+ZAfJojVIu9cv7wtBfXNzW+75XZDfh/J88geHWNSiZDm3Fe
 aM4MOnGG0yF3hQrRQyEHe4IFhGFNERax8Ccv+OL44md9CjYrIrsGkRD08qwb+gNh
 P8T/3wJEKwUcVHA/1VHEIM8MlirxNENc78p6JKd/C7zb0emjGavdIpWFUMr3SNfs
 CemabhJUuwOYtwjRInyx1y34FzYwW3Ejuc9a9UoZ+COahUfkuxHE8u+EQS7vLVF6
 2VGVu5SA0PqgmLlGhHthxLqVgQYo+dB22cRnsLtXlUChtVAq8q9uu5sKzvqEzuE=
 =b4Jx
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "ARM:
   - selftests improvements
   - large PUD support for HugeTLB
   - single-stepping fixes
   - improved tracing
   - various timer and vGIC fixes

  x86:
   - Processor Tracing virtualization
   - STIBP support
   - some correctness fixes
   - refactorings and splitting of vmx.c
   - use the Hyper-V range TLB flush hypercall
   - reduce order of vcpu struct
   - WBNOINVD support
   - do not use -ftrace for __noclone functions
   - nested guest support for PAUSE filtering on AMD
   - more Hyper-V enlightenments (direct mode for synthetic timers)

  PPC:
   -  nested VFIO

  s390:
   - bugfixes only this time"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (171 commits)
  KVM: x86: Add CPUID support for new instruction WBNOINVD
  kvm: selftests: ucall: fix exit mmio address guessing
  Revert "compiler-gcc: disable -ftracer for __noclone functions"
  KVM: VMX: Move VM-Enter + VM-Exit handling to non-inline sub-routines
  KVM: VMX: Explicitly reference RCX as the vmx_vcpu pointer in asm blobs
  KVM: x86: Use jmp to invoke kvm_spurious_fault() from .fixup
  MAINTAINERS: Add arch/x86/kvm sub-directories to existing KVM/x86 entry
  KVM/x86: Use SVM assembly instruction mnemonics instead of .byte streams
  KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()
  KVM/MMU: Flush tlb directly in kvm_set_pte_rmapp()
  KVM/MMU: Move tlb flush in kvm_set_pte_rmapp() to kvm_mmu_notifier_change_pte()
  KVM: Make kvm_set_spte_hva() return int
  KVM: Replace old tlb flush function with new one to flush a specified range.
  KVM/MMU: Add tlb flush with range helper function
  KVM/VMX: Add hv tlb range flush support
  x86/hyper-v: Add HvFlushGuestAddressList hypercall support
  KVM: Add tlb_remote_flush_with_range callback in kvm_x86_ops
  KVM: x86: Disable Intel PT when VMXON in L1 guest
  KVM: x86: Set intercept for Intel PT MSRs read/write
  KVM: x86: Implement Intel PT MSRs read/write emulation
  ...
2018-12-26 11:46:28 -08:00
Uros Bizjak
ac5ffda244 KVM/x86: Use SVM assembly instruction mnemonics instead of .byte streams
Recently the minimum required version of binutils was changed to 2.20,
which supports all SVM instruction mnemonics. The patch removes
all .byte #defines and uses real instruction mnemonics instead.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-12-21 11:28:44 +01:00
Chao Peng
86f5201df0 KVM: x86: Add Intel Processor Trace cpuid emulation
Expose Intel Processor Trace to guest only when
the PT works in Host-Guest mode.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-12-21 11:28:35 +01:00
Tambe, William
e081354d6a KVM: nSVM: Fix nested guest support for PAUSE filtering.
Currently, the nested guest's PAUSE intercept intentions are not being
honored.  Instead, since the L0 hypervisor's pause_filter_count and
pause_filter_thresh values are still in place, these values are used
instead of those programmed in the VMCB by the L1 hypervisor.

To honor the desired PAUSE intercept support of the L1 hypervisor, the L0
hypervisor must use the PAUSE filtering fields of the L1 hypervisor. This
requires saving and restoring of both the L0 and L1 hypervisor's PAUSE
filtering fields.

Signed-off-by: William Tambe <william.tambe@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-12-21 11:28:23 +01:00
Vitaly Kuznetsov
e87555e550 KVM: x86: svm: report MSR_IA32_MCG_EXT_CTL as unsupported
AMD doesn't seem to implement MSR_IA32_MCG_EXT_CTL and svm code in kvm
knows nothing about it, however, this MSR is among emulated_msrs and
thus returned with KVM_GET_MSR_INDEX_LIST. The consequent KVM_GET_MSRS,
of course, fails.

Report the MSR as unsupported to not confuse userspace.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-12-21 11:28:20 +01:00
Vitaly Kuznetsov
3cf85f9f6b KVM: x86: nSVM: fix switch to guest mmu
Recent optimizations in MMU code broke nested SVM with NPT in L1
completely: when we do nested_svm_{,un}init_mmu_context() we want
to switch from TDP MMU to shadow MMU, both init_kvm_tdp_mmu() and
kvm_init_shadow_mmu() check if re-configuration is needed by looking
at cache source data. The data, however, doesn't change - it's only
the type of the MMU which changes. We end up not re-initializing
guest MMU as shadow and everything goes off the rails.

The issue could have been fixed by putting MMU type into extended MMU
role but this is not really needed. We can just split root and guest MMUs
the exact same way we did for nVMX, their types never change in the
lifetime of a vCPU.

There is still room for improvement: currently, we reset all MMU roots
when switching from L1 to L2 and back and this is not needed.

Fixes: 7dcd575520 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-12-19 22:19:22 +01:00
Marc Orr
b666a4b697 kvm: x86: Dynamically allocate guest_fpu
Previously, the guest_fpu field was embedded in the kvm_vcpu_arch
struct. Unfortunately, the field is quite large, (e.g., 4352 bytes on my
current setup). This bloats the kvm_vcpu_arch struct for x86 into an
order 3 memory allocation, which can become a problem on overcommitted
machines. Thus, this patch moves the fpu state outside of the
kvm_vcpu_arch struct.

With this patch applied, the kvm_vcpu_arch struct is reduced to 15168
bytes for vmx on my setup when building the kernel with kvmconfig.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Marc Orr <marcorr@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-12-14 18:00:08 +01:00
Vitaly Kuznetsov
e2e871ab2f x86/kvm/hyper-v: Introduce nested_get_evmcs_version() helper
The upcoming KVM_GET_SUPPORTED_HV_CPUID ioctl will need to return
Enlightened VMCS version in HYPERV_CPUID_NESTED_FEATURES.EAX when
it was enabled.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-12-14 17:59:54 +01:00
Peng Hao
b2227ddec1 kvm: svm: remove unused struct definition
structure svm_init_data is never used. So remove it.

Signed-off-by: Peng Hao <peng.hao2@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-12-14 17:59:50 +01:00
Paolo Bonzini
45c3af974e KVM: x86: Trace changes to active TSC offset regardless if vCPU in guest-mode
For some reason, kvm_x86_ops->write_l1_tsc_offset() skipped trace
of change to active TSC offset in case vCPU is in guest-mode.
This patch changes write_l1_tsc_offset() behavior to trace any change
to active TSC offset to aid debugging.  The VMX code is changed to
look more similar to SVM, which is in my opinion nicer.

Based on a patch by Liran Alon.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-11-27 12:53:43 +01:00
Jim Mattson
fd65d3142f kvm: svm: Ensure an IBPB on all affected CPUs when freeing a vmcb
Previously, we only called indirect_branch_prediction_barrier on the
logical CPU that freed a vmcb. This function should be called on all
logical CPUs that last loaded the vmcb in question.

Fixes: 15d4507152 ("KVM/x86: Add IBPB support")
Reported-by: Neel Natu <neelnatu@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-11-27 12:50:42 +01:00
Leonid Shatz
326e742533 KVM: nVMX/nSVM: Fix bug which sets vcpu->arch.tsc_offset to L1 tsc_offset
Since commit e79f245dde ("X86/KVM: Properly update 'tsc_offset' to
represent the running guest"), vcpu->arch.tsc_offset meaning was
changed to always reflect the tsc_offset value set on active VMCS.
Regardless if vCPU is currently running L1 or L2.

However, above mentioned commit failed to also change
kvm_vcpu_write_tsc_offset() to set vcpu->arch.tsc_offset correctly.
This is because vmx_write_tsc_offset() could set the tsc_offset value
in active VMCS to given offset parameter *plus vmcs12->tsc_offset*.
However, kvm_vcpu_write_tsc_offset() just sets vcpu->arch.tsc_offset
to given offset parameter. Without taking into account the possible
addition of vmcs12->tsc_offset. (Same is true for SVM case).

Fix this issue by changing kvm_x86_ops->write_tsc_offset() to return
actually set tsc_offset in active VMCS and modify
kvm_vcpu_write_tsc_offset() to set returned value in
vcpu->arch.tsc_offset.
In addition, rename write_tsc_offset() callback to write_l1_tsc_offset()
to make it clear that it is meant to set L1 TSC offset.

Fixes: e79f245dde ("X86/KVM: Properly update 'tsc_offset' to represent the running guest")
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Leonid Shatz <leonid.shatz@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-11-27 12:50:10 +01:00
Wei Wang
30510387a5 svm: Add mutex_lock to protect apic_access_page_done on AMD systems
There is a race condition when accessing kvm->arch.apic_access_page_done.
Due to it, x86_set_memory_region will fail when creating the second vcpu
for a svm guest.

Add a mutex_lock to serialize the accesses to apic_access_page_done.
This lock is also used by vmx for the same purpose.

Signed-off-by: Wei Wang <wawei@amazon.de>
Signed-off-by: Amadeusz Juskowiak <ajusk@amazon.de>
Signed-off-by: Julian Stecklina <jsteckli@amazon.de>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-11-27 12:49:39 +01:00
Linus Torvalds
0d1e8b8d2b KVM updates for v4.20
ARM:
  - Improved guest IPA space support (32 to 52 bits)
 
  - RAS event delivery for 32bit
 
  - PMU fixes
 
  - Guest entry hardening
 
  - Various cleanups
 
  - Port of dirty_log_test selftest
 
 PPC:
  - Nested HV KVM support for radix guests on POWER9.  The performance is
    much better than with PR KVM.  Migration and arbitrary level of
    nesting is supported.
 
  - Disable nested HV-KVM on early POWER9 chips that need a particular hardware
    bug workaround
 
  - One VM per core mode to prevent potential data leaks
 
  - PCI pass-through optimization
 
  - merge ppc-kvm topic branch and kvm-ppc-fixes to get a better base
 
 s390:
  - Initial version of AP crypto virtualization via vfio-mdev
 
  - Improvement for vfio-ap
 
  - Set the host program identifier
 
  - Optimize page table locking
 
 x86:
  - Enable nested virtualization by default
 
  - Implement Hyper-V IPI hypercalls
 
  - Improve #PF and #DB handling
 
  - Allow guests to use Enlightened VMCS
 
  - Add migration selftests for VMCS and Enlightened VMCS
 
  - Allow coalesced PIO accesses
 
  - Add an option to perform nested VMCS host state consistency check
    through hardware
 
  - Automatic tuning of lapic_timer_advance_ns
 
  - Many fixes, minor improvements, and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJb0FINAAoJEED/6hsPKofoI60IAJRS3vOAQ9Fav8cJsO1oBHcX
 3+NexfnBke1bzrjIR3SUcHKGZbdnVPNZc+Q4JjIbPpPmmOMU5jc9BC1dmd5f4Vzh
 BMnQ0yCvgFv3A3fy/Icx1Z8NJppxosdmqdQLrQrNo8aD3cjnqY2yQixdXrAfzLzw
 XEgKdIFCCz8oVN/C9TT4wwJn6l9OE7BM5bMKGFy5VNXzMu7t64UDOLbbjZxNgi1g
 teYvfVGdt5mH0N7b2GPPWRbJmgnz5ygVVpVNQUEFrdKZoCm6r5u9d19N+RRXAwan
 ZYFj10W2T8pJOUf3tryev4V33X7MRQitfJBo4tP5hZfi9uRX89np5zP1CFE7AtY=
 =yEPW
 -----END PGP SIGNATURE-----

Merge tag 'kvm-4.20-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Radim Krčmář:
 "ARM:
   - Improved guest IPA space support (32 to 52 bits)

   - RAS event delivery for 32bit

   - PMU fixes

   - Guest entry hardening

   - Various cleanups

   - Port of dirty_log_test selftest

  PPC:
   - Nested HV KVM support for radix guests on POWER9. The performance
     is much better than with PR KVM. Migration and arbitrary level of
     nesting is supported.

   - Disable nested HV-KVM on early POWER9 chips that need a particular
     hardware bug workaround

   - One VM per core mode to prevent potential data leaks

   - PCI pass-through optimization

   - merge ppc-kvm topic branch and kvm-ppc-fixes to get a better base

  s390:
   - Initial version of AP crypto virtualization via vfio-mdev

   - Improvement for vfio-ap

   - Set the host program identifier

   - Optimize page table locking

  x86:
   - Enable nested virtualization by default

   - Implement Hyper-V IPI hypercalls

   - Improve #PF and #DB handling

   - Allow guests to use Enlightened VMCS

   - Add migration selftests for VMCS and Enlightened VMCS

   - Allow coalesced PIO accesses

   - Add an option to perform nested VMCS host state consistency check
     through hardware

   - Automatic tuning of lapic_timer_advance_ns

   - Many fixes, minor improvements, and cleanups"

* tag 'kvm-4.20-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits)
  KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned
  Revert "kvm: x86: optimize dr6 restore"
  KVM: PPC: Optimize clearing TCEs for sparse tables
  x86/kvm/nVMX: tweak shadow fields
  selftests/kvm: add missing executables to .gitignore
  KVM: arm64: Safety check PSTATE when entering guest and handle IL
  KVM: PPC: Book3S HV: Don't use streamlined entry path on early POWER9 chips
  arm/arm64: KVM: Enable 32 bits kvm vcpu events support
  arm/arm64: KVM: Rename function kvm_arch_dev_ioctl_check_extension()
  KVM: arm64: Fix caching of host MDCR_EL2 value
  KVM: VMX: enable nested virtualization by default
  KVM/x86: Use 32bit xor to clear registers in svm.c
  kvm: x86: Introduce KVM_CAP_EXCEPTION_PAYLOAD
  kvm: vmx: Defer setting of DR6 until #DB delivery
  kvm: x86: Defer setting of CR2 until #PF delivery
  kvm: x86: Add payload operands to kvm_multiple_exception
  kvm: x86: Add exception payload fields to kvm_vcpu_events
  kvm: x86: Add has_payload and payload to kvm_queued_exception
  KVM: Documentation: Fix omission in struct kvm_vcpu_events
  KVM: selftests: add Enlightened VMCS test
  ...
2018-10-25 17:57:35 -07:00
Uros Bizjak
43ce76ce73 KVM/x86: Use 32bit xor to clear registers in svm.c
x86_64 zero-extends 32bit xor operation to a full 64bit register.

Also add a comment and remove unnecessary instruction suffix in vmx.c

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 19:08:21 +02:00
Jim Mattson
da998b46d2 kvm: x86: Defer setting of CR2 until #PF delivery
When exception payloads are enabled by userspace (which is not yet
possible) and a #PF is raised in L2, defer the setting of CR2 until
the #PF is delivered. This allows the L1 hypervisor to intercept the
fault before CR2 is modified.

For backwards compatibility, when exception payloads are not enabled
by userspace, kvm_multiple_exception modifies CR2 when the #PF
exception is raised.

Reported-by: Jim Mattson <jmattson@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 19:07:43 +02:00
Vitaly Kuznetsov
57b119da35 KVM: nVMX: add KVM_CAP_HYPERV_ENLIGHTENED_VMCS capability
Enlightened VMCS is opt-in. The current version does not contain all
fields supported by nested VMX so we must not advertise the
corresponding VMX features if enlightened VMCS is enabled.

Userspace is given the enlightened VMCS version supported by KVM as
part of enabling KVM_CAP_HYPERV_ENLIGHTENED_VMCS. The version is to
be advertised to the nested hypervisor, currently done via a cpuid
leaf for Hyper-V.

Suggested-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:30:14 +02:00
Vitaly Kuznetsov
44dd3ffa7b x86/kvm/mmu: make vcpu->mmu a pointer to the current MMU
As a preparation to full MMU split between L1 and L2 make vcpu->arch.mmu
a pointer to the currently used mmu. For now, this is always
vcpu->arch.root_mmu. No functional change.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
2018-10-17 00:30:02 +02:00
Paolo Bonzini
853c110982 KVM: x86: support CONFIG_KVM_AMD=y with CONFIG_CRYPTO_DEV_CCP_DD=m
SEV requires access to the AMD cryptographic device APIs, and this
does not work when KVM is builtin and the crypto driver is a module.
Actually the Kconfig conditions for CONFIG_KVM_AMD_SEV try to disable
SEV in that case, but it does not work because the actual crypto
calls are not culled, only sev_hardware_setup() is.

This patch adds two CONFIG_KVM_AMD_SEV checks that gate all the remaining
SEV code; it fixes this particular configuration, and drops 5 KiB of
code when CONFIG_KVM_AMD_SEV=n.

Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-09 18:38:42 +02:00
Sean Christopherson
d264ee0c2e KVM: VMX: use preemption timer to force immediate VMExit
A VMX preemption timer value of '0' is guaranteed to cause a VMExit
prior to the CPU executing any instructions in the guest.  Use the
preemption timer (if it's supported) to trigger immediate VMExit
in place of the current method of sending a self-IPI.  This ensures
that pending VMExit injection to L1 occurs prior to executing any
instructions in the guest (regardless of nesting level).

When deferring VMExit injection, KVM generates an immediate VMExit
from the (possibly nested) guest by sending itself an IPI.  Because
hardware interrupts are blocked prior to VMEnter and are unblocked
(in hardware) after VMEnter, this results in taking a VMExit(INTR)
before any guest instruction is executed.  But, as this approach
relies on the IPI being received before VMEnter executes, it only
works as intended when KVM is running as L0.  Because there are no
architectural guarantees regarding when IPIs are delivered, when
running nested the INTR may "arrive" long after L2 is running e.g.
L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.

For the most part, this unintended delay is not an issue since the
events being injected to L1 also do not have architectural guarantees
regarding their timing.  The notable exception is the VMX preemption
timer[1], which is architecturally guaranteed to cause a VMExit prior
to executing any instructions in the guest if the timer value is '0'
at VMEnter.  Specifically, the delay in injecting the VMExit causes
the preemption timer KVM unit test to fail when run in a nested guest.

Note: this approach is viable even on CPUs with a broken preemption
timer, as broken in this context only means the timer counts at the
wrong rate.  There are no known errata affecting timer value of '0'.

[1] I/O SMIs also have guarantees on when they arrive, but I have
    no idea if/how those are emulated in KVM.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
[Use a hook for SVM instead of leaving the default in x86.c - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20 00:51:42 +02:00
Andy Shevchenko
a101c9d63e KVM: SVM: Switch to bitmap_zalloc()
Switch to bitmap_zalloc() to show clearly what we are allocating.
Besides that it returns pointer of bitmap type instead of opaque void *.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20 00:26:45 +02:00
Sean Christopherson
0ce97a2b62 KVM: x86: Rename emulate_instruction() to kvm_emulate_instruction()
Lack of the kvm_ prefix gives the impression that it's a VMX or SVM
specific function, and there's no conflict that prevents adding the
kvm_ prefix.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-08-30 16:20:44 +02:00
Sean Christopherson
35be0aded7 KVM: x86: SVM: Set EMULTYPE_NO_REEXECUTE for RSM emulation
Re-execution after an emulation decode failure is only intended to
handle a case where two or vCPUs race to write a shadowed page, i.e.
we should never re-execute an instruction as part of RSM emulation.

Add a new helper, kvm_emulate_instruction_from_buffer(), to support
emulating from a pre-defined buffer.  This eliminates the last direct
call to x86_emulate_instruction() outside of kvm_mmu_page_fault(),
which means x86_emulate_instruction() can be unexported in a future
patch.

Fixes: 7607b71744 ("KVM: SVM: install RSM intercept")
Cc: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-08-30 16:20:43 +02:00
Colin Ian King
0186ec8232 KVM: SVM: remove unused variable dst_vaddr_end
Variable dst_vaddr_end is being assigned but is never used hence it is
redundant and can be removed.

Cleans up clang warning:
variable 'dst_vaddr_end' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-08-30 16:20:42 +02:00
Thomas Gleixner
024d83cadc KVM: x86: SVM: Call x86_spec_ctrl_set_guest/host() with interrupts disabled
Mikhail reported the following lockdep splat:

WARNING: possible irq lock inversion dependency detected
CPU 0/KVM/10284 just changed the state of lock:
  000000000d538a88 (&st->lock){+...}, at:
  speculative_store_bypass_update+0x10b/0x170

but this lock was taken by another, HARDIRQ-safe lock
in the past:

(&(&sighand->siglock)->rlock){-.-.}

   and interrupts could create inverse lock ordering between them.

Possible interrupt unsafe locking scenario:

    CPU0                    CPU1
    ----                    ----
   lock(&st->lock);
                           local_irq_disable();
                           lock(&(&sighand->siglock)->rlock);
                           lock(&st->lock);
    <Interrupt>
     lock(&(&sighand->siglock)->rlock);
     *** DEADLOCK ***

The code path which connects those locks is:

   speculative_store_bypass_update()
   ssb_prctl_set()
   do_seccomp()
   do_syscall_64()

In svm_vcpu_run() speculative_store_bypass_update() is called with
interupts enabled via x86_virt_spec_ctrl_set_guest/host().

This is actually a false positive, because GIF=0 so interrupts are
disabled even if IF=1; however, we can easily move the invocations of
x86_virt_spec_ctrl_set_guest/host() into the interrupt disabled region to
cure it, and it's a good idea to keep the GIF=0/IF=1 area as small
and self-contained as possible.

Fixes: 1f50ddb4f4 ("x86/speculation: Handle HT correctly on AMD")
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: kvm@vger.kernel.org
Cc: x86@kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-22 16:48:36 +02:00
Junaid Shahid
faff87588d kvm: x86: Flush only affected TLB entries in kvm_mmu_invlpg*
This needs a minor bug fix. The updated patch is as follows.

Thanks,
Junaid

------------------------------------------------------------------------------

kvm_mmu_invlpg() and kvm_mmu_invpcid_gva() only need to flush the TLB
entries for the specific guest virtual address, instead of flushing all
TLB entries associated with the VM.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:59:01 +02:00
Junaid Shahid
afe828d1de kvm: x86: Add ability to skip TLB flush when switching CR3
Remove the implicit flush from the set_cr3 handlers, so that the
callers are able to decide whether to flush the TLB or not.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:55 +02:00
Linus Torvalds
b08fc5277a - Error path bug fix for overflow tests (Dan)
- Additional struct_size() conversions (Matthew, Kees)
 - Explicitly reported overflow fixes (Silvio, Kees)
 - Add missing kvcalloc() function (Kees)
 - Treewide conversions of allocators to use either 2-factor argument
   variant when available, or array_size() and array3_size() as needed (Kees)
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlsgVtMWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJhsJEACLYe2EbwLFJz7emOT1KUGK5R1b
 oVxJog0893WyMqgk9XBlA2lvTBRBYzR3tzsadfYo87L3VOBzazUv0YZaweJb65sF
 bAvxW3nY06brhKKwTRed1PrMa1iG9R63WISnNAuZAq7+79mN6YgW4G6YSAEF9lW7
 oPJoPw93YxcI8JcG+dA8BC9w7pJFKooZH4gvLUSUNl5XKr8Ru5YnWcV8F+8M4vZI
 EJtXFmdlmxAledUPxTSCIojO8m/tNOjYTreBJt9K1DXKY6UcgAdhk75TRLEsp38P
 fPvMigYQpBDnYz2pi9ourTgvZLkffK1OBZ46PPt8BgUZVf70D6CBg10vK47KO6N2
 zreloxkMTrz5XohyjfNjYFRkyyuwV2sSVrRJqF4dpyJ4NJQRjvyywxIP4Myifwlb
 ONipCM1EjvQjaEUbdcqKgvlooMdhcyxfshqJWjHzXB6BL22uPzq5jHXXugz8/ol8
 tOSM2FuJ2sBLQso+szhisxtMd11PihzIZK9BfxEG3du+/hlI+2XgN7hnmlXuA2k3
 BUW6BSDhab41HNd6pp50bDJnL0uKPWyFC6hqSNZw+GOIb46jfFcQqnCB3VZGCwj3
 LH53Be1XlUrttc/NrtkvVhm4bdxtfsp4F7nsPFNDuHvYNkalAVoC3An0BzOibtkh
 AtfvEeaPHaOyD8/h2Q==
 =zUUp
 -----END PGP SIGNATURE-----

Merge tag 'overflow-v4.18-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull more overflow updates from Kees Cook:
 "The rest of the overflow changes for v4.18-rc1.

  This includes the explicit overflow fixes from Silvio, further
  struct_size() conversions from Matthew, and a bug fix from Dan.

  But the bulk of it is the treewide conversions to use either the
  2-factor argument allocators (e.g. kmalloc(a * b, ...) into
  kmalloc_array(a, b, ...) or the array_size() macros (e.g. vmalloc(a *
  b) into vmalloc(array_size(a, b)).

  Coccinelle was fighting me on several fronts, so I've done a bunch of
  manual whitespace updates in the patches as well.

  Summary:

   - Error path bug fix for overflow tests (Dan)

   - Additional struct_size() conversions (Matthew, Kees)

   - Explicitly reported overflow fixes (Silvio, Kees)

   - Add missing kvcalloc() function (Kees)

   - Treewide conversions of allocators to use either 2-factor argument
     variant when available, or array_size() and array3_size() as needed
     (Kees)"

* tag 'overflow-v4.18-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (26 commits)
  treewide: Use array_size in f2fs_kvzalloc()
  treewide: Use array_size() in f2fs_kzalloc()
  treewide: Use array_size() in f2fs_kmalloc()
  treewide: Use array_size() in sock_kmalloc()
  treewide: Use array_size() in kvzalloc_node()
  treewide: Use array_size() in vzalloc_node()
  treewide: Use array_size() in vzalloc()
  treewide: Use array_size() in vmalloc()
  treewide: devm_kzalloc() -> devm_kcalloc()
  treewide: devm_kmalloc() -> devm_kmalloc_array()
  treewide: kvzalloc() -> kvcalloc()
  treewide: kvmalloc() -> kvmalloc_array()
  treewide: kzalloc_node() -> kcalloc_node()
  treewide: kzalloc() -> kcalloc()
  treewide: kmalloc() -> kmalloc_array()
  mm: Introduce kvcalloc()
  video: uvesafb: Fix integer overflow in allocation
  UBIFS: Fix potential integer overflow in allocation
  leds: Use struct_size() in allocation
  Convert intel uncore to struct_size
  ...
2018-06-12 18:28:00 -07:00
Kees Cook
6da2ec5605 treewide: kmalloc() -> kmalloc_array()
The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
patch replaces cases of:

        kmalloc(a * b, gfp)

with:
        kmalloc_array(a * b, gfp)

as well as handling cases of:

        kmalloc(a * b * c, gfp)

with:

        kmalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

        kmalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

        kmalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The tools/ directory was manually excluded, since it has its own
implementation of kmalloc().

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kmalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kmalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kmalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kmalloc
+ kmalloc_array
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kmalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kmalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kmalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kmalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kmalloc(C1 * C2 * C3, ...)
|
  kmalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kmalloc(sizeof(THING) * C2, ...)
|
  kmalloc(sizeof(TYPE) * C2, ...)
|
  kmalloc(C1 * C2 * C3, ...)
|
  kmalloc(C1 * C2, ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Linus Torvalds
b357bf6023 Small update for KVM.
* ARM: lazy context-switching of FPSIMD registers on arm64, "split"
 regions for vGIC redistributor
 
 * s390: cleanups for nested, clock handling, crypto, storage keys and
 control register bits
 
 * x86: many bugfixes, implement more Hyper-V super powers,
 implement lapic_timer_advance_ns even when the LAPIC timer
 is emulated using the processor's VMX preemption timer.  Two
 security-related bugfixes at the top of the branch.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJbH8Z/AAoJEL/70l94x66DF+UIAJeOuTp6LGasT/9uAb2OovaN
 +5kGmOPGFwkTcmg8BQHI2fXT4vhxMXWPFcQnyig9eXJVxhuwluXDOH4P9IMay0yw
 VDCBsWRdMvZDQad2hn6Z5zR4Jx01XrSaG/KqvXbbDKDCy96mWG7SYAY2m3ZwmeQi
 3Pa3O3BTijr7hBYnMhdXGkSn4ZyU8uPaAgIJ8795YKeOJ2JmioGYk6fj6y2WCxA3
 ztJymBjTmIoZ/F8bjuVouIyP64xH4q9roAyw4rpu7vnbWGqx1fjPYJoB8yddluWF
 JqCPsPzhKDO7mjZJy+lfaxIlzz2BN7tKBNCm88s5GefGXgZwk3ByAq/0GQ2M3rk=
 =H5zI
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "Small update for KVM:

  ARM:
   - lazy context-switching of FPSIMD registers on arm64
   - "split" regions for vGIC redistributor

  s390:
   - cleanups for nested
   - clock handling
   - crypto
   - storage keys
   - control register bits

  x86:
   - many bugfixes
   - implement more Hyper-V super powers
   - implement lapic_timer_advance_ns even when the LAPIC timer is
     emulated using the processor's VMX preemption timer.
   - two security-related bugfixes at the top of the branch"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (79 commits)
  kvm: fix typo in flag name
  kvm: x86: use correct privilege level for sgdt/sidt/fxsave/fxrstor access
  KVM: x86: pass kvm_vcpu to kvm_read_guest_virt and kvm_write_guest_virt_system
  KVM: x86: introduce linear_{read,write}_system
  kvm: nVMX: Enforce cpl=0 for VMX instructions
  kvm: nVMX: Add support for "VMWRITE to any supported field"
  kvm: nVMX: Restrict VMX capability MSR changes
  KVM: VMX: Optimize tscdeadline timer latency
  KVM: docs: nVMX: Remove known limitations as they do not exist now
  KVM: docs: mmu: KVM support exposing SLAT to guests
  kvm: no need to check return value of debugfs_create functions
  kvm: Make VM ioctl do valloc for some archs
  kvm: Change return type to vm_fault_t
  KVM: docs: mmu: Fix link to NPT presentation from KVM Forum 2008
  kvm: x86: Amend the KVM_GET_SUPPORTED_CPUID API documentation
  KVM: x86: hyperv: declare KVM_CAP_HYPERV_TLBFLUSH capability
  KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}_EX implementation
  KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} implementation
  KVM: introduce kvm_make_vcpus_request_mask() API
  KVM: x86: hyperv: do rep check for each hypercall separately
  ...
2018-06-12 11:34:04 -07:00
Konrad Rzeszutek Wilk
6ac2f49edb x86/bugs: Add AMD's SPEC_CTRL MSR usage
The AMD document outlining the SSBD handling
124441_AMD64_SpeculativeStoreBypassDisable_Whitepaper_final.pdf
mentions that if CPUID 8000_0008.EBX[24] is set we should be using
the SPEC_CTRL MSR (0x48) over the VIRT SPEC_CTRL MSR (0xC001_011f)
for speculative store bypass disable.

This in effect means we should clear the X86_FEATURE_VIRT_SSBD
flag so that we would prefer the SPEC_CTRL MSR.

See the document titled:
   124441_AMD64_SpeculativeStoreBypassDisable_Whitepaper_final.pdf

A copy of this document is available at
   https://bugzilla.kernel.org/show_bug.cgi?id=199889

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Cc: kvm@vger.kernel.org
Cc: KarimAllah Ahmed <karahmed@amazon.de>
Cc: andrew.cooper3@citrix.com
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20180601145921.9500-3-konrad.wilk@oracle.com
2018-06-06 14:13:16 +02:00
Marc Orr
d1e5b0e98e kvm: Make VM ioctl do valloc for some archs
The kvm struct has been bloating. For example, it's tens of kilo-bytes
for x86, which turns out to be a large amount of memory to allocate
contiguously via kzalloc. Thus, this patch does the following:
1. Uses architecture-specific routines to allocate the kvm struct via
   vzalloc for x86.
2. Switches arm to __KVM_HAVE_ARCH_VM_ALLOC so that it can use vzalloc
   when has_vhe() is true.

Other architectures continue to default to kalloc, as they have a
dependency on kalloc or have a small-enough struct kvm.

Signed-off-by: Marc Orr <marcorr@google.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-06-01 19:18:26 +02:00
Dan Carpenter
86bf20cb57 KVM: x86: prevent integer overflows in KVM_MEMORY_ENCRYPT_REG_REGION
This is a fix from reviewing the code, but it looks like it might be
able to lead to an Oops.  It affects 32bit systems.

The KVM_MEMORY_ENCRYPT_REG_REGION ioctl uses a u64 for range->addr and
range->size but the high 32 bits would be truncated away on a 32 bit
system.  This is harmless but it's also harmless to prevent it.

Then in sev_pin_memory() the "uaddr + ulen" calculation can wrap around.
The wrap around can happen on 32 bit or 64 bit systems, but I was only
able to figure out a problem for 32 bit systems.  We would pick a number
which results in "npages" being zero.  The sev_pin_memory() would then
return ZERO_SIZE_PTR without allocating anything.

I made it illegal to call sev_pin_memory() with "ulen" set to zero.
Hopefully, that doesn't cause any problems.  I also changed the type of
"first" and "last" to long, just for cosmetic reasons.  Otherwise on a
64 bit system you're saving "uaddr >> 12" in an int and it truncates the
high 20 bits away.  The math works in the current code so far as I can
see but it's just weird.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
[Brijesh noted that the code is only reachable on X86_64.]
Reviewed-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-05-24 19:32:20 +02:00
Tom Lendacky
bc226f07dc KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBD
Expose the new virtualized architectural mechanism, VIRT_SSBD, for using
speculative store bypass disable (SSBD) under SVM.  This will allow guests
to use SSBD on hardware that uses non-architectural mechanisms for enabling
SSBD.

[ tglx: Folded the migration fixup from Paolo Bonzini ]

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-05-17 17:09:21 +02:00
Thomas Gleixner
ccbcd26744 x86/bugs, KVM: Extend speculation control for VIRT_SPEC_CTRL
AMD is proposing a VIRT_SPEC_CTRL MSR to handle the Speculative Store
Bypass Disable via MSR_AMD64_LS_CFG so that guests do not have to care
about the bit position of the SSBD bit and thus facilitate migration.
Also, the sibling coordination on Family 17H CPUs can only be done on
the host.

Extend x86_spec_ctrl_set_guest() and x86_spec_ctrl_restore_host() with an
extra argument for the VIRT_SPEC_CTRL MSR.

Hand in 0 from VMX and in SVM add a new virt_spec_ctrl member to the CPU
data structure which is going to be used in later patches for the actual
implementation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2018-05-17 17:09:18 +02:00
Borislav Petkov
e7c587da12 x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP
Intel and AMD have different CPUID bits hence for those use synthetic bits
which get set on the respective vendor's in init_speculation_control(). So
that debacles like what the commit message of

  c65732e4f7 ("x86/cpu: Restore CPUID_8000_0008_EBX reload")

talks about don't happen anymore.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Jörg Otte <jrg.otte@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: https://lkml.kernel.org/r/20180504161815.GG9257@pd.tnic
2018-05-17 17:09:16 +02:00
Thomas Gleixner
15e6c22fd8 KVM: SVM: Move spec control call after restore of GS
svm_vcpu_run() invokes x86_spec_ctrl_restore_host() after VMEXIT, but
before the host GS is restored. x86_spec_ctrl_restore_host() uses 'current'
to determine the host SSBD state of the thread. 'current' is GS based, but
host GS is not yet restored and the access causes a triple fault.

Move the call after the host GS restore.

Fixes: 885f82bfbc x86/process: Allow runtime control of Speculative Store Bypass
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
2018-05-17 17:09:16 +02:00
Jim Mattson
8d860bbeed kvm: vmx: Basic APIC virtualization controls have three settings
Previously, we toggled between SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE
and SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES, depending on whether or
not the EXTD bit was set in MSR_IA32_APICBASE. However, if the local
APIC is disabled, we should not set either of these APIC
virtualization control bits.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-05-14 18:24:24 +02:00