Commit Graph

2305 Commits

Author SHA1 Message Date
Sean Christopherson
edce46548b KVM: x86: Replace .set_hflags() with dedicated .exiting_smm() helper
Replace the .set_hflags() emulator hook with a dedicated .exiting_smm(),
moving the SMM and SMM_INSIDE_NMI flag handling out of the emulator in
the process.  This is a step towards consolidating much of the logic in
kvm_smm_changed(), including the SMM hflags updates.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609185619.992058-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:34 -04:00
Sean Christopherson
25b17226cd KVM: x86: Emulate triple fault shutdown if RSM emulation fails
Use the recently introduced KVM_REQ_TRIPLE_FAULT to properly emulate
shutdown if RSM from SMM fails.

Note, entering shutdown after clearing the SMM flag and restoring NMI
blocking is architecturally correct with respect to AMD's APM, which KVM
also uses for SMRAM layout and RSM NMI blocking behavior.  The APM says:

  An RSM causes a processor shutdown if an invalid-state condition is
  found in the SMRAM state-save area. Only an external reset, external
  processor-initialization, or non-maskable external interrupt (NMI) can
  cause the processor to leave the shutdown state.

Of note is processor-initialization (INIT) as a valid shutdown wake
event, as INIT is blocked by SMM, implying that entering shutdown also
forces the CPU out of SMM.

For recent Intel CPUs, restoring NMI blocking is technically wrong, but
so is restoring NMI blocking in the first place, and Intel's RSM
"architecture" is such a mess that just about anything is allowed and can
be justified as micro-architectural behavior.

Per the SDM:

  On Pentium 4 and later processors, shutdown will inhibit INTR and A20M
  but will not change any of the other inhibits. On these processors,
  NMIs will be inhibited if no action is taken in the SMI handler to
  uninhibit them (see Section 34.8).

where Section 34.8 says:

  When the processor enters SMM while executing an NMI handler, the
  processor saves the SMRAM state save map but does not save the
  attribute to keep NMI interrupts disabled. Potentially, an NMI could be
  latched (while in SMM or upon exit) and serviced upon exit of SMM even
  though the previous NMI handler has still not completed.

I.e. RSM unconditionally unblocks NMI, but shutdown on RSM does not,
which is in direct contradiction of KVM's behavior.  But, as mentioned
above, KVM follows AMD architecture and restores NMI blocking on RSM, so
that micro-architectural detail is already lost.

And for Pentium era CPUs, SMI# can break shutdown, meaning that at least
some Intel CPUs fully leave SMM when entering shutdown:

  In the shutdown state, Intel processors stop executing instructions
  until a RESET#, INIT# or NMI# is asserted.  While Pentium family
  processors recognize the SMI# signal in shutdown state, P6 family and
  Intel486 processors do not.

In other words, the fact that Intel CPUs have implemented the two
extremes gives KVM carte blanche when it comes to honoring Intel's
architecture for handling shutdown during RSM.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609185619.992058-3-seanjc@google.com>
[Return X86EMUL_CONTINUE after triple fault. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:33 -04:00
Vitaly Kuznetsov
4651fc56ba KVM: x86: Drop vendor specific functions for APICv/AVIC enablement
Now that APICv/AVIC enablement is kept in common 'enable_apicv' variable,
there's no need to call kvm_apicv_init() from vendor specific code.

No functional change intended.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210609150911.1471882-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:33 -04:00
Vitaly Kuznetsov
fdf513e37a KVM: x86: Use common 'enable_apicv' variable for both APICv and AVIC
Unify VMX and SVM code by moving APICv/AVIC enablement tracking to common
'enable_apicv' variable. Note: unlike APICv, AVIC is disabled by default.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210609150911.1471882-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:33 -04:00
Sergey Senozhatsky
7d62874f69 kvm: x86: implement KVM PM-notifier
Implement PM hibernation/suspend prepare notifiers so that KVM
can reliably set PVCLOCK_GUEST_STOPPED on VCPUs and properly
suspend VMs.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Message-Id: <20210606021045.14159-2-senozhatsky@chromium.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:33 -04:00
Jim Mattson
4fe09bcf14 KVM: x86: Add a return code to kvm_apic_accept_events
No functional change intended. At present, the only negative value
returned by kvm_check_nested_events is -EBUSY.

Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210604172611.281819-6-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:31 -04:00
Jim Mattson
a5f6909a71 KVM: x86: Add a return code to inject_pending_event
No functional change intended. At present, 'r' will always be -EBUSY
on a control transfer to the 'out' label.

Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210604172611.281819-5-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:30 -04:00
Jim Mattson
d82ee28195 KVM: x86: Remove guest mode check from kvm_check_nested_events
A survey of the callsites reveals that they all ensure the vCPU is in
guest mode before calling kvm_check_nested_events. Remove this dead
code so that the only negative value this function returns (at the
moment) is -EBUSY.

Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210604172611.281819-2-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:30 -04:00
Ilias Stamatis
1ab9287add KVM: X86: Add vendor callbacks for writing the TSC multiplier
Currently vmx_vcpu_load_vmcs() writes the TSC_MULTIPLIER field of the
VMCS every time the VMCS is loaded. Instead of doing this, set this
field from common code on initialization and whenever the scaling ratio
changes.

Additionally remove vmx->current_tsc_ratio. This field is redundant as
vcpu->arch.tsc_scaling_ratio already tracks the current TSC scaling
ratio. The vmx->current_tsc_ratio field is only used for avoiding
unnecessary writes but it is no longer needed after removing the code
from the VMCS load path.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Message-Id: <20210607105438.16541-1-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:29 -04:00
Ilias Stamatis
edcfe54058 KVM: X86: Move write_l1_tsc_offset() logic to common code and rename it
The write_l1_tsc_offset() callback has a misleading name. It does not
set L1's TSC offset, it rather updates the current TSC offset which
might be different if a nested guest is executing. Additionally, both
the vmx and svm implementations use the same logic for calculating the
current TSC before writing it to hardware.

Rename the function and move the common logic to the caller. The vmx/svm
specific code now merely sets the given offset to the corresponding
hardware structure.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-9-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:29 -04:00
Ilias Stamatis
83150f2932 KVM: X86: Add functions that calculate the nested TSC fields
When L2 is entered we need to "merge" the TSC multiplier and TSC offset
values of 01 and 12 together.

The merging is done using the following equations:
  offset_02 = ((offset_01 * mult_12) >> shift_bits) + offset_12
  mult_02 = (mult_01 * mult_12) >> shift_bits

Where shift_bits is kvm_tsc_scaling_ratio_frac_bits.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-8-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:29 -04:00
Ilias Stamatis
fe3eb50418 KVM: X86: Add a ratio parameter to kvm_scale_tsc()
Sometimes kvm_scale_tsc() needs to use the current scaling ratio and
other times (like when reading the TSC from user space) it needs to use
L1's scaling ratio. Have the caller specify this by passing the ratio as
a parameter.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-5-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:28 -04:00
Ilias Stamatis
9b399dfd4c KVM: X86: Rename kvm_compute_tsc_offset() to kvm_compute_l1_tsc_offset()
All existing code uses kvm_compute_tsc_offset() passing L1 TSC values to
it. Let's document this by renaming it to kvm_compute_l1_tsc_offset().

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-4-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:28 -04:00
Ilias Stamatis
805d705ff8 KVM: X86: Store L1's TSC scaling ratio in 'struct kvm_vcpu_arch'
Store L1's scaling ratio in the kvm_vcpu_arch struct like we already do
for L1's TSC offset. This allows for easy save/restore when we enter and
then exit the nested guest.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-3-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:27 -04:00
Ben Gardon
d501f747ef KVM: x86/mmu: Lazily allocate memslot rmaps
If the TDP MMU is in use, wait to allocate the rmaps until the shadow
MMU is actually used. (i.e. a nested VM is launched.) This saves memory
equal to 0.2% of guest memory in cases where the TDP MMU is used and
there are no nested guests involved.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-8-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:27 -04:00
Ben Gardon
e2209710cc KVM: x86/mmu: Skip rmap operations if rmaps not allocated
If only the TDP MMU is being used to manage the memory mappings for a VM,
then many rmap operations can be skipped as they are guaranteed to be
no-ops. This saves some time which would be spent on the rmap operation.
It also avoids acquiring the MMU lock in write mode for many operations.

This makes it safe to run the VM without rmaps allocated, when only
using the TDP MMU and sets the stage for waiting to allocate the rmaps
until they're needed.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-7-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:26 -04:00
Ben Gardon
a255740876 KVM: x86/mmu: Add a field to control memslot rmap allocation
Add a field to control whether new memslots should have rmaps allocated
for them. As of this change, it's not safe to skip allocating rmaps, so
the field is always set to allocate rmaps. Future changes will make it
safe to operate without rmaps, using the TDP MMU. Then further changes
will allow the rmaps to be allocated lazily when needed for nested
oprtation.

No functional change expected.

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-6-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:26 -04:00
Ben Gardon
56dd1019c8 KVM: x86/mmu: Factor out allocating memslot rmap
Small refactor to facilitate allocating rmaps for all memslots at once.

No functional change expected.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-3-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:25 -04:00
Ben Gardon
c9b929b3fa KVM: x86/mmu: Deduplicate rmap freeing
Small code deduplication. No functional change expected.

Reviewed-by: David Hildenbrand <david@redhat.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-2-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:25 -04:00
Keqian Zhu
8921291980 KVM: x86: Do not write protect huge page in initially-all-set mode
Currently, when dirty logging is started in initially-all-set mode,
we write protect huge pages to prepare for splitting them into
4K pages, and leave normal pages untouched as the logging will
be enabled lazily as dirty bits are cleared.

However, enabling dirty logging lazily is also feasible for huge pages.
This not only reduces the time of start dirty logging, but it also
greatly reduces side-effect on guest when there is high dirty rate.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Message-Id: <20210429034115.35560-3-zhukeqian1@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:25 -04:00
Wanpeng Li
dfdc0a714d KVM: X86: Fix x86_emulator slab cache leak
Commit c9b8b07cde (KVM: x86: Dynamically allocate per-vCPU emulation context)
tries to allocate per-vCPU emulation context dynamically, however, the
x86_emulator slab cache is still exiting after the kvm module is unload
as below after destroying the VM and unloading the kvm module.

grep x86_emulator /proc/slabinfo
x86_emulator          36     36   2672   12    8 : tunables    0    0    0 : slabdata      3      3      0

This patch fixes this slab cache leak by destroying the x86_emulator slab cache
when the kvm module is unloaded.

Fixes: c9b8b07cde (KVM: x86: Dynamically allocate per-vCPU emulation context)
Cc: stable@vger.kernel.org
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1623387573-5969-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-11 11:53:48 -04:00
Sean Christopherson
78fcb2c91a KVM: x86: Immediately reset the MMU context when the SMM flag is cleared
Immediately reset the MMU context when the vCPU's SMM flag is cleared so
that the SMM flag in the MMU role is always synchronized with the vCPU's
flag.  If RSM fails (which isn't correctly emulated), KVM will bail
without calling post_leave_smm() and leave the MMU in a bad state.

The bad MMU role can lead to a NULL pointer dereference when grabbing a
shadow page's rmap for a page fault as the initial lookups for the gfn
will happen with the vCPU's SMM flag (=0), whereas the rmap lookup will
use the shadow page's SMM flag, which comes from the MMU (=1).  SMM has
an entirely different set of memslots, and so the initial lookup can find
a memslot (SMM=0) and then explode on the rmap memslot lookup (SMM=1).

  general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
  KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
  CPU: 1 PID: 8410 Comm: syz-executor382 Not tainted 5.13.0-rc5-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  RIP: 0010:__gfn_to_rmap arch/x86/kvm/mmu/mmu.c:935 [inline]
  RIP: 0010:gfn_to_rmap+0x2b0/0x4d0 arch/x86/kvm/mmu/mmu.c:947
  Code: <42> 80 3c 20 00 74 08 4c 89 ff e8 f1 79 a9 00 4c 89 fb 4d 8b 37 44
  RSP: 0018:ffffc90000ffef98 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffff888015b9f414 RCX: ffff888019669c40
  RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000001
  RBP: 0000000000000001 R08: ffffffff811d9cdb R09: ffffed10065a6002
  R10: ffffed10065a6002 R11: 0000000000000000 R12: dffffc0000000000
  R13: 0000000000000003 R14: 0000000000000001 R15: 0000000000000000
  FS:  000000000124b300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 0000000028e31000 CR4: 00000000001526e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   rmap_add arch/x86/kvm/mmu/mmu.c:965 [inline]
   mmu_set_spte+0x862/0xe60 arch/x86/kvm/mmu/mmu.c:2604
   __direct_map arch/x86/kvm/mmu/mmu.c:2862 [inline]
   direct_page_fault+0x1f74/0x2b70 arch/x86/kvm/mmu/mmu.c:3769
   kvm_mmu_do_page_fault arch/x86/kvm/mmu.h:124 [inline]
   kvm_mmu_page_fault+0x199/0x1440 arch/x86/kvm/mmu/mmu.c:5065
   vmx_handle_exit+0x26/0x160 arch/x86/kvm/vmx/vmx.c:6122
   vcpu_enter_guest+0x3bdd/0x9630 arch/x86/kvm/x86.c:9428
   vcpu_run+0x416/0xc20 arch/x86/kvm/x86.c:9494
   kvm_arch_vcpu_ioctl_run+0x4e8/0xa40 arch/x86/kvm/x86.c:9722
   kvm_vcpu_ioctl+0x70f/0xbb0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3460
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:1069 [inline]
   __se_sys_ioctl+0xfb/0x170 fs/ioctl.c:1055
   do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47
   entry_SYSCALL_64_after_hwframe+0x44/0xae
  RIP: 0033:0x440ce9

Cc: stable@vger.kernel.org
Reported-by: syzbot+fb0b6a7e8713aeb0319c@syzkaller.appspotmail.com
Fixes: 9ec19493fb ("KVM: x86: clear SMM flags before loading state while leaving SMM")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609185619.992058-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-10 09:21:12 -04:00
Linus Torvalds
2f673816b2 Bugfixes, including a TLB flush fix that affects processors
without nested page tables.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmDAVpQUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNkOgf9F97eFxAdod3/wbW9EbsUPR5bMTLE
 +R6Hmvw+yCm/W2cycVGdCSh1BEKNuZN/XfHln2cYVfVr6ndog58A4Y0urFAhTROv
 IHs8TCA5biQitoZ716l88ExOitnqJiSmMhGex969+zm1Lb9MQo1KA/zxERlqCi3s
 Pfcxb6I8VbD9LEb6NaQdDgQoslJo1tzhe9gGYAYrpMOZujpj1RPeIOZIfeII0MP/
 g14/JSar8cXc9QJ6zbiKn8HhpmzGJnaIsyFFL2RMIBlKvxsnpOU6VmisLTL9407o
 P246Vq59BM8pdRCVUW9W9hLr2ho8lmi+ZYXASCm+qfn8cLaHyRCqSK56ZQ==
 =nW43
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "Bugfixes, including a TLB flush fix that affects processors without
  nested page tables"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  kvm: fix previous commit for 32-bit builds
  kvm: avoid speculation-based attacks from out-of-range memslot accesses
  KVM: x86: Unload MMU on guest TLB flush if TDP disabled to force MMU sync
  KVM: x86: Ensure liveliness of nested VM-Enter fail tracepoint message
  selftests: kvm: Add support for customized slot0 memory size
  KVM: selftests: introduce P47V64 for s390x
  KVM: x86: Ensure PV TLB flush tracepoint reflects KVM behavior
  KVM: X86: MMU: Use the correct inherited permissions to get shadow page
  KVM: LAPIC: Write 0 to TMICT should also cancel vmx-preemption timer
  KVM: SVM: Fix SEV SEND_START session length & SEND_UPDATE_DATA query length after commit 238eca821c
2021-06-09 13:09:57 -07:00
Lai Jiangshan
b53e84eed0 KVM: x86: Unload MMU on guest TLB flush if TDP disabled to force MMU sync
When using shadow paging, unload the guest MMU when emulating a guest TLB
flush to ensure all roots are synchronized.  From the guest's perspective,
flushing the TLB ensures any and all modifications to its PTEs will be
recognized by the CPU.

Note, unloading the MMU is overkill, but is done to mirror KVM's existing
handling of INVPCID(all) and ensure the bug is squashed.  Future cleanup
can be done to more precisely synchronize roots when servicing a guest
TLB flush.

If TDP is enabled, synchronizing the MMU is unnecessary even if nested
TDP is in play, as a "legacy" TLB flush from L1 does not invalidate L1's
TDP mappings.  For EPT, an explicit INVEPT is required to invalidate
guest-physical mappings; for NPT, guest mappings are always tagged with
an ASID and thus can only be invalidated via the VMCB's ASID control.

This bug has existed since the introduction of KVM_VCPU_FLUSH_TLB.
It was only recently exposed after Linux guests stopped flushing the
local CPU's TLB prior to flushing remote TLBs (see commit 4ce94eabac,
"x86/mm/tlb: Flush remote and local TLBs concurrently"), but is also
visible in Windows 10 guests.

Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Fixes: f38a7b7526 ("KVM: X86: support paravirtualized help for TLB shootdowns")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
[sean: massaged comment and changelog]
Message-Id: <20210531172256.2908-1-jiangshanlai@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-08 17:10:21 -04:00
Lai Jiangshan
af3511ff7f KVM: x86: Ensure PV TLB flush tracepoint reflects KVM behavior
In record_steal_time(), st->preempted is read twice, and
trace_kvm_pv_tlb_flush() might output result inconsistent if
kvm_vcpu_flush_tlb_guest() see a different st->preempted later.

It is a very trivial problem and hardly has actual harm and can be
avoided by reseting and reading st->preempted in atomic way via xchg().

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>

Message-Id: <20210531174628.10265-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-08 13:15:20 -04:00
Linus Torvalds
224478289c ARM fixes:
* Another state update on exit to userspace fix
 
 * Prevent the creation of mixed 32/64 VMs
 
 * Fix regression with irqbypass not restarting the guest on failed connect
 
 * Fix regression with debug register decoding resulting in overlapping access
 
 * Commit exception state on exit to usrspace
 
 * Fix the MMU notifier return values
 
 * Add missing 'static' qualifiers in the new host stage-2 code
 
 x86 fixes:
 * fix guest missed wakeup with assigned devices
 
 * fix WARN reported by syzkaller
 
 * do not use BIT() in UAPI headers
 
 * make the kvm_amd.avic parameter bool
 
 PPC fixes:
 * make halt polling heuristics consistent with other architectures
 
 selftests:
 * various fixes
 
 * new performance selftest memslot_perf_test
 
 * test UFFD minor faults in demand_paging_test
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCyF0MUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOHSgf/Q4Hm5e12Bj2xJy6A+iShnrbbT8PW
 hcIIOA7zGWXfjVYcBV7anbj7CcpzfIz0otcRBABa5mkhj+fb3YmPEb0EzCPi4Hru
 zxpcpB2w7W7WtUOIKe2EmaT+4Pk6/iLcfr8UMHMqx460akE9OmIg10QNWai3My/3
 RIOeakSckBI9e/1TQZbxH66dsLwCT0lLco7i7AWHdFxkzUQyoA34HX5pczOCBsO5
 3nXH+/txnRVhqlcyzWLVVGVzFqmpHtBqkIInDOXfUqIoxo/gOhOgF1QdMUEKomxn
 5ZFXlL5IXNtr+7yiI67iHX7CWkGZE9oJ04TgPHn6LR6wRnVvc3JInzcB5Q==
 =ollO
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "ARM fixes:

   - Another state update on exit to userspace fix

   - Prevent the creation of mixed 32/64 VMs

   - Fix regression with irqbypass not restarting the guest on failed
     connect

   - Fix regression with debug register decoding resulting in
     overlapping access

   - Commit exception state on exit to usrspace

   - Fix the MMU notifier return values

   - Add missing 'static' qualifiers in the new host stage-2 code

  x86 fixes:

   - fix guest missed wakeup with assigned devices

   - fix WARN reported by syzkaller

   - do not use BIT() in UAPI headers

   - make the kvm_amd.avic parameter bool

  PPC fixes:

   - make halt polling heuristics consistent with other architectures

  selftests:

   - various fixes

   - new performance selftest memslot_perf_test

   - test UFFD minor faults in demand_paging_test"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (44 commits)
  selftests: kvm: fix overlapping addresses in memslot_perf_test
  KVM: X86: Kill off ctxt->ud
  KVM: X86: Fix warning caused by stale emulation context
  KVM: X86: Use kvm_get_linear_rip() in single-step and #DB/#BP interception
  KVM: x86/mmu: Fix comment mentioning skip_4k
  KVM: VMX: update vcpu posted-interrupt descriptor when assigning device
  KVM: rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK
  KVM: x86: add start_assignment hook to kvm_x86_ops
  KVM: LAPIC: Narrow the timer latency between wait_lapic_expire and world switch
  selftests: kvm: do only 1 memslot_perf_test run by default
  KVM: X86: Use _BITUL() macro in UAPI headers
  KVM: selftests: add shared hugetlbfs backing source type
  KVM: selftests: allow using UFFD minor faults for demand paging
  KVM: selftests: create alias mappings when using shared memory
  KVM: selftests: add shmem backing source type
  KVM: selftests: refactor vm_mem_backing_src_type flags
  KVM: selftests: allow different backing source types
  KVM: selftests: compute correct demand paging size
  KVM: selftests: simplify setup_demand_paging error handling
  KVM: selftests: Print a message if /dev/kvm is missing
  ...
2021-05-29 06:02:25 -10:00
Wanpeng Li
b35491e66c KVM: X86: Kill off ctxt->ud
ctxt->ud is consumed only by x86_decode_insn(), we can kill it off by
passing emulation_type to x86_decode_insn() and dropping ctxt->ud
altogether. Tracking that info in ctxt for literally one call is silly.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <1622160097-37633-2-git-send-email-wanpengli@tencent.com>
2021-05-28 12:59:10 -04:00
Wanpeng Li
da6393cdd8 KVM: X86: Fix warning caused by stale emulation context
Reported by syzkaller:

  WARNING: CPU: 7 PID: 10526 at linux/arch/x86/kvm//x86.c:7621 x86_emulate_instruction+0x41b/0x510 [kvm]
  RIP: 0010:x86_emulate_instruction+0x41b/0x510 [kvm]
  Call Trace:
   kvm_mmu_page_fault+0x126/0x8f0 [kvm]
   vmx_handle_exit+0x11e/0x680 [kvm_intel]
   vcpu_enter_guest+0xd95/0x1b40 [kvm]
   kvm_arch_vcpu_ioctl_run+0x377/0x6a0 [kvm]
   kvm_vcpu_ioctl+0x389/0x630 [kvm]
   __x64_sys_ioctl+0x8e/0xd0
   do_syscall_64+0x3c/0xb0
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Commit 4a1e10d5b5 ("KVM: x86: handle hardware breakpoints during emulation())
adds hardware breakpoints check before emulation the instruction and parts of
emulation context initialization, actually we don't have the EMULTYPE_NO_DECODE flag
here and the emulation context will not be reused. Commit c8848cee74 ("KVM: x86:
set ctxt->have_exception in x86_decode_insn()) triggers the warning because it
catches the stale emulation context has #UD, however, it is not during instruction
decoding which should result in EMULATION_FAILED. This patch fixes it by moving
the second part emulation context initialization into init_emulate_ctxt() and
before hardware breakpoints check. The ctxt->ud will be dropped by a follow-up
patch.

syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=134683fdd00000

Reported-by: syzbot+71271244f206d17f6441@syzkaller.appspotmail.com
Fixes: 4a1e10d5b5 (KVM: x86: handle hardware breakpoints during emulation)
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <1622160097-37633-1-git-send-email-wanpengli@tencent.com>
2021-05-28 12:59:09 -04:00
Yuan Yao
e87e46d5f3 KVM: X86: Use kvm_get_linear_rip() in single-step and #DB/#BP interception
The kvm_get_linear_rip() handles x86/long mode cases well and has
better readability, __kvm_set_rflags() also use the paired
function kvm_is_linear_rip() to check the vcpu->arch.singlestep_rip
set in kvm_arch_vcpu_ioctl_set_guest_debug(), so change the
"CS.BASE + RIP" code in kvm_arch_vcpu_ioctl_set_guest_debug() and
handle_exception_nmi() to this one.

Signed-off-by: Yuan Yao <yuan.yao@intel.com>
Message-Id: <20210526063828.1173-1-yuan.yao@linux.intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-28 12:57:53 -04:00
Marcelo Tosatti
084071d5e9 KVM: rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK
KVM_REQ_UNBLOCK will be used to exit a vcpu from
its inner vcpu halt emulation loop.

Rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK, switch
PowerPC to arch specific request bit.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Message-Id: <20210525134321.303768132@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 07:57:38 -04:00
Marcelo Tosatti
57ab87947a KVM: x86: add start_assignment hook to kvm_x86_ops
Add a start_assignment hook to kvm_x86_ops, which is called when
kvm_arch_start_assignment is done.

The hook is required to update the wakeup vector of a sleeping vCPU
when a device is assigned to the guest.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Message-Id: <20210525134321.254128742@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 07:50:13 -04:00
Wanpeng Li
1eff0ada88 KVM: X86: Fix vCPU preempted state from guest's point of view
Commit 66570e966d (kvm: x86: only provide PV features if enabled in guest's
CPUID) avoids to access pv tlb shootdown host side logic when this pv feature
is not exposed to guest, however, kvm_steal_time.preempted not only leveraged
by pv tlb shootdown logic but also mitigate the lock holder preemption issue.
From guest's point of view, vCPU is always preempted since we lose the reset
of kvm_steal_time.preempted before vmentry if pv tlb shootdown feature is not
exposed. This patch fixes it by clearing kvm_steal_time.preempted before
vmentry.

Fixes: 66570e966d (kvm: x86: only provide PV features if enabled in guest's CPUID)
Reviewed-by: Sean Christopherson <seanjc@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1621339235-11131-3-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 07:45:53 -04:00
Wanpeng Li
72b268a8e9 KVM: X86: Bail out of direct yield in case of under-committed scenarios
In case of under-committed scenarios, vCPUs can be scheduled easily;
kvm_vcpu_yield_to adds extra overhead, and it is also common to see
when vcpu->ready is true but yield later failing due to p->state is
TASK_RUNNING.

Let's bail out in such scenarios by checking the length of current cpu
runqueue, which can be treated as a hint of under-committed instead of
guarantee of accuracy. 30%+ of directed-yield attempts can now avoid
the expensive lookups in kvm_sched_yield() in an under-committed scenario.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1621339235-11131-2-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 07:45:52 -04:00
Paolo Bonzini
a4345a7cec KVM/arm64 fixes for 5.13, take #1
- Fix regression with irqbypass not restarting the guest on failed connect
 - Fix regression with debug register decoding resulting in overlapping access
 - Commit exception state on exit to usrspace
 - Fix the MMU notifier return values
 - Add missing 'static' qualifiers in the new host stage-2 code
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmCfmUoPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDei8QAMOWMA9wFTydsMTyRwDDZzD9i3Vg4bYlTdj1
 1C1FiHHGL37t44coo1eHtnydWBuhxhhwDHWQE8owFbDHyOnPzEX+NwhmJ4gVlUW5
 51aSxfPgXzKiv17WyncqZO9SfA5/RFyA/C2gRq9/fMr/7CpQJjqrvdQXaWh4kPVa
 9jFMVd1sCDUPd5c9Jyxd42CmVZjg6mCorOKaEwlI7NZkulRBlFW21A5y+M57sGTF
 RLIuQcggFJaG17kZN4p6v55Yoclt8O4xVbDv8SZV3vO1gjpaF1LtXdsmAKvbDZrZ
 lEtdumPHyD1maFhwXQFMOyvOgEaRhlhiNaTgKUOyX2LgeW1utCiYO/KwysflZvIC
 oLsfx3x+G0nSxa+MWGL9m52Hrt4yyscfbKfBg6nqJB+AqD3teH20xfsEUHTEuYkW
 kEgeWcJcWkadL5+ngs6S4PwFr88NyVBdUAagNd5VXE/KFhxCcr4B9oOXk5WdOaMi
 ZvLG5IQfIH6k3w+h2wR2WSoxYwltriZ3PwrPIeJ2Se33bK15xtQy1k/IIqvZP/oK
 0xxRVoY+nwuru0QZGwyI7zCFFvZzEKOXJ3qzJ2NeQxoTBky/e0bvUwnU8gXLXGPM
 lx2Gzw6t+xlTfcF9oIaQq7WlOsrC7Zr4uiTurZGLZKWklso9tLdzW35zmdN6D3qx
 sP2LC4iv
 =57tg
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-fixes-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 5.13, take #1

- Fix regression with irqbypass not restarting the guest on failed connect
- Fix regression with debug register decoding resulting in overlapping access
- Commit exception state on exit to usrspace
- Fix the MMU notifier return values
- Add missing 'static' qualifiers in the new host stage-2 code
2021-05-17 09:55:12 +02:00
Linus Torvalds
ccb013c29d - Enable -Wundef for the compressed kernel build stage
- Reorganize SEV code to streamline and simplify future development
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCg1XQACgkQEsHwGGHe
 VUpRKA//dwzDD1QU16JucfhgFlv/9OTm48ukSwAb9lZjDEy4H1CtVL3xEHFd7L3G
 LJp0LTW+OQf0/0aGlQp/cP6sBF6G9Bf4mydx70Id4SyCQt8eZDodB+ZOOWbeteWq
 p92fJPbX8CzAglutbE+3v/MD8CCAllTiLZnJZPVj4Kux2/wF6EryDgF1+rb5q8jp
 ObTT9817mHVwWVUYzbgceZtd43IocOlKZRmF1qivwScMGylQTe1wfMjunpD5pVt8
 Zg4UDNknNfYduqpaG546E6e1zerGNaJK7SHnsuzHRUVU5icNqtgBk061CehP9Ksq
 DvYXLUl4xF16j6xJAqIZPNrBkJGdQf4q1g5x2FiBm7rSQU5owzqh5rkVk4EBFFzn
 UtzeXpqbStbsZHXycyxBNdq2HXxkFPf2NXZ+bkripPg+DifOGots1uwvAft+6iAE
 GudK6qxAvr8phR1cRyy6BahGtgOStXbZYEz0ZdU6t7qFfZMz+DomD5Jimj0kAe6B
 s6ras5xm8q3/Py87N/KNjKtSEpgsHv/7F+idde7ODtHhpRL5HCBqhkZOSRkMMZqI
 ptX1oSTvBXwRKyi5x9YhkKHUFqfFSUTfJhiRFCWK+IEAv3Y7SipJtfkqxRbI6fEV
 FfCeueKDDdViBtseaRceVLJ8Tlr6Qjy27fkPPTqJpthqPpCdoZ0=
 =ENfF
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:
 "The three SEV commits are not really urgent material. But we figured
  since getting them in now will avoid a huge amount of conflicts
  between future SEV changes touching tip, the kvm and probably other
  trees, sending them to you now would be best.

  The idea is that the tip, kvm etc branches for 5.14 will all base
  ontop of -rc2 and thus everything will be peachy. What is more, those
  changes are purely mechanical and defines movement so they should be
  fine to go now (famous last words).

  Summary:

   - Enable -Wundef for the compressed kernel build stage

   - Reorganize SEV code to streamline and simplify future development"

* tag 'x86_urgent_for_v5.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot/compressed: Enable -Wundef
  x86/msr: Rename MSR_K8_SYSCFG to MSR_AMD64_SYSCFG
  x86/sev: Move GHCB MSR protocol and NAE definitions in a common header
  x86/sev-es: Rename sev-es.{ch} to sev.{ch}
2021-05-16 09:31:06 -07:00
Linus Torvalds
0aa099a312 * Lots of bug fixes.
* Fix virtualization of RDPID
 
 * Virtualization of DR6_BUS_LOCK, which on bare metal is new in
   the 5.13 merge window
 
 * More nested virtualization migration fixes (nSVM and eVMCS)
 
 * Fix for KVM guest hibernation
 
 * Fix for warning in SEV-ES SRCU usage
 
 * Block KVM from loading on AMD machines with 5-level page tables,
   due to the APM not mentioning how host CR4.LA57 exactly impacts
   the guest.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCZWwgUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOE9wgAk7Io8cuvnhC9ogVqzZWrPweWqFg8
 fJcPMB584JRnMqYHBVYbkTPGe8SsCHKR2MKsNdc4cEP111cyr3suWsxOdmjJn58i
 7ahy6PcKx7wWeWwEt7O599l6CeoX5XB9ExvA6eiXAv7iZeOJHFa+Ny2GlWgauy6Y
 DELryEomx1r4IUkZaSR+2fYjzvOWTXQixwU/jwx8NcTJz0DrzknzLE7XOciPBfn0
 t0Q2rCXdL2nF1uPksZbntx8Qoa6t6GDVIyrH/ZCPQYJtAX6cjxNAh3zwCe+hMnOd
 fW8ntBH1nZRiNnberA4IICAzqnUokgPWdKBrZT2ntWHBK+aqxXHznrlPJA==
 =e+gD
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:

 - Lots of bug fixes.

 - Fix virtualization of RDPID

 - Virtualization of DR6_BUS_LOCK, which on bare metal is new to this
   release

 - More nested virtualization migration fixes (nSVM and eVMCS)

 - Fix for KVM guest hibernation

 - Fix for warning in SEV-ES SRCU usage

 - Block KVM from loading on AMD machines with 5-level page tables, due
   to the APM not mentioning how host CR4.LA57 exactly impacts the
   guest.

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (48 commits)
  KVM: SVM: Move GHCB unmapping to fix RCU warning
  KVM: SVM: Invert user pointer casting in SEV {en,de}crypt helpers
  kvm: Cap halt polling at kvm->max_halt_poll_ns
  tools/kvm_stat: Fix documentation typo
  KVM: x86: Prevent deadlock against tk_core.seq
  KVM: x86: Cancel pvclock_gtod_work on module removal
  KVM: x86: Prevent KVM SVM from loading on kernels with 5-level paging
  KVM: X86: Expose bus lock debug exception to guest
  KVM: X86: Add support for the emulation of DR6_BUS_LOCK bit
  KVM: PPC: Book3S HV: Fix conversion to gfn-based MMU notifier callbacks
  KVM: x86: Hide RDTSCP and RDPID if MSR_TSC_AUX probing failed
  KVM: x86: Tie Intel and AMD behavior for MSR_TSC_AUX to guest CPU model
  KVM: x86: Move uret MSR slot management to common x86
  KVM: x86: Export the number of uret MSRs to vendor modules
  KVM: VMX: Disable loading of TSX_CTRL MSR the more conventional way
  KVM: VMX: Use common x86's uret MSR list as the one true list
  KVM: VMX: Use flag to indicate "active" uret MSRs instead of sorting list
  KVM: VMX: Configure list of user return MSRs at module init
  KVM: x86: Add support for RDPID without RDTSCP
  KVM: SVM: Probe and load MSR_TSC_AUX regardless of RDTSCP support in host
  ...
2021-05-10 12:30:45 -07:00
Brijesh Singh
059e5c321a x86/msr: Rename MSR_K8_SYSCFG to MSR_AMD64_SYSCFG
The SYSCFG MSR continued being updated beyond the K8 family; drop the K8
name from it.

Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/20210427111636.1207-4-brijesh.singh@amd.com
2021-05-10 07:51:38 +02:00
Thomas Gleixner
3f804f6d20 KVM: x86: Prevent deadlock against tk_core.seq
syzbot reported a possible deadlock in pvclock_gtod_notify():

CPU 0  		  	   	    	    CPU 1
write_seqcount_begin(&tk_core.seq);
  pvclock_gtod_notify()			    spin_lock(&pool->lock);
    queue_work(..., &pvclock_gtod_work)	    ktime_get()
     spin_lock(&pool->lock);		      do {
     						seq = read_seqcount_begin(tk_core.seq)
						...
				              } while (read_seqcount_retry(&tk_core.seq, seq);

While this is unlikely to happen, it's possible.

Delegate queue_work() to irq_work() which postpones it until the
tk_core.seq write held region is left and interrupts are reenabled.

Fixes: 16e8d74d2d ("KVM: x86: notifier for clocksource changes")
Reported-by: syzbot+6beae4000559d41d80f8@syzkaller.appspotmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Message-Id: <87h7jgm1zy.ffs@nanos.tec.linutronix.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:21 -04:00
Thomas Gleixner
594b27e677 KVM: x86: Cancel pvclock_gtod_work on module removal
Nothing prevents the following:

  pvclock_gtod_notify()
    queue_work(system_long_wq, &pvclock_gtod_work);
  ...
  remove_module(kvm);
  ...
  work_queue_run()
    pvclock_gtod_work()	<- UAF

Ditto for any other operation on that workqueue list head which touches
pvclock_gtod_work after module removal.

Cancel the work in kvm_arch_exit() to prevent that.

Fixes: 16e8d74d2d ("KVM: x86: notifier for clocksource changes")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Message-Id: <87czu4onry.ffs@nanos.tec.linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:21 -04:00
Chenyi Qiang
e8ea85fb28 KVM: X86: Add support for the emulation of DR6_BUS_LOCK bit
Bus lock debug exception introduces a new bit DR6_BUS_LOCK (bit 11 of
DR6) to indicate that bus lock #DB exception is generated. The set/clear
of DR6_BUS_LOCK is similar to the DR6_RTM. The processor clears
DR6_BUS_LOCK when the exception is generated. For all other #DB, the
processor sets this bit to 1. Software #DB handler should set this bit
before returning to the interrupted task.

In VMM, to avoid breaking the CPUs without bus lock #DB exception
support, activate the DR6_BUS_LOCK conditionally in DR6_FIXED_1 bits.
When intercepting the #DB exception caused by bus locks, bit 11 of the
exit qualification is set to identify it. The VMM should emulate the
exception by clearing the bit 11 of the guest DR6.

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20210202090433.13441-3-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:20 -04:00
Sean Christopherson
61a05d444d KVM: x86: Tie Intel and AMD behavior for MSR_TSC_AUX to guest CPU model
Squish the Intel and AMD emulation of MSR_TSC_AUX together and tie it to
the guest CPU model instead of the host CPU behavior.  While not strictly
necessary to avoid guest breakage, emulating cross-vendor "architecture"
will provide consistent behavior for the guest, e.g. WRMSR fault behavior
won't change if the vCPU is migrated to a host with divergent behavior.

Note, the "new" kvm_is_supported_user_return_msr() checks do not add new
functionality on either SVM or VMX.  On SVM, the equivalent was
"tsc_aux_uret_slot < 0", and on VMX the check was buried in the
vmx_find_uret_msr() call at the find_uret_msr label.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-15-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:19 -04:00
Sean Christopherson
e5fda4bbad KVM: x86: Move uret MSR slot management to common x86
Now that SVM and VMX both probe MSRs before "defining" user return slots
for them, consolidate the code for probe+define into common x86 and
eliminate the odd behavior of having the vendor code define the slot for
a given MSR.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:19 -04:00
Sean Christopherson
9cc39a5a43 KVM: x86: Export the number of uret MSRs to vendor modules
Split out and export the number of configured user return MSRs so that
VMX can iterate over the set of MSRs without having to do its own tracking.
Keep the list itself internal to x86 so that vendor code still has to go
through the "official" APIs to add/modify entries.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:18 -04:00
Sean Christopherson
8ea8b8d6f8 KVM: VMX: Use common x86's uret MSR list as the one true list
Drop VMX's global list of user return MSRs now that VMX doesn't resort said
list to isolate "active" MSRs, i.e. now that VMX's list and x86's list have
the same MSRs in the same order.

In addition to eliminating the redundant list, this will also allow moving
more of the list management into common x86.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:18 -04:00
Sean Christopherson
36fa06f9ff KVM: x86: Add support for RDPID without RDTSCP
Allow userspace to enable RDPID for a guest without also enabling RDTSCP.
Aside from checking for RDPID support in the obvious flows, VMX also needs
to set ENABLE_RDTSCP=1 when RDPID is exposed.

For the record, there is no known scenario where enabling RDPID without
RDTSCP is desirable.  But, both AMD and Intel architectures allow for the
condition, i.e. this is purely to make KVM more architecturally accurate.

Fixes: 41cd02c6f7 ("kvm: x86: Expose RDPID in KVM_GET_SUPPORTED_CPUID")
Cc: stable@vger.kernel.org
Reported-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:17 -04:00
Sean Christopherson
5104d7ffcf KVM: VMX: Disable preemption when probing user return MSRs
Disable preemption when probing a user return MSR via RDSMR/WRMSR.  If
the MSR holds a different value per logical CPU, the WRMSR could corrupt
the host's value if KVM is preempted between the RDMSR and WRMSR, and
then rescheduled on a different CPU.

Opportunistically land the helper in common x86, SVM will use the helper
in a future commit.

Fixes: 4be5341026 ("KVM: VMX: Initialize vmx->guest_msrs[] right after allocation")
Cc: stable@vger.kernel.org
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-6-seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:16 -04:00
Siddharth Chandrasekaran
aca352886e KVM: x86: Hoist input checks in kvm_add_msr_filter()
In ioctl KVM_X86_SET_MSR_FILTER, input from user space is validated
after a memdup_user(). For invalid inputs we'd memdup and then call
kfree unnecessarily. Hoist input validation to avoid kfree altogether.

Signed-off-by: Siddharth Chandrasekaran <sidcha@amazon.de>
Message-Id: <20210503122111.13775-1-sidcha@amazon.de>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:14 -04:00
Wanpeng Li
1604571401 KVM: x86: Defer vtime accounting 'til after IRQ handling
Defer the call to account guest time until after servicing any IRQ(s)
that happened in the guest or immediately after VM-Exit.  Tick-based
accounting of vCPU time relies on PF_VCPU being set when the tick IRQ
handler runs, and IRQs are blocked throughout the main sequence of
vcpu_enter_guest(), including the call into vendor code to actually
enter and exit the guest.

This fixes a bug where reported guest time remains '0', even when
running an infinite loop in the guest:

  https://bugzilla.kernel.org/show_bug.cgi?id=209831

Fixes: 87fa7f3e98 ("x86/kvm: Move context tracking where it belongs")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210505002735.1684165-4-seanjc@google.com
2021-05-05 22:54:11 +02:00
Linus Torvalds
152d32aa84 ARM:
- Stage-2 isolation for the host kernel when running in protected mode
 
 - Guest SVE support when running in nVHE mode
 
 - Force W^X hypervisor mappings in nVHE mode
 
 - ITS save/restore for guests using direct injection with GICv4.1
 
 - nVHE panics now produce readable backtraces
 
 - Guest support for PTP using the ptp_kvm driver
 
 - Performance improvements in the S2 fault handler
 
 x86:
 
 - Optimizations and cleanup of nested SVM code
 
 - AMD: Support for virtual SPEC_CTRL
 
 - Optimizations of the new MMU code: fast invalidation,
   zap under read lock, enable/disably dirty page logging under
   read lock
 
 - /dev/kvm API for AMD SEV live migration (guest API coming soon)
 
 - support SEV virtual machines sharing the same encryption context
 
 - support SGX in virtual machines
 
 - add a few more statistics
 
 - improved directed yield heuristics
 
 - Lots and lots of cleanups
 
 Generic:
 
 - Rework of MMU notifier interface, simplifying and optimizing
 the architecture-specific code
 
 - Some selftests improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCJ13kUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroM1HAgAqzPxEtiTPTFeFJV5cnPPJ3dFoFDK
 y/juZJUQ1AOtvuWzzwuf175ewkv9vfmtG6rVohpNSkUlJYeoc6tw7n8BTTzCVC1b
 c/4Dnrjeycr6cskYlzaPyV6MSgjSv5gfyj1LA5UEM16LDyekmaynosVWY5wJhju+
 Bnyid8l8Utgz+TLLYogfQJQECCrsU0Wm//n+8TWQgLf1uuiwshU5JJe7b43diJrY
 +2DX+8p9yWXCTz62sCeDWNahUv8AbXpMeJ8uqZPYcN1P0gSEUGu8xKmLOFf9kR7b
 M4U1Gyz8QQbjd2lqnwiWIkvRLX6gyGVbq2zH0QbhUe5gg3qGUX7JjrhdDQ==
 =AXUi
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "This is a large update by KVM standards, including AMD PSP (Platform
  Security Processor, aka "AMD Secure Technology") and ARM CoreSight
  (debug and trace) changes.

  ARM:

   - CoreSight: Add support for ETE and TRBE

   - Stage-2 isolation for the host kernel when running in protected
     mode

   - Guest SVE support when running in nVHE mode

   - Force W^X hypervisor mappings in nVHE mode

   - ITS save/restore for guests using direct injection with GICv4.1

   - nVHE panics now produce readable backtraces

   - Guest support for PTP using the ptp_kvm driver

   - Performance improvements in the S2 fault handler

  x86:

   - AMD PSP driver changes

   - Optimizations and cleanup of nested SVM code

   - AMD: Support for virtual SPEC_CTRL

   - Optimizations of the new MMU code: fast invalidation, zap under
     read lock, enable/disably dirty page logging under read lock

   - /dev/kvm API for AMD SEV live migration (guest API coming soon)

   - support SEV virtual machines sharing the same encryption context

   - support SGX in virtual machines

   - add a few more statistics

   - improved directed yield heuristics

   - Lots and lots of cleanups

  Generic:

   - Rework of MMU notifier interface, simplifying and optimizing the
     architecture-specific code

   - a handful of "Get rid of oprofile leftovers" patches

   - Some selftests improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (379 commits)
  KVM: selftests: Speed up set_memory_region_test
  selftests: kvm: Fix the check of return value
  KVM: x86: Take advantage of kvm_arch_dy_has_pending_interrupt()
  KVM: SVM: Skip SEV cache flush if no ASIDs have been used
  KVM: SVM: Remove an unnecessary prototype declaration of sev_flush_asids()
  KVM: SVM: Drop redundant svm_sev_enabled() helper
  KVM: SVM: Move SEV VMCB tracking allocation to sev.c
  KVM: SVM: Explicitly check max SEV ASID during sev_hardware_setup()
  KVM: SVM: Unconditionally invoke sev_hardware_teardown()
  KVM: SVM: Enable SEV/SEV-ES functionality by default (when supported)
  KVM: SVM: Condition sev_enabled and sev_es_enabled on CONFIG_KVM_AMD_SEV=y
  KVM: SVM: Append "_enabled" to module-scoped SEV/SEV-ES control variables
  KVM: SEV: Mask CPUID[0x8000001F].eax according to supported features
  KVM: SVM: Move SEV module params/variables to sev.c
  KVM: SVM: Disable SEV/SEV-ES if NPT is disabled
  KVM: SVM: Free sev_asid_bitmap during init if SEV setup fails
  KVM: SVM: Zero out the VMCB array used to track SEV ASID association
  x86/sev: Drop redundant and potentially misleading 'sev_enabled'
  KVM: x86: Move reverse CPUID helpers to separate header file
  KVM: x86: Rename GPR accessors to make mode-aware variants the defaults
  ...
2021-05-01 10:14:08 -07:00
Linus Torvalds
ea5bc7b977 Trivial cleanups and fixes all over the place.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCGmYIACgkQEsHwGGHe
 VUr45w/8CSXr7MXaFBj4To0hTWJXSZyF6YGqlZOSJXFcFh4cWTNwfVOoFaV47aDo
 +HsCNTkGENcKhLrDUWDRiG/Uo46jxtOtl1vhq7U4pGemSYH871XWOKfb5k5XNMwn
 /uhaHMI4aEfd6bUFnF518NeyRIsD0BdqFj4tB7RbAiyFwdETDX9Tkj/uBKnQ4zon
 4tEDoXgThuK5YKK9zVQg5pa7aFp2zg1CAdX/WzBkS8BHVBPXSV0CF97AJYQOM/V+
 lUHv+BN3wp97GYHPQMPsbkNr8IuFoe2mIvikwjxg8iOFpzEU1G1u09XV9R+PXByX
 LclFTRqK/2uU5hJlcsBiKfUuidyErYMRYImbMAOREt2w0ogWVu2zQ7HkjVve25h1
 sQPwPudbAt6STbqRxvpmB3yoV4TCYwnF91FcWgEy+rcEK2BDsHCnScA45TsK5I1C
 kGR1K17pHXprgMZFPveH+LgxewB6smDv+HllxQdSG67LhMJXcs2Epz0TsN8VsXw8
 dlD3lGReK+5qy9FTgO7mY0xhiXGz1IbEdAPU4eRBgih13puu03+jqgMaMabvBWKD
 wax+BWJUrPtetwD5fBPhlS/XdJDnd8Mkv2xsf//+wT0s4p+g++l1APYxeB8QEehm
 Pd7Mvxm4GvQkfE13QEVIPYQRIXCMH/e9qixtY5SHUZDBVkUyFM0=
 =bO1i
 -----END PGP SIGNATURE-----

Merge tag 'x86_cleanups_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 cleanups from Borislav Petkov:
 "Trivial cleanups and fixes all over the place"

* tag 'x86_cleanups_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  MAINTAINERS: Remove me from IDE/ATAPI section
  x86/pat: Do not compile stubbed functions when X86_PAT is off
  x86/asm: Ensure asm/proto.h can be included stand-alone
  x86/platform/intel/quark: Fix incorrect kernel-doc comment syntax in files
  x86/msr: Make locally used functions static
  x86/cacheinfo: Remove unneeded dead-store initialization
  x86/process/64: Move cpu_current_top_of_stack out of TSS
  tools/turbostat: Unmark non-kernel-doc comment
  x86/syscalls: Fix -Wmissing-prototypes warnings from COND_SYSCALL()
  x86/fpu/math-emu: Fix function cast warning
  x86/msr: Fix wr/rdmsr_safe_regs_on_cpu() prototypes
  x86: Fix various typos in comments, take #2
  x86: Remove unusual Unicode characters from comments
  x86/kaslr: Return boolean values from a function returning bool
  x86: Fix various typos in comments
  x86/setup: Remove unused RESERVE_BRK_ARRAY()
  stacktrace: Move documentation for arch_stack_walk_reliable() to header
  x86: Remove duplicate TSC DEADLINE MSR definitions
2021-04-26 09:25:47 -07:00
Haiwei Li
10dbdf98ac KVM: x86: Take advantage of kvm_arch_dy_has_pending_interrupt()
`kvm_arch_dy_runnable` checks the pending_interrupt as the code in
`kvm_arch_dy_has_pending_interrupt`. So take advantage of it.

Signed-off-by: Haiwei Li <lihaiwei@tencent.com>
Message-Id: <20210421032513.1921-1-lihaiwei.kernel@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-26 05:27:18 -04:00
Sean Christopherson
27b4a9c454 KVM: x86: Rename GPR accessors to make mode-aware variants the defaults
Append raw to the direct variants of kvm_register_read/write(), and
drop the "l" from the mode-aware variants.  I.e. make the mode-aware
variants the default, and make the direct variants scary sounding so as
to discourage use.  Accessing the full 64-bit values irrespective of
mode is rarely the desired behavior.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422022128.3464144-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-26 05:27:13 -04:00
Sean Christopherson
886bbcc7a5 KVM: x86: Check CR3 GPA for validity regardless of vCPU mode
Check CR3 for an invalid GPA even if the vCPU isn't in long mode.  For
bigger emulation flows, notably RSM, the vCPU mode may not be accurate
if CR0/CR4 are loaded after CR3.  For MOV CR3 and similar flows, the
caller is responsible for truncating the value.

Fixes: 660a5d517a ("KVM: x86: save/load state on SMM switch")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422022128.3464144-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-26 05:27:11 -04:00
Wanpeng Li
b86bb11e3a KVM: X86: Fix failure to boost kernel lock holder candidate in SEV-ES guests
Commit f1c6366e30 ("KVM: SVM: Add required changes to support intercepts under
SEV-ES") prevents hypervisor accesses guest register state when the guest is
running under SEV-ES. The initial value of vcpu->arch.guest_state_protected
is false, it will not be updated in preemption notifiers after this commit which
means that the kernel spinlock lock holder will always be skipped to boost. Let's
fix it by always treating preempted is in the guest kernel mode, false positive
is better than skip completely.

Fixes: f1c6366e30 (KVM: SVM: Add required changes to support intercepts under SEV-ES)
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1619080459-30032-1-git-send-email-wanpengli@tencent.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-26 05:19:28 -04:00
Vitaly Kuznetsov
2f15d027c0 KVM: x86: Properly handle APF vs disabled LAPIC situation
Async PF 'page ready' event may happen when LAPIC is (temporary) disabled.
In particular, Sebastien reports that when Linux kernel is directly booted
by Cloud Hypervisor, LAPIC is 'software disabled' when APF mechanism is
initialized. On initialization KVM tries to inject 'wakeup all' event and
puts the corresponding token to the slot. It is, however, failing to inject
an interrupt (kvm_apic_set_irq() -> __apic_accept_irq() -> !apic_enabled())
so the guest never gets notified and the whole APF mechanism gets stuck.
The same issue is likely to happen if the guest temporary disables LAPIC
and a previously unavailable page becomes available.

Do two things to resolve the issue:
- Avoid dequeuing 'page ready' events from APF queue when LAPIC is
  disabled.
- Trigger an attempt to deliver pending 'page ready' events when LAPIC
  becomes enabled (SPIV or MSR_IA32_APICBASE).

Reported-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210422092948.568327-1-vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-26 05:19:27 -04:00
Wanpeng Li
9c1a07442c KVM: x86/xen: Take srcu lock when accessing kvm_memslots()
kvm_memslots() will be called by kvm_write_guest_offset_cached() so we should
take the srcu lock. Let's pull the srcu lock operation from kvm_steal_time_set_preempted()
again to fix xen part.

Fixes: 30b5c851af ("KVM: x86/xen: Add support for vCPU runstate information")
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1619166200-9215-1-git-send-email-wanpengli@tencent.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-23 17:00:50 -04:00
Paolo Bonzini
fd49e8ee70 Merge branch 'kvm-sev-cgroup' into HEAD 2021-04-22 13:19:01 -04:00
Wanpeng Li
52acd22faa KVM: Boost vCPU candidate in user mode which is delivering interrupt
Both lock holder vCPU and IPI receiver that has halted are condidate for
boost. However, the PLE handler was originally designed to deal with the
lock holder preemption problem. The Intel PLE occurs when the spinlock
waiter is in kernel mode. This assumption doesn't hold for IPI receiver,
they can be in either kernel or user mode. the vCPU candidate in user mode
will not be boosted even if they should respond to IPIs. Some benchmarks
like pbzip2, swaptions etc do the TLB shootdown in kernel mode and most
of the time they are running in user mode. It can lead to a large number
of continuous PLE events because the IPI sender causes PLE events
repeatedly until the receiver is scheduled while the receiver is not
candidate for a boost.

This patch boosts the vCPU candidiate in user mode which is delivery
interrupt. We can observe the speed of pbzip2 improves 10% in 96 vCPUs
VM in over-subscribe scenario (The host machine is 2 socket, 48 cores,
96 HTs Intel CLX box). There is no performance regression for other
benchmarks like Unixbench spawn (most of the time contend read/write
lock in kernel mode), ebizzy (most of the time contend read/write sem
and TLB shoodtdown in kernel mode).

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1618542490-14756-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-21 12:20:03 -04:00
Nathan Tempelman
54526d1fd5 KVM: x86: Support KVM VMs sharing SEV context
Add a capability for userspace to mirror SEV encryption context from
one vm to another. On our side, this is intended to support a
Migration Helper vCPU, but it can also be used generically to support
other in-guest workloads scheduled by the host. The intention is for
the primary guest and the mirror to have nearly identical memslots.

The primary benefits of this are that:
1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they
can't accidentally clobber each other.
2) The VMs can have different memory-views, which is necessary for post-copy
migration (the migration vCPUs on the target need to read and write to
pages, when the primary guest would VMEXIT).

This does not change the threat model for AMD SEV. Any memory involved
is still owned by the primary guest and its initial state is still
attested to through the normal SEV_LAUNCH_* flows. If userspace wanted
to circumvent SEV, they could achieve the same effect by simply attaching
a vCPU to the primary VM.
This patch deliberately leaves userspace in charge of the memslots for the
mirror, as it already has the power to mess with them in the primary guest.

This patch does not support SEV-ES (much less SNP), as it does not
handle handing off attested VMSAs to the mirror.

For additional context, we need a Migration Helper because SEV PSP
migration is far too slow for our live migration on its own. Using
an in-guest migrator lets us speed this up significantly.

Signed-off-by: Nathan Tempelman <natet@google.com>
Message-Id: <20210408223214.2582277-1-natet@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-21 12:20:02 -04:00
Sean Christopherson
fe7e948837 KVM: x86: Add capability to grant VM access to privileged SGX attribute
Add a capability, KVM_CAP_SGX_ATTRIBUTE, that can be used by userspace
to grant a VM access to a priveleged attribute, with args[0] holding a
file handle to a valid SGX attribute file.

The SGX subsystem restricts access to a subset of enclave attributes to
provide additional security for an uncompromised kernel, e.g. to prevent
malware from using the PROVISIONKEY to ensure its nodes are running
inside a geniune SGX enclave and/or to obtain a stable fingerprint.

To prevent userspace from circumventing such restrictions by running an
enclave in a VM, KVM restricts guest access to privileged attributes by
default.

Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <0b099d65e933e068e3ea934b0523bab070cb8cea.1618196135.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-20 04:18:56 -04:00
Sean Christopherson
54f958cdaa KVM: x86: Export kvm_mmu_gva_to_gpa_{read,write}() for SGX (VMX)
Export the gva_to_gpa() helpers for use by SGX virtualization when
executing ENCLS[ECREATE] and ENCLS[EINIT] on behalf of the guest.
To execute ECREATE and EINIT, KVM must obtain the GPA of the target
Secure Enclave Control Structure (SECS) in order to get its
corresponding HVA.

Because the SECS must reside in the Enclave Page Cache (EPC), copying
the SECS's data to a host-controlled buffer via existing exported
helpers is not a viable option as the EPC is not readable or writable
by the kernel.

SGX virtualization will also use gva_to_gpa() to obtain HVAs for
non-EPC pages in order to pass user pointers directly to ECREATE and
EINIT, which avoids having to copy pages worth of data into the kernel.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <02f37708321bcdfaa2f9d41c8478affa6e84b04d.1618196135.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-20 04:18:53 -04:00
Wanpeng Li
a1fa4cbd53 KVM: X86: Do not yield to self
If the target is self we do not need to yield, we can avoid malicious
guest to play this.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1617941911-5338-3-git-send-email-wanpengli@tencent.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 18:04:46 -04:00
Wanpeng Li
4a7132efff KVM: X86: Count attempted/successful directed yield
To analyze some performance issues with lock contention and scheduling,
it is nice to know when directed yield are successful or failing.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1617941911-5338-2-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 18:04:46 -04:00
Maxim Levitsky
7e582ccbbd KVM: x86: implement KVM_CAP_SET_GUEST_DEBUG2
Store the supported bits into KVM_GUESTDBG_VALID_MASK
macro, similar to how arm does this.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210401135451.1004564-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:31:02 -04:00
Vitaly Kuznetsov
c28fa560c5 KVM: x86/vPMU: Forbid reading from MSR_F15H_PERF MSRs when guest doesn't have X86_FEATURE_PERFCTR_CORE
MSR_F15H_PERF_CTL0-5, MSR_F15H_PERF_CTR0-5 MSRs have a CPUID bit assigned
to them (X86_FEATURE_PERFCTR_CORE) and when it wasn't exposed to the guest
the correct behavior is to inject #GP an not just return zero.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210329124804.170173-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:53 -04:00
Vitaly Kuznetsov
77fcbe823f KVM: x86: Prevent 'hv_clock->system_time' from going negative in kvm_guest_time_update()
When guest time is reset with KVM_SET_CLOCK(0), it is possible for
'hv_clock->system_time' to become a small negative number. This happens
because in KVM_SET_CLOCK handling we set 'kvm->arch.kvmclock_offset' based
on get_kvmclock_ns(kvm) but when KVM_REQ_CLOCK_UPDATE is handled,
kvm_guest_time_update() does (masterclock in use case):

hv_clock.system_time = ka->master_kernel_ns + v->kvm->arch.kvmclock_offset;

And 'master_kernel_ns' represents the last time when masterclock
got updated, it can precede KVM_SET_CLOCK() call. Normally, this is not a
problem, the difference is very small, e.g. I'm observing
hv_clock.system_time = -70 ns. The issue comes from the fact that
'hv_clock.system_time' is stored as unsigned and 'system_time / 100' in
compute_tsc_page_parameters() becomes a very big number.

Use 'master_kernel_ns' instead of get_kvmclock_ns() when masterclock is in
use and get_kvmclock_base_ns() when it's not to prevent 'system_time' from
going negative.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210331124130.337992-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 05:14:19 -04:00
Paolo Bonzini
a83829f56c KVM: x86: disable interrupts while pvclock_gtod_sync_lock is taken
pvclock_gtod_sync_lock can be taken with interrupts disabled if the
preempt notifier calls get_kvmclock_ns to update the Xen
runstate information:

   spin_lock include/linux/spinlock.h:354 [inline]
   get_kvmclock_ns+0x25/0x390 arch/x86/kvm/x86.c:2587
   kvm_xen_update_runstate+0x3d/0x2c0 arch/x86/kvm/xen.c:69
   kvm_xen_update_runstate_guest+0x74/0x320 arch/x86/kvm/xen.c:100
   kvm_xen_runstate_set_preempted arch/x86/kvm/xen.h:96 [inline]
   kvm_arch_vcpu_put+0x2d8/0x5a0 arch/x86/kvm/x86.c:4062

So change the users of the spinlock to spin_lock_irqsave and
spin_unlock_irqrestore.

Reported-by: syzbot+b282b65c2c68492df769@syzkaller.appspotmail.com
Fixes: 30b5c851af ("KVM: x86/xen: Add support for vCPU runstate information")
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 05:14:19 -04:00
Paolo Bonzini
c2c647f91a KVM: x86: reduce pvclock_gtod_sync_lock critical sections
There is no need to include changes to vcpu->requests into
the pvclock_gtod_sync_lock critical section.  The changes to
the shared data structures (in pvclock_update_vm_gtod_copy)
already occur under the lock.

Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 05:14:19 -04:00
Haiwei Li
d632826f26 KVM: clean up the unused argument
kvm_msr_ignored_check function never uses vcpu argument. Clean up the
function and invokers.

Signed-off-by: Haiwei Li <lihaiwei@tencent.com>
Message-Id: <20210313051032.4171-1-lihaiwei.kernel@gmail.com>
Reviewed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-30 13:07:09 -04:00
Ingo Molnar
ca8778c45e Merge branch 'linus' into x86/cleanups, to resolve conflict
Conflicts:
	arch/x86/kernel/kprobes/ftrace.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-03-21 22:16:08 +01:00
Wanpeng Li
c2162e13d6 KVM: X86: Fix missing local pCPU when executing wbinvd on all dirty pCPUs
In order to deal with noncoherent DMA, we should execute wbinvd on
all dirty pCPUs when guest wbinvd exits to maintain data consistency.
smp_call_function_many() does not execute the provided function on the
local core, therefore replace it by on_each_cpu_mask().

Reported-by: Nadav Amit <namit@vmware.com>
Cc: Nadav Amit <namit@vmware.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1615517151-7465-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-18 13:55:34 -04:00
Sean Christopherson
b318e8decf KVM: x86: Protect userspace MSR filter with SRCU, and set atomically-ish
Fix a plethora of issues with MSR filtering by installing the resulting
filter as an atomic bundle instead of updating the live filter one range
at a time.  The KVM_X86_SET_MSR_FILTER ioctl() isn't truly atomic, as
the hardware MSR bitmaps won't be updated until the next VM-Enter, but
the relevant software struct is atomically updated, which is what KVM
really needs.

Similar to the approach used for modifying memslots, make arch.msr_filter
a SRCU-protected pointer, do all the work configuring the new filter
outside of kvm->lock, and then acquire kvm->lock only when the new filter
has been vetted and created.  That way vCPU readers either see the old
filter or the new filter in their entirety, not some half-baked state.

Yuan Yao pointed out a use-after-free in ksm_msr_allowed() due to a
TOCTOU bug, but that's just the tip of the iceberg...

  - Nothing is __rcu annotated, making it nigh impossible to audit the
    code for correctness.
  - kvm_add_msr_filter() has an unpaired smp_wmb().  Violation of kernel
    coding style aside, the lack of a smb_rmb() anywhere casts all code
    into doubt.
  - kvm_clear_msr_filter() has a double free TOCTOU bug, as it grabs
    count before taking the lock.
  - kvm_clear_msr_filter() also has memory leak due to the same TOCTOU bug.

The entire approach of updating the live filter is also flawed.  While
installing a new filter is inherently racy if vCPUs are running, fixing
the above issues also makes it trivial to ensure certain behavior is
deterministic, e.g. KVM can provide deterministic behavior for MSRs with
identical settings in the old and new filters.  An atomic update of the
filter also prevents KVM from getting into a half-baked state, e.g. if
installing a filter fails, the existing approach would leave the filter
in a half-baked state, having already committed whatever bits of the
filter were already processed.

[*] https://lkml.kernel.org/r/20210312083157.25403-1-yaoyuan0329os@gmail.com

Fixes: 1a155254ff ("KVM: x86: Introduce MSR filtering")
Cc: stable@vger.kernel.org
Cc: Alexander Graf <graf@amazon.com>
Reported-by: Yuan Yao <yaoyuan0329os@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210316184436.2544875-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-18 13:55:14 -04:00
Ingo Molnar
d9f6e12fb0 x86: Fix various typos in comments
Fix ~144 single-word typos in arch/x86/ code comments.

Doing this in a single commit should reduce the churn.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-kernel@vger.kernel.org
2021-03-18 15:31:53 +01:00
Ingo Molnar
14ff3ed86e Linux 5.12-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmBOgu4eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGUd0H/3Ey8aWjVAig9Pe+
 VQVZKwG+LXWH6UmUx5qyaTxophhmGnWLvkigJMn63qIg4eQtfp2gNFHK+T4OJNIP
 ybnkjFZ337x4J9zD6m8mt4Wmelq9iW2wNOS+3YZAyYiGlXfMGM7SlYRCQRQznTED
 2O/JCMsOoP+Z8tr5ah/bzs0dANsXmTZ3QqRP2uzb6irKTgFR3/weOhj+Ht1oJ4Aq
 V+bgdcwhtk20hJhlvVeqws+o74LR789tTDCknlz/YNMv9e6VPfyIQ5vJAcFmZATE
 Ezj9yzkZ4IU+Ux6ikAyaFyBU8d1a4Wqye3eHCZBsEo6tcSAhbTZ90eoU86vh6ajS
 LZjwkNw=
 =6y1u
 -----END PGP SIGNATURE-----

Merge tag 'v5.12-rc3' into x86/cleanups, to refresh the tree

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-03-18 15:27:03 +01:00
Vitaly Kuznetsov
e880c6ea55 KVM: x86: hyper-v: Prevent using not-yet-updated TSC page by secondary CPUs
When KVM_REQ_MASTERCLOCK_UPDATE request is issued (e.g. after migration)
we need to make sure no vCPU sees stale values in PV clock structures and
thus all vCPUs are kicked with KVM_REQ_CLOCK_UPDATE. Hyper-V TSC page
clocksource is global and kvm_guest_time_update() only updates in on vCPU0
but this is not entirely correct: nothing blocks some other vCPU from
entering the guest before we finish the update on CPU0 and it can read
stale values from the page.

Invalidate TSC page in kvm_gen_update_masterclock() to switch all vCPUs
to using MSR based clocksource (HV_X64_MSR_TIME_REF_COUNT).

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210316143736.964151-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-17 04:03:22 -04:00
Sean Christopherson
ec761cfd35 KVM: x86/mmu: Move initial kvm_mmu_set_mask_ptes() call into MMU proper
Move kvm_mmu_set_mask_ptes() into mmu.c as prep for future cleanup of the
mask initialization code.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210225204749.1512652-16-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:46 -04:00
Maxim Levitsky
b97f074583 KVM: x86: determine if an exception has an error code only when injecting it.
A page fault can be queued while vCPU is in real paged mode on AMD, and
AMD manual asks the user to always intercept it
(otherwise result is undefined).
The resulting VM exit, does have an error code.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210225154135.405125-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:27 -04:00
Sean Christopherson
c483c45471 KVM: x86: Move RDPMC emulation to common code
Move the entirety of the accelerated RDPMC emulation to x86.c, and assign
the common handler directly to the exit handler array for VMX.  SVM has
bizarre nrips behavior that prevents it from directly invoking the common
handler.  The nrips goofiness will be addressed in a future patch.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:20 -04:00
Sean Christopherson
5ff3a351f6 KVM: x86: Move trivial instruction-based exit handlers to common code
Move the trivial exit handlers, e.g. for instructions that KVM
"emulates" as nops, to common x86 code.  Assign the common handlers
directly to the exit handler arrays and drop the vendor trampolines.

Opportunistically use pr_warn_once() where appropriate.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:19 -04:00
Sean Christopherson
92f9895c14 KVM: x86: Move XSETBV emulation to common code
Move the entirety of XSETBV emulation to x86.c, and assign the
function directly to both VMX's and SVM's exit handlers, i.e. drop the
unnecessary trampolines.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:18 -04:00
Sean Christopherson
cb6a32c2b8 KVM: x86: Handle triple fault in L2 without killing L1
Synthesize a nested VM-Exit if L2 triggers an emulated triple fault
instead of exiting to userspace, which likely will kill L1.  Any flow
that does KVM_REQ_TRIPLE_FAULT is suspect, but the most common scenario
for L2 killing L1 is if L0 (KVM) intercepts a contributory exception that
is _not_intercepted by L1.  E.g. if KVM is intercepting #GPs for the
VMware backdoor, a #GP that occurs in L2 while vectoring an injected #DF
will cause KVM to emulate triple fault.

Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210302174515.2812275-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:15 -04:00
Sean Christopherson
f66c53b3b9 KVM: x86: Defer the MMU unload to the normal path on an global INVPCID
Defer unloading the MMU after a INVPCID until the instruction emulation
has completed, i.e. until after RIP has been updated.

On VMX, this is a benign bug as VMX doesn't touch the MMU when skipping
an emulated instruction.  However, on SVM, if nrip is disabled, the
emulator is used to skip an instruction, which would lead to fireworks
if the emulator were invoked without a valid MMU.

Fixes: eb4b248e15 ("kvm: vmx: Support INVPCID in shadow paging mode")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305011101.3597423-15-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:22 -04:00
Dongli Zhang
43c11d91fb KVM: x86: to track if L1 is running L2 VM
The new per-cpu stat 'nested_run' is introduced in order to track if L1 VM
is running or used to run L2 VM.

An example of the usage of 'nested_run' is to help the host administrator
to easily track if any L1 VM is used to run L2 VM. Suppose there is issue
that may happen with nested virtualization, the administrator will be able
to easily narrow down and confirm if the issue is due to nested
virtualization via 'nested_run'. For example, whether the fix like
commit 88dddc11a8 ("KVM: nVMX: do not use dangling shadow VMCS after
guest reset") is required.

Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Message-Id: <20210305225747.7682-1-dongli.zhang@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:28:02 -04:00
Linus Torvalds
9d0c8e793f More fixes for ARM and x86.
-----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmBLsyoUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMpYgf/Zu1Byif+XZVdwm52wJN38ppUUVmn
 4u8HvQ8Ht+P0cGg1IaNx9D5QXGRgdn72qEpWUF5aH03ahTANAuf6zXw+evKmiub/
 RtJfxZWEcWeLdugLVHUSrR4MOox7uvFmCdcdht4sEPdjFdH/9JeceC3A1pZ/DYTR
 +eS+E3dMWQjXnd2Omo/5f5H1LTZjNLEditnkcHT5unwKKukc008V/avgs8xOAKJB
 xf3oqJF960IO+NYf8rRQb8WtyGeo0grrWjgeqvZ37gwGUaFB9ldVxchsVLsL66OR
 bJRIoSiTgL+TUYSMQ5mKG4tmmBnPHUHfgfNoOXlWMoJHIjFeQ9oM6eTHhA==
 =QTFW
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "More fixes for ARM and x86"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: LAPIC: Advancing the timer expiration on guest initiated write
  KVM: x86/mmu: Skip !MMU-present SPTEs when removing SP in exclusive mode
  KVM: kvmclock: Fix vCPUs > 64 can't be online/hotpluged
  kvm: x86: annotate RCU pointers
  KVM: arm64: Fix exclusive limit for IPA size
  KVM: arm64: Reject VM creation when the default IPA size is unsupported
  KVM: arm64: Ensure I-cache isolation between vcpus of a same VM
  KVM: arm64: Don't use cbz/adr with external symbols
  KVM: arm64: Fix range alignment when walking page tables
  KVM: arm64: Workaround firmware wrongly advertising GICv2-on-v3 compatibility
  KVM: arm64: Rename __vgic_v3_get_ich_vtr_el2() to __vgic_v3_get_gic_config()
  KVM: arm64: Don't access PMSELR_EL0/PMUSERENR_EL0 when no PMU is available
  KVM: arm64: Turn kvm_arm_support_pmu_v3() into a static key
  KVM: arm64: Fix nVHE hyp panic host context restore
  KVM: arm64: Avoid corrupting vCPU context register in guest exit
  KVM: arm64: nvhe: Save the SPE context early
  kvm: x86: use NULL instead of using plain integer as pointer
  KVM: SVM: Connect 'npt' module param to KVM's internal 'npt_enabled'
  KVM: x86: Ensure deadline timer has truly expired before posting its IRQ
2021-03-14 12:35:02 -07:00
Dave Hansen
09141ec0e4 x86: Remove duplicate TSC DEADLINE MSR definitions
There are two definitions for the TSC deadline MSR in msr-index.h,
one with an underscore and one without.  Axe one of them and move
all the references over to the other one.

 [ bp: Fixup the MSR define in handle_fastpath_set_msr_irqoff() too. ]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200305174706.0D6B8EE4@viggo.jf.intel.com
2021-03-08 11:05:20 +01:00
Muhammad Usama Anjum
4691453406 kvm: x86: use NULL instead of using plain integer as pointer
Sparse warnings removed:
warning: Using plain integer as NULL pointer

Signed-off-by: Muhammad Usama Anjum <musamaanjum@gmail.com>
Message-Id: <20210305180816.GA488770@LEGION>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-06 04:18:39 -05:00
Linus Torvalds
cee407c5cc * Doc fixes
* selftests fixes
 * Add runstate information to the new Xen support
 * Allow compiling out the Xen interface
 * 32-bit PAE without EPT bugfix
 * NULL pointer dereference bugfix
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmA+lGcUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMaMQf/Q8bQr5vVAeNk+1MyRmzNqFEbLqbe
 h50f4Wd2N+svZ6XinQH1vvuQm1WYj/g616Q3nCeYwCJyY34g5tf60XcuAMnVRIzw
 qc2IUvSAJ3faVElMrSA5thN3bkPzJpRrdIpQGBgOd+rT+eQkPSsJlTy34JJmvbmh
 xFGjoVj49tYEkFfpxEbtytW6QiYtPz/ai8SARRXbEUWO/pVzdkgK5XWshRhE9vpB
 GLCEXUngdPokJMblRMuK4YOSFQXXHobAJAgPwSzguDV41qezXaKOGYOLe7+V+0kH
 z607RnQc1wGgsLanT13okYMQr09/XCjpvFkZ9CK2bIJPsyWP+ihA/37hVQ==
 =1GNo
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:

 - Doc fixes

 - selftests fixes

 - Add runstate information to the new Xen support

 - Allow compiling out the Xen interface

 - 32-bit PAE without EPT bugfix

 - NULL pointer dereference bugfix

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: SVM: Clear the CR4 register on reset
  KVM: x86/xen: Add support for vCPU runstate information
  KVM: x86/xen: Fix return code when clearing vcpu_info and vcpu_time_info
  selftests: kvm: Mmap the entire vcpu mmap area
  KVM: Documentation: Fix index for KVM_CAP_PPC_DAWR1
  KVM: x86: allow compiling out the Xen hypercall interface
  KVM: xen: flush deferred static key before checking it
  KVM: x86/mmu: Set SPTE_AD_WRPROT_ONLY_MASK if and only if PML is enabled
  KVM: x86: hyper-v: Fix Hyper-V context null-ptr-deref
  KVM: x86: remove misplaced comment on active_mmu_pages
  KVM: Documentation: rectify rst markup in kvm_run->flags
  Documentation: kvm: fix messy conversion from .txt to .rst
2021-03-04 11:26:17 -08:00
David Woodhouse
30b5c851af KVM: x86/xen: Add support for vCPU runstate information
This is how Xen guests do steal time accounting. The hypervisor records
the amount of time spent in each of running/runnable/blocked/offline
states.

In the Xen accounting, a vCPU is still in state RUNSTATE_running while
in Xen for a hypercall or I/O trap, etc. Only if Xen explicitly schedules
does the state become RUNSTATE_blocked. In KVM this means that even when
the vCPU exits the kvm_run loop, the state remains RUNSTATE_running.

The VMM can explicitly set the vCPU to RUNSTATE_blocked by using the
KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_CURRENT attribute, and can also use
KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST to retrospectively add a given
amount of time to the blocked state and subtract it from the running
state.

The state_entry_time corresponds to get_kvmclock_ns() at the time the
vCPU entered the current state, and the total times of all four states
should always add up to state_entry_time.

Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20210301125309.874953-2-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-02 14:30:54 -05:00
Paolo Bonzini
b59b153d10 KVM: x86: allow compiling out the Xen hypercall interface
The Xen hypercall interface adds to the attack surface of the hypervisor
and will be used quite rarely.  Allow compiling it out.

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-02 14:30:45 -05:00
Linus Torvalds
d94d14008e x86:
- take into account HVA before retrying on MMU notifier race
 - fixes for nested AMD guests without NPT
 - allow INVPCID in guest without PCID
 - disable PML in hardware when not in use
 - MMU code cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmA3eMQUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroP6TQf5ARpUyq3oo+13albwg+zNca6hzR8i
 Vl7dpoR3bSJCN3sTYFnlL9eXw5TxgeUL2nqKqma6ddZDNDEBLT2Bq8rcFkbi4pUf
 n7av76EEq74HW/jlUhKVug7Q5Dm5DiKC6BOH3RVuKHbr6iZseyF3jXZSX0Ppf0yF
 gvoy6cGyMW60NVLN5tuGeOjVQ1fxziE0SqB90fXuiWgZ5rzIBfbqJV7EOOZsGO67
 /LHSaEpvKutsc2a+Hx76yQNJjAbb2/O+4Bo5/RqfdqS5tRLGBzYggdJjLvAPvd6P
 pTNtDCnErvBZQfMedEQyHYuBL2Ca59fOp6i/ekOM2I+m7816+kSkdTMt2g==
 =iMHY
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more KVM updates from Paolo Bonzini:
 "x86:

   - take into account HVA before retrying on MMU notifier race

   - fixes for nested AMD guests without NPT

   - allow INVPCID in guest without PCID

   - disable PML in hardware when not in use

   - MMU code cleanups:

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (28 commits)
  KVM: SVM: Fix nested VM-Exit on #GP interception handling
  KVM: vmx/pmu: Fix dummy check if lbr_desc->event is created
  KVM: x86/mmu: Consider the hva in mmu_notifier retry
  KVM: x86/mmu: Skip mmu_notifier check when handling MMIO page fault
  KVM: Documentation: rectify rst markup in KVM_GET_SUPPORTED_HV_CPUID
  KVM: nSVM: prepare guest save area while is_guest_mode is true
  KVM: x86/mmu: Remove a variety of unnecessary exports
  KVM: x86: Fold "write-protect large" use case into generic write-protect
  KVM: x86/mmu: Don't set dirty bits when disabling dirty logging w/ PML
  KVM: VMX: Dynamically enable/disable PML based on memslot dirty logging
  KVM: x86: Further clarify the logic and comments for toggling log dirty
  KVM: x86: Move MMU's PML logic to common code
  KVM: x86/mmu: Make dirty log size hook (PML) a value, not a function
  KVM: x86/mmu: Expand on the comment in kvm_vcpu_ad_need_write_protect()
  KVM: nVMX: Disable PML in hardware when running L2
  KVM: x86/mmu: Consult max mapping level when zapping collapsible SPTEs
  KVM: x86/mmu: Pass the memslot to the rmap callbacks
  KVM: x86/mmu: Split out max mapping level calculation to helper
  KVM: x86/mmu: Expand collapsible SPTE zap for TDP MMU to ZONE_DEVICE and HugeTLB pages
  KVM: nVMX: no need to undo inject_page_fault change on nested vmexit
  ...
2021-02-26 10:00:12 -08:00
Paolo Bonzini
c462f859f8 KVM: xen: flush deferred static key before checking it
A missing flush would cause the static branch to trigger incorrectly.

Cc: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-26 05:13:02 -05:00
Linus Torvalds
3e10585335 x86:
- Support for userspace to emulate Xen hypercalls
 - Raise the maximum number of user memslots
 - Scalability improvements for the new MMU.  Instead of the complex
   "fast page fault" logic that is used in mmu.c, tdp_mmu.c uses an
   rwlock so that page faults are concurrent, but the code that can run
   against page faults is limited.  Right now only page faults take the
   lock for reading; in the future this will be extended to some
   cases of page table destruction.  I hope to switch the default MMU
   around 5.12-rc3 (some testing was delayed due to Chinese New Year).
 - Cleanups for MAXPHYADDR checks
 - Use static calls for vendor-specific callbacks
 - On AMD, use VMLOAD/VMSAVE to save and restore host state
 - Stop using deprecated jump label APIs
 - Workaround for AMD erratum that made nested virtualization unreliable
 - Support for LBR emulation in the guest
 - Support for communicating bus lock vmexits to userspace
 - Add support for SEV attestation command
 - Miscellaneous cleanups
 
 PPC:
 - Support for second data watchpoint on POWER10
 - Remove some complex workarounds for buggy early versions of POWER9
 - Guest entry/exit fixes
 
 ARM64
 - Make the nVHE EL2 object relocatable
 - Cleanups for concurrent translation faults hitting the same page
 - Support for the standard TRNG hypervisor call
 - A bunch of small PMU/Debug fixes
 - Simplification of the early init hypercall handling
 
 Non-KVM changes (with acks):
 - Detection of contended rwlocks (implemented only for qrwlocks,
   because KVM only needs it for x86)
 - Allow __DISABLE_EXPORTS from assembly code
 - Provide a saner follow_pfn replacements for modules
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmApSRgUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOc7wf9FnlinKoTFaSk7oeuuhF/CoCVwSFs
 Z9+A2sNI99tWHQxFR6dyDkEFeQoXnqSxfLHtUVIdH/JnTg0FkEvFz3NK+0PzY1PF
 PnGNbSoyhP58mSBG4gbBAxdF3ZJZMB8GBgYPeR62PvMX2dYbcHqVBNhlf6W4MQK4
 5mAUuAnbf19O5N267sND+sIg3wwJYwOZpRZB7PlwvfKAGKf18gdBz5dQ/6Ej+apf
 P7GODZITjqM5Iho7SDm/sYJlZprFZT81KqffwJQHWFMEcxFgwzrnYPx7J3gFwRTR
 eeh9E61eCBDyCTPpHROLuNTVBqrAioCqXLdKOtO5gKvZI3zmomvAsZ8uXQ==
 =uFZU
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "x86:

   - Support for userspace to emulate Xen hypercalls

   - Raise the maximum number of user memslots

   - Scalability improvements for the new MMU.

     Instead of the complex "fast page fault" logic that is used in
     mmu.c, tdp_mmu.c uses an rwlock so that page faults are concurrent,
     but the code that can run against page faults is limited. Right now
     only page faults take the lock for reading; in the future this will
     be extended to some cases of page table destruction. I hope to
     switch the default MMU around 5.12-rc3 (some testing was delayed
     due to Chinese New Year).

   - Cleanups for MAXPHYADDR checks

   - Use static calls for vendor-specific callbacks

   - On AMD, use VMLOAD/VMSAVE to save and restore host state

   - Stop using deprecated jump label APIs

   - Workaround for AMD erratum that made nested virtualization
     unreliable

   - Support for LBR emulation in the guest

   - Support for communicating bus lock vmexits to userspace

   - Add support for SEV attestation command

   - Miscellaneous cleanups

  PPC:

   - Support for second data watchpoint on POWER10

   - Remove some complex workarounds for buggy early versions of POWER9

   - Guest entry/exit fixes

  ARM64:

   - Make the nVHE EL2 object relocatable

   - Cleanups for concurrent translation faults hitting the same page

   - Support for the standard TRNG hypervisor call

   - A bunch of small PMU/Debug fixes

   - Simplification of the early init hypercall handling

  Non-KVM changes (with acks):

   - Detection of contended rwlocks (implemented only for qrwlocks,
     because KVM only needs it for x86)

   - Allow __DISABLE_EXPORTS from assembly code

   - Provide a saner follow_pfn replacements for modules"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (192 commits)
  KVM: x86/xen: Explicitly pad struct compat_vcpu_info to 64 bytes
  KVM: selftests: Don't bother mapping GVA for Xen shinfo test
  KVM: selftests: Fix hex vs. decimal snafu in Xen test
  KVM: selftests: Fix size of memslots created by Xen tests
  KVM: selftests: Ignore recently added Xen tests' build output
  KVM: selftests: Add missing header file needed by xAPIC IPI tests
  KVM: selftests: Add operand to vmsave/vmload/vmrun in svm.c
  KVM: SVM: Make symbol 'svm_gp_erratum_intercept' static
  locking/arch: Move qrwlock.h include after qspinlock.h
  KVM: PPC: Book3S HV: Fix host radix SLB optimisation with hash guests
  KVM: PPC: Book3S HV: Ensure radix guest has no SLB entries
  KVM: PPC: Don't always report hash MMU capability for P9 < DD2.2
  KVM: PPC: Book3S HV: Save and restore FSCR in the P9 path
  KVM: PPC: remove unneeded semicolon
  KVM: PPC: Book3S HV: Use POWER9 SLBIA IH=6 variant to clear SLB
  KVM: PPC: Book3S HV: No need to clear radix host SLB before loading HPT guest
  KVM: PPC: Book3S HV: Fix radix guest SLB side channel
  KVM: PPC: Book3S HV: Remove support for running HPT guest on RPT host without mixed mode support
  KVM: PPC: Book3S HV: Introduce new capability for 2nd DAWR
  KVM: PPC: Book3S HV: Add infrastructure to support 2nd DAWR
  ...
2021-02-21 13:31:43 -08:00
Sean Christopherson
a1419f8b5b KVM: x86: Fold "write-protect large" use case into generic write-protect
Drop kvm_mmu_slot_largepage_remove_write_access() and refactor its sole
caller to use kvm_mmu_slot_remove_write_access().  Remove the now-unused
slot_handle_large_level() and slot_handle_all_level() helpers.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210213005015.1651772-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-19 03:08:35 -05:00
Sean Christopherson
b6e16ae5d9 KVM: x86/mmu: Don't set dirty bits when disabling dirty logging w/ PML
Stop setting dirty bits for MMU pages when dirty logging is disabled for
a memslot, as PML is now completely disabled when there are no memslots
with dirty logging enabled.

This means that spurious PML entries will be created for memslots with
dirty logging disabled if at least one other memslot has dirty logging
enabled.  However, spurious PML entries are already possible since
dirty bits are set only when a dirty logging is turned off, i.e. memslots
that are never dirty logged will have dirty bits cleared.

In the end, it's faster overall to eat a few spurious PML entries in the
window where dirty logging is being disabled across all memslots.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210213005015.1651772-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-19 03:08:35 -05:00
Makarand Sonare
a85863c2ec KVM: VMX: Dynamically enable/disable PML based on memslot dirty logging
Currently, if enable_pml=1 PML remains enabled for the entire lifetime
of the VM irrespective of whether dirty logging is enable or disabled.
When dirty logging is disabled, all the pages of the VM are manually
marked dirty, so that PML is effectively non-operational.  Setting
the dirty bits is an expensive operation which can cause severe MMU
lock contention in a performance sensitive path when dirty logging is
disabled after a failed or canceled live migration.

Manually setting dirty bits also fails to prevent PML activity if some
code path clears dirty bits, which can incur unnecessary VM-Exits.

In order to avoid this extra overhead, dynamically enable/disable PML
when dirty logging gets turned on/off for the first/last memslot.

Signed-off-by: Makarand Sonare <makarandsonare@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210213005015.1651772-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-19 03:08:34 -05:00
Sean Christopherson
52f4607940 KVM: x86: Further clarify the logic and comments for toggling log dirty
Add a sanity check in kvm_mmu_slot_apply_flags to assert that the
LOG_DIRTY_PAGES flag is indeed being toggled, and explicitly rely on
that holding true when zapping collapsible SPTEs.  Manipulating the
CPU dirty log (PML) and write-protection also relies on this assertion,
but that's not obvious in the current code.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210213005015.1651772-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-19 03:08:34 -05:00
Sean Christopherson
a018eba538 KVM: x86: Move MMU's PML logic to common code
Drop the facade of KVM's PML logic being vendor specific and move the
bits that aren't truly VMX specific into common x86 code.  The MMU logic
for dealing with PML is tightly coupled to the feature and to VMX's
implementation, bouncing through kvm_x86_ops obfuscates the code without
providing any meaningful separation of concerns or encapsulation.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210213005015.1651772-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-19 03:08:34 -05:00
Frederic Weisbecker
4ae7dc97f7 entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point
Following the idle loop model, cleanly check for pending rcuog wakeup
before the last rescheduling point upon resuming to guest mode. This
way we can avoid to do it from rcu_user_enter() with the last resort
self-IPI hack that enforces rescheduling.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-6-frederic@kernel.org
2021-02-17 14:12:43 +01:00
Vitaly Kuznetsov
fc08b628d7 KVM: x86: hyper-v: Allocate Hyper-V context lazily
Hyper-V context is only needed for guests which use Hyper-V emulation in
KVM (e.g. Windows/Hyper-V guests) so we don't actually need to allocate
it in kvm_arch_vcpu_create(), we can postpone the action until Hyper-V
specific MSRs are accessed or SynIC is enabled.

Once allocated, let's keep the context alive for the lifetime of the vCPU
as an attempt to free it would require additional synchronization with
other vCPUs and normally it is not supposed to happen.

Note, Hyper-V style hypercall enablement is done by writing to
HV_X64_MSR_GUEST_OS_ID so we don't need to worry about allocating Hyper-V
context from kvm_hv_hypercall().

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210126134816.1880136-15-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09 08:40:50 -05:00
Vitaly Kuznetsov
8f014550df KVM: x86: hyper-v: Make Hyper-V emulation enablement conditional
Hyper-V emulation is enabled in KVM unconditionally. This is bad at least
from security standpoint as it is an extra attack surface. Ideally, there
should be a per-VM capability explicitly enabled by VMM but currently it
is not the case and we can't mandate one without breaking backwards
compatibility. We can, however, check guest visible CPUIDs and only enable
Hyper-V emulation when "Hv#1" interface was exposed in
HYPERV_CPUID_INTERFACE.

Note, VMMs are free to act in any sequence they like, e.g. they can try
to set MSRs first and CPUIDs later so we still need to allow the host
to read/write Hyper-V specific MSRs unconditionally.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210126134816.1880136-14-vkuznets@redhat.com>
[Add selftest vcpu_set_hv_cpuid API to avoid breaking xen_vmcall_test. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09 08:39:56 -05:00
Vitaly Kuznetsov
4592b7eaa8 KVM: x86: hyper-v: Allocate 'struct kvm_vcpu_hv' dynamically
Hyper-V context is only needed for guests which use Hyper-V emulation in
KVM (e.g. Windows/Hyper-V guests). 'struct kvm_vcpu_hv' is, however, quite
big, it accounts for more than 1/4 of the total 'struct kvm_vcpu_arch'
which is also quite big already. This all looks like a waste.

Allocate 'struct kvm_vcpu_hv' dynamically. This patch does not bring any
(intentional) functional change as we still allocate the context
unconditionally but it paves the way to doing that only when needed.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210126134816.1880136-13-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09 08:17:15 -05:00
Vitaly Kuznetsov
f2bc14b69c KVM: x86: hyper-v: Prepare to meet unallocated Hyper-V context
Currently, Hyper-V context is part of 'struct kvm_vcpu_arch' and is always
available. As a preparation to allocating it dynamically, check that it is
not NULL at call sites which can normally proceed without it i.e. the
behavior is identical to the situation when Hyper-V emulation is not being
used by the guest.

When Hyper-V context for a particular vCPU is not allocated, we may still
need to get 'vp_index' from there. E.g. in a hypothetical situation when
Hyper-V emulation was enabled on one CPU and wasn't on another, Hyper-V
style send-IPI hypercall may still be used. Luckily, vp_index is always
initialized to kvm_vcpu_get_idx() and can only be changed when Hyper-V
context is present. Introduce kvm_hv_get_vpindex() helper for
simplification.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210126134816.1880136-12-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09 08:17:14 -05:00
Vitaly Kuznetsov
9ff5e0304e KVM: x86: hyper-v: Always use to_hv_vcpu() accessor to get to 'struct kvm_vcpu_hv'
As a preparation to allocating Hyper-V context dynamically, make it clear
who's the user of the said context.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210126134816.1880136-11-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09 08:17:13 -05:00
Vitaly Kuznetsov
05f04ae4ff KVM: x86: hyper-v: Introduce to_kvm_hv() helper
Spelling '&kvm->arch.hyperv' correctly is hard. Also, this makes the code
more consistent with vmx/svm where to_kvm_vmx()/to_kvm_svm() are already
being used.

Opportunistically change kvm_hv_msr_{get,set}_crash_{data,ctl}() and
kvm_hv_msr_set_crash_data() to take 'kvm' instead of 'vcpu' as these
MSRs are partition wide.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210126134816.1880136-9-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09 08:17:12 -05:00
Vitaly Kuznetsov
e0121fa29a KVM: x86: hyper-v: Rename vcpu_to_synic()/synic_to_vcpu()
vcpu_to_synic()'s argument is almost always 'vcpu' so there's no need to
have an additional prefix. Also, as this is used outside of hyper-v
emulation code, add '_hv_' part to make it clear what this s. This makes
the naming more consistent with to_hv_vcpu().

Rename synic_to_vcpu() to hv_synic_to_vcpu() for consistency.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210126134816.1880136-6-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09 08:17:11 -05:00
Paolo Bonzini
996ff5429e KVM: x86: move kvm_inject_gp up from kvm_set_dr to callers
Push the injection of #GP up to the callers, so that they can just use
kvm_complete_insn_gp. __kvm_set_dr is pretty much what the callers can use
together with kvm_complete_insn_gp, so rename it to kvm_set_dr and drop
the old kvm_set_dr wrapper.

This also allows nested VMX code, which really wanted to use __kvm_set_dr,
to use the right function.

While at it, remove the kvm_require_dr() check from the SVM interception.
The APM states:

  All normal exception checks take precedence over the SVM intercepts.

which includes the CR4.DE=1 #UD.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09 08:17:07 -05:00
Paolo Bonzini
29d6ca4199 KVM: x86: reading DR cannot fail
kvm_get_dr and emulator_get_dr except an in-range value for the register
number so they cannot fail.  Change the return type to void.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09 08:17:07 -05:00
Sean Christopherson
031b91a5fe KVM: x86: Set so called 'reserved CR3 bits in LM mask' at vCPU reset
Set cr3_lm_rsvd_bits, which is effectively an invalid GPA mask, at vCPU
reset.  The reserved bits check needs to be done even if userspace never
configures the guest's CPUID model.

Cc: stable@vger.kernel.org
Fixes: 0107973a80 ("KVM: x86: Introduce cr3_lm_rsvd_bits in kvm_vcpu_arch")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 12:59:28 -05:00
Sean Christopherson
a8ac864a7d KVM: x86: Add helper to consolidate "raw" reserved GPA mask calculations
Add a helper to generate the mask of reserved GPA bits _without_ any
adjustments for repurposed bits, and use it to replace a variety of
open coded variants in the MTRR and APIC_BASE flows.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 09:27:30 -05:00
Sean Christopherson
5b7f575ccd KVM: x86: Use reserved_gpa_bits to calculate reserved PxE bits
Use reserved_gpa_bits, which accounts for exceptions to the maxphyaddr
rule, e.g. SEV's C-bit, for the page {table,directory,etc...} entry (PxE)
reserved bits checks.  For SEV, the C-bit is ignored by hardware when
walking pages tables, e.g. the APM states:

  Note that while the guest may choose to set the C-bit explicitly on
  instruction pages and page table addresses, the value of this bit is a
  don't-care in such situations as hardware always performs these as
  private accesses.

Such behavior is expected to hold true for other features that repurpose
GPA bits, e.g. KVM could theoretically emulate SME or MKTME, which both
allow non-zero repurposed bits in the page tables.  Conceptually, KVM
should apply reserved GPA checks universally, and any features that do
not adhere to the basic rule should be explicitly handled, i.e. if a GPA
bit is repurposed but not allowed in page tables for whatever reason.

Refactor __reset_rsvds_bits_mask() to take the pre-generated reserved
bits mask, and opportunistically clean up its code, e.g. to align lines
and comments.

Practically speaking, this is change is a likely a glorified nop given
the current KVM code base.  SEV's C-bit is the only repurposed GPA bit,
and KVM doesn't support shadowing encrypted page tables (which is
theoretically possible via SEV debug APIs).

Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 09:27:29 -05:00
Sean Christopherson
ca29e14506 KVM: x86: SEV: Treat C-bit as legal GPA bit regardless of vCPU mode
Rename cr3_lm_rsvd_bits to reserved_gpa_bits, and use it for all GPA
legality checks.  AMD's APM states:

  If the C-bit is an address bit, this bit is masked from the guest
  physical address when it is translated through the nested page tables.

Thus, any access that can conceivably be run through NPT should ignore
the C-bit when checking for validity.

For features that KVM emulates in software, e.g. MTRRs, there is no
clear direction in the APM for how the C-bit should be handled.  For
such cases, follow the SME behavior inasmuch as possible, since SEV is
is essentially a VM-specific variant of SME.  For SME, the APM states:

  In this case the upper physical address bits are treated as reserved
  when the feature is enabled except where otherwise indicated.

Collecting the various relavant SME snippets in the APM and cross-
referencing the omissions with Linux kernel code, this leaves MTTRs and
APIC_BASE as the only flows that KVM emulates that should _not_ ignore
the C-bit.

Note, this means the reserved bit checks in the page tables are
technically broken.  This will be remedied in a future patch.

Although the page table checks are technically broken, in practice, it's
all but guaranteed to be irrelevant.  NPT is required for SEV, i.e.
shadowing page tables isn't needed in the common case.  Theoretically,
the checks could be in play for nested NPT, but it's extremely unlikely
that anyone is running nested VMs on SEV, as doing so would require L1
to expose sensitive data to L0, e.g. the entire VMCB.  And if anyone is
running nested VMs, L0 can't read the guest's encrypted memory, i.e. L1
would need to put its NPT in shared memory, in which case the C-bit will
never be set.  Or, L1 could use shadow paging, but again, if L0 needs to
read page tables, e.g. to load PDPTRs, the memory can't be encrypted if
L1 has any expectation of L0 doing the right thing.

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 09:27:29 -05:00
Sean Christopherson
f156abec72 KVM: x86: Set so called 'reserved CR3 bits in LM mask' at vCPU reset
Set cr3_lm_rsvd_bits, which is effectively an invalid GPA mask, at vCPU
reset.  The reserved bits check needs to be done even if userspace never
configures the guest's CPUID model.

Cc: stable@vger.kernel.org
Fixes: 0107973a80 ("KVM: x86: Introduce cr3_lm_rsvd_bits in kvm_vcpu_arch")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 09:27:26 -05:00
David Woodhouse
8d4e7e8083 KVM: x86: declare Xen HVM shared info capability and add test case
Instead of adding a plethora of new KVM_CAP_XEN_FOO capabilities, just
add bits to the return value of KVM_CAP_XEN_HVM.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04 14:19:39 +00:00
David Woodhouse
40da8ccd72 KVM: x86/xen: Add event channel interrupt vector upcall
It turns out that we can't handle event channels *entirely* in userspace
by delivering them as ExtINT, because KVM is a bit picky about when it
accepts ExtINT interrupts from a legacy PIC. The in-kernel local APIC
has to have LVT0 configured in APIC_MODE_EXTINT and unmasked, which
isn't necessarily the case for Xen guests especially on secondary CPUs.

To cope with this, add kvm_xen_get_interrupt() which checks the
evtchn_pending_upcall field in the Xen vcpu_info, and delivers the Xen
upcall vector (configured by KVM_XEN_ATTR_TYPE_UPCALL_VECTOR) if it's
set regardless of LAPIC LVT0 configuration. This gives us the minimum
support we need for completely userspace-based implementation of event
channels.

This does mean that vcpu_enter_guest() needs to check for the
evtchn_pending_upcall flag being set, because it can't rely on someone
having set KVM_REQ_EVENT unless we were to add some way for userspace to
do so manually.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04 14:19:39 +00:00
Joao Martins
f2340cd9e4 KVM: x86/xen: register vcpu time info region
Allow the Xen emulated guest the ability to register secondary
vcpu time information. On Xen guests this is used in order to be
mapped to userspace and hence allow vdso gettimeofday to work.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04 14:19:39 +00:00
Joao Martins
aa096aa0a0 KVM: x86/xen: setup pvclock updates
Parameterise kvm_setup_pvclock_page() a little bit so that it can be
invoked for different gfn_to_hva_cache structures, and with different
offsets. Then we can invoke it for the normal KVM pvclock and also for
the Xen one in the vcpu_info.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04 14:19:39 +00:00
David Woodhouse
3e32461588 KVM: x86/xen: Add KVM_XEN_VCPU_SET_ATTR/KVM_XEN_VCPU_GET_ATTR
This will be used for per-vCPU setup such as runstate and vcpu_info.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04 14:19:39 +00:00
Joao Martins
629b534884 KVM: x86/xen: update wallclock region
Wallclock on Xen is written in the shared_info page.

To that purpose, export kvm_write_wall_clock() and pass on the GPA of
its location to populate the shared_info wall clock data.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04 14:19:39 +00:00
Joao Martins
a76b9641ad KVM: x86/xen: add KVM_XEN_HVM_SET_ATTR/KVM_XEN_HVM_GET_ATTR
This will be used to set up shared info pages etc.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04 14:19:38 +00:00
David Woodhouse
7d6bbebb7b KVM: x86/xen: Add kvm_xen_enabled static key
The code paths for Xen support are all fairly lightweight but if we hide
them behind this, they're even *more* lightweight for any system which
isn't actually hosting Xen guests.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04 14:19:38 +00:00
David Woodhouse
78e9878cb3 KVM: x86/xen: Move KVM_XEN_HVM_CONFIG handling to xen.c
This is already more complex than the simple memcpy it originally had.
Move it to xen.c with the rest of the Xen support.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04 14:19:38 +00:00
Joao Martins
23200b7a30 KVM: x86/xen: intercept xen hypercalls if enabled
Add a new exit reason for emulator to handle Xen hypercalls.

Since this means KVM owns the ABI, dispense with the facility for the
VMM to provide its own copy of the hypercall pages; just fill them in
directly using VMCALL/VMMCALL as we do for the Hyper-V hypercall page.

This behaviour is enabled by a new INTERCEPT_HCALL flag in the
KVM_XEN_HVM_CONFIG ioctl structure, and advertised by the same flag
being returned from the KVM_CAP_XEN_HVM check.

Rename xen_hvm_config() to kvm_xen_write_hypercall_page() and move it
to the nascent xen.c while we're at it, and add a test case.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04 14:18:45 +00:00
David Woodhouse
92f4d400a4 KVM: x86/xen: Fix __user pointer handling for hypercall page installation
The address we give to memdup_user() isn't correctly tagged as __user.
This is harmless enough as it's a one-off use and we're doing exactly
the right thing, but fix it anyway to shut the checker up. Otherwise
it'll whine when the (now legacy) code gets moved around in a later
patch.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04 13:58:34 +00:00
Joao Martins
1232f8e6c9 KVM: x86/xen: fix Xen hypercall page msr handling
Xen usually places its MSR at 0x40000000 or 0x40000200 depending on
whether it is running in viridian mode or not. Note that this is not
ABI guaranteed, so it is possible for Xen to advertise the MSR some
place else.

Given the way xen_hvm_config() is handled, if the former address is
selected, this will conflict with Hyper-V's MSR
(HV_X64_MSR_GUEST_OS_ID) which unconditionally uses the same address.

Given that the MSR location is arbitrary, move the xen_hvm_config()
handling to the top of kvm_set_msr_common() before falling through.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04 13:58:34 +00:00
Ben Gardon
531810caa9 KVM: x86/mmu: Use an rwlock for the x86 MMU
Add a read / write lock to be used in place of the MMU spinlock on x86.
The rwlock will enable the TDP MMU to handle page faults, and other
operations in parallel in future commits.

Reviewed-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: Ben Gardon <bgardon@google.com>

Message-Id: <20210202185734.1680553-19-bgardon@google.com>
[Introduce virt/kvm/mmu_lock.h - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:43 -05:00
Paolo Bonzini
bbefd4fc8f KVM: x86: move kvm_inject_gp up from kvm_set_xcr to callers
Push the injection of #GP up to the callers, so that they can just use
kvm_complete_insn_gp.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:37 -05:00
Paolo Bonzini
fd23800261 KVM: cleanup DR6/DR7 reserved bits checks
kvm_dr6_valid and kvm_dr7_valid check that bits 63:32 are zero.  Using
them makes it easier to review the code for inconsistencies.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:37 -05:00
Paolo Bonzini
d89d04ab60 KVM: move EXIT_FASTPATH_REENTER_GUEST to common code
Now that KVM is using static calls, calling vmx_vcpu_run and
vmx_sync_pir_to_irr does not incur anymore the cost of a
retpoline.

Therefore there is no need anymore to handle EXIT_FASTPATH_REENTER_GUEST
in vendor code.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:37 -05:00
Jason Baron
b3646477d4 KVM: x86: use static calls to reduce kvm_x86_ops overhead
Convert kvm_x86_ops to use static calls. Note that all kvm_x86_ops are
covered here except for 'pmu_ops and 'nested ops'.

Here are some numbers running cpuid in a loop of 1 million calls averaged
over 5 runs, measured in the vm (lower is better).

Intel Xeon 3000MHz:

           |default    |mitigations=off
-------------------------------------
vanilla    |.671s      |.486s
static call|.573s(-15%)|.458s(-6%)

AMD EPYC 2500MHz:

           |default    |mitigations=off
-------------------------------------
vanilla    |.710s      |.609s
static call|.664s(-6%) |.609s(0%)

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Message-Id: <e057bf1b8a7ad15652df6eeba3f907ae758d3399.1610680941.git.jbaron@akamai.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:30 -05:00
Jason Baron
9af5471bdb KVM: x86: introduce definitions to support static calls for kvm_x86_ops
Use static calls to improve kvm_x86_ops performance. Introduce the
definitions that will be used by a subsequent patch to actualize the
savings. Add a new kvm-x86-ops.h header that can be used for the
definition of static calls. This header is also intended to be
used to simplify the defition of svm_kvm_ops and vmx_x86_ops.

Note that all functions in kvm_x86_ops are covered here except for
'pmu_ops' and 'nested ops'. I think they can be covered by static
calls in a simlilar manner, but were omitted from this series to
reduce scope and because I don't think they have as large of a
performance impact.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Message-Id: <e5cc82ead7ab37b2dceb0837a514f3f8bea4f8d1.1610680941.git.jbaron@akamai.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:29 -05:00
Cun Li
6e4e3b4df4 KVM: Stop using deprecated jump label APIs
The use of 'struct static_key' and 'static_key_false' is
deprecated. Use the new API.

Signed-off-by: Cun Li <cun.jia.li@gmail.com>
Message-Id: <20210111152435.50275-1-cun.jia.li@gmail.com>
[Make it compile.  While at it, rename kvm_no_apic_vcpu to
 kvm_has_noapic_vcpu; the former reads too much like "true if
 no vCPU has an APIC". - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:29 -05:00
Wei Huang
4aa2691dcb KVM: x86: Factor out x86 instruction emulation with decoding
Move the instruction decode part out of x86_emulate_instruction() for it
to be used in other places. Also kvm_clear_exception_queue() is moved
inside the if-statement as it doesn't apply when KVM are coming back from
userspace.

Co-developed-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Wei Huang <wei.huang2@amd.com>
Message-Id: <20210126081831.570253-2-wei.huang2@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:27 -05:00
Chenyi Qiang
9a3ecd5e2a KVM: X86: Rename DR6_INIT to DR6_ACTIVE_LOW
DR6_INIT contains the 1-reserved bits as well as the bit that is cleared
to 0 when the condition (e.g. RTM) happens. The value can be used to
initialize dr6 and also be the XOR mask between the #DB exit
qualification (or payload) and DR6.

Concerning that DR6_INIT is used as initial value only once, rename it
to DR6_ACTIVE_LOW and apply it in other places, which would make the
incoming changes for bus lock debug exception more simple.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20210202090433.13441-2-chenyi.qiang@intel.com>
[Define DR6_FIXED_1 from DR6_ACTIVE_LOW and DR6_VOLATILE. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:27 -05:00
Like Xu
d855066f81 KVM: VMX: read/write MSR_IA32_DEBUGCTLMSR from GUEST_IA32_DEBUGCTL
SVM already has specific handlers of MSR_IA32_DEBUGCTLMSR in the
svm_get/set_msr, so the x86 common part can be safely moved to VMX.
This allows KVM to store the bits it supports in GUEST_IA32_DEBUGCTL.

Add vmx_supported_debugctl() to refactor the throwing logic of #GP.

Signed-off-by: Like Xu <like.xu@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Message-Id: <20210108013704.134985-2-like.xu@linux.intel.com>
[Merge parts of Chenyi Qiang's "KVM: X86: Expose bus lock debug exception
 to guest". - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:23 -05:00
Chenyi Qiang
fe6b6bc802 KVM: VMX: Enable bus lock VM exit
Virtual Machine can exploit bus locks to degrade the performance of
system. Bus lock can be caused by split locked access to writeback(WB)
memory or by using locks on uncacheable(UC) memory. The bus lock is
typically >1000 cycles slower than an atomic operation within a cache
line. It also disrupts performance on other cores (which must wait for
the bus lock to be released before their memory operations can
complete).

To address the threat, bus lock VM exit is introduced to notify the VMM
when a bus lock was acquired, allowing it to enforce throttling or other
policy based mitigations.

A VMM can enable VM exit due to bus locks by setting a new "Bus Lock
Detection" VM-execution control(bit 30 of Secondary Processor-based VM
execution controls). If delivery of this VM exit was preempted by a
higher priority VM exit (e.g. EPT misconfiguration, EPT violation, APIC
access VM exit, APIC write VM exit, exception bitmap exiting), bit 26 of
exit reason in vmcs field is set to 1.

In current implementation, the KVM exposes this capability through
KVM_CAP_X86_BUS_LOCK_EXIT. The user can get the supported mode bitmap
(i.e. off and exit) and enable it explicitly (disabled by default). If
bus locks in guest are detected by KVM, exit to user space even when
current exit reason is handled by KVM internally. Set a new field
KVM_RUN_BUS_LOCK in vcpu->run->flags to inform the user space that there
is a bus lock detected in guest.

Document for Bus Lock VM exit is now available at the latest "Intel
Architecture Instruction Set Extensions Programming Reference".

Document Link:
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20201106090315.18606-4-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:21 -05:00
Chenyi Qiang
15aad3be9a KVM: X86: Reset the vcpu->run->flags at the beginning of vcpu_run
Reset the vcpu->run->flags at the beginning of kvm_arch_vcpu_ioctl_run.
It can avoid every thunk of code that needs to set the flag clear it,
which increases the odds of missing a case and ending up with a flag in
an undefined state.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20201106090315.18606-3-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:21 -05:00
Sean Christopherson
c5e2184d15 KVM: x86/mmu: Remove the defunct update_pte() paging hook
Remove the update_pte() shadow paging logic, which was obsoleted by
commit 4731d4c7a0 ("KVM: MMU: out of sync shadow core"), but never
removed.  As pointed out by Yu, KVM never write protects leaf page
tables for the purposes of shadow paging, and instead marks their
associated shadow page as unsync so that the guest can write PTEs at
will.

The update_pte() path, which predates the unsync logic, optimizes COW
scenarios by refreshing leaf SPTEs when they are written, as opposed to
zapping the SPTE, restarting the guest, and installing the new SPTE on
the subsequent fault.  Since KVM no longer write-protects leaf page
tables, update_pte() is unreachable and can be dropped.

Reported-by: Yu Zhang <yu.c.zhang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210115004051.4099250-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:17 -05:00
YANG LI
7ca7f3b944 x86: kvm: style: Simplify bool comparison
Fix the following coccicheck warning:
./arch/x86/kvm/x86.c:8012:5-48: WARNING: Comparison to bool

Signed-off-by: YANG LI <abaci-bugfix@linux.alibaba.com>
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Message-Id: <1610357578-66081-1-git-send-email-abaci-bugfix@linux.alibaba.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:16 -05:00
Sean Christopherson
15b51dc08a KVM: x86: Take KVM's SRCU lock only if steal time update is needed
Enter a SRCU critical section for a memslots lookup during steal time
update if and only if a steal time update is actually needed.  Taking
the lock can be avoided if steal time is disabled by the guest, or if
KVM knows it has already flagged the vCPU as being preempted.

Reword the comment to be more precise as to exactly why memslots will
be queried.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210123000334.3123628-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:14 -05:00
Sean Christopherson
19979fba9b KVM: x86: Remove obsolete disabling of page faults in kvm_arch_vcpu_put()
Remove the disabling of page faults across kvm_steal_time_set_preempted()
as KVM now accesses the steal time struct (shared with the guest) via a
cached mapping (see commit b043138246, "x86/KVM: Make sure
KVM_VCPU_FLUSH_TLB flag is not missed".)  The cache lookup is flagged as
atomic, thus it would be a bug if KVM tried to resolve a new pfn, i.e.
we want the splat that would be reached via might_fault().

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210123000334.3123628-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:14 -05:00
Paolo Bonzini
c1c35cf78b KVM: x86: cleanup CR3 reserved bits checks
If not in long mode, the low bits of CR3 are reserved but not enforced to
be zero, so remove those checks.  If in long mode, however, the MBZ bits
extend down to the highest physical address bit of the guest, excluding
the encryption bit.

Make the checks consistent with the above, and match them between
nested_vmcb_checks and KVM_SET_SREGS.

Cc: stable@vger.kernel.org
Fixes: 761e416934 ("KVM: nSVM: Check that MBZ bits in CR3 and CR4 are not set on vmrun of nested guests")
Fixes: a780a3ea62 ("KVM: X86: Fix reserved bits check for MOV to CR3")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-03 04:30:38 -05:00
Zheng Zhan Liang
b66f9bab12 KVM/x86: assign hva with the right value to vm_munmap the pages
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Zheng Zhan Liang <zhengzhanliang@huorong.cn>
Message-Id: <20210201055310.267029-1-zhengzhanliang@huorong.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-01 12:43:01 -05:00
Paolo Bonzini
7131636e7e KVM: x86: Allow guests to see MSR_IA32_TSX_CTRL even if tsx=off
Userspace that does not know about KVM_GET_MSR_FEATURE_INDEX_LIST
will generally use the default value for MSR_IA32_ARCH_CAPABILITIES.
When this happens and the host has tsx=on, it is possible to end up with
virtual machines that have HLE and RTM disabled, but TSX_CTRL available.

If the fleet is then switched to tsx=off, kvm_get_arch_capabilities()
will clear the ARCH_CAP_TSX_CTRL_MSR bit and it will not be possible to
use the tsx=off hosts as migration destinations, even though the guests
do not have TSX enabled.

To allow this migration, allow guests to write to their TSX_CTRL MSR,
while keeping the host MSR unchanged for the entire life of the guests.
This ensures that TSX remains disabled and also saves MSR reads and
writes, and it's okay to do because with tsx=off we know that guests will
not have the HLE and RTM features in their CPUID.  (If userspace sets
bogus CPUID data, we do not expect HLE and RTM to work in guests anyway).

Cc: stable@vger.kernel.org
Fixes: cbbaa2727a ("KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-01 12:43:00 -05:00
Paolo Bonzini
9a78e15802 KVM: x86: allow KVM_REQ_GET_NESTED_STATE_PAGES outside guest mode for VMX
VMX also uses KVM_REQ_GET_NESTED_STATE_PAGES for the Hyper-V eVMCS,
which may need to be loaded outside guest mode.  Therefore we cannot
WARN in that case.

However, that part of nested_get_vmcs12_pages is _not_ needed at
vmentry time.  Split it out of KVM_REQ_GET_NESTED_STATE_PAGES handling,
so that both vmentry and migration (and in the latter case, independent
of is_guest_mode) do the parts that are needed.

Cc: <stable@vger.kernel.org> # 5.10.x: f2c7ef3ba: KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES
Cc: <stable@vger.kernel.org> # 5.10.x
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-25 18:54:09 -05:00
Lorenzo Brescia
d95df95106 kvm: tracing: Fix unmatched kvm_entry and kvm_exit events
On VMX, if we exit and then re-enter immediately without leaving
the vmx_vcpu_run() function, the kvm_entry event is not logged.
That means we will see one (or more) kvm_exit, without its (their)
corresponding kvm_entry, as shown here:

 CPU-1979 [002] 89.871187: kvm_entry: vcpu 1
 CPU-1979 [002] 89.871218: kvm_exit:  reason MSR_WRITE
 CPU-1979 [002] 89.871259: kvm_exit:  reason MSR_WRITE

It also seems possible for a kvm_entry event to be logged, but then
we leave vmx_vcpu_run() right away (if vmx->emulation_required is
true). In this case, we will have a spurious kvm_entry event in the
trace.

Fix these situations by moving trace_kvm_entry() inside vmx_vcpu_run()
(where trace_kvm_exit() already is).

A trace obtained with this patch applied looks like this:

 CPU-14295 [000] 8388.395387: kvm_entry: vcpu 0
 CPU-14295 [000] 8388.395392: kvm_exit:  reason MSR_WRITE
 CPU-14295 [000] 8388.395393: kvm_entry: vcpu 0
 CPU-14295 [000] 8388.395503: kvm_exit:  reason EXTERNAL_INTERRUPT

Of course, not calling trace_kvm_entry() in common x86 code any
longer means that we need to adjust the SVM side of things too.

Signed-off-by: Lorenzo Brescia <lorenzo.brescia@edu.unito.it>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Message-Id: <160873470698.11652.13483635328769030605.stgit@Wayrath>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-25 18:52:08 -05:00
Jay Zhou
1f7becf1b7 KVM: x86: get smi pending status correctly
The injection process of smi has two steps:

    Qemu                        KVM
Step1:
    cpu->interrupt_request &= \
        ~CPU_INTERRUPT_SMI;
    kvm_vcpu_ioctl(cpu, KVM_SMI)

                                call kvm_vcpu_ioctl_smi() and
                                kvm_make_request(KVM_REQ_SMI, vcpu);

Step2:
    kvm_vcpu_ioctl(cpu, KVM_RUN, 0)

                                call process_smi() if
                                kvm_check_request(KVM_REQ_SMI, vcpu) is
                                true, mark vcpu->arch.smi_pending = true;

The vcpu->arch.smi_pending will be set true in step2, unfortunately if
vcpu paused between step1 and step2, the kvm_run->immediate_exit will be
set and vcpu has to exit to Qemu immediately during step2 before mark
vcpu->arch.smi_pending true.
During VM migration, Qemu will get the smi pending status from KVM using
KVM_GET_VCPU_EVENTS ioctl at the downtime, then the smi pending status
will be lost.

Signed-off-by: Jay Zhou <jianjay.zhou@huawei.com>
Signed-off-by: Shengen Zhuang <zhuangshengen@huawei.com>
Message-Id: <20210118084720.1585-1-jianjay.zhou@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-25 18:52:07 -05:00
Linus Torvalds
2a190b22aa x86:
* Fixes for the new scalable MMU
 * Fixes for migration of nested hypervisors on AMD
 * Fix for clang integrated assembler
 * Fix for left shift by 64 (UBSAN)
 * Small cleanups
 * Straggler SEV-ES patch
 
 ARM:
 * VM init cleanups
 * PSCI relay cleanups
 * Kill CONFIG_KVM_ARM_PMU
 * Fixup __init annotations
 * Fixup reg_to_encoding()
 * Fix spurious PMCR_EL0 access
 
 * selftests cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl/4YOMUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMg0Qf/eGDOoLL18KhA9MpoPXusR5kU+82G
 AcbFj9siNW46EF7mL+sw/xAx+gZaqSpIEmn/f6BzgiaUBdFTv9CKX3B54e43e59G
 HAKD0NpqwvDIi1b0T6bcgC92mY3Qx/IDCc7/9JYjBs/iORfqyWW6xVtkF/Gfymxt
 eK+MnfMqqNZODgR/cZnCH1E48fuwOvRMxLqilLi3OOMSUfs2cQOSLTNfYQYqjeaJ
 dsQ4YeyPJO5JHtfHFr6VPIo/jDhowniac9CNvOomWWVIx2zPYVSl9d8ub6ESEPNF
 GM7UeBCOMmZG/a3qFEZPAUJ7znW4yYE85Z6pjxlrGhd1I54MJi4dd+RApw==
 =5Nfj
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "x86:
   - Fixes for the new scalable MMU
   - Fixes for migration of nested hypervisors on AMD
   - Fix for clang integrated assembler
   - Fix for left shift by 64 (UBSAN)
   - Small cleanups
   - Straggler SEV-ES patch

  ARM:
   - VM init cleanups
   - PSCI relay cleanups
   - Kill CONFIG_KVM_ARM_PMU
   - Fixup __init annotations
   - Fixup reg_to_encoding()
   - Fix spurious PMCR_EL0 access

  Misc:
   - selftests cleanups"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (38 commits)
  KVM: x86: __kvm_vcpu_halt can be static
  KVM: SVM: Add support for booting APs in an SEV-ES guest
  KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit
  KVM: nSVM: mark vmcb as dirty when forcingly leaving the guest mode
  KVM: nSVM: correctly restore nested_run_pending on migration
  KVM: x86/mmu: Clarify TDP MMU page list invariants
  KVM: x86/mmu: Ensure TDP MMU roots are freed after yield
  kvm: check tlbs_dirty directly
  KVM: x86: change in pv_eoi_get_pending() to make code more readable
  MAINTAINERS: Really update email address for Sean Christopherson
  KVM: x86: fix shift out of bounds reported by UBSAN
  KVM: selftests: Implement perf_test_util more conventionally
  KVM: selftests: Use vm_create_with_vcpus in create_vm
  KVM: selftests: Factor out guest mode code
  KVM/SVM: Remove leftover __svm_vcpu_run prototype from svm.c
  KVM: SVM: Add register operand to vmsave call in sev_es_vcpu_load
  KVM: x86/mmu: Optimize not-present/MMIO SPTE check in get_mmio_spte()
  KVM: x86/mmu: Use raw level to index into MMIO walks' sptes array
  KVM: x86/mmu: Get root level from walkers when retrieving MMIO SPTE
  KVM: x86/mmu: Use -1 to flag an undefined spte in get_mmio_spte()
  ...
2021-01-08 15:06:02 -08:00
Paolo Bonzini
872f36eb0b KVM: x86: __kvm_vcpu_halt can be static
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-08 05:54:44 -05:00
Tom Lendacky
647daca25d KVM: SVM: Add support for booting APs in an SEV-ES guest
Typically under KVM, an AP is booted using the INIT-SIPI-SIPI sequence,
where the guest vCPU register state is updated and then the vCPU is VMRUN
to begin execution of the AP. For an SEV-ES guest, this won't work because
the guest register state is encrypted.

Following the GHCB specification, the hypervisor must not alter the guest
register state, so KVM must track an AP/vCPU boot. Should the guest want
to park the AP, it must use the AP Reset Hold exit event in place of, for
example, a HLT loop.

First AP boot (first INIT-SIPI-SIPI sequence):
  Execute the AP (vCPU) as it was initialized and measured by the SEV-ES
  support. It is up to the guest to transfer control of the AP to the
  proper location.

Subsequent AP boot:
  KVM will expect to receive an AP Reset Hold exit event indicating that
  the vCPU is being parked and will require an INIT-SIPI-SIPI sequence to
  awaken it. When the AP Reset Hold exit event is received, KVM will place
  the vCPU into a simulated HLT mode. Upon receiving the INIT-SIPI-SIPI
  sequence, KVM will make the vCPU runnable. It is again up to the guest
  to then transfer control of the AP to the proper location.

  To differentiate between an actual HLT and an AP Reset Hold, a new MP
  state is introduced, KVM_MP_STATE_AP_RESET_HOLD, which the vCPU is
  placed in upon receiving the AP Reset Hold exit event. Additionally, to
  communicate the AP Reset Hold exit event up to userspace (if needed), a
  new exit reason is introduced, KVM_EXIT_AP_RESET_HOLD.

A new x86 ops function is introduced, vcpu_deliver_sipi_vector, in order
to accomplish AP booting. For VMX, vcpu_deliver_sipi_vector is set to the
original SIPI delivery function, kvm_vcpu_deliver_sipi_vector(). SVM adds
a new function that, for non SEV-ES guests, invokes the original SIPI
delivery function, kvm_vcpu_deliver_sipi_vector(), but for SEV-ES guests,
implements the logic above.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <e8fbebe8eb161ceaabdad7c01a5859a78b424d5e.1609791600.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-07 18:11:37 -05:00
Maxim Levitsky
f2c7ef3ba9 KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit
It is possible to exit the nested guest mode, entered by
svm_set_nested_state prior to first vm entry to it (e.g due to pending event)
if the nested run was not pending during the migration.

In this case we must not switch to the nested msr permission bitmap.
Also add a warning to catch similar cases in the future.

Fixes: a7d5c7ce41 ("KVM: nSVM: delay MSR permission processing to first nested VM run")

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210107093854.882483-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-07 18:11:35 -05:00
Paolo Bonzini
bc351f0726 Merge branch 'kvm-master' into kvm-next
Fixes to get_mmio_spte, destined to 5.10 stable branch.
2021-01-07 18:06:52 -05:00
Linus Torvalds
6a447b0e31 ARM:
* PSCI relay at EL2 when "protected KVM" is enabled
 * New exception injection code
 * Simplification of AArch32 system register handling
 * Fix PMU accesses when no PMU is enabled
 * Expose CSV3 on non-Meltdown hosts
 * Cache hierarchy discovery fixes
 * PV steal-time cleanups
 * Allow function pointers at EL2
 * Various host EL2 entry cleanups
 * Simplification of the EL2 vector allocation
 
 s390:
 * memcg accouting for s390 specific parts of kvm and gmap
 * selftest for diag318
 * new kvm_stat for when async_pf falls back to sync
 
 x86:
 * Tracepoints for the new pagetable code from 5.10
 * Catch VFIO and KVM irqfd events before userspace
 * Reporting dirty pages to userspace with a ring buffer
 * SEV-ES host support
 * Nested VMX support for wait-for-SIPI activity state
 * New feature flag (AVX512 FP16)
 * New system ioctl to report Hyper-V-compatible paravirtualization features
 
 Generic:
 * Selftest improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl/bdL4UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNgQQgAnTH6rhXa++Zd5F0EM2NwXwz3iEGb
 lOq1DZSGjs6Eekjn8AnrWbmVQr+CBCuGU9MrxpSSzNDK/awryo3NwepOWAZw9eqk
 BBCVwGBbJQx5YrdgkGC0pDq2sNzcpW/VVB3vFsmOxd9eHblnuKSIxEsCCXTtyqIt
 XrLpQ1UhvI4yu102fDNhuFw2EfpzXm+K0Lc0x6idSkdM/p7SyeOxiv8hD4aMr6+G
 bGUQuMl4edKZFOWFigzr8NovQAvDHZGrwfihu2cLRYKLhV97QuWVmafv/yYfXcz2
 drr+wQCDNzDOXyANnssmviazrhOX0QmTAhbIXGGX/kTxYKcfPi83ZLoI3A==
 =ISud
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "Much x86 work was pushed out to 5.12, but ARM more than made up for it.

  ARM:
   - PSCI relay at EL2 when "protected KVM" is enabled
   - New exception injection code
   - Simplification of AArch32 system register handling
   - Fix PMU accesses when no PMU is enabled
   - Expose CSV3 on non-Meltdown hosts
   - Cache hierarchy discovery fixes
   - PV steal-time cleanups
   - Allow function pointers at EL2
   - Various host EL2 entry cleanups
   - Simplification of the EL2 vector allocation

  s390:
   - memcg accouting for s390 specific parts of kvm and gmap
   - selftest for diag318
   - new kvm_stat for when async_pf falls back to sync

  x86:
   - Tracepoints for the new pagetable code from 5.10
   - Catch VFIO and KVM irqfd events before userspace
   - Reporting dirty pages to userspace with a ring buffer
   - SEV-ES host support
   - Nested VMX support for wait-for-SIPI activity state
   - New feature flag (AVX512 FP16)
   - New system ioctl to report Hyper-V-compatible paravirtualization features

  Generic:
   - Selftest improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (171 commits)
  KVM: SVM: fix 32-bit compilation
  KVM: SVM: Add AP_JUMP_TABLE support in prep for AP booting
  KVM: SVM: Provide support to launch and run an SEV-ES guest
  KVM: SVM: Provide an updated VMRUN invocation for SEV-ES guests
  KVM: SVM: Provide support for SEV-ES vCPU loading
  KVM: SVM: Provide support for SEV-ES vCPU creation/loading
  KVM: SVM: Update ASID allocation to support SEV-ES guests
  KVM: SVM: Set the encryption mask for the SVM host save area
  KVM: SVM: Add NMI support for an SEV-ES guest
  KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest
  KVM: SVM: Do not report support for SMM for an SEV-ES guest
  KVM: x86: Update __get_sregs() / __set_sregs() to support SEV-ES
  KVM: SVM: Add support for CR8 write traps for an SEV-ES guest
  KVM: SVM: Add support for CR4 write traps for an SEV-ES guest
  KVM: SVM: Add support for CR0 write traps for an SEV-ES guest
  KVM: SVM: Add support for EFER write traps for an SEV-ES guest
  KVM: SVM: Support string IO operations for an SEV-ES guest
  KVM: SVM: Support MMIO for an SEV-ES guest
  KVM: SVM: Create trace events for VMGEXIT MSR protocol processing
  KVM: SVM: Create trace events for VMGEXIT processing
  ...
2020-12-20 10:44:05 -08:00
Shakeel Butt
93bb59ca5e mm, kvm: account kvm_vcpu_mmap to kmemcg
A VCPU of a VM can allocate couple of pages which can be mmap'ed by the
user space application. At the moment this memory is not charged to the
memcg of the VMM. On a large machine running large number of VMs or
small number of VMs having large number of VCPUs, this unaccounted
memory can be very significant. So, charge this memory to the memcg of
the VMM. Please note that lifetime of these allocations corresponds to
the lifetime of the VMM.

Link: https://lkml.kernel.org/r/20201106202923.2087414-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-19 11:18:37 -08:00
Tom Lendacky
16809ecdc1 KVM: SVM: Provide an updated VMRUN invocation for SEV-ES guests
The run sequence is different for an SEV-ES guest compared to a legacy or
even an SEV guest. The guest vCPU register state of an SEV-ES guest will
be restored on VMRUN and saved on VMEXIT. There is no need to restore the
guest registers directly and through VMLOAD before VMRUN and no need to
save the guest registers directly and through VMSAVE on VMEXIT.

Update the svm_vcpu_run() function to skip register state saving and
restoring and provide an alternative function for running an SEV-ES guest
in vmenter.S

Additionally, certain host state is restored across an SEV-ES VMRUN. As
a result certain register states are not required to be restored upon
VMEXIT (e.g. FS, GS, etc.), so only do that if the guest is not an SEV-ES
guest.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <fb1c66d32f2194e171b95fc1a8affd6d326e10c1.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-15 05:20:59 -05:00
Tom Lendacky
861377730a KVM: SVM: Provide support for SEV-ES vCPU loading
An SEV-ES vCPU requires additional VMCB vCPU load/put requirements. SEV-ES
hardware will restore certain registers on VMEXIT, but not save them on
VMRUN (see Table B-3 and Table B-4 of the AMD64 APM Volume 2), so make the
following changes:

General vCPU load changes:
  - During vCPU loading, perform a VMSAVE to the per-CPU SVM save area and
    save the current values of XCR0, XSS and PKRU to the per-CPU SVM save
    area as these registers will be restored on VMEXIT.

General vCPU put changes:
  - Do not attempt to restore registers that SEV-ES hardware has already
    restored on VMEXIT.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <019390e9cb5e93cd73014fa5a040c17d42588733.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-15 05:20:59 -05:00
Tom Lendacky
ed02b21309 KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest
The guest FPU state is automatically restored on VMRUN and saved on VMEXIT
by the hardware, so there is no reason to do this in KVM. Eliminate the
allocation of the guest_fpu save area and key off that to skip operations
related to the guest FPU state.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <173e429b4d0d962c6a443c4553ffdaf31b7665a4.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-15 05:20:56 -05:00
Tom Lendacky
5719455fbd KVM: SVM: Do not report support for SMM for an SEV-ES guest
SEV-ES guests do not currently support SMM. Update the has_emulated_msr()
kvm_x86_ops function to take a struct kvm parameter so that the capability
can be reported at a VM level.

Since this op is also called during KVM initialization and before a struct
kvm instance is available, comments will be added to each implementation
of has_emulated_msr() to indicate the kvm parameter can be null.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <75de5138e33b945d2fb17f81ae507bda381808e3.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-15 05:20:55 -05:00
Tom Lendacky
5265713a07 KVM: x86: Update __get_sregs() / __set_sregs() to support SEV-ES
Since many of the registers used by the SEV-ES are encrypted and cannot
be read or written, adjust the __get_sregs() / __set_sregs() to take into
account whether the VMSA/guest state is encrypted.

For __get_sregs(), return the actual value that is in use by the guest
for all registers being tracked using the write trap support.

For __set_sregs(), skip setting of all guest registers values.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <23051868db76400a9b07a2020525483a1e62dbcf.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-15 05:20:54 -05:00
Tom Lendacky
5b51cb1316 KVM: SVM: Add support for CR4 write traps for an SEV-ES guest
For SEV-ES guests, the interception of control register write access
is not recommended. Control register interception occurs prior to the
control register being modified and the hypervisor is unable to modify
the control register itself because the register is located in the
encrypted register state.

SEV-ES guests introduce new control register write traps. These traps
provide intercept support of a control register write after the control
register has been modified. The new control register value is provided in
the VMCB EXITINFO1 field, allowing the hypervisor to track the setting
of the guest control registers.

Add support to track the value of the guest CR4 register using the control
register write trap so that the hypervisor understands the guest operating
mode.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <c3880bf2db8693aa26f648528fbc6e967ab46e25.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-15 05:20:53 -05:00
Tom Lendacky
f27ad38aac KVM: SVM: Add support for CR0 write traps for an SEV-ES guest
For SEV-ES guests, the interception of control register write access
is not recommended. Control register interception occurs prior to the
control register being modified and the hypervisor is unable to modify
the control register itself because the register is located in the
encrypted register state.

SEV-ES support introduces new control register write traps. These traps
provide intercept support of a control register write after the control
register has been modified. The new control register value is provided in
the VMCB EXITINFO1 field, allowing the hypervisor to track the setting
of the guest control registers.

Add support to track the value of the guest CR0 register using the control
register write trap so that the hypervisor understands the guest operating
mode.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <182c9baf99df7e40ad9617ff90b84542705ef0d7.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-15 05:20:52 -05:00
Tom Lendacky
7ed9abfe8e KVM: SVM: Support string IO operations for an SEV-ES guest
For an SEV-ES guest, string-based port IO is performed to a shared
(un-encrypted) page so that both the hypervisor and guest can read or
write to it and each see the contents.

For string-based port IO operations, invoke SEV-ES specific routines that
can complete the operation using common KVM port IO support.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <9d61daf0ffda496703717218f415cdc8fd487100.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-15 05:20:51 -05:00
Tom Lendacky
8f423a80d2 KVM: SVM: Support MMIO for an SEV-ES guest
For an SEV-ES guest, MMIO is performed to a shared (un-encrypted) page
so that both the hypervisor and guest can read or write to it and each
see the contents.

The GHCB specification provides software-defined VMGEXIT exit codes to
indicate a request for an MMIO read or an MMIO write. Add support to
recognize the MMIO requests and invoke SEV-ES specific routines that
can complete the MMIO operation. These routines use common KVM support
to complete the MMIO operation.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <af8de55127d5bcc3253d9b6084a0144c12307d4d.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-15 05:20:50 -05:00
Tom Lendacky
59e38b58de KVM: SVM: Create trace events for VMGEXIT MSR protocol processing
Add trace events for entry to and exit from VMGEXIT MSR protocol
processing. The vCPU will be common for the trace events. The MSR
protocol processing is guided by the GHCB GPA in the VMCB, so the GHCB
GPA will represent the input and output values for the entry and exit
events, respectively. Additionally, the exit event will contain the
return code for the event.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <c5b3b440c3e0db43ff2fc02813faa94fa54896b0.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-15 05:20:49 -05:00
Tom Lendacky
d523ab6ba2 KVM: SVM: Create trace events for VMGEXIT processing
Add trace events for entry to and exit from VMGEXIT processing. The vCPU
id and the exit reason will be common for the trace events. The exit info
fields will represent the input and output values for the entry and exit
events, respectively.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <25357dca49a38372e8f483753fb0c1c2a70a6898.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-15 05:20:49 -05:00
Tom Lendacky
8d4846b9b1 KVM: SVM: Prevent debugging under SEV-ES
Since the guest register state of an SEV-ES guest is encrypted, debugging
is not supported. Update the code to prevent guest debugging when the
guest has protected state.

Additionally, an SEV-ES guest must only and always intercept DR7 reads and
writes. Update set_dr_intercepts() and clr_dr_intercepts() to account for
this.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <8db966fa2f9803d6454ce773863025d0e2e7f3cc.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-15 05:20:46 -05:00
Tom Lendacky
f1c6366e30 KVM: SVM: Add required changes to support intercepts under SEV-ES
When a guest is running under SEV-ES, the hypervisor cannot access the
guest register state. There are numerous places in the KVM code where
certain registers are accessed that are not allowed to be accessed (e.g.
RIP, CR0, etc). Add checks to prevent register accesses and add intercept
update support at various points within the KVM code.

Also, when handling a VMGEXIT, exceptions are passed back through the
GHCB. Since the RDMSR/WRMSR intercepts (may) inject a #GP on error,
update the SVM intercepts to handle this for SEV-ES guests.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[Redo MSR part using the .complete_emulated_msr callback. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-15 05:20:45 -05:00
Paolo Bonzini
f9a4d62176 KVM: x86: introduce complete_emulated_msr callback
This will be used by SEV-ES to inject MSR failure via the GHCB.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-15 05:20:34 -05:00
Paolo Bonzini
8b474427cb KVM: x86: use kvm_complete_insn_gp in emulating RDMSR/WRMSR
Simplify the four functions that handle {kernel,user} {rd,wr}msr, there
is still some repetition between the two instances of rdmsr but the
whole business of calling kvm_inject_gp and kvm_skip_emulated_instruction
can be unified nicely.

Because complete_emulated_wrmsr now becomes essentially a call to
kvm_complete_insn_gp, remove complete_emulated_msr.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-15 05:20:09 -05:00
Paolo Bonzini
71cc849b70 KVM: x86: Fix split-irqchip vs interrupt injection window request
kvm_cpu_accept_dm_intr and kvm_vcpu_ready_for_interrupt_injection are
a hodge-podge of conditions, hacked together to get something that
more or less works.  But what is actually needed is much simpler;
in both cases the fundamental question is, do we have a place to stash
an interrupt if userspace does KVM_INTERRUPT?

In userspace irqchip mode, that is !vcpu->arch.interrupt.injected.
Currently kvm_event_needs_reinjection(vcpu) covers it, but it is
unnecessarily restrictive.

In split irqchip mode it's a bit more complicated, we need to check
kvm_apic_accept_pic_intr(vcpu) (the IRQ window exit is basically an INTACK
cycle and thus requires ExtINTs not to be masked) as well as
!pending_userspace_extint(vcpu).  However, there is no need to
check kvm_event_needs_reinjection(vcpu), since split irqchip keeps
pending ExtINT state separate from event injection state, and checking
kvm_cpu_has_interrupt(vcpu) is wrong too since ExtINT has higher
priority than APIC interrupts.  In fact the latter fixes a bug:
when userspace requests an IRQ window vmexit, an interrupt in the
local APIC can cause kvm_cpu_has_interrupt() to be true and thus
kvm_vcpu_ready_for_interrupt_injection() to return false.  When this
happens, vcpu_run does not exit to userspace but the interrupt window
vmexits keep occurring.  The VM loops without any hope of making progress.

Once we try to fix these with something like

     return kvm_arch_interrupt_allowed(vcpu) &&
-        !kvm_cpu_has_interrupt(vcpu) &&
-        !kvm_event_needs_reinjection(vcpu) &&
-        kvm_cpu_accept_dm_intr(vcpu);
+        (!lapic_in_kernel(vcpu)
+         ? !vcpu->arch.interrupt.injected
+         : (kvm_apic_accept_pic_intr(vcpu)
+            && !pending_userspace_extint(v)));

we realize two things.  First, thanks to the previous patch the complex
conditional can reuse !kvm_cpu_has_extint(vcpu).  Second, the interrupt
window request in vcpu_enter_guest()

        bool req_int_win =
                dm_request_for_irq_injection(vcpu) &&
                kvm_cpu_accept_dm_intr(vcpu);

should be kept in sync with kvm_vcpu_ready_for_interrupt_injection():
it is unnecessary to ask the processor for an interrupt window
if we would not be able to return to userspace.  Therefore,
kvm_cpu_accept_dm_intr(vcpu) is basically !kvm_cpu_has_extint(vcpu)
ANDed with the existing check for masked ExtINT.  It all makes sense:

- we can accept an interrupt from userspace if there is a place
  to stash it (and, for irqchip split, ExtINTs are not masked).
  Interrupts from userspace _can_ be accepted even if right now
  EFLAGS.IF=0.

- in order to tell userspace we will inject its interrupt ("IRQ
  window open" i.e. kvm_vcpu_ready_for_interrupt_injection), both
  KVM and the vCPU need to be ready to accept the interrupt.

... and this is what the patch implements.

Reported-by: David Woodhouse <dwmw@amazon.co.uk>
Analyzed-by: David Woodhouse <dwmw@amazon.co.uk>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Tested-by: David Woodhouse <dwmw@amazon.co.uk>
2020-11-27 09:27:28 -05:00
Jim Mattson
2259c17f01 kvm: x86: Sink cpuid update into vendor-specific set_cr4 functions
On emulated VM-entry and VM-exit, update the CPUID bits that reflect
CR4.OSXSAVE and CR4.PKE.

This fixes a bug where the CPUID bits could continue to reflect L2 CR4
values after emulated VM-exit to L1. It also fixes a related bug where
the CPUID bits could continue to reflect L1 CR4 values after emulated
VM-entry to L2. The latter bug is mainly relevant to SVM, wherein
CPUID is not a required intercept. However, it could also be relevant
to VMX, because the code to conditionally update these CPUID bits
assumes that the guest CPUID and the guest CR4 are always in sync.

Fixes: 8eb3f87d90 ("KVM: nVMX: fix guest CR4 loading when emulating L2 to L1 exit")
Fixes: 2acf923e38 ("KVM: VMX: Enable XSAVE/XRSTOR for guest")
Fixes: b9baba8614 ("KVM, pkeys: expose CPUID/CR4 to guest")
Reported-by: Abhiroop Dabral <adabral@paloaltonetworks.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Cc: Haozhong Zhang <haozhong.zhang@intel.com>
Cc: Dexuan Cui <dexuan.cui@intel.com>
Cc: Huaitong Han <huaitong.han@intel.com>
Message-Id: <20201029170648.483210-1-jmattson@google.com>
2020-11-15 09:49:18 -05:00
Peter Xu
fb04a1eddb KVM: X86: Implement ring-based dirty memory tracking
This patch is heavily based on previous work from Lei Cao
<lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]

KVM currently uses large bitmaps to track dirty memory.  These bitmaps
are copied to userspace when userspace queries KVM for its dirty page
information.  The use of bitmaps is mostly sufficient for live
migration, as large parts of memory are be dirtied from one log-dirty
pass to another.  However, in a checkpointing system, the number of
dirty pages is small and in fact it is often bounded---the VM is
paused when it has dirtied a pre-defined number of pages. Traversing a
large, sparsely populated bitmap to find set bits is time-consuming,
as is copying the bitmap to user-space.

A similar issue will be there for live migration when the guest memory
is huge while the page dirty procedure is trivial.  In that case for
each dirty sync we need to pull the whole dirty bitmap to userspace
and analyse every bit even if it's mostly zeros.

The preferred data structure for above scenarios is a dense list of
guest frame numbers (GFN).  This patch series stores the dirty list in
kernel memory that can be memory mapped into userspace to allow speedy
harvesting.

This patch enables dirty ring for X86 only.  However it should be
easily extended to other archs as well.

[1] https://patchwork.kernel.org/patch/10471409/

Signed-off-by: Lei Cao <lei.cao@stratus.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201001012222.5767-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-15 09:49:15 -05:00
Peter Xu
ff5a983cbb KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
Originally, we have three code paths that can dirty a page without
vcpu context for X86:

  - init_rmode_identity_map
  - init_rmode_tss
  - kvmgt_rw_gpa

init_rmode_identity_map and init_rmode_tss will be setup on
destination VM no matter what (and the guest cannot even see them), so
it does not make sense to track them at all.

To do this, allow __x86_set_memory_region() to return the userspace
address that just allocated to the caller.  Then in both of the
functions we directly write to the userspace address instead of
calling kvm_write_*() APIs.

Another trivial change is that we don't need to explicitly clear the
identity page table root in init_rmode_identity_map() because no
matter what we'll write to the whole page with 4M huge page entries.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201001012044.5151-4-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-15 09:49:12 -05:00
Vitaly Kuznetsov
c21d54f030 KVM: x86: hyper-v: allow KVM_GET_SUPPORTED_HV_CPUID as a system ioctl
KVM_GET_SUPPORTED_HV_CPUID is a vCPU ioctl but its output is now
independent from vCPU and in some cases VMMs may want to use it as a system
ioctl instead. In particular, QEMU doesn CPU feature expansion before any
vCPU gets created so KVM_GET_SUPPORTED_HV_CPUID can't be used.

Convert KVM_GET_SUPPORTED_HV_CPUID to 'dual' system/vCPU ioctl with the
same meaning.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200929150944.1235688-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-15 09:49:11 -05:00
Sean Christopherson
ee69c92bac KVM: x86: Return bool instead of int for CR4 and SREGS validity checks
Rework the common CR4 and SREGS checks to return a bool instead of an
int, i.e. true/false instead of 0/-EINVAL, and add "is" to the name to
clarify the polarity of the return value (which is effectively inverted
by this change).

No functional changed intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201007014417.29276-6-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-15 09:49:08 -05:00
Sean Christopherson
c2fe3cd460 KVM: x86: Move vendor CR4 validity check to dedicated kvm_x86_ops hook
Split out VMX's checks on CR4.VMXE to a dedicated hook, .is_valid_cr4(),
and invoke the new hook from kvm_valid_cr4().  This fixes an issue where
KVM_SET_SREGS would return success while failing to actually set CR4.

Fixing the issue by explicitly checking kvm_x86_ops.set_cr4()'s return
in __set_sregs() is not a viable option as KVM has already stuffed a
variety of vCPU state.

Note, kvm_valid_cr4() and is_valid_cr4() have different return types and
inverted semantics.  This will be remedied in a future patch.

Fixes: 5e1746d620 ("KVM: nVMX: Allow setting the VMXE bit in CR4")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201007014417.29276-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-15 09:49:07 -05:00
Babu Moger
0107973a80 KVM: x86: Introduce cr3_lm_rsvd_bits in kvm_vcpu_arch
SEV guests fail to boot on a system that supports the PCID feature.

While emulating the RSM instruction, KVM reads the guest CR3
and calls kvm_set_cr3(). If the vCPU is in the long mode,
kvm_set_cr3() does a sanity check for the CR3 value. In this case,
it validates whether the value has any reserved bits set. The
reserved bit range is 63:cpuid_maxphysaddr(). When AMD memory
encryption is enabled, the memory encryption bit is set in the CR3
value. The memory encryption bit may fall within the KVM reserved
bit range, causing the KVM emulation failure.

Introduce a new field cr3_lm_rsvd_bits in kvm_vcpu_arch which will
cache the reserved bits in the CR3 value. This will be initialized
to rsvd_bits(cpuid_maxphyaddr(vcpu), 63).

If the architecture has any special bits(like AMD SEV encryption bit)
that needs to be masked from the reserved bits, should be cleared
in vendor specific kvm_x86_ops.vcpu_after_set_cpuid handler.

Fixes: a780a3ea62 ("KVM: X86: Fix reserved bits check for MOV to CR3")
Signed-off-by: Babu Moger <babu.moger@amd.com>
Message-Id: <160521947657.32054.3264016688005356563.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-13 06:28:37 -05:00
Pankaj Gupta
2cdef91cf8 KVM: x86: handle MSR_IA32_DEBUGCTLMSR with report_ignored_msrs
Windows2016 guest tries to enable LBR by setting the corresponding bits
in MSR_IA32_DEBUGCTLMSR. KVM does not emulate MSR_IA32_DEBUGCTLMSR and
spams the host kernel logs with error messages like:

	kvm [...]: vcpu1, guest rIP: 0xfffff800a8b687d3 kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop"

This patch fixes this by enabling error logging only with
'report_ignored_msrs=1'.

Signed-off-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Message-Id: <20201105153932.24316-1-pankaj.gupta.linux@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-08 04:41:31 -05:00
Oliver Upton
1e293d1ae8 kvm: x86: request masterclock update any time guest uses different msr
Commit 5b9bb0ebbc ("kvm: x86: encapsulate wrmsr(MSR_KVM_SYSTEM_TIME)
emulation in helper fn", 2020-10-21) subtly changed the behavior of guest
writes to MSR_KVM_SYSTEM_TIME(_NEW). Restore the previous behavior; update
the masterclock any time the guest uses a different msr than before.

Fixes: 5b9bb0ebbc ("kvm: x86: encapsulate wrmsr(MSR_KVM_SYSTEM_TIME) emulation in helper fn", 2020-10-21)
Signed-off-by: Oliver Upton <oupton@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Message-Id: <20201027231044.655110-6-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-08 04:41:30 -05:00
Oliver Upton
01b4f510b9 kvm: x86: ensure pv_cpuid.features is initialized when enabling cap
Make the paravirtual cpuid enforcement mechanism idempotent to ioctl()
ordering by updating pv_cpuid.features whenever userspace requests the
capability. Extract this update out of kvm_update_cpuid_runtime() into a
new helper function and move its other call site into
kvm_vcpu_after_set_cpuid() where it more likely belongs.

Fixes: 66570e966d ("kvm: x86: only provide PV features if enabled in guest's CPUID")
Signed-off-by: Oliver Upton <oupton@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Message-Id: <20201027231044.655110-5-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-08 04:41:29 -05:00
Oliver Upton
1930e5ddce kvm: x86: reads of restricted pv msrs should also result in #GP
commit 66570e966d ("kvm: x86: only provide PV features if enabled in
guest's CPUID") only protects against disallowed guest writes to KVM
paravirtual msrs, leaving msr reads unchecked. Fix this by enforcing
KVM_CPUID_FEATURES for msr reads as well.

Fixes: 66570e966d ("kvm: x86: only provide PV features if enabled in guest's CPUID")
Signed-off-by: Oliver Upton <oupton@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Message-Id: <20201027231044.655110-4-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-08 04:41:29 -05:00
Maxim Levitsky
cc4cb01767 KVM: x86: use positive error values for msr emulation that causes #GP
Recent introduction of the userspace msr filtering added code that uses
negative error codes for cases that result in either #GP delivery to
the guest, or handled by the userspace msr filtering.

This breaks an assumption that a negative error code returned from the
msr emulation code is a semi-fatal error which should be returned
to userspace via KVM_RUN ioctl and usually kill the guest.

Fix this by reusing the already existing KVM_MSR_RET_INVALID error code,
and by adding a new KVM_MSR_RET_FILTERED error code for the
userspace filtered msrs.

Fixes: 291f35fb2c1d1 ("KVM: x86: report negative values from wrmsr emulation to userspace")
Reported-by: Qian Cai <cai@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20201101115523.115780-1-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-08 04:41:28 -05:00
Takashi Iwai
d383b3146d KVM: x86: Fix NULL dereference at kvm_msr_ignored_check()
The newly introduced kvm_msr_ignored_check() tries to print error or
debug messages via vcpu_*() macros, but those may cause Oops when NULL
vcpu is passed for KVM_GET_MSRS ioctl.

Fix it by replacing the print calls with kvm_*() macros.

(Note that this will leave vcpu argument completely unused in the
 function, but I didn't touch it to make the fix as small as
 possible.  A clean up may be applied later.)

Fixes: 12bc2132b1 ("KVM: X86: Do the same ignore_msrs check for feature msrs")
BugLink: https://bugzilla.suse.com/show_bug.cgi?id=1178280
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Message-Id: <20201030151414.20165-1-tiwai@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-30 13:40:31 -04:00
Paolo Bonzini
c0623f5e5d Merge branch 'kvm-fixes' into 'next'
Pick up bugfixes from 5.9, otherwise various tests fail.
2020-10-21 18:05:58 -04:00
Maxim Levitsky
72f211ecaa KVM: x86: allow kvm_x86_ops.set_efer to return an error value
This will be used to signal an error to the userspace, in case
the vendor code failed during handling of this msr. (e.g -ENOMEM)

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20201001112954.6258-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-21 17:48:48 -04:00
Maxim Levitsky
7dffecaf4e KVM: x86: report negative values from wrmsr emulation to userspace
This will allow the KVM to report such errors (e.g -ENOMEM)
to the userspace.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20201001112954.6258-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-21 17:36:34 -04:00
Maxim Levitsky
36385ccc9b KVM: x86: xen_hvm_config: cleanup return values
Return 1 on errors that are caused by wrong guest behavior
(which will inject #GP to the guest)

And return a negative error value on issues that are
the kernel's fault (e.g -ENOMEM)

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20201001112954.6258-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-21 17:36:34 -04:00
Vitaly Kuznetsov
255cbecfe0 KVM: x86: allocate vcpu->arch.cpuid_entries dynamically
The current limit for guest CPUID leaves (KVM_MAX_CPUID_ENTRIES, 80)
is reported to be insufficient but before we bump it let's switch to
allocating vcpu->arch.cpuid_entries[] array dynamically. Currently,
'struct kvm_cpuid_entry2' is 40 bytes so vcpu->arch.cpuid_entries is
3200 bytes which accounts for 1/4 of the whole 'struct kvm_vcpu_arch'
but having it pre-allocated (for all vCPUs which we also pre-allocate)
gives us no real benefits.

Another plus of the dynamic allocation is that we now do kvm_check_cpuid()
check before we assign anything to vcpu->arch.cpuid_nent/cpuid_entries so
no changes are made in case the check fails.

Opportunistically remove unneeded 'out' labels from
kvm_vcpu_ioctl_set_cpuid()/kvm_vcpu_ioctl_set_cpuid2() and return
directly whenever possible.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20201001130541.1398392-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
2020-10-21 17:36:33 -04:00
Oliver Upton
66570e966d kvm: x86: only provide PV features if enabled in guest's CPUID
KVM unconditionally provides PV features to the guest, regardless of the
configured CPUID. An unwitting guest that doesn't check
KVM_CPUID_FEATURES before use could access paravirt features that
userspace did not intend to provide. Fix this by checking the guest's
CPUID before performing any paravirtual operations.

Introduce a capability, KVM_CAP_ENFORCE_PV_FEATURE_CPUID, to gate the
aforementioned enforcement. Migrating a VM from a host w/o this patch to
a host with this patch could silently change the ABI exposed to the
guest, warranting that we default to the old behavior and opt-in for
the new one.

Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Change-Id: I202a0926f65035b872bfe8ad15307c026de59a98
Message-Id: <20200818152429.1923996-4-oupton@google.com>
Reviewed-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-21 17:36:32 -04:00
Oliver Upton
210dfd93ea kvm: x86: set wall_clock in kvm_write_wall_clock()
Small change to avoid meaningless duplication in the subsequent patch.
No functional change intended.

Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Change-Id: I77ab9cdad239790766b7a49d5cbae5e57a3005ea
Message-Id: <20200818152429.1923996-3-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-21 17:36:31 -04:00
Oliver Upton
5b9bb0ebbc kvm: x86: encapsulate wrmsr(MSR_KVM_SYSTEM_TIME) emulation in helper fn
No functional change intended.

Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Change-Id: I7cbe71069db98d1ded612fd2ef088b70e7618426
Message-Id: <20200818152429.1923996-2-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-21 17:36:31 -04:00
Paolo Bonzini
043248b328 KVM: VMX: Forbid userspace MSR filters for x2APIC
Allowing userspace to intercept reads to x2APIC MSRs when APICV is
fully enabled for the guest simply can't work.   But more in general,
the LAPIC could be set to in-kernel after the MSR filter is setup
and allowing accesses by userspace would be very confusing.

We could in principle allow userspace to intercept reads and writes to TPR,
and writes to EOI and SELF_IPI, but while that could be made it work, it
would still be silly.

Cc: Alexander Graf <graf@amazon.com>
Cc: Aaron Lewis <aaronlewis@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-21 17:36:24 -04:00
Sean Christopherson
9389b9d5d3 KVM: VMX: Ignore userspace MSR filters for x2APIC
Rework the resetting of the MSR bitmap for x2APIC MSRs to ignore userspace
filtering.  Allowing userspace to intercept reads to x2APIC MSRs when
APICV is fully enabled for the guest simply can't work; the LAPIC and thus
virtual APIC is in-kernel and cannot be directly accessed by userspace.
To keep things simple we will in fact forbid intercepting x2APIC MSRs
altogether, independent of the default_allow setting.

Cc: Alexander Graf <graf@amazon.com>
Cc: Aaron Lewis <aaronlewis@google.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201005195532.8674-3-sean.j.christopherson@intel.com>
[Modified to operate even if APICv is disabled, adjust documentation. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-21 17:36:19 -04:00
Paolo Bonzini
0c899c25d7 KVM: x86: do not attempt TSC synchronization on guest writes
KVM special-cases writes to MSR_IA32_TSC so that all CPUs have
the same base for the TSC.  This logic is complicated, and we
do not want it to have any effect once the VM is started.

In particular, if any guest started to synchronize its TSCs
with writes to MSR_IA32_TSC rather than MSR_IA32_TSC_ADJUST,
the additional effect of kvm_write_tsc code would be uncharted
territory.

Therefore, this patch makes writes to MSR_IA32_TSC behave
essentially the same as writes to MSR_IA32_TSC_ADJUST when
they come from the guest.  A new selftest (which passes
both before and after the patch) checks the current semantics
of writes to MSR_IA32_TSC and MSR_IA32_TSC_ADJUST originating
from both the host and the guest.

Upcoming work to remove the special side effects
of host-initiated writes to MSR_IA32_TSC and MSR_IA32_TSC_ADJUST
will be able to build onto this test, adjusting the host side
to use the new APIs and achieve the same effect.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:59:52 -04:00
Paolo Bonzini
729c15c20f KVM: x86: rename KVM_REQ_GET_VMCS12_PAGES
We are going to use it for SVM too, so use a more generic name.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:58:49 -04:00
Alexander Graf
1a155254ff KVM: x86: Introduce MSR filtering
It's not desireable to have all MSRs always handled by KVM kernel space. Some
MSRs would be useful to handle in user space to either emulate behavior (like
uCode updates) or differentiate whether they are valid based on the CPU model.

To allow user space to specify which MSRs it wants to see handled by KVM,
this patch introduces a new ioctl to push filter rules with bitmaps into
KVM. Based on these bitmaps, KVM can then decide whether to reject MSR access.
With the addition of KVM_CAP_X86_USER_SPACE_MSR it can also deflect the
denied MSR events to user space to operate on.

If no filter is populated, MSR handling stays identical to before.

Signed-off-by: Alexander Graf <graf@amazon.com>

Message-Id: <20200925143422.21718-8-graf@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:58:08 -04:00
Alexander Graf
51de8151bd KVM: x86: Add infrastructure for MSR filtering
In the following commits we will add pieces of MSR filtering.
To ensure that code compiles even with the feature half-merged, let's add
a few stubs and struct definitions before the real patches start.

Signed-off-by: Alexander Graf <graf@amazon.com>

Message-Id: <20200925143422.21718-4-graf@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:58:05 -04:00
Alexander Graf
1ae099540e KVM: x86: Allow deflecting unknown MSR accesses to user space
MSRs are weird. Some of them are normal control registers, such as EFER.
Some however are registers that really are model specific, not very
interesting to virtualization workloads, and not performance critical.
Others again are really just windows into package configuration.

Out of these MSRs, only the first category is necessary to implement in
kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against
certain CPU models and MSRs that contain information on the package level
are much better suited for user space to process. However, over time we have
accumulated a lot of MSRs that are not the first category, but still handled
by in-kernel KVM code.

This patch adds a generic interface to handle WRMSR and RDMSR from user
space. With this, any future MSR that is part of the latter categories can
be handled in user space.

Furthermore, it allows us to replace the existing "ignore_msrs" logic with
something that applies per-VM rather than on the full system. That way you
can run productive VMs in parallel to experimental ones where you don't care
about proper MSR handling.

Signed-off-by: Alexander Graf <graf@amazon.com>
Reviewed-by: Jim Mattson <jmattson@google.com>

Message-Id: <20200925143422.21718-3-graf@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:58:04 -04:00
Alexander Graf
90218e434c KVM: x86: Return -ENOENT on unimplemented MSRs
When we find an MSR that we can not handle, bubble up that error code as
MSR error return code. Follow up patches will use that to expose the fact
that an MSR is not handled by KVM to user space.

Suggested-by: Aaron Lewis <aaronlewis@google.com>
Signed-off-by: Alexander Graf <graf@amazon.com>
Message-Id: <20200925143422.21718-2-graf@amazon.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:58:02 -04:00
Sean Christopherson
7e34fbd05c KVM: x86: Rename "shared_msrs" to "user_return_msrs"
Rename the "shared_msrs" mechanism, which is used to defer restoring
MSRs that are only consumed when running in userspace, to a more banal
but less likely to be confusing "user_return_msrs".

The "shared" nomenclature is confusing as it's not obvious who is
sharing what, e.g. reasonable interpretations are that the guest value
is shared by vCPUs in a VM, or that the MSR value is shared/common to
guest and host, both of which are wrong.

"shared" is also misleading as the MSR value (in hardware) is not
guaranteed to be shared/reused between VMs (if that's indeed the correct
interpretation of the name), as the ability to share values between VMs
is simply a side effect (albiet a very nice side effect) of deferring
restoration of the host value until returning from userspace.

"user_return" avoids the above confusion by describing the mechanism
itself instead of its effects.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:57:54 -04:00
Sean Christopherson
b2d522552c KVM: x86: Add RIP to the kvm_entry, i.e. VM-Enter, tracepoint
Add RIP to the kvm_entry tracepoint to help debug if the kvm_exit
tracepoint is disabled or if VM-Enter fails, in which case the kvm_exit
tracepoint won't be hit.

Read RIP from within the tracepoint itself to avoid a potential VMREAD
and retpoline if the guest's RIP isn't available.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923201349.16097-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:57:49 -04:00