Commit Graph

11727 Commits

Author SHA1 Message Date
Suravee Suthikulpanit
0e311d33bf KVM: SVM: Introduce hybrid-AVIC mode
Currently, AVIC is inhibited when booting a VM w/ x2APIC support.
because AVIC cannot virtualize x2APIC MSR register accesses.
However, the AVIC doorbell can be used to accelerate interrupt
injection into a running vCPU, while all guest accesses to x2APIC MSRs
will be intercepted and emulated by KVM.

With hybrid-AVIC support, the APICV_INHIBIT_REASON_X2APIC is
no longer enforced.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-14-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:51:00 -04:00
Suravee Suthikulpanit
4d1d7942e3 KVM: SVM: Introduce logic to (de)activate x2AVIC mode
Introduce logic to (de)activate AVIC, which also allows
switching between AVIC to x2AVIC mode at runtime.

When an AVIC-enabled guest switches from APIC to x2APIC mode,
the SVM driver needs to perform the following steps:

1. Set the x2APIC mode bit for AVIC in VMCB along with the maximum
APIC ID support for each mode accodingly.

2. Disable x2APIC MSRs interception in order to allow the hardware
to virtualize x2APIC MSRs accesses.

Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-12-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:50:35 -04:00
Suravee Suthikulpanit
d2fe6bf5b8 KVM: SVM: Update max number of vCPUs supported for x2AVIC mode
xAVIC and x2AVIC modes can support diffferent number of vcpus.
Update existing logics to support each mode accordingly.

Also, modify the maximum physical APIC ID for AVIC to 255 to reflect
the actual value supported by the architecture.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-5-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:45:08 -04:00
Suravee Suthikulpanit
4bdec12aa8 KVM: SVM: Detect X2APIC virtualization (x2AVIC) support
Add CPUID check for the x2APIC virtualization (x2AVIC) feature.
If available, the SVM driver can support both AVIC and x2AVIC modes
when load the kvm_amd driver with avic=1. The operating mode will be
determined at runtime depending on the guest APIC mode.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-4-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:44:54 -04:00
Suravee Suthikulpanit
bf348f667e KVM: x86: lapic: Rename [GET/SET]_APIC_DEST_FIELD to [GET/SET]_XAPIC_DEST_FIELD
To signify that the macros only support 8-bit xAPIC destination ID.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220519102709.24125-3-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:44:34 -04:00
Suravee Suthikulpanit
aae99a7c9a x86/cpufeatures: Introduce x2AVIC CPUID bit
Introduce a new feature bit for virtualized x2APIC (x2AVIC) in
CPUID_Fn8000000A_EDX [SVM Revision and Feature Identification].

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220519102709.24125-2-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:44:34 -04:00
Jue Wang
281b52780b KVM: x86: Add emulation for MSR_IA32_MCx_CTL2 MSRs.
This patch adds the emulation of IA32_MCi_CTL2 registers to KVM. A
separate mci_ctl2_banks array is used to keep the existing mce_banks
register layout intact.

In Machine Check Architecture, in addition to MCG_CMCI_P, bit 30 of
the per-bank register IA32_MCi_CTL2 controls whether Corrected Machine
Check error reporting is enabled.

Signed-off-by: Jue Wang <juew@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220610171134.772566-7-juew@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:52:03 -04:00
David Matlack
ada51a9de7 KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.

Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.

Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.

(3) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.

Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush.  As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.

This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits.  However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).

[ This commit is based off of the original implementation of Eager Page
  Splitting from Peter in Google's kernel from 2016. ]

Suggested-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:52:00 -04:00
David Matlack
6a97575d5c KVM: x86/mmu: Cache the access bits of shadowed translations
Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.

In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU.  Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page.  The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.

Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.

While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:58 -04:00
Ben Gardon
084cc29f8b KVM: x86/MMU: Allow NX huge pages to be disabled on a per-vm basis
In some cases, the NX hugepage mitigation for iTLB multihit is not
needed for all guests on a host. Allow disabling the mitigation on a
per-VM basis to avoid the performance hit of NX hugepages on trusted
workloads.

In order to disable NX hugepages on a VM, ensure that the userspace
actor has permission to reboot the system. Since disabling NX hugepages
would allow a guest to crash the system, it is similar to reboot
permissions.

Ideally, KVM would require userspace to prove it has access to KVM's
nx_huge_pages module param, e.g. so that userspace can opt out without
needing full reboot permissions.  But getting access to the module param
file info is difficult because it is buried in layers of sysfs and module
glue. Requiring CAP_SYS_BOOT is sufficient for all known use cases.

Suggested-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220613212523.3436117-9-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:49 -04:00
Linus Torvalds
ca1fdab7fd Merge tag 'efi-urgent-for-v5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:

 - remove pointless include of asm/efi.h, which does not exist on ia64

 - fix DXE service marshalling prototype for mixed mode

* tag 'efi-urgent-for-v5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  efi/x86: libstub: Fix typo in __efi64_argmap* name
  efi: sysfb_efi: remove unnecessary <asm/efi.h> include
2022-06-21 12:20:11 -05:00
Evgeniy Baskov
aa6d1ed107 efi/x86: libstub: Fix typo in __efi64_argmap* name
The actual name of the DXE services function used
is set_memory_space_attributes(), not set_memory_space_descriptor().

Change EFI mixed mode helper macro name to match the function name.

Fixes: 31f1a0edff ("efi/x86: libstub: Make DXE calls mixed mode safe")
Signed-off-by: Evgeniy Baskov <baskov@ispras.ru>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2022-06-21 18:11:46 +02:00
Sean Christopherson
bfbcc81bb8 KVM: x86: Add a quirk for KVM's "MONITOR/MWAIT are NOPs!" behavior
Add a quirk for KVM's behavior of emulating intercepted MONITOR/MWAIT
instructions a NOPs regardless of whether or not they are supported in
guest CPUID.  KVM's current behavior was likely motiviated by a certain
fruity operating system that expects MONITOR/MWAIT to be supported
unconditionally and blindly executes MONITOR/MWAIT without first checking
CPUID.  And because KVM does NOT advertise MONITOR/MWAIT to userspace,
that's effectively the default setup for any VMM that regurgitates
KVM_GET_SUPPORTED_CPUID to KVM_SET_CPUID2.

Note, this quirk interacts with KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT.  The
behavior is actually desirable, as userspace VMMs that want to
unconditionally hide MONITOR/MWAIT from the guest can leave the
MISC_ENABLE quirk enabled.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220608224516.3788274-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20 11:50:42 -04:00
Sean Christopherson
ce0a58f475 KVM: x86: Move "apicv_active" into "struct kvm_lapic"
Move the per-vCPU apicv_active flag into KVM's local APIC instance.
APICv is fully dependent on an in-kernel local APIC, but that's not at
all clear when reading the current code due to the flag being stored in
the generic kvm_vcpu_arch struct.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220614230548.3852141-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20 06:21:24 -04:00
Sean Christopherson
d39850f57d KVM: x86: Drop @vcpu parameter from kvm_x86_ops.hwapic_isr_update()
Drop the unused @vcpu parameter from hwapic_isr_update().  AMD/AVIC is
unlikely to implement the helper, and VMX/APICv doesn't need the vCPU as
it operates on the current VMCS.  The result is somewhat odd, but allows
for a decent amount of (future) cleanup in the APIC code.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220614230548.3852141-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20 06:21:23 -04:00
Linus Torvalds
05c6ca8512 Merge tag 'x86-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:

 - Make RESERVE_BRK() work again with older binutils. The recent
   'simplification' broke that.

 - Make early #VE handling increment RIP when successful.

 - Make the #VE code consistent vs. the RIP adjustments and add
   comments.

 - Handle load_unaligned_zeropad() across page boundaries correctly in
   #VE when the second page is shared.

* tag 'x86-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page
  x86/tdx: Clarify RIP adjustments in #VE handler
  x86/tdx: Fix early #VE handling
  x86/mm: Fix RESERVE_BRK() for older binutils
2022-06-19 09:58:28 -05:00
Linus Torvalds
32efdbffff Merge tag 'pci-v5.19-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull pci fix from Bjorn Helgaas:
 "Revert clipping of PCI host bridge windows to avoid E820 regions,
  which broke several machines by forcing unnecessary BAR reassignments
  (Hans de Goede)"

* tag 'pci-v5.19-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
  x86/PCI: Revert "x86/PCI: Clip only host bridge windows for E820 regions"
2022-06-17 15:12:20 -05:00
Hans de Goede
a2b36ffbf5 x86/PCI: Revert "x86/PCI: Clip only host bridge windows for E820 regions"
This reverts commit 4c5e242d3e.

Prior to 4c5e242d3e ("x86/PCI: Clip only host bridge windows for E820
regions"), E820 regions did not affect PCI host bridge windows.  We only
looked at E820 regions and avoided them when allocating new MMIO space.
If firmware PCI bridge window and BAR assignments used E820 regions, we
left them alone.

After 4c5e242d3e, we removed E820 regions from the PCI host bridge
windows before looking at BARs, so firmware assignments in E820 regions
looked like errors, and we moved things around to fit in the space left
(if any) after removing the E820 regions.  This unnecessary BAR
reassignment broke several machines.

Guilherme reported that Steam Deck fails to boot after 4c5e242d3e.  We
clipped the window that contained most 32-bit BARs:

  BIOS-e820: [mem 0x00000000a0000000-0x00000000a00fffff] reserved
  acpi PNP0A08:00: clipped [mem 0x80000000-0xf7ffffff window] to [mem 0xa0100000-0xf7ffffff window] for e820 entry [mem 0xa0000000-0xa00fffff]

which forced us to reassign all those BARs, for example, this NVMe BAR:

  pci 0000:00:01.2: PCI bridge to [bus 01]
  pci 0000:00:01.2:   bridge window [mem 0x80600000-0x806fffff]
  pci 0000:01:00.0: BAR 0: [mem 0x80600000-0x80603fff 64bit]
  pci 0000:00:01.2: can't claim window [mem 0x80600000-0x806fffff]: no compatible bridge window
  pci 0000:01:00.0: can't claim BAR 0 [mem 0x80600000-0x80603fff 64bit]: no compatible bridge window

  pci 0000:00:01.2: bridge window: assigned [mem 0xa0100000-0xa01fffff]
  pci 0000:01:00.0: BAR 0: assigned [mem 0xa0100000-0xa0103fff 64bit]

All the reassignments were successful, so the devices should have been
functional at the new addresses, but some were not.

Andy reported a similar failure on an Intel MID platform.  Benjamin
reported a similar failure on a VMWare Fusion VM.

Note: this is not a clean revert; this revert keeps the later change to
make the clipping dependent on a new pci_use_e820 bool, moving the checking
of this bool to arch_remove_reservations().

[bhelgaas: commit log, add more reporters and testers]
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=216109
Reported-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Reported-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Reported-by: Benjamin Coddington <bcodding@redhat.com>
Reported-by: Jongman Heo <jongman.heo@gmail.com>
Fixes: 4c5e242d3e ("x86/PCI: Clip only host bridge windows for E820 regions")
Link: https://lore.kernel.org/r/20220612144325.85366-1-hdegoede@redhat.com
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2022-06-17 14:24:14 -05:00
Linus Torvalds
2d806a688f Merge tag 'hyperv-fixes-signed-20220617' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:

 - Fix hv_init_clocksource annotation (Masahiro Yamada)

 - Two bug fixes for vmbus driver (Saurabh Sengar)

 - Fix SEV negotiation (Tianyu Lan)

 - Fix comments in code (Xiang Wang)

 - One minor fix to HID driver (Michael Kelley)

* tag 'hyperv-fixes-signed-20220617' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  x86/Hyper-V: Add SEV negotiate protocol support in Isolation VM
  Drivers: hv: vmbus: Release cpu lock in error case
  HID: hyperv: Correctly access fields declared as __le16
  clocksource: hyper-v: unexport __init-annotated hv_init_clocksource()
  Drivers: hv: Fix syntax errors in comments
  Drivers: hv: vmbus: Don't assign VMbus channel interrupts to isolated CPUs
2022-06-17 13:39:12 -05:00
Tianyu Lan
49d6a3c062 x86/Hyper-V: Add SEV negotiate protocol support in Isolation VM
Hyper-V Isolation VM current code uses sev_es_ghcb_hv_call()
to read/write MSR via GHCB page and depends on the sev code.
This may cause regression when sev code changes interface
design.

The latest SEV-ES code requires to negotiate GHCB version before
reading/writing MSR via GHCB page and sev_es_ghcb_hv_call() doesn't
work for Hyper-V Isolation VM. Add Hyper-V ghcb related implementation
to decouple SEV and Hyper-V code. Negotiate GHCB version in the
hyperv_init() and use the version to communicate with Hyper-V
in the ghcb hv call function.

Fixes: 2ea29c5abb ("x86/sev: Save the negotiated GHCB version")
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20220614014553.1915929-1-ltykernel@gmail.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2022-06-15 18:27:40 +00:00
Ma Wupeng
6365a1935c efi: Make code to find mirrored memory ranges generic
Commit b05b9f5f9d ("x86, mirror: x86 enabling - find mirrored memory
ranges") introduce the efi_find_mirror() function on x86. In order to reuse
the API we make it public.

Arm64 can support mirrored memory too, so function efi_find_mirror() is added to
efi_init() to this support for arm64.

Since efi_init() is shared by ARM, arm64 and riscv, this patch will bring
mirror memory support for these architectures, but this support is only tested
in arm64.

Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
Link: https://lore.kernel.org/r/20220614092156.1972846-2-mawupeng1@huawei.com
[ardb: fix subject to better reflect the payload]
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2022-06-15 12:11:19 +02:00
Linus Torvalds
24625f7d91 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
 "While last week's pull request contained miscellaneous fixes for x86,
  this one covers other architectures, selftests changes, and a bigger
  series for APIC virtualization bugs that were discovered during 5.20
  development. The idea is to base 5.20 development for KVM on top of
  this tag.

  ARM64:

   - Properly reset the SVE/SME flags on vcpu load

   - Fix a vgic-v2 regression regarding accessing the pending state of a
     HW interrupt from userspace (and make the code common with vgic-v3)

   - Fix access to the idreg range for protected guests

   - Ignore 'kvm-arm.mode=protected' when using VHE

   - Return an error from kvm_arch_init_vm() on allocation failure

   - A bunch of small cleanups (comments, annotations, indentation)

  RISC-V:

   - Typo fix in arch/riscv/kvm/vmid.c

   - Remove broken reference pattern from MAINTAINERS entry

  x86-64:

   - Fix error in page tables with MKTME enabled

   - Dirty page tracking performance test extended to running a nested
     guest

   - Disable APICv/AVIC in cases that it cannot implement correctly"

[ This merge also fixes a misplaced end parenthesis bug introduced in
  commit 3743c2f025 ("KVM: x86: inhibit APICv/AVIC on changes to APIC
  ID or APIC base") pointed out by Sean Christopherson ]

Link: https://lore.kernel.org/all/20220610191813.371682-1-seanjc@google.com/

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (34 commits)
  KVM: selftests: Restrict test region to 48-bit physical addresses when using nested
  KVM: selftests: Add option to run dirty_log_perf_test vCPUs in L2
  KVM: selftests: Clean up LIBKVM files in Makefile
  KVM: selftests: Link selftests directly with lib object files
  KVM: selftests: Drop unnecessary rule for STATIC_LIBS
  KVM: selftests: Add a helper to check EPT/VPID capabilities
  KVM: selftests: Move VMX_EPT_VPID_CAP_AD_BITS to vmx.h
  KVM: selftests: Refactor nested_map() to specify target level
  KVM: selftests: Drop stale function parameter comment for nested_map()
  KVM: selftests: Add option to create 2M and 1G EPT mappings
  KVM: selftests: Replace x86_page_size with PG_LEVEL_XX
  KVM: x86: SVM: fix nested PAUSE filtering when L0 intercepts PAUSE
  KVM: x86: SVM: drop preempt-safe wrappers for avic_vcpu_load/put
  KVM: x86: disable preemption around the call to kvm_arch_vcpu_{un|}blocking
  KVM: x86: disable preemption while updating apicv inhibition
  KVM: x86: SVM: fix avic_kick_target_vcpus_fast
  KVM: x86: SVM: remove avic's broken code that updated APIC ID
  KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base
  KVM: x86: document AVIC/APICv inhibit reasons
  KVM: x86/mmu: Set memory encryption "value", not "mask", in shadow PDPTRs
  ...
2022-06-14 07:57:18 -07:00
Linus Torvalds
8e8afafb0b Merge tag 'x86-bugs-2022-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 MMIO stale data fixes from Thomas Gleixner:
 "Yet another hw vulnerability with a software mitigation: Processor
  MMIO Stale Data.

  They are a class of MMIO-related weaknesses which can expose stale
  data by propagating it into core fill buffers. Data which can then be
  leaked using the usual speculative execution methods.

  Mitigations include this set along with microcode updates and are
  similar to MDS and TAA vulnerabilities: VERW now clears those buffers
  too"

* tag 'x86-bugs-2022-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/speculation/mmio: Print SMT warning
  KVM: x86/speculation: Disable Fill buffer clear within guests
  x86/speculation/mmio: Reuse SRBDS mitigation for SBDS
  x86/speculation/srbds: Update SRBDS mitigation selection
  x86/speculation/mmio: Add sysfs reporting for Processor MMIO Stale Data
  x86/speculation/mmio: Enable CPU Fill buffer clearing on idle
  x86/bugs: Group MDS, TAA & Processor MMIO Stale Data mitigations
  x86/speculation/mmio: Add mitigation for Processor MMIO Stale Data
  x86/speculation: Add a common function for MD_CLEAR mitigation update
  x86/speculation/mmio: Enumerate Processor MMIO Stale Data bug
  Documentation: Add documentation for Processor MMIO Stale Data
2022-06-14 07:43:15 -07:00
Sandipan Das
c390241a93 perf/x86/amd/uncore: Add PerfMonV2 DF event format
If AMD Performance Monitoring Version 2 (PerfMonV2) is
supported, use bits 0-7, 32-37 as EventSelect and bits
8-15, 24-27 as UnitMask for Data Fabric (DF) events.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/ffc24d5a3375b1d6e457d88e83241114de5c1942.1652954372.git.sandipan.das@amd.com
2022-06-13 10:15:14 +02:00
Sandipan Das
16b48c3f5e perf/x86/amd/uncore: Detect available DF counters
If AMD Performance Monitoring Version 2 (PerfMonV2) is
supported, use CPUID leaf 0x80000022 EBX to detect the
number of Data Fabric (DF) PMCs. This offers more
flexibility if the counts change in later processor
families.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/bac7b2806561e03f2acc7fdc9db94f102df80e1d.1652954372.git.sandipan.das@amd.com
2022-06-13 10:15:13 +02:00
Josh Poimboeuf
e32683c6f7 x86/mm: Fix RESERVE_BRK() for older binutils
With binutils 2.26, RESERVE_BRK() causes a build failure:

  /tmp/ccnGOKZ5.s: Assembler messages:
  /tmp/ccnGOKZ5.s:98: Error: missing ')'
  /tmp/ccnGOKZ5.s:98: Error: missing ')'
  /tmp/ccnGOKZ5.s:98: Error: missing ')'
  /tmp/ccnGOKZ5.s:98: Error: junk at end of line, first unrecognized
  character is `U'

The problem is this line:

  RESERVE_BRK(early_pgt_alloc, INIT_PGT_BUF_SIZE)

Specifically, the INIT_PGT_BUF_SIZE macro which (via PAGE_SIZE's use
_AC()) has a "1UL", which makes older versions of the assembler unhappy.
Unfortunately the _AC() macro doesn't work for inline asm.

Inline asm was only needed here to convince the toolchain to add the
STT_NOBITS flag.  However, if a C variable is placed in a section whose
name is prefixed with ".bss", GCC and Clang automatically set
STT_NOBITS.  In fact, ".bss..page_aligned" already relies on this trick.

So fix the build failure (and simplify the macro) by allocating the
variable in C.

Also, add NOLOAD to the ".brk" output section clause in the linker
script.  This is a failsafe in case the ".bss" prefix magic trick ever
stops working somehow.  If there's a section type mismatch, the GNU
linker will force the ".brk" output section to be STT_NOBITS.  The LLVM
linker will fail with a "section type mismatch" error.

Note this also changes the name of the variable from .brk.##name to
__brk_##name.  The variable names aren't actually used anywhere, so it's
harmless.

Fixes: a1e2c031ec ("x86/mm: Simplify RESERVE_BRK()")
Reported-by: Joe Damato <jdamato@fastly.com>
Reported-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Joe Damato <jdamato@fastly.com>
Link: https://lore.kernel.org/r/22d07a44c80d8e8e1e82b9a806ddc8c6bbb2606e.1654759036.git.jpoimboe@kernel.org
2022-06-13 10:15:04 +02:00
Paolo Bonzini
e15f5e6fa6 Merge branch 'kvm-5.20-early'
s390:

* add an interface to provide a hypervisor dump for secure guests

* improve selftests to show tests

x86:

* Intel IPI virtualization

* Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS

* PEBS virtualization

* Simplify PMU emulation by just using PERF_TYPE_RAW events

* More accurate event reinjection on SVM (avoid retrying instructions)

* Allow getting/setting the state of the speaker port data bit

* Rewrite gfn-pfn cache refresh

* Refuse starting the module if VM-Entry/VM-Exit controls are inconsistent

* "Notify" VM exit
2022-06-09 11:38:12 -04:00
Maxim Levitsky
3743c2f025 KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base
Neither of these settings should be changed by the guest and it is
a burden to support it in the acceleration code, so just inhibit
this code instead.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606180829.102503-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09 10:52:18 -04:00
Maxim Levitsky
a9603ae0e4 KVM: x86: document AVIC/APICv inhibit reasons
These days there are too many AVIC/APICv inhibit
reasons, and it doesn't hurt to have some documentation
for them.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606180829.102503-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09 10:52:17 -04:00
Paolo Bonzini
66da65005a Merge tag 'kvm-riscv-fixes-5.19-1' of https://github.com/kvm-riscv/linux into HEAD
KVM/riscv fixes for 5.19, take #1

- Typo fix in arch/riscv/kvm/vmid.c

- Remove broken reference pattern from MAINTAINERS entry
2022-06-09 09:45:00 -04:00
Wyes Karny
6f33a9daff x86: Fix comment for X86_FEATURE_ZEN
The feature X86_FEATURE_ZEN implies that the CPU based on Zen
microarchitecture. Call this out explicitly in the comment.

Signed-off-by: Wyes Karny <wyes.karny@amd.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lkml.kernel.org/r/9931b01a85120a0d1faf0f244e8de3f2190e774c.1654538381.git-series.wyes.karny@amd.com
2022-06-08 13:01:58 -07:00
Wyes Karny
aebef63cf7 x86: Remove vendor checks from prefer_mwait_c1_over_halt
Remove vendor checks from prefer_mwait_c1_over_halt function. Restore
the decision tree to support MWAIT C1 as the default idle state based on
CPUID checks as done by Thomas Gleixner in
commit 09fd4b4ef5 ("x86: use cpuid to check MWAIT support for C1")

The decision tree is removed in
commit 69fb3676df ("x86 idle: remove mwait_idle() and "idle=mwait" cmdline param")

Prefer MWAIT when the following conditions are satisfied:
    1. CPUID_Fn00000001_ECX [Monitor] should be set
    2. CPUID_Fn00000005 should be supported
    3. If CPUID_Fn00000005_ECX [EMX] is set then there should be
       at least one C1 substate available, indicated by
       CPUID_Fn00000005_EDX [MWaitC1SubStates] bits.

Otherwise use HLT for default_idle function.

HPC customers who want to optimize for lower latency are known to
disable Global C-States in the BIOS. In fact, some vendors allow
choosing a BIOS 'performance' profile which explicitly disables
C-States.  In this scenario, the cpuidle driver will not be loaded and
the kernel will continue with the default idle state chosen at boot
time. On AMD systems currently the default idle state is HLT which has
a higher exit latency compared to MWAIT.

The reason for the choice of HLT over MWAIT on AMD systems is:

1. Families prior to 10h didn't support MWAIT
2. Families 10h-15h supported MWAIT, but not MWAIT C1. Hence it was
   preferable to use HLT as the default state on these systems.

However, AMD Family 17h onwards supports MWAIT as well as MWAIT C1. And
it is preferable to use MWAIT as the default idle state on these
systems, as it has lower exit latencies.

The below table represents the exit latency for HLT and MWAIT on AMD
Zen 3 system. Exit latency is measured by issuing a wakeup (IPI) to
other CPU and measuring how many clock cycles it took to wakeup.  Each
iteration measures 10K wakeups by pinning source and destination.

HLT:

25.0000th percentile  :      1900 ns
50.0000th percentile  :      2000 ns
75.0000th percentile  :      2300 ns
90.0000th percentile  :      2500 ns
95.0000th percentile  :      2600 ns
99.0000th percentile  :      2800 ns
99.5000th percentile  :      3000 ns
99.9000th percentile  :      3400 ns
99.9500th percentile  :      3600 ns
99.9900th percentile  :      5900 ns
  Min latency         :      1700 ns
  Max latency         :      5900 ns
Total Samples      9999

MWAIT:

25.0000th percentile  :      1400 ns
50.0000th percentile  :      1500 ns
75.0000th percentile  :      1700 ns
90.0000th percentile  :      1800 ns
95.0000th percentile  :      1900 ns
99.0000th percentile  :      2300 ns
99.5000th percentile  :      2500 ns
99.9000th percentile  :      3200 ns
99.9500th percentile  :      3500 ns
99.9900th percentile  :      4600 ns
  Min latency         :      1200 ns
  Max latency         :      4600 ns
Total Samples      9997

Improvement (99th percentile): 21.74%

Below is another result for context_switch2 micro-benchmark, which
brings out the impact of improved wakeup latency through increased
context-switches per second.

with HLT:
-------------------------------
50.0000th percentile  :  190184
75.0000th percentile  :  191032
90.0000th percentile  :  192314
95.0000th percentile  :  192520
99.0000th percentile  :  192844
MIN  :  190148
MAX  :  192852

with MWAIT:
-------------------------------
50.0000th percentile  :  277444
75.0000th percentile  :  278268
90.0000th percentile  :  278888
95.0000th percentile  :  279164
99.0000th percentile  :  280504
MIN  :  273278
MAX  :  281410

Improvement(99th percentile): ~ 45.46%

Signed-off-by: Wyes Karny <wyes.karny@amd.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://ozlabs.org/~anton/junkcode/context_switch2.c
Link: https://lkml.kernel.org/r/0cc675d8fd1f55e41b510e10abf2e21b6e9803d5.1654538381.git-series.wyes.karny@amd.com
2022-06-08 13:00:19 -07:00
Paul Durrant
b172862241 KVM: x86: PIT: Preserve state of speaker port data bit
Currently the state of the speaker port (0x61) data bit (bit 1) is not
saved in the exported state (kvm_pit_state2) and hence is lost when
re-constructing guest state.

This patch removes the 'speaker_data_port' field from kvm_kpit_state and
instead tracks the state using a new KVM_PIT_FLAGS_SPEAKER_DATA_ON flag
defined in the API.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Message-Id: <20220531124421.1427-1-pdurrant@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 13:06:20 -04:00
Linus Torvalds
34f4335c16 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:

 - syzkaller NULL pointer dereference

 - TDP MMU performance issue with disabling dirty logging

 - 5.14 regression with SVM TSC scaling

 - indefinite stall on applying live patches

 - unstable selftest

 - memory leak from wrong copy-and-paste

 - missed PV TLB flush when racing with emulation

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: do not report a vCPU as preempted outside instruction boundaries
  KVM: x86: do not set st->preempted when going back to user space
  KVM: SVM: fix tsc scaling cache logic
  KVM: selftests: Make hyperv_clock selftest more stable
  KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging
  x86: drop bogus "cc" clobber from __try_cmpxchg_user_asm()
  KVM: x86/mmu: Check every prev_roots in __kvm_mmu_free_obsolete_roots()
  entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set
  KVM: Don't null dereference ops->destroy
2022-06-08 09:16:31 -07:00
Tao Xu
2f4073e08f KVM: VMX: Enable Notify VM exit
There are cases that malicious virtual machines can cause CPU stuck (due
to event windows don't open up), e.g., infinite loop in microcode when
nested #AC (CVE-2015-5307). No event window means no event (NMI, SMI and
IRQ) can be delivered. It leads the CPU to be unavailable to host or
other VMs.

VMM can enable notify VM exit that a VM exit generated if no event
window occurs in VM non-root mode for a specified amount of time (notify
window).

Feature enabling:
- The new vmcs field SECONDARY_EXEC_NOTIFY_VM_EXITING is introduced to
  enable this feature. VMM can set NOTIFY_WINDOW vmcs field to adjust
  the expected notify window.
- Add a new KVM capability KVM_CAP_X86_NOTIFY_VMEXIT so that user space
  can query and enable this feature in per-VM scope. The argument is a
  64bit value: bits 63:32 are used for notify window, and bits 31:0 are
  for flags. Current supported flags:
  - KVM_X86_NOTIFY_VMEXIT_ENABLED: enable the feature with the notify
    window provided.
  - KVM_X86_NOTIFY_VMEXIT_USER: exit to userspace once the exits happen.
- It's safe to even set notify window to zero since an internal hardware
  threshold is added to vmcs.notify_window.

VM exit handling:
- Introduce a vcpu state notify_window_exits to records the count of
  notify VM exits and expose it through the debugfs.
- Notify VM exit can happen incident to delivery of a vector event.
  Allow it in KVM.
- Exit to userspace unconditionally for handling when VM_CONTEXT_INVALID
  bit is set.

Nested handling
- Nested notify VM exits are not supported yet. Keep the same notify
  window control in vmcs02 as vmcs01, so that L1 can't escape the
  restriction of notify VM exits through launching L2 VM.

Notify VM exit is defined in latest Intel Architecture Instruction Set
Extensions Programming Reference, chapter 9.2.

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
Co-developed-by: Chenyi Qiang <chenyi.qiang@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20220524135624.22988-5-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 05:56:24 -04:00
Sean Christopherson
938c8745bc KVM: x86: Introduce "struct kvm_caps" to track misc caps/settings
Add kvm_caps to hold a variety of capabilites and defaults that aren't
handled by kvm_cpu_caps because they aren't CPUID bits in order to reduce
the amount of boilerplate code required to add a new feature.  The vast
majority (all?) of the caps interact with vendor code and are written
only during initialization, i.e. should be tagged __read_mostly, declared
extern in x86.h, and exported.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220524135624.22988-4-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 05:21:16 -04:00
Chenyi Qiang
ed2351174e KVM: x86: Extend KVM_{G,S}ET_VCPU_EVENTS to support pending triple fault
For the triple fault sythesized by KVM, e.g. the RSM path or
nested_vmx_abort(), if KVM exits to userspace before the request is
serviced, userspace could migrate the VM and lose the triple fault.

Extend KVM_{G,S}ET_VCPU_EVENTS to support pending triple fault with a
new event KVM_VCPUEVENT_VALID_FAULT_FAULT so that userspace can save and
restore the triple fault event. This extension is guarded by a new KVM
capability KVM_CAP_TRIPLE_FAULT_EVENT.

Note that in the set_vcpu_events path, userspace is able to set/clear
the triple fault request through triple_fault.pending field.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20220524135624.22988-2-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 05:20:53 -04:00
Like Xu
7aadaa988c KVM: x86/pmu: Drop amd_event_mapping[] in the KVM context
All gp or fixed counters have been reprogrammed using PERF_TYPE_RAW,
which means that the table that maps perf_hw_id to event select values is
no longer useful, at least for AMD.

For Intel, the logic to check if the pmu event reported by Intel cpuid is
not available is still required, in which case pmc_perf_hw_id() could be
renamed to hw_event_is_unavail() and a bool value is returned to replace
the semantics of "PERF_COUNT_HW_MAX+1".

Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20220518132512.37864-12-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 04:49:06 -04:00
Like Xu
dc852ff5bb perf: x86/core: Add interface to query perfmon_event_map[] directly
Currently, we have [intel|knc|p4|p6]_perfmon_event_map on the Intel
platforms and amd_[f17h]_perfmon_event_map on the AMD platforms.

Early clumsy KVM code or other potential perf_event users may have
hard-coded these perfmon_maps (e.g., arch/x86/kvm/svm/pmu.c), so
it would not make sense to program a common hardware event based
on the generic "enum perf_hw_id" once the two tables do not match.

Let's provide an interface for callers outside the perf subsystem to get
the counter config based on the perfmon_event_map currently in use,
and it also helps to save bytes.

Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Like Xu <likexu@tencent.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-Id: <20220518132512.37864-10-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 04:49:01 -04:00
Like Xu
854250329c KVM: x86/pmu: Disable guest PEBS temporarily in two rare situations
The guest PEBS will be disabled when some users try to perf KVM and
its user-space through the same PEBS facility OR when the host perf
doesn't schedule the guest PEBS counter in a one-to-one mapping manner
(neither of these are typical scenarios).

The PEBS records in the guest DS buffer are still accurate and the
above two restrictions will be checked before each vm-entry only if
guest PEBS is deemed to be enabled.

Suggested-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-Id: <20220411101946.20262-15-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 04:48:14 -04:00
Like Xu
902caeb684 KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS
If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the adaptive
PEBS is supported. The PEBS_DATA_CFG MSR and adaptive record enable
bits (IA32_PERFEVTSELx.Adaptive_Record and IA32_FIXED_CTR_CTRL.
FCx_Adaptive_Record) are also supported.

Adaptive PEBS provides software the capability to configure the PEBS
records to capture only the data of interest, keeping the record size
compact. An overflow of PMCx results in generation of an adaptive PEBS
record with state information based on the selections specified in
MSR_PEBS_DATA_CFG.By default, the record only contain the Basic group.

When guest adaptive PEBS is enabled, the IA32_PEBS_ENABLE MSR will
be added to the perf_guest_switch_msr() and switched during the VMX
transitions just like CORE_PERF_GLOBAL_CTRL MSR.

According to Intel SDM, software is recommended to  PEBS Baseline
when the following is true. IA32_PERF_CAPABILITIES.PEBS_BASELINE[14]
&& IA32_PERF_CAPABILITIES.PEBS_FMT[11:8] ≥ 4.

Co-developed-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20220411101946.20262-12-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 04:48:06 -04:00
Like Xu
8183a538cd KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to support guest DS
When CPUID.01H:EDX.DS[21] is set, the IA32_DS_AREA MSR exists and points
to the linear address of the first byte of the DS buffer management area,
which is used to manage the PEBS records.

When guest PEBS is enabled, the MSR_IA32_DS_AREA MSR will be added to the
perf_guest_switch_msr() and switched during the VMX transitions just like
CORE_PERF_GLOBAL_CTRL MSR. The WRMSR to IA32_DS_AREA MSR brings a #GP(0)
if the source register contains a non-canonical address.

Originally-by: Andi Kleen <ak@linux.intel.com>
Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-Id: <20220411101946.20262-11-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 04:48:03 -04:00
Like Xu
c59a1f106f KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS
If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the
IA32_PEBS_ENABLE MSR exists and all architecturally enumerated fixed
and general-purpose counters have corresponding bits in IA32_PEBS_ENABLE
that enable generation of PEBS records. The general-purpose counter bits
start at bit IA32_PEBS_ENABLE[0], and the fixed counter bits start at
bit IA32_PEBS_ENABLE[32].

When guest PEBS is enabled, the IA32_PEBS_ENABLE MSR will be
added to the perf_guest_switch_msr() and atomically switched during
the VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR.

Based on whether the platform supports x86_pmu.pebs_ept, it has also
refactored the way to add more msrs to arr[] in intel_guest_get_msrs()
for extensibility.

Originally-by: Andi Kleen <ak@linux.intel.com>
Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-Id: <20220411101946.20262-8-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 04:47:55 -04:00
Like Xu
2c985527dd KVM: x86/pmu: Introduce the ctrl_mask value for fixed counter
The mask value of fixed counter control register should be dynamic
adjusted with the number of fixed counters. This patch introduces a
variable that includes the reserved bits of fixed counter control
registers. This is a generic code refactoring.

Co-developed-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-Id: <20220411101946.20262-6-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 04:47:50 -04:00
Like Xu
39a4d77954 perf/x86/core: Pass "struct kvm_pmu *" to determine the guest values
Splitting the logic for determining the guest values is unnecessarily
confusing, and potentially fragile. Perf should have full knowledge and
control of what values are loaded for the guest.

If we change .guest_get_msrs() to take a struct kvm_pmu pointer, then it
can generate the full set of guest values by grabbing guest ds_area and
pebs_data_cfg. Alternatively, .guest_get_msrs() could take the desired
guest MSR values directly (ds_area and pebs_data_cfg), but kvm_pmu is
vendor agnostic, so we don't see any reason to not just pass the pointer.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-Id: <20220411101946.20262-4-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 04:47:45 -04:00
Like Xu
fb358e0b81 perf/x86/intel: Add EPT-Friendly PEBS for Ice Lake Server
Add support for EPT-Friendly PEBS, a new CPU feature that enlightens PEBS
to translate guest linear address through EPT, and facilitates handling
VM-Exits that occur when accessing PEBS records.  More information can
be found in the December 2021 release of Intel's SDM, Volume 3,
18.9.5 "EPT-Friendly PEBS". This new hardware facility makes sure the
guest PEBS records will not be lost, which is available on Intel Ice Lake
Server platforms (and later).

KVM will check this field through perf_get_x86_pmu_capability() instead
of hard coding the CPU models in the KVM code. If it is supported, the
guest PEBS capability will be exposed to the guest. Guest PEBS can be
enabled when and only when "EPT-Friendly PEBS" is supported and
EPT is enabled.

Cc: linux-perf-users@vger.kernel.org
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20220411101946.20262-2-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 04:47:39 -04:00
Chao Gao
d588bb9be1 KVM: VMX: enable IPI virtualization
With IPI virtualization enabled, the processor emulates writes to
APIC registers that would send IPIs. The processor sets the bit
corresponding to the vector in target vCPU's PIR and may send a
notification (IPI) specified by NDST and NV fields in target vCPU's
Posted-Interrupt Descriptor (PID). It is similar to what IOMMU
engine does when dealing with posted interrupt from devices.

A PID-pointer table is used by the processor to locate the PID of a
vCPU with the vCPU's APIC ID. The table size depends on maximum APIC
ID assigned for current VM session from userspace. Allocating memory
for PID-pointer table is deferred to vCPU creation, because irqchip
mode and VM-scope maximum APIC ID is settled at that point. KVM can
skip PID-pointer table allocation if !irqchip_in_kernel().

Like VT-d PI, if a vCPU goes to blocked state, VMM needs to switch its
notification vector to wakeup vector. This can ensure that when an IPI
for blocked vCPUs arrives, VMM can get control and wake up blocked
vCPUs. And if a VCPU is preempted, its posted interrupt notification
is suppressed.

Note that IPI virtualization can only virualize physical-addressing,
flat mode, unicast IPIs. Sending other IPIs would still cause a
trap-like APIC-write VM-exit and need to be handled by VMM.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
Message-Id: <20220419154510.11938-1-guang.zeng@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 04:47:37 -04:00
Zeng Guang
3587531638 KVM: x86: Allow userspace to set maximum VCPU id for VM
Introduce new max_vcpu_ids in KVM for x86 architecture. Userspace
can assign maximum possible vcpu id for current VM session using
KVM_CAP_MAX_VCPU_ID of KVM_ENABLE_CAP ioctl().

This is done for x86 only because the sole use case is to guide
memory allocation for PID-pointer table, a structure needed to
enable VMX IPI.

By default, max_vcpu_ids set as KVM_MAX_VCPU_IDS.

Suggested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
Message-Id: <20220419154444.11888-1-guang.zeng@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 04:47:31 -04:00
Robert Hoo
1ad4e5438c KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config
Check VMX features on tertiary execution control in VMCS config setup.
Sub-features in tertiary execution control to be enabled are adjusted
according to hardware capabilities although no sub-feature is enabled
in this patch.

EVMCSv1 doesn't support tertiary VM-execution control, so disable it
when EVMCSv1 is in use. And define the auxiliary functions for Tertiary
control field here, using the new BUILD_CONTROLS_SHADOW().

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
Message-Id: <20220419153400.11642-1-guang.zeng@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 04:47:18 -04:00
Robert Hoo
465932db25 x86/cpu: Add new VMX feature, Tertiary VM-Execution control
A new 64-bit control field "tertiary processor-based VM-execution
controls", is defined [1]. It's controlled by bit 17 of the primary
processor-based VM-execution controls.

Different from its brother VM-execution fields, this tertiary VM-
execution controls field is 64 bit. So it occupies 2 vmx_feature_leafs,
TERTIARY_CTLS_LOW and TERTIARY_CTLS_HIGH.

Its companion VMX capability reporting MSR,MSR_IA32_VMX_PROCBASED_CTLS3
(0x492), is also semantically different from its brothers, whose 64 bits
consist of all allow-1, rather than 32-bit allow-0 and 32-bit allow-1 [1][2].
Therefore, its init_vmx_capabilities() is a little different from others.

[1] ISE 6.2 "VMCS Changes"
https://www.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html

[2] SDM Vol3. Appendix A.3

Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
Message-Id: <20220419153240.11549-1-guang.zeng@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 04:47:13 -04:00