This enables module_alloc() to allocate huge page for 2MB+ requests.
To check the difference of this change, we need enable config
CONFIG_PTDUMP_DEBUGFS, and call module_alloc(2MB). Before the change,
/sys/kernel/debug/page_tables/kernel shows pte for this map. With the
change, /sys/kernel/debug/page_tables/ show pmd for thie map.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220204185742.271030-2-song@kernel.org
Pull perf fixes from Borislav Petkov:
- Intel/PT: filters could crash the kernel
- Intel: default disable the PMU for SMM, some new-ish EFI firmware has
started using CPL3 and the PMU CPL filters don't discriminate against
SMM, meaning that CPL3 (userspace only) events now also count EFI/SMM
cycles.
- Fixup for perf_event_attr::sig_data
* tag 'perf_urgent_for_v5.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/pt: Fix crash with stop filters in single-range mode
perf: uapi: Document perf_event_attr::sig_data truncation on 32 bit architectures
selftests/perf_events: Test modification of perf_event_attr::sig_data
perf: Copy perf_event_attr::sig_data on modification
x86/perf: Default set FREEZE_ON_SMI for all
Pull xen fixes from Juergen Gross:
- documentation fixes related to Xen
- enable x2apic mode when available when running as hardware
virtualized guest under Xen
- cleanup and fix a corner case of vcpu enumeration when running a
paravirtualized Xen guest
* tag 'for-linus-5.17a-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/Xen: streamline (and fix) PV CPU enumeration
xen: update missing ioctl magic numers documentation
Improve docs for IOCTL_GNTDEV_MAP_GRANT_REF
xen: xenbus_dev.h: delete incorrect file name
xen/x2apic: enable x2apic mode when supported for HVM
Pull kvm fixes from Paolo Bonzini:
"ARM:
- A couple of fixes when handling an exception while a SError has
been delivered
- Workaround for Cortex-A510's single-step erratum
RISC-V:
- Make CY, TM, and IR counters accessible in VU mode
- Fix SBI implementation version
x86:
- Report deprecation of x87 features in supported CPUID
- Preparation for fixing an interrupt delivery race on AMD hardware
- Sparse fix
All except POWER and s390:
- Rework guest entry code to correctly mark noinstr areas and fix
vtime' accounting (for x86, this was already mostly correct but not
entirely; for ARM, MIPS and RISC-V it wasn't)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: Use ERR_PTR_USR() to return -EFAULT as a __user pointer
KVM: x86: Report deprecated x87 features in supported CPUID
KVM: arm64: Workaround Cortex-A510's single-step and PAC trap errata
KVM: arm64: Stop handle_exit() from handling HVC twice when an SError occurs
KVM: arm64: Avoid consuming a stale esr value when SError occur
RISC-V: KVM: Fix SBI implementation version
RISC-V: KVM: make CY, TM, and IR counters accessible in VU mode
kvm/riscv: rework guest entry logic
kvm/arm64: rework guest entry logic
kvm/x86: rework guest entry logic
kvm/mips: rework guest entry logic
kvm: add guest_state_{enter,exit}_irqoff()
KVM: x86: Move delivery of non-APICv interrupt into vendor code
kvm: Move KVM_GET_XSAVE2 IOCTL definition at the end of kvm.h
KVM/arm64 fixes for 5.17, take #2
- A couple of fixes when handling an exception while a SError has been
delivered
- Workaround for Cortex-A510's single-step[ erratum
blake2s_compress_generic is weakly aliased by blake2s_compress. The
current harness for function selection uses a function pointer, which is
ordinarily inlined and resolved at compile time. But when Clang's CFI is
enabled, CFI still triggers when making an indirect call via a weak
symbol. This seems like a bug in Clang's CFI, as though it's bucketing
weak symbols and strong symbols differently. It also only seems to
trigger when "full LTO" mode is used, rather than "thin LTO".
[ 0.000000][ T0] Kernel panic - not syncing: CFI failure (target: blake2s_compress_generic+0x0/0x1444)
[ 0.000000][ T0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.16.0-mainline-06981-g076c855b846e #1
[ 0.000000][ T0] Hardware name: MT6873 (DT)
[ 0.000000][ T0] Call trace:
[ 0.000000][ T0] dump_backtrace+0xfc/0x1dc
[ 0.000000][ T0] dump_stack_lvl+0xa8/0x11c
[ 0.000000][ T0] panic+0x194/0x464
[ 0.000000][ T0] __cfi_check_fail+0x54/0x58
[ 0.000000][ T0] __cfi_slowpath_diag+0x354/0x4b0
[ 0.000000][ T0] blake2s_update+0x14c/0x178
[ 0.000000][ T0] _extract_entropy+0xf4/0x29c
[ 0.000000][ T0] crng_initialize_primary+0x24/0x94
[ 0.000000][ T0] rand_initialize+0x2c/0x6c
[ 0.000000][ T0] start_kernel+0x2f8/0x65c
[ 0.000000][ T0] __primary_switched+0xc4/0x7be4
[ 0.000000][ T0] Rebooting in 5 seconds..
Nonetheless, the function pointer method isn't so terrific anyway, so
this patch replaces it with a simple boolean, which also gets inlined
away. This successfully works around the Clang bug.
In general, I'm not too keen on all of the indirection involved here; it
clearly does more harm than good. Hopefully the whole thing can get
cleaned up down the road when lib/crypto is overhauled more
comprehensively. But for now, we go with a simple bandaid.
Fixes: 6048fdcc5f ("lib/crypto: blake2s: include as built-in")
Link: https://github.com/ClangBuiltLinux/linux/issues/1567
Reported-by: Miles Chen <miles.chen@mediatek.com>
Tested-by: Miles Chen <miles.chen@mediatek.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: John Stultz <john.stultz@linaro.org>
Acked-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
CPUID.(EAX=7,ECX=0):EBX.FDP_EXCPTN_ONLY[bit 6] and
CPUID.(EAX=7,ECX=0):EBX.ZERO_FCS_FDS[bit 13] are "defeature"
bits. Unlike most of the other CPUID feature bits, these bits are
clear if the features are present and set if the features are not
present. These bits should be reported in KVM_GET_SUPPORTED_CPUID,
because if these bits are set on hardware, they cannot be cleared in
the guest CPUID. Doing so would claim guest support for a feature that
the hardware doesn't support and that can't be efficiently emulated.
Of course, any software (e.g WIN87EM.DLL) expecting these features to
be present likely predates these CPUID feature bits and therefore
doesn't know to check for them anyway.
Aaron Lewis added the corresponding X86_FEATURE macros in
commit cbb99c0f58 ("x86/cpufeatures: Add FDP_EXCPTN_ONLY and
ZERO_FCS_FDS"), with the intention of reporting these bits in
KVM_GET_SUPPORTED_CPUID, but I was unable to find a proposed patch on
the kvm list.
Opportunistically reordered the CPUID_7_0_EBX capability bits from
least to most significant.
Cc: Aaron Lewis <aaronlewis@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20220204001348.2844660-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add the CPUID feature bit and the model-specific registers needed to
identify and configure the Intel Hardware Feedback Interface.
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This started out with me noticing that "dom0_max_vcpus=<N>" with <N>
larger than the number of physical CPUs reported through ACPI tables
would not bring up the "excess" vCPU-s. Addressing this is the primary
purpose of the change; CPU maps handling is being tidied only as far as
is necessary for the change here (with the effect of also avoiding the
setting up of too much per-CPU infrastructure, i.e. for CPUs which can
never come online).
Noticing that xen_fill_possible_map() is called way too early, whereas
xen_filter_cpu_maps() is called too late (after per-CPU areas were
already set up), and further observing that each of the functions serves
only one of Dom0 or DomU, it looked like it was better to simplify this.
Use the .get_smp_config hook instead, uniformly for Dom0 and DomU.
xen_fill_possible_map() can be dropped altogether, while
xen_filter_cpu_maps() is re-purposed but not otherwise changed.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/2dbd5f0a-9859-ca2d-085e-a02f7166c610@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
In __WARN_FLAGS(), we had two asm statements (abbreviated):
asm volatile("ud2");
asm volatile(".pushsection .discard.reachable");
These pair of statements are used to trigger an exception, but then help
objtool understand that for warnings, control flow will be restored
immediately afterwards.
The problem is that volatile is not a compiler barrier. GCC explicitly
documents this:
> Note that the compiler can move even volatile asm instructions
> relative to other code, including across jump instructions.
Also, no clobbers are specified to prevent instructions from subsequent
statements from being scheduled by compiler before the second asm
statement. This can lead to instructions from subsequent statements
being emitted by the compiler before the second asm statement.
Providing a scheduling model such as via -march= options enables the
compiler to better schedule instructions with known latencies to hide
latencies from data hazards compared to inline asm statements in which
latencies are not estimated.
If an instruction gets scheduled by the compiler between the two asm
statements, then objtool will think that it is not reachable, producing
a warning.
To prevent instructions from being scheduled in between the two asm
statements, merge them.
Also remove an unnecessary unreachable() asm annotation from BUG() in
favor of __builtin_unreachable(). objtool is able to track that the ud2
from BUG() terminates control flow within the function.
Link: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Volatile
Link: https://github.com/ClangBuiltLinux/linux/issues/1483
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20220202205557.2260694-1-ndesaulniers@google.com
The new PEBS format 5 implies that the number of the fixed counters can
be up to 16. The current INTEL_PMC_MAX_FIXED is still 4. If the current
kernel runs on a future platform which has more than 4 fixed counters,
a warning will be triggered. The number of the fixed counters will be
clipped to 4. Users have to upgrade the kernel to access the new fixed
counters.
Add a new default constraint for PerfMon v5 and up, which can support
up to 16 fixed counters. The pseudo-encoding is applied for the fixed
counters 4 and later. The user can have generic support for the new
fixed counters on the future platfroms without updating the kernel.
Increase the INTEL_PMC_MAX_FIXED to 16.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/1643750603-100733-3-git-send-email-kan.liang@linux.intel.com
KVM vPMU doesn't support to emulate all the fixed counters that the
host PMU driver has supported, e.g. the fixed counter 3 used by
Topdown metrics hasn't been supported by KVM so far.
Rename MAX_FIXED_COUNTERS to KVM_PMC_MAX_FIXED to have a more
straightforward naming convention as INTEL_PMC_MAX_FIXED used by the
host PMU driver, and fix vPMU to use the KVM side KVM_PMC_MAX_FIXED
for the virtual fixed counter emulation, instead of the host side
INTEL_PMC_MAX_FIXED.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1643750603-100733-2-git-send-email-kan.liang@linux.intel.com
The new PEBS Record Format 5 is similar to the PEBS Record Format 4. The
only difference is the layout of the Counter Reset fields of the PEBS
Config Buffer in the DS area. For the PEBS format 4, the Counter Reset
fields allocation is for 8 general-purpose counters followed by 4
fixed-function counters. For the PEBS format 5, the Counter Reset fields
allocation is for 32 general-purpose counters followed by 16
fixed-function counters.
Extend the MAX_PEBS_EVENTS to 32. Add MAX_PEBS_EVENTS_FMT4 for the
previous platform. Except for the DS auto-reload code, other places
already assume 32 counters. Only check the PEBS_FMT in the DS
auto-reload code.
Extend the MAX_FIXED_PEBS_EVENTS to 16, which only impacts the size of
struct debug_store and some local temporary variables. The size of
struct debug_store increases 288B, which is small and should be
acceptable.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1643750603-100733-1-git-send-email-kan.liang@linux.intel.com
The requirement for 64-bit address filters is that they are canonical
addresses. In other respects any address range is allowed which would
include user space addresses.
That can be useful for tracing virtual machine guests because address
filtering can be used to advantage in place of current privilege level
(CPL) filtering.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220131072453.2839535-2-adrian.hunter@intel.com
Add a check for !buf->single before calling pt_buffer_region_size in a
place where a missing check can cause a kernel crash.
Fixes a bug introduced by commit 670638477a ("perf/x86/intel/pt:
Opportunistically use single range output mode"), which added a
support for PT single-range output mode. Since that commit if a PT
stop filter range is hit while tracing, the kernel will crash because
of a null pointer dereference in pt_handle_status due to calling
pt_buffer_region_size without a ToPA configured.
The commit which introduced single-range mode guarded almost all uses of
the ToPA buffer variables with checks of the buf->single variable, but
missed the case where tracing was stopped by the PT hardware, which
happens when execution hits a configured stop filter.
Tested that hitting a stop filter while PT recording successfully
records a trace with this patch but crashes without this patch.
Fixes: 670638477a ("perf/x86/intel/pt: Opportunistically use single range output mode")
Signed-off-by: Tristan Hume <tristan@thume.ca>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: stable@kernel.org
Link: https://lkml.kernel.org/r/20220127220806.73664-1-tristan@thume.ca
PPIN is the Protected Processor Identification Number.
This is used to identify the socket as a Field Replaceable Unit (FRU).
Existing code only displays this when reporting errors. But this makes
it inconvenient for large clusters to use it for its intended purpose
of inventory control.
Add ppin to /sys/devices/system/cpu/cpu*/topology to make what
is already available using RDMSR more easily accessible. Make
the file read only for root in case there are still people
concerned about making a unique system "serial number" available.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20220131230111.2004669-6-tony.luck@intel.com
Currently, the PPIN (Protected Processor Inventory Number) MSR is read
by every CPU that processes a machine check, CMCI, or just polls machine
check banks from a periodic timer. This is not a "fast" MSR, so this
adds to overhead of processing errors.
Add a new "ppin" field to the cpuinfo_x86 structure. Read and save the
PPIN during initialization. Use this copy in mce_setup() instead of
reading the MSR.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220131230111.2004669-4-tony.luck@intel.com
After nine generations of adding to model specific list of CPUs that
support PPIN (Protected Processor Inventory Number) Intel allocated
a CPUID bit to enumerate the MSRs.
CPUID(EAX=7, ECX=1).EBX bit 0 enumerates presence of MSR_PPIN_CTL and
MSR_PPIN. Add it to the "scattered" CPUID bits and add an entry to the
ppin_cpuids[] x86_match_cpu() array to catch Intel CPUs that implement
it.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220131230111.2004669-3-tony.luck@intel.com
For consistency and clarity, migrate x86 over to the generic helpers for
guest timing and lockdep/RCU/tracing management, and remove the
x86-specific helpers.
Prior to this patch, the guest timing was entered in
kvm_guest_enter_irqoff() (called by svm_vcpu_enter_exit() and
svm_vcpu_enter_exit()), and was exited by the call to
vtime_account_guest_exit() within vcpu_enter_guest().
To minimize duplication and to more clearly balance entry and exit, both
entry and exit of guest timing are placed in vcpu_enter_guest(), using
the new guest_timing_{enter,exit}_irqoff() helpers. When context
tracking is used a small amount of additional time will be accounted
towards guests; tick-based accounting is unnaffected as IRQs are
disabled at this point and not enabled until after the return from the
guest.
This also corrects (benign) mis-balanced context tracking accounting
introduced in commits:
ae95f566b3 ("KVM: X86: TSCDEADLINE MSR emulation fastpath")
26efe2fd92 ("KVM: VMX: Handle preemption timer fastpath")
Where KVM can enter a guest multiple times, calling vtime_guest_enter()
without a corresponding call to vtime_account_guest_exit(), and with
vtime_account_system() called when vtime_account_guest() should be used.
As account_system_time() checks PF_VCPU and calls account_guest_time(),
this doesn't result in any functional problem, but is unnecessarily
confusing.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <20220201132926.3301912-4-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The code to decide whether a system supports the PPIN (Protected
Processor Inventory Number) MSR was cloned from the Intel
implementation. Apart from the X86_FEATURE bit and the MSR numbers it is
identical.
Merge the two functions into common x86 code, but use x86_match_cpu()
instead of the switch (c->x86_model) that was used by the old Intel
code.
No functional change.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220131230111.2004669-2-tony.luck@intel.com
There are currently 2 ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field. Move the AMD mce sysfs code to use default_groups field which has
been the preferred way since
aa30f47cf6 ("kobject: Add support for default attribute groups to kobj_type")
so that the obsolete default_attrs field can be removed soon.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20220106103537.3663852-1-gregkh@linuxfoundation.org
Handle non-APICv interrupt delivery in vendor code, even though it means
VMX and SVM will temporarily have duplicate code. SVM's AVIC has a race
condition that requires KVM to fall back to legacy interrupt injection
_after_ the interrupt has been logged in the vIRR, i.e. to fix the race,
SVM will need to open code the full flow anyways[*]. Refactor the code
so that the SVM bug without introducing other issues, e.g. SVM would
return "success" and thus invoke trace_kvm_apicv_accept_irq() even when
delivery through the AVIC failed, and to opportunistically prepare for
using KVM_X86_OP to fill each vendor's kvm_x86_ops struct, which will
rely on the vendor function matching the kvm_x86_op pointer name.
No functional change intended.
[*] https://lore.kernel.org/all/20211213104634.199141-4-mlevitsk@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Catch-up with 5.17-rc2 and trying to align with drm-intel-gt-next
for a possible topic branch for merging the split of i915_regs...
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Some of them used to be used by libbfd for a.out coredump handling.
Seeing that
* libbfd has their copies anyway
* we don't export them into userland headers
* we don't support a.out coredumps anymore
let's bury the definitions. They never had in-kernel
users anyway...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull x86 fixes from Borislav Petkov:
- Add another Intel CPU model to the list of CPUs supporting the
processor inventory unique number
- Allow writing to MCE thresholding sysfs files again - a previous
change had accidentally disabled it and no one noticed. Goes to show
how much is this stuff used
* tag 'x86_urgent_for_v5.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Add Xeon Icelake-D to list of CPUs that support PPIN
x86/MCE/AMD: Allow thresholding interface updates after init
Pull pci fixes from Bjorn Helgaas:
- Fix compilation warnings in new mt7621 driver (Sergio Paracuellos)
- Restore the sysfs "rom" file for VGA shadow ROMs, which was broken
when converting "rom" to be a static attribute (Bjorn Helgaas)
* tag 'pci-v5.17-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI/sysfs: Find shadow ROM before static attribute initialization
PCI: mt7621: Remove unused function pcie_rmw()
PCI: mt7621: Drop of_match_ptr() to avoid unused variable
Pulltracing fixes from Steven Rostedt:
- Limit mcount build time sorting to only those archs that we know it
works for.
- Fix memory leak in error path of histogram setup
- Fix and clean up rel_loc array out of bounds issue
- tools/rtla documentation fixes
- Fix issues with histogram logic
* tag 'trace-v5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Don't inc err_log entry count if entry allocation fails
tracing: Propagate is_signed to expression
tracing: Fix smatch warning for do while check in event_hist_trigger_parse()
tracing: Fix smatch warning for null glob in event_hist_trigger_parse()
tools/tracing: Update Makefile to build rtla
rtla: Make doc build optional
tracing/perf: Avoid -Warray-bounds warning for __rel_loc macro
tracing: Avoid -Warray-bounds warning for __rel_loc macro
tracing/histogram: Fix a potential memory leak for kstrdup()
ftrace: Have architectures opt-in for mcount build time sorting
Pull kvm fixes from Paolo Bonzini:
"Two larger x86 series:
- Redo incorrect fix for SEV/SMAP erratum
- Windows 11 Hyper-V workaround
Other x86 changes:
- Various x86 cleanups
- Re-enable access_tracking_perf_test
- Fix for #GP handling on SVM
- Fix for CPUID leaf 0Dh in KVM_GET_SUPPORTED_CPUID
- Fix for ICEBP in interrupt shadow
- Avoid false-positive RCU splat
- Enable Enlightened MSR-Bitmap support for real
ARM:
- Correctly update the shadow register on exception injection when
running in nVHE mode
- Correctly use the mm_ops indirection when performing cache
invalidation from the page-table walker
- Restrict the vgic-v3 workaround for SEIS to the two known broken
implementations
Generic code changes:
- Dead code cleanup"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (43 commits)
KVM: eventfd: Fix false positive RCU usage warning
KVM: nVMX: Allow VMREAD when Enlightened VMCS is in use
KVM: nVMX: Implement evmcs_field_offset() suitable for handle_vmread()
KVM: nVMX: Rename vmcs_to_field_offset{,_table}
KVM: nVMX: eVMCS: Filter out VM_EXIT_SAVE_VMX_PREEMPTION_TIMER
KVM: nVMX: Also filter MSR_IA32_VMX_TRUE_PINBASED_CTLS when eVMCS
selftests: kvm: check dynamic bits against KVM_X86_XCOMP_GUEST_SUPP
KVM: x86: add system attribute to retrieve full set of supported xsave states
KVM: x86: Add a helper to retrieve userspace address from kvm_device_attr
selftests: kvm: move vm_xsave_req_perm call to amx_test
KVM: x86: Sync the states size with the XCR0/IA32_XSS at, any time
KVM: x86: Update vCPU's runtime CPUID on write to MSR_IA32_XSS
KVM: x86: Keep MSR_IA32_XSS unchanged for INIT
KVM: x86: Free kvm_cpuid_entry2 array on post-KVM_RUN KVM_SET_CPUID{,2}
KVM: nVMX: WARN on any attempt to allocate shadow VMCS for vmcs02
KVM: selftests: Don't skip L2's VMCALL in SMM test for SVM guest
KVM: x86: Check .flags in kvm_cpuid_check_equal() too
KVM: x86: Forcibly leave nested virt when SMM state is toggled
KVM: SVM: drop unnecessary code in svm_hv_vmcb_dirty_nested_enlightenments()
KVM: SVM: hyper-v: Enable Enlightened MSR-Bitmap support for real
...
Hyper-V TLFS explicitly forbids VMREAD and VMWRITE instructions when
Enlightened VMCS interface is in use:
"Any VMREAD or VMWRITE instructions while an enlightened VMCS is
active is unsupported and can result in unexpected behavior.""
Windows 11 + WSL2 seems to ignore this, attempts to VMREAD VMCS field
0x4404 ("VM-exit interruption information") are observed. Failing
these attempts with nested_vmx_failInvalid() makes such guests
unbootable.
Microsoft confirms this is a Hyper-V bug and claims that it'll get fixed
eventually but for the time being we need a workaround. (Temporary) allow
VMREAD to get data from the currently loaded Enlightened VMCS.
Note: VMWRITE instructions remain forbidden, it is not clear how to
handle them properly and hopefully won't ever be needed.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220112170134.1904308-6-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In preparation to allowing reads from Enlightened VMCS from
handle_vmread(), implement evmcs_field_offset() to get the correct
read offset. get_evmcs_offset(), which is being used by KVM-on-Hyper-V,
is almost what's needed but a few things need to be adjusted. First,
WARN_ON() is unacceptable for handle_vmread() as any field can (in
theory) be supplied by the guest and not all fields are defined in
eVMCS v1. Second, we need to handle 'holes' in eVMCS (missing fields).
It also sounds like a good idea to WARN_ON() if such fields are ever
accessed by KVM-on-Hyper-V.
Implement dedicated evmcs_field_offset() helper.
No functional change intended.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220112170134.1904308-5-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
vmcs_to_field_offset{,_table} may sound misleading as VMCS is an opaque
blob which is not supposed to be accessed directly. In fact,
vmcs_to_field_offset{,_table} are related to KVM defined VMCS12 structure.
Rename vmcs_field_to_offset() to get_vmcs12_field_offset() for clarity.
No functional change intended.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220112170134.1904308-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Enlightened VMCS v1 doesn't have VMX_PREEMPTION_TIMER_VALUE field,
PIN_BASED_VMX_PREEMPTION_TIMER is also filtered out already so it makes
sense to filter out VM_EXIT_SAVE_VMX_PREEMPTION_TIMER too.
Note, none of the currently existing Windows/Hyper-V versions are known
to enable 'save VMX-preemption timer value' when eVMCS is in use, the
change is aimed at making the filtering future proof.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220112170134.1904308-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Similar to MSR_IA32_VMX_EXIT_CTLS/MSR_IA32_VMX_TRUE_EXIT_CTLS,
MSR_IA32_VMX_ENTRY_CTLS/MSR_IA32_VMX_TRUE_ENTRY_CTLS pair,
MSR_IA32_VMX_TRUE_PINBASED_CTLS needs to be filtered the same way
MSR_IA32_VMX_PINBASED_CTLS is currently filtered as guests may solely rely
on 'true' MSR data.
Note, none of the currently existing Windows/Hyper-V versions are known
to stumble upon the unfiltered MSR_IA32_VMX_TRUE_PINBASED_CTLS, the change
is aimed at making the filtering future proof.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220112170134.1904308-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Because KVM_GET_SUPPORTED_CPUID is meant to be passed (by simple-minded
VMMs) to KVM_SET_CPUID2, it cannot include any dynamic xsave states that
have not been enabled. Probing those, for example so that they can be
passed to ARCH_REQ_XCOMP_GUEST_PERM, requires a new ioctl or arch_prctl.
The latter is in fact worse, even though that is what the rest of the
API uses, because it would require supported_xcr0 to be moved from the
KVM module to the kernel just for this use. In addition, the value
would be nonsensical (or an error would have to be returned) until
the KVM module is loaded in.
Therefore, to limit the growth of system ioctls, add a /dev/kvm
variant of KVM_{GET,HAS}_DEVICE_ATTR, and implement it in x86
with just one group (0) and attribute (KVM_X86_XCOMP_GUEST_SUPP).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a helper to handle converting the u64 userspace address embedded in
struct kvm_device_attr into a userspace pointer, it's all too easy to
forget the intermediate "unsigned long" cast as well as the truncation
check.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch adds AVX assembly accelerated implementation of SM3 secure
hash algorithm. From the benchmark data, compared to pure software
implementation sm3-generic, the performance increase is up to 38%.
The main algorithm implementation based on SM3 AES/BMI2 accelerated
work by libgcrypt at:
https://gnupg.org/software/libgcrypt/index.html
Benchmark on Intel i5-6200U 2.30GHz, performance data of two
implementations, pure software sm3-generic and sm3-avx acceleration.
The data comes from the 326 mode and 422 mode of tcrypt. The abscissas
are different lengths of per update. The data is tabulated and the
unit is Mb/s:
update-size | 16 64 256 1024 2048 4096 8192
------------+-------------------------------------------------------
sm3-generic | 105.97 129.60 182.12 189.62 188.06 193.66 194.88
sm3-avx | 119.87 163.05 244.44 260.92 257.60 264.87 265.88
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
First S390 complained that the sorting of the mcount sections at build
time caused the kernel to crash on their architecture. Now PowerPC is
complaining about it too. And also ARM64 appears to be having issues.
It may be necessary to also update the relocation table for the values
in the mcount table. Not only do we have to sort the table, but also
update the relocations that may be applied to the items in the table.
If the system is not relocatable, then it is fine to sort, but if it is,
some architectures may have issues (although x86 does not as it shifts all
addresses the same).
Add a HAVE_BUILDTIME_MCOUNT_SORT that an architecture can set to say it is
safe to do the sorting at build time.
Also update the config to compile in build time sorting in the sorttable
code in scripts/ to depend on CONFIG_BUILDTIME_MCOUNT_SORT.
Link: https://lore.kernel.org/all/944D10DA-8200-4BA9-8D0A-3BED9AA99F82@linux.ibm.com/
Link: https://lkml.kernel.org/r/20220127153821.3bc1ac6e@gandalf.local.home
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Yinan Liu <yinan@linux.alibaba.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Reported-by: Sachin Sant <sachinp@linux.ibm.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com> [arm64]
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Fixes: 72b3942a17 ("scripts: ftrace - move the sort-processing in ftrace_init")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
XCR0 is reset to 1 by RESET but not INIT and IA32_XSS is zeroed by
both RESET and INIT. The kvm_set_msr_common()'s handling of MSR_IA32_XSS
also needs to update kvm_update_cpuid_runtime(). In the above cases, the
size in bytes of the XSAVE area containing all states enabled by XCR0 or
(XCRO | IA32_XSS) needs to be updated.
For simplicity and consistency, existing helpers are used to write values
and call kvm_update_cpuid_runtime(), and it's not exactly a fast path.
Fixes: a554d207dc ("KVM: X86: Processor States following Reset or INIT")
Cc: stable@vger.kernel.org
Signed-off-by: Like Xu <likexu@tencent.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220126172226.2298529-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>