Dynamically enabled features can be requested by any thread of a running
process at any time. The request does neither enable the feature nor
allocate larger buffers. It just stores the permission to use the feature
by adding the features to the permission bitmap and by calculating the
required sizes for kernel and user space.
The reallocation of the kernel buffer happens when the feature is used
for the first time which is caught by an exception. The permission
bitmap is then checked and if the feature is permitted, then it becomes
fully enabled. If not, the task dies similarly to a task which uses an
undefined instruction.
The size information is precomputed to allow proper sigaltstack size checks
once the feature is permitted, but not yet in use because otherwise this
would open race windows where too small stacks could be installed causing
a later fail on signal delivery.
Initialize them to the default feature set and sizes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211021225527.10184-5-chang.seok.bae@intel.com
Hyper-V needs to issue the GHCB HV call in order to read/write MSRs in
Isolation VMs. For that, expose sev_es_ghcb_hv_call().
The Hyper-V Isolation VMs are unenlightened guests and run a paravisor
at VMPL0 for communicating. GHCB pages are being allocated and set up
by that paravisor. Linux gets the GHCB page's physical address via
MSR_AMD64_SEV_ES_GHCB from the paravisor and should not change it.
Add a @set_ghcb_msr parameter to sev_es_ghcb_hv_call() to control
whether the function should set the GHCB's address prior to the call or
not and export that function for use by HyperV.
[ bp: - Massage commit message
- add a struct ghcb forward declaration to fix randconfig builds. ]
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20211025122116.264793-6-ltykernel@gmail.com
For the upcoming AMX support it's necessary to do a proper integration with
KVM. Currently KVM allocates two FPU structs which are used for saving the user
state of the vCPU thread and restoring the guest state when entering
vcpu_run() and doing the reverse operation before leaving vcpu_run().
With the new fpstate mechanism this can be reduced to one extra buffer by
swapping the fpstate pointer in current::thread::fpu. This makes the
upcoming support for AMX and XFD simpler because then fpstate information
(features, sizes, xfd) are always consistent and it does not require any
nasty workarounds.
Convert the KVM FPU code over to this new scheme.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211022185313.019454292@linutronix.de
For the upcoming AMX support it's necessary to do a proper integration with
KVM. Currently KVM allocates two FPU structs which are used for saving the user
state of the vCPU thread and restoring the guest state when entering
vcpu_run() and doing the reverse operation before leaving vcpu_run().
With the new fpstate mechanism this can be reduced to one extra buffer by
swapping the fpstate pointer in current::thread::fpu. This makes the
upcoming support for AMX and XFD simpler because then fpstate information
(features, sizes, xfd) are always consistent and it does not require any
nasty workarounds.
Provide:
- An allocator which initializes the state properly
- A replacement for the existing FPU swap mechanim
Aside of the reduced memory footprint, this also makes state switching
more efficient when TIF_FPU_NEED_LOAD is set. It does not require a
memcpy as the state is already correct in the to be swapped out fpstate.
The existing interfaces will be removed once KVM is converted over.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211022185312.954684740@linutronix.de
For the upcoming AMX support it's necessary to do a proper integration with
KVM. To avoid more nasty hackery in KVM which violate encapsulation extend
struct fpu and fpstate so the fpstate switching can be consolidated and
simplified.
Currently KVM allocates two FPU structs which are used for saving the user
state of the vCPU thread and restoring the guest state when entering
vcpu_run() and doing the reverse operation before leaving vcpu_run().
With the new fpstate mechanism this can be reduced to one extra buffer by
swapping the fpstate pointer in current::thread::fpu. This makes the
upcoming support for AMX and XFD simpler because then fpstate information
(features, sizes, xfd) are always consistent and it does not require any
nasty workarounds.
Add fpu::__task_fpstate to save the regular fpstate pointer while the task
is inside vcpu_run(). Add some state fields to fpstate to indicate the
nature of the state.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211022185312.896403942@linutronix.de
Pull more x86 kvm fixes from Paolo Bonzini:
- Cache coherency fix for SEV live migration
- Fix for instruction emulation with PKU
- fixes for rare delaying of interrupt delivery
- fix for SEV-ES buffer overflow
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: SEV-ES: go over the sev_pio_data buffer in multiple passes if needed
KVM: SEV-ES: keep INS functions together
KVM: x86: remove unnecessary arguments from complete_emulator_pio_in
KVM: x86: split the two parts of emulator_pio_in
KVM: SEV-ES: clean up kvm_sev_es_ins/outs
KVM: x86: leave vcpu->arch.pio.count alone in emulator_pio_in_out
KVM: SEV-ES: rename guest_ins_data to sev_pio_data
KVM: SEV: Flush cache on non-coherent systems before RECEIVE_UPDATE_DATA
KVM: MMU: Reset mmu->pkru_mask to avoid stale data
KVM: nVMX: promptly process interrupts delivered while in guest mode
KVM: x86: check for interrupts before deciding whether to exit the fast path
Compile kretprobe related stacktrace entry recovery code and
unwind_state::kr_cur field only when CONFIG_KRETPROBES=y.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
For bare-metal SGX on real hardware, the hardware provides guarantees
SGX state at reboot. For instance, all pages start out uninitialized.
The vepc driver provides a similar guarantee today for freshly-opened
vepc instances, but guests such as Windows expect all pages to be in
uninitialized state on startup, including after every guest reboot.
Some userspace implementations of virtual SGX would rather avoid having
to close and reopen the /dev/sgx_vepc file descriptor and re-mmap the
virtual EPC. For example, they could sandbox themselves after the guest
starts and forbid further calls to open(), in order to mitigate exploits
from untrusted guests.
Therefore, add a ioctl that does this with EREMOVE. Userspace can
invoke the ioctl to bring its vEPC pages back to uninitialized state.
There is a possibility that some pages fail to be removed if they are
SECS pages, and the child and SECS pages could be in separate vEPC
regions. Therefore, the ioctl returns the number of EREMOVE failures,
telling userspace to try the ioctl again after it's done with all
vEPC regions. A more verbose description of the correct usage and
the possible error conditions is documented in sgx.rst.
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20211021201155.1523989-3-pbonzini@redhat.com
Use a rw_semaphore instead of a mutex to coordinate APICv updates so that
vCPUs responding to requests can take the lock for read and run in
parallel. Using a mutex forces serialization of vCPUs even though
kvm_vcpu_update_apicv() only touches data local to that vCPU or is
protected by a different lock, e.g. SVM's ir_list_lock.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211022004927.1448382-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The PIO scratch buffer is larger than a single page, and therefore
it is not possible to copy it in a single step to vcpu->arch/pio_data.
Bound each call to emulator_pio_in/out to a single page; keep
track of how many I/O operations are left in vcpu->arch.sev_pio_count,
so that the operation can be restarted in the complete_userspace_io
callback.
For OUT, this means that the previous kvm_sev_es_outs implementation
becomes an iterator of the loop, and we can consume the sev_pio_data
buffer before leaving to userspace.
For IN, instead, consuming the buffer and decreasing sev_pio_count
is always done in the complete_userspace_io callback, because that
is when the memcpy is done into sev_pio_data.
Cc: stable@vger.kernel.org
Fixes: 7ed9abfe8e ("KVM: SVM: Support string IO operations for an SEV-ES guest")
Reported-by: Felix Wilhelm <fwilhelm@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We will be using this field for OUTS emulation as well, in case the
data that is pushed via OUTS spans more than one page. In that case,
there will be a need to save the data pointer across exits to userspace.
So, change the name to something that refers to any kind of PIO.
Also spell out what it is used for, namely SEV-ES.
No functional change intended.
Cc: stable@vger.kernel.org
Fixes: 7ed9abfe8e ("KVM: SVM: Support string IO operations for an SEV-ES guest")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The microcode loader has been looping through __start_builtin_fw down to
__end_builtin_fw to look for possibly built-in firmware for microcode
updates.
Now that the firmware loader code has exported an API for looping
through the kernel's built-in firmware section, use it and drop the x86
implementation in favor.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20211021155843.1969401-4-mcgrof@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Paul pointed out the error messages when KVM fails to load are unhelpful
in understanding exactly what went wrong if userspace probes the "wrong"
module.
Add a mandatory kvm_x86_ops field to track vendor module names, kvm_intel
and kvm_amd, and use the name for relevant error message when KVM fails
to load so that the user knows which module failed to load.
Opportunistically tweak the "disabled by bios" error message to clarify
that _support_ was disabled, not that the module itself was magically
disabled by BIOS.
Suggested-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211018183929.897461-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SDM section 18.2.3 mentioned that:
"IA32_PERF_GLOBAL_OVF_CTL MSR allows software to clear overflow indicator(s) of
any general-purpose or fixed-function counters via a single WRMSR."
It is R/W mentioned by SDM, we read this msr on bare-metal during perf testing,
the value is always 0 for ICX/SKX boxes on hands. Let's fill get_msr
MSR_CORE_PERF_GLOBAL_OVF_CTRL w/ 0 as hardware behavior and drop
global_ovf_ctrl variable.
Tested-by: Like Xu <likexu@tencent.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1634631160-67276-2-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unify the flags for rmaps and page tracking data, using a
single flag in struct kvm_arch and a single loop to go
over all the address spaces and memslots. This avoids
code duplication between alloc_all_memslots_rmaps and
kvm_page_track_enable_mmu_write_tracking.
Signed-off-by: David Stevens <stevensd@chromium.org>
[This patch is the delta between David's v2 and v3, with conflicts
fixed and my own commit message. - Paolo]
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
xfeatures_mask_fpstate() is no longer valid when dynamically enabled
features come into play.
Rework restore_regs_from_fpstate() so it takes a constant mask which will
then be applied against the maximum feature set so that the restore
operation brings all features which are not in the xsave buffer xfeature
bitmap into init state.
This ensures that if the previous task used a dynamically enabled feature
that the task which restores has all unused components properly initialized.
Cleanup the last user of xfeatures_mask_fpstate() as well and remove it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211014230739.461348278@linutronix.de
Use the new fpu_user_cfg to retrieve the information instead of
xfeatures_mask_uabi() which will be no longer correct when dynamically
enabled features become available.
Using fpu_user_cfg is appropriate when setting XCOMP_BV in the
init_fpstate since it has space allocated for "max_features". But,
normal fpstates might only have space for default xfeatures. Since
XRSTOR* derives the format of the XSAVE buffer from XCOMP_BV, this can
lead to XRSTOR reading out of bounds.
So when copying actively used fpstate, simply read the XCOMP_BV features
bits directly out of the fpstate instead.
This correction courtesy of Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211014230739.408879849@linutronix.de
Provide a struct to store information about the maximum supported and the
default feature set and buffer sizes for both user and kernel space.
This allows quick retrieval of this information for the upcoming support
for dynamically enabled features.
[ bp: Add vertical spacing between the struct members. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211014230739.126107370@linutronix.de
DM&P devices were not being properly identified, which resulted in
unneeded Spectre/Meltdown mitigations being applied.
The manufacturer states that these devices execute always in-order and
don't support either speculative execution or branch prediction, so
they are not vulnerable to this class of attack. [1]
This is something I've personally tested by a simple timing analysis
on my Vortex86MX CPU, and can confirm it is true.
Add identification for some devices that lack the CPUID product name
call, so they appear properly on /proc/cpuinfo.
¹https://www.ssv-embedded.de/doks/infos/DMP_Ann_180108_Meltdown.pdf
[ bp: Massage commit message. ]
Signed-off-by: Marcos Del Sol Vives <marcos@orca.pet>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211017094408.1512158-1-marcos@orca.pet
Prepare for dynamically enabled states per task. The function needs to
retrieve the features and sizes which are valid in a fpstate
context. Retrieve them from fpstate.
Move the function declarations to the core header as they are not
required anywhere else.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211013145323.233529986@linutronix.de
Add state size and feature mask information to the fpstate container. This
will be used for runtime checks with the upcoming support for dynamically
enabled features and dynamically sized buffers. That avoids conditionals
all over the place as the required information is accessible for both
default and extended buffers.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211013145322.921388806@linutronix.de
We don't need special hook for graph tracer entry point,
but instead we can use graph_ops::func function to install
the return_hooker.
This moves the graph tracing setup _before_ the direct
trampoline prepares the stack, so the return_hooker will
be called when the direct trampoline is finished.
This simplifies the code, because we don't need to take into
account the direct trampoline setup when preparing the graph
tracer hooker and we can allow function graph tracer on entries
registered with direct trampoline.
Link: https://lkml.kernel.org/r/20211008091336.33616-4-jolsa@kernel.org
[fixed compile error reported by kernel test robot <lkp@intel.com>]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
New xfeatures will not longer be automatically stored in the regular XSAVE
buffer in thread_struct::fpu.
The kernel will provide the default sized buffer for storing the regular
features up to AVX512 in thread_struct::fpu and if a task requests to use
one of the new features then the register storage has to be extended.
The state will be accessed via a pointer in thread_struct::fpu which
defaults to the builtin storage and can be switched when extended storage
is required.
To avoid conditionals all over the code, create a new container for the
register storage which will gain other information, e.g. size, feature
masks etc., later. For now it just contains the register storage, which
gives it exactly the same layout as the exiting fpu::state.
Stick fpu::state and the new fpu::__fpstate into an anonymous union and
initialize the pointer. Add build time checks to validate that both are
at the same place and have the same size.
This allows step by step conversion of all users.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211013145322.234458659@linutronix.de
internal.h is a kitchen sink which needs to get out of the way to prepare
for the upcoming changes.
Move the context switch and exit to user inlines into a separate header,
which is all that code needs.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011539.349132461@linutronix.de