Look for a !leaf=>leaf conversion instead of a PFN change when checking
if a SPTE change removed a TDP MMU shadow page. Convert the PFN check
into a WARN, as KVM should never change the PFN of a shadow page (except
when its being zapped or replaced).
From a purely theoretical perspective, it's not illegal to replace a SP
with a hugepage pointing at the same PFN. In practice, it's impossible
as that would require mapping guest memory overtop a kernel-allocated SP.
Either way, the check is odd.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-8-seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the "shared" argument of for_each_tdp_mmu_root_yield_safe, thus ensuring
that readers do not ever acquire a reference to an invalid root. After this
patch, all readers except kvm_tdp_mmu_zap_invalidated_roots() treat
refcount=0/valid, refcount=0/invalid and refcount=1/invalid in exactly the
same way. kvm_tdp_mmu_zap_invalidated_roots() is different but it also
does not acquire a reference to the invalid root, and it cannot see
refcount=0/invalid because it is guaranteed to run after
kvm_tdp_mmu_invalidate_all_roots().
Opportunistically add a lockdep assertion to the yield-safe iterator.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Eager page splitting is an optimization; it does not have to be performed on
invalid roots. It is also the only case in which a reader might acquire
a reference to an invalid root, so after this change we know that readers
will skip both dying and invalid roots.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Assert that mmu_lock is held for write by users of the yield-unfriendly
TDP iterator. The nature of a shared walk means that the caller needs to
play nice with other tasks modifying the page tables, which is more or
less the same thing as playing nice with yielding. Theoretically, KVM
could gain a flow where it could legitimately take mmu_lock for read in
a non-preemptible context, but that's highly unlikely and any such case
should be viewed with a fair amount of scrutiny.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the misleading flush "handling" when zapping invalidated TDP MMU
roots, and document that flushing is unnecessary for all flavors of MMUs
when zapping invalid/obsolete roots/pages. The "handling" in the TDP MMU
is dead code, as zap_gfn_range() is called with shared=true, in which
case it will never return true due to the flushing being handled by
tdp_mmu_zap_spte_atomic().
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explicitly ignore the result of zap_gfn_range() when putting the last
reference to a TDP MMU root, and add a pile of comments to formalize the
TDP MMU's behavior of deferring TLB flushes to alloc/reuse. Note, this
only affects the !shared case, as zap_gfn_range() subtly never returns
true for "flush" as the flush is handled by tdp_mmu_zap_spte_atomic().
Putting the root without a flush is ok because even if there are stale
references to the root in the TLB, they are unreachable because KVM will
not run the guest with the same ASID without first flushing (where ASID
in this context refers to both SVM's explicit ASID and Intel's implicit
ASID that is constructed from VPID+PCID+EPT4A+etc...).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220226001546.360188-5-seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fix misleading and arguably wrong comments in the TDP MMU's fast zap
flow. The comments, and the fact that actually zapping invalid roots was
added separately, strongly suggests that zapping invalid roots is an
optimization and not required for correctness. That is a lie.
KVM _must_ zap invalid roots before returning from kvm_mmu_zap_all_fast(),
because when it's called from kvm_mmu_invalidate_zap_pages_in_memslot(),
KVM is relying on it to fully remove all references to the memslot. Once
the memslot is gone, KVM's mmu_notifier hooks will be unable to find the
stale references as the hva=>gfn translation is done via the memslots.
If KVM doesn't immediately zap SPTEs and userspace unmaps a range after
deleting a memslot, KVM will fail to zap in response to the mmu_notifier
due to not finding a memslot corresponding to the notifier's range, which
leads to a variation of use-after-free.
The other misleading comment (and code) explicitly states that roots
without a reference should be skipped. While that's technically true,
it's also extremely misleading as it should be impossible for KVM to
encounter a defunct root on the list while holding mmu_lock for write.
Opportunistically add a WARN to enforce that invariant.
Fixes: b7cccd397f ("KVM: x86/mmu: Fast invalidation for TDP MMU")
Fixes: 4c6654bd16 ("KVM: x86/mmu: Tear down roots before kvm_mmu_zap_all_fast returns")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explicitly check for present SPTEs when clearing dirty bits in the TDP
MMU. This isn't strictly required for correctness, as setting the dirty
bit in a defunct SPTE will not change the SPTE from !PRESENT to PRESENT.
However, the guarded MMU_WARN_ON() in spte_ad_need_write_protect() would
complain if anyone actually turned on KVM's MMU debugging.
Fixes: a6a0b05da9 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-3-seanjc@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Allocations whose size is related to the memslot size can be arbitrarily
large. Do not use kvzalloc/kvcalloc, as those are limited to "not crazy"
sizes that fit in 32 bits.
Cc: stable@vger.kernel.org
Fixes: 7661809d49 ("mm: don't allow oversized kvmalloc() calls")
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull x86 spectre fixes from Borislav Petkov:
- Mitigate Spectre v2-type Branch History Buffer attacks on machines
which support eIBRS, i.e., the hardware-assisted speculation
restriction after it has been shown that such machines are vulnerable
even with the hardware mitigation.
- Do not use the default LFENCE-based Spectre v2 mitigation on AMD as
it is insufficient to mitigate such attacks. Instead, switch to
retpolines on all AMD by default.
- Update the docs and add some warnings for the obviously vulnerable
cmdline configurations.
* tag 'x86_bugs_for_v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation: Warn about eIBRS + LFENCE + Unprivileged eBPF + SMT
x86/speculation: Warn about Spectre v2 LFENCE mitigation
x86/speculation: Update link to AMD speculation whitepaper
x86/speculation: Use generic retpoline by default on AMD
x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation reporting
Documentation/hw-vuln: Update spectre doc
x86/speculation: Add eIBRS + Retpoline options
x86/speculation: Rename RETPOLINE_AMD to RETPOLINE_LFENCE
Pull kvm fixes from Paolo Bonzini:
"x86 guest:
- Tweaks to the paravirtualization code, to avoid using them when
they're pointless or harmful
x86 host:
- Fix for SRCU lockdep splat
- Brown paper bag fix for the propagation of errno"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run
KVM: x86/mmu: Passing up the error state of mmu_alloc_shadow_roots()
KVM: x86: Yield to IPI target vCPU only if it is busy
x86/kvmclock: Fix Hyper-V Isolated VM's boot issue when vCPUs > 64
x86/kvm: Don't waste memory if kvmclock is disabled
x86/kvm: Don't use PV TLB/yield when mwait is advertised
The commit
44a3918c82 ("x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation reporting")
added a warning for the "eIBRS + unprivileged eBPF" combination, which
has been shown to be vulnerable against Spectre v2 BHB-based attacks.
However, there's no warning about the "eIBRS + LFENCE retpoline +
unprivileged eBPF" combo. The LFENCE adds more protection by shortening
the speculation window after a mispredicted branch. That makes an attack
significantly more difficult, even with unprivileged eBPF. So at least
for now the logic doesn't warn about that combination.
But if you then add SMT into the mix, the SMT attack angle weakens the
effectiveness of the LFENCE considerably.
So extend the "eIBRS + unprivileged eBPF" warning to also include the
"eIBRS + LFENCE + unprivileged eBPF + SMT" case.
[ bp: Massage commit message. ]
Suggested-by: Alyssa Milburn <alyssa.milburn@linux.intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
With:
f8a66d608a ("x86,bugs: Unconditionally allow spectre_v2=retpoline,amd")
it became possible to enable the LFENCE "retpoline" on Intel. However,
Intel doesn't recommend it, as it has some weaknesses compared to
retpoline.
Now AMD doesn't recommend it either.
It can still be left available as a cmdline option. It's faster than
retpoline but is weaker in certain scenarios -- particularly SMT, but
even non-SMT may be vulnerable in some cases.
So just unconditionally warn if the user requests it on the cmdline.
[ bp: Massage commit message. ]
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Daniel Borkmann says:
====================
pull-request: bpf-next 2022-03-04
We've added 32 non-merge commits during the last 14 day(s) which contain
a total of 59 files changed, 1038 insertions(+), 473 deletions(-).
The main changes are:
1) Optimize BPF stackmap's build_id retrieval by caching last valid build_id,
as consecutive stack frames are likely to be in the same VMA and therefore
have the same build id, from Hao Luo.
2) Several improvements to arm64 BPF JIT, that is, support for JITing
the atomic[64]_fetch_add, atomic[64]_[fetch_]{and,or,xor} and lastly
atomic[64]_{xchg|cmpxchg}. Also fix the BTF line info dump for JITed
programs, from Hou Tao.
3) Optimize generic BPF map batch deletion by only enforcing synchronize_rcu()
barrier once upon return to user space, from Eric Dumazet.
4) For kernel build parse DWARF and generate BTF through pahole with enabled
multithreading, from Kui-Feng Lee.
5) BPF verifier usability improvements by making log info more concise and
replacing inv with scalar type name, from Mykola Lysenko.
6) Two follow-up fixes for BPF prog JIT pack allocator, from Song Liu.
7) Add a new Kconfig to allow for loading kernel modules with non-matching
BTF type info; their BTF info is then removed on load, from Connor O'Brien.
8) Remove reallocarray() usage from bpftool and switch to libbpf_reallocarray()
in order to fix compilation errors for older glibc, from Mauricio Vásquez.
9) Fix libbpf to error on conflicting name in BTF when type declaration
appears before the definition, from Xu Kuohai.
10) Fix issue in BPF preload for in-kernel light skeleton where loaded BPF
program fds prevent init process from setting up fd 0-2, from Yucong Sun.
11) Fix libbpf reuse of pinned perf RB map when max_entries is auto-determined
by libbpf, from Stijn Tintel.
12) Several cleanups for libbpf and a fix to enforce perf RB map #pages to be
non-zero, from Yuntao Wang.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (32 commits)
bpf: Small BPF verifier log improvements
libbpf: Add a check to ensure that page_cnt is non-zero
bpf, x86: Set header->size properly before freeing it
x86: Disable HAVE_ARCH_HUGE_VMALLOC on 32-bit x86
bpf, test_run: Fix overflow in XDP frags bpf_test_finish
selftests/bpf: Update btf_dump case for conflicting names
libbpf: Skip forward declaration when counting duplicated type names
bpf: Add some description about BPF_JIT_ALWAYS_ON in Kconfig
bpf, docs: Add a missing colon in verifier.rst
bpf: Cache the last valid build_id
libbpf: Fix BPF_MAP_TYPE_PERF_EVENT_ARRAY auto-pinning
bpf, selftests: Use raw_tp program for atomic test
bpf, arm64: Support more atomic operations
bpftool: Remove redundant slashes
bpf: Add config to allow loading modules with BTF mismatches
bpf, arm64: Feed byte-offset into bpf line info
bpf, arm64: Call build_prologue() first in first JIT pass
bpf: Fix issue with bpf preload module taking over stdout/stdin of kernel.
bpftool: Bpf skeletons assert type sizes
bpf: Cleanup comments
...
====================
Link: https://lore.kernel.org/r/20220304164313.31675-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since switch to simplefb/simpledrm VESA graphic mode selection with vga=
kernel parameter is no longer available with legacy BIOS.
The x86 realmode boot code enables the VESA graphic modes when option
FB_BOOT_VESA_SUPPORT is enabled.
This option is selected by vesafb but not simplefb/simpledrm.
To enable use of VESA modes with simplefb in legacy BIOS boot mode drop
dependency of BOOT_VESA_SUPPORT on FB, also drop the FB_ prefix. Select
the option from sysfb rather than the drivers that depend on it.
The BOOT_VESA_SUPPORT is not specific to framebuffer but rather to x86
platform, move it from fbdev to x86 Kconfig.
Fixes: e3263ab389 ("x86: provide platform-devices for boot-framebuffers")
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/948c39940a4e99f5b43bdbcbe537faae71a43e1d.1645822213.git.msuchanek@suse.de
On x86_64 we must disable preemption before we enable interrupts
for stack faults, int3 and debugging, because the current task is using
a per CPU debug stack defined by the IST. If we schedule out, another task
can come in and use the same stack and cause the stack to be corrupted
and crash the kernel on return.
When CONFIG_PREEMPT_RT is enabled, spinlock_t locks become sleeping, and
one of these is the spin lock used in signal handling.
Some of the debug code (int3) causes do_trap() to send a signal.
This function calls a spinlock_t lock that has been converted to a
sleeping lock. If this happens, the above issues with the corrupted
stack is possible.
Instead of calling the signal right away, for PREEMPT_RT and x86,
the signal information is stored on the stacks task_struct and
TIF_NOTIFY_RESUME is set. Then on exit of the trap, the signal resume
code will send the signal when preemption is enabled.
[ rostedt: Switched from #ifdef CONFIG_PREEMPT_RT to
ARCH_RT_DELAYS_SIGNAL_SEND and added comments to the code. ]
[bigeasy: Add on 32bit as per Yang Shi, minor rewording. ]
[ tglx: Use a config option ]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/Ygq5aBB/qMQw6aP5@linutronix.de
On do_jit failure path, the header is freed by bpf_jit_binary_pack_free.
While bpf_jit_binary_pack_free doesn't require proper ro_header->size,
bpf_prog_pack_free still uses it. Set header->size in bpf_int_jit_compile
before calling bpf_jit_binary_pack_free.
Fixes: 1022a5498f ("bpf, x86_64: Use bpf_jit_binary_pack_alloc")
Fixes: 33c9805860 ("bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]")
Reported-by: Kui-Feng Lee <kuifeng@fb.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220302175126.247459-3-song@kernel.org
kernel test robot reported kernel BUG like:
[ 44.587744][ T1] kernel BUG at arch/x86/mm/physaddr.c:76!
[ 44.590151][ T1] __vmalloc_area_node (mm/vmalloc.c:622 mm/vmalloc.c:2995)
[ 44.590151][ T1] __vmalloc_node_range (mm/vmalloc.c:3108)
[ 44.590151][ T1] __vmalloc_node (mm/vmalloc.c:3157)
which is triggered with HAVE_ARCH_HUGE_VMALLOC on 32-bit x86. Since BPF
only uses HAVE_ARCH_HUGE_VMALLOC for x86_64, turn it off for 32-bit x86.
Fixes: fac54e2bfb ("x86/Kconfig: Select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220302175126.247459-2-song@kernel.org
kvm_arch_vcpu_ioctl_run is already doing srcu_read_lock/unlock in two
places, namely vcpu_run and post_kvm_run_save, and a third is actually
needed around the call to vcpu->arch.complete_userspace_io to avoid
the following splat:
WARNING: suspicious RCU usage
arch/x86/kvm/pmu.c:190 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by CPU 28/KVM/370841:
#0: ff11004089f280b8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x87/0x730 [kvm]
Call Trace:
<TASK>
dump_stack_lvl+0x59/0x73
reprogram_fixed_counter+0x15d/0x1a0 [kvm]
kvm_pmu_trigger_event+0x1a3/0x260 [kvm]
? free_moved_vector+0x1b4/0x1e0
complete_fast_pio_in+0x8a/0xd0 [kvm]
This splat is not at all unexpected, since complete_userspace_io callbacks
can execute similar code to vmexits. For example, SVM with nrips=false
will call into the emulator from svm_skip_emulated_instruction().
While it's tempting to never acquire kvm->srcu for an uninitialized vCPU,
practically speaking there's no penalty to acquiring kvm->srcu "early"
as the KVM_MP_STATE_UNINITIALIZED path is a one-time thing per vCPU. On
the other hand, seemingly innocuous helpers like kvm_apic_accept_events()
and sync_regs() can theoretically reach code that might access
SRCU-protected data structures, e.g. sync_regs() can trigger forced
existing of nested mode via kvm_vcpu_ioctl_x86_set_vcpu_events().
Reported-by: Like Xu <likexu@tencent.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Just like on the optional mmu_alloc_direct_roots() path, once shadow
path reaches "r = -EIO" somewhere, the caller needs to know the actual
state in order to enter error handling and avoid something worse.
Fixes: 4a38162ee9 ("KVM: MMU: load PDPTRs outside mmu_lock")
Signed-off-by: Like Xu <likexu@tencent.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220301124941.48412-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Recent Fam19h EPYC server line of processors from AMD support system
management functionality via HSMP (Host System Management Port) interface.
The Host System Management Port (HSMP) is an interface to provide
OS-level software with access to system management functions via a
set of mailbox registers.
More details on the interface can be found in chapter
"7 Host System Management Port (HSMP)" of the following PPR
https://www.amd.com/system/files/TechDocs/55898_B1_pub_0.50.zip
This patch adds new amd_hsmp module under the drivers/platforms/x86/
which creates miscdevice with an IOCTL interface to the user space.
/dev/hsmp is for running the hsmp mailbox commands.
Signed-off-by: Suma Hegde <suma.hegde@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
Reviewed-by: Carlos Bilbao <carlos.bilbao@amd.com>
Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Nathan Fontenot <nathan.fontenot@amd.com>
Link: https://lore.kernel.org/r/20220222050501.18789-1-nchatrad@amd.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Pull kvm fixes from Paolo Bonzini:
"The bigger part of the change is a revert for x86 hosts. Here the
second patch was supposed to fix the first, but in reality it was just
as broken, so both have to go.
x86 host:
- Revert incorrect assumption that cr3 changes come with preempt
notifier callbacks (they don't when static branches are changed,
for example)
ARM host:
- Correctly synchronise PMR and co on PSCI CPU_SUSPEND
- Skip tests that depend on GICv3 when the HW isn't available"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: selftests: aarch64: Skip tests if we can't create a vgic-v3
Revert "KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()"
Revert "KVM: VMX: Save HOST_CR3 in vmx_set_host_fs_gs()"
KVM: arm64: Don't miss pending interrupts for suspended vCPU
Disable preemption when loading/putting the AVIC during an APICv refresh.
If the vCPU task is preempted and migrated ot a different pCPU, the
unprotected avic_vcpu_load() could set the wrong pCPU in the physical ID
cache/table.
Pull the necessary code out of avic_vcpu_{,un}blocking() and into a new
helper to reduce the probability of introducing this exact bug a third
time.
Fixes: df7e4827c5 ("KVM: SVM: call avic_vcpu_load/avic_vcpu_put when enabling/disabling AVIC")
Cc: stable@vger.kernel.org
Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This expands generic branch type classification by adding two more entries
there in i.e irq and exception return. Also updates the x86 implementation
to process X86_BR_IRET and X86_BR_IRQ records as appropriate. This changes
branch types reported to user space on x86 platform but it should not be a
problem. The possible scenarios and impacts are enumerated here.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1645681014-3346-1-git-send-email-anshuman.khandual@arm.com
Exit to userspace if setup_vmgexit_scratch() fails due to OOM or because
copying data from guest (userspace) memory failed/faulted. The OOM
scenario is clearcut, it's userspace's decision as to whether it should
terminate the guest, free memory, etc...
As for -EFAULT, arguably, any guest issue is a violation of the guest's
contract with userspace, and thus userspace needs to decide how to
proceed. E.g. userspace defines what is RAM vs. MMIO and communicates
that directly to the guest, KVM is not involved in deciding what is/isn't
RAM nor in communicating that information to the guest. If the scratch
GPA doesn't resolve to a memslot, then the guest is not honoring the
memory configuration as defined by userspace.
And if userspace unmaps an hva for whatever reason, then exiting to
userspace with -EFAULT is absolutely the right thing to do. KVM's ABI
currently sucks and doesn't provide enough information to act on the
-EFAULT, but that will hopefully be remedied in the future as there are
multiple use cases, e.g. uffd and virtiofs truncation, that shouldn't
require any work in KVM beyond returning -EFAULT with a small amount of
metadata.
KVM could define its ABI such that failure to access the scratch area is
reflected into the guest, i.e. establish a contract with userspace, but
that's undesirable as it limits KVM's options in the future, e.g. in the
potential uffd case any failure on a uaccess needs to kick out to
userspace. KVM does have several cases where it reflects these errors
into the guest, e.g. kvm_pv_clock_pairing() and Hyper-V emulation, but
KVM would preferably "fix" those instead of propagating the falsehood
that any memory failure is the guest's fault.
Lastly, returning a boolean as an "error" for that a helper that isn't
named accordingly never works out well.
Fixes: ad5b353240 ("KVM: SVM: Do not terminate SEV-ES guests on GHCB validation failure")
Cc: Alper Gun <alpergun@google.com>
Cc: Peter Gonda <pgonda@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220225205209.3881130-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
WARN and bail if is_unsync_root() is passed a root for which there is no
shadow page, i.e. is passed the physical address of one of the special
roots, which do not have an associated shadow page. The current usage
squeaks by without bug reports because neither kvm_mmu_sync_roots() nor
kvm_mmu_sync_prev_roots() calls the helper with pae_root or pml4_root,
and 5-level AMD CPUs are not generally available, i.e. no one can coerce
KVM into calling is_unsync_root() on pml5_root.
Note, this doesn't fix the mess with 5-level nNPT, it just (hopefully)
prevents KVM from crashing.
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220225182248.3812651-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Zap only obsolete roots when responding to zapping a single root shadow
page. Because KVM keeps root_count elevated when stuffing a previous
root into its PGD cache, shadowing a 64-bit guest means that zapping any
root causes all vCPUs to reload all roots, even if their current root is
not affected by the zap.
For many kernels, zapping a single root is a frequent operation, e.g. in
Linux it happens whenever an mm is dropped, e.g. process exits, etc...
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220225182248.3812651-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the generic kvm_reload_remote_mmus() and open code its
functionality into the two x86 callers. x86 is (obviously) the only
architecture that uses the hook, and is also the only architecture that
uses KVM_REQ_MMU_RELOAD in a way that's consistent with the name. That
will change in a future patch, as x86's usage when zapping a single
shadow page x86 doesn't actually _need_ to reload all vCPUs' MMUs, only
MMUs whose root is being zapped actually need to be reloaded.
s390 also uses KVM_REQ_MMU_RELOAD, but for a slightly different purpose.
Drop the generic code in anticipation of implementing s390 and x86 arch
specific requests, which will allow dropping KVM_REQ_MMU_RELOAD entirely.
Opportunistically reword the x86 TDP MMU comment to avoid making
references to functions (and requests!) when possible, and to remove the
rather ambiguous "this".
No functional change intended.
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220225182248.3812651-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Replace a KVM_REQ_MMU_RELOAD request with a direct kvm_mmu_unload() call
when the guest's CR4.PCIDE changes. This will allow tweaking the logic
of KVM_REQ_MMU_RELOAD to free only obsolete/invalid roots, which is the
historical intent of KVM_REQ_MMU_RELOAD. The recent PCIDE behavior is
the only user of KVM_REQ_MMU_RELOAD that doesn't mark affected roots as
obsolete, needs to unconditionally unload the entire MMU, _and_ affects
only the current vCPU.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220225182248.3812651-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Code segment descriptor can be loaded by jmp/call/ret, iret
and int. The privilege checks are different between those
instructions above realmode. Although, the emulator has
use x86_transfer_type enumerate to differentiate them, but
it is not really used in __load_segment_descriptor(). Note,
far jump/call to call gate, task gate or task state segment
are not implemented in emulator.
As for far jump/call to code segment, if DPL > CPL for conforming
code or (RPL > CPL or DPL != CPL) for non-conforming code, it
should trigger #GP. The current checks are ok.
As for far return, if RPL < CPL or DPL > RPL for conforming
code or DPL != RPL for non-conforming code, it should trigger #GP.
Outer level return is not implemented above virtual-8086 mode in
emulator. So it implies that RPL <= CPL, but the current checks
wouldn't trigger #GP if RPL < CPL.
As for code segment loading in task switch, if DPL > RPL for conforming
code or DPL != RPL for non-conforming code, it should trigger #TS. Since
segment selector is loaded before segment descriptor when load state from
tss, it implies that RPL = CPL, so the current checks are ok.
The only problem in current implementation is missing RPL < CPL check for
far return. However, change code to follow the manual is better.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Message-Id: <e01f5ea70fc1f18f23da1182acdbc5c97c0e5886.1644292363.git.houwenlong.hwl@antgroup.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Per Intel's SDM on the "Instruction Set Reference", when
loading segment descriptor, not-present segment check should
be after all type and privilege checks. But the emulator checks
it first, then #NP is triggered instead of #GP if privilege fails
and segment is not present. Put not-present segment check after
type and privilege checks in __load_segment_descriptor().
Fixes: 38ba30ba51 (KVM: x86 emulator: Emulate task switch in emulator.c)
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Message-Id: <52573c01d369f506cadcf7233812427cf7db81a7.1644292363.git.houwenlong.hwl@antgroup.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Hide the lapic's "raw" write helper inside lapic.c to force non-APIC code
to go through proper helpers when modification the vAPIC state. Keep the
read helper visible to outsiders for now, refactoring KVM to hide it too
is possible, it will just take more work to do so.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220204214205.3306634-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Emulate the x2APIC ICR as a single 64-bit register, as opposed to forking
it across ICR and ICR2 as two 32-bit registers. This mirrors hardware
behavior for Intel's upcoming IPI virtualization support, which does not
split the access.
Previous versions of Intel's SDM and AMD's APM don't explicitly state
exactly how ICR is reflected in the vAPIC page for x2APIC, KVM just
happened to speculate incorrectly.
Handling the upcoming behavior is necessary in order to maintain
backwards compatibility with KVM_{G,S}ET_LAPIC, e.g. failure to shuffle
the 64-bit ICR to ICR+ICR2 and vice versa would break live migration if
IPI virtualization support isn't symmetrical across the source and dest.
Cc: Zeng Guang <guang.zeng@intel.com>
Cc: Chao Gao <chao.gao@intel.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220204214205.3306634-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add helpers to handle 64-bit APIC read/writes via MSRs to deduplicate the
x2APIC and Hyper-V code needed to service reads/writes to ICR. Future
support for IPI virtualization will add yet another path where KVM must
handle 64-bit APIC MSR reads/write (to ICR).
Opportunistically fix the comment in the write path; ICR2 holds the
destination (if there's no shorthand), not the vector.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220204214205.3306634-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Make the low level read/write lapic helpers static, any accesses to the
local APIC from vendor code or non-APIC code should be routed through
proper helpers.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220204214205.3306634-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
WARN if KVM emulates an IPI without clearing the BUSY flag, failure to do
so could hang the guest if it waits for the IPI be sent.
Opportunistically use APIC_ICR_BUSY macro instead of open coding the
magic number, and add a comment to clarify why kvm_recalculate_apic_map()
is unconditionally invoked (it's really, really confusing for IPIs due to
the existence of fast paths that don't trigger a potential recalc).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220204214205.3306634-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Don't bother rewriting the ICR value into the vAPIC page on an AVIC IPI
virtualization failure, the access is a trap, i.e. the value has already
been written to the vAPIC page. The one caveat is if hardware left the
BUSY flag set (which appears to happen somewhat arbitrarily), in which
case go through the "nodecode" APIC-write path in order to clear the BUSY
flag.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220204214205.3306634-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the common kvm_apic_write_nodecode() to handle AVIC/APIC-write traps
instead of open coding the same exact code. This will allow making the
low level lapic helpers inaccessible outside of lapic.c code.
Opportunistically clean up the params to eliminate a bunch of svm=>vcpu
reflection.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220204214205.3306634-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the "raw" helper to read the vAPIC register after an APIC-write trap
VM-Exit. Hardware is responsible for vetting the write, and the caller
is responsible for sanitizing the offset. This is a functional change,
as it means KVM will consume whatever happens to be in the vAPIC page if
the write was dropped by hardware. But, unless userspace deliberately
wrote garbage into the vAPIC page via KVM_SET_LAPIC, the value should be
zero since it's not writable by the guest.
This aligns common x86 with SVM's AVIC logic, i.e. paves the way for
using the nodecode path to handle APIC-write traps when AVIC is enabled.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220204214205.3306634-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the vAPIC offset adjustments done in the APIC-write trap path from
common x86 to VMX in anticipation of using the nodecode path for SVM's
AVIC. The adjustment reflects hardware behavior, i.e. it's technically a
property of VMX, no common x86. SVM's AVIC behavior is identical, so
it's a bit of a moot point, the goal is purely to make it easier to
understand why the adjustment is ok.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220204214205.3306634-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Emulating writes to SELF_IPI with a write to ICR has an unwanted side effect:
the value of ICR in vAPIC page gets changed. The lists SELF_IPI as write-only,
with no associated MMIO offset, so any write should have no visible side
effect in the vAPIC page.
Reported-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In emulation of writing to cr8, one of the lowest four bits in TPR[3:0]
is kept.
According to Intel SDM 10.8.6.1(baremetal scenario):
"APIC.TPR[bits 7:4] = CR8[bits 3:0], APIC.TPR[bits 3:0] = 0";
and SDM 28.3(use TPR shadow):
"MOV to CR8. The instruction stores bits 3:0 of its source operand into
bits 7:4 of VTPR; the remainder of VTPR (bits 3:0 and bits 31:8) are
cleared.";
and AMD's APM 16.6.4:
"Task Priority Sub-class (TPS)-Bits 3 : 0. The TPS field indicates the
current sub-priority to be used when arbitrating lowest-priority messages.
This field is written with zero when TPR is written using the architectural
CR8 register.";
so in KVM emulated scenario, clear TPR[3:0] to make a consistent behavior
as in other scenarios.
This doesn't impact evaluation and delivery of pending virtual interrupts
because processor does not use the processor-priority sub-class to
determine which interrupts to delivery and which to inhibit.
Sub-class is used by hardware to arbitrate lowest priority interrupts,
but KVM just does a round-robin style delivery.
Fixes: b93463aa59 ("KVM: Accelerated apic support")
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220210094506.20181-1-zhenzhong.duan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For both CR0 and CR4, disassociate the TLB flush logic from the
MMU role logic. Instead of relying on kvm_mmu_reset_context() being
a superset of various TLB flushes (which is not necessarily going to
be the case in the future), always call it if the role changes
but also set the various TLB flush requests according to what is
in the manual.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
AMD retpoline may be susceptible to speculation. The speculation
execution window for an incorrect indirect branch prediction using
LFENCE/JMP sequence may potentially be large enough to allow
exploitation using Spectre V2.
By default, don't use retpoline,lfence on AMD. Instead, use the
generic retpoline.
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>