Commit Graph

577 Commits

Author SHA1 Message Date
Paolo Bonzini
d786c7783b KVM: MMU: inline set_spte in mmu_set_spte
Since the two callers of set_spte do different things with the results,
inlining it actually makes the code simpler to reason about.  For example,
mmu_set_spte looks quite like tdp_mmu_map_handle_target_level, but the
similarity is hidden by set_spte.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:54 -04:00
David Matlack
888104138c KVM: x86/mmu: Avoid memslot lookup in page_fault_handle_page_track
Now that kvm_page_fault has a pointer to the memslot it can be passed
down to the page tracking code to avoid a redundant slot lookup.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210813203504.2742757-5-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:53 -04:00
David Matlack
e710c5f6be KVM: x86/mmu: Pass the memslot around via struct kvm_page_fault
The memslot for the faulting gfn is used throughout the page fault
handling code, so capture it in kvm_page_fault as soon as we know the
gfn and use it in the page fault handling code that has direct access
to the kvm_page_fault struct.  Replace various tests using is_noslot_pfn
with more direct tests on fault->slot being NULL.

This, in combination with the subsequent patch, improves "Populate
memory time" in dirty_log_perf_test by 5% when using the legacy MMU.
There is no discerable improvement to the performance of the TDP MMU.

No functional change intended.

Suggested-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210813203504.2742757-4-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:53 -04:00
Paolo Bonzini
bcc4f2bc50 KVM: MMU: mark page dirty in make_spte
This simplifies set_spte, which we want to remove, and unifies code
between the shadow MMU and the TDP MMU.  The warning will be added
back later to make_spte as well.

There is a small disadvantage in the TDP MMU; it may unnecessarily mark
a page as dirty twice if two vCPUs end up mapping the same page twice.
However, this is a very small cost for a case that is already rare.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:53 -04:00
David Matlack
68be1306ca KVM: x86/mmu: Fold rmap_recycle into rmap_add
Consolidate rmap_recycle and rmap_add into a single function since they
are only ever called together (and only from one place). This has a nice
side effect of eliminating an extra kvm_vcpu_gfn_to_memslot(). In
addition it makes mmu_set_spte(), which is a very long function, a
little shorter.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210813203504.2742757-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:52 -04:00
Sean Christopherson
b1a429fb18 KVM: x86/mmu: Verify shadow walk doesn't terminate early in page faults
WARN and bail if the shadow walk for faulting in a SPTE terminates early,
i.e. doesn't reach the expected level because the walk encountered a
terminal SPTE.  The shadow walks for page faults are subtle in that they
install non-leaf SPTEs (zapping leaf SPTEs if necessary!) in the loop
body, and consume the newly created non-leaf SPTE in the loop control,
e.g. __shadow_walk_next().  In other words, the walks guarantee that the
walk will stop if and only if the target level is reached by installing
non-leaf SPTEs to guarantee the walk remains valid.

Opportunistically use fault->goal-level instead of it.level in
FNAME(fetch) to further clarify that KVM always installs the leaf SPTE at
the target level.

Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210906122547.263316-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:52 -04:00
Paolo Bonzini
f0066d94c9 KVM: MMU: change tracepoints arguments to kvm_page_fault
Pass struct kvm_page_fault to tracepoints instead of extracting the
arguments from the struct.  This also lets the kvm_mmu_spte_requested
tracepoint pick the gfn directly from fault->gfn, instead of using
the address.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:52 -04:00
Paolo Bonzini
536f0e6ace KVM: MMU: change disallowed_hugepage_adjust() arguments to kvm_page_fault
Pass struct kvm_page_fault to disallowed_hugepage_adjust() instead of
extracting the arguments from the struct.  Tweak a bit the conditions
to avoid long lines.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:51 -04:00
Paolo Bonzini
73a3c65947 KVM: MMU: change kvm_mmu_hugepage_adjust() arguments to kvm_page_fault
Pass struct kvm_page_fault to kvm_mmu_hugepage_adjust() instead of
extracting the arguments from the struct; the results are also stored
in the struct, so the callers are adjusted consequently.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:51 -04:00
Paolo Bonzini
3c8ad5a675 KVM: MMU: change fast_page_fault() arguments to kvm_page_fault
Pass struct kvm_page_fault to fast_page_fault() instead of
extracting the arguments from the struct.

Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:51 -04:00
Paolo Bonzini
2f6305dd56 KVM: MMU: change kvm_tdp_mmu_map() arguments to kvm_page_fault
Pass struct kvm_page_fault to kvm_tdp_mmu_map() instead of
extracting the arguments from the struct.

Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:50 -04:00
Paolo Bonzini
43b74355ef KVM: MMU: change __direct_map() arguments to kvm_page_fault
Pass struct kvm_page_fault to __direct_map() instead of
extracting the arguments from the struct.

Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:50 -04:00
Paolo Bonzini
3a13f4fea3 KVM: MMU: change handle_abnormal_pfn() arguments to kvm_page_fault
Pass struct kvm_page_fault to handle_abnormal_pfn() instead of
extracting the arguments from the struct.

Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:49 -04:00
Paolo Bonzini
3647cd04b7 KVM: MMU: change kvm_faultin_pfn() arguments to kvm_page_fault
Add fields to struct kvm_page_fault corresponding to outputs of
kvm_faultin_pfn().  For now they have to be extracted again from struct
kvm_page_fault in the subsequent steps, but this is temporary until
other functions in the chain are switched over as well.

Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:49 -04:00
Paolo Bonzini
b8a5d55115 KVM: MMU: change page_fault_handle_page_track() arguments to kvm_page_fault
Add fields to struct kvm_page_fault corresponding to the arguments
of page_fault_handle_page_track().  The fields are initialized in the
callers, and page_fault_handle_page_track() receives a struct
kvm_page_fault instead of having to extract the arguments out of it.

Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:49 -04:00
Paolo Bonzini
4326e57ef4 KVM: MMU: change direct_page_fault() arguments to kvm_page_fault
Add fields to struct kvm_page_fault corresponding to
the arguments of direct_page_fault().  The fields are
initialized in the callers, and direct_page_fault()
receives a struct kvm_page_fault instead of having to
extract the arguments out of it.

Also adjust FNAME(page_fault) to store the max_level in
struct kvm_page_fault, to keep it similar to the direct
map path.

Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:49 -04:00
Paolo Bonzini
c501040abc KVM: MMU: change mmu->page_fault() arguments to kvm_page_fault
Pass struct kvm_page_fault to mmu->page_fault() instead of
extracting the arguments from the struct.  FNAME(page_fault) can use
the precomputed bools from the error code.

Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:48 -04:00
Paolo Bonzini
d055f028a5 KVM: MMU: pass unadulterated gpa to direct_page_fault
Do not bother removing the low bits of the gpa.  This masking dates back
to the very first commit of KVM but it is unnecessary, as exemplified
by the other call in kvm_tdp_page_fault.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:48 -04:00
Lai Jiangshan
3e44dce4d0 KVM: X86: Move PTE present check from loop body to __shadow_walk_next()
So far, the loop bodies already ensure the PTE is present before calling
__shadow_walk_next():  Some loop bodies simply exit with a !PRESENT
directly and some other loop bodies, i.e. FNAME(fetch) and __direct_map()
do not currently guard their walks with is_shadow_present_pte, but only
because they install present non-leaf SPTEs in the loop itself.

But checking pte present in __shadow_walk_next() (which is called from
shadow_walk_okay()) is more prudent; walking past a !PRESENT SPTE
would lead to attempting to read a the next level SPTE from a garbage
iter->shadow_addr.  It also allows to remove the is_shadow_present_pte()
checks from the loop bodies.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210906122547.263316-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:46 -04:00
Lai Jiangshan
f1c4a88c41 KVM: X86: Don't unsync pagetables when speculative
We'd better only unsync the pagetable when there just was a really
write fault on a level-1 pagetable.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-10-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:10 -04:00
Lai Jiangshan
5591c0694d KVM: X86: Zap the invalid list after remote tlb flushing
In mmu_sync_children(), it can zap the invalid list after remote tlb flushing.
Emptifying the invalid list ASAP might help reduce a remote tlb flushing
in some cases.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-8-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:09 -04:00
Lai Jiangshan
c3e5e415bc KVM: X86: Change kvm_sync_page() to return true when remote flush is needed
Currently kvm_sync_page() returns true when there is any present spte.
But the return value is ignored in the callers.

Changing kvm_sync_page() to return true when remote flush is needed and
changing mmu->sync_page() not to directly flush can combine and reduce
remote flush requests.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-7-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:09 -04:00
Lai Jiangshan
06152b2dec KVM: X86: Remove kvm_mmu_flush_or_zap()
Because local_flush is useless, kvm_mmu_flush_or_zap() can be removed
and kvm_mmu_remote_flush_or_zap is used instead.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-6-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:09 -04:00
Lai Jiangshan
bd047e5440 KVM: X86: Don't flush current tlb on shadow page modification
After any shadow page modification, flushing tlb only on current VCPU
is weird due to other VCPU's tlb might still be stale.

In other words, if there is any mandatory tlb-flushing after shadow page
modification, SET_SPTE_NEED_REMOTE_TLB_FLUSH or remote_flush should be
set and the tlbs of all VCPUs should be flushed.  There is not point to
only flush current tlb except when the request is from vCPU's or pCPU's
activities.

If there was any bug that mandatory tlb-flushing is required and
SET_SPTE_NEED_REMOTE_TLB_FLUSH/remote_flush is failed to set, this patch
would expose the bug in a more destructive way.  The related code paths
are checked and no missing SET_SPTE_NEED_REMOTE_TLB_FLUSH is found yet.

Currently, there is no optional tlb-flushing after sync page related code
is changed to flush tlb timely.  So we can just remove these local flushing
code.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-5-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:09 -04:00
Sean Christopherson
c6cecc4b93 KVM: x86/mmu: Complete prefetch for trailing SPTEs for direct, legacy MMU
Make a final call to direct_pte_prefetch_many() if there are "trailing"
SPTEs to prefetch, i.e. SPTEs for GFNs following the faulting GFN.  The
call to direct_pte_prefetch_many() in the loop only handles the case
where there are !PRESENT SPTEs preceding a PRESENT SPTE.

E.g. if the faulting GFN is a multiple of 8 (the prefetch size) and all
SPTEs for the following GFNs are !PRESENT, the loop will terminate with
"start = sptep+1" and not prefetch any SPTEs.

Prefetching trailing SPTEs as intended can drastically reduce the number
of guest page faults, e.g. accessing the first byte of every 4kb page in
a 6gb chunk of virtual memory, in a VM with 8gb of preallocated memory,
the number of pf_fixed events observed in L0 drops from ~1.75M to <0.27M.

Note, this only affects memory that is backed by 4kb pages as KVM doesn't
prefetch when installing hugepages.  Shadow paging prefetching is not
affected as it does not batch the prefetches due to the need to process
the corresponding guest PTE.  The TDP MMU is not affected because it
doesn't have prefetching, yet...

Fixes: 957ed9effd ("KVM: MMU: prefetch ptes when intercepted guest #PF")
Cc: Sergey Senozhatsky <senozhatsky@google.com>
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210818235615.2047588-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:08 -04:00
Lai Jiangshan
65855ed8b0 KVM: X86: Synchronize the shadow pagetable before link it
If gpte is changed from non-present to present, the guest doesn't need
to flush tlb per SDM.  So the host must synchronze sp before
link it.  Otherwise the guest might use a wrong mapping.

For example: the guest first changes a level-1 pagetable, and then
links its parent to a new place where the original gpte is non-present.
Finally the guest can access the remapped area without flushing
the tlb.  The guest's behavior should be allowed per SDM, but the host
kvm mmu makes it wrong.

Fixes: 4731d4c7a0 ("KVM: MMU: out of sync shadow core")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-23 11:01:00 -04:00
Paolo Bonzini
4ac214574d KVM: MMU: mark role_regs and role accessors as maybe unused
It is reasonable for these functions to be used only in some configurations,
for example only if the host is 64-bits (and therefore supports 64-bit
guests).  It is also reasonable to keep the role_regs and role accessors
in sync even though some of the accessors may be used only for one of the
two sets (as is the case currently for CR4.LA57)..

Because clang reports warnings for unused inlines declared in a .c file,
mark both sets of accessors as __maybe_unused.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-06 06:56:38 -04:00
Sean Christopherson
e7177339d7 Revert "KVM: x86: mmu: Add guest physical address check in translate_gpa()"
Revert a misguided illegal GPA check when "translating" a non-nested GPA.
The check is woefully incomplete as it does not fill in @exception as
expected by all callers, which leads to KVM attempting to inject a bogus
exception, potentially exposing kernel stack information in the process.

 WARNING: CPU: 0 PID: 8469 at arch/x86/kvm/x86.c:525 exception_type+0x98/0xb0 arch/x86/kvm/x86.c:525
 CPU: 1 PID: 8469 Comm: syz-executor531 Not tainted 5.14.0-rc7-syzkaller #0
 RIP: 0010:exception_type+0x98/0xb0 arch/x86/kvm/x86.c:525
 Call Trace:
  x86_emulate_instruction+0xef6/0x1460 arch/x86/kvm/x86.c:7853
  kvm_mmu_page_fault+0x2f0/0x1810 arch/x86/kvm/mmu/mmu.c:5199
  handle_ept_misconfig+0xdf/0x3e0 arch/x86/kvm/vmx/vmx.c:5336
  __vmx_handle_exit arch/x86/kvm/vmx/vmx.c:6021 [inline]
  vmx_handle_exit+0x336/0x1800 arch/x86/kvm/vmx/vmx.c:6038
  vcpu_enter_guest+0x2a1c/0x4430 arch/x86/kvm/x86.c:9712
  vcpu_run arch/x86/kvm/x86.c:9779 [inline]
  kvm_arch_vcpu_ioctl_run+0x47d/0x1b20 arch/x86/kvm/x86.c:10010
  kvm_vcpu_ioctl+0x49e/0xe50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3652

The bug has escaped notice because practically speaking the GPA check is
useless.  The GPA check in question only comes into play when KVM is
walking guest page tables (or "translating" CR3), and KVM already handles
illegal GPA checks by setting reserved bits in rsvd_bits_mask for each
PxE, or in the case of CR3 for loading PTDPTRs, manually checks for an
illegal CR3.  This particular failure doesn't hit the existing reserved
bits checks because syzbot sets guest.MAXPHYADDR=1, and IA32 architecture
simply doesn't allow for such an absurd MAXPHYADDR, e.g. 32-bit paging
doesn't define any reserved PA bits checks, which KVM emulates by only
incorporating the reserved PA bits into the "high" bits, i.e. bits 63:32.

Simply remove the bogus check.  There is zero meaningful value and no
architectural justification for supporting guest.MAXPHYADDR < 32, and
properly filling the exception would introduce non-trivial complexity.

This reverts commit ec7771ab47.

Fixes: ec7771ab47 ("KVM: x86: mmu: Add guest physical address check in translate_gpa()")
Cc: stable@vger.kernel.org
Reported-by: syzbot+200c08e88ae818f849ce@syzkaller.appspotmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210831164224.1119728-2-seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-06 06:18:02 -04:00
Sean Christopherson
a717a780fc KVM: x86/mmu: Don't freak out if pml5_root is NULL on 4-level host
Include pml5_root in the set of special roots if and only if the host,
and thus NPT, is using 5-level paging.  mmu_alloc_special_roots() expects
special roots to be allocated as a bundle, i.e. they're either all valid
or all NULL.  But for pml5_root, that expectation only holds true if the
host uses 5-level paging, which causes KVM to WARN about pml5_root being
NULL when the other special roots are valid.

The silver lining of 4-level vs. 5-level NPT being tied to the host
kernel's paging level is that KVM's shadow root level is constant; unlike
VMX's EPT, KVM can't choose 4-level NPT based on guest.MAXPHYADDR.  That
means KVM can still expect pml5_root to be bundled with the other special
roots, it just needs to be conditioned on the shadow root level.

Fixes: cb0f722aff ("KVM: x86/mmu: Support shadowing NPT when 5-level paging is enabled in host")
Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210824005824.205536-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-06 05:56:38 -04:00
Wei Huang
cb0f722aff KVM: x86/mmu: Support shadowing NPT when 5-level paging is enabled in host
When the 5-level page table CPU flag is set in the host, but the guest
has CR4.LA57=0 (including the case of a 32-bit guest), the top level of
the shadow NPT page tables will be fixed, consisting of one pointer to
a lower-level table and 511 non-present entries.  Extend the existing
code that creates the fixed PML4 or PDP table, to provide a fixed PML5
table if needed.

This is not needed on EPT because the number of layers in the tables
is specified in the EPTP instead of depending on the host CR4.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Wei Huang <wei.huang2@amd.com>
Message-Id: <20210818165549.3771014-3-wei.huang2@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:07:48 -04:00
Wei Huang
746700d21f KVM: x86: Allow CPU to force vendor-specific TDP level
AMD future CPUs will require a 5-level NPT if host CR4.LA57 is set.
To prevent kvm_mmu_get_tdp_level() from incorrectly changing NPT level
on behalf of CPUs, add a new parameter in kvm_configure_mmu() to force
a fixed TDP level.

Signed-off-by: Wei Huang <wei.huang2@amd.com>
Message-Id: <20210818165549.3771014-2-wei.huang2@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:44 -04:00
Paolo Bonzini
ec607a564f KVM: x86: clamp host mapping level to max_level in kvm_mmu_max_mapping_level
This change started as a way to make kvm_mmu_hugepage_adjust a bit simpler,
but it does fix two bugs as well.

One bug is in zapping collapsible PTEs.  If a large page size is
disallowed but not all of them, kvm_mmu_max_mapping_level will return the
host mapping level and the small PTEs will be zapped up to that level.
However, if e.g. 1GB are prohibited, we can still zap 4KB mapping and
preserve the 2MB ones. This can happen for example when NX huge pages
are in use.

The second would happen when userspace backs guest memory
with a 1gb hugepage but only assign a subset of the page to
the guest.  1gb pages would be disallowed by the memslot, but
not 2mb.  kvm_mmu_max_mapping_level() would fall through to the
host_pfn_mapping_level() logic, see the 1gb hugepage, and map the whole
thing into the guest.

Fixes: 2f57b7051f ("KVM: x86/mmu: Persist gfn_lpage_is_disallowed() to max_level")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:41 -04:00
Mingwei Zhang
71f51d2c32 KVM: x86/mmu: Add detailed page size stats
Existing KVM code tracks the number of large pages regardless of their
sizes. Therefore, when large page of 1GB (or larger) is adopted, the
information becomes less useful because lpages counts a mix of 1G and 2M
pages.

So remove the lpages since it is easy for user space to aggregate the info.
Instead, provide a comprehensive page stats of all sizes from 4K to 512G.

Suggested-by: Ben Gardon <bgardon@google.com>

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Cc: Jing Zhang <jingzhangos@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Message-Id: <20210803044607.599629-4-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:34 -04:00
Mingwei Zhang
4293ddb788 KVM: x86/mmu: Remove redundant spte present check in mmu_set_spte
Drop an unnecessary is_shadow_present_pte() check when updating the rmaps
after installing a non-MMIO SPTE.  set_spte() is used only to create
shadow-present SPTEs, e.g. MMIO SPTEs are handled early on, mmu_set_spte()
runs with mmu_lock held for write, i.e. the SPTE can't be zapped between
writing the SPTE and updating the rmaps.

Opportunistically combine the "new SPTE" logic for large pages and rmaps.

No functional change intended.

Suggested-by: Ben Gardon <bgardon@google.com>

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20210803044607.599629-2-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:34 -04:00
Maxim Levitsky
9cc13d60ba KVM: x86/mmu: allow APICv memslot to be enabled but invisible
on AMD, APIC virtualization needs to dynamicaly inhibit the AVIC in a
response to some events, and this is problematic and not efficient to do by
enabling/disabling the memslot that covers APIC's mmio range.

Plus due to SRCU locking, it makes it more complex to
request AVIC inhibition.

Instead, the APIC memslot will be always enabled, but be invisible
to the guest, such as the MMU code will not install a SPTE for it,
when it is inhibited and instead jump straight to emulating the access.

When inhibiting the AVIC, this SPTE will be zapped.

This code is based on a suggestion from Sean Christopherson:
https://lkml.org/lkml/2021/7/19/2970

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-8-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:22 -04:00
Maxim Levitsky
8f32d5e563 KVM: x86/mmu: allow kvm_faultin_pfn to return page fault handling code
This will allow it to return RET_PF_EMULATE for APIC mmio
emulation.

This code is based on a patch from Sean Christopherson:
https://lkml.org/lkml/2021/7/19/2970

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210810205251.424103-7-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:20 -04:00
Maxim Levitsky
33a5c0009d KVM: x86/mmu: rename try_async_pf to kvm_faultin_pfn
try_async_pf is a wrong name for this function, since this code
is used when asynchronous page fault is not enabled as well.

This code is based on a patch from Sean Christopherson:
https://lkml.org/lkml/2021/7/19/2970

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210810205251.424103-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:20 -04:00
Maxim Levitsky
edb298c663 KVM: x86/mmu: bump mmu notifier count in kvm_zap_gfn_range
This together with previous patch, ensures that
kvm_zap_gfn_range doesn't race with page fault
running on another vcpu, and will make this page fault code
retry instead.

This is based on a patch suggested by Sean Christopherson:
https://lkml.org/lkml/2021/7/22/1025

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-5-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:19 -04:00
Maxim Levitsky
88f585358b KVM: x86/mmu: add comment explaining arguments to kvm_zap_gfn_range
This comment makes it clear that the range of gfns that this
function receives is non inclusive.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:18 -04:00
Maxim Levitsky
2822da4466 KVM: x86/mmu: fix parameters to kvm_flush_remote_tlbs_with_address
kvm_flush_remote_tlbs_with_address expects (start gfn, number of pages),
and not (start gfn, end gfn)

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:16 -04:00
Sean Christopherson
5a324c24b6 Revert "KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock"
This together with the next patch will fix a future race between
kvm_zap_gfn_range and the page fault handler, which will happen
when AVIC memslot is going to be only partially disabled.

The performance impact is minimal since kvm_zap_gfn_range is only
called by users, update_mtrr() and kvm_post_set_cr0().

Both only use it if the guest has non-coherent DMA, in order to
honor the guest's UC memtype.

MTRR and CD setup only happens at boot, and generally in an area
where the page tables should be small (for CD) or should not
include the affected GFNs at all (for MTRRs).

This is based on a patch suggested by Sean Christopherson:
https://lkml.org/lkml/2021/7/22/1025

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:15 -04:00
Peter Xu
3bcd0662d6 KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file
Use this file to dump rmap statistic information.  The statistic is done by
calculating the rmap count and the result is log-2-based.

An example output of this looks like (idle 6GB guest, right after boot linux):

Rmap_Count:     0       1       2-3     4-7     8-15    16-31   32-63   64-127  128-255 256-511 512-1023
Level=4K:       3086676 53045   12330   1272    502     121     76      2       0       0       0
Level=2M:       5947    231     0       0       0       0       0       0       0       0       0
Level=1G:       32      0       0       0       0       0       0       0       0       0       0

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210730220455.26054-5-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:11 -04:00
Paolo Bonzini
9a63b4517c Merge branch 'kvm-tdpmmu-fixes' into HEAD
Merge topic branch with fixes for 5.14-rc6 and 5.15 merge window.
2021-08-13 03:35:01 -04:00
Sean Christopherson
ce25681d59 KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock
Add yet another spinlock for the TDP MMU and take it when marking indirect
shadow pages unsync.  When using the TDP MMU and L1 is running L2(s) with
nested TDP, KVM may encounter shadow pages for the TDP entries managed by
L1 (controlling L2) when handling a TDP MMU page fault.  The unsync logic
is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
which runs with mmu_lock held for read, not write.

Lack of a critical section manifests most visibly as an underflow of
unsync_children in clear_unsync_child_bit() due to unsync_children being
corrupted when multiple CPUs write it without a critical section and
without atomic operations.  But underflow is the best case scenario.  The
worst case scenario is that unsync_children prematurely hits '0' and
leads to guest memory corruption due to KVM neglecting to properly sync
shadow pages.

Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
would functionally be ok.  Usurping the lock could degrade performance when
building upper level page tables on different vCPUs, especially since the
unsync flow could hold the lock for a comparatively long time depending on
the number of indirect shadow pages and the depth of the paging tree.

For simplicity, take the lock for all MMUs, even though KVM could fairly
easily know that mmu_lock is held for write.  If mmu_lock is held for
write, there cannot be contention for the inner spinlock, and marking
shadow pages unsync across multiple vCPUs will be slow enough that
bouncing the kvm_arch cacheline should be in the noise.

Note, even though L2 could theoretically be given access to its own EPT
entries, a nested MMU must hold mmu_lock for write and thus cannot race
against a TDP MMU page fault.  I.e. the additional spinlock only _needs_ to
be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
that is running with the TDP MMU enabled.  Holding mmu_lock for read also
prevents the indirect shadow page from being freed.  But as above, keep
it simple and always take the lock.

Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
effectively disable unsync behavior for nested TDP.  Write protecting leaf
shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
VMMs typically don't modify TDP entries, but the same may not hold true for
non-standard use cases and/or VMMs that are migrating physical pages (from
L1's perspective).

Alternative #2, the unsync logic could be made thread safe.  In theory,
simply converting all relevant kvm_mmu_page fields to atomics and using
atomic bitops for the bitmap would suffice.  However, (a) an in-depth audit
would be required, (b) the code churn would be substantial, and (c) legacy
shadow paging would incur additional atomic operations in performance
sensitive paths for no benefit (to legacy shadow paging).

Fixes: a2855afc7e ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181815.3378104-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:32:14 -04:00
Paolo Bonzini
c3e9434c98 Merge branch 'kvm-vmx-secctl' into HEAD
Merge common topic branch for 5.14-rc6 and 5.15 merge window.
2021-08-10 13:45:26 -04:00
David Matlack
93e083d4f4 KVM: x86/mmu: Rename __gfn_to_rmap to gfn_to_rmap
gfn_to_rmap was removed in the previous patch so there is no need to
retain the double underscore on __gfn_to_rmap.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210804222844.1419481-7-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-06 07:52:58 -04:00
David Matlack
601f8af01e KVM: x86/mmu: Leverage vcpu->last_used_slot for rmap_add and rmap_recycle
rmap_add() and rmap_recycle() both run in the context of the vCPU and
thus we can use kvm_vcpu_gfn_to_memslot() to look up the memslot. This
enables rmap_add() and rmap_recycle() to take advantage of
vcpu->last_used_slot and avoid expensive memslot searching.

This change improves the performance of "Populate memory time" in
dirty_log_perf_test with tdp_mmu=N. In addition to improving the
performance, "Populate memory time" no longer scales with the number
of memslots in the VM.

Command                         | Before           | After
------------------------------- | ---------------- | -------------
./dirty_log_perf_test -v64 -x1  | 15.18001570s     | 14.99469366s
./dirty_log_perf_test -v64 -x64 | 18.71336392s     | 14.98675076s

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210804222844.1419481-6-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-06 07:52:29 -04:00
Sean Christopherson
d5aaad6f83 KVM: x86/mmu: Fix per-cpu counter corruption on 32-bit builds
Take a signed 'long' instead of an 'unsigned long' for the number of
pages to add/subtract to the total number of pages used by the MMU.  This
fixes a zero-extension bug on 32-bit kernels that effectively corrupts
the per-cpu counter used by the shrinker.

Per-cpu counters take a signed 64-bit value on both 32-bit and 64-bit
kernels, whereas kvm_mod_used_mmu_pages() takes an unsigned long and thus
an unsigned 32-bit value on 32-bit kernels.  As a result, the value used
to adjust the per-cpu counter is zero-extended (unsigned -> signed), not
sign-extended (signed -> signed), and so KVM's intended -1 gets morphed to
4294967295 and effectively corrupts the counter.

This was found by a staggering amount of sheer dumb luck when running
kvm-unit-tests on a 32-bit KVM build.  The shrinker just happened to kick
in while running tests and do_shrink_slab() logged an error about trying
to free a negative number of objects.  The truly lucky part is that the
kernel just happened to be a slightly stale build, as the shrinker no
longer yells about negative objects as of commit 18bb473e50 ("mm:
vmscan: shrink deferred objects proportional to priority").

 vmscan: shrink_slab: mmu_shrink_scan+0x0/0x210 [kvm] negative objects to delete nr=-858993460

Fixes: bc8a3d8925 ("kvm: mmu: Fix overflow on kvm mmu page limit calculation")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210804214609.1096003-1-seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-05 03:33:56 -04:00
Peter Xu
a75b540451 KVM: X86: Optimize zapping rmap
Using rmap_get_first() and rmap_remove() for zapping a huge rmap list could be
slow.  The easy way is to travers the rmap list, collecting the a/d bits and
free the slots along the way.

Provide a pte_list_destroy() and do exactly that.

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210730220605.26377-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-04 05:55:56 -04:00
Peter Xu
13236e25eb KVM: X86: Optimize pte_list_desc with per-array counter
Add a counter field into pte_list_desc, so as to simplify the add/remove/loop
logic.  E.g., we don't need to loop over the array any more for most reasons.

This will make more sense after we've switched the array size to be larger
otherwise the counter will be a waste.

Initially I wanted to store a tail pointer at the head of the array list so we
don't need to traverse the list at least for pushing new ones (if without the
counter we traverse both the list and the array).  However that'll need
slightly more change without a huge lot benefit, e.g., after we grow entry
numbers per array the list traversing is not so expensive.

So let's be simple but still try to get as much benefit as we can with just
these extra few lines of changes (not to mention the code looks easier too
without looping over arrays).

I used the same a test case to fork 500 child and recycle them ("./rmap_fork
500" [1]), this patch further speeds up the total fork time of about 4%, which
is a total of 33% of vanilla kernel:

        Vanilla:      473.90 (+-5.93%)
        3->15 slots:  366.10 (+-4.94%)
        Add counter:  351.00 (+-3.70%)

[1] 825436f825

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210730220602.26327-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-04 05:55:56 -04:00