Since the two callers of set_spte do different things with the results,
inlining it actually makes the code simpler to reason about. For example,
mmu_set_spte looks quite like tdp_mmu_map_handle_target_level, but the
similarity is hidden by set_spte.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The memslot for the faulting gfn is used throughout the page fault
handling code, so capture it in kvm_page_fault as soon as we know the
gfn and use it in the page fault handling code that has direct access
to the kvm_page_fault struct. Replace various tests using is_noslot_pfn
with more direct tests on fault->slot being NULL.
This, in combination with the subsequent patch, improves "Populate
memory time" in dirty_log_perf_test by 5% when using the legacy MMU.
There is no discerable improvement to the performance of the TDP MMU.
No functional change intended.
Suggested-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210813203504.2742757-4-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This simplifies set_spte, which we want to remove, and unifies code
between the shadow MMU and the TDP MMU. The warning will be added
back later to make_spte as well.
There is a small disadvantage in the TDP MMU; it may unnecessarily mark
a page as dirty twice if two vCPUs end up mapping the same page twice.
However, this is a very small cost for a case that is already rare.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Consolidate rmap_recycle and rmap_add into a single function since they
are only ever called together (and only from one place). This has a nice
side effect of eliminating an extra kvm_vcpu_gfn_to_memslot(). In
addition it makes mmu_set_spte(), which is a very long function, a
little shorter.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210813203504.2742757-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
WARN and bail if the shadow walk for faulting in a SPTE terminates early,
i.e. doesn't reach the expected level because the walk encountered a
terminal SPTE. The shadow walks for page faults are subtle in that they
install non-leaf SPTEs (zapping leaf SPTEs if necessary!) in the loop
body, and consume the newly created non-leaf SPTE in the loop control,
e.g. __shadow_walk_next(). In other words, the walks guarantee that the
walk will stop if and only if the target level is reached by installing
non-leaf SPTEs to guarantee the walk remains valid.
Opportunistically use fault->goal-level instead of it.level in
FNAME(fetch) to further clarify that KVM always installs the leaf SPTE at
the target level.
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210906122547.263316-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pass struct kvm_page_fault to tracepoints instead of extracting the
arguments from the struct. This also lets the kvm_mmu_spte_requested
tracepoint pick the gfn directly from fault->gfn, instead of using
the address.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pass struct kvm_page_fault to disallowed_hugepage_adjust() instead of
extracting the arguments from the struct. Tweak a bit the conditions
to avoid long lines.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pass struct kvm_page_fault to kvm_mmu_hugepage_adjust() instead of
extracting the arguments from the struct; the results are also stored
in the struct, so the callers are adjusted consequently.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pass struct kvm_page_fault to fast_page_fault() instead of
extracting the arguments from the struct.
Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pass struct kvm_page_fault to kvm_tdp_mmu_map() instead of
extracting the arguments from the struct.
Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pass struct kvm_page_fault to __direct_map() instead of
extracting the arguments from the struct.
Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pass struct kvm_page_fault to handle_abnormal_pfn() instead of
extracting the arguments from the struct.
Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add fields to struct kvm_page_fault corresponding to outputs of
kvm_faultin_pfn(). For now they have to be extracted again from struct
kvm_page_fault in the subsequent steps, but this is temporary until
other functions in the chain are switched over as well.
Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add fields to struct kvm_page_fault corresponding to the arguments
of page_fault_handle_page_track(). The fields are initialized in the
callers, and page_fault_handle_page_track() receives a struct
kvm_page_fault instead of having to extract the arguments out of it.
Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add fields to struct kvm_page_fault corresponding to
the arguments of direct_page_fault(). The fields are
initialized in the callers, and direct_page_fault()
receives a struct kvm_page_fault instead of having to
extract the arguments out of it.
Also adjust FNAME(page_fault) to store the max_level in
struct kvm_page_fault, to keep it similar to the direct
map path.
Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pass struct kvm_page_fault to mmu->page_fault() instead of
extracting the arguments from the struct. FNAME(page_fault) can use
the precomputed bools from the error code.
Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Do not bother removing the low bits of the gpa. This masking dates back
to the very first commit of KVM but it is unnecessary, as exemplified
by the other call in kvm_tdp_page_fault.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
So far, the loop bodies already ensure the PTE is present before calling
__shadow_walk_next(): Some loop bodies simply exit with a !PRESENT
directly and some other loop bodies, i.e. FNAME(fetch) and __direct_map()
do not currently guard their walks with is_shadow_present_pte, but only
because they install present non-leaf SPTEs in the loop itself.
But checking pte present in __shadow_walk_next() (which is called from
shadow_walk_okay()) is more prudent; walking past a !PRESENT SPTE
would lead to attempting to read a the next level SPTE from a garbage
iter->shadow_addr. It also allows to remove the is_shadow_present_pte()
checks from the loop bodies.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210906122547.263316-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently kvm_sync_page() returns true when there is any present spte.
But the return value is ignored in the callers.
Changing kvm_sync_page() to return true when remote flush is needed and
changing mmu->sync_page() not to directly flush can combine and reduce
remote flush requests.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-7-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
After any shadow page modification, flushing tlb only on current VCPU
is weird due to other VCPU's tlb might still be stale.
In other words, if there is any mandatory tlb-flushing after shadow page
modification, SET_SPTE_NEED_REMOTE_TLB_FLUSH or remote_flush should be
set and the tlbs of all VCPUs should be flushed. There is not point to
only flush current tlb except when the request is from vCPU's or pCPU's
activities.
If there was any bug that mandatory tlb-flushing is required and
SET_SPTE_NEED_REMOTE_TLB_FLUSH/remote_flush is failed to set, this patch
would expose the bug in a more destructive way. The related code paths
are checked and no missing SET_SPTE_NEED_REMOTE_TLB_FLUSH is found yet.
Currently, there is no optional tlb-flushing after sync page related code
is changed to flush tlb timely. So we can just remove these local flushing
code.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-5-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Make a final call to direct_pte_prefetch_many() if there are "trailing"
SPTEs to prefetch, i.e. SPTEs for GFNs following the faulting GFN. The
call to direct_pte_prefetch_many() in the loop only handles the case
where there are !PRESENT SPTEs preceding a PRESENT SPTE.
E.g. if the faulting GFN is a multiple of 8 (the prefetch size) and all
SPTEs for the following GFNs are !PRESENT, the loop will terminate with
"start = sptep+1" and not prefetch any SPTEs.
Prefetching trailing SPTEs as intended can drastically reduce the number
of guest page faults, e.g. accessing the first byte of every 4kb page in
a 6gb chunk of virtual memory, in a VM with 8gb of preallocated memory,
the number of pf_fixed events observed in L0 drops from ~1.75M to <0.27M.
Note, this only affects memory that is backed by 4kb pages as KVM doesn't
prefetch when installing hugepages. Shadow paging prefetching is not
affected as it does not batch the prefetches due to the need to process
the corresponding guest PTE. The TDP MMU is not affected because it
doesn't have prefetching, yet...
Fixes: 957ed9effd ("KVM: MMU: prefetch ptes when intercepted guest #PF")
Cc: Sergey Senozhatsky <senozhatsky@google.com>
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210818235615.2047588-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If gpte is changed from non-present to present, the guest doesn't need
to flush tlb per SDM. So the host must synchronze sp before
link it. Otherwise the guest might use a wrong mapping.
For example: the guest first changes a level-1 pagetable, and then
links its parent to a new place where the original gpte is non-present.
Finally the guest can access the remapped area without flushing
the tlb. The guest's behavior should be allowed per SDM, but the host
kvm mmu makes it wrong.
Fixes: 4731d4c7a0 ("KVM: MMU: out of sync shadow core")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It is reasonable for these functions to be used only in some configurations,
for example only if the host is 64-bits (and therefore supports 64-bit
guests). It is also reasonable to keep the role_regs and role accessors
in sync even though some of the accessors may be used only for one of the
two sets (as is the case currently for CR4.LA57)..
Because clang reports warnings for unused inlines declared in a .c file,
mark both sets of accessors as __maybe_unused.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Revert a misguided illegal GPA check when "translating" a non-nested GPA.
The check is woefully incomplete as it does not fill in @exception as
expected by all callers, which leads to KVM attempting to inject a bogus
exception, potentially exposing kernel stack information in the process.
WARNING: CPU: 0 PID: 8469 at arch/x86/kvm/x86.c:525 exception_type+0x98/0xb0 arch/x86/kvm/x86.c:525
CPU: 1 PID: 8469 Comm: syz-executor531 Not tainted 5.14.0-rc7-syzkaller #0
RIP: 0010:exception_type+0x98/0xb0 arch/x86/kvm/x86.c:525
Call Trace:
x86_emulate_instruction+0xef6/0x1460 arch/x86/kvm/x86.c:7853
kvm_mmu_page_fault+0x2f0/0x1810 arch/x86/kvm/mmu/mmu.c:5199
handle_ept_misconfig+0xdf/0x3e0 arch/x86/kvm/vmx/vmx.c:5336
__vmx_handle_exit arch/x86/kvm/vmx/vmx.c:6021 [inline]
vmx_handle_exit+0x336/0x1800 arch/x86/kvm/vmx/vmx.c:6038
vcpu_enter_guest+0x2a1c/0x4430 arch/x86/kvm/x86.c:9712
vcpu_run arch/x86/kvm/x86.c:9779 [inline]
kvm_arch_vcpu_ioctl_run+0x47d/0x1b20 arch/x86/kvm/x86.c:10010
kvm_vcpu_ioctl+0x49e/0xe50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3652
The bug has escaped notice because practically speaking the GPA check is
useless. The GPA check in question only comes into play when KVM is
walking guest page tables (or "translating" CR3), and KVM already handles
illegal GPA checks by setting reserved bits in rsvd_bits_mask for each
PxE, or in the case of CR3 for loading PTDPTRs, manually checks for an
illegal CR3. This particular failure doesn't hit the existing reserved
bits checks because syzbot sets guest.MAXPHYADDR=1, and IA32 architecture
simply doesn't allow for such an absurd MAXPHYADDR, e.g. 32-bit paging
doesn't define any reserved PA bits checks, which KVM emulates by only
incorporating the reserved PA bits into the "high" bits, i.e. bits 63:32.
Simply remove the bogus check. There is zero meaningful value and no
architectural justification for supporting guest.MAXPHYADDR < 32, and
properly filling the exception would introduce non-trivial complexity.
This reverts commit ec7771ab47.
Fixes: ec7771ab47 ("KVM: x86: mmu: Add guest physical address check in translate_gpa()")
Cc: stable@vger.kernel.org
Reported-by: syzbot+200c08e88ae818f849ce@syzkaller.appspotmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210831164224.1119728-2-seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Include pml5_root in the set of special roots if and only if the host,
and thus NPT, is using 5-level paging. mmu_alloc_special_roots() expects
special roots to be allocated as a bundle, i.e. they're either all valid
or all NULL. But for pml5_root, that expectation only holds true if the
host uses 5-level paging, which causes KVM to WARN about pml5_root being
NULL when the other special roots are valid.
The silver lining of 4-level vs. 5-level NPT being tied to the host
kernel's paging level is that KVM's shadow root level is constant; unlike
VMX's EPT, KVM can't choose 4-level NPT based on guest.MAXPHYADDR. That
means KVM can still expect pml5_root to be bundled with the other special
roots, it just needs to be conditioned on the shadow root level.
Fixes: cb0f722aff ("KVM: x86/mmu: Support shadowing NPT when 5-level paging is enabled in host")
Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210824005824.205536-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When the 5-level page table CPU flag is set in the host, but the guest
has CR4.LA57=0 (including the case of a 32-bit guest), the top level of
the shadow NPT page tables will be fixed, consisting of one pointer to
a lower-level table and 511 non-present entries. Extend the existing
code that creates the fixed PML4 or PDP table, to provide a fixed PML5
table if needed.
This is not needed on EPT because the number of layers in the tables
is specified in the EPTP instead of depending on the host CR4.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Wei Huang <wei.huang2@amd.com>
Message-Id: <20210818165549.3771014-3-wei.huang2@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
AMD future CPUs will require a 5-level NPT if host CR4.LA57 is set.
To prevent kvm_mmu_get_tdp_level() from incorrectly changing NPT level
on behalf of CPUs, add a new parameter in kvm_configure_mmu() to force
a fixed TDP level.
Signed-off-by: Wei Huang <wei.huang2@amd.com>
Message-Id: <20210818165549.3771014-2-wei.huang2@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This change started as a way to make kvm_mmu_hugepage_adjust a bit simpler,
but it does fix two bugs as well.
One bug is in zapping collapsible PTEs. If a large page size is
disallowed but not all of them, kvm_mmu_max_mapping_level will return the
host mapping level and the small PTEs will be zapped up to that level.
However, if e.g. 1GB are prohibited, we can still zap 4KB mapping and
preserve the 2MB ones. This can happen for example when NX huge pages
are in use.
The second would happen when userspace backs guest memory
with a 1gb hugepage but only assign a subset of the page to
the guest. 1gb pages would be disallowed by the memslot, but
not 2mb. kvm_mmu_max_mapping_level() would fall through to the
host_pfn_mapping_level() logic, see the 1gb hugepage, and map the whole
thing into the guest.
Fixes: 2f57b7051f ("KVM: x86/mmu: Persist gfn_lpage_is_disallowed() to max_level")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Existing KVM code tracks the number of large pages regardless of their
sizes. Therefore, when large page of 1GB (or larger) is adopted, the
information becomes less useful because lpages counts a mix of 1G and 2M
pages.
So remove the lpages since it is easy for user space to aggregate the info.
Instead, provide a comprehensive page stats of all sizes from 4K to 512G.
Suggested-by: Ben Gardon <bgardon@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Cc: Jing Zhang <jingzhangos@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Message-Id: <20210803044607.599629-4-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop an unnecessary is_shadow_present_pte() check when updating the rmaps
after installing a non-MMIO SPTE. set_spte() is used only to create
shadow-present SPTEs, e.g. MMIO SPTEs are handled early on, mmu_set_spte()
runs with mmu_lock held for write, i.e. the SPTE can't be zapped between
writing the SPTE and updating the rmaps.
Opportunistically combine the "new SPTE" logic for large pages and rmaps.
No functional change intended.
Suggested-by: Ben Gardon <bgardon@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20210803044607.599629-2-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
on AMD, APIC virtualization needs to dynamicaly inhibit the AVIC in a
response to some events, and this is problematic and not efficient to do by
enabling/disabling the memslot that covers APIC's mmio range.
Plus due to SRCU locking, it makes it more complex to
request AVIC inhibition.
Instead, the APIC memslot will be always enabled, but be invisible
to the guest, such as the MMU code will not install a SPTE for it,
when it is inhibited and instead jump straight to emulating the access.
When inhibiting the AVIC, this SPTE will be zapped.
This code is based on a suggestion from Sean Christopherson:
https://lkml.org/lkml/2021/7/19/2970
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-8-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This together with the next patch will fix a future race between
kvm_zap_gfn_range and the page fault handler, which will happen
when AVIC memslot is going to be only partially disabled.
The performance impact is minimal since kvm_zap_gfn_range is only
called by users, update_mtrr() and kvm_post_set_cr0().
Both only use it if the guest has non-coherent DMA, in order to
honor the guest's UC memtype.
MTRR and CD setup only happens at boot, and generally in an area
where the page tables should be small (for CD) or should not
include the affected GFNs at all (for MTRRs).
This is based on a patch suggested by Sean Christopherson:
https://lkml.org/lkml/2021/7/22/1025
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use this file to dump rmap statistic information. The statistic is done by
calculating the rmap count and the result is log-2-based.
An example output of this looks like (idle 6GB guest, right after boot linux):
Rmap_Count: 0 1 2-3 4-7 8-15 16-31 32-63 64-127 128-255 256-511 512-1023
Level=4K: 3086676 53045 12330 1272 502 121 76 2 0 0 0
Level=2M: 5947 231 0 0 0 0 0 0 0 0 0
Level=1G: 32 0 0 0 0 0 0 0 0 0 0
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210730220455.26054-5-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add yet another spinlock for the TDP MMU and take it when marking indirect
shadow pages unsync. When using the TDP MMU and L1 is running L2(s) with
nested TDP, KVM may encounter shadow pages for the TDP entries managed by
L1 (controlling L2) when handling a TDP MMU page fault. The unsync logic
is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
which runs with mmu_lock held for read, not write.
Lack of a critical section manifests most visibly as an underflow of
unsync_children in clear_unsync_child_bit() due to unsync_children being
corrupted when multiple CPUs write it without a critical section and
without atomic operations. But underflow is the best case scenario. The
worst case scenario is that unsync_children prematurely hits '0' and
leads to guest memory corruption due to KVM neglecting to properly sync
shadow pages.
Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
would functionally be ok. Usurping the lock could degrade performance when
building upper level page tables on different vCPUs, especially since the
unsync flow could hold the lock for a comparatively long time depending on
the number of indirect shadow pages and the depth of the paging tree.
For simplicity, take the lock for all MMUs, even though KVM could fairly
easily know that mmu_lock is held for write. If mmu_lock is held for
write, there cannot be contention for the inner spinlock, and marking
shadow pages unsync across multiple vCPUs will be slow enough that
bouncing the kvm_arch cacheline should be in the noise.
Note, even though L2 could theoretically be given access to its own EPT
entries, a nested MMU must hold mmu_lock for write and thus cannot race
against a TDP MMU page fault. I.e. the additional spinlock only _needs_ to
be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
that is running with the TDP MMU enabled. Holding mmu_lock for read also
prevents the indirect shadow page from being freed. But as above, keep
it simple and always take the lock.
Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
effectively disable unsync behavior for nested TDP. Write protecting leaf
shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
VMMs typically don't modify TDP entries, but the same may not hold true for
non-standard use cases and/or VMMs that are migrating physical pages (from
L1's perspective).
Alternative #2, the unsync logic could be made thread safe. In theory,
simply converting all relevant kvm_mmu_page fields to atomics and using
atomic bitops for the bitmap would suffice. However, (a) an in-depth audit
would be required, (b) the code churn would be substantial, and (c) legacy
shadow paging would incur additional atomic operations in performance
sensitive paths for no benefit (to legacy shadow paging).
Fixes: a2855afc7e ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181815.3378104-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
rmap_add() and rmap_recycle() both run in the context of the vCPU and
thus we can use kvm_vcpu_gfn_to_memslot() to look up the memslot. This
enables rmap_add() and rmap_recycle() to take advantage of
vcpu->last_used_slot and avoid expensive memslot searching.
This change improves the performance of "Populate memory time" in
dirty_log_perf_test with tdp_mmu=N. In addition to improving the
performance, "Populate memory time" no longer scales with the number
of memslots in the VM.
Command | Before | After
------------------------------- | ---------------- | -------------
./dirty_log_perf_test -v64 -x1 | 15.18001570s | 14.99469366s
./dirty_log_perf_test -v64 -x64 | 18.71336392s | 14.98675076s
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210804222844.1419481-6-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Take a signed 'long' instead of an 'unsigned long' for the number of
pages to add/subtract to the total number of pages used by the MMU. This
fixes a zero-extension bug on 32-bit kernels that effectively corrupts
the per-cpu counter used by the shrinker.
Per-cpu counters take a signed 64-bit value on both 32-bit and 64-bit
kernels, whereas kvm_mod_used_mmu_pages() takes an unsigned long and thus
an unsigned 32-bit value on 32-bit kernels. As a result, the value used
to adjust the per-cpu counter is zero-extended (unsigned -> signed), not
sign-extended (signed -> signed), and so KVM's intended -1 gets morphed to
4294967295 and effectively corrupts the counter.
This was found by a staggering amount of sheer dumb luck when running
kvm-unit-tests on a 32-bit KVM build. The shrinker just happened to kick
in while running tests and do_shrink_slab() logged an error about trying
to free a negative number of objects. The truly lucky part is that the
kernel just happened to be a slightly stale build, as the shrinker no
longer yells about negative objects as of commit 18bb473e50 ("mm:
vmscan: shrink deferred objects proportional to priority").
vmscan: shrink_slab: mmu_shrink_scan+0x0/0x210 [kvm] negative objects to delete nr=-858993460
Fixes: bc8a3d8925 ("kvm: mmu: Fix overflow on kvm mmu page limit calculation")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210804214609.1096003-1-seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Using rmap_get_first() and rmap_remove() for zapping a huge rmap list could be
slow. The easy way is to travers the rmap list, collecting the a/d bits and
free the slots along the way.
Provide a pte_list_destroy() and do exactly that.
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210730220605.26377-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a counter field into pte_list_desc, so as to simplify the add/remove/loop
logic. E.g., we don't need to loop over the array any more for most reasons.
This will make more sense after we've switched the array size to be larger
otherwise the counter will be a waste.
Initially I wanted to store a tail pointer at the head of the array list so we
don't need to traverse the list at least for pushing new ones (if without the
counter we traverse both the list and the array). However that'll need
slightly more change without a huge lot benefit, e.g., after we grow entry
numbers per array the list traversing is not so expensive.
So let's be simple but still try to get as much benefit as we can with just
these extra few lines of changes (not to mention the code looks easier too
without looping over arrays).
I used the same a test case to fork 500 child and recycle them ("./rmap_fork
500" [1]), this patch further speeds up the total fork time of about 4%, which
is a total of 33% of vanilla kernel:
Vanilla: 473.90 (+-5.93%)
3->15 slots: 366.10 (+-4.94%)
Add counter: 351.00 (+-3.70%)
[1] 825436f825
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210730220602.26327-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>