Commit Graph

812192 Commits

Author SHA1 Message Date
Sean Christopherson
eca6be566d KVM: doc: Document the life cycle of a VM and its resources
The series to add memcg accounting to KVM allocations[1] states:

  There are many KVM kernel memory allocations which are tied to the
  life of the VM process and should be charged to the VM process's
  cgroup.

While it is correct to account KVM kernel allocations to the cgroup of
the process that created the VM, it's technically incorrect to state
that the KVM kernel memory allocations are tied to the life of the VM
process.  This is because the VM itself, i.e. struct kvm, is not tied to
the life of the process which created it, rather it is tied to the life
of its associated file descriptor.  In other words, kvm_destroy_vm() is
not invoked until fput() decrements its associated file's refcount to
zero.  A simple example is to fork() in Qemu and have the child sleep
indefinitely; kvm_destroy_vm() isn't called until Qemu closes its file
descriptor *and* the rogue child is killed.

The allocations are guaranteed to be *accounted* to the process which
created the VM, but only because KVM's per-{VM,vCPU} ioctls reject the
ioctl() with -EIO if kvm->mm != current->mm.  I.e. the child can keep
the VM "alive" but can't do anything useful with its reference.

Note that because 'struct kvm' also holds a reference to the mm_struct
of its owner, the above behavior also applies to userspace allocations.

Given that mucking with a VM's file descriptor can lead to subtle and
undesirable behavior, e.g. memcg charges persisting after a VM is shut
down, explicitly document a VM's lifecycle and its impact on the VM's
resources.

Alternatively, KVM could aggressively free resources when the creating
process exits, e.g. via mmu_notifier->release().  However, mmu_notifier
isn't guaranteed to be available, and freeing resources when the creator
exits is likely to be error prone and fragile as KVM would need to
ensure that it only freed resources that are truly out of reach. In
practice, the existing behavior shouldn't be problematic as a properly
configured system will prevent a child process from being moved out of
the appropriate cgroup hierarchy, i.e. prevent hiding the process from
the OOM killer, and will prevent an unprivileged user from being able to
to hold a reference to struct kvm via another method, e.g. debugfs.

[1]https://patchwork.kernel.org/patch/10806707/

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-15 19:24:33 +01:00
Paolo Bonzini
c7a0e83cb6 Third PPC KVM update for 5.1
- Tell userspace about whether a particular hardware workaround for
   one of the Spectre vulnerabilities is available, so that userspace
   can inform the guest.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJceLfNAAoJEJ2a6ncsY3GfWnMH/1H4BQOcWnyBMkGX/IIVYmYP
 Fxr9QdYdp+sRDmIgtKugARiW/cNdmrcTlW9qsOSD/sNpCtEjQ0t3R717f51DvSvC
 J+RAnBD6dOP5qplvsUTPoU5xnkqBhxkyTJnEbG6zWGyvQNqLQlJEe3xI91c5r6RB
 PiurQMtE7Mw/YyPpkthTQcm+JSgZX5QDd+KNkzL1rcfAV/4HlSXblAklwYgdsbch
 m/X1HmnZVG/Seh9hH3q8hvp/YaPFvHOvoPWXaifwrZ3xNQ/tgCyGNJp+wupoay6K
 RYHfsYmfoW0AbLnJqLH0Zbm9xDwkoZYjLbdWht4N/B5G0P0vkpQ86bdauabxCXI=
 =mkez
 -----END PGP SIGNATURE-----

Merge tag 'kvm-ppc-next-5.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD

Third PPC KVM update for 5.1

- Tell userspace about whether a particular hardware workaround for
  one of the Spectre vulnerabilities is available, so that userspace
  can inform the guest.
2019-03-15 19:16:51 +01:00
Sean Christopherson
4633323648 MAINTAINERS: Add KVM selftests to existing KVM entry
It's safe to assume Paolo and Radim are maintaining the KVM selftests
given that the vast majority of commits have their SOBs.  Play nice
with get_maintainers and make it official.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-15 19:16:48 +01:00
Ben Gardon
92da008fa2 Revert "KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()"
This reverts commit 71883a62fc.

The above commit contains an optimization to kvm_zap_gfn_range which
uses gfn-limited TLB flushes, if enabled. If using these limited flushes,
kvm_zap_gfn_range passes lock_flush_tlb=false to slot_handle_level_range
which creates a race when the function unlocks to call cond_resched.
See an example of this race below:

CPU 0                   CPU 1                           CPU 3
// zap_direct_gfn_range
mmu_lock()
// *ptep == pte_1
*ptep = 0
if (lock_flush_tlb)
        flush_tlbs()
mmu_unlock()
                        // In invalidate range
                        // MMU notifier
                        mmu_lock()
                        if (pte != 0)
                                *ptep = 0
                                flush = true
                        if (flush)
                                flush_remote_tlbs()
                        mmu_unlock()
                        return
                        // Host MM reallocates
                        // page previously
                        // backing guest memory.
                                                        // Guest accesses
                                                        // invalid page
                                                        // through pte_1
                                                        // in its TLB!!

Tested: Ran all kvm-unit-tests on a Intel Haswell machine with and
	without this patch. The patch introduced no new failures.

Signed-off-by: Ben Gardon <bgardon@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-15 19:16:45 +01:00
Suraj Jitindar Singh
2b57ecd020 KVM: PPC: Book3S: Add count cache flush parameters to kvmppc_get_cpu_char()
Add KVM_PPC_CPU_CHAR_BCCTR_FLUSH_ASSIST &
KVM_PPC_CPU_BEHAV_FLUSH_COUNT_CACHE to the characteristics returned
from the H_GET_CPU_CHARACTERISTICS H-CALL, as queried from either the
hypervisor or the device tree.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-03-01 15:11:14 +11:00
Paul Mackerras
e74d53e30e KVM: PPC: Fix compilation when KVM is not enabled
Compiling with CONFIG_PPC_POWERNV=y and KVM disabled currently gives
an error like this:

  CC      arch/powerpc/kernel/dbell.o
In file included from arch/powerpc/kernel/dbell.c:20:0:
arch/powerpc/include/asm/kvm_ppc.h: In function ‘xics_on_xive’:
arch/powerpc/include/asm/kvm_ppc.h:625:9: error: implicit declaration of function ‘xive_enabled’ [-Werror=implicit-function-declaration]
  return xive_enabled() && cpu_has_feature(CPU_FTR_HVMODE);
         ^
cc1: all warnings being treated as errors
scripts/Makefile.build:276: recipe for target 'arch/powerpc/kernel/dbell.o' failed
make[3]: *** [arch/powerpc/kernel/dbell.o] Error 1

Fix this by making the xics_on_xive() definition conditional on the
same symbol (CONFIG_KVM_BOOK3S_64_HANDLER) that determines whether we
include <asm/xive.h> or not, since that's the header that defines
xive_enabled().

Fixes: 03f953329b ("KVM: PPC: Book3S: Allow XICS emulation to work in nested hosts using XIVE")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-27 09:14:44 +11:00
Paolo Bonzini
71783e09b4 KVM/arm updates for Linux v5.1
- A number of pre-nested code rework
 - Direct physical timer assignment on VHE systems
 - kvm_call_hyp type safety enforcement
 - Set/Way cache sanitisation for 32bit guests
 - Build system cleanups
 - A bunch of janitorial fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAlxwH3EVHG1hcmMuenlu
 Z2llckBhcm0uY29tAAoJECPQ0LrRPXpD9j4P/i1lIUKfg/bxGw46wwoqF1MjOEPD
 uDQ6irms65FfqFkUkIPaU1du51UcI9nwncUPeJh+E3g2wp2f5EXsAp7tksARfIWU
 YCLLez5AiuYH6Otrs7YxLm8L/Sqqc3DacGWuyOamXmdWM9wZlv7F295Yfo5nX8zk
 IhksfBQH4/KvOPxkzbY6yy1StKOreuXQuboecrcfUP0lxwaUcbqxHMuynP7DneCv
 EHNo5TUjK975xH4jS/K61Ji9FmTlA/PgGqgn+EOw5KXGnKlphFBaTrzuE7vPLveR
 XPV1VeNEuEitH/+qVhZr8k2Du+3kKqQA8Ikxv6SasYAnqyVFTPPPMtUEgZXTLpHa
 6D4kIc+5jxgxF6Dyk3PKnjoNHPolCApj/uPCcTiD8dyY4smpJQ3+gxGiJkX68e92
 EkJlBj0Hn4xgudHi9UWLP+eZHT+v3L8mvVLP9N9oqapwc0x4g9YqJVbJMyAnT5Pw
 pLPSKTx9ApmyAEkdzRHjB89gG5cwjUzmx5BF7gASYSmTS9el9r5Kaaxx8zCmMt1R
 gM1TF7rBrgyW5s+bsIBf2rqk+5WAxag+FmeCQwghuNc+uhNfboRgoJzx7qFTpxeX
 KFS86QmQPRMWGR0klXgW1+hNOD8ACnqCOPGB+3d41ql3bgQ0otLItvj4RoA44JYG
 0Guq7o9EZNUYqDzA
 =iEs0
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-for-v5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-next

KVM/arm updates for Linux v5.1

- A number of pre-nested code rework
- Direct physical timer assignment on VHE systems
- kvm_call_hyp type safety enforcement
- Set/Way cache sanitisation for 32bit guests
- Build system cleanups
- A bunch of janitorial fixes
2019-02-22 17:45:05 +01:00
Paolo Bonzini
8f060f5355 KVM: s390: Features for 5.1
- Clarify KVM related kernel messages
 - Interrupt cleanup
 - Introduction of the Guest Information Block (GIB)
 - Preparation for processor subfunctions in cpu model
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJcb9iJAAoJEONU5rjiOLn4mI8P/1J3IyN3imnZUkTM5JOMWhNM
 TSwh5obvn7dT6URZ6BgM1DWIz9E/FKrEb2kU4xr1hwf/a69Q1cYKVmHSnzzIpxHQ
 ZNjr7QbcBCsVJ8LtasOoMmgGnVvtBYKKHr4J8UcqeW9raP3YfPJmqyETufiE2lFy
 G50r8EBFr9rPh7nK7ImAabKC/7Q/qxZ0729m71cu729/uBb/Wf6frqaDmFlA8362
 YZC7KY+xEHZbWqKQqAt/x1TWAOb7nA5dCzemeRckNrs5+FN7rSBrje6SbWApZPfn
 weteCVbJMLCoRMUTFjRy3YNz1x0gAC9VQT6Qz5Kz7dColVfJjTPWdYuKpbRsj+n1
 PEv1uuDBNbDqdS29KG3Dk9cfzUgAU12g+Xsb+3168HsQbU7XU1v6gCoRaR8ccaoq
 3k8Em0xusHa+uGI6K4knKmWboRrCA6FWHIaink4B2K7qIaVdWqTebhHaDiDx8qB8
 JRNjxQDho92FpRzxHyajHtamFKPjGT/Guc0yWMIrPHBn97GktUnDD6E5AdhTRVxs
 aXTZv7XFq5j307lc3qWsdAf4zGEaPbi9f2nHgFK8hJf+z560CmNbye9Rw6L96Lil
 gy0rvSQgN+3xBtSKvq3DNrgoouupOS6kFu5iyYLBS8UUOztXttKEzTCs+M87/3AP
 fphwixKEEXsMRWR2SJvG
 =YeIb
 -----END PGP SIGNATURE-----

Merge tag 'kvm-s390-next-5.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-next

KVM: s390: Features for 5.1

- Clarify KVM related kernel messages
- Interrupt cleanup
- Introduction of the Guest Information Block (GIB)
- Preparation for processor subfunctions in cpu model
2019-02-22 17:44:23 +01:00
Leo Yan
a242010776 KVM: Minor cleanups for kvm_main.c
This patch contains two minor cleanups: firstly it puts exported symbol
for kvm_io_bus_write() by following the function definition; secondly it
removes a redundant blank line.

Signed-off-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-22 17:43:57 +01:00
Paolo Bonzini
54a1f393ce PPC KVM update for 5.1
There are no major new features this time, just a collection of bug
 fixes and improvements in various areas, including machine check
 handling and context switching of protection-key-related registers.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJcb3lEAAoJEJ2a6ncsY3GflNwH/2ezxhHv7CRy18d2D3F+Kna+
 YQs3V/pJfBRvVdV7ZLxnR03H/NmzAK3UOzRfqGodYUtbF+gUDqSuM27lAxMKrjBv
 S87X5g/1ZdiQNnqYK7PIBn75Tx27vnw2kJAif8rXTfqbj8qLUsXcNhsziA16sJOA
 azbD5PBp9mOVzTojawyriJ3H8LYqw+vinad0idvFrApFCuNmMxv56FR6H+IBadt7
 1UJyx6AegQACdhxvy0CzmZjzzXw02z9zeFUa4lakm2sORc4fbbyyZ68CtkGURg7A
 8rt2j9SGt649ExpjfG2Cz/UihMGIMXSAOrpqTZMfyd9UPzPgHeKx2FidnxASUBc=
 =PIT8
 -----END PGP SIGNATURE-----

Merge tag 'kvm-ppc-next-5.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into kvm-next

PPC KVM update for 5.1

There are no major new features this time, just a collection of bug
fixes and improvements in various areas, including machine check
handling and context switching of protection-key-related registers.
2019-02-22 17:43:05 +01:00
Christian Borntraeger
11ba5961a2 KVM: s390: add debug logging for cpu model subfunctions
As userspace can now get/set the subfunctions we want to trace those.
This will allow to also check QEMUs cpu model vs. what the real
hardware provides.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.vnet.ibm.com>
2019-02-22 11:04:35 +01:00
Christian Borntraeger
346fa2f891 KVM: s390: implement subfunction processor calls
While we will not implement interception for query functions yet, we can
and should disable functions that have a control bit based on the given
CPU model.

Let us start with enabling the subfunction interface.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
2019-02-22 11:04:35 +01:00
Dave Martin
c88b093693 arm64: KVM: Fix architecturally invalid reset value for FPEXC32_EL2
Due to what looks like a typo dating back to the original addition
of FPEXC32_EL2 handling, KVM currently initialises this register to
an architecturally invalid value.

As a result, the VECITR field (RES1) in bits [10:8] is initialised
with 0, and the two reserved (RES0) bits [6:5] are initialised with
1.  (In the Common VFP Subarchitecture as specified by ARMv7-A,
these two bits were IMP DEF.  ARMv8-A removes them.)

This patch changes the reset value from 0x70 to 0x700, which
reflects the architectural constraints and is presumably what was
originally intended.

Cc: <stable@vger.kernel.org> # 4.12.x-
Cc: Christoffer Dall <christoffer.dall@arm.com>
Fixes: 62a89c4495 ("arm64: KVM: 32bit handling of coprocessor traps")
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2019-02-22 09:59:20 +00:00
Shaokun Zhang
7f5d9c1bc0 KVM: arm/arm64: Remove unused timer variable
The 'timer' local variable became unused after commit bee038a674
("KVM: arm/arm64: Rework the timer code to use a timer_map").
Remove it to avoid [-Wunused-but-set-variable] warning.

Cc: Christoffer Dall <christoffer.dall@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Suzuki K Pouloze <suzuki.poulose@arm.com>
Reviewed-by: Julien Thierry <julien.thierry@arm.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2019-02-22 09:41:52 +00:00
Paul Mackerras
0a0c50f771 Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next
This merges in the "ppc-kvm" topic branch of the powerpc tree to get a
series of commits that touch both general arch/powerpc code and KVM
code.  These commits will be merged both via the KVM tree and the
powerpc tree.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-22 13:52:30 +11:00
Michael Ellerman
c3c7470c75 powerpc/kvm: Save and restore host AMR/IAMR/UAMOR
When the hash MMU is active the AMR, IAMR and UAMOR are used for
pkeys. The AMR is directly writable by user space, and the UAMOR masks
those writes, meaning both registers are effectively user register
state. The IAMR is used to create an execute only key.

Also we must maintain the value of at least the AMR when running in
process context, so that any memory accesses done by the kernel on
behalf of the process are correctly controlled by the AMR.

Although we are correctly switching all registers when going into a
guest, on returning to the host we just write 0 into all regs, except
on Power9 where we restore the IAMR correctly.

This could be observed by a user process if it writes the AMR, then
runs a guest and we then return immediately to it without
rescheduling. Because we have written 0 to the AMR that would have the
effect of granting read/write permission to pages that the process was
trying to protect.

In addition, when using the Radix MMU, the AMR can prevent inadvertent
kernel access to userspace data, writing 0 to the AMR disables that
protection.

So save and restore AMR, IAMR and UAMOR.

Fixes: cf43d3b264 ("powerpc: Enable pkey subsystem")
Cc: stable@vger.kernel.org # v4.16+
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-22 13:41:13 +11:00
Alexey Kardashevskiy
716cb11608 KVM: PPC: Book3S: Improve KVM reference counting
The anon fd's ops releases the KVM reference in the release hook.
However we reference the KVM object after we create the fd so there is
small window when the release function can be called and
dereferenced the KVM object which potentially may free it.

It is not a problem at the moment as the file is created and KVM is
referenced under the KVM lock and the release function obtains the same
lock before dereferencing the KVM (although the lock is not held when
calling kvm_put_kvm()) but it is potentially fragile against future changes.

This references the KVM object before creating a file.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-22 12:51:02 +11:00
Jordan Niethe
e40542aff9 KVM: PPC: Book3S HV: Fix build failure without IOMMU support
Currently trying to build without IOMMU support will fail:

  (.text+0x1380): undefined reference to `kvmppc_h_get_tce'
  (.text+0x1384): undefined reference to `kvmppc_rm_h_put_tce'
  (.text+0x149c): undefined reference to `kvmppc_rm_h_stuff_tce'
  (.text+0x14a0): undefined reference to `kvmppc_rm_h_put_tce_indirect'

This happens because turning off IOMMU support will prevent
book3s_64_vio_hv.c from being built because it is only built when
SPAPR_TCE_IOMMU is set, which depends on IOMMU support.

Fix it using ifdefs for the undefined references.

Fixes: 76d837a4c0 ("KVM: PPC: Book3S PR: Don't include SPAPR TCE code on non-pseries platforms")
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-22 12:51:02 +11:00
Paul Mackerras
c057720184 powerpc/64s: Better printing of machine check info for guest MCEs
This adds an "in_guest" parameter to machine_check_print_event_info()
so that we can avoid trying to translate guest NIP values into
symbolic form using the host kernel's symbol table.

Reviewed-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-21 23:16:45 +11:00
Paul Mackerras
884dfb722d KVM: PPC: Book3S HV: Simplify machine check handling
This makes the handling of machine check interrupts that occur inside
a guest simpler and more robust, with less done in assembler code and
in real mode.

Now, when a machine check occurs inside a guest, we always get the
machine check event struct and put a copy in the vcpu struct for the
vcpu where the machine check occurred.  We no longer call
machine_check_queue_event() from kvmppc_realmode_mc_power7(), because
on POWER8, when a vcpu is running on an offline secondary thread and
we call machine_check_queue_event(), that calls irq_work_queue(),
which doesn't work because the CPU is offline, but instead triggers
the WARN_ON(lazy_irq_pending()) in pnv_smp_cpu_kill_self() (which
fires again and again because nothing clears the condition).

All that machine_check_queue_event() actually does is to cause the
event to be printed to the console.  For a machine check occurring in
the guest, we now print the event in kvmppc_handle_exit_hv()
instead.

The assembly code at label machine_check_realmode now just calls C
code and then continues exiting the guest.  We no longer either
synthesize a machine check for the guest in assembly code or return
to the guest without a machine check.

The code in kvmppc_handle_exit_hv() is extended to handle the case
where the guest is not FWNMI-capable.  In that case we now always
synthesize a machine check interrupt for the guest.  Previously, if
the host thinks it has recovered the machine check fully, it would
return to the guest without any notification that the machine check
had occurred.  If the machine check was caused by some action of the
guest (such as creating duplicate SLB entries), it is much better to
tell the guest that it has caused a problem.  Therefore we now always
generate a machine check interrupt for guests that are not
FWNMI-capable.

Reviewed-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-21 23:16:44 +11:00
Michael Ellerman
d976f6807e KVM: PPC: Book3S HV: Context switch AMR on Power9
kvmhv_p9_guest_entry() implements a fast-path guest entry for Power9
when guest and host are both running with the Radix MMU.

Currently in that path we don't save the host AMR (Authority Mask
Register) value, and we always restore 0 on return to the host. That
is OK at the moment because the AMR is not used for storage keys with
the Radix MMU.

However we plan to start using the AMR on Radix to prevent the kernel
from reading/writing to userspace outside of copy_to/from_user(). In
order to make that work we need to save/restore the AMR value.

We only restore the value if it is different from the guest value,
which is already in the register when we exit to the host. This should
mean we rarely need to actually restore the value when running a
modern Linux as a guest, because it will be using the same value as
us.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Russell Currey <ruscur@russell.cc>
2019-02-21 13:19:52 +11:00
Lan Tianyu
a67794cafb Revert "KVM: Eliminate extra function calls in kvm_get_dirty_log_protect()"
The value of "dirty_bitmap[i]" is already check before setting its value
to mask. The following check of "mask" is redundant. The check of "mask" was
introduced by commit 58d2930f4e ("KVM: Eliminate extra function calls in
kvm_get_dirty_log_protect()"), revert it.

Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:52 +01:00
Marcelo Tosatti
7539b174ae x86: kvmguest: use TSC clocksource if invariant TSC is exposed
The invariant TSC bit has the following meaning:

"The time stamp counter in newer processors may support an enhancement,
referred to as invariant TSC. Processor's support for invariant TSC
is indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run
at a constant rate in all ACPI P-, C-. and T-states. This is the
architectural behavior moving forward. On processors with invariant TSC
support, the OS may use the TSC for wall clock timer services (instead
of ACPI or HPET timers). TSC reads are much more efficient and do not
incur the overhead associated with a ring transition or access to a
platform resource."

IOW, TSC does not change frequency. In such case, and with
TSC scaling hardware available to handle migration, it is possible
to use the TSC clocksource directly, whose system calls are
faster.

Reduce the rating of kvmclock clocksource to allow TSC clocksource
to be the default if invariant TSC is exposed.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

v2: Use feature bits and tsc_unstable() check (Sean Christopherson)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:52 +01:00
Nir Weiner
dee339b5c1 KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start
grow_halt_poll_ns() have a strange behaviour in case
(vcpu->halt_poll_ns != 0) &&
(vcpu->halt_poll_ns < halt_poll_ns_grow_start).

In this case, vcpu->halt_poll_ns will be multiplied by grow factor
(halt_poll_ns_grow) which will require several grow iteration in order
to reach a value bigger than halt_poll_ns_grow_start.
This means that growing vcpu->halt_poll_ns from value of 0 is slower
than growing it from a positive value less than halt_poll_ns_grow_start.
Which is misleading and inaccurate.

Fix issue by changing grow_halt_poll_ns() to set vcpu->halt_poll_ns
to halt_poll_ns_grow_start in any case that
(vcpu->halt_poll_ns < halt_poll_ns_grow_start).
Regardless if vcpu->halt_poll_ns is 0.

use READ_ONCE to get a consistent number for all cases.

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Nir Weiner <nir.weiner@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:51 +01:00
Nir Weiner
49113d360b KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter
The hard-coded value 10000 in grow_halt_poll_ns() stands for the initial
start value when raising up vcpu->halt_poll_ns.
It actually sets the first timeout to the first polling session.
This value has significant effect on how tolerant we are to outliers.
On the standard case, higher value is better - we will spend more time
in the polling busyloop, handle events/interrupts faster and result
in better performance.
But on outliers it puts us in a busy loop that does nothing.
Even if the shrink factor is zero, we will still waste time on the first
iteration.
The optimal value changes between different workloads. It depends on
outliers rate and polling sessions length.
As this value has significant effect on the dynamic halt-polling
algorithm, it should be configurable and exposed.

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Nir Weiner <nir.weiner@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:50 +01:00
Nir Weiner
7fa08e71b4 KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns
grow_halt_poll_ns() have a strange behavior in case
(halt_poll_ns_grow == 0) && (vcpu->halt_poll_ns != 0).

In this case, vcpu->halt_pol_ns will be set to zero.
That results in shrinking instead of growing.

Fix issue by changing grow_halt_poll_ns() to not modify
vcpu->halt_poll_ns in case halt_poll_ns_grow is zero

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Nir Weiner <nir.weiner@oracle.com>
Suggested-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:50 +01:00
Sean Christopherson
8ab3c471ee KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and kvm_mmu_zap_mmio_sptes()
...via a new helper, __kvm_mmu_zap_all().  An alternative to passing a
'bool mmio_only' would be to pass a callback function to filter the
shadow page, i.e. to make __kvm_mmu_zap_all() generic and reusable, but
zapping all shadow pages is a last resort, i.e. making the helper less
extensible is a feature of sorts.  And the explicit MMIO parameter makes
it easy to preserve the WARN_ON_ONCE() if a restart is triggered when
zapping MMIO sptes.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:49 +01:00
Sean Christopherson
24efe61f69 KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping children
Paolo expressed a concern that kvm_mmu_zap_mmio_sptes() could have a
quadratic runtime[1], i.e. restarting the spte walk while zapping only
MMIO sptes could result in re-walking large portions of the list over
and over due to the non-MMIO sptes encountered before the restart not
being removed.

At the time, the concern was legitimate as the walk was restarted when
any spte was zapped.  But that is no longer the case as the walk is now
restarted iff one or more children have been zapped, which is necessary
because zapping children makes the active_mmu_pages list unstable.

Furthermore, it should be impossible for an MMIO spte to have children,
i.e. zapping an MMIO spte should never result in zapping children.  In
other words, kvm_mmu_zap_mmio_sptes() should never restart its walk, and
so should always execute in linear time.  WARN if this assertion fails.

Although it should never be needed, leave the restart logic in place.
In normal operation, the cost is at worst an extra CMP+Jcc, and if for
some reason the list does become unstable, not restarting would likely
crash KVM, or worse, the kernel.

[1] https://patchwork.kernel.org/patch/10756589/#22452085

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:48 +01:00
Sean Christopherson
83cdb56864 KVM: x86/mmu: Differentiate between nr zapped and list unstable
The return value of kvm_mmu_prepare_zap_page() has evolved to become
overloaded to convey two separate pieces of information.  1) was at
least one page zapped and 2) has the list of MMU pages become unstable.

In it's original incarnation (as kvm_mmu_zap_page()), there was no
return value at all.  Commit 0738541396 ("KVM: MMU: awareness of new
kvm_mmu_zap_page behaviour") added a return value in preparation for
commit 4731d4c7a0 ("KVM: MMU: out of sync shadow core").  Although
the return value was of type 'int', it was actually used as a boolean
to indicate whether or not active_mmu_pages may have become unstable due
to zapping children.  Walking a list with list_for_each_entry_safe()
only protects against deleting/moving the current entry, i.e. zapping a
child page would break iteration due to modifying any number of entries.

Later, commit 60c8aec6e2 ("KVM: MMU: use page array in unsync walk")
modified mmu_zap_unsync_children() to return an approximation of the
number of children zapped.  This was not intentional, it was simply a
side effect of how the code was written.

The unintented side affect was then morphed into an actual feature by
commit 77662e0028 ("KVM: MMU: fix kvm_mmu_zap_page() and its calling
path"), which modified kvm_mmu_change_mmu_pages() to use the number of
zapped pages when determining the number of MMU pages in use by the VM.

Finally, commit 54a4f0239f ("KVM: MMU: make kvm_mmu_zap_page() return
the number of pages it actually freed") added the initial page to the
return value to make its behavior more consistent with what most users
would expect.  Incorporating the initial parent page in the return value
of kvm_mmu_zap_page() breaks the original usage of restarting a list
walk on a non-zero return value to handle a potentially unstable list,
i.e. walks will unnecessarily restart when any page is zapped.

Fix this by restoring the original behavior of kvm_mmu_zap_page(), i.e.
return a boolean to indicate that the list may be unstable and move the
number of zapped children to a dedicated parameter.  Since the majority
of callers to kvm_mmu_prepare_zap_page() don't care about either return
value, preserve the current definition of kvm_mmu_prepare_zap_page() by
making it a wrapper of a new helper, __kvm_mmu_prepare_zap_page().  This
avoids having to update every call site and also provides cleaner code
for functions that only care about the number of pages zapped.

Fixes: 54a4f0239f ("KVM: MMU: make kvm_mmu_zap_page() return
                      the number of pages it actually freed")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:48 +01:00
Sean Christopherson
ea145aacf4 Revert "KVM: MMU: fast invalidate all pages"
Remove x86 KVM's fast invalidate mechanism, i.e. revert all patches
from the original series[1], now that all users of the fast invalidate
mechanism are gone.

This reverts commit 5304b8d37c.

[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:47 +01:00
Sean Christopherson
5d6317ca4e KVM: x86/mmu: Voluntarily reschedule as needed when zapping all sptes
Call cond_resched_lock() when zapping all sptes to reschedule if needed
or to release and reacquire mmu_lock in case of contention.  There is no
need to flush or zap when temporarily dropping mmu_lock as zapping all
sptes is done only when the owning userspace VMM has exited or when the
VM is being destroyed, i.e. there is no interplay with memslots or MMIO
generations to worry about.

Be paranoid and restart the walk if mmu_lock is dropped to avoid any
potential issues with consuming a stale iterator.  The overhead in doing
so is negligible as at worst there will be a few root shadow pages at
the head of the list, i.e. the iterator is essentially the head of the
list already.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:46 +01:00
Sean Christopherson
8a674adc11 KVM: x86/mmu: skip over invalid root pages when zapping all sptes
...to guarantee forward progress.  When zapped, root pages are marked
invalid and moved to the head of the active pages list until they are
explicitly freed.  Theoretically, having unzappable root pages at the
head of the list could prevent kvm_mmu_zap_all() from making forward
progress were a future patch to add a loop restart after processing a
page, e.g. to drop mmu_lock on contention.

Although kvm_mmu_prepare_zap_page() can theoretically take action on
invalid pages, e.g. to zap unsync children, functionally it's not
necessary (root pages will be re-zapped when freed) and practically
speaking the odds of e.g. @unsync or @unsync_children becoming %true
while zapping all pages is basically nil.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:46 +01:00
Sean Christopherson
7390de1e99 Revert "KVM: x86: use the fast way to invalidate all pages"
Revert to a slow kvm_mmu_zap_all() for kvm_arch_flush_shadow_all().
Flushing all shadow entries is only done during VM teardown, i.e.
kvm_arch_flush_shadow_all() is only called when the associated MM struct
is being released or when the VM instance is being freed.

Although the performance of teardown itself isn't critical, KVM should
still voluntarily schedule to play nice with the rest of the kernel;
but that can be done without the fast invalidate mechanism in a future
patch.

This reverts commit 6ca18b6950.

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:45 +01:00
Sean Christopherson
b59c4830ca Revert "KVM: MMU: show mmu_valid_gen in shadow page related tracepoints"
...as part of removing x86 KVM's fast invalidate mechanism, i.e. this
is one part of a revert all patches from the series that introduced the
mechanism[1].

This reverts commit 2248b02321.

[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:44 +01:00
Sean Christopherson
42560fb1f3 Revert "KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages"
...as part of removing x86 KVM's fast invalidate mechanism, i.e. this
is one part of a revert all patches from the series that introduced the
mechanism[1].

This reverts commit 35006126f0.

[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:44 +01:00
Sean Christopherson
43d2b14b10 Revert "KVM: MMU: zap pages in batch"
Unwinding optimizations related to obsolete pages is a step towards
removing x86 KVM's fast invalidate mechanism, i.e. this is one part of
a revert all patches from the series that introduced the mechanism[1].

This reverts commit e7d11c7a89.

[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:43 +01:00
Sean Christopherson
210f494261 Revert "KVM: MMU: collapse TLB flushes when zap all pages"
Unwinding optimizations related to obsolete pages is a step towards
removing x86 KVM's fast invalidate mechanism, i.e. this is one part of
a revert all patches from the series that introduced the mechanism[1].

This reverts commit f34d251d66.

[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:42 +01:00
Sean Christopherson
52d5dedc79 Revert "KVM: MMU: reclaim the zapped-obsolete page first"
Unwinding optimizations related to obsolete pages is a step towards
removing x86 KVM's fast invalidate mechanism, i.e. this is one part of
a revert all patches from the series that introduced the mechanism[1].

This reverts commit 365c886860.

[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:42 +01:00
Sean Christopherson
5ff0568374 KVM: x86/mmu: Remove is_obsolete() call
Unwinding usage of is_obsolete() is a step towards removing x86's fast
invalidate mechanism, i.e. this is one part of a revert all patches from
the series that introduced the mechanism[1].

This is a partial revert of commit 05988d728d ("KVM: MMU: reduce
KVM_REQ_MMU_RELOAD when root page is zapped").

[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:41 +01:00
Sean Christopherson
571c5af06e KVM: x86/mmu: Voluntarily reschedule as needed when zapping MMIO sptes
Call cond_resched_lock() when zapping MMIO to reschedule if needed or to
release and reacquire mmu_lock in case of contention.  There is no need
to flush or zap when temporarily dropping mmu_lock as zapping MMIO sptes
is done when holding the memslots lock and with the "update in-progress"
bit set in the memslots generation, which disables MMIO spte caching.
The walk does need to be restarted if mmu_lock is dropped as the active
pages list may be modified.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:40 +01:00
Sean Christopherson
4771450c34 Revert "KVM: MMU: drop kvm_mmu_zap_mmio_sptes"
Revert back to a dedicated (and slower) mechanism for handling the
scenario where all MMIO shadow PTEs need to be zapped due to overflowing
the MMIO generation number.  The MMIO generation scenario is almost
literally a one-in-a-million occurrence, i.e. is not a performance
sensitive scenario.

Restoring kvm_mmu_zap_mmio_sptes() leaves VM teardown as the only user
of kvm_mmu_invalidate_zap_all_pages() and paves the way for removing
the fast invalidate mechanism altogether.

This reverts commit a8eca9dcc6.

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:40 +01:00
Sean Christopherson
a592a3b8fc Revert "KVM: MMU: document fast invalidate all pages"
Remove x86 KVM's fast invalidate mechanism, i.e. revert all patches
from the original series[1].

Though not explicitly stated, for all intents and purposes the fast
invalidate mechanism was added to speed up the scenario where removing
a memslot, e.g. as part of accessing reading PCI ROM, caused KVM to
flush all shadow entries[1].  Now that the memslot case flushes only
shadow entries belonging to the memslot, i.e. doesn't use the fast
invalidate mechanism, the only remaining usage of the mechanism are
when the VM is being destroyed and when the MMIO generation rolls
over.

When a VM is being destroyed, either there are no active vcpus, i.e.
there's no lock contention, or the VM has ungracefully terminated, in
which case we want to reclaim its pages as quickly as possible, i.e.
not release the MMU lock if there are still CPUs executing in the VM.

The MMIO generation scenario is almost literally a one-in-a-million
occurrence, i.e. is not a performance sensitive scenario.

Given that lock-breaking is not desirable (VM teardown) or irrelevant
(MMIO generation overflow), remove the fast invalidate mechanism to
simplify the code (a small amount) and to discourage future code from
zapping all pages as using such a big hammer should be a last restort.

This reverts commit f6f8adeef5.

[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:39 +01:00
Sean Christopherson
4e103134b8 KVM: x86/mmu: Zap only the relevant pages when removing a memslot
Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's
handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs
that actually belong to the memslot being removed.  This improves
performance, especially why the deleted memslot has only a few shadow
entries, or even no entries.  E.g. a microbenchmark to access regular
memory while concurrently reading PCI ROM to trigger memslot deletion
showed a 5% improvement in throughput.

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:39 +01:00
Sean Christopherson
a21136345c KVM: x86/mmu: Split remote_flush+zap case out of kvm_mmu_flush_or_zap()
...and into a separate helper, kvm_mmu_remote_flush_or_zap(), that does
not require a vcpu so that the code can be (re)used by
kvm_mmu_invalidate_zap_pages_in_memslot().

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:38 +01:00
Sean Christopherson
85875a133e KVM: x86/mmu: Move slot_level_*() helper functions up a few lines
...so that kvm_mmu_invalidate_zap_pages_in_memslot() can utilize the
helpers in future patches.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:37 +01:00
Sean Christopherson
164bf7e56c KVM: Move the memslot update in-progress flag to bit 63
...now that KVM won't explode by moving it out of bit 0.  Using bit 63
eliminates the need to jump over bit 0, e.g. when calculating a new
memslots generation or when propagating the memslots generation to an
MMIO spte.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:37 +01:00
Sean Christopherson
0e32958ec4 KVM: Remove the hack to trigger memslot generation wraparound
x86 captures a subset of the memslot generation (19 bits) in its MMIO
sptes so that it can expedite emulated MMIO handling by checking only
the releveant spte, i.e. doesn't need to do a full page fault walk.

Because the MMIO sptes capture only 19 bits (due to limited space in
the sptes), there is a non-zero probability that the MMIO generation
could wrap, e.g. after 500k memslot updates.  Since normal usage is
extremely unlikely to result in 500k memslot updates, a hack was added
by commit 69c9ea93ea ("KVM: MMU: init kvm generation close to mmio
wrap-around value") to offset the MMIO generation in order to trigger
a wraparound, e.g. after 150 memslot updates.

When separate memslot generation sequences were assigned to each
address space, commit 00f034a12f ("KVM: do not bias the generation
number in kvm_current_mmio_generation") moved the offset logic into the
initialization of the memslot generation itself so that the per-address
space bit(s) were not dropped/corrupted by the MMIO shenanigans.

Remove the offset hack for three reasons:

  - While it does exercise x86's kvm_mmu_invalidate_mmio_sptes(), simply
    wrapping the generation doesn't actually test the interesting case
    of having stale MMIO sptes with the new generation number, e.g. old
    sptes with a generation number of 0.

  - Triggering kvm_mmu_invalidate_mmio_sptes() prematurely makes its
    performance rather important since the probability of invalidating
    MMIO sptes jumps from "effectively never" to "fairly likely".  This
    limits what can be done in future patches, e.g. to simplify the
    invalidation code, as doing so without proper caution could lead to
    a noticeable performance regression.

  - Forcing the memslots generation, which is a 64-bit number, to wrap
    prevents KVM from assuming the memslots generation will never wrap.
    This in turn prevents KVM from using an arbitrary bit for the
    "update in-progress" flag, e.g. using bit 63 would immediately
    collide with using a large value as the starting generation number.
    The "update in-progress" flag is effectively forced into bit 0 so
    that it's (subtly) taken into account when incrementing the
    generation.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:36 +01:00
Sean Christopherson
cae7ed3c2c KVM: x86: Refactor the MMIO SPTE generation handling
The code to propagate the memslots generation number into MMIO sptes is
a bit convoluted.  The "what" is relatively straightfoward, e.g. the
comment explaining which bits go where is quite readable, but the "how"
requires a lot of staring to understand what is happening.  For example,
'MMIO_GEN_LOW_SHIFT' is actually used to calculate the high bits of the
spte, while 'MMIO_SPTE_GEN_LOW_SHIFT' is used to calculate the low bits.

Refactor the code to:

  - use #defines whose values align with the bits defined in the comment
  - use consistent code for both the high and low mask
  - explicitly highlight the handling of bit 0 (update in-progress flag)
  - explicitly call out that the defines are for MMIO sptes (to avoid
    confusion with the per-vCPU MMIO cache, which uses the full memslots
    generation)

In addition to making the code a little less magical, this paves the way
for moving the update in-progress flag to bit 63 without having to
simultaneously rewrite all of the MMIO spte code.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:35 +01:00
Sean Christopherson
5192f9b976 KVM: x86: Use a u64 when passing the MMIO gen around
KVM currently uses an 'unsigned int' for the MMIO generation number
despite it being derived from the 64-bit memslots generation and
being propagated to (potentially) 64-bit sptes.  There is no hidden
agenda behind using an 'unsigned int', it's done simply because the
MMIO generation will never set bits above bit 19.

Passing a u64 will allow the "update in-progress" flag to be relocated
from bit 0 to bit 63 and removes the need to cast the generation back
to a u64 when propagating it to a spte.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:35 +01:00
Sean Christopherson
361209e054 KVM: Explicitly define the "memslot update in-progress" bit
KVM uses bit 0 of the memslots generation as an "update in-progress"
flag, which is used by x86 to prevent caching MMIO access while the
memslots are changing.  Although the intended behavior is flag-like,
e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
caching data from in-flux memslots, the implementation oftentimes treats
the bit as part of the generation number itself, e.g. incrementing the
generation increments twice, once to set the flag and once to clear it.

Prior to commit 4bd518f159 ("KVM: use separate generations for
each address space"), incorporating the "update in-progress" bit into
the generation number largely made sense, e.g. "real" generations are
even, "bogus" generations are odd, most code doesn't need to be aware of
the bit, etc...

Now that unique memslots generation numbers are assigned to each address
space, stealthing the in-progress status into the generation number
results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
over bit 0 when initializing the memslots generation without any hint as
to why.

Explicitly define the flag and convert as much code as possible (which
isn't much) to actually treat it like a flag.  This paves the way for
eventually using a different bit for "update in-progress" so that it can
be a flag in truth instead of a awkward extension to the generation
number.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:34 +01:00