linux

Author	SHA1	Message	Date
Ben Gardon	2de4085ccc	KVM: x86/MMU: Recursively zap nested TDP SPs when zapping last/only parent Recursively zap all to-be-orphaned children, unsynced or otherwise, when zapping a shadow page for a nested TDP MMU. KVM currently only zaps the unsynced child pages, but not the synced ones. This can create problems over time when running many nested guests because it leaves unlinked pages which will not be freed until the page quota is hit. With the default page quota of 20 shadow pages per 1000 guest pages, this looks like a memory leak and can degrade MMU performance. In a recent benchmark, substantial performance degradation was observed: An L1 guest was booted with 64G memory. 2G nested Windows guests were booted, 10 at a time for 20 iterations. (200 total boots) Windows was used in this benchmark because they touch all of their memory on startup. By the end of the benchmark, the nested guests were taking ~10% longer to boot. With this patch there is no degradation in boot time. Without this patch the benchmark ends with hundreds of thousands of stale EPT02 pages cluttering up rmaps and the page hash map. As a result, VM shutdown is also much slower: deleting memslot 0 was observed to take over a minute. With this patch it takes just a few miliseconds. Cc: Peter Shier <pshier@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923221406.16297-3-sean.j.christopherson@intel.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:35 -04:00
Sean Christopherson	ace569e015	KVM: x86/mmu: Move flush logic from mmu_page_zap_pte() to FNAME(invlpg) Move the logic that controls whether or not FNAME(invlpg) needs to flush fully into FNAME(invlpg) so that mmu_page_zap_pte() doesn't return a value. This allows a future patch to redefine the return semantics for mmu_page_zap_pte() so that it can recursively zap orphaned child shadow pages for nested TDP MMUs. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923221406.16297-2-sean.j.christopherson@intel.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:34 -04:00
Sean Christopherson	4d710de964	KVM: x86/mmu: Stash 'kvm' in a local variable in kvm_mmu_free_roots() To make kvm_mmu_free_roots() a bit more readable, capture 'kvm' in a local variable instead of doing vcpu->kvm over and over (and over). No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923191204.8410-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:32 -04:00
Sean Christopherson	dc46515cf8	KVM: x86: Move illegal GPA helper out of the MMU code Rename kvm_mmu_is_illegal_gpa() to kvm_vcpu_is_illegal_gpa() and move it to cpuid.h so that's it's colocated with cpuid_maxphyaddr(). The helper is not MMU specific and will gain a user that is completely unrelated to the MMU in a future patch. No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200924194250.19137-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:27 -04:00
Sean Christopherson	09e3e2a1cc	KVM: x86: Add kvm_x86_ops hook to short circuit emulation Replace the existing kvm_x86_ops.need_emulation_on_page_fault() with a more generic is_emulatable(), and unconditionally call the new function in x86_emulate_instruction(). KVM will use the generic hook to support multiple security related technologies that prevent emulation in one way or another. Similar to the existing AMD #NPF case where emulation of the current instruction is not possible due to lack of information, AMD's SEV-ES and Intel's SGX and TDX will introduce scenarios where emulation is impossible due to the guest's register state being inaccessible. And again similar to the existing #NPF case, emulation can be initiated by kvm_mmu_page_fault(), i.e. outside of the control of vendor-specific code. While the cause and architecturally visible behavior of the various cases are different, e.g. SGX will inject a #UD, AMD #NPF is a clean resume or complete shutdown, and SEV-ES and TDX "return" an error, the impact on the common emulation code is identical: KVM must stop emulation immediately and resume the guest. Query is_emulatable() in handle_ud() as well so that the force_emulation_prefix code doesn't incorrectly modify RIP before calling emulate_instruction() in the absurdly unlikely scenario that KVM encounters forced emulation in conjunction with "do not emulate". Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200915232702.15945-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:19 -04:00
Paolo Bonzini	bf3c0e5e71	Merge branch 'x86-seves-for-paolo' of https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into HEAD	2020-09-22 06:43:17 -04:00
Linus Torvalds	84b1349972	ARM: - Multiple stolen time fixes, with a new capability to match x86 - Fix for hugetlbfs mappings when PUD and PMD are the same level - Fix for hugetlbfs mappings when PTE mappings are enforced (dirty logging, for example) - Fix tracing output of 64bit values x86: - nSVM state restore fixes - Async page fault fixes - Lots of small fixes everywhere -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl9dM5kUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroM+Iwf+LbISO7ccpPMK1kKtOeug/jZv+xQA sVaBGRzYo+k2e0XtV8E8IV4N30FBtYSwXsbBKkMAoy2FpmMebgDWDQ7xspb6RJMS /y8t1iqPwdOaLIkUkgc7UihSTlZm05Es3f3q6uZ9+oaM4Fe+V7xWzTUX4Oy89JO7 KcQsTD7pMqS4bfZGADK781ITR/WPgCi0aYx5s6dcwcZAQXhb1K1UKEjB8OGKnjUh jliReJtxRA16rjF+S5aJ7L07Ce/ksrfwkI4NXJ4GxW+lyOfVNdSBJUBaZt1m7G2M 1We5+i5EjKCjuxmgtUUUfVdazpj1yl+gBGT7KKkLte9T9WZdXyDnixAbvg== =OFb3 -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "A bit on the bigger side, mostly due to me being on vacation, then busy, then on parental leave, but there's nothing worrisome. ARM: - Multiple stolen time fixes, with a new capability to match x86 - Fix for hugetlbfs mappings when PUD and PMD are the same level - Fix for hugetlbfs mappings when PTE mappings are enforced (dirty logging, for example) - Fix tracing output of 64bit values x86: - nSVM state restore fixes - Async page fault fixes - Lots of small fixes everywhere" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits) KVM: emulator: more strict rsm checks. KVM: nSVM: more strict SMM checks when returning to nested guest SVM: nSVM: setup nested msr permission bitmap on nested state load SVM: nSVM: correctly restore GIF on vmexit from nesting after migration x86/kvm: don't forget to ACK async PF IRQ x86/kvm: properly use DEFINE_IDTENTRY_SYSVEC() macro KVM: VMX: Don't freeze guest when event delivery causes an APIC-access exit KVM: SVM: avoid emulation with stale next_rip KVM: x86: always allow writing '0' to MSR_KVM_ASYNC_PF_EN KVM: SVM: Periodically schedule when unregistering regions on destroy KVM: MIPS: Change the definition of kvm type kvm x86/mmu: use KVM_REQ_MMU_SYNC to sync when needed KVM: nVMX: Fix the update value of nested load IA32_PERF_GLOBAL_CTRL control KVM: fix memory leak in kvm_io_bus_unregister_dev() KVM: Check the allocation of pv cpu mask KVM: nVMX: Update VMCS02 when L2 PAE PDPTE updates detected KVM: arm64: Update page shift if stage 2 block mapping not supported KVM: arm64: Fix address truncation in traces KVM: arm64: Do not try to map PUDs when they are folded into PMD arm64/x86: KVM: Introduce steal-time cap ...	2020-09-13 08:34:47 -07:00
Lai Jiangshan	f6f6195b88	kvm x86/mmu: use KVM_REQ_MMU_SYNC to sync when needed When kvm_mmu_get_page() gets a page with unsynced children, the spt pagetable is unsynchronized with the guest pagetable. But the guest might not issue a "flush" operation on it when the pagetable entry is changed from zero or other cases. The hypervisor has the responsibility to synchronize the pagetables. KVM behaved as above for many years, But commit `8c8560b833` ("KVM: x86/mmu: Use KVM_REQ_TLB_FLUSH_CURRENT for MMU specific flushes") inadvertently included a line of code to change it without giving any reason in the changelog. It is clear that the commit's intention was to change KVM_REQ_TLB_FLUSH -> KVM_REQ_TLB_FLUSH_CURRENT, so we don't needlessly flush other contexts; however, one of the hunks changed a nearby KVM_REQ_MMU_SYNC instead. This patch changes it back. Link: https://lore.kernel.org/lkml/20200320212833.3507-26-sean.j.christopherson@intel.com/ Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20200902135421.31158-1-jiangshanlai@gmail.com> fixes: `8c8560b833` ("KVM: x86/mmu: Use KVM_REQ_TLB_FLUSH_CURRENT for MMU specific flushes") Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-11 13:16:55 -04:00
Madhuparna Bhowmik	df9a30fd1f	kvm: mmu: page_track: Fix RCU list API usage Use hlist_for_each_entry_srcu() instead of hlist_for_each_entry_rcu() as it also checkes if the right lock is held. Using hlist_for_each_entry_rcu() with a condition argument will not report the cases where a SRCU protected list is traversed using rcu_read_lock(). Hence, use hlist_for_each_entry_srcu(). Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: <kvm@vger.kernel.org>	2020-08-24 18:36:23 -07:00
Gustavo A. R. Silva	df561f6688	treewide: Use fallthrough pseudo-keyword Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>	2020-08-23 17:36:59 -05:00
Will Deacon	fdfe7cbd58	KVM: Pass MMU notifier range flags to kvm_unmap_hva_range() The 'flags' field of 'struct mmu_notifier_range' is used to indicate whether invalidate_range_{start,end}() are permitted to block. In the case of kvm_mmu_notifier_invalidate_range_start(), this field is not forwarded on to the architecture-specific implementation of kvm_unmap_hva_range() and therefore the backend cannot sensibly decide whether or not to block. Add an extra 'flags' parameter to kvm_unmap_hva_range() so that architectures are aware as to whether or not they are permitted to block. Cc: <stable@vger.kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: James Morse <james.morse@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20200811102725.7121-2-will@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-08-21 18:03:47 -04:00
Linus Torvalds	921d2597ab	s390: implement diag318 x86: * Report last CPU for debugging * Emulate smaller MAXPHYADDR in the guest than in the host * .noinstr and tracing fixes from Thomas * nested SVM page table switching optimization and fixes Generic: * Unify shadow MMU cache data structures across architectures -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl8pC+oUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroNcOwgAjomqtEqQNlp7DdZT7VyyklzbxX1/ ud7v+oOJ8K4sFlf64lSthjPo3N9rzZCcw+yOXmuyuITngXOGc3tzIwXpCzpLtuQ1 WO1Ql3B/2dCi3lP5OMmsO1UAZqy9pKLg1dfeYUPk48P5+p7d/NPmk+Em5kIYzKm5 JsaHfCp2EEXomwmljNJ8PQ1vTjIQSSzlgYUBZxmCkaaX7zbEUMtxAQCStHmt8B84 33LczwXBm3viSWrzsoBV37I70+tseugiSGsCfUyupXOvq55d6D9FCqtCb45Hn4Vh Ik8ggKdalsk/reiGEwNw1/3nr6mRMkHSbl+Mhc4waOIFf9dn0urgQgOaDg== =YVx0 -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM updates from Paolo Bonzini: "s390: - implement diag318 x86: - Report last CPU for debugging - Emulate smaller MAXPHYADDR in the guest than in the host - .noinstr and tracing fixes from Thomas - nested SVM page table switching optimization and fixes Generic: - Unify shadow MMU cache data structures across architectures" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits) KVM: SVM: Fix sev_pin_memory() error handling KVM: LAPIC: Set the TDCR settable bits KVM: x86: Specify max TDP level via kvm_configure_mmu() KVM: x86/mmu: Rename max_page_level to max_huge_page_level KVM: x86: Dynamically calculate TDP level from max level and MAXPHYADDR KVM: VXM: Remove temporary WARN on expected vs. actual EPTP level mismatch KVM: x86: Pull the PGD's level from the MMU instead of recalculating it KVM: VMX: Make vmx_load_mmu_pgd() static KVM: x86/mmu: Add separate helper for shadow NPT root page role calc KVM: VMX: Drop a duplicate declaration of construct_eptp() KVM: nSVM: Correctly set the shadow NPT root level in its MMU role KVM: Using macros instead of magic values MIPS: KVM: Fix build error caused by 'kvm_run' cleanup KVM: nSVM: remove nonsensical EXITINFO1 adjustment on nested NPF KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support KVM: VMX: optimize #PF injection when MAXPHYADDR does not match KVM: VMX: Add guest physical address check in EPT violation and misconfig KVM: VMX: introduce vmx_need_pf_intercept KVM: x86: update exception bitmap on CPUID changes KVM: x86: rename update_bp_intercept to update_exception_bitmap ...	2020-08-06 12:59:31 -07:00
Linus Torvalds	99ea1521a0	Remove uninitialized_var() macro for v5.9-rc1 - Clean up non-trivial uses of uninitialized_var() - Update documentation and checkpatch for uninitialized_var() removal - Treewide removal of uninitialized_var() -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl8oYLQWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsfjEACvf0D3WL3H7sLHtZ2HeMwOgAzq il08t6vUscINQwiIIK3Be43ok3uQ1Q+bj8sr2gSYTwunV2IYHFferzgzhyMMno3o XBIGd1E+v1E4DGBOiRXJvacBivKrfvrdZ7AWiGlVBKfg2E0fL1aQbe9AYJ6eJSbp UGqkBkE207dugS5SQcwrlk1tWKUL089lhDAPd7iy/5RK76OsLRCJFzIerLHF2ZK2 BwvA+NWXVQI6pNZ0aRtEtbbxwEU4X+2J/uaXH5kJDszMwRrgBT2qoedVu5LXFPi8 +B84IzM2lii1HAFbrFlRyL/EMueVFzieN40EOB6O8wt60Y4iCy5wOUzAdZwFuSTI h0xT3JI8BWtpB3W+ryas9cl9GoOHHtPA8dShuV+Y+Q2bWe1Fs6kTl2Z4m4zKq56z 63wQCdveFOkqiCLZb8s6FhnS11wKtAX4czvXRXaUPgdVQS1Ibyba851CRHIEY+9I AbtogoPN8FXzLsJn7pIxHR4ADz+eZ0dQ18f2hhQpP6/co65bYizNP5H3h+t9hGHG k3r2k8T+jpFPaddpZMvRvIVD8O2HvJZQTyY6Vvneuv6pnQWtr2DqPFn2YooRnzoa dbBMtpon+vYz6OWokC5QNWLqHWqvY9TmMfcVFUXE4AFse8vh4wJ8jJCNOFVp8On+ drhmmImUr1YylrtVOw== =xHmk -----END PGP SIGNATURE----- Merge tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull uninitialized_var() macro removal from Kees Cook: "This is long overdue, and has hidden too many bugs over the years. The series has several "by hand" fixes, and then a trivial treewide replacement. - Clean up non-trivial uses of uninitialized_var() - Update documentation and checkpatch for uninitialized_var() removal - Treewide removal of uninitialized_var()" * tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: compiler: Remove uninitialized_var() macro treewide: Remove uninitialized_var() usage checkpatch: Remove awareness of uninitialized_var() macro mm/debug_vm_pgtable: Remove uninitialized_var() usage f2fs: Eliminate usage of uninitialized_var() macro media: sur40: Remove uninitialized_var() usage KVM: PPC: Book3S PR: Remove uninitialized_var() usage clk: spear: Remove uninitialized_var() usage clk: st: Remove uninitialized_var() usage spi: davinci: Remove uninitialized_var() usage ide: Remove uninitialized_var() usage rtlwifi: rtl8192cu: Remove uninitialized_var() usage b43: Remove uninitialized_var() usage drbd: Remove uninitialized_var() usage x86/mm/numa: Remove uninitialized_var() usage docs: deprecated.rst: Add uninitialized_var()	2020-08-04 13:49:43 -07:00
Sean Christopherson	83013059bd	KVM: x86: Specify max TDP level via kvm_configure_mmu() Capture the max TDP level during kvm_configure_mmu() instead of using a kvm_x86_ops hook to do it at every vCPU creation. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200716034122.5998-10-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-30 18:18:08 -04:00
Sean Christopherson	1d92d2e8e7	KVM: x86/mmu: Rename max_page_level to max_huge_page_level Rename max_page_level to explicitly call out that it tracks the max huge page level so as to avoid confusion when a future patch moves the max TDP level, i.e. max root level, into the MMU and kvm_configure_mmu(). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200716034122.5998-9-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-30 18:17:58 -04:00
Sean Christopherson	d468d94b7b	KVM: x86: Dynamically calculate TDP level from max level and MAXPHYADDR Calculate the desired TDP level on the fly using the max TDP level and MAXPHYADDR instead of doing the same when CPUID is updated. This avoids the hidden dependency on cpuid_maxphyaddr() in vmx_get_tdp_level() and also standardizes the "use 5-level paging iff MAXPHYADDR > 48" behavior across x86. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200716034122.5998-8-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-30 18:17:16 -04:00
Sean Christopherson	59505b55aa	KVM: x86/mmu: Add separate helper for shadow NPT root page role calc Refactor the shadow NPT role calculation into a separate helper to better differentiate it from the non-nested shadow MMU, e.g. the NPT variant is never direct and derives its root level from the TDP level. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200716034122.5998-3-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-30 18:14:34 -04:00
Sean Christopherson	096586fda5	KVM: nSVM: Correctly set the shadow NPT root level in its MMU role Move the initialization of shadow NPT MMU's shadow_root_level into kvm_init_shadow_npt_mmu() and explicitly set the level in the shadow NPT MMU's role to be the TDP level. This ensures the role and MMU levels are synchronized and also initialized before __kvm_mmu_new_pgd(), which consumes the level when attempting a fast PGD switch. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Fixes: `9fa72119b2` ("kvm: x86: Introduce kvm_mmu_calc_root_page_role()") Fixes: `a506fdd223` ("KVM: nSVM: implement nested_svm_load_cr3() and use it for host->guest switch") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200716034122.5998-2-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-30 18:13:23 -04:00
Kees Cook	3f649ab728	treewide: Remove uninitialized_var() usage Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' \| cut -d: -f1 \| sort -u \| \ xargs perl -pi -e \ 's/\buninitialized_var$([^$]+)\)/\1/g; s:\s/\ (GCC be quiet\|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-16 12:35:15 -07:00
Mohammed Gamal	ec7771ab47	KVM: x86: mmu: Add guest physical address check in translate_gpa() Intel processors of various generations have supported 36, 39, 46 or 52 bits for physical addresses. Until IceLake introduced MAXPHYADDR==52, running on a machine with higher MAXPHYADDR than the guest more or less worked, because software that relied on reserved address bits (like KVM) generally used bit 51 as a marker and therefore the page faults where generated anyway. Unfortunately this is not true anymore if the host MAXPHYADDR is 52, and this can cause problems when migrating from a MAXPHYADDR<52 machine to one with MAXPHYADDR==52. Typically, the latter are machines that support 5-level page tables, so they can be identified easily from the LA57 CPUID bit. When that happens, the guest might have a physical address with reserved bits set, but the host won't see that and trap it. Hence, we need to check page faults' physical addresses against the guest's maximum physical memory and if it's exceeded, we need to add the PFERR_RSVD_MASK bits to the page fault error code. This patch does this for the MMU's page walks. The next patches will ensure that the correct exception and error code is produced whenever no host-reserved bits are set in page table entries. Signed-off-by: Mohammed Gamal <mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20200710154811.418214-4-mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-10 13:09:59 -04:00
Mohammed Gamal	cd313569f5	KVM: x86: mmu: Move translate_gpa() to mmu.c Also no point of it being inline since it's always called through function pointers. So remove that. Signed-off-by: Mohammed Gamal <mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20200710154811.418214-3-mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-10 13:09:40 -04:00
Vitaly Kuznetsov	fe9304d318	KVM: x86: drop superfluous mmu_check_root() from fast_pgd_switch() The mmu_check_root() check in fast_pgd_switch() seems to be superfluous: when GPA is outside of the visible range cached_root_available() will fail for non-direct roots (as we can't have a matching one on the list) and we don't seem to care for direct ones. Also, raising #TF immediately when a non-existent GFN is written to CR3 doesn't seem to mach architectural behavior. Drop the check. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200710141157.1640173-10-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-10 13:03:12 -04:00
Vitaly Kuznetsov	a506fdd223	KVM: nSVM: implement nested_svm_load_cr3() and use it for host->guest switch Undesired triple fault gets injected to L1 guest on SVM when L2 is launched with certain CR3 values. #TF is raised by mmu_check_root() check in fast_pgd_switch() and the root cause is that when kvm_set_cr3() is called from nested_prepare_vmcb_save() with NPT enabled CR3 points to a nGPA so we can't check it with kvm_is_visible_gfn(). Using generic kvm_set_cr3() when switching to nested guest is not a great idea as we'll have to distinguish between 'real' CR3s and 'nested' CR3s to e.g. not call kvm_mmu_new_pgd() with nGPA. Following nVMX implement nested-specific nested_svm_load_cr3() doing the job. To support the change, nested_svm_load_cr3() needs to be re-ordered with nested_svm_init_mmu_context(). Note: the current implementation is sub-optimal as we always do TLB flush/MMU sync but this is still an improvement as we at least stop doing kvm_mmu_reset_context(). Fixes: `7c390d350f` ("kvm: x86: Add fast CR3 switch code path") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200710141157.1640173-8-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-10 12:57:37 -04:00
Paolo Bonzini	8c008659aa	KVM: MMU: stop dereferencing vcpu->arch.mmu to get the context for MMU init kvm_init_shadow_mmu() was actually the only function that could be called with different vcpu->arch.mmu values. Now that kvm_init_shadow_npt_mmu() is separated from kvm_init_shadow_mmu(), we always know the MMU context we need to use and there is no need to dereference vcpu->arch.mmu pointer. Based on a patch by Vitaly Kuznetsov <vkuznets@redhat.com>. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200710141157.1640173-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-10 12:53:58 -04:00
Vitaly Kuznetsov	0f04a2ac4f	KVM: nSVM: split kvm_init_shadow_npt_mmu() from kvm_init_shadow_mmu() As a preparatory change for moving kvm_mmu_new_pgd() from nested_prepare_vmcb_save() to nested_svm_init_mmu_context() split kvm_init_shadow_npt_mmu() from kvm_init_shadow_mmu(). This also makes the code look more like nVMX (kvm_init_shadow_ept_mmu()). No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200710141157.1640173-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-10 12:53:27 -04:00
Sean Christopherson	6926f95acc	KVM: Move x86's MMU memory cache helpers to common KVM code Move x86's memory cache helpers to common KVM code so that they can be reused by arm64 and MIPS in future patches. Suggested-by: Christoffer Dall <christoffer.dall@arm.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703023545.8771-16-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-09 13:29:42 -04:00
Sean Christopherson	94ce87ef81	KVM: x86/mmu: Prepend "kvm_" to memory cache helpers that will be global Rename the memory helpers that will soon be moved to common code and be made globaly available via linux/kvm_host.h. "mmu" alone is not a sufficient namespace for globally available KVM symbols. Opportunistically add "nr_" in mmu_memory_cache_free_objects() to make it clear the function returns the number of free objects, as opposed to freeing existing objects. Suggested-by: Christoffer Dall <christoffer.dall@arm.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703023545.8771-14-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-09 13:29:41 -04:00
Sean Christopherson	378f5cd64a	KVM: x86/mmu: Skip filling the gfn cache for guaranteed direct MMU topups Don't bother filling the gfn array cache when the caller is a fully direct MMU, i.e. won't need a gfn array for shadow pages. Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703023545.8771-13-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-09 13:29:41 -04:00
Sean Christopherson	9688088378	KVM: x86/mmu: Zero allocate shadow pages (outside of mmu_lock) Set __GFP_ZERO for the shadow page memory cache and drop the explicit clear_page() from kvm_mmu_get_page(). This moves the cost of zeroing a page to the allocation time of the physical page, i.e. when topping up the memory caches, and thus avoids having to zero out an entire page while holding mmu_lock. Cc: Peter Feiner <pfeiner@google.com> Cc: Peter Shier <pshier@google.com> Cc: Junaid Shahid <junaids@google.com> Cc: Jim Mattson <jmattson@google.com> Suggested-by: Ben Gardon <bgardon@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703023545.8771-12-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-09 13:29:40 -04:00
Sean Christopherson	5f6078f9f1	KVM: x86/mmu: Make __GFP_ZERO a property of the memory cache Add a gfp_zero flag to 'struct kvm_mmu_memory_cache' and use it to control __GFP_ZERO instead of hardcoding a call to kmem_cache_zalloc(). A future patch needs such a flag for the __get_free_page() path, as gfn arrays do not need/want the allocator to zero the memory. Convert the kmem_cache paths to __GFP_ZERO now so as to avoid a weird and inconsistent API in the future. No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703023545.8771-11-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-09 13:29:40 -04:00
Sean Christopherson	171a90d70f	KVM: x86/mmu: Separate the memory caches for shadow pages and gfn arrays Use separate caches for allocating shadow pages versus gfn arrays. This sets the stage for specifying __GFP_ZERO when allocating shadow pages without incurring extra cost for gfn arrays. No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703023545.8771-10-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-09 13:29:40 -04:00
Sean Christopherson	531281ad98	KVM: x86/mmu: Clean up the gorilla math in mmu_topup_memory_caches() Clean up the minimums in mmu_topup_memory_caches() to document the driving mechanisms behind the minimums. Now that encountering an empty cache is unlikely to trigger BUG_ON(), it is less dangerous to be more precise when defining the minimums. For rmaps, the logic is 1 parent PTE per level, plus a single rmap, and prefetched rmaps. The extra objects in the current '8 + PREFETCH' minimum came about due to an abundance of paranoia in commit `c41ef344de` ("KVM: MMU: increase per-vcpu rmap cache alloc size"), i.e. it could have increased the minimum to 2 rmaps. Furthermore, the unexpected extra rmap case was killed off entirely by commits `f759e2b4c7` ("KVM: MMU: avoid pte_list_desc running out in kvm_mmu_pte_write") and `f5a1e9f895` ("KVM: MMU: remove call to kvm_mmu_pte_write from walk_addr"). For the so called page cache, replace '8' with 2*PT64_ROOT_MAX_LEVEL. The 2x multiplier is needed because the cache is used for both shadow pages and gfn arrays for indirect MMUs. And finally, for page headers, replace '4' with PT64_ROOT_MAX_LEVEL. Note, KVM now supports 5-level paging, i.e. the old minimums that used a baseline derived from 4-level paging were technically wrong. But, KVM always allocates roots in a separate flow, e.g. it's impossible in the current implementation to actually need 5 new shadow pages in a single flow. Use PT64_ROOT_MAX_LEVEL unmodified instead of subtracting 1, as the direct usage is likely more intuitive to uninformed readers, and the inflated minimum is unlikely to affect functionality in practice. Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703023545.8771-9-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-09 13:29:39 -04:00
Sean Christopherson	f3747a5a9e	KVM: x86/mmu: Topup memory caches after walking GVA->GPA Topup memory caches after walking the GVA->GPA translation during a shadow page fault, there is no need to ensure the caches are full when walking the GVA. As of commit `f5a1e9f895` ("KVM: MMU: remove call to kvm_mmu_pte_write from walk_addr"), the FNAME(walk_addr) flow no longer add rmaps via kvm_mmu_pte_write(). This avoids allocating memory in the case that the GVA is unmapped in the guest, and also provides a paper trail of why/when the memory caches need to be filled. Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703023545.8771-8-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-09 13:29:39 -04:00
Sean Christopherson	832914452a	KVM: x86/mmu: Move fast_page_fault() call above mmu_topup_memory_caches() Avoid refilling the memory caches and potentially slow reclaim/swap when handling a fast page fault, which does not need to allocate any new objects. Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703023545.8771-7-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-09 13:29:39 -04:00
Sean Christopherson	53a3f48771	KVM: x86/mmu: Try to avoid crashing KVM if a MMU memory cache is empty Attempt to allocate a new object instead of crashing KVM (and likely the kernel) if a memory cache is unexpectedly empty. Use GFP_ATOMIC for the allocation as the caches are used while holding mmu_lock. The immediate BUG_ON() makes the code unnecessarily explosive and led to confusing minimums being used in the past, e.g. allocating 4 objects where 1 would suffice. Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703023545.8771-6-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-09 13:29:38 -04:00
Sean Christopherson	284aa86868	KVM: x86/mmu: Remove superfluous gotos from mmu_topup_memory_caches() Return errors directly from mmu_topup_memory_caches() instead of branching to a label that does the same. No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703023545.8771-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-09 13:29:38 -04:00
Sean Christopherson	356ec69adf	KVM: x86/mmu: Use consistent "mc" name for kvm_mmu_memory_cache locals Use "mc" for local variables to shorten line lengths and provide consistent names, which will be especially helpful when some of the helpers are moved to common KVM code in future patches. No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703023545.8771-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-09 13:29:37 -04:00
Sean Christopherson	45177cccd9	KVM: x86/mmu: Consolidate "page" variant of memory cache helpers Drop the "page" variants of the topup/free memory cache helpers, using the existence of an associated kmem_cache to select the correct alloc or free routine. No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703023545.8771-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-09 13:29:37 -04:00
Sean Christopherson	5962bfb748	KVM: x86/mmu: Track the associated kmem_cache in the MMU caches Track the kmem_cache used for non-page KVM MMU memory caches instead of passing in the associated kmem_cache when filling the cache. This will allow consolidating code and other cleanups. No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703023545.8771-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-09 13:29:37 -04:00
Vitaly Kuznetsov	995decb6c4	KVM: x86: take as_id into account when checking PGD OVMF booted guest running on shadow pages crashes on TRIPLE FAULT after enabling paging from SMM. The crash is triggered from mmu_check_root() and is caused by kvm_is_visible_gfn() searching through memslots with as_id = 0 while vCPU may be in a different context (address space). Introduce kvm_vcpu_is_visible_gfn() and use it from mmu_check_root(). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200708140023.1476020-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-09 07:08:37 -04:00
Sean Christopherson	e47c4aee5b	KVM: x86/mmu: Rename page_header() to to_shadow_page() Rename KVM's accessor for retrieving a 'struct kvm_mmu_page' from the associated host physical address to better convey what the function is doing. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622202034.15093-7-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:55 -04:00
Sean Christopherson	573546820b	KVM: x86/mmu: Add sptep_to_sp() helper to wrap shadow page lookup Introduce sptep_to_sp() to reduce the boilerplate code needed to get the shadow page associated with a spte pointer, and to improve readability as it's not immediately obvious that "page_header" is a KVM-specific accessor for retrieving a shadow page. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622202034.15093-6-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:55 -04:00
Sean Christopherson	985ab27801	KVM: x86/mmu: Make kvm_mmu_page definition and accessor internal-only Make 'struct kvm_mmu_page' MMU-only, nothing outside of the MMU should be poking into the gory details of shadow pages. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622202034.15093-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:54 -04:00
Sean Christopherson	6ca9a6f3ad	KVM: x86/mmu: Add MMU-internal header Add mmu/mmu_internal.h to hold declarations and definitions that need to be shared between various mmu/ files, but should not be used by anything outside of the MMU. Begin populating mmu_internal.h with declarations of the helpers used by page_track.c. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622202034.15093-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:54 -04:00
Sean Christopherson	afe8d7e611	KVM: x86/mmu: Move kvm_mmu_available_pages() into mmu.c Move kvm_mmu_available_pages() from mmu.h to mmu.c, it has a single caller and has no business being exposed via mmu.h. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622202034.15093-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:53 -04:00
Sean Christopherson	33e3042dac	KVM: x86/mmu: Move mmu_audit.c and mmutrace.h into the mmu/ sub-directory Move mmu_audit.c and mmutrace.h under mmu/ where they belong. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622202034.15093-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:53 -04:00
Sean Christopherson	7bd7ded642	KVM: x86/mmu: Exit to userspace on make_mmu_pages_available() error Propagate any error returned by make_mmu_pages_available() out to userspace instead of resuming the guest if the error occurs while handling a page fault. Now that zapping the oldest MMU pages skips active roots, i.e. fails if and only if there are no zappable pages, there is no chance for a false positive, i.e. no chance of returning a spurious error to userspace. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200623193542.7554-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:52 -04:00
Sean Christopherson	ebdb292dac	KVM: x86/mmu: Batch zap MMU pages when shrinking the slab Use the recently introduced kvm_mmu_zap_oldest_mmu_pages() to batch zap MMU pages when shrinking a slab. This fixes a long standing issue where KVM's shrinker implementation is completely ineffective due to zapping only a single page. E.g. without batch zapping, forcing a scan via drop_caches basically has no impact on a VM with ~2k shadow pages. With batch zapping, the number of shadow pages can be reduced to a few hundred pages in one or two runs of drop_caches. Note, if the default batch size (currently 128) is problematic, e.g. zapping 128 pages holds mmu_lock for too long, KVM can bound the batch size by setting @batch in mmu_shrinker. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200623193542.7554-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:52 -04:00
Sean Christopherson	6b82ef2c9c	KVM: x86/mmu: Batch zap MMU pages when recycling oldest pages Collect MMU pages for zapping in a loop when making MMU pages available, and skip over active roots when doing so as zapping an active root can never immediately free up a page. Batching the zapping avoids multiple remote TLB flushes and remedies the issue where the loop would bail early if an active root was encountered. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200623193542.7554-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:51 -04:00
Sean Christopherson	f95eec9bed	KVM: x86/mmu: Don't put invalid SPs back on the list of active pages Delete a shadow page from the invalidation list instead of throwing it back on the list of active pages when it's a root shadow page with active users. Invalid active root pages will be explicitly freed by mmu_free_root_page() when the root_count hits zero, i.e. they don't need to be put on the active list to avoid leakage. Use sp->role.invalid to detect that a shadow page has already been zapped, i.e. is not on a list. WARN if an invalid page is encountered when zapping pages, as it should now be impossible. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200623193542.7554-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:51 -04:00
Sean Christopherson	fb58a9c345	KVM: x86/mmu: Optimize MMU page cache lookup for fully direct MMUs Skip the unsync checks and the write flooding clearing for fully direct MMUs, which are guaranteed to not have unsync'd or indirect pages (write flooding detection only applies to indirect pages). For TDP, this avoids unnecessary memory reads and writes, and for the write flooding count will also avoid dirtying a cache line (unsync_child_bitmap itself consumes a cache line, i.e. write_flooding_count is guaranteed to be in a different cache line than parent_ptes). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200623194027.23135-3-sean.j.christopherson@intel.com> Reviewed-By: Jon Cargille <jcargill@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:50 -04:00
Sean Christopherson	ac101b7cb1	KVM: x86/mmu: Avoid multiple hash lookups in kvm_get_mmu_page() Refactor for_each_valid_sp() to take the list of shadow pages instead of retrieving it from a gfn to avoid doing the gfn->list hash and lookup multiple times during kvm_get_mmu_page(). Cc: Peter Feiner <pfeiner@google.com> Cc: Jon Cargille <jcargill@google.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200623194027.23135-2-sean.j.christopherson@intel.com> Reviewed-By: Jon Cargille <jcargill@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:50 -04:00
Sean Christopherson	02f5fb2e69	KVM: x86/mmu: Make .write_log_dirty a nested operation Move .write_log_dirty() into kvm_x86_nested_ops to help differentiate it from the non-nested dirty log hooks. And because it's a nested-only operation. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622215832.22090-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:38 -04:00
Sean Christopherson	f25a9dec2d	KVM: x86/mmu: Drop kvm_arch_write_log_dirty() wrapper Drop kvm_arch_write_log_dirty() in favor of invoking .write_log_dirty() directly from FNAME(update_accessed_dirty_bits). "kvm_arch" is usually used for x86 functions that are invoked from generic KVM, and implies that there are external callers, neither of which is true. Remove the check for a non-NULL kvm_x86_ops hook as the call is wrapped in PTTYPE_EPT and is unconditionally set by VMX. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622215832.22090-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:37 -04:00
Vitaly Kuznetsov	e8c22266e6	KVM: async_pf: change kvm_setup_async_pf()/kvm_arch_setup_async_pf() return type to bool Unlike normal 'int' functions returning '0' on success, kvm_setup_async_pf()/ kvm_arch_setup_async_pf() return '1' when a job to handle page fault asynchronously was scheduled and '0' otherwise. To avoid the confusion change return type to 'bool'. No functional change intended. Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200615121334.91300-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:36 -04:00
Vitaly Kuznetsov	9ce372b33a	KVM: x86: drop KVM_PV_REASON_PAGE_READY case from kvm_handle_page_fault() KVM guest code in Linux enables APF only when KVM_FEATURE_ASYNC_PF_INT is supported, this means we will never see KVM_PV_REASON_PAGE_READY when handling page fault vmexit in KVM. While on it, make sure we only follow genuine page fault path when APF reason is zero. If we happen to see something else this means that the underlying hypervisor is misbehaving. Leave WARN_ON_ONCE() to catch that. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:35 -04:00
Paolo Bonzini	5ecad245de	KVM: x86: bit 8 of non-leaf PDPEs is not reserved Bit 8 would be the "global" bit, which does not quite make sense for non-leaf page table entries. Intel ignores it; AMD ignores it in PDEs and PDPEs, but reserves it in PML4Es. Probably, earlier versions of the AMD manual documented it as reserved in PDPEs as well, and that behavior made it into KVM as well as kvm-unit-tests; fix it. Cc: stable@vger.kernel.org Reported-by: Nadav Amit <namit@vmware.com> Fixes: `a0c0feb579` ("KVM: x86: reserve bit 8 of non-leaf PDPEs and PML4Es in 64-bit mode on AMD", 2014-09-03) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-06-30 07:07:20 -04:00
Sean Christopherson	2dbebf7ae1	KVM: nVMX: Plumb L2 GPA through to PML emulation Explicitly pass the L2 GPA to kvm_arch_write_log_dirty(), which for all intents and purposes is vmx_write_pml_buffer(), instead of having the latter pull the GPA from vmcs.GUEST_PHYSICAL_ADDRESS. If the dirty bit update is the result of KVM emulation (rare for L2), then the GPA in the VMCS may be stale and/or hold a completely unrelated GPA. Fixes: `c5f983f6e8` ("nVMX: Implement emulated Page Modification Logging") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622215832.22090-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-06-22 18:23:03 -04:00
Vitaly Kuznetsov	312d16c7c0	KVM: x86/mmu: Avoid mixing gpa_t with gfn_t in walk_addr_generic() translate_gpa() returns a GPA, assigning it to 'real_gfn' seems obviously wrong. There is no real issue because both 'gpa_t' and 'gfn_t' are u64 and we don't use the value in 'real_gfn' as a GFN, we do real_gfn = gpa_to_gfn(real_gfn); instead. 'If you see a "buffalo" sign on an elephant's cage, do not trust your eyes', but let's fix it for good. No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200622151435.752560-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-06-22 13:38:30 -04:00
Michel Lespinasse	89154dd531	mmap locking API: convert mmap_sem call sites missed by coccinelle Convert the last few remaining mmap_sem rwsem calls to use the new mmap locking API. These were missed by coccinelle for some reason (I think coccinelle does not support some of the preprocessor constructs in these files ?) [akpm@linux-foundation.org: convert linux-next leftovers] [akpm@linux-foundation.org: more linux-next leftovers] [akpm@linux-foundation.org: more linux-next leftovers] Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-6-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-09 09:39:14 -07:00
Linus Torvalds	039aeb9deb	ARM: - Move the arch-specific code into arch/arm64/kvm - Start the post-32bit cleanup - Cherry-pick a few non-invasive pre-NV patches x86: - Rework of TLB flushing - Rework of event injection, especially with respect to nested virtualization - Nested AMD event injection facelift, building on the rework of generic code and fixing a lot of corner cases - Nested AMD live migration support - Optimization for TSC deadline MSR writes and IPIs - Various cleanups - Asynchronous page fault cleanups (from tglx, common topic branch with tip tree) - Interrupt-based delivery of asynchronous "page ready" events (host side) - Hyper-V MSRs and hypercalls for guest debugging - VMX preemption timer fixes s390: - Cleanups Generic: - switch vCPU thread wakeup from swait to rcuwait The other architectures, and the guest side of the asynchronous page fault work, will come next week. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl7VJcYUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroPf6QgAq4wU5wdd1lTGz/i3DIhNVJNJgJlp ozLzRdMaJbdbn5RpAK6PEBd9+pt3+UlojpFB3gpJh2Nazv2OzV4yLQgXXXyyMEx1 5Hg7b4UCJYDrbkCiegNRv7f/4FWDkQ9dx++RZITIbxeskBBCEI+I7GnmZhGWzuC4 7kj4ytuKAySF2OEJu0VQF6u0CvrNYfYbQIRKBXjtOwuRK4Q6L63FGMJpYo159MBQ asg3B1jB5TcuGZ9zrjL5LkuzaP4qZZHIRs+4kZsH9I6MODHGUxKonrkablfKxyKy CFK+iaHCuEXXty5K0VmWM3nrTfvpEjVjbMc7e1QGBQ5oXsDM0pqn84syRg== =v7Wn -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "ARM: - Move the arch-specific code into arch/arm64/kvm - Start the post-32bit cleanup - Cherry-pick a few non-invasive pre-NV patches x86: - Rework of TLB flushing - Rework of event injection, especially with respect to nested virtualization - Nested AMD event injection facelift, building on the rework of generic code and fixing a lot of corner cases - Nested AMD live migration support - Optimization for TSC deadline MSR writes and IPIs - Various cleanups - Asynchronous page fault cleanups (from tglx, common topic branch with tip tree) - Interrupt-based delivery of asynchronous "page ready" events (host side) - Hyper-V MSRs and hypercalls for guest debugging - VMX preemption timer fixes s390: - Cleanups Generic: - switch vCPU thread wakeup from swait to rcuwait The other architectures, and the guest side of the asynchronous page fault work, will come next week" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (256 commits) KVM: selftests: fix rdtsc() for vmx_tsc_adjust_test KVM: check userspace_addr for all memslots KVM: selftests: update hyperv_cpuid with SynDBG tests x86/kvm/hyper-v: Add support for synthetic debugger via hypercalls x86/kvm/hyper-v: enable hypercalls regardless of hypercall page x86/kvm/hyper-v: Add support for synthetic debugger interface x86/hyper-v: Add synthetic debugger definitions KVM: selftests: VMX preemption timer migration test KVM: nVMX: Fix VMX preemption timer migration x86/kvm/hyper-v: Explicitly align hcall param for kvm_hyperv_exit KVM: x86/pmu: Support full width counting KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' in KVM: x86: announce KVM_FEATURE_ASYNC_PF_INT KVM: x86: acknowledgment mechanism for async pf page ready notifications KVM: x86: interrupt based APF 'page ready' event delivery KVM: introduce kvm_read_guest_offset_cached() KVM: rename kvm_arch_can_inject_async_page_present() to kvm_arch_can_dequeue_async_page_present() KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously" KVM: VMX: Replace zero-length array with flexible-array ...	2020-06-03 15:13:47 -07:00
Vitaly Kuznetsov	68fd66f100	KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info Currently, APF mechanism relies on the #PF abuse where the token is being passed through CR2. If we switch to using interrupts to deliver page-ready notifications we need a different way to pass the data. Extent the existing 'struct kvm_vcpu_pv_apf_data' with token information for page-ready notifications. While on it, rename 'reason' to 'flags'. This doesn't change the semantics as we only have reasons '1' and '2' and these can be treated as bit flags but KVM_PV_REASON_PAGE_READY is going away with interrupt based delivery making 'reason' name misleading. The newly introduced apf_put_user_ready() temporary puts both flags and token information, this will be changed to put token only when we switch to interrupt based notifications. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-06-01 04:26:06 -04:00
Paolo Bonzini	929d1cfaa6	KVM: MMU: pass arbitrary CR0/CR4/EFER to kvm_init_shadow_mmu This allows fetching the registers from the hsave area when setting up the NPT shadow MMU, and is needed for KVM_SET_NESTED_STATE (which runs long after the CR0, CR4 and EFER values in vcpu have been switched to hold L2 guest state). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-06-01 04:26:03 -04:00
彭浩(Richard)	88197e6ab3	kvm/x86: Remove redundant function implementations pic_in_kernel(), ioapic_in_kernel() and irqchip_kernel() have the same implementation. Signed-off-by: Peng Hao <richard.peng@oppo.com> Message-Id: <HKAPR02MB4291D5926EA10B8BFE9EA0D3E0B70@HKAPR02MB4291.apcprd02.prod.outlook.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-27 13:11:10 -04:00
Paolo Bonzini	7529e767c2	Merge branch 'kvm-master' into HEAD Merge AMD fixes before doing more development work.	2020-05-27 13:10:29 -04:00
Paolo Bonzini	e7581caca4	KVM: x86: simplify is_mmio_spte We can simply look at bits 52-53 to identify MMIO entries in KVM's page tables. Therefore, there is no need to pass a mask to kvm_mmu_set_mmio_spte_mask. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-27 13:08:29 -04:00
Sean Christopherson	6129ed877d	KVM: x86/mmu: Set mmio_value to '0' if reserved #PF can't be generated Set the mmio_value to '0' instead of simply clearing the present bit to squash a benign warning in kvm_mmu_set_mmio_spte_mask() that complains about the mmio_value overlapping the lower GFN mask on systems with 52 bits of PA space. Opportunistically clean up the code and comments. Cc: stable@vger.kernel.org Fixes: `d43e2675e9` ("KVM: x86: only do L1TF workaround on affected processors") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200527084909.23492-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-27 13:06:45 -04:00
Paolo Bonzini	9d5272f5e3	Merge tag 'noinstr-x86-kvm-2020-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into HEAD	2020-05-20 03:40:09 -04:00
Thomas Gleixner	6bca69ada4	x86/kvm: Sanitize kvm_async_pf_task_wait() While working on the entry consolidation I stumbled over the KVM async page fault handler and kvm_async_pf_task_wait() in particular. It took me a while to realize that the randomly sprinkled around rcu_irq_enter()/exit() invocations are just cargo cult programming. Several patches "fixed" RCU splats by curing the symptoms without noticing that the code is flawed from a design perspective. The main problem is that this async injection is not based on a proper handshake mechanism and only respects the minimal requirement, i.e. the guest is not in a state where it has interrupts disabled. Aside of that the actual code is a convoluted one fits it all swiss army knife. It is invoked from different places with different RCU constraints: 1) Host side: vcpu_enter_guest() kvm_x86_ops->handle_exit() kvm_handle_page_fault() kvm_async_pf_task_wait() The invocation happens from fully preemptible context. 2) Guest side: The async page fault interrupted: a) user space b) preemptible kernel code which is not in a RCU read side critical section c) non-preemtible kernel code or a RCU read side critical section or kernel code with CONFIG_PREEMPTION=n which allows not to differentiate between #2b and #2c. RCU is watching for: #1 The vCPU exited and current is definitely not the idle task #2a The #PF entry code on the guest went through enter_from_user_mode() which reactivates RCU #2b There is no preemptible, interrupts enabled code in the kernel which can run with RCU looking away. (The idle task is always non preemptible). I.e. all schedulable states (#1, #2a, #2b) do not need any of this RCU voodoo at all. In #2c RCU is eventually not watching, but as that state cannot schedule anyway there is no point to worry about it so it has to invoke rcu_irq_enter() before running that code. This can be optimized, but this will be done as an extra step in course of the entry code consolidation work. So the proper solution for this is to: - Split kvm_async_pf_task_wait() into schedule and halt based waiting interfaces which share the enqueueing code. - Add comments (condensed form of this changelog) to spare others the time waste and pain of reverse engineering all of this with the help of uncomprehensible changelogs and code history. - Invoke kvm_async_pf_task_wait_schedule() from kvm_handle_page_fault(), user mode and schedulable kernel side async page faults (#1, #2a, #2b) - Invoke kvm_async_pf_task_wait_halt() for the non schedulable kernel case (#2c). For this case also remove the rcu_irq_exit()/enter() pair around the halt as it is just a pointless exercise: - vCPUs can VMEXIT at any random point and can be scheduled out for an arbitrary amount of time by the host and this is not any different except that it voluntary triggers the exit via halt. - The interrupted context could have RCU watching already. So the rcu_irq_exit() before the halt is not gaining anything aside of confusing the reader. Claiming that this might prevent RCU stalls is just an illusion. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134059.262701431@linutronix.de	2020-05-19 15:53:58 +02:00
Paolo Bonzini	d43e2675e9	KVM: x86: only do L1TF workaround on affected processors KVM stores the gfn in MMIO SPTEs as a caching optimization. These are split in two parts, as in "[high 11111 low]", to thwart any attempt to use these bits in an L1TF attack. This works as long as there are 5 free bits between MAXPHYADDR and bit 50 (inclusive), leaving bit 51 free so that the MMIO access triggers a reserved-bit-set page fault. The bit positions however were computed wrongly for AMD processors that have encryption support. In this case, x86_phys_bits is reduced (for example from 48 to 43, to account for the C bit at position 47 and four bits used internally to store the SEV ASID and other stuff) while x86_cache_bits in would remain set to 48, and _all_ bits between the reduced MAXPHYADDR and bit 51 are set. Then low_phys_bits would also cover some of the bits that are set in the shadow_mmio_value, terribly confusing the gfn caching mechanism. To fix this, avoid splitting gfns as long as the processor does not have the L1TF bug (which includes all AMD processors). When there is no splitting, low_phys_bits can be set to the reduced MAXPHYADDR removing the overlap. This fixes "npt=0" operation on EPYC processors. Thanks to Maxim Levitsky for bisecting this bug. Cc: stable@vger.kernel.org Fixes: `52918ed5fc` ("KVM: SVM: Override default MMIO mask if memory encryption is enabled") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-19 05:47:06 -04:00
Sean Christopherson	8123f26524	KVM: x86/mmu: Add a helper to consolidate root sp allocation Add a helper, mmu_alloc_root(), to consolidate the allocation of a root shadow page, which has the same basic mechanics for all flavors of TDP and shadow paging. Note, __pa(sp->spt) doesn't need to be protected by mmu_lock, sp->spt points at a kernel page. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428023714.31923-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:12 -04:00
Sean Christopherson	3bae0459bc	KVM: x86/mmu: Drop KVM's hugepage enums in favor of the kernel's enums Replace KVM's PT_PAGE_TABLE_LEVEL, PT_DIRECTORY_LEVEL and PT_PDPE_LEVEL with the kernel's PG_LEVEL_4K, PG_LEVEL_2M and PG_LEVEL_1G. KVM's enums are borderline impossible to remember and result in code that is visually difficult to audit, e.g. if (!enable_ept) ept_lpage_level = 0; else if (cpu_has_vmx_ept_1g_page()) ept_lpage_level = PT_PDPE_LEVEL; else if (cpu_has_vmx_ept_2m_page()) ept_lpage_level = PT_DIRECTORY_LEVEL; else ept_lpage_level = PT_PAGE_TABLE_LEVEL; versus if (!enable_ept) ept_lpage_level = 0; else if (cpu_has_vmx_ept_1g_page()) ept_lpage_level = PG_LEVEL_1G; else if (cpu_has_vmx_ept_2m_page()) ept_lpage_level = PG_LEVEL_2M; else ept_lpage_level = PG_LEVEL_4K; No functional change intended. Suggested-by: Barret Rhoden <brho@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428005422.4235-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:11 -04:00
Sean Christopherson	e662ec3e07	KVM: x86/mmu: Move max hugepage level to a separate #define Rename PT_MAX_HUGEPAGE_LEVEL to KVM_MAX_HUGEPAGE_LEVEL and make it a separate define in anticipation of dropping KVM's PT__LEVEL enums in favor of the kernel's PG_LEVEL_ enums. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428005422.4235-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:11 -04:00
Sean Christopherson	b2f432f872	KVM: x86/mmu: Tweak PSE hugepage handling to avoid 2M vs 4M conundrum Change the PSE hugepage handling in walk_addr_generic() to fire on any page level greater than PT_PAGE_TABLE_LEVEL, a.k.a. PG_LEVEL_4K. PSE paging only has two levels, so "== 2" and "> 1" are functionally the same, i.e. this is a nop. A future patch will drop KVM's PT__LEVEL enums in favor of the kernel's PG_LEVEL_ enums, at which point "walker->level == PG_LEVEL_2M" is semantically incorrect (though still functionally ok). No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428005422.4235-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:10 -04:00
Sean Christopherson	e93fd3b3e8	KVM: x86/mmu: Capture TDP level when updating CPUID Snapshot the TDP level now that it's invariant (SVM) or dependent only on host capabilities and guest CPUID (VMX). This avoids having to call kvm_x86_ops.get_tdp_level() when initializing a TDP MMU and/or calculating the page role, and thus avoids the associated retpoline. Drop the WARN in vmx_get_tdp_level() as updating CPUID while L2 is active is legal, if dodgy. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-11-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:15:14 -04:00
Paolo Bonzini	c36b71503a	KVM: x86/mmu: Avoid an extra memslot lookup in try_async_pf() for L2 Create a new function kvm_is_visible_memslot() and use it from kvm_is_visible_gfn(); use the new function in try_async_pf() too, to avoid an extra memslot lookup. Opportunistically squish a multi-line comment into a single-line comment. Note, the end result, KVM_PFN_NOSLOT, is unchanged. Cc: Jim Mattson <jmattson@google.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:08 -04:00
Sean Christopherson	c583eed6d7	KVM: x86/mmu: Set @writable to false for non-visible accesses by L2 Explicitly set @writable to false in try_async_pf() if the GFN->PFN translation is short-circuited due to the requested GFN not being visible to L2. Leaving @writable ('map_writable' in the callers) uninitialized is ok in that it's never actually consumed, but one has to track it all the way through set_spte() being short-circuited by set_mmio_spte() to understand that the uninitialized variable is benign, and relying on @writable being ignored is an unnecessary risk. Explicitly setting @writable also aligns try_async_pf() with __gfn_to_pfn_memslot(). Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415214414.10194-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:08 -04:00
Sean Christopherson	be01e8e2c6	KVM: x86: Replace "cr3" with "pgd" in "new cr3/pgd" related code Rename functions and variables in kvm_mmu_new_cr3() and related code to replace "cr3" with "pgd", i.e. continue the work started by commit `727a7e27cf` ("KVM: x86: rename set_cr3 callback and related flags to load_mmu_pgd"). kvm_mmu_new_cr3() and company are not always loading a new CR3, e.g. when nested EPT is enabled "cr3" is actually an EPTP. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-37-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:59 -04:00
Sean Christopherson	9805c5f74b	KVM: nVMX: Don't flush TLB on nested VMX transition Unconditionally skip the TLB flush triggered when reusing a root for a nested transition as nested_vmx_transition_tlb_flush() ensures the TLB is flushed when needed, regardless of whether the MMU can reuse a cached root (or the last root). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-35-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:58 -04:00
Sean Christopherson	41fab65e7c	KVM: nVMX: Skip MMU sync on nested VMX transition when possible Skip the MMU sync when reusing a cached root if EPT is enabled or L1 enabled VPID for L2. If EPT is enabled, guest-physical mappings aren't flushed even if VPID is disabled, i.e. L1 can't expect stale TLB entries to be flushed if it has enabled EPT and L0 isn't shadowing PTEs (for L1 or L2) if L1 has EPT disabled. If VPID is enabled (and EPT is disabled), then L1 can't expect stale TLB entries to be flushed (for itself or L2). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-34-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:57 -04:00
Sean Christopherson	71fe70130d	KVM: x86/mmu: Add module param to force TLB flush on root reuse Add a module param, flush_on_reuse, to override skip_tlb_flush and skip_mmu_sync when performing a so called "fast cr3 switch", i.e. when reusing a cached root. The primary motiviation for the control is to provide a fallback mechanism in the event that TLB flushing and/or MMU sync bugs are exposed/introduced by upcoming changes to stop unconditionally flushing on nested VMX transitions. Suggested-by: Jim Mattson <jmattson@google.com> Suggested-by: Junaid Shahid <junaids@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-33-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:57 -04:00
Sean Christopherson	4a632ac6ca	KVM: x86/mmu: Add separate override for MMU sync during fast CR3 switch Add a separate "skip" override for MMU sync, a future change to avoid TLB flushes on nested VMX transitions may need to sync the MMU even if the TLB flush is unnecessary. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-32-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:56 -04:00
Sean Christopherson	b869855bad	KVM: x86/mmu: Move fast_cr3_switch() side effects to __kvm_mmu_new_cr3() Handle the side effects of a fast CR3 (PGD) switch up a level in __kvm_mmu_new_cr3(), which is the only caller of fast_cr3_switch(). This consolidates handling all side effects in __kvm_mmu_new_cr3() (where freeing the current root when KVM can't do a fast switch is already handled), and ameliorates the pain of adding a second boolean in a future patch to provide a separate "skip" override for the MMU sync. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-31-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:56 -04:00
Sean Christopherson	8c8560b833	KVM: x86/mmu: Use KVM_REQ_TLB_FLUSH_CURRENT for MMU specific flushes Flush only the current ASID/context when requesting a TLB flush due to a change in the current vCPU's MMU to avoid blasting away TLB entries associated with other ASIDs/contexts, e.g. entries cached for L1 when a change in L2's MMU requires a flush. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-26-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:54 -04:00
Sean Christopherson	7780938cc7	KVM: x86: Rename ->tlb_flush() to ->tlb_flush_all() Rename ->tlb_flush() to ->tlb_flush_all() in preparation for adding a new hook to flush only the current ASID/context. Opportunstically replace the comment in vmx_flush_tlb() that explains why it flushes all EPTP/VPID contexts with a comment explaining why it unconditionally uses INVEPT when EPT is enabled. I.e. rely on the "all" part of the name to clarify why it does global INVEPT/INVVPID. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-23-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:52 -04:00
Sean Christopherson	f55ac304ca	KVM: x86: Drop @invalidate_gpa param from kvm_x86_ops' tlb_flush() Drop @invalidate_gpa from ->tlb_flush() and kvm_vcpu_flush_tlb() now that all callers pass %true for said param, or ignore the param (SVM has an internal call to svm_flush_tlb() in svm_flush_tlb_guest that somewhat arbitrarily passes %false). Remove __vmx_flush_tlb() as it is no longer used. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-17-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:49 -04:00
Mauro Carvalho Chehab	3ecad8c2c1	docs: fix broken references for ReST files that moved around Some broken references happened due to shifting files around and ReST renames. Those can't be auto-fixed by the script, so let's fix them manually. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Acked-by: Corentin Labbe <clabbe.montjoie@gmail.com> Link: https://lore.kernel.org/r/64773a12b4410aaf3e3be89e3ec7e34de2484eea.1586881715.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2020-04-20 15:45:03 -06:00
Paolo Bonzini	0cd665bd20	KVM: x86: cleanup kvm_inject_emulated_page_fault To reconstruct the kvm_mmu to be used for page fault injection, we can simply use fault->nested_page_fault. This matches how fault->nested_page_fault is assigned in the first place by FNAME(walk_addr_generic). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-20 17:26:05 -04:00
Paolo Bonzini	5efac0741c	KVM: x86: introduce kvm_mmu_invalidate_gva Wrap the combination of mmu->invlpg and kvm_x86_ops->tlb_flush_gva into a new function. This function also lets us specify the host PGD to invalidate and also the MMU, both of which will be useful in fixing and simplifying kvm_inject_emulated_page_fault. A nested guest's MMU however has g_context->invlpg == NULL. Instead of setting it to nonpaging_invlpg, make kvm_mmu_invalidate_gva the only entry point to mmu->invlpg and make a NULL invlpg pointer equivalent to nonpaging_invlpg, saving a retpoline. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-20 17:25:55 -04:00
Linus Torvalds	8c1b724ddb	ARM: * GICv4.1 support * 32bit host removal PPC: * secure (encrypted) using under the Protected Execution Framework ultravisor s390: * allow disabling GISA (hardware interrupt injection) and protected VMs/ultravisor support. x86: * New dirty bitmap flag that sets all bits in the bitmap when dirty page logging is enabled; this is faster because it doesn't require bulk modification of the page tables. * Initial work on making nested SVM event injection more similar to VMX, and less buggy. * Various cleanups to MMU code (though the big ones and related optimizations were delayed to 5.8). Instead of using cr3 in function names which occasionally means eptp, KVM too has standardized on "pgd". * A large refactoring of CPUID features, which now use an array that parallels the core x86_features. * Some removal of pointer chasing from kvm_x86_ops, which will also be switched to static calls as soon as they are available. * New Tigerlake CPUID features. * More bugfixes, optimizations and cleanups. Generic: * selftests: cleanups, new MMU notifier stress test, steal-time test * CSV output for kvm_stat. KVM/MIPS has been broken since 5.5, it does not compile due to a patch committed by MIPS maintainers. I had already prepared a fix, but the MIPS maintainers prefer to fix it in generic code rather than KVM so they are taking care of it. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl6GOnIUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroMfxwf/ZKLZiRoaovXCOG71M/eHtQb8ZIqU 3MPy+On3eC5Sk/aBxWUL9EFZsbYG6kYdbZ1VOvG9XPBoLlnkDSm/IR0kaELHtnjj oGVda/tvGn46Ne39y8xBptmb91WDcWH0vFthT/CwlMxAw3xjr+gG7Qyo+8F2CW6m SSSuLiHSBnyO1cQKruBTHZ8qnR8LlnfXEqtd6Y4LFLic0LbLIoIdRcT3wjQrcZrm Djd7wbTEYZjUfoqZ72ekwEDUsONcDLDSKcguDO9pSMSCGhpxCVT5Vy68KRpoIMs2 nzNWDKjvqQo5zb2+GWxJgkd12Hv+n7PCXZMbVrWBu1pQsewUns9m4mkpGw== =6fGt -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "ARM: - GICv4.1 support - 32bit host removal PPC: - secure (encrypted) using under the Protected Execution Framework ultravisor s390: - allow disabling GISA (hardware interrupt injection) and protected VMs/ultravisor support. x86: - New dirty bitmap flag that sets all bits in the bitmap when dirty page logging is enabled; this is faster because it doesn't require bulk modification of the page tables. - Initial work on making nested SVM event injection more similar to VMX, and less buggy. - Various cleanups to MMU code (though the big ones and related optimizations were delayed to 5.8). Instead of using cr3 in function names which occasionally means eptp, KVM too has standardized on "pgd". - A large refactoring of CPUID features, which now use an array that parallels the core x86_features. - Some removal of pointer chasing from kvm_x86_ops, which will also be switched to static calls as soon as they are available. - New Tigerlake CPUID features. - More bugfixes, optimizations and cleanups. Generic: - selftests: cleanups, new MMU notifier stress test, steal-time test - CSV output for kvm_stat" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (277 commits) x86/kvm: fix a missing-prototypes "vmread_error" KVM: x86: Fix BUILD_BUG() in __cpuid_entry_get_reg() w/ CONFIG_UBSAN=y KVM: VMX: Add a trampoline to fix VMREAD error handling KVM: SVM: Annotate svm_x86_ops as __initdata KVM: VMX: Annotate vmx_x86_ops as __initdata KVM: x86: Drop __exit from kvm_x86_ops' hardware_unsetup() KVM: x86: Copy kvm_x86_ops by value to eliminate layer of indirection KVM: x86: Set kvm_x86_ops only after ->hardware_setup() completes KVM: VMX: Configure runtime hooks using vmx_x86_ops KVM: VMX: Move hardware_setup() definition below vmx_x86_ops KVM: x86: Move init-only kvm_x86_ops to separate struct KVM: Pass kvm_init()'s opaque param to additional arch funcs s390/gmap: return proper error code on ksm unsharing KVM: selftests: Fix cosmetic copy-paste error in vm_mem_region_move() KVM: Fix out of range accesses to memslots KVM: X86: Micro-optimize IPI fastpath delay KVM: X86: Delay read msr data iff writes ICR MSR KVM: PPC: Book3S HV: Add a capability for enabling secure guests KVM: arm64: GICv4.1: Expose HW-based SGIs in debugfs KVM: arm64: GICv4.1: Allow non-trapping WFI when using HW SGIs ...	2020-04-02 15:13:15 -07:00
Linus Torvalds	fdf5563a72	Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "This topic tree contains more commits than usual: - most of it are uaccess cleanups/reorganization by Al - there's a bunch of prototype declaration (--Wmissing-prototypes) cleanups - misc other cleanups all around the map" * 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) x86/mm/set_memory: Fix -Wmissing-prototypes warnings x86/efi: Add a prototype for efi_arch_mem_reserve() x86/mm: Mark setup_emu2phys_nid() static x86/jump_label: Move 'inline' keyword placement x86/platform/uv: Add a missing prototype for uv_bau_message_interrupt() kill uaccess_try() x86: unsafe_put-style macro for sigmask x86: x32_setup_rt_frame(): consolidate uaccess areas x86: __setup_rt_frame(): consolidate uaccess areas x86: __setup_frame(): consolidate uaccess areas x86: setup_sigcontext(): list user_access_{begin,end}() into callers x86: get rid of put_user_try in __setup_rt_frame() (both 32bit and 64bit) x86: ia32_setup_rt_frame(): consolidate uaccess areas x86: ia32_setup_frame(): consolidate uaccess areas x86: ia32_setup_sigcontext(): lift user_access_{begin,end}() into the callers x86/alternatives: Mark text_poke_loc_init() static x86/cpu: Fix a -Wmissing-prototypes warning for init_ia32_feat_ctl() x86/mm: Drop pud_mknotpresent() x86: Replace setup_irq() by request_irq() x86/configs: Slightly reduce defconfigs ...	2020-03-31 11:04:05 -07:00
Sean Christopherson	afaf0b2f9b	KVM: x86: Copy kvm_x86_ops by value to eliminate layer of indirection Replace the kvm_x86_ops pointer in common x86 with an instance of the struct to save one pointer dereference when invoking functions. Copy the struct by value to set the ops during kvm_init(). Arbitrarily use kvm_x86_ops.hardware_enable to track whether or not the ops have been initialized, i.e. a vendor KVM module has been loaded. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200321202603.19355-7-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-03-31 10:48:08 -04:00
Paolo Bonzini	727a7e27cf	KVM: x86: rename set_cr3 callback and related flags to load_mmu_pgd The set_cr3 callback is not setting the guest CR3, it is setting the root of the guest page tables, either shadow or two-dimensional. To make this clearer as well as to indicate that the MMU calls it via kvm_mmu_load_cr3, rename it to load_mmu_pgd. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-03-16 17:58:52 +01:00
Paolo Bonzini	689f3bf216	KVM: x86: unify callbacks to load paging root Similar to what kvm-intel.ko is doing, provide a single callback that merges svm_set_cr3, set_tdp_cr3 and nested_svm_set_tdp_cr3. This lets us unify the set_cr3 and set_tdp_cr3 entries in kvm_x86_ops. I'm doing that in this same patch because splitting it adds quite a bit of churn due to the need for forward declarations. For the same reason the assignment to vcpu->arch.mmu->set_cr3 is moved to kvm_init_shadow_mmu from init_kvm_softmmu and nested_svm_init_mmu_context. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-03-16 17:58:51 +01:00
Sean Christopherson	23493d0a17	KVM x86: Extend AMD specific guest behavior to Hygon virtual CPUs Extend guest_cpuid_is_amd() to cover Hygon virtual CPUs and rename it accordingly. Hygon CPUs use an AMD-based core and so have the same basic behavior as AMD CPUs. Fixes: `b8f4abb652` ("x86/kvm: Add Hygon Dhyana support to KVM") Cc: Pu Wen <puwen@hygon.cn> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-03-16 17:58:48 +01:00
Sean Christopherson	703c335d06	KVM: x86/mmu: Configure max page level during hardware setup Configure the max page level during hardware setup to avoid a retpoline in the page fault handler. Drop ->get_lpage_level() as the page fault handler was the last user. No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-03-16 17:58:40 +01:00
Sean Christopherson	bde7723559	KVM: x86/mmu: Merge kvm_{enable,disable}_tdp() into a common function Combine kvm_enable_tdp() and kvm_disable_tdp() into a single function, kvm_configure_mmu(), in preparation for doing additional configuration during hardware setup. And because having separate helpers is silly. No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-03-16 17:58:39 +01:00
Sean Christopherson	2f728d66e8	KVM: x86: Move kvm_emulate.h into KVM's private directory Now that the emulation context is dynamically allocated and not embedded in struct kvm_vcpu, move its header, kvm_emulate.h, out of the public asm directory and into KVM's private x86 directory. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-03-16 17:57:52 +01:00
Sean Christopherson	d8dd54e063	KVM: x86/mmu: Rename kvm_mmu->get_cr3() to ->get_guest_pgd() Rename kvm_mmu->get_cr3() to call out that it is retrieving a guest value, as opposed to kvm_mmu->set_cr3(), which sets a host value, and to note that it will return something other than CR3 when nested EPT is in use. Hopefully the new name will also make it more obvious that L1's nested_cr3 is returned in SVM's nested NPT case. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-03-16 17:57:46 +01:00
Sean Christopherson	bb1fcc70d9	KVM: nVMX: Allow L1 to use 5-level page walks for nested EPT Add support for 5-level nested EPT, and advertise said support in the EPT capabilities MSR. KVM's MMU can already handle 5-level legacy page tables, there's no reason to force an L1 VMM to use shadow paging if it wants to employ 5-level page tables. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-03-16 17:57:44 +01:00

1 2 3 4 5

206 Commits