linux

Author	SHA1	Message	Date
Jakub Kicinski	d480a26bdf	crypto: x86/aesni - don't require alignment of data x86 AES-NI routines can deal with unaligned data. Crypto context (key, iv etc.) have to be aligned but we take care of that separately by copying it onto the stack. We were feeding unaligned data into crypto routines up until commit `83c83e6588` ("crypto: aesni - refactor scatterlist processing") switched to use the full skcipher API which uses cra_alignmask to decide data alignment. This fixes 21% performance regression in kTLS. Tested by booting with CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y (and running thru various kTLS packets). CC: stable@vger.kernel.org # 5.15+ Fixes: `83c83e6588` ("crypto: aesni - refactor scatterlist processing") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-12-31 18:10:55 +11:00
Jakub Kicinski	aec53e60e0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net drivers/net/ethernet/mellanox/mlx5/core/en_tc.c commit `077cdda764` ("net/mlx5e: TC, Fix memory leak with rules with internal port") commit `31108d142f` ("net/mlx5: Fix some error handling paths in 'mlx5e_tc_add_fdb_flow()'") commit `4390c6edc0` ("net/mlx5: Fix some error handling paths in 'mlx5e_tc_add_fdb_flow()'") https://lore.kernel.org/all/20211229065352.30178-1-saeed@kernel.org/ net/smc/smc_wr.c commit `49dc9013e3` ("net/smc: Use the bitmap API when applicable") commit `349d43127d` ("net/smc: fix kernel panic caused by race of smc_sock") bitmap_zero()/memset() is removed by the fix Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-30 12:12:12 -08:00
Huang Rui	89aa94b4a2	x86/msr: Add AMD CPPC MSR definitions AMD CPPC (Collaborative Processor Performance Control) function uses MSR registers to manage the performance hints. So add the MSR register macro here. Signed-off-by: Huang Rui <ray.huang@amd.com> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2021-12-30 18:51:17 +01:00
Huang Rui	d341db8f48	x86/cpufeatures: Add AMD Collaborative Processor Performance Control feature flag Add Collaborative Processor Performance Control feature flag for AMD processors. This feature flag will be used on the following AMD P-State driver. The AMD P-State driver has two approaches to implement the frequency control behavior. That depends on the CPU hardware implementation. One is "Full MSR Support" and another is "Shared Memory Support". The feature flag indicates the current processors with "Full MSR Support". Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2021-12-30 16:58:03 +01:00
Masahiro Yamada	9102fa3460	x86/purgatory: Remove -nostdlib compiler flag The -nostdlib option requests the compiler to not use the standard system startup files or libraries when linking. It is effective only when $(CC) is used as a linker driver. $(LD) is directly used for linking purgatory.{ro,chk} here, hence -nostdlib is unneeded. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20211107162641.324688-2-masahiroy@kernel.org	2021-12-30 14:13:06 +01:00
Masahiro Yamada	a41f5b78ac	x86/vdso: Remove -nostdlib compiler flag The -nostdlib option requests the compiler to not use the standard system startup files or libraries when linking. It is effective only when $(CC) is used as a linker driver. Since `379d98ddf4` ("x86: vdso: Use $LD instead of $CC to link") $(LD) is directly used, hence -nostdlib is unneeded. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20211107162641.324688-1-masahiroy@kernel.org	2021-12-30 14:08:20 +01:00
Ingo Molnar	ff936357b4	x86/defconfig: Enable CONFIG_LOCALVERSION_AUTO=y in the defconfig With CONFIG_LOCALVERSION_AUTO=y enabled, 'uname' provides much more useful output on debug kernels: Before: # CONFIG_LOCALVERSION_AUTO is not set $ uname -a Linux localhost 5.16.0-rc7+ #4563 SMP PREEMPT Thu Dec 30 08:28:38 CET 2021 x86_64 GNU/Linux After: # CONFIG_LOCALVERSION_AUTO=y $ uname -a Linux localhost 5.16.0-rc7-02294-g5537f9709b16 #4562 SMP PREEMPT Thu Dec 30 08:27:17 CET 2021 x86_64 GNU/Linux This is particularly valuable during bisection, if we want to double check the exact kernel version we are testing. (Just remove the config line, the global Kconfig default for this is default-y.) Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2021-12-30 08:37:54 +01:00
Lukas Bulwahn	d6f12f8398	x86/build: Use the proper name CONFIG_FW_LOADER Commit in Fixes intends to add the expression regex only when FW_LOADER is enabled - not FW_LOADER_BUILTIN. Latter is a leftover from a previous patchset and not a valid config item. So, adjust the condition to the actual name of the config. [ bp: Cleanup commit message. ] Fixes: `c8dcf655ec` ("x86/build: Tuck away built-in firmware under FW_LOADER") Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20211229111553.5846-1-lukas.bulwahn@gmail.com	2021-12-29 22:20:38 +01:00
Tony Luck	244122b4d2	x86/lib: Add fast-short-rep-movs check to copy_user_enhanced_fast_string() Commit `f444a5ff95` ("x86/cpufeatures: Add support for fast short REP; MOVSB") fixed memmove() with an ALTERNATIVE that will use REP MOVSB for all string lengths. copy_user_enhanced_fast_string() has a similar run time check to avoid using REP MOVSB for copies less that 64 bytes. Add an ALTERNATIVE to patch out the short length check and always use REP MOVSB on X86_FEATURE_FSRM CPUs. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211216172431.1396371-1-tony.luck@intel.com	2021-12-29 13:46:02 +01:00
Colin Ian King	0be4838f01	x86/events/amd/iommu: Remove redundant assignment to variable shift Variable shift is being initialized with a value that is never read, it is being re-assigned later inside a loop. The assignment is redundant and can be removed. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211207185001.1412413-1-colin.i.king@gmail.com	2021-12-28 21:30:05 +01:00
Michael Kelley	e1878402ab	x86/hyperv: Fix definition of hv_ghcb_pg variable The percpu variable hv_ghcb_pg is incorrectly defined. The __percpu qualifier should be associated with the union hv_ghcb * (i.e., a pointer), not with the target of the pointer. This distinction makes no difference to gcc and the generated code, but sparse correctly complains. Fix the definition in the interest of general correctness in addition to making sparse happy. No functional change. Fixes: `0cc4f6d9f0` ("x86/hyperv: Initialize GHCB page in Isolation VM") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/1640662315-22260-2-git-send-email-mikelley@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>	2021-12-28 14:18:47 +00:00
Zhang Zixun	de768416b2	x86/mce/inject: Avoid out-of-bounds write when setting flags A contrived zero-length write, for example, by using write(2): ... ret = write(fd, str, 0); ... to the "flags" file causes: BUG: KASAN: stack-out-of-bounds in flags_write Write of size 1 at addr ffff888019be7ddf by task writefile/3787 CPU: 4 PID: 3787 Comm: writefile Not tainted 5.16.0-rc7+ #12 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 due to accessing buf one char before its start. Prevent such out-of-bounds access. [ bp: Productize into a proper patch. Link below is the next best thing because the original mail didn't get archived on lore. ] Fixes: `0451d14d05` ("EDAC, mce_amd_inj: Modify flags attribute to use string arguments") Signed-off-by: Zhang Zixun <zhang133010@icloud.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/linux-edac/YcnePfF1OOqoQwrX@zn.tnic/	2021-12-28 11:45:36 +01:00
Linus Torvalds	a8ad9a2434	Another EFI fix for v5.16 - Prevent missing prototype warning from breaking the build under CONFIG_WERROR=y -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE+9lifEBpyUIVN1cpw08iOZLZjyQFAmHJrdYACgkQw08iOZLZ jyRS/wwAvrrv6xidZNqm/BAr4bhY9saQ4BZTracZuHJwD7300H9mjlMDMnVbLhOU oVv9hEHDqBqBKgZxIYc0ok4SMZEpieFhLuTdXzVf5dMg+BL+qHR0EQp8kLp8mssL xbPnMidecL2TfjPqLn34MVPe46uQo2yioxET4hDvWA5xsWNKKd0AeKVzUzdV2AVE elnHsdsBWpCexaIJkQEFJWf/Y3xUlzcqp7zcqgJq4hq3fZlEbD8SJ7TUM7ReAQJ6 5afzYXBm086sn602hZVbAtOy94vLBBQswnbKntmFpcT+z/GLNjycx+A4Z2h+6jjx ImiIkNZA2N5ij12riXdx3n6OG0LMuILlRhhqyckgrgHyc1oYCwaUq+LOuVQENZF+ YNGfyntQ+Y0F9dfG28atnFiL5ZFZcCSJxVZ3Dl2bql2DHptbQNHOBdXKOcDWTkpc JKTs3j3rigZ1I01b1SgWzH418do8s8y4iv4nmp3EdMUO8v8rEFMx3yHIHKSRjMqj 1vAPZtWQ =coEy -----END PGP SIGNATURE----- Merge tag 'efi-urgent-for-v5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fix from Ard Biesheuvel: "Another EFI fix for v5.16: - Prevent missing prototype warning from breaking the build under CONFIG_WERROR=y" * tag 'efi-urgent-for-v5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efi: Move efifb_setup_from_dmi() prototype from arch headers	2021-12-27 08:58:35 -08:00
Yazen Ghannam	4fb0abfee4	x86/amd_nb: Add AMD Family 19h Models (10h-1Fh) and (A0h-AFh) PCI IDs Add the new PCI Device IDs to support new generation of AMD 19h family of processors. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Babu Moger <babu.moger@amd.com> Acked-by: Krzysztof Wilczyński <kw@linux.com> Acked-by: Borislav Petkov <bp@suse.de> Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci_ids.h Link: https://lore.kernel.org/r/163640828133.955062.18349019796157170473.stgit@bmoger-ubuntu Signed-off-by: Guenter Roeck <linux@roeck-us.net>	2021-12-26 15:02:04 -08:00
Linus Torvalds	e8ffcd3ab0	- Prevent potential undefined behavior due to shifting pkey constants into the sign bit - Move the EFI memory reservation code after the efi= cmdline parsing has happened - Revert two commits which turned out to be the wrong direction to chase when accommodating early memblock reservations consolidation and command line parameters parsing -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHIYbAACgkQEsHwGGHe VUphMhAAptI7YGJFqfH6bUuVh7HX38FOx5pCODpZUXJWIsvgC0MRnxqUF0+Srzbt ZBMhh7WUiOpIe/ZeUpRouX5Al4nhiDdG4QeddJLc1DaSaAgOJWD9NH0V0K1aWf4k poAUoX7mp5M1mfc/boOQGBdtX+l+3vCYU0RzyQ+IhfDNM6GzJpAC1txWFaL/lWE+ 3IVb4OF37mW7Hlk+HdizcTgm1qaHRobwP7fxbqRwdHZy+YRjz0eOd2dR625jff6b y3Z7WVkwx+YtNVD7YnXvmJKyh43rbbMmx0CU6WiYlhGkz5dooe9Er2LuouS1kzRr 4XWvZrr28qLc9eCdv6MaZGv62KyP0QBm4Fg7E/nC/0Jo8NA3GQXulF2OHofSCX/h BZa3mJa+heoXpUb7zDTUMa8UCdOYGJpvHRGYcU3JiaIwv+dnTRBdAzyVcqCMWt5P wPOl4A08vVZ+w3aNrZh3Zfg03pG4/9miveIa7EsuCZF/oN+5KAVQdcU74zneH4Qx nUfF7b80PyiSnCblQ3DL+efKVG4/Bud6crxo1tOtw6cDQxiAYDzupuZBlHV9791O 0TwPuLJSAqv4aMiobi5/j32pSLaohYJij9stbfB8zeD49rq5uvT+jDoxcKiZurJP jamo6LHffF+o8+F2YwFOSIPayZOViXxWvwPd+E6uL7EodvZgiS0= =j12C -----END PGP SIGNATURE----- Merge tag 'x86_urgent_for_v5.16_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Prevent potential undefined behavior due to shifting pkey constants into the sign bit - Move the EFI memory reservation code after the efi= cmdline parsing has happened - Revert two commits which turned out to be the wrong direction to chase when accommodating early memblock reservations consolidation and command line parameters parsing * tag 'x86_urgent_for_v5.16_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/pkey: Fix undefined behaviour with PKRU_WD_BIT x86/boot: Move EFI range reservation after cmdline parsing Revert "x86/boot: Pull up cmdline preparation and early param parsing" Revert "x86/boot: Mark prepare_command_line() __init"	2021-12-26 10:28:55 -08:00
Jason A. Donenfeld	acd93f8a4c	crypto: x86/curve25519 - use in/out register constraints more precisely Rather than passing all variables as modified, pass ones that are only read into that parameter. This helps with old gcc versions when alternatives are additionally used, and lets gcc's codegen be a little bit more efficient. This also syncs up with the latest Vale/EverCrypt output. Reported-by: Mathias Krause <minipli@grsecurity.net> Cc: Aymeric Fromherz <aymeric.fromherz@inria.fr> Link: https://lore.kernel.org/wireguard/1554725710.1290070.1639240504281.JavaMail.zimbra@inria.fr/ Link: https://github.com/project-everest/hacl-star/pull/501 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-12-24 14:18:22 +11:00
Nathan Chancellor	5fe392ff9d	x86/boot/compressed: Move CLANG_FLAGS to beginning of KBUILD_CFLAGS When cross compiling i386_defconfig on an arm64 host with clang, there are a few instances of '-Waddress-of-packed-member' and '-Wgnu-variable-sized-type-not-at-end' in arch/x86/boot/compressed/, which should both be disabled with the cc-disable-warning calls in that directory's Makefile, which indicates that cc-disable-warning is failing at the point of testing these flags. The cc-disable-warning calls fail because at the point that the flags are tested, KBUILD_CFLAGS has '-march=i386' without $(CLANG_FLAGS), which has the '--target=' flag to tell clang what architecture it is targeting. Without the '--target=' flag, the host architecture (arm64) is used and i386 is not a valid value for '-march=' in that case. This error can be seen by adding some logging to try-run: clang-14: error: the clang compiler does not support '-march=i386' Invoking the compiler has to succeed prior to calling cc-option or cc-disable-warning in order to accurately test whether or not the flag is supported; if it doesn't, the requested flag can never be added to the compiler flags. Move $(CLANG_FLAGS) to the beginning of KBUILD_FLAGS so that any new flags that might be added in the future can be accurately tested. Fixes: `d5cbd80e30` ("x86/boot: Add $(CLANG_FLAGS) to compressed KBUILD_CFLAGS") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211222163040.1961481-1-nathan@kernel.org	2021-12-23 11:03:28 +01:00
Christoph Hellwig	4d5cff69fb	x86/mtrr: Remove the mtrr_bp_init() stub Add an IS_ENABLED() check in setup_arch() and call pat_disable() directly if MTRRs are not supported. This allows to remove the <asm/memtype.h> include in <asm/mtrr.h>, which pull in lowlevel x86 headers that should not be included for UML builds and will cause build warnings with a following patch. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211215165612.554426-2-hch@lst.de	2021-12-22 19:50:26 +01:00
Christoph Hellwig	8bb227ac34	um: remove set_fs Remove address space overrides using set_fs() for User Mode Linux. Note that just like the existing kernel access case of the uaccess routines the new nofault kernel handlers do not actually have any exception handling. This is probably broken, but not change to the status quo. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2021-12-22 17:56:56 +01:00
Yazen Ghannam	91f75eb481	x86/MCE/AMD, EDAC/mce_amd: Support non-uniform MCA bank type enumeration AMD systems currently lay out MCA bank types such that the type of bank number "i" is either the same across all CPUs or is Reserved/Read-as-Zero. For example: Bank # \| CPUx \| CPUy 0 LS LS 1 RAZ UMC 2 CS CS 3 SMU RAZ Future AMD systems will lay out MCA bank types such that the type of bank number "i" may be different across CPUs. For example: Bank # \| CPUx \| CPUy 0 LS LS 1 RAZ UMC 2 CS NBIO 3 SMU RAZ Change the structures that cache MCA bank types to be per-CPU and update smca_get_bank_type() to handle this change. Move some SMCA-specific structures to amd.c from mce.h, since they no longer need to be global. Break out the "count" for bank types from struct smca_hwid, since this should provide a per-CPU count rather than a system-wide count. Apply the "const" qualifier to the struct smca_hwid_mcatypes array. The values in this array should not change at runtime. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211216162905.4132657-3-yazen.ghannam@amd.com	2021-12-22 17:22:09 +01:00
Yazen Ghannam	5176a93ab2	x86/MCE/AMD, EDAC/mce_amd: Add new SMCA bank types Add HWID and McaType values for new SMCA bank types, and add their error descriptions to edac_mce_amd. The "PHY" bank types all have the same error descriptions, and the NBIF and SHUB bank types have the same error descriptions. So reuse the same arrays where appropriate. [ bp: Remove useless comments over hwid types. ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211216162905.4132657-2-yazen.ghannam@amd.com	2021-12-22 17:19:18 +01:00
Borislav Petkov	b64dfcde1c	x86/mm: Prevent early boot triple-faults with instrumentation Commit in Fixes added a global TLB flush on the early boot path, after the kernel switches off of the trampoline page table. Compiler profiling options enabled with GCOV_PROFILE add additional measurement code on clang which needs to be initialized prior to use. The global flush in x86_64_start_kernel() happens before those initializations can happen, leading to accessing invalid memory. GCOV_PROFILE builds with gcc are still ok so this is clang-specific. The second issue this fixes is with KASAN: for a similar reason, kasan_early_init() needs to have happened before KASAN-instrumented functions are called. Therefore, reorder the flush to happen after the KASAN early init and prevent the compilers from adding profiling instrumentation to native_write_cr4(). Fixes: `f154f29085` ("x86/mm/64: Flush global TLB on boot and AP bringup") Reported-by: "J. Bruce Fields" <bfields@fieldses.org> Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Carel Si <beibei.si@intel.com> Tested-by: "J. Bruce Fields" <bfields@fieldses.org> Link: https://lore.kernel.org/r/20211209144141.GC25654@xsang-OptiPlex-9020	2021-12-22 11:51:20 +01:00
Al Viro	2610ed63ea	um, x86: bury crypto_tfm_ctx_offset unused since 2011 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Richard Weinberger <richard@nod.at>	2021-12-21 21:31:35 +01:00
Al Viro	2098e213dd	uml/i386: missing include in barrier.h we need cpufeatures.h there Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Richard Weinberger <richard@nod.at>	2021-12-21 21:31:35 +01:00
Al Viro	dbba7f704a	um: stop polluting the namespace with registers.h contents Only one extern in there is needed in processor-generic.h, and it's not needed anywhere else. So move it over there and get rid of the include in processor-generic.h, adding includes of registers.h to the few files that need the declarations in it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Richard Weinberger <richard@nod.at>	2021-12-21 21:31:35 +01:00
Al Viro	577ade59b9	um: move amd64 variant of mmap(2) to arch/x86/um/syscalls_64.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Richard Weinberger <richard@nod.at>	2021-12-21 21:30:44 +01:00
Al Viro	8f5c84f367	uml: trim unused junk from arch/x86/um/sys_call_table_*.c a bunch of detritus there - definitions that are never expanded or checked. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Richard Weinberger <richard@nod.at>	2021-12-21 21:30:44 +01:00
Linus Torvalds	ca0ea8a60b	* Fix for compilation of selftests on non-x86 architectures * Fix for kvm_run->if_flag on SEV-ES * Fix for page table use-after-free if yielding during exit_mm() * Improve behavior when userspace starts a nested guest with invalid state * Fix missed wakeup with assigned devices but no VT-d posted interrupts * Do not tell userspace to save/restore an unsupported PMU MSR -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmHCGbIUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroNMSwf+PpOl69xqQCHzLpDtA9MTBZZi5uKc eZqMtQ7D+T7XYbtMiHBI9zofcbtHlQVC9Na74j8J7bCuvqe8UGS/z93RoZZQ+/dg seEbQNEgcNlPD1G3JzaipcGhRqQl6j5G9LjuixyFpkS2O27YsRTpb1HJag31uSFV Xd7Ih0R2YnKCNJqJov324KHpDM3c4FZL8HdCS/gG94qyLHPA2dJ8fOTwCcAK4LP6 tl9XW5InXMVcck9ssDKFpAfipMNNPNPeAHMONbIm6hyG5HkVjcS3JT1bZmkauxky VIegPlAmorRCig5Hj2QfWEIOfvTUPdmm7WWh+SSZamY6cyvolWs1hV86Bg== =333O -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: - Fix for compilation of selftests on non-x86 architectures - Fix for kvm_run->if_flag on SEV-ES - Fix for page table use-after-free if yielding during exit_mm() - Improve behavior when userspace starts a nested guest with invalid state - Fix missed wakeup with assigned devices but no VT-d posted interrupts - Do not tell userspace to save/restore an unsupported PMU MSR * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this vCPU KVM: selftests: Add test to verify TRIPLE_FAULT on invalid L2 guest state KVM: VMX: Fix stale docs for kvm-intel.emulate_invalid_guest_state KVM: nVMX: Synthesize TRIPLE_FAULT for L2 if emulation is required KVM: VMX: Always clear vmx->fail on emulation_required selftests: KVM: Fix non-x86 compiling KVM: x86: Always set kvm_run->if_flag KVM: x86/mmu: Don't advance iterator after restart due to yielding KVM: x86: remove PMU FIXED_CTR3 from msrs_to_save_all	2021-12-21 12:25:57 -08:00
Randy Dunlap	077b732094	um: registers: Rename function names to avoid conflicts and build problems The function names init_registers() and restore_registers() are used in several net/ethernet/ and gpu/drm/ drivers for other purposes (not calls to UML functions), so rename them. This fixes multiple build errors. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: linux-um@lists.infradead.org Signed-off-by: Richard Weinberger <richard@nod.at>	2021-12-21 21:22:19 +01:00
Johannes Berg	494545aa9b	uml: x86: add FORCE to user_constants.h The build system has started warning when filechk is called without FORCE: arch/x86/um/Makefile:44: FORCE prerequisite is missing Add FORCE to make sure the file is checked/rebuilt when necessary (and to quiet up the warning.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2021-12-21 21:13:44 +01:00
Paolo Bonzini	855fb0384a	Merge remote-tracking branch 'kvm/master' into HEAD Pick commit `fdba608f15` ("KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this vCPU"). In addition to fixing a bug, it also aligns the non-nested and nested usage of triggering posted interrupts, allowing for additional cleanups. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-21 12:51:09 -05:00
Sean Christopherson	fdba608f15	KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this vCPU Drop a check that guards triggering a posted interrupt on the currently running vCPU, and more importantly guards waking the target vCPU if triggering a posted interrupt fails because the vCPU isn't IN_GUEST_MODE. If a vIRQ is delivered from asynchronous context, the target vCPU can be the currently running vCPU and can also be blocking, in which case skipping kvm_vcpu_wake_up() is effectively dropping what is supposed to be a wake event for the vCPU. The "do nothing" logic when "vcpu == running_vcpu" mostly works only because the majority of calls to ->deliver_posted_interrupt(), especially when using posted interrupts, come from synchronous KVM context. But if a device is exposed to the guest using vfio-pci passthrough, the VFIO IRQ and vCPU are bound to the same pCPU, and the IRQ is _not_ configured to use posted interrupts, wake events from the device will be delivered to KVM from IRQ context, e.g. vfio_msihandler() \| \|-> eventfd_signal() \| \|-> ... \| \|-> irqfd_wakeup() \| \|->kvm_arch_set_irq_inatomic() \| \|-> kvm_irq_delivery_to_apic_fast() \| \|-> kvm_apic_set_irq() This also aligns the non-nested and nested usage of triggering posted interrupts, and will allow for additional cleanups. Fixes: `379a3c8ee4` ("KVM: VMX: Optimize posted-interrupt delivery for timer fastpath") Cc: stable@vger.kernel.org Reported-by: Longpeng (Mike) <longpeng2@huawei.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-18-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-21 12:39:03 -05:00
Tianyu Lan	846da38de0	net: netvsc: Add Isolation VM support for netvsc driver In Isolation VM, all shared memory with host needs to mark visible to host via hvcall. vmbus_establish_gpadl() has already done it for netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_ pagebuffer() stills need to be handled. Use DMA API to map/umap these memory during sending/receiving packet and Hyper-V swiotlb bounce buffer dma address will be returned. The swiotlb bounce buffer has been masked to be visible to host during boot up. rx/tx ring buffer is allocated via vzalloc() and they need to be mapped into unencrypted address space(above vTOM) before sharing with host and accessing. Add hv_map/unmap_memory() to map/umap rx /tx ring buffer. Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20211213071407.314309-6-ltykernel@gmail.com Signed-off-by: Wei Liu <wei.liu@kernel.org>	2021-12-20 18:01:09 +00:00
Tianyu Lan	062a5c4260	hyper-v: Enable swiotlb bounce buffer for Isolation VM hyperv Isolation VM requires bounce buffer support to copy data from/to encrypted memory and so enable swiotlb force mode to use swiotlb bounce buffer for DMA transaction. In Isolation VM with AMD SEV, the bounce buffer needs to be accessed via extra address space which is above shared_gpa_boundary (E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG. The access physical address will be original physical address + shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP spec is called virtual top of memory(vTOM). Memory addresses below vTOM are automatically treated as private while memory above vTOM is treated as shared. Swiotlb bounce buffer code calls set_memory_decrypted() to mark bounce buffer visible to host and map it in extra address space via memremap. Populate the shared_gpa_boundary (vTOM) via swiotlb_unencrypted_base variable. The map function memremap() can't work in the early place (e.g ms_hyperv_init_platform()) and so call swiotlb_update_mem_ attributes() in the hyperv_init(). Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20211213071407.314309-4-ltykernel@gmail.com Signed-off-by: Wei Liu <wei.liu@kernel.org>	2021-12-20 18:01:09 +00:00
Tianyu Lan	c789b90a69	x86/hyper-v: Add hyperv Isolation VM check in the cc_platform_has() Hyper-V provides Isolation VM for confidential computing support and guest memory is encrypted in it. Places checking cc_platform_has() with GUEST_MEM_ENCRYPT attr should return "True" in Isolation VM. Hyper-V Isolation VMs need to adjust the SWIOTLB size just like SEV guests. Add a hyperv_cc_platform_has() variant which enables that. Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Acked-by: Borislav Petkov <bp@suse.de> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20211213071407.314309-3-ltykernel@gmail.com Signed-off-by: Wei Liu <wei.liu@kernel.org>	2021-12-20 18:01:09 +00:00
Sean Christopherson	cd0e615c49	KVM: nVMX: Synthesize TRIPLE_FAULT for L2 if emulation is required Synthesize a triple fault if L2 guest state is invalid at the time of VM-Enter, which can happen if L1 modifies SMRAM or if userspace stuffs guest state via ioctls(), e.g. KVM_SET_SREGS. KVM should never emulate invalid guest state, since from L1's perspective, it's architecturally impossible for L2 to have invalid state while L2 is running in hardware. E.g. attempts to set CR0 or CR4 to unsupported values will either VM-Exit or #GP. Modifying vCPU state via RSM+SMRAM and ioctl() are the only paths that can trigger this scenario, as nested VM-Enter correctly rejects any attempt to enter L2 with invalid state. RSM is a straightforward case as (a) KVM follows AMD's SMRAM layout and behavior, and (b) Intel's SDM states that loading reserved CR0/CR4 bits via RSM results in shutdown, i.e. there is precedent for KVM's behavior. Following AMD's SMRAM layout is important as AMD's layout saves/restores the descriptor cache information, including CS.RPL and SS.RPL, and also defines all the fields relevant to invalid guest state as read-only, i.e. so long as the vCPU had valid state before the SMI, which is guaranteed for L2, RSM will generate valid state unless SMRAM was modified. Intel's layout saves/restores only the selector, which means that scenarios where the selector and cached RPL don't match, e.g. conforming code segments, would yield invalid guest state. Intel CPUs fudge around this issued by stuffing SS.RPL and CS.RPL on RSM. Per Intel's SDM on the "Default Treatment of RSM", paraphrasing for brevity: IF internal storage indicates that the [CPU was post-VMXON] THEN enter VMX operation (root or non-root); restore VMX-critical state as defined in Section 34.14.1; set to their fixed values any bits in CR0 and CR4 whose values must be fixed in VMX operation [unless coming from an unrestricted guest]; IF RFLAGS.VM = 0 AND (in VMX root operation OR the “unrestricted guest” VM-execution control is 0) THEN CS.RPL := SS.DPL; SS.RPL := SS.DPL; FI; restore current VMCS pointer; FI; Note that Intel CPUs also overwrite the fixed CR0/CR4 bits, whereas KVM will sythesize TRIPLE_FAULT in this scenario. KVM's behavior is allowed as both Intel and AMD define CR0/CR4 SMRAM fields as read-only, i.e. the only way for CR0 and/or CR4 to have illegal values is if they were modified by the L1 SMM handler, and Intel's SDM "SMRAM State Save Map" section states "modifying these registers will result in unpredictable behavior". KVM's ioctl() behavior is less straightforward. Because KVM allows ioctls() to be executed in any order, rejecting an ioctl() if it would result in invalid L2 guest state is not an option as KVM cannot know if a future ioctl() would resolve the invalid state, e.g. KVM_SET_SREGS, or drop the vCPU out of L2, e.g. KVM_SET_NESTED_STATE. Ideally, KVM would reject KVM_RUN if L2 contained invalid guest state, but that carries the risk of a false positive, e.g. if RSM loaded invalid guest state and KVM exited to userspace. Setting a flag/request to detect such a scenario is undesirable because (a) it's extremely unlikely to add value to KVM as a whole, and (b) KVM would need to consider ioctl() interactions with such a flag, e.g. if userspace migrated the vCPU while the flag were set. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211207193006.120997-3-seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-20 08:06:54 -05:00
Sean Christopherson	a80dfc0259	KVM: VMX: Always clear vmx->fail on emulation_required Revert a relatively recent change that set vmx->fail if the vCPU is in L2 and emulation_required is true, as that behavior is completely bogus. Setting vmx->fail and synthesizing a VM-Exit is contradictory and wrong: (a) it's impossible to have both a VM-Fail and VM-Exit (b) vmcs.EXIT_REASON is not modified on VM-Fail (c) emulation_required refers to guest state and guest state checks are always VM-Exits, not VM-Fails. For KVM specifically, emulation_required is handled before nested exits in __vmx_handle_exit(), thus setting vmx->fail has no immediate effect, i.e. KVM calls into handle_invalid_guest_state() and vmx->fail is ignored. Setting vmx->fail can ultimately result in a WARN in nested_vmx_vmexit() firing when tearing down the VM as KVM never expects vmx->fail to be set when L2 is active, KVM always reflects those errors into L1. ------------[ cut here ]------------ WARNING: CPU: 0 PID: 21158 at arch/x86/kvm/vmx/nested.c:4548 nested_vmx_vmexit+0x16bd/0x17e0 arch/x86/kvm/vmx/nested.c:4547 Modules linked in: CPU: 0 PID: 21158 Comm: syz-executor.1 Not tainted 5.16.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:nested_vmx_vmexit+0x16bd/0x17e0 arch/x86/kvm/vmx/nested.c:4547 Code: <0f> 0b e9 2e f8 ff ff e8 57 b3 5d 00 0f 0b e9 00 f1 ff ff 89 e9 80 Call Trace: vmx_leave_nested arch/x86/kvm/vmx/nested.c:6220 [inline] nested_vmx_free_vcpu+0x83/0xc0 arch/x86/kvm/vmx/nested.c:330 vmx_free_vcpu+0x11f/0x2a0 arch/x86/kvm/vmx/vmx.c:6799 kvm_arch_vcpu_destroy+0x6b/0x240 arch/x86/kvm/x86.c:10989 kvm_vcpu_destroy+0x29/0x90 arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 kvm_free_vcpus arch/x86/kvm/x86.c:11426 [inline] kvm_arch_destroy_vm+0x3ef/0x6b0 arch/x86/kvm/x86.c:11545 kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1189 [inline] kvm_put_kvm+0x751/0xe40 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1220 kvm_vcpu_release+0x53/0x60 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3489 __fput+0x3fc/0x870 fs/file_table.c:280 task_work_run+0x146/0x1c0 kernel/task_work.c:164 exit_task_work include/linux/task_work.h:32 [inline] do_exit+0x705/0x24f0 kernel/exit.c:832 do_group_exit+0x168/0x2d0 kernel/exit.c:929 get_signal+0x1740/0x2120 kernel/signal.c:2852 arch_do_signal_or_restart+0x9c/0x730 arch/x86/kernel/signal.c:868 handle_signal_work kernel/entry/common.c:148 [inline] exit_to_user_mode_loop kernel/entry/common.c:172 [inline] exit_to_user_mode_prepare+0x191/0x220 kernel/entry/common.c:207 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] syscall_exit_to_user_mode+0x2e/0x70 kernel/entry/common.c:300 do_syscall_64+0x53/0xd0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: `c8607e4a08` ("KVM: x86: nVMX: don't fail nested VM entry on invalid guest state if !from_vmentry") Reported-by: syzbot+f1d2136db9c80d4733e8@syzkaller.appspotmail.com Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211207193006.120997-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-20 08:06:54 -05:00
Marc Orr	c5063551bf	KVM: x86: Always set kvm_run->if_flag The kvm_run struct's if_flag is a part of the userspace/kernel API. The SEV-ES patches failed to set this flag because it's no longer needed by QEMU (according to the comment in the source code). However, other hypervisors may make use of this flag. Therefore, set the flag for guests with encrypted registers (i.e., with guest_state_protected set). Fixes: `f1c6366e30` ("KVM: SVM: Add required changes to support intercepts under SEV-ES") Signed-off-by: Marc Orr <marcorr@google.com> Message-Id: <20211209155257.128747-1-marcorr@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>	2021-12-20 08:06:53 -05:00
Sean Christopherson	3a0f64de47	KVM: x86/mmu: Don't advance iterator after restart due to yielding After dropping mmu_lock in the TDP MMU, restart the iterator during tdp_iter_next() and do not advance the iterator. Advancing the iterator results in skipping the top-level SPTE and all its children, which is fatal if any of the skipped SPTEs were not visited before yielding. When zapping all SPTEs, i.e. when min_level == root_level, restarting the iter and then invoking tdp_iter_next() is always fatal if the current gfn has as a valid SPTE, as advancing the iterator results in try_step_side() skipping the current gfn, which wasn't visited before yielding. Sprinkle WARNs on iter->yielded being true in various helpers that are often used in conjunction with yielding, and tag the helper with __must_check to reduce the probabily of improper usage. Failing to zap a top-level SPTE manifests in one of two ways. If a valid SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(), the shadow page will be leaked and KVM will WARN accordingly. WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm] RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm] Call Trace: <TASK> kvm_arch_destroy_vm+0x130/0x1b0 [kvm] kvm_destroy_vm+0x162/0x2a0 [kvm] kvm_vcpu_release+0x34/0x60 [kvm] __fput+0x82/0x240 task_work_run+0x5c/0x90 do_exit+0x364/0xa10 ? futex_unqueue+0x38/0x60 do_group_exit+0x33/0xa0 get_signal+0x155/0x850 arch_do_signal_or_restart+0xed/0x750 exit_to_user_mode_prepare+0xc5/0x120 syscall_exit_to_user_mode+0x1d/0x40 do_syscall_64+0x48/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of marking a struct page as dirty/accessed after it has been put back on the free list. This directly triggers a WARN due to encountering a page with page_count() == 0, but it can also lead to data corruption and additional errors in the kernel. WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171 RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm] Call Trace: <TASK> kvm_set_pfn_dirty+0x120/0x1d0 [kvm] __handle_changed_spte+0x92e/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] zap_gfn_range+0x549/0x620 [kvm] kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm] mmu_free_root_page+0x219/0x2c0 [kvm] kvm_mmu_free_roots+0x1b4/0x4e0 [kvm] kvm_mmu_unload+0x1c/0xa0 [kvm] kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm] kvm_put_kvm+0x3b1/0x8b0 [kvm] kvm_vcpu_release+0x4e/0x70 [kvm] __fput+0x1f7/0x8c0 task_work_run+0xf8/0x1a0 do_exit+0x97b/0x2230 do_group_exit+0xda/0x2a0 get_signal+0x3be/0x1e50 arch_do_signal_or_restart+0x244/0x17f0 exit_to_user_mode_prepare+0xcb/0x120 syscall_exit_to_user_mode+0x1d/0x40 do_syscall_64+0x4d/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Note, the underlying bug existed even before commit `1af4a96025` ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still incorrectly advance past a top-level entry when yielding on a lower-level entry. But with respect to leaking shadow pages, the bug was introduced by yielding before processing the current gfn. Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or callers could jump to their "retry" label. The downside of that approach is that tdp_mmu_iter_cond_resched() _must_ be called before anything else in the loop, and there's no easy way to enfornce that requirement. Ideally, KVM would handling the cond_resched() fully within the iterator macro (the code is actually quite clean) and avoid this entire class of bugs, but that is extremely difficult do while also supporting yielding after tdp_mmu_set_spte_atomic() fails. Yielding after failing to set a SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE may block operations on the SPTE for a significant amount of time. Fixes: `faaf05b00a` ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: `1af4a96025` ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed") Reported-by: Ignat Korchagin <ignat@cloudflare.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211214033528.123268-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-20 08:06:53 -05:00
Borislav Petkov	1acd85feba	x86/mce: Check regs before accessing it Commit in Fixes accesses pt_regs before checking whether it is NULL or not. Make sure the NULL pointer check happens first. Fixes: `0a5b288e85` ("x86/mce: Prevent severity computation from being instrumented") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20211217102029.GA29708@kili	2021-12-20 11:41:02 +01:00
Wei Wang	9fb12fe5b9	KVM: x86: remove PMU FIXED_CTR3 from msrs_to_save_all The fixed counter 3 is used for the Topdown metrics, which hasn't been enabled for KVM guests. Userspace accessing to it will fail as it's not included in get_fixed_pmc(). This breaks KVM selftests on ICX+ machines, which have this counter. To reproduce it on ICX+ machines, ./state_test reports: ==== Test Assertion Failure ==== lib/x86_64/processor.c:1078: r == nmsrs pid=4564 tid=4564 - Argument list too long 1 0x000000000040b1b9: vcpu_save_state at processor.c:1077 2 0x0000000000402478: main at state_test.c:209 (discriminator 6) 3 0x00007fbe21ed5f92: ?? ??:0 4 0x000000000040264d: _start at ??:? Unexpected result from KVM_GET_MSRS, r: 17 (failed MSR was 0x30c) With this patch, it works well. Signed-off-by: Wei Wang <wei.w.wang@intel.com> Message-Id: <20211217124934.32893-1-wei.w.wang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-20 10:51:19 +01:00
Andrew Cooper	57690554ab	x86/pkey: Fix undefined behaviour with PKRU_WD_BIT Both __pkru_allows_write() and arch_set_user_pkey_access() shift PKRU_WD_BIT (a signed constant) by up to 30 bits, hitting the sign bit. Use unsigned constants instead. Clearly pkey 15 has not been used in combination with UBSAN yet. Noticed by code inspection only. I can't actually provoke the compiler into generating incorrect logic as far as this shift is concerned. [ dhansen: add stable@ tag, plus minor changelog massaging, For anyone doing backports, these #defines were in arch/x86/include/asm/pgtable.h before `784a46618f`. ] Fixes: `33a709b25a` ("mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20211216000856.4480-1-andrew.cooper3@citrix.com	2021-12-19 22:44:34 +01:00
Linus Torvalds	f291e2d899	Two small fixes, one of which was being worked around in selftests. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmG/fW8UHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroOVBAf/dZDUEnrsu/o4xVPjPinMihtMstGk RP7PqmJ+6nHeDsfoivIiigaFchpc0DYq2kU/8iOPkbsEnLyFuhKg1jUHMQiK5ZGY shWvJq9HEwqAQ4v3H3PC5SomwlO4UmRkbrR4NT1SwFBSIk9s+soVbDZrdL8bJW7x 6XS01LsXeUPEcJj+h9bDGKmnibkiE4hsf9jGMgfiXgF33DB+LDNU9TH9MFn4K6J2 ColgIqZ+o5dXn7wMcsDsw+DTj81nki+gkghmW9k5rC5FA3qmKd3e5T2sgKey3Ztd J+acNmGyG/hhYNE4RHXaN+ckNEcqywcCA31DfWq87aw6yP4pF2dKWu+BYw== =yEYt -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "Two small fixes, one of which was being worked around in selftests" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: Retry page fault if MMU reload is pending and root has no sp KVM: selftests: vmx_pmu_msrs_test: Drop tests mangling guest visible CPUIDs KVM: x86: Drop guest CPUID check for host initiated writes to MSR_IA32_PERF_CAPABILITIES	2021-12-19 12:44:03 -08:00
Sean Christopherson	18c841e1f4	KVM: x86: Retry page fault if MMU reload is pending and root has no sp Play nice with a NULL shadow page when checking for an obsolete root in the page fault handler by flagging the page fault as stale if there's no shadow page associated with the root and KVM_REQ_MMU_RELOAD is pending. Invalidating memslots, which is the only case where _all_ roots need to be reloaded, requests all vCPUs to reload their MMUs while holding mmu_lock for lock. The "special" roots, e.g. pae_root when KVM uses PAE paging, are not backed by a shadow page. Running with TDP disabled or with nested NPT explodes spectaculary due to dereferencing a NULL shadow page pointer. Skip the KVM_REQ_MMU_RELOAD check if there is a valid shadow page for the root. Zapping shadow pages in response to guest activity, e.g. when the guest frees a PGD, can trigger KVM_REQ_MMU_RELOAD even if the current vCPU isn't using the affected root. I.e. KVM_REQ_MMU_RELOAD can be seen with a completely valid root shadow page. This is a bit of a moot point as KVM currently unloads all roots on KVM_REQ_MMU_RELOAD, but that will be cleaned up in the future. Fixes: `a955cad84c` ("KVM: x86/mmu: Retry page fault if root is invalidated by memslot update") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211209060552.2956723-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-19 19:38:58 +01:00
Vitaly Kuznetsov	1aa2abb33a	KVM: x86: Drop guest CPUID check for host initiated writes to MSR_IA32_PERF_CAPABILITIES The ability to write to MSR_IA32_PERF_CAPABILITIES from the host should not depend on guest visible CPUID entries, even if just to allow creating/restoring guest MSRs and CPUIDs in any sequence. Fixes: `27461da310` ("KVM: x86/pmu: Support full width counting") Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211216165213.338923-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-19 19:35:18 +01:00
Colin Ian King	015e42c85f	crypto: x86/des3 - remove redundant assignment of variable nbytes The variable nbytes is being assigned a value that is never read, it is being re-assigned in the following statement. The assignment is redundant and can be removed. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-12-17 16:59:46 +11:00
Dave Airlie	eacef9fd61	Merge tag 'drm-intel-next-2021-12-14' of ssh://git.freedesktop.org/git/drm/drm-intel into drm-next drm/i915 feature pull #2 for v5.17: Features and functionality: - Add eDP privacy screen support (Hans) - Add Raptor Lake S (RPL-S) support (Anusha) - Add CD clock squashing support (Mika) - Properly support ADL-P without force probe (Clint) - Enable pipe color support (10 bit gamma) for display 13 platforms (Uma) - Update ADL-P DMC firmware to v2.14 (Madhumitha) Refactoring and cleanups: - More FBC refactoring preparing for multiple FBC instances (Ville) - Plane register cleanups (Ville) - Header refactoring and include cleanups (Jani) - Crtc helper and vblank wait function cleanups (Jani, Ville) - Move pipe/transcoder/abox masks under intel_device_info.display (Ville) Fixes: - Add a delay to let eDP source OUI write take effect (Lyude) - Use div32 version of MPLLB word clock for UHBR on SNPS PHY (Jani) - Fix DMC firmware loader overflow check (Harshit Mogalapalli) - Fully disable FBC on FIFO underruns (Ville) - Disable FBC with double wide pipe as mutually exclusive (Ville) - DG2 workarounds (Matt) - Non-x86 build fixes (Siva) - Fix HDR plane max width for NV12 (Vidya) - Disable IRQ for selftest timestamp calculation (Anshuman) - ADL-P VBT DDC pin mapping fix (Tejas) Merges: - Backmerge drm-next for privacy screen plumbing (Jani) Signed-off-by: Dave Airlie <airlied@redhat.com> From: Jani Nikula <jani.nikula@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/87ee6f5h9u.fsf@intel.com	2021-12-17 15:23:49 +10:00
Jakub Kicinski	7cd2802d74	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-16 16:13:19 -08:00
Linus Torvalds	180f3bcfe3	Networking fixes for 5.16-rc6, including fixes from mac80211, wifi, bpf. Current release - regressions: - dpaa2-eth: fix buffer overrun when reporting ethtool statistics Current release - new code bugs: - bpf: fix incorrect state pruning for <8B spill/fill - iavf: - add missing unlocks in iavf_watchdog_task() - do not override the adapter state in the watchdog task (again) - mlxsw: spectrum_router: consolidate MAC profiles when possible Previous releases - regressions: - mac80211, fix: - rate control, avoid driver crash for retransmitted frames - regression in SSN handling of addba tx - a memory leak where sta_info is not freed - marking TX-during-stop for TX in in_reconfig, prevent stall - cfg80211: acquire wiphy mutex on regulatory work - wifi drivers: fix build regressions and LED config dependency - virtio_net: fix rx_drops stat for small pkts - dsa: mv88e6xxx: unforce speed & duplex in mac_link_down() Previous releases - always broken: - bpf, fix: - kernel address leakage in atomic fetch - kernel address leakage in atomic cmpxchg's r0 aux reg - signed bounds propagation after mov32 - extable fixup offset - extable address check - mac80211: - fix the size used for building probe request - send ADDBA requests using the tid/queue of the aggregation session - agg-tx: don't schedule_and_wake_txq() under sta->lock, avoid deadlocks - validate extended element ID is present - mptcp: - never allow the PM to close a listener subflow (null-defer) - clear 'kern' flag from fallback sockets, prevent crash - fix deadlock in __mptcp_push_pending() - inet_diag: fix kernel-infoleak for UDP sockets - xsk: do not sleep in poll() when need_wakeup set - smc: avoid very long waits in smc_release() - sch_ets: don't remove idle classes from the round-robin list - netdevsim: - zero-initialize memory for bpf map's value, prevent info leak - don't let user space overwrite read only (max) ethtool parms - ixgbe: set X550 MDIO speed before talking to PHY - stmmac: - fix null-deref in flower deletion w/ VLAN prio Rx steering - dwmac-rk: fix oob read in rk_gmac_setup - ice: time stamping fixes - systemport: add global locking for descriptor life cycle Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmG7rdUACgkQMUZtbf5S IrtRvw//etsgeg2+zxe+fBSbe7ZihcCB4yzWUoRDdNzPrLNLsnWxKT1wYblDcZft b1f/SpTy9ycfg+fspn2qET8gzydn4m9xHkjmlQPzmXB9tdIDF6mECFTAXYlar1hQ RQIijpfZYyrZeGdgHpsyq72YC4dpNdbZrxmQFVdpMr3cK8P2N0Dn32bBVa//+jb+ LCv3Uw9C0yNbqhtRIiukkWIE20+/pXtKm0uErDVmvonqFMWPo6mYD0C2PwC20PwR Kv5ok6jH+44fCSwDoLChbB+Wes0AtrIQdUvUwXGXaF3MDfZl+24oLkX5xJl3EHWT 90Mh0k0NhRORgBZ3NItwK7OliohrRHCYxlAXPjg1Dicxl+kxl0wPlva8v64eAA+u ZhwXwaQpCrZNdKoxHJw9kQ/CmbggtxcWkVolbZp3TzDjYY1E7qxuwg51YMhGmGT1 FPjradYGvHKi+thizJiEdiZaMKRc8bpaL0hbpROxFQvfjNwFOwREQhtnXYP3W5Kd lK88fWaH86dxqL+ABvbrMnSZKuNlSL8R/CROWpZuF+vyLRXaxhAvYRrL79bgmkKq zvImnh1mFovdyKGJhibFMdy92X14z8FzoyX3VQuFcl9EB+2NQXnNZ6abDLJlufZX A0jQ5r46Ce/yyaXXmS61PrP7Pf5sxhs/69fqAIDQfSSzpyUKHd4= =VIbd -----END PGP SIGNATURE----- Merge tag 'net-5.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes, including fixes from mac80211, wifi, bpf. Relatively large batches of fixes from BPF and the WiFi stack, calm in general networking. Current release - regressions: - dpaa2-eth: fix buffer overrun when reporting ethtool statistics Current release - new code bugs: - bpf: fix incorrect state pruning for <8B spill/fill - iavf: - add missing unlocks in iavf_watchdog_task() - do not override the adapter state in the watchdog task (again) - mlxsw: spectrum_router: consolidate MAC profiles when possible Previous releases - regressions: - mac80211 fixes: - rate control, avoid driver crash for retransmitted frames - regression in SSN handling of addba tx - a memory leak where sta_info is not freed - marking TX-during-stop for TX in in_reconfig, prevent stall - cfg80211: acquire wiphy mutex on regulatory work - wifi drivers: fix build regressions and LED config dependency - virtio_net: fix rx_drops stat for small pkts - dsa: mv88e6xxx: unforce speed & duplex in mac_link_down() Previous releases - always broken: - bpf fixes: - kernel address leakage in atomic fetch - kernel address leakage in atomic cmpxchg's r0 aux reg - signed bounds propagation after mov32 - extable fixup offset - extable address check - mac80211: - fix the size used for building probe request - send ADDBA requests using the tid/queue of the aggregation session - agg-tx: don't schedule_and_wake_txq() under sta->lock, avoid deadlocks - validate extended element ID is present - mptcp: - never allow the PM to close a listener subflow (null-defer) - clear 'kern' flag from fallback sockets, prevent crash - fix deadlock in __mptcp_push_pending() - inet_diag: fix kernel-infoleak for UDP sockets - xsk: do not sleep in poll() when need_wakeup set - smc: avoid very long waits in smc_release() - sch_ets: don't remove idle classes from the round-robin list - netdevsim: - zero-initialize memory for bpf map's value, prevent info leak - don't let user space overwrite read only (max) ethtool parms - ixgbe: set X550 MDIO speed before talking to PHY - stmmac: - fix null-deref in flower deletion w/ VLAN prio Rx steering - dwmac-rk: fix oob read in rk_gmac_setup - ice: time stamping fixes - systemport: add global locking for descriptor life cycle" * tag 'net-5.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (89 commits) bpf, selftests: Fix racing issue in btf_skc_cls_ingress test selftest/bpf: Add a test that reads various addresses. bpf: Fix extable address check. bpf: Fix extable fixup offset. bpf, selftests: Add test case trying to taint map value pointer bpf: Make 32->64 bounds propagation slightly more robust bpf: Fix signed bounds propagation after mov32 sit: do not call ipip6_dev_free() from sit_init_net() net: systemport: Add global locking for descriptor lifecycle net/smc: Prevent smc_release() from long blocking net: Fix double 0x prefix print in SKB dump virtio_net: fix rx_drops stat for small pkts dsa: mv88e6xxx: fix debug print for SPEED_UNFORCED sfc_ef100: potential dereference of null pointer net: stmmac: dwmac-rk: fix oob read in rk_gmac_setup net: usb: lan78xx: add Allied Telesis AT29M2-AF net/packet: rx_owner_map depends on pg_vec netdevsim: Zero-initialize memory for new map's value in function nsim_bpf_map_alloc dpaa2-eth: fix ethtool statistics ixgbe: set X550 MDIO speed before talking to PHY ...	2021-12-16 15:02:14 -08:00
Thomas Gleixner	f2948df5f8	x86/pci/xen: Use msi_for_each_desc() Replace the about to vanish iterators. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20211206210748.198359105@linutronix.de	2021-12-16 22:22:18 +01:00
Thomas Gleixner	173ffad79d	PCI/MSI: Use msi_desc::msi_index The usage of msi_desc::pci::entry_nr is confusing at best. It's the index into the MSI[X] descriptor table. Use msi_desc::msi_index which is shared between all MSI incarnations instead of having a PCI specific storage for no value. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Nishanth Menon <nm@ti.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://lore.kernel.org/r/20211210221814.602911509@linutronix.de	2021-12-16 22:16:40 +01:00
Thomas Gleixner	b3f8236411	x86/apic/msi: Use PCI device MSI property instead of fiddling with MSI descriptors. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mikelley@microsoft.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20211210221813.372357371@linutronix.de	2021-12-16 22:16:37 +01:00
Thomas Gleixner	0bcfade920	x86/pci/XEN: Use PCI device property instead of fiddling with MSI descriptors. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20211210221813.311410967@linutronix.de	2021-12-16 22:16:37 +01:00
Mateusz Jończyk	0dd8d6cb9e	rtc: Check return value from mc146818_get_time() There are 4 users of mc146818_get_time() and none of them was checking the return value from this function. Change this. Print the appropriate warnings in callers of mc146818_get_time() instead of in the function mc146818_get_time() itself, in order not to add strings to rtc-mc146818-lib.c, which is kind of a library. The callers of alpha_rtc_read_time() and cmos_read_time() may use the contents of (struct rtc_time ) even when the functions return a failure code. Therefore, set the contents of (struct rtc_time ) to 0x00, which looks more sensible then 0xff and aligns with the (possibly stale?) comment in cmos_read_time: /* * If pm_trace abused the RTC for storage, set the timespec to 0, * which tells the caller that this RTC value is unusable. */ For consistency, do this in mc146818_get_time(). Note: hpet_rtc_interrupt() may call mc146818_get_time() many times a second. It is very unlikely, though, that the RTC suddenly stops working and mc146818_get_time() would consistently fail. Only compile-tested on alpha. Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Alexandre Belloni <alexandre.belloni@bootlin.com> Cc: linux-alpha@vger.kernel.org Cc: x86@kernel.org Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/20211210200131.153887-4-mat.jonczyk@o2.pl	2021-12-16 21:50:06 +01:00
Alexei Starovoitov	588a25e924	bpf: Fix extable address check. The verifier checks that PTR_TO_BTF_ID pointer is either valid or NULL, but it cannot distinguish IS_ERR pointer from valid one. When offset is added to IS_ERR pointer it may become small positive value which is a user address that is not handled by extable logic and has to be checked for at the runtime. Tighten BPF_PROBE_MEM pointer check code to prevent this case. Fixes: `4c5de12759` ("bpf: Emit explicit NULL pointer checks for PROBE_LDX instructions.") Reported-by: Lorenzo Fontana <lorenzo.fontana@elastic.co> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2021-12-16 21:41:04 +01:00
Alexei Starovoitov	433956e912	bpf: Fix extable fixup offset. The prog - start_of_ldx is the offset before the faulting ldx to the location after it, so this will be used to adjust pt_regs->ip for jumping over it and continuing, and with old temp it would have been fixed up to the wrong offset, causing crash. Fixes: `4c5de12759` ("bpf: Emit explicit NULL pointer checks for PROBE_LDX instructions.") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2021-12-16 21:18:26 +01:00
Arnd Bergmann	91f7d2dbf9	x86/xen: Use correct #ifdef guard for xen_initdom_restore_msi() The #ifdef check around the definition doesn't match the one around the declaration, leading to a link failure when CONFIG_XEN_DOM0 is enabled but CONFIG_XEN_PV_DOM0 is not: x86_64-linux-ld: arch/x86/kernel/apic/msi.o: in function `arch_restore_msi_irqs': msi.c:(.text+0x29a): undefined reference to `xen_initdom_restore_msi' Change the declaration to use the same check that was already present around the function definition. Fixes: `ae72f31567` ("PCI/MSI: Make arch_restore_msi_irqs() less horrible.") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20211215140209.451379-1-arnd@kernel.org	2021-12-15 16:13:23 +01:00
Mike Rapoport	2f5b3514c3	x86/boot: Move EFI range reservation after cmdline parsing The memory reservation in arch/x86/platform/efi/efi.c depends on at least two command line parameters. Put it back later in the boot process and move efi_memblock_x86_reserve_range() out of early_memory_reserve(). An attempt to fix this was done in `8d48bf8206` ("x86/boot: Pull up cmdline preparation and early param parsing") but that caused other troubles so it got reverted. The bug this is addressing is: Dan reports that Anjaneya Chagam can no longer use the efi=nosoftreserve kernel command line parameter to suppress "soft reservation" behavior. This is due to the fact that the following call-chain happens at boot: early_reserve_memory \|-> efi_memblock_x86_reserve_range \|-> efi_fake_memmap_early which does if (!efi_soft_reserve_enabled()) return; and that would have set EFI_MEM_NO_SOFT_RESERVE after having parsed "nosoftreserve". However, parse_early_param() gets called after it, leading to the boot cmdline not being taken into account. See also https://lore.kernel.org/r/e8dd8993c38702ee6dd73b3c11f158617e665607.camel@intel.com [ bp: Turn into a proper patch. ] Signed-off-by: Mike Rapoport <rppt@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211213112757.2612-4-bp@alien8.de	2021-12-15 14:07:54 +01:00
Borislav Petkov	fbe6183998	Revert "x86/boot: Pull up cmdline preparation and early param parsing" This reverts commit `8d48bf8206`. It turned out to be a bad idea as it broke supplying mem= cmdline parameters due to parse_memopt() requiring preparatory work like setting up the e820 table in e820__memory_setup() in order to be able to exclude the range specified by mem=. Pulling that up would've broken Xen PV again, see threads at https://lkml.kernel.org/r/20210920120421.29276-1-jgross@suse.com due to xen_memory_setup() needing the first reservations in early_reserve_memory() - kernel and initrd - to have happened already. This could be fixed again by having Xen do those reservations itself... Long story short, revert this and do a simpler fix in a later patch. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211213112757.2612-3-bp@alien8.de	2021-12-15 11:38:57 +01:00
Borislav Petkov	58e138d624	Revert "x86/boot: Mark prepare_command_line() __init" This reverts commit `c0f2077baa`. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211213112757.2612-2-bp@alien8.de	2021-12-15 11:14:28 +01:00
Thomas Gleixner	09eb3ad55f	Merge branch 'irq/urgent' into irq/msi to pick up the PCI/MSI-x fixes. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2021-12-14 13:30:34 +01:00
Rob Herring	369461ce8f	x86: perf: Move RDPMC event flag to a common definition In preparation to enable user counter access on arm64 and to move some of the user access handling to perf core, create a common event flag for user counter access and convert x86 to use it. Since the architecture specific flags start at the LSB, starting at the MSB for common flags. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: linux-perf-users@vger.kernel.org Reviewed-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/20211208201124.310740-2-robh@kernel.org Signed-off-by: Will Deacon <will@kernel.org>	2021-12-14 11:30:54 +00:00
Eric W. Biederman	0e25498f8c	exit: Add and use make_task_dead. There are two big uses of do_exit. The first is it's design use to be the guts of the exit(2) system call. The second use is to terminate a task after something catastrophic has happened like a NULL pointer in kernel code. Add a function make_task_dead that is initialy exactly the same as do_exit to cover the cases where do_exit is called to handle catastrophic failure. In time this can probably be reduced to just a light wrapper around do_task_dead. For now keep it exactly the same so that there will be no behavioral differences introducing this new concept. Replace all of the uses of do_exit that use it for catastraphic task cleanup with make_task_dead to make it clear what the code is doing. As part of this rename rewind_stack_do_exit rewind_stack_and_make_dead. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2021-12-13 12:04:45 -06:00
Jiri Olsa	f92c1e1836	bpf: Add get_func_[arg\|ret\|arg_cnt] helpers Adding following helpers for tracing programs: Get n-th argument of the traced function: long bpf_get_func_arg(void ctx, u32 n, u64 value) Get return value of the traced function: long bpf_get_func_ret(void ctx, u64 value) Get arguments count of the traced function: long bpf_get_func_arg_cnt(void *ctx) The trampoline now stores number of arguments on ctx-8 address, so it's easy to verify argument index and find return value argument's position. Moving function ip address on the trampoline stack behind the number of functions arguments, so it's now stored on ctx-16 address if it's needed. All helpers above are inlined by verifier. Also bit unrelated small change - using newly added function bpf_prog_has_trampoline in check_get_func_ip. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211208193245.172141-5-jolsa@kernel.org	2021-12-13 09:25:59 -08:00
Jiri Olsa	5edf6a1983	bpf, x64: Replace some stack_size usage with offset variables As suggested by Andrii, adding variables for registers and ip address offsets, which makes the code more clear, rather than abusing single stack_size variable for everything. Also describing the stack layout in the comment. There is no function change. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211208193245.172141-4-jolsa@kernel.org	2021-12-13 09:24:22 -08:00
Javier Martinez Canillas	4bc5e64e6c	efi: Move efifb_setup_from_dmi() prototype from arch headers Commit `8633ef82f1` ("drivers/firmware: consolidate EFI framebuffer setup for all arches") made the Generic System Framebuffers (sysfb) driver able to be built on non-x86 architectures. But it left the efifb_setup_from_dmi() function prototype declaration in the architecture specific headers. This could lead to the following compiler warning as reported by the kernel test robot: drivers/firmware/efi/sysfb_efi.c:70:6: warning: no previous prototype for function 'efifb_setup_from_dmi' [-Wmissing-prototypes] void efifb_setup_from_dmi(struct screen_info si, const char opt) ^ drivers/firmware/efi/sysfb_efi.c:70:1: note: declare 'static' if the function is not intended to be used outside of this translation unit void efifb_setup_from_dmi(struct screen_info si, const char opt) Fixes: `8633ef82f1` ("drivers/firmware: consolidate EFI framebuffer setup for all arches") Reported-by: kernel test robot <lkp@intel.com> Cc: <stable@vger.kernel.org> # 5.15.x Signed-off-by: Javier Martinez Canillas <javierm@redhat.com> Acked-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://lore.kernel.org/r/20211126001333.555514-1-javierm@redhat.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org>	2021-12-13 15:07:16 +01:00
Borislav Petkov	e3d72e8eee	x86/mce: Mark mce_start() noinstr Fixes vmlinux.o: warning: objtool: do_machine_check()+0x4ae: call to __const_udelay() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-13-bp@alien8.de	2021-12-13 14:14:05 +01:00
Borislav Petkov	edb3d07e24	x86/mce: Mark mce_timed_out() noinstr Fixes vmlinux.o: warning: objtool: do_machine_check()+0x482: call to mce_timed_out() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-12-bp@alien8.de	2021-12-13 14:13:54 +01:00
Borislav Petkov	75581a203e	x86/mce: Move the tainting outside of the noinstr region add_taint() is yet another external facility which the #MC handler calls. Move that tainting call into the instrumentation-allowed part of the handler. Fixes vmlinux.o: warning: objtool: do_machine_check()+0x617: call to add_taint() leaves .noinstr.text section While at it, allow instrumentation around the mce_log() call. Fixes vmlinux.o: warning: objtool: do_machine_check()+0x690: call to mce_log() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-11-bp@alien8.de	2021-12-13 14:13:35 +01:00
Borislav Petkov	db6c996d6c	x86/mce: Mark mce_read_aux() noinstr Fixes vmlinux.o: warning: objtool: do_machine_check()+0x681: call to mce_read_aux() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-10-bp@alien8.de	2021-12-13 14:13:23 +01:00
Borislav Petkov	b4813539d3	x86/mce: Mark mce_end() noinstr It is called by the #MC handler which is noinstr. Fixes vmlinux.o: warning: objtool: do_machine_check()+0xbd6: call to memset() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-9-bp@alien8.de	2021-12-13 14:13:12 +01:00
Borislav Petkov	3c7ce80a81	x86/mce: Mark mce_panic() noinstr And allow instrumentation inside it because it does calls to other facilities which will not be tagged noinstr. Fixes vmlinux.o: warning: objtool: do_machine_check()+0xc73: call to mce_panic() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-8-bp@alien8.de	2021-12-13 14:13:01 +01:00
Borislav Petkov	0a5b288e85	x86/mce: Prevent severity computation from being instrumented Mark all the MCE severity computation logic noinstr and allow instrumentation when it "calls out". Fixes vmlinux.o: warning: objtool: do_machine_check()+0xc5d: call to mce_severity() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-7-bp@alien8.de	2021-12-13 14:12:48 +01:00
Borislav Petkov	4fbce464db	x86/mce: Allow instrumentation during task work queueing Fixes vmlinux.o: warning: objtool: do_machine_check()+0xdb1: call to queue_task_work() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-6-bp@alien8.de	2021-12-13 14:12:35 +01:00
Borislav Petkov	487d654db3	x86/mce: Remove noinstr annotation from mce_setup() Instead, sandwitch around the call which is done in noinstr context and mark the caller - mce_gather_info() - as noinstr. Also, document what the whole instrumentation strategy with #MC is going to be in the future and where it all is supposed to be going to. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-5-bp@alien8.de	2021-12-13 14:12:21 +01:00
Borislav Petkov	88f66a4235	x86/mce: Use mce_rdmsrl() in severity checking code MCA has its own special MSR accessors. Use them. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-4-bp@alien8.de	2021-12-13 14:12:08 +01:00
Borislav Petkov	ad669ec16a	x86/mce: Remove function-local cpus variables Use num_online_cpus() directly. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-3-bp@alien8.de	2021-12-13 14:11:53 +01:00
Borislav Petkov	cd5e0d1fc9	x86/mce: Do not use memset to clear the banks bitmaps The bitmap is a single unsigned long so no need for the function call. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-2-bp@alien8.de	2021-12-13 14:11:22 +01:00
Linus Torvalds	773602256a	A single fix for the x86 scheduler topology: Using cluster topology on hybrid CPUs, e.g. Alder Lake, biases the scheduler towards the ATOM cluster as that has more total capacity. Use selection based on CPU priority instead. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmG1xxETHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoWhaEADFlr1l6QwIJs8Bv3Z9u/kkW1fPIoqP aFDrqLiQ/T+gBuS+d7R3YkRf1cICS+gbvv/dw8qx/v6gerUC71tEXxYxvOUqR7QL wX2zFNrhQqBaIKKGnbTHe47pB2+/AlpbXuQ2/j415AHxCxB4fOC1N4VKq0hTUyPu Ubz+QmOHcB0NhAf1jeN9DAlTJgQNbRW6oYYPJDsr+n+WjdlN3IZW/hCxLN1cvnNd NzsDEK1EeuccteoT5btzYy+XB31+WNrZ9afcJgG7goO/zmJfZ1lWAsyjKPocn6dC ozZSvHQhbgJIAwBaZJFyjZocgHTIkdFX6NzlpKaWw05tk984hJJ9pvg4KO5YeNx/ tIehPnkmd9GwwO541+t3DhcbCPQIiKypRZKnXuczkjIuG3t2oD25b1EAqwhoPKGk TEX2o4DjixS3oKUP6khDteQkj8B0qOjeY18eGjBilJ764mRfbL+GRk/NoqVidrnt JWSCPEMM842Y8W7GYfZZZAypqDz9U0yRl+BCKpNDcx3Nr5OSbfkWC/LdvM3LlgIE /a2C6CXIZXgiifTF2PHOmv0O3/VPAtHXZ01ExzZu/MJP0hBUjLrxsj+TwILgociw WLVYnBgCf5XN1rMPmora8P5y+O/UpEYhXbmIFa6HQAQyfc0Ffiof4ZOl5VYf3mNh 1GLB+ByEgCz4ig== =rr0v -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2021-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Thomas Gleixner: "A single fix for the x86 scheduler topology: Using cluster topology on hybrid CPUs, e.g. Alder Lake, biases the scheduler towards the ATOM cluster as that has more total capacity. Use selection based on CPU priority instead" * tag 'sched-urgent-2021-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched,x86: Don't use cluster topology for x86 hybrid CPUs	2021-12-12 09:38:04 -08:00
Peter Zijlstra	e5eefda5aa	x86: Remove .fixup section No moar users, kill it dead. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20211110101326.201590122@infradead.org	2021-12-11 09:09:50 +01:00
Peter Zijlstra	b776078025	x86/word-at-a-time: Remove .fixup usage Rewrite load_unaligned_zeropad() to not require .fixup text. This is easiest done using asm-goto-output, where we can stick a C label in the exception table entry. The fallback version isn't nearly so nice but should work. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20211110101326.141775772@infradead.org	2021-12-11 09:09:50 +01:00
Peter Zijlstra	d5d797dcbd	x86/usercopy: Remove .fixup usage Typically usercopy does whole word copies followed by a number of byte copies to finish the tail. This means that on exception it needs to compute the remaining length as: words*sizeof(long) + bytes. Create a new extable handler to do just this. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20211110101326.081701085@infradead.org	2021-12-11 09:09:50 +01:00
Peter Zijlstra	13e4bf1bdd	x86/usercopy_32: Simplify __copy_user_intel_nocache() Have an exception jump to a .fixup to only immediately jump out is daft, jump to the right place in one go. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20211110101326.021517780@infradead.org	2021-12-11 09:09:50 +01:00
Peter Zijlstra	5ce8e39f55	x86/sgx: Remove .fixup usage Create EX_TYPE_FAULT_SGX which does as EX_TYPE_FAULT does, except adds this extra bit that SGX really fancies having. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20211110101325.961246679@infradead.org	2021-12-11 09:09:49 +01:00
Peter Zijlstra	fedb24cda1	x86/checksum_32: Remove .fixup usage Simply add EX_FLAG_CLEAR_AX to do as the .fixup used to do. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20211110101325.899657959@infradead.org	2021-12-11 09:09:49 +01:00
Peter Zijlstra	3e8ea7803a	x86/vmx: Remove .fixup usage In the vmread exceptin path, use the, thus far, unused output register to push the @fault argument onto the stack. This, in turn, enables the exception handler to not do pushes and only modify that register when an exception does occur. As noted by Sean the input constraint needs to be changed to "=&r" to avoid the value and field occupying the same register. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20211110101325.781308550@infradead.org	2021-12-11 09:09:49 +01:00
Peter Zijlstra	c9a34c3f4e	x86/kvm: Remove .fixup usage KVM instruction emulation has a gnarly hack where the .fixup does a return, however there's already a ret right after the 10b label, so mark that as 11 and have the exception clear %esi to remove the .fixup. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20211110101325.722157053@infradead.org	2021-12-11 09:09:48 +01:00
Peter Zijlstra	5fc77b916c	x86/segment: Remove .fixup usage Create and use EX_TYPE_ZERO_REG to clear the register and retry the segment load on exception. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20211110101325.663529463@infradead.org	2021-12-11 09:09:48 +01:00
Peter Zijlstra	1c3b9091d0	x86/fpu: Remove .fixup usage Employ EX_TYPE_EFAULT_REG to store '-EFAULT' into the %[err] register on exception. All the callers only ever test for 0, so the change from -1 to -EFAULT is immaterial. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20211110101325.604494664@infradead.org	2021-12-11 09:09:48 +01:00
Peter Zijlstra	e2b48e4328	x86/xen: Remove .fixup usage Employ the fancy new EX_TYPE_IMM_REG to store -EFAULT in the return register and use this to remove some Xen .fixup usage. All callers of these functions only test for 0 return, so the actual return value change from -1 to -EFAULT is immaterial. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20211110101325.545019822@infradead.org	2021-12-11 09:09:48 +01:00
Peter Zijlstra	99641e094d	x86/uaccess: Remove .fixup usage For the !CC_AS_ASM_GOTO_OUTPUT (aka. the legacy codepath), remove the .fixup usage by employing both EX_TYPE_EFAULT_REG and EX_FLAG_CLEAR. Like was already done for X86_32's version of __get_user_asm_u64() use the "a" register for output, specifically so we can use CLEAR_AX. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20211110101325.485154848@infradead.org	2021-12-11 09:09:47 +01:00
Peter Zijlstra	4c132d1d84	x86/futex: Remove .fixup usage Use the new EX_TYPE_IMM_REG to store -EFAULT into the designated 'ret' register, this removes the need for anonymous .fixup code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20211110101325.426016322@infradead.org	2021-12-11 09:09:47 +01:00
Peter Zijlstra	d52a7344bd	x86/msr: Remove .fixup usage Rework the MSR accessors to remove .fixup usage. Add two new extable types (to the 4 already existing msr ones) using the new register infrastructure to record which register should get the error value. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20211110101325.364084212@infradead.org	2021-12-11 09:09:47 +01:00
Peter Zijlstra	4b5305decc	x86/extable: Extend extable functionality In order to remove further .fixup usage, extend the extable infrastructure to take additional information from the extable entry sites. Specifically add _ASM_EXTABLE_TYPE_REG() and EX_TYPE_IMM_REG that extend the existing _ASM_EXTABLE_TYPE() by taking an additional register argument and encoding that and an s16 immediate into the existing s32 type field. This limits the actual types to the first byte, 255 seem plenty. Also add a few flags into the type word, specifically CLEAR_AX and CLEAR_DX which clear the return and extended return register. Notes: - due to the % in our register names it's hard to make it more generally usable as arm64 did. - the s16 is far larger than used in these patches, future extentions can easily shrink this to get more bits. - without the bitfield fix this will not compile, because: 0xFF > -1 and we can't even extract the TYPE field. [nathanchance: Build fix for clang-lto builds: https://lkml.kernel.org/r/20211210234953.3420108-1-nathan@kernel.org ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20211110101325.303890153@infradead.org	2021-12-11 09:09:46 +01:00
Peter Zijlstra	aa93e2ad74	x86/entry_32: Remove .fixup usage Where possible, push the .fixup into code, at the tail of functions. This is hard for macros since they're used in multiple functions, therefore introduce a new extable handler to pop zeros. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20211110101325.245184699@infradead.org	2021-12-11 09:09:46 +01:00
Peter Zijlstra	16e617d05e	x86/entry_64: Remove .fixup usage Place the anonymous .fixup code at the tail of the regular functions. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Link: https://lore.kernel.org/r/20211110101325.186049322@infradead.org	2021-12-11 09:09:46 +01:00
Peter Zijlstra	ab0fedcc71	x86/copy_mc_64: Remove .fixup usage Place the anonymous .fixup code at the tail of the regular functions. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211110101325.127055887@infradead.org	2021-12-11 09:09:46 +01:00
Peter Zijlstra	acba44d243	x86/copy_user_64: Remove .fixup usage Place the anonymous .fixup code at the tail of the regular functions. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211110101325.068505810@infradead.org	2021-12-11 09:09:45 +01:00
Peter Zijlstra	c6dbd3e5e6	x86/mmx_32: Remove X86_USE_3DNOW This code puts an exception table entry on the PREFETCH instruction to overwrite it with a JMP.d8 when it triggers an exception. Except of course, our code is no longer writable, also SMP. Instead of fixing this broken mess, simply take it out. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/YZKQzUmeNuwyvZpk@hirez.programming.kicks-ass.net	2021-12-11 09:09:45 +01:00
Jakub Kicinski	be3158290d	Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Andrii Nakryiko says: ==================== bpf-next 2021-12-10 v2 We've added 115 non-merge commits during the last 26 day(s) which contain a total of 182 files changed, 5747 insertions(+), 2564 deletions(-). The main changes are: 1) Various samples fixes, from Alexander Lobakin. 2) BPF CO-RE support in kernel and light skeleton, from Alexei Starovoitov. 3) A batch of new unified APIs for libbpf, logging improvements, version querying, etc. Also a batch of old deprecations for old APIs and various bug fixes, in preparation for libbpf 1.0, from Andrii Nakryiko. 4) BPF documentation reorganization and improvements, from Christoph Hellwig and Dave Tucker. 5) Support for declarative initialization of BPF_MAP_TYPE_PROG_ARRAY in libbpf, from Hengqi Chen. 6) Verifier log fixes, from Hou Tao. 7) Runtime-bounded loops support with bpf_loop() helper, from Joanne Koong. 8) Extend branch record capturing to all platforms that support it, from Kajol Jain. 9) Light skeleton codegen improvements, from Kumar Kartikeya Dwivedi. 10) bpftool doc-generating script improvements, from Quentin Monnet. 11) Two libbpf v0.6 bug fixes, from Shuyi Cheng and Vincent Minet. 12) Deprecation warning fix for perf/bpf_counter, from Song Liu. 13) MAX_TAIL_CALL_CNT unification and MIPS build fix for libbpf, from Tiezhu Yang. 14) BTF_KING_TYPE_TAG follow-up fixes, from Yonghong Song. 15) Selftests fixes and improvements, from Ilya Leoshkevich, Jean-Philippe Brucker, Jiri Olsa, Maxim Mikityanskiy, Tirthendu Sarkar, Yucong Sun, and others. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (115 commits) libbpf: Add "bool skipped" to struct bpf_map libbpf: Fix typo in btf__dedup@LIBBPF_0.0.2 definition bpftool: Switch bpf_object__load_xattr() to bpf_object__load() selftests/bpf: Remove the only use of deprecated bpf_object__load_xattr() selftests/bpf: Add test for libbpf's custom log_buf behavior selftests/bpf: Replace all uses of bpf_load_btf() with bpf_btf_load() libbpf: Deprecate bpf_object__load_xattr() libbpf: Add per-program log buffer setter and getter libbpf: Preserve kernel error code and remove kprobe prog type guessing libbpf: Improve logging around BPF program loading libbpf: Allow passing user log setting through bpf_object_open_opts libbpf: Allow passing preallocated log_buf when loading BTF into kernel libbpf: Add OPTS-based bpf_btf_load() API libbpf: Fix bpf_prog_load() log_buf logic for log_level 0 samples/bpf: Remove unneeded variable bpf: Remove redundant assignment to pointer t selftests/bpf: Fix a compilation warning perf/bpf_counter: Use bpf_map_create instead of bpf_create_map samples: bpf: Fix 'unknown warning group' build warning on Clang samples: bpf: Fix xdp_sample_user.o linking with Clang ... ==================== Link: https://lore.kernel.org/r/20211210234746.2100561-1-andrii@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-10 15:56:13 -08:00
Linus Torvalds	b9172f9e88	More x86 fixes: * Logic bugs in CR0 writes and Hyper-V hypercalls * Don't use Enlightened MSR Bitmap for L3 * Remove user-triggerable WARN Plus a few selftest fixes and a regression test for the user-triggerable WARN. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmGzZuIUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroMD4Qf+Im7Q0XNRZGzK6x4Blu3ZZJSuIIkW gEK5mDX/BWSYxoGhRN0IOkyf1Tx/A5qYwbZts87wZSvKONG2MuVzdeQ0mkDxgKc3 cYwvvIPxCKaW/dQLD2OKVlqdAv6YbeJiFURWXgszMkrcgHvw39H5Tn6ldi0B5nvg Gvpj8LtbPDXGXab//Xrhia3+1F9TKOrcOG+obGC5G2mrGKTkG2+pi9L6LohvENhd sOSWdpmvQTU4PeqGlhW8RCwcN+vpa+NasHT2i2tHcWZA9Lqp4P81+4ZyQLIBsRB3 psANG0c40EW+lfjFGbLL/6VR5kypxa6zy9RgX+QiRcj6C0+dJOgnwNY35A== =cREz -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "More x86 fixes: - Logic bugs in CR0 writes and Hyper-V hypercalls - Don't use Enlightened MSR Bitmap for L3 - Remove user-triggerable WARN Plus a few selftest fixes and a regression test for the user-triggerable WARN" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: selftests: KVM: Add test to verify KVM doesn't explode on "bad" I/O KVM: x86: Don't WARN if userspace mucks with RCX during string I/O exit KVM: X86: Raise #GP when clearing CR0_PG in 64 bit mode selftests: KVM: avoid failures due to reserved HyperTransport region KVM: x86: Ignore sparse banks size for an "all CPUs", non-sparse IPI req KVM: x86: Wait for IPIs to be delivered when handling Hyper-V TLB flush hypercall KVM: x86: selftests: svm_int_ctl_test: fix intercept calculation KVM: nVMX: Don't use Enlightened MSR Bitmap for L3	2021-12-10 14:09:12 -08:00
Kees Cook	bc7aaf52f9	x86/boot/string: Add missing function prototypes Silence "warning: no previous prototype for ... [-Wmissing-prototypes]" warnings from string.h when building under W=1. [ bp: Clarify commit message. ] Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211119175325.3668419-1-keescook@chromium.org	2021-12-10 19:49:06 +01:00
Shaokun Zhang	20735d24ad	x86/fpu: Remove duplicate copy_fpstate_to_sigframe() prototype The function prototype of copy_fpstate_to_sigframe() is declared twice in `0ae67cc34f` ("x86/fpu: Remove internal.h dependency from fpu/signal.h"). Remove one of them. [ bp: Massage ] Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211209015550.51916-1-zhangshaokun@hisilicon.com	2021-12-10 19:13:06 +01:00
Kees Cook	61646ca83d	x86/uaccess: Move variable into switch case statement When building with automatic stack variable initialization, GCC 12 complains about variables defined outside of switch case statements. Move the variable into the case that uses it, which silences the warning: ./arch/x86/include/asm/uaccess.h:317:23: warning: statement will never be executed [-Wswitch-unreachable] 317 \| unsigned char x_u8__; \ \| ^~~~~~ Fixes: `865c50e1d2` ("x86/uaccess: utilize CONFIG_CC_HAS_ASM_GOTO_OUTPUT") Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211209043456.1377875-1-keescook@chromium.org	2021-12-10 19:13:00 +01:00
Sean Christopherson	d07898eaf3	KVM: x86: Don't WARN if userspace mucks with RCX during string I/O exit Replace a WARN with a comment to call out that userspace can modify RCX during an exit to userspace to handle string I/O. KVM doesn't actually support changing the rep count during an exit, i.e. the scenario can be ignored, but the WARN needs to go as it's trivial to trigger from userspace. Cc: stable@vger.kernel.org Fixes: `3b27de2718` ("KVM: x86: split the two parts of emulator_pio_in") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211025201311.1881846-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-10 09:38:02 -05:00
Lai Jiangshan	777ab82d7c	KVM: X86: Raise #GP when clearing CR0_PG in 64 bit mode In the SDM: If the logical processor is in 64-bit mode or if CR4.PCIDE = 1, an attempt to clear CR0.PG causes a general-protection exception (#GP). Software should transition to compatibility mode and clear CR4.PCIDE before attempting to disable paging. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211207095230.53437-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-10 09:38:01 -05:00
Peter Zijlstra	1614b2b11f	arch: Make ARCH_STACKWALK independent of STACKTRACE Make arch_stack_walk() available for ARCH_STACKWALK architectures without it being entangled in STACKTRACE. Link: https://lore.kernel.org/lkml/20211022152104.356586621@infradead.org/ Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [Mark: rebase, drop unnecessary arm change] Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Link: https://lore.kernel.org/r/20211129142849.3056714-2-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2021-12-10 14:06:03 +00:00
Sean Christopherson	3244867af8	KVM: x86: Ignore sparse banks size for an "all CPUs", non-sparse IPI req Do not bail early if there are no bits set in the sparse banks for a non-sparse, a.k.a. "all CPUs", IPI request. Per the Hyper-V spec, it is legal to have a variable length of '0', e.g. VP_SET's BankContents in this case, if the request can be serviced without the extra info. It is possible that for a given invocation of a hypercall that does accept variable sized input headers that all the header input fits entirely within the fixed size header. In such cases the variable sized input header is zero-sized and the corresponding bits in the hypercall input should be set to zero. Bailing early results in KVM failing to send IPIs to all CPUs as expected by the guest. Fixes: `214ff83d44` ("KVM: x86: hyperv: implement PV IPI send hypercalls") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211207220926.718794-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-10 07:12:42 -05:00
Vitaly Kuznetsov	1ebfaa11eb	KVM: x86: Wait for IPIs to be delivered when handling Hyper-V TLB flush hypercall Prior to commit `0baedd7927` ("KVM: x86: make Hyper-V PV TLB flush use tlb_flush_guest()"), kvm_hv_flush_tlb() was using 'KVM_REQ_TLB_FLUSH \| KVM_REQUEST_NO_WAKEUP' when making a request to flush TLBs on other vCPUs and KVM_REQ_TLB_FLUSH is/was defined as: (0 \| KVM_REQUEST_WAIT \| KVM_REQUEST_NO_WAKEUP) so KVM_REQUEST_WAIT was lost. Hyper-V TLFS, however, requires that "This call guarantees that by the time control returns back to the caller, the observable effects of all flushes on the specified virtual processors have occurred." and without KVM_REQUEST_WAIT there's a small chance that the vCPU making the TLB flush will resume running before all IPIs get delivered to other vCPUs and a stale mapping can get read there. Fix the issue by adding KVM_REQUEST_WAIT flag to KVM_REQ_TLB_FLUSH_GUEST: kvm_hv_flush_tlb() is the sole caller which uses it for kvm_make_all_cpus_request()/kvm_make_vcpus_request_mask() where KVM_REQUEST_WAIT makes a difference. Cc: stable@kernel.org Fixes: `0baedd7927` ("KVM: x86: make Hyper-V PV TLB flush use tlb_flush_guest()") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211209102937.584397-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-10 07:12:41 -05:00
Marco Elver	d93414e375	x86/qspinlock, kcsan: Instrument barrier of pv_queued_spin_unlock() If CONFIG_PARAVIRT_SPINLOCKS=y, queued_spin_unlock() is implemented using pv_queued_spin_unlock() which is entirely inline asm based. As such, we do not receive any KCSAN barrier instrumentation via regular atomic operations. Add the missing KCSAN barrier instrumentation for the CONFIG_PARAVIRT_SPINLOCKS case. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-12-09 16:42:28 -08:00
Marco Elver	cd8730c3ab	x86/barriers, kcsan: Use generic instrumentation for non-smp barriers Prefix all barriers with __, now that asm-generic/barriers.h supports defining the final instrumented version of these barriers. The change is limited to barriers used by x86-64. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-12-09 16:42:28 -08:00
Sebastian Andrzej Siewior	35fa745286	x86/mm: Include spinlock_t definition in pgtable. This header file provides forward declartion for pgd_lock but does not include the header defining its type. This works since the definition of spinlock_t is usually included somehow via printk. By trying to avoid recursive includes on PREEMPT_RT I avoided the loop in printk and as a consequnce kernel/intel.c failed to compile due to missing type definition. Include the needed definition for spinlock_t. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20211102165224.wpz4zyhsvwccx5p3@linutronix.de	2021-12-09 10:58:48 -08:00
Colin Ian King	df0114f1f8	x86/resctrl: Remove redundant assignment to variable chunks The variable chunks is being shifted right and re-assinged the shifted value which is then returned. Since chunks is not being read afterwards the assignment is redundant and the >>= operator can be replaced with a shift >> operator instead. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Fenghua Yu <fenghua.yu@intel.com> Link: https://lkml.kernel.org/r/20211207223735.35173-1-colin.i.king@gmail.com	2021-12-09 09:57:16 -08:00
David Woodhouse	6f2cdbdba4	KVM: Add Makefile.kvm for common files, use it for x86 Splitting kvm_main.c out into smaller and better-organized files is slightly non-trivial when it involves editing a bunch of per-arch KVM makefiles. Provide virt/kvm/Makefile.kvm for them to include. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Acked-by: Marc Zyngier <maz@kernel.org> Message-Id: <20211121125451.9489-3-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-09 12:56:02 -05:00
David Woodhouse	dc70ec217c	KVM: Introduce CONFIG_HAVE_KVM_DIRTY_RING I'd like to make the build include dirty_ring.c based on whether the arch wants it or not. That's a whole lot simpler if there's a config symbol instead of doing it implicitly on KVM_DIRTY_LOG_PAGE_OFFSET being set to something non-zero. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211121125451.9489-2-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-09 12:55:52 -05:00
Jarkko Sakkinen	50468e4313	x86/sgx: Add an attribute for the amount of SGX memory in a NUMA node == Problem == The amount of SGX memory on a system is determined by the BIOS and it varies wildly between systems. It can be as small as dozens of MB's and as large as many GB's on servers. Just like how applications need to know how much regular RAM is available, enclave builders need to know how much SGX memory an enclave can consume. == Solution == Introduce a new sysfs file: /sys/devices/system/node/nodeX/x86/sgx_total_bytes to enumerate the amount of SGX memory available in each NUMA node. This serves the same function for SGX as /proc/meminfo or /sys/devices/system/node/nodeX/meminfo does for normal RAM. 'sgx_total_bytes' is needed today to help drive the SGX selftests. SGX-specific swap code is exercised by creating overcommitted enclaves which are larger than the physical SGX memory on the system. They currently use a CPUID-based approach which can diverge from the actual amount of SGX memory available. 'sgx_total_bytes' ensures that the selftests can work efficiently and do not attempt stupid things like creating a 100,000 MB enclave on a system with 128 MB of SGX memory. == Implementation Details == Introduce CONFIG_HAVE_ARCH_NODE_DEV_GROUP opt-in flag to expose an arch specific attribute group, and add an attribute for the amount of SGX memory in bytes to each NUMA node: == ABI Design Discussion == As opposed to the per-node ABI, a single, global ABI was considered. However, this would prevent enclaves from being able to size themselves so that they fit on a single NUMA node. Essentially, a single value would rule out NUMA optimizations for enclaves. Create a new "x86/" directory inside each "nodeX/" sysfs directory. 'sgx_total_bytes' is expected to be the first of at least a few sgx-specific files to be placed in the new directory. Just scanning /proc/meminfo, these are the no-brainers that we have for RAM, but we need for SGX: MemTotal: xxxx kB // sgx_total_bytes (implemented here) MemFree: yyyy kB // sgx_free_bytes SwapTotal: zzzz kB // sgx_swapped_bytes So, at least three. I think we will eventually end up needing something more along the lines of a dozen. A new directory (as opposed to being in the nodeX/ "root") directory avoids cluttering the root with several "sgx_*" files. Place the new file in a new "nodeX/x86/" directory because SGX is highly x86-specific. It is very unlikely that any other architecture (or even non-Intel x86 vendor) will ever implement SGX. Using "sgx/" as opposed to "x86/" was also considered. But, there is a real chance this can get used for other arch-specific purposes. [ dhansen: rewrite changelog ] Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211116162116.93081-2-jarkko@kernel.org	2021-12-09 07:02:22 -08:00
Sean Christopherson	45af1bb99b	KVM: VMX: Clean up PI pre/post-block WARNs Move the WARN sanity checks out of the PI descriptor update loop so as not to spam the kernel log if the condition is violated and the update takes multiple attempts due to another writer. This also eliminates a few extra uops from the retry path. Technically not checking every attempt could mean KVM will now fail to WARN in a scenario that would have failed before, but any such failure would be inherently racy as some other agent (CPU or device) would have to concurrent modify the PI descriptor. Add a helper to handle the actual write and more importantly to document why the write may need to be retried. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211208015236.1616697-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-09 09:30:47 -05:00
Sean Christopherson	83c98007d9	KVM: nVMX: Ensure vCPU honors event request if posting nested IRQ fails Add a memory barrier between writing vcpu->requests and reading vcpu->guest_mode to ensure the read is ordered after the write when (potentially) delivering an IRQ to L2 via nested posted interrupt. If the request were to be completed after reading vcpu->mode, it would be possible for the target vCPU to enter the guest without posting the interrupt and without handling the event request. Note, the barrier is only for documentation since atomic operations are serializing on x86. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: `6b6977117f` ("KVM: nVMX: Fix races when sending nested PI while dest enters/leaves L2") Fixes: `705699a139` ("KVM: nVMX: Enable nested posted interrupt processing") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211208015236.1616697-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-09 09:30:46 -05:00
Maxim Levitsky	8e819d75cb	KVM: x86: add a tracepoint for APICv/AVIC interrupt delivery This allows to see how many interrupts were delivered via the APICv/AVIC from the host. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211209115440.394441-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-09 09:07:39 -05:00
Peter Zijlstra	e463a09af2	x86: Add straight-line-speculation mitigation Make use of an upcoming GCC feature to mitigate straight-line-speculation for x86: https://gcc.gnu.org/g:53a643f8568067d7700a9f2facc8ba39974973d3 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102952 https://bugs.llvm.org/show_bug.cgi?id=52323 It's built tested on x86_64-allyesconfig using GCC-12 and GCC-11. Maintenance overhead of this should be fairly low due to objtool validation. Size overhead of all these additional int3 instructions comes to: text data bss dec hex filename 22267751 6933356 2011368 31212475 1dc43bb defconfig-build/vmlinux 22804126 6933356 1470696 31208178 1dc32f2 defconfig-build/vmlinux.sls Or roughly 2.4% additional text. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211204134908.140103474@infradead.org	2021-12-09 13:32:25 +01:00
Thomas Gleixner	ae72f31567	PCI/MSI: Make arch_restore_msi_irqs() less horrible. Make arch_restore_msi_irqs() return a boolean which indicates whether the core code should restore the MSI message or not. Get rid of the indirection in x86. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> # PCI Link: https://lore.kernel.org/r/20211206210224.485668098@linutronix.de	2021-12-09 11:52:21 +01:00
Thomas Gleixner	1982afd6c0	x86/hyperv: Refactor hv_msi_domain_free_irqs() No point in looking up things over and over. Just look up the associated irq data and work from there. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Wei Liu <wei.liu@kernel.org> Link: https://lore.kernel.org/r/20211206210224.429625690@linutronix.de	2021-12-09 11:52:21 +01:00
Thomas Gleixner	e58f2259b9	genirq/msi, treewide: Use a named struct for PCI/MSI attributes The unnamed struct sucks and is in the way of further cleanups. Stick the PCI related MSI data into a real data structure and cleanup all users. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Kalle Valo <kvalo@codeaurora.org> Link: https://lore.kernel.org/r/20211206210224.374863119@linutronix.de	2021-12-09 11:52:21 +01:00
Peter Zijlstra	26c44b776d	x86/alternative: Relax text_poke_bp() constraint Currently, text_poke_bp() is very strict to only allow patching a single instruction; however with straight-line-speculation it will be required to patch: ret; int3, which is two instructions. As such, relax the constraints a little to allow int3 padding for all instructions that do not imply the execution of the next instruction, ie: RET, JMP.d8 and JMP.d32. While there, rename the text_poke_loc::rel32 field to ::disp. Note: this fills up the text_poke_loc structure which is now a round 16 bytes big. [ bp: Put comments ontop instead of on the side. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211204134908.082342723@infradead.org	2021-12-09 11:04:50 +01:00
Peter Zijlstra	cabdc3a847	sched,x86: Don't use cluster topology for x86 hybrid CPUs For x86 hybrid CPUs like Alder Lake, the order of CPU selection should be based strictly on CPU priority. Don't include cluster topology for hybrid CPUs to avoid interference with such CPU selection order. On Alder Lake, the Atom CPU cluster has more capacity (4 Atom CPUs) vs Big core cluster (2 hyperthread CPUs). This could potentially bias CPU selection towards Atom over Big Core, when Big core CPU has higher priority. Fixes: `66558b730f` ("sched: Add cluster scheduler level for x86") Suggested-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tim Chen <tim.c.chen@linux.intel.com> Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Link: https://lkml.kernel.org/r/20211204091402.GM16608@worktop.programming.kicks-ass.net	2021-12-08 22:15:37 +01:00
Anusha Srivatsa	52407c220c	drm/i915/rpl-s: Add PCI IDS for Raptor Lake S Raptor Lake S(RPL-S) is a version 12 Display, Media and Render. For all i915 purposes it is the same as Alder Lake S (ADL-S). Introduce RPL-S as a subplatform of ADL-S. This patch adds PCI ids for RPL-S. BSpec: 53655 Cc: x86@kernel.org Cc: dri-devel@lists.freedesktop.org Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Signed-off-by: Anusha Srivatsa <anusha.srivatsa@intel.com> Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> # arch/x86 Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Signed-off-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20211203063545.2254380-2-anusha.srivatsa@intel.com	2021-12-08 13:02:54 -08:00
Peter Zijlstra	b17c2baa30	x86: Prepare inline-asm for straight-line-speculation Replace all ret/retq instructions with ASM_RET in preparation of making it more than a single instruction. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211204134907.964635458@infradead.org	2021-12-08 19:23:12 +01:00
Kirill A. Shutemov	20f07a044a	x86/sev: Move common memory encryption code to mem_encrypt.c SEV and TDX both protect guest memory from host accesses. They both use guest physical address bits to communicate to the hardware which pages receive protection or not. SEV and TDX both assume that all I/O (real devices and virtio) must be performed to pages without protection. To add this support, AMD SEV code forces force_dma_unencrypted() to decrypt DMA pages when DMA pages were allocated for I/O. It also uses swiotlb_update_mem_attributes() to update decryption bits in SWIOTLB DMA buffers. Since TDX also uses a similar memory sharing design, all the above mentioned changes can be reused. So move force_dma_unencrypted(), SWIOTLB update code and virtio changes out of mem_encrypt_amd.c to mem_encrypt.c. Introduce a new config option X86_MEM_ENCRYPT that can be selected by platforms which use x86 memory encryption features (needed in both AMD SEV and Intel TDX guest platforms). Since the code is moved from mem_encrypt_amd.c, inherit the same make flags. This is preparation for enabling TDX memory encryption support and it has no functional changes. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20211206135505.75045-4-kirill.shutemov@linux.intel.com	2021-12-08 16:49:53 +01:00
Kuppuswamy Sathyanarayanan	dbca5e1a04	x86/sev: Rename mem_encrypt.c to mem_encrypt_amd.c Both Intel TDX and AMD SEV implement memory encryption features. But the bulk of the code in mem_encrypt.c is AMD-specific. Rename the file to mem_encrypt_amd.c. A subsequent patch will extract the parts that can be shared by both TDX and AMD SEV/SME into a generic file. No functional changes. Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20211206135505.75045-3-kirill.shutemov@linux.intel.com	2021-12-08 16:49:47 +01:00
Kuppuswamy Sathyanarayanan	8260b9820f	x86/sev: Use CC_ATTR attribute to generalize string I/O unroll INS/OUTS are not supported in TDX guests and cause #UD. Kernel has to avoid them when running in TDX guest. To support existing usage, string I/O operations are unrolled using IN/OUT instructions. AMD SEV platform implements this support by adding unroll logic in ins#bwl()/outs#bwl() macros with SEV-specific checks. Since TDX VM guests will also need similar support, use CC_ATTR_GUEST_UNROLL_STRING_IO and generic cc_platform_has() API to implement it. String I/O helpers were the last users of sev_key_active() interface and sev_enable_key static key. Remove them. [ bp: Move comment too and do not delete it. ] Suggested-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lkml.kernel.org/r/20211206135505.75045-2-kirill.shutemov@linux.intel.com	2021-12-08 16:49:42 +01:00
David Woodhouse	74d9555580	PM: hibernate: Allow ACPI hardware signature to be honoured Theoretically, when the hardware signature in FACS changes, the OS is supposed to gracefully decline to attempt to resume from S4: "If the signature has changed, OSPM will not restore the system context and can boot from scratch" In practice, Windows doesn't do this and many laptop vendors do allow the signature to change especially when docking/undocking, so it would be a bad idea to simply comply with the specification by default in the general case. However, there are use cases where we do want the compliant behaviour and we know it's safe. Specifically, when resuming virtual machines where we know the hypervisor has changed sufficiently that resume will fail. We really want to be able to tell the guest kernel not to try, so it boots cleanly and doesn't just crash. This patch provides a way to opt in to the spec-compliant behaviour on the command line. A follow-up patch may do this automatically for certain "known good" machines based on a DMI match, or perhaps just for all hypervisor guests since there's no good reason a hypervisor would change the hardware_signature that it exposes to guests unless it wants them to obey the ACPI specification. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2021-12-08 16:06:10 +01:00
Vitaly Kuznetsov	502d2bf5f2	KVM: nVMX: Implement Enlightened MSR Bitmap feature Updating MSR bitmap for L2 is not cheap and rearly needed. TLFS for Hyper-V offers 'Enlightened MSR Bitmap' feature which allows L1 hypervisor to inform L0 when it changes MSR bitmap, this eliminates the need to examine L1's MSR bitmap for L2 every time when 'real' MSR bitmap for L2 gets constructed. Use 'vmx->nested.msr_bitmap_changed' flag to implement the feature. Note, KVM already uses 'Enlightened MSR bitmap' feature when it runs as a nested hypervisor on top of Hyper-V. The newly introduced feature is going to be used by Hyper-V guests on KVM. When the feature is enabled for Win10+WSL2, it shaves off around 700 CPU cycles from a nested vmexit cost (tight cpuid loop test). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211129094704.326635-5-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 09:06:05 -05:00
Vitaly Kuznetsov	ed2a4800ae	KVM: nVMX: Track whether changes in L0 require MSR bitmap for L2 to be rebuilt Introduce a flag to keep track of whether MSR bitmap for L2 needs to be rebuilt due to changes in MSR bitmap for L1 or switching to a different L2. This information will be used for Enlightened MSR Bitmap feature for Hyper-V guests. Note, setting msr_bitmap_changed to 'true' from set_current_vmptr() is not really needed for Enlightened MSR Bitmap as the feature can only be used in conjunction with Enlightened VMCS but let's keep tracking information complete, it's cheap and in the future similar PV feature can easily be implemented for KVM on KVM too. No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211129094704.326635-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 09:06:04 -05:00
Vitaly Kuznetsov	b84155c380	KVM: VMX: Introduce vmx_msr_bitmap_l01_changed() helper In preparation to enabling 'Enlightened MSR Bitmap' feature for Hyper-V guests move MSR bitmap update tracking to a dedicated helper. Note: vmx_msr_bitmap_l01_changed() is called when MSR bitmap might be updated. KVM doesn't check if the bit we're trying to set is already set (or the bit it's trying to clear is already cleared). Such situations should not be common and a few false positives should not be a problem. No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211129094704.326635-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 09:06:04 -05:00
Peter Zijlstra	f94909ceb1	x86: Prepare asm files for straight-line-speculation Replace all ret/retq instructions with RET in preparation of making RET a macro. Since AS is case insensitive it's a big no-op without RET defined. find arch/x86/ -name \.S \| while read file do sed -i 's/\<ret[q]\>/RET/' $file done Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211204134907.905503893@infradead.org	2021-12-08 12:25:37 +01:00
Smita Koralahalli	1e56279a49	x86/mce/inject: Set the valid bit in MCA_STATUS before error injection MCA handlers check the valid bit in each status register (MCA_STATUS[Val]) and continue processing the error only if the valid bit is set. Set the valid bit unconditionally in the corresponding MCA_STATUS register and correct any Val=0 injections made by the user as such errors will get ignored and such injections will be largely pointless. Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211104215846.254012-3-Smita.KoralahalliChannabasappa@amd.com	2021-12-08 12:01:01 +01:00
Smita Koralahalli	e48d008bd1	x86/mce/inject: Check if a bank is populated before injecting The MCA_IPID register uniquely identifies a bank's type on Scalable MCA (SMCA) systems. When an MCA bank is not populated, the MCA_IPID register will read as zero and writes to it will be ignored. On a hw-type error injection (injection which writes the actual MCA registers in an attempt to cause a real MCE) check the value of this register before trying to inject the error. Do not impose any limitations on a sw injection and allow the user to test out all the decoding paths without relying on the available hardware, as its purpose is to just test the code. [ bp: Heavily massage. ] Link: https://lkml.kernel.org/r/20211019233641.140275-2-Smita.KoralahalliChannabasappa@amd.com Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211104215846.254012-2-Smita.KoralahalliChannabasappa@amd.com	2021-12-08 12:00:56 +01:00
Peter Zijlstra	22da5a07c7	x86/lib/atomic64_386_32: Rename things Principally, in order to get rid of #define RET in this code to make place for a new RET, but also to clarify the code, rename a bunch of things: s/UNLOCK/IRQ_RESTORE/ s/LOCK/IRQ_SAVE/ s/BEGIN/BEGIN_IRQ_SAVE/ s/\<RET\>/RET_IRQ_RESTORE/ s/RET_ENDP/\tRET_IRQ_RESTORE\rENDP/ which then leaves RET unused so it can be removed. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211204134907.841623970@infradead.org	2021-12-08 11:57:08 +01:00
Peter Zijlstra	68cf4f2a72	x86: Use -mindirect-branch-cs-prefix for RETPOLINE builds In order to further enable commit: `bbe2df3f6b` ("x86/alternative: Try inline spectre_v2=retpoline,amd") add the new GCC flag -mindirect-branch-cs-prefix: https://gcc.gnu.org/g:2196a681d7810ad8b227bf983f38ba716620545e https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102952 https://bugs.llvm.org/show_bug.cgi?id=52323 to RETPOLINE=y builds. This should allow fully inlining retpoline,amd for GCC builds. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lkml.kernel.org/r/20211119165630.276205624@infradead.org	2021-12-08 11:57:04 +01:00
Peter Zijlstra	b2f825bfed	x86: Move RETPOLINE_CFLAGS to arch Makefile Currently, RETPOLINE_CFLAGS are defined in the top-level Makefile but only x86 makes use of them. Move them there. If ever another architecture finds the need, it can be reconsidered. [ bp: Massage a bit. ] Suggested-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lkml.kernel.org/r/20211119165630.219152765@infradead.org	2021-12-08 11:39:42 +01:00
Paolo Bonzini	93b350f884	Merge branch 'kvm-on-hv-msrbm-fix' into HEAD Merge bugfix for enlightened MSR Bitmap, before adding support to KVM for exposing the feature to nested guests.	2021-12-08 05:30:48 -05:00
Eric Dumazet	3411506550	x86/csum: Rewrite/optimize csum_partial() With more NICs supporting CHECKSUM_COMPLETE, and IPv6 being widely used csum_partial() is heavily used with small amount of bytes, and is consuming many cycles. IPv6 header size, for instance, is 40 bytes. Another thing to consider is that NET_IP_ALIGN is 0 on x86, meaning that network headers are not word-aligned, unless the driver forces this. This means that csum_partial() fetches one u16 to 'align the buffer', then performs three u64 additions with carry in a loop, then a remaining u32, then a remaining u16. With this new version, it performs a loop only for the 64 bytes blocks, then the remaining is bisected. Testing on various CPUs, all of them show a big reduction in csum_partial() cost (by 50 to 80 %) Before: 4.16% [kernel] [k] csum_partial After: 0.83% [kernel] [k] csum_partial If run in a loop 1,000,000 times: Before: 26,922,913 cycles # 3846130.429 GHz 80,302,961 instructions # 2.98 insn per cycle 21,059,816 branches # 3008545142.857 M/sec 2,896 branch-misses # 0.01% of all branches After: 17,960,709 cycles # 3592141.800 GHz 41,292,805 instructions # 2.30 insn per cycle 11,058,119 branches # 2211623800.000 M/sec 2,997 branch-misses # 0.03% of all branches [ bp: Massage, merge in subsequent fixes into a single patch: - um compilation error due to missing load_unaligned_zeropad(): - Reported-by: kernel test robot <lkp@intel.com> - Link: https://lkml.kernel.org/r/20211118175239.1525650-1-eric.dumazet@gmail.com - Fix initial seed for odd buffers - Reported-by: Noah Goldstein <goldstein.w.n@gmail.com> - Link: https://lkml.kernel.org/r/20211125141817.3541501-1-eric.dumazet@gmail.com ] Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/20211112161950.528886-1-eric.dumazet@gmail.com	2021-12-08 11:26:09 +01:00
Hou Wenlong	adbfb12d4c	KVM: x86: Exit to userspace if emulation prepared a completion callback em_rdmsr() and em_wrmsr() return X86EMUL_IO_NEEDED if MSR accesses required an exit to userspace. However, x86_emulate_insn() doesn't return X86EMUL_*, so x86_emulate_instruction() doesn't directly act on X86EMUL_IO_NEEDED; instead, it looks for other signals to differentiate between PIO, MMIO, etc. causing RDMSR/WRMSR emulation not to exit to userspace now. Nevertheless, if the userspace_msr_exit_test testcase in selftests is changed to test RDMSR/WRMSR with a forced emulation prefix, the test passes. What happens is that first userspace exit information is filled but the userspace exit does not happen. Because x86_emulate_instruction() returns 1, the guest retries the instruction---but this time RIP has already been adjusted past the forced emulation prefix, so the guest executes RDMSR/WRMSR and the userspace exit finally happens. Since the X86EMUL_IO_NEEDED path has provided a complete_userspace_io callback, x86_emulate_instruction() can just return 0 if the callback is not NULL. Then RDMSR/WRMSR instruction emulation will exit to userspace directly, without the RDMSR/WRMSR vmexit. Fixes: `1ae099540e` ("KVM: x86: Allow deflecting unknown MSR accesses to user space") Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <56f9df2ee5c05a81155e2be366c9dc1f7adc8817.1635842679.git.houwenlong93@linux.alibaba.com>	2021-12-08 05:24:11 -05:00
Vitaly Kuznetsov	250552b925	KVM: nVMX: Don't use Enlightened MSR Bitmap for L3 When KVM runs as a nested hypervisor on top of Hyper-V it uses Enlightened VMCS and enables Enlightened MSR Bitmap feature for its L1s and L2s (which are actually L2s and L3s from Hyper-V's perspective). When MSR bitmap is updated, KVM has to reset HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP from clean fields to make Hyper-V aware of the change. For KVM's L1s, this is done in vmx_disable_intercept_for_msr()/vmx_enable_intercept_for_msr(). MSR bitmap for L2 is build in nested_vmx_prepare_msr_bitmap() by blending MSR bitmap for L1 and L1's idea of MSR bitmap for L2. KVM, however, doesn't check if the resulting bitmap is different and never cleans HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP in eVMCS02. This is incorrect and may result in Hyper-V missing the update. The issue could've been solved by calling evmcs_touch_msr_bitmap() for eVMCS02 from nested_vmx_prepare_msr_bitmap() unconditionally but doing so would not give any performance benefits (compared to not using Enlightened MSR Bitmap at all). 3-level nesting is also not a very common setup nowadays. Don't enable 'Enlightened MSR Bitmap' feature for KVM's L2s (real L3s) for now. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211129094704.326635-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:53:13 -05:00
Hou Wenlong	d2f7d49826	KVM: x86: Use different callback if msr access comes from the emulator If msr access triggers an exit to userspace, the complete_userspace_io callback would skip instruction by vendor callback for kvm_skip_emulated_instruction(). However, when msr access comes from the emulator, e.g. if kvm.force_emulation_prefix is enabled and the guest uses rdmsr/wrmsr with kvm prefix, VM_EXIT_INSTRUCTION_LEN in vmcs is invalid and kvm_emulate_instruction() should be used to skip instruction instead. As Sean noted, unlike the previous case, there's no #UD if unrestricted guest is disabled and the guest accesses an MSR in Big RM. So the correct way to fix this is to attach a different callback when the msr access comes from the emulator. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <34208da8f51580a06e45afefac95afea0e3f96e3.1635842679.git.houwenlong93@linux.alibaba.com>	2021-12-08 04:25:16 -05:00
Hou Wenlong	906fa90416	KVM: x86: Add an emulation type to handle completion of user exits The next patch would use kvm_emulate_instruction() with EMULTYPE_SKIP in complete_userspace_io callback to fix a problem in msr access emulation. However, EMULTYPE_SKIP only updates RIP, more things like updating interruptibility state and injecting single-step #DBs would be done in the callback. Since the emulator also does those things after x86_emulate_insn(), add a new emulation type to pair with EMULTYPE_SKIP to do those things for completion of user exits within the emulator. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <8f8c8e268b65f31d55c2881a4b30670946ecfa0d.1635842679.git.houwenlong93@linux.alibaba.com>	2021-12-08 04:25:15 -05:00
Sean Christopherson	5e854864ee	KVM: x86: Handle 32-bit wrap of EIP for EMULTYPE_SKIP with flat code seg Truncate the new EIP to a 32-bit value when handling EMULTYPE_SKIP as the decode phase does not truncate _eip. Wrapping the 32-bit boundary is legal if and only if CS is a flat code segment, but that check is implicitly handled in the form of limit checks in the decode phase. Opportunstically prepare for a future fix by storing the result of any truncation in "eip" instead of "_eip". Fixes: `1957aa63be` ("KVM: VMX: Handle single-step #DB for EMULTYPE_SKIP on EPT misconfig") Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <093eabb1eab2965201c9b018373baf26ff256d85.1635842679.git.houwenlong93@linux.alibaba.com>	2021-12-08 04:25:15 -05:00
Li RongQing	51b1209c61	KVM: Clear pv eoi pending bit only when it is set merge pv_eoi_get_pending and pv_eoi_clr_pending into a single function pv_eoi_test_and_clear_pending, which returns and clear the value of the pending bit. This makes it possible to clear the pending bit only if the guest had set it, and otherwise skip the call to pv_eoi_put_user(). This can save up to 300 nsec on AMD EPYC processors. Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Message-Id: <1636026974-50555-2-git-send-email-lirongqing@baidu.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:14 -05:00
Li RongQing	ce5977b181	KVM: x86: don't print when fail to read/write pv eoi memory If guest gives MSR_KVM_PV_EOI_EN a wrong value, this printk() will be trigged, and kernel log is spammed with the useless message Fixes: `0d88800d54` ("kvm: x86: ioapic and apic debug macros cleanup") Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Cc: stable@kernel.org Message-Id: <1636026974-50555-1-git-send-email-lirongqing@baidu.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:14 -05:00
Lai Jiangshan	2df4a5eb6c	KVM: X86: Remove mmu parameter from load_pdptrs() It uses vcpu->arch.walk_mmu always; nested EPT does not have PDPTRs, and nested NPT treats them like all other non-leaf page table levels instead of caching them. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-11-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:14 -05:00
Lai Jiangshan	bb3b394d35	KVM: X86: Rename gpte_is_8_bytes to has_4_byte_gpte and invert the direction This bit is very close to mean "role.quadrant is not in use", except that it is false also when the MMU is mapping guest physical addresses directly. In that case, role.quadrant is indeed not in use, but there are no guest PTEs at all. Changing the name and direction of the bit removes the special case, since a guest with paging disabled, or not considering guest paging structures as is the case for two-dimensional paging, does not have to deal with 4-byte guest PTEs. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-10-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:13 -05:00
Lai Jiangshan	f8cd457f06	KVM: VMX: Use ept_caps_to_lpage_level() in hardware_setup() Using ept_caps_to_lpage_level is simpler. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-9-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:12 -05:00
Lai Jiangshan	cc022ae144	KVM: X86: Add parameter huge_page_level to kvm_init_shadow_ept_mmu() The level of supported large page on nEPT affects the rsvds_bits_mask. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-8-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:12 -05:00
Lai Jiangshan	84ea5c09a6	KVM: X86: Add huge_page_level to __reset_rsvds_bits_mask_ept() Bit 7 on pte depends on the level of supported large page. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-7-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:11 -05:00
Lai Jiangshan	c59a0f57fa	KVM: X86: Remove mmu->translate_gpa Reduce an indirect function call (retpoline) and some intialization code. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:11 -05:00
Lai Jiangshan	1f5a21ee84	KVM: X86: Add parameter struct kvm_mmu mmu into mmu->gva_to_gpa() The mmu->gva_to_gpa() has no "struct kvm_mmu mmu", so an extra FNAME(gva_to_gpa_nested) is needed. Add the parameter can simplify the code. And it makes it explicit that the walk is upon vcpu->arch.walk_mmu for gva and vcpu->arch.mmu for L2 gpa in translate_nested_gpa() via the new parameter. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:10 -05:00
Lai Jiangshan	b46a13cb7e	KVM: X86: Calculate quadrant when !role.gpte_is_8_bytes role.quadrant is only valid when gpte size is 4 bytes and only be calculated when gpte size is 4 bytes. Although "vcpu->arch.mmu->root_level <= PT32_ROOT_LEVEL" also means gpte size is 4 bytes, but using "!role.gpte_is_8_bytes" is clearer Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-15-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:10 -05:00
Lai Jiangshan	41e35604ea	KVM: X86: Remove useless code to set role.gpte_is_8_bytes when role.direct role.gpte_is_8_bytes is unused when role.direct; there is no point in changing a bit in the role, the value that was set when the MMU is initialized is just fine. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-14-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:09 -05:00
Lai Jiangshan	42f34c20a1	KVM: X86: Remove unused declaration of __kvm_mmu_free_some_pages() The body of __kvm_mmu_free_some_pages() has been removed. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-13-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:09 -05:00
Lai Jiangshan	84432316cd	KVM: X86: Fix comment in __kvm_mmu_create() The allocation of special roots is moved to mmu_alloc_special_roots(). Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-12-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:08 -05:00
Lai Jiangshan	27f4fca29f	KVM: X86: Skip allocating pae_root for vcpu->arch.guest_mmu when !tdp_enabled It is never used. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-11-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:08 -05:00
Lai Jiangshan	5835676710	KVM: SVM: Allocate sd->save_area with __GFP_ZERO And remove clear_page() on it. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-10-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:07 -05:00
Lai Jiangshan	1af4a1199a	KVM: SVM: Rename get_max_npt_level() to get_npt_level() It returns the only proper NPT level, so the "max" in the name is not appropriate. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-9-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:07 -05:00
Lai Jiangshan	fe26f91d30	KVM: VMX: Change comments about vmx_get_msr() The variable name is changed in the code. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-8-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:06 -05:00
Lai Jiangshan	ed07ef5a66	KVM: VMX: Use kvm_set_msr_common() for MSR_IA32_TSC_ADJUST in the default way MSR_IA32_TSC_ADJUST can be left to the default way which also uese kvm_set_msr_common(). Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-7-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:06 -05:00
Lai Jiangshan	15ad9762d6	KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest() The host CR3 in the vcpu thread can only be changed when scheduling. Moving the code in vmx_prepare_switch_to_guest() makes the code simpler. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:06 -05:00
Lai Jiangshan	3ab4ac877c	KVM: VMX: Update msr value after kvm_set_user_return_msr() succeeds Aoid earlier modification. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:05 -05:00
Lai Jiangshan	6ab8a4053f	KVM: VMX: Avoid to rdmsrl(MSR_IA32_SYSENTER_ESP) The value of host MSR_IA32_SYSENTER_ESP is known to be constant for each CPU: (cpu_entry_stack(cpu) + 1) when 32 bit syscall is enabled or NULL is 32 bit syscall is not enabled. So rdmsrl() can be avoided for the first case and both rdmsrl() and vmcs_writel() can be avoided for the second case. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:04 -05:00
Lai Jiangshan	24cd19a28c	KVM: X86: Update mmu->pdptrs only when it is changed It is unchanged in most cases. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211111144527.88852-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:04 -05:00
Lai Jiangshan	2e9ebd5509	KVM: X86: Remove kvm_register_clear_available() It has no user. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-15-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:03 -05:00
Paolo Bonzini	41e68b6964	KVM: vmx, svm: clean up mass updates to regs_avail/regs_dirty bits Document the meaning of the three combinations of regs_avail and regs_dirty. Update regs_dirty just after writeback instead of doing it later after vmexit. After vmexit, instead, we clear the regs_avail bits corresponding to lazily-loaded registers. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:03 -05:00
Lai Jiangshan	c62c7bd4f9	KVM: VMX: Update vmcs.GUEST_CR3 only when the guest CR3 is dirty When vcpu->arch.cr3 is changed, it is marked dirty, so vmcs.GUEST_CR3 can be updated only when kvm_register_is_dirty(vcpu, VCPU_EXREG_CR3). Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-12-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:02 -05:00
Lai Jiangshan	3883bc9d28	KVM: X86: Mark CR3 dirty when vcpu->arch.cr3 is changed When vcpu->arch.cr3 is changed, it should be marked dirty unless it is being updated to the value of the architecture guest CR3 (i.e. VMX.GUEST_CR3 or vmcb->save.cr3 when tdp is enabled). This patch has no functionality changed because kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3) is superset of kvm_register_mark_available(vcpu, VCPU_EXREG_CR3) with additional change to vcpu->arch.regs_dirty, but no code uses regs_dirty for VCPU_EXREG_CR3. (vmx_load_mmu_pgd() uses vcpu->arch.regs_avail instead to test if VCPU_EXREG_CR3 dirty which means current code (ab)uses regs_avail for VCPU_EXREG_CR3 dirty information.) Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-11-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:02 -05:00
Lai Jiangshan	aec9c2402f	KVM: SVM: Remove references to VCPU_EXREG_CR3 VCPU_EXREG_CR3 is never cleared from vcpu->arch.regs_avail or vcpu->arch.regs_dirty in SVM; therefore, marking CR3 as available is merely a NOP, and testing it will likewise always succeed. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-9-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:01 -05:00
Lai Jiangshan	8f29bf12a3	KVM: SVM: Remove outdated comment in svm_load_mmu_pgd() The comment had been added in the commit `689f3bf216` ("KVM: x86: unify callbacks to load paging root") and its related code was removed later, and it has nothing to do with the next line of code. So the comment should be removed too. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-8-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:01 -05:00
Lai Jiangshan	e63f315d74	KVM: X86: Move CR0 pdptr_bits into header file as X86_CR0_PDPTR_BITS Not functionality changed. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-7-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:00 -05:00
Lai Jiangshan	a37ebdce16	KVM: VMX: Add and use X86_CR4_PDPTR_BITS when !enable_ept In set_cr4_guest_host_mask(), all cr4 pdptr bits are already set to be intercepted in an unclear way. Add X86_CR4_PDPTR_BITS to make it clear and self-documented. No functionality changed. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-6-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:00 -05:00
Lai Jiangshan	5ec60aad54	KVM: VMX: Add and use X86_CR4_TLBFLUSH_BITS when !enable_ept In set_cr4_guest_host_mask(), X86_CR4_PGE is set to be intercepted when !enable_ept just because X86_CR4_PGE is the only bit that is responsible for flushing TLB but listed in KVM_POSSIBLE_CR4_GUEST_BITS. It is clearer and self-documented to use X86_CR4_TLBFLUSH_BITS instead. No functionality changed. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:59 -05:00
Lai Jiangshan	40e49c4f5f	KVM: SVM: Track dirtiness of PDPTRs even if NPT is disabled Use the same logic to handle the availability of VCPU_EXREG_PDPTR as VMX, also removing a branch in svm_vcpu_run(). Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:59 -05:00
Lai Jiangshan	c0d6956e43	KVM: VMX: Mark VCPU_EXREG_PDPTR available in ept_save_pdptrs() mmu->pdptrs[] and vmcs.GUEST_PDPTR[0-3] are synced, so mmu->pdptrs is available and GUEST_PDPTR[0-3] is not dirty. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:59 -05:00
Lai Jiangshan	2c5653caec	KVM: X86: Ensure that dirty PDPTRs are loaded For VMX with EPT, dirty PDPTRs need to be loaded before the next vmentry via vmx_load_mmu_pgd() But not all paths that call load_pdptrs() will cause vmx_load_mmu_pgd() to be invoked. Normally, kvm_mmu_reset_context() is used to cause KVM_REQ_LOAD_MMU_PGD, but sometimes it is skipped: * commit d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed") skips kvm_mmu_reset_context() after load_pdptrs() when changing CR0.CD and CR0.NW. * commit 21823fbda552("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush") skips KVM_REQ_LOAD_MMU_PGD after load_pdptrs() when rewriting the CR3 with the same value. * commit a91a7c709600("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE") skips kvm_mmu_reset_context() after load_pdptrs() when changing CR4.PGE. Fixes: `d81135a57a` ("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed") Fixes: `21823fbda5` ("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush") Fixes: `a91a7c7096` ("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:58 -05:00
Like Xu	b1d66dad65	KVM: x86/svm: Add module param to control PMU virtualization For Intel, the guest PMU can be disabled via clearing the PMU CPUID. For AMD, all hw implementations support the base set of four performance counters, with current mainstream hardware indicating the presence of two additional counters via X86_FEATURE_PERFCTR_CORE. In the virtualized world, the AMD guest driver may detect the presence of at least one counter MSR. Most hypervisor vendors would introduce a module param (like lbrv for svm) to disable PMU for all guests. Another control proposal per-VM is to pass PMU disable information via MSR_IA32_PERF_CAPABILITIES or one bit in CPUID Fn4000_00[FF:00]. Both of methods require some guest-side changes, so a module parameter may not be sufficiently granular, but practical enough. Signed-off-by: Like Xu <likexu@tencent.com> Message-Id: <20211117080304.38989-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:58 -05:00
Sean Christopherson	baed82c8e4	KVM: VMX: Remove vCPU from PI wakeup list before updating PID.NV Remove the vCPU from the wakeup list before updating the notification vector in the posted interrupt post-block helper. There is no need to wake the current vCPU as it is by definition not blocking. Practically speaking this is a nop as it only shaves a few meager cycles in the unlikely case that the vCPU was migrated and the previous pCPU gets a wakeup IRQ right before PID.NV is updated. The real motivation is to allow for more readable code in the future, when post-block is merged with vmx_vcpu_pi_load(), at which point removal from the list will be conditional on the old notification vector. Opportunistically add comments to document why KVM has a per-CPU spinlock that, at first glance, appears to be taken only on the owning CPU. Explicitly call out that the spinlock must be taken with IRQs disabled, a detail that was "lost" when KVM switched from spin_lock_irqsave() to spin_lock(), with IRQs disabled for the entirety of the relevant path. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-29-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:57 -05:00
Sean Christopherson	724b3962ef	KVM: VMX: Move Posted Interrupt ndst computation out of write loop Hoist the CPU => APIC ID conversion for the Posted Interrupt descriptor out of the loop to write the descriptor, preemption is disabled so the CPU won't change, and if the APIC ID changes KVM has bigger problems. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-28-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:57 -05:00
Sean Christopherson	cfb0e1306a	KVM: VMX: Read Posted Interrupt "control" exactly once per loop iteration Use READ_ONCE() when loading the posted interrupt descriptor control field to ensure "old" and "new" have the same base value. If the compiler emits separate loads, and loads into "new" before "old", KVM could theoretically drop the ON bit if it were set between the loads. Fixes: `28b835d60f` ("KVM: Update Posted-Interrupts Descriptor when vCPU is preempted") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-27-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:56 -05:00
Sean Christopherson	89ef0f21cf	KVM: VMX: Save/restore IRQs (instead of CLI/STI) during PI pre/post block Save/restore IRQs when disabling IRQs in posted interrupt pre/post block in preparation for moving the code into vcpu_put/load(), where it would be called with IRQs already disabled. No functional changed intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-26-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:56 -05:00
Sean Christopherson	29802380b6	KVM: VMX: Drop pointless PI.NDST update when blocking Don't update Posted Interrupt's NDST, a.k.a. the target pCPU, in the pre-block path, as NDST is guaranteed to be up-to-date. The comment about the vCPU being preempted during the update is simply wrong, as the update path runs with IRQs disabled (from before snapshotting vcpu->cpu, until after the update completes). Since commit `8b306e2f3c` ("KVM: VMX: avoid double list add with VT-d posted interrupts", 2017-09-27) The vCPU can get preempted _before_ the update starts, but not during. And if the vCPU is preempted before, vmx_vcpu_pi_load() is responsible for updating NDST when the vCPU is scheduled back in. In that case, the check against the wakeup vector in vmx_vcpu_pi_load() cannot be true as that would require the notification vector to have been set to the wakeup vector _before_ blocking. Opportunistically switch to using vcpu->cpu for the list/lock lookups, which do not need pre_pcpu since the same commit. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-25-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:56 -05:00
Sean Christopherson	74ba5bc872	KVM: VMX: Use boolean returns for Posted Interrupt "test" helpers Return bools instead of ints for the posted interrupt "test" helpers. The bit position of the flag being test does not matter to the callers, and is in fact lost by virtue of test_bit() itself returning a bool. Returning ints is potentially dangerous, e.g. "pi_test_on(pi_desc) == 1" is safe-ish because ON is bit 0 and thus any sane implementation of pi_test_on() will work, but for SN (bit 1), checking "== 1" would rely on pi_test_on() to return 0 or 1, a.k.a. bools, as opposed to 0 or 2 (the positive bit position). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-24-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:55 -05:00
Sean Christopherson	c95717218a	KVM: VMX: Drop unnecessary PI logic to handle impossible conditions Drop sanity checks on the validity of the previous pCPU when handling vCPU block/unlock for posted interrupts. The intention behind the sanity checks is to avoid memory corruption in case of a race or incorrect locking, but the code has been stable for a few years now and the checks get in the way of eliminating kvm_vcpu.pre_cpu. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-23-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:55 -05:00
Sean Christopherson	057aa61bc9	KVM: VMX: Skip Posted Interrupt updates if APICv is hard disabled Explicitly skip posted interrupt updates if APICv is disabled in all of KVM, or if the guest doesn't have an in-kernel APIC. The PI descriptor is kept up-to-date if APICv is inhibited, e.g. so that re-enabling APICv doesn't require a bunch of updates, but neither the module param nor the APIC type can be changed on-the-fly. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-21-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:54 -05:00
Sean Christopherson	d92a5d1c6c	KVM: Add helpers to wake/query blocking vCPU Add helpers to wake and query a blocking vCPU. In addition to providing nice names, the helpers reduce the probability of KVM neglecting to use kvm_arch_vcpu_get_wait(). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:54 -05:00
Sean Christopherson	cdafece4b9	KVM: x86: Invoke kvm_vcpu_block() directly for non-HALTED wait states Call kvm_vcpu_block() directly for all wait states except HALTED so that kvm_vcpu_halt() is no longer a misnomer on x86. Functionally, this means KVM will never attempt halt-polling or adjust vcpu->halt_poll_ns for INIT_RECEIVED (a.k.a. Wait-For-SIPI (WFS)) or AP_RESET_HOLD; UNINITIALIZED is handled in kvm_arch_vcpu_ioctl_run(), and x86 doesn't use any other "wait" states. As mentioned above, the motivation of this is purely so that "halt" isn't overloaded on x86, e.g. in KVM's stats. Skipping halt-polling for WFS (and RESET_HOLD) has no meaningful effect on guest performance as there are typically single-digit numbers of INIT-SIPI sequences per AP vCPU, per boot, versus thousands of HLTs just to boot to console. Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-19-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:53 -05:00
Sean Christopherson	c91d449714	KVM: x86: Directly block (instead of "halting") UNINITIALIZED vCPUs Go directly to kvm_vcpu_block() when handling the case where userspace attempts to run an UNINITIALIZED vCPU. The vCPU is not halted, nor is it likely that halt-polling will be successful in this case. Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-18-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:53 -05:00
Sean Christopherson	91b99ea706	KVM: Rename kvm_vcpu_block() => kvm_vcpu_halt() Rename kvm_vcpu_block() to kvm_vcpu_halt() in preparation for splitting the actual "block" sequences into a separate helper (to be named kvm_vcpu_block()). x86 will use the standalone block-only path to handle non-halt cases where the vCPU is not runnable. Rename block_ns to halt_ns to match the new function name. No functional change intended. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:51 -05:00
Sean Christopherson	005467e06b	KVM: Drop obsolete kvm_arch_vcpu_block_finish() Drop kvm_arch_vcpu_block_finish() now that all arch implementations are nops. No functional change intended. Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:50 -05:00
Sean Christopherson	1460179dcd	KVM: x86: Tweak halt emulation helper names to free up kvm_vcpu_halt() Rename a variety of HLT-related helpers to free up the function name "kvm_vcpu_halt" for future use in generic KVM code, e.g. to differentiate between "block" and "halt". No functional change intended. Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:50 -05:00
Sean Christopherson	91b0189507	KVM: SVM: Ensure target pCPU is read once when signalling AVIC doorbell Ensure vcpu->cpu is read once when signalling the AVIC doorbell. If the compiler rereads the field and the vCPU is migrated between the check and writing the doorbell, KVM would signal the wrong physical CPU. Functionally, signalling the wrong CPU in this case is not an issue as task migration means the vCPU has exited and will pick up any pending interrupts on the next VMRUN. Add the READ_ONCE() purely to clean up the code. Opportunistically add a comment explaining the task migration behavior, and rename cpuid=>cpu to avoid conflating the CPU number with KVM's more common usage of CPUID. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:45 -05:00
Paolo Bonzini	1831fa44df	KVM: VMX: Don't unblock vCPU w/ Posted IRQ if IRQs are disabled in guest Don't configure the wakeup handler when a vCPU is blocking with IRQs disabled, in which case any IRQ, posted or otherwise, should not be recognized and thus should not wake the vCPU. Fixes: `bf9f6ac8d7` ("KVM: Update Posted-Interrupts Descriptor when vCPU is blocked") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:45 -05:00
Vihas Mak	98a26b69d8	KVM: x86: change TLB flush indicator to bool change 0 to false and 1 to true to fix following cocci warnings: arch/x86/kvm/mmu/mmu.c:1485:9-10: WARNING: return of 0/1 in function 'kvm_set_pte_rmapp' with return type bool arch/x86/kvm/mmu/mmu.c:1636:10-11: WARNING: return of 0/1 in function 'kvm_test_age_rmapp' with return type bool Signed-off-by: Vihas Mak <makvihas@gmail.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Jim Mattson <jmattson@google.com> Cc: Joerg Roedel <joro@8bytes.org> Message-Id: <20211114164312.GA28736@makvihas> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:44 -05:00
Ben Gardon	fb43496c83	KVM: x86/MMU: Simplify flow of vmx_get_mt_mask Remove the gotos from vmx_get_mt_mask. It's easier to build the whole memory type at once, than it is to combine separate cacheability and ipat fields. No functional change intended. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20211115234603.2908381-12-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:43 -05:00
Ben Gardon	8283e36abf	KVM: x86/mmu: Propagate memslot const qualifier In preparation for implementing in-place hugepage promotion, various functions will need to be called from zap_collapsible_spte_range, which has the const qualifier on its memslot argument. Propagate the const qualifier to the various functions which will be needed. This just serves to simplify the following patch. No functional change intended. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20211115234603.2908381-11-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:43 -05:00
Ben Gardon	4d78d0b39a	KVM: x86/mmu: Remove need for a vcpu from mmu_try_to_unsync_pages The vCPU argument to mmu_try_to_unsync_pages is now only used to get a pointer to the associated struct kvm, so pass in the kvm pointer from the beginning to remove the need for a vCPU when calling the function. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20211115234603.2908381-7-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:42 -05:00
Ben Gardon	9d395a0a7a	KVM: x86/mmu: Remove need for a vcpu from kvm_slot_page_track_is_active kvm_slot_page_track_is_active only uses its vCPU argument to get a pointer to the assoicated struct kvm, so just pass in the struct KVM to remove the need for a vCPU pointer. No functional change intended. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20211115234603.2908381-6-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:42 -05:00
Sean Christopherson	ce92ef7604	KVM: x86/mmu: Use shadow page role to detect PML-unfriendly pages for L2 Rework make_spte() to query the shadow page's role, specifically whether or not it's a guest_mode page, a.k.a. a page for L2, when determining if the SPTE is compatible with PML. This eliminates a dependency on @vcpu, with a future goal of being able to create SPTEs without a specific vCPU. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:41 -05:00
Emanuele Giuseppe Esposito	8fc78909c0	KVM: nSVM: introduce struct vmcb_ctrl_area_cached This structure will replace vmcb_control_area in svm_nested_state, providing only the fields that are actually used by the nested state. This avoids having and copying around uninitialized fields. The cost of this, however, is that all functions (in this case vmcb_is_intercept) expect the old structure, so they need to be duplicated. In addition, in svm_get_nested_state() user space expects a vmcb_control_area struct, so we need to copy back all fields in a temporary structure before copying it to userspace. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211103140527.752797-7-eesposit@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:40 -05:00
Paolo Bonzini	bd95926c2b	KVM: nSVM: split out __nested_vmcb_check_controls Remove the struct vmcb_control_area parameter from nested_vmcb_check_controls, for consistency with the functions that operate on the save area. This way, VMRUN uses the version without underscores for both areas, while KVM_SET_NESTED_STATE uses the version with underscores. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:40 -05:00
Emanuele Giuseppe Esposito	355d0473b1	KVM: nSVM: use svm->nested.save to load vmcb12 registers and avoid TOC/TOU races Use the already checked svm->nested.save cached fields (EFER, CR0, CR4, ...) instead of vmcb12's in nested_vmcb02_prepare_save(). This prevents from creating TOC/TOU races, since the guest could modify the vmcb12 fields. This also avoids the need of force-setting EFER_SVME in nested_vmcb02_prepare_save. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211103140527.752797-6-eesposit@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:40 -05:00
Emanuele Giuseppe Esposito	b7a3d8b6f4	KVM: nSVM: use vmcb_save_area_cached in nested_vmcb_valid_sregs() Now that struct vmcb_save_area_cached contains the required vmcb fields values (done in nested_load_save_from_vmcb12()), check them to see if they are correct in nested_vmcb_valid_sregs(). While at it, rename nested_vmcb_valid_sregs in nested_vmcb_check_save. __nested_vmcb_check_save takes the additional @save parameter, so it is helpful when we want to check a non-svm save state, like in svm_set_nested_state. The reason for that is that save is the L1 state, not L2, so we check it without moving it to svm->nested.save. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20211103140527.752797-5-eesposit@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:39 -05:00
Emanuele Giuseppe Esposito	7907160dbf	KVM: nSVM: rename nested_load_control_from_vmcb12 in nested_copy_vmcb_control_to_cache Following the same naming convention of the previous patch, rename nested_load_control_from_vmcb12. In addition, inline copy_vmcb_control_area as it is only called by this function. __nested_copy_vmcb_control_to_cache() works with vmcb_control_area parameters and it will be useful in next patches, when we use local variables instead of svm cached state. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20211103140527.752797-4-eesposit@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:39 -05:00
Emanuele Giuseppe Esposito	f2740a8d85	KVM: nSVM: introduce svm->nested.save to cache save area before checks This is useful in the next patch, to keep a saved copy of vmcb12 registers and pass it around more easily. Instead of blindly copying everything, we just copy EFER, CR0, CR3, CR4, DR6 and DR7 which are needed by the VMRUN checks. If more fields will need to be checked, it will be quite obvious to see that they must be added in struct vmcb_save_area_cached and in nested_copy_vmcb_save_to_cache(). __nested_copy_vmcb_save_to_cache() takes a vmcb_save_area_cached parameter, which is useful in order to save the state to a local variable. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20211103140527.752797-3-eesposit@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:38 -05:00
Emanuele Giuseppe Esposito	907afa48e9	KVM: nSVM: move nested_vmcb_check_cr3_cr4 logic in nested_vmcb_valid_sregs Inline nested_vmcb_check_cr3_cr4 as it is not called by anyone else. Doing so simplifies next patches. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211103140527.752797-2-eesposit@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:38 -05:00
Maciej S. Szmigiero	f4209439b5	KVM: Optimize gfn lookup in kvm_zap_gfn_range() Introduce a memslots gfn upper bound operation and use it to optimize kvm_zap_gfn_range(). This way this handler can do a quick lookup for intersecting gfns and won't have to do a linear scan of the whole memslot set. Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <ef242146a87a335ee93b441dcf01665cb847c902.1638817641.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:35 -05:00
Maciej S. Szmigiero	a54d806688	KVM: Keep memslots in tree-based structures instead of array-based ones The current memslot code uses a (reverse gfn-ordered) memslot array for keeping track of them. Because the memslot array that is currently in use cannot be modified every memslot management operation (create, delete, move, change flags) has to make a copy of the whole array so it has a scratch copy to work on. Strictly speaking, however, it is only necessary to make copy of the memslot that is being modified, copying all the memslots currently present is just a limitation of the array-based memslot implementation. Two memslot sets, however, are still needed so the VM continues to run on the currently active set while the requested operation is being performed on the second, currently inactive one. In order to have two memslot sets, but only one copy of actual memslots it is necessary to split out the memslot data from the memslot sets. The memslots themselves should be also kept independent of each other so they can be individually added or deleted. These two memslot sets should normally point to the same set of memslots. They can, however, be desynchronized when performing a memslot management operation by replacing the memslot to be modified by its copy. After the operation is complete, both memslot sets once again point to the same, common set of memslot data. This commit implements the aforementioned idea. For tracking of gfns an ordinary rbtree is used since memslots cannot overlap in the guest address space and so this data structure is sufficient for ensuring that lookups are done quickly. The "last used slot" mini-caches (both per-slot set one and per-vCPU one), that keep track of the last found-by-gfn memslot, are still present in the new code. Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:34 -05:00
Maciej S. Szmigiero	ed922739c9	KVM: Use interval tree to do fast hva lookup in memslots The current memslots implementation only allows quick binary search by gfn, quick lookup by hva is not possible - the implementation has to do a linear scan of the whole memslots array, even though the operation being performed might apply just to a single memslot. This significantly hurts performance of per-hva operations with higher memslot counts. Since hva ranges can overlap between memslots an interval tree is needed for tracking them. [sean: handle interval tree updates in kvm_replace_memslot()] Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <d66b9974becaa9839be9c4e1a5de97b177b4ac20.1638817640.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:32 -05:00
Maciej S. Szmigiero	f5756029ee	KVM: x86: Use nr_memslot_pages to avoid traversing the memslots array There is no point in recalculating from scratch the total number of pages in all memslots each time a memslot is created or deleted. Use KVM's cached nr_memslot_pages to compute the default max number of MMU pages. Note that even with nr_memslot_pages capped at ULONG_MAX we can't safely multiply it by KVM_PERMILLE_MMU_PAGES (20) since this operation can possibly overflow an unsigned long variable. Write this "* 20 / 1000" operation as "/ 50" instead to avoid such overflow. Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> [sean: use common KVM field and rework changelog accordingly] Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <d14c5a24535269606675437d5602b7dac4ad8c0e.1638817640.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:29 -05:00
Maciej S. Szmigiero	e0c2b6338a	KVM: x86: Don't call kvm_mmu_change_mmu_pages() if the count hasn't changed There is no point in calling kvm_mmu_change_mmu_pages() for memslot operations that don't change the total page count, so do it just for KVM_MR_CREATE and KVM_MR_DELETE. Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <9e56b7616a11f5654e4ab486b3237366b7ba9f2a.1638817640.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:28 -05:00
Sean Christopherson	77aedf26fe	KVM: x86: Don't assume old/new memslots are non-NULL at memslot commit Play nice with a NULL @old or @new when handling memslot updates so that common KVM can pass NULL for one or the other in CREATE and DELETE cases instead of having to synthesize a dummy memslot. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <2eb7788adbdc2bc9a9c5f86844dd8ee5c8428732.1638817640.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:26 -05:00
Sean Christopherson	6a99c6e3f5	KVM: Stop passing kvm_userspace_memory_region to arch memslot hooks Drop the @mem param from kvm_arch_{prepare,commit}_memory_region() now that its use has been removed in all architectures. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <aa5ed3e62c27e881d0d8bc0acbc1572bc336dc19.1638817640.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:25 -05:00
Sean Christopherson	9d7d18ee3f	KVM: x86: Use "new" memslot instead of userspace memory region Get the number of pages directly from the new memslot instead of computing the same from the userspace memory region when allocating memslot metadata. This will allow a future patch to drop @mem. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <ef44892eb615f5c28e682bbe06af96aff9ce2a9f.1638817639.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:23 -05:00
Sean Christopherson	537a17b314	KVM: Let/force architectures to deal with arch specific memslot data Pass the "old" slot to kvm_arch_prepare_memory_region() and force arch code to handle propagating arch specific data from "new" to "old" when necessary. This is a baby step towards dynamically allocating "new" from the get go, and is a (very) minor performance boost on x86 due to not unnecessarily copying arch data. For PPC HV, copy the rmap in the !CREATE and !DELETE paths, i.e. for MOVE and FLAGS_ONLY. This is functionally a nop as the previous behavior would overwrite the pointer for CREATE, and eventually discard/ignore it for DELETE. For x86, copy the arch data only for FLAGS_ONLY changes. Unlike PPC HV, x86 needs to reallocate arch data in the MOVE case as the size of x86's allocations depend on the alignment of the memslot's gfn. Opportunistically tweak kvm_arch_prepare_memory_region()'s param order to match the "commit" prototype. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> [mss: add missing RISCV kvm_arch_prepare_memory_region() change] Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <67dea5f11bbcfd71e3da5986f11e87f5dd4013f9.1638817639.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:20 -05:00
Marc Zyngier	46808a4cb8	KVM: Use 'unsigned long' as kvm_for_each_vcpu()'s index Everywhere we use kvm_for_each_vpcu(), we use an int as the vcpu index. Unfortunately, we're about to move rework the iterator, which requires this to be upgrade to an unsigned long. Let's bite the bullet and repaint all of it in one go. Signed-off-by: Marc Zyngier <maz@kernel.org> Message-Id: <20211116160403.4074052-7-maz@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:15 -05:00
Marc Zyngier	27592ae8db	KVM: Move wiping of the kvm->vcpus array to common code All architectures have similar loops iterating over the vcpus, freeing one vcpu at a time, and eventually wiping the reference off the vcpus array. They are also inconsistently taking the kvm->lock mutex when wiping the references from the array. Make this code common, which will simplify further changes. The locking is dropped altogether, as this should only be called when there is no further references on the kvm structure. Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Message-Id: <20211116160403.4074052-2-maz@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:13 -05:00
Paolo Bonzini	dc1ce45575	KVM: MMU: update comment on the number of page role combinations Fix the number of bits in the role, and simplify the explanation of why several bits or combinations of bits are redundant. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:13 -05:00
Linus Torvalds	55a677b256	EFI fix for v5.16 Ensure that the EFI memory map resides in encrypted memory even after it has been reallocated. -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE+9lifEBpyUIVN1cpw08iOZLZjyQFAmGt9nUACgkQw08iOZLZ jyT6ZgwAt1jkqt/ILev5MPE2rEQ8BrYhR7pMBcNW7su4BECwRfmQBGjYFeeXjlqp 7tK2L18DsHJxgt699gumDD7IyjbLNUx9zqsd7nJzlvnpwjGSiUhAGLPwoAeRSEgD sxK/HIt6J5ZX/zMsEbtptP9M1eitbZQhtt+0aCLcJvt1NCZc/+FvUaEEPI+6BrMt TMWdXd6MOZ8FV2iq+FUpTeG9SGzze6QaIiL7H5Z/otvXbBG1iWmlWR0bR17CXGUC thIuqCjEKSM7P45kFF+byCq5ajo55ULy3ZrPVQUYLt/FdMdI+pLxK8TTg46xq8KS mXqexjtCPLiJP728lWTETIPB7x9aQW/i6cwesh0O6cmhPznJybW/+uR9zBsyiSGs 8+9ua+JT9+B3bEn1rsbWYKEcy3G4KrxwtJ5aqRNVJM96pel/yxA4EYkmfyf4TlrS pxbY2Agzlci18qCHb5lycc1yM6WHdirzY1l0m8uwqk7o8GHtRMMMU/7e/kvmtxUA j9V3vqm6 =mDRi -----END PGP SIGNATURE----- Merge tag 'efi-urgent-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fix from Ard Biesheuvel: "Ensure that the EFI memory map resides in encrypted memory even after it has been reallocated" * tag 'efi-urgent-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: x86/sme: Explicitly map new EFI memmap table as encrypted	2021-12-06 10:09:00 -08:00
Joerg Roedel	71d5049b05	x86/mm: Flush global TLB when switching to trampoline page-table Move the switching code into a function so that it can be re-used and add a global TLB flush. This makes sure that usage of memory which is not mapped in the trampoline page-table is reliably caught. Also move the clearing of CR4.PCIDE before the CR3 switch because the cr4_clear_bits() function will access data not mapped into the trampoline page-table. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211202153226.22946-4-joro@8bytes.org	2021-12-06 09:54:10 +01:00
Joerg Roedel	f154f29085	x86/mm/64: Flush global TLB on boot and AP bringup The AP bringup code uses the trampoline_pgd page-table which establishes global mappings in the user range of the address space. Flush the global TLB entries after the indentity mappings are removed so no stale entries remain in the TLB. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211202153226.22946-3-joro@8bytes.org	2021-12-06 09:38:48 +01:00
Linus Torvalds	f5d54a42d3	- Fix a couple of SWAPGS fencing issues in the x86 entry code - Use the proper operand types in __{get,put}_user() to prevent truncation in SEV-ES string io - Make sure the kernel mappings are present in trampoline_pgd in order to prevent any potential accesses to unmapped memory after switching to it - Fix a trivial list corruption in objtool's pv_ops validation - Disable the clocksource watchdog for TSC on platforms which claim that the TSC is constant, doesn't stop in sleep states, CPU has TSC adjust and the number of sockets of the platform are max 2, to prevent erroneous markings of the TSC as unstable. - Make sure TSC adjust is always checked not only when going idle - Prevent a stack leak by initializing struct _fpx_sw_bytes properly in the FPU code - Fix INTEL_FAM6_RAPTORLAKE define naming to adhere to the convention -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmGsnWcACgkQEsHwGGHe VUoR+g/9FcOP0/XLH+LKHumYc9JHXsp5BvYGihyypMFgU0fXQBORtGqdls8jZtiJ kEdbW6iL0MRlyN8aHJCqr7dJqs7KJlpWes6hky7BY+U+7uewtjL5y3eSyZnA34T3 M/Raecx27Hh0L0kHQlHXTUN73v1cgDvq3dCXWsP7Jqgjf5cEmCcV/tPEateqhq/f 8TkLVIm55rJlbJ0LBO/cT0V3Q8QH9JPKm7nviOZuKCh9gcttFEPaM9MkaJyKUhoy O13jlenDoVkVWRXIQec1EZp2pTLxVAm/3Y0plge1yEVsejzh07gsQnMpoNeF+yFC 8mDgSv8ZAED/vbsnB+BcgoRVj6ajG0+ilpLzcfPwUquiqS9pZrBSTddlvYDPjRMC MEXO548xiYgxmipu3r62H89nqmLEYQPk914rJu6bDnDeJ1gaabh8RXbNtQcRqqj3 RETgVOp78iWn+aT33RLLD1EyodZb2IkMy087a3+TZICIXG81aDj9VgHvrVRnWnfY yKuldyrEKzi60yMQkV6h1oc8KSWQhspUSLtOVS9zrulCinYphFOfYFrzFmcKUWIq GdVb9eaP2oNBGfPybXP+TBLGZ4Zv9iXZmaEUk7ZGCjgv3ZmGMWJ18Hs/ufs2cwWK RNNUo3sz/y3OsreHowkWIk1eSxI16MabB7G/PDMnBSHlioVT390= =d6nS -----END PGP SIGNATURE----- Merge tag 'x86_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Fix a couple of SWAPGS fencing issues in the x86 entry code - Use the proper operand types in __{get,put}_user() to prevent truncation in SEV-ES string io - Make sure the kernel mappings are present in trampoline_pgd in order to prevent any potential accesses to unmapped memory after switching to it - Fix a trivial list corruption in objtool's pv_ops validation - Disable the clocksource watchdog for TSC on platforms which claim that the TSC is constant, doesn't stop in sleep states, CPU has TSC adjust and the number of sockets of the platform are max 2, to prevent erroneous markings of the TSC as unstable. - Make sure TSC adjust is always checked not only when going idle - Prevent a stack leak by initializing struct _fpx_sw_bytes properly in the FPU code - Fix INTEL_FAM6_RAPTORLAKE define naming to adhere to the convention * tag 'x86_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/xen: Add xenpv_restore_regs_and_return_to_usermode() x86/entry: Use the correct fence macro after swapgs in kernel CR3 x86/entry: Add a fence for kernel entry SWAPGS in paranoid_entry() x86/sev: Fix SEV-ES INS/OUTS instructions for word, dword, and qword x86/64/mm: Map all kernel memory into trampoline_pgd objtool: Fix pv_ops noinstr validation x86/tsc: Disable clocksource watchdog for TSC on qualified platorms x86/tsc: Add a timer to make sure TSC_adjust is always checked x86/fpu/signal: Initialize sw_bytes in save_xstate_epilog() x86/cpu: Drop spurious underscore from RAPTOR_LAKE #define	2021-12-05 08:43:35 -08:00
Linus Torvalds	90bf8d98b4	* Static analysis fix * New SEV-ES protocol for communicating invalid VMGEXIT requests * Ensure APICv is considered inactive if there is no APIC * Fix reserved bits for AMD PerfEvtSeln register -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmGscuAUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroOnzAgAg/0tifAOC8F31xu+gbbFa2hJsfH1 sr2QXqVI6XUWeAHw5GEEcSFHGcFc0BrmfPEL4T3tEhkTGcijL2uTK8dwgq3ue4yg 5LzaZzh03AWi4x84rV3XNVHHatzF69tgbUG49rSlK2T6BkGWzh4gI5LZV7XqNqLh acW+92YcCGo/O9RAUYYakofX4bp0rsQaZornQiD/R5X6AlrtMUyhAYHH5Wnv69n+ MHf4K9MzrtixXbTvkOXflN5yz6TkIGvCpCK+gmppeuJ3JyZshjn+y95XcDDZtB6o +4Ypap3SmIkjTg4VwS8lRwIFkkY31hxhW2ohRnorfP2Q5Ckcz2/kVcIEew== =uJMy -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull more kvm fixes from Paolo Bonzini: - Static analysis fix - New SEV-ES protocol for communicating invalid VMGEXIT requests - Ensure APICv is considered inactive if there is no APIC - Fix reserved bits for AMD PerfEvtSeln register * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: SVM: Do not terminate SEV-ES guests on GHCB validation failure KVM: SEV: Fall back to vmalloc for SEV-ES scratch area if necessary KVM: SEV: Return appropriate error codes if SEV-ES scratch setup fails KVM: x86/mmu: Retry page fault if root is invalidated by memslot update KVM: VMX: Set failure code in prepare_vmcs02() KVM: ensure APICv is considered inactive if there is no APIC KVM: x86/pmu: Fix reserved bits for AMD PerfEvtSeln register	2021-12-05 08:25:33 -08:00
Tom Lendacky	1ff2fc0286	x86/sme: Explicitly map new EFI memmap table as encrypted Reserving memory using efi_mem_reserve() calls into the x86 efi_arch_mem_reserve() function. This function will insert a new EFI memory descriptor into the EFI memory map representing the area of memory to be reserved and marking it as EFI runtime memory. As part of adding this new entry, a new EFI memory map is allocated and mapped. The mapping is where a problem can occur. This new memory map is mapped using early_memremap() and generally mapped encrypted, unless the new memory for the mapping happens to come from an area of memory that is marked as EFI_BOOT_SERVICES_DATA memory. In this case, the new memory will be mapped unencrypted. However, during replacement of the old memory map, efi_mem_type() is disabled, so the new memory map will now be long-term mapped encrypted (in efi.memmap), resulting in the map containing invalid data and causing the kernel boot to crash. Since it is known that the area will be mapped encrypted going forward, explicitly map the new memory map as encrypted using early_memremap_prot(). Cc: <stable@vger.kernel.org> # 4.14.x Fixes: `8f716c9b5f` ("x86/mm: Add support to access boot related data in the clear") Link: https://lore.kernel.org/all/ebf1eb2940405438a09d51d121ec0d02c8755558.1634752931.git.thomas.lendacky@amd.com/ Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> [ardb: incorporate Kconfig fix by Arnd] Signed-off-by: Ard Biesheuvel <ardb@kernel.org>	2021-12-05 16:44:52 +01:00
Tom Lendacky	ad5b353240	KVM: SVM: Do not terminate SEV-ES guests on GHCB validation failure Currently, an SEV-ES guest is terminated if the validation of the VMGEXIT exit code or exit parameters fails. The VMGEXIT instruction can be issued from userspace, even though userspace (likely) can't update the GHCB. To prevent userspace from being able to kill the guest, return an error through the GHCB when validation fails rather than terminating the guest. For cases where the GHCB can't be updated (e.g. the GHCB can't be mapped, etc.), just return back to the guest. The new error codes are documented in the lasest update to the GHCB specification. Fixes: `291bd20d5d` ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Message-Id: <b57280b5562893e2616257ac9c2d4525a9aeeb42.1638471124.git.thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-05 03:02:04 -05:00
Sean Christopherson	a655276a59	KVM: SEV: Fall back to vmalloc for SEV-ES scratch area if necessary Use kvzalloc() to allocate KVM's buffer for SEV-ES's GHCB scratch area so that KVM falls back to __vmalloc() if physically contiguous memory isn't available. The buffer is purely a KVM software construct, i.e. there's no need for it to be physically contiguous. Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109222350.2266045-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-05 03:02:03 -05:00
Sean Christopherson	75236f5f22	KVM: SEV: Return appropriate error codes if SEV-ES scratch setup fails Return appropriate error codes if setting up the GHCB scratch area for an SEV-ES guest fails. In particular, returning -EINVAL instead of -ENOMEM when allocating the kernel buffer could be confusing as userspace would likely suspect a guest issue. Fixes: `8f423a80d2` ("KVM: SVM: Support MMIO for an SEV-ES guest") Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109222350.2266045-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-05 03:02:03 -05:00
Joerg Roedel	9de4999050	x86/realmode: Add comment for Global bit usage in trampoline_pgd Document the fact that using the trampoline_pgd will result in the creation of global TLB entries in the user range of the address space. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211202153226.22946-2-joro@8bytes.org	2021-12-04 13:50:08 +01:00
Lai Jiangshan	5c8f6a2e31	x86/xen: Add xenpv_restore_regs_and_return_to_usermode() In the native case, PER_CPU_VAR(cpu_tss_rw + TSS_sp0) is the trampoline stack. But XEN pv doesn't use trampoline stack, so PER_CPU_VAR(cpu_tss_rw + TSS_sp0) is also the kernel stack. In that case, source and destination stacks are identical, which means that reusing swapgs_restore_regs_and_return_to_usermode() in XEN pv would cause %rsp to move up to the top of the kernel stack and leave the IRET frame below %rsp. This is dangerous as it can be corrupted if #NMI / #MC hit as either of these events occurring in the middle of the stack pushing would clobber data on the (original) stack. And, with XEN pv, swapgs_restore_regs_and_return_to_usermode() pushing the IRET frame on to the original address is useless and error-prone when there is any future attempt to modify the code. [ bp: Massage commit message. ] Fixes: `7f2590a110` ("x86/entry/64: Use a per-CPU trampoline stack for IDT entries") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lkml.kernel.org/r/20211126101209.8613-4-jiangshanlai@gmail.com	2021-12-03 19:21:15 +01:00
Lai Jiangshan	1367afaa2e	x86/entry: Use the correct fence macro after swapgs in kernel CR3 The commit `c758907004` ("x86/entry/64: Remove unneeded kernel CR3 switching") removed a CR3 write in the faulting path of load_gs_index(). But the path's FENCE_SWAPGS_USER_ENTRY has no fence operation if PTI is enabled, see spectre_v1_select_mitigation(). Rather, it depended on the serializing CR3 write of SWITCH_TO_KERNEL_CR3 and since it got removed, add a FENCE_SWAPGS_KERNEL_ENTRY call to make sure speculation is blocked. [ bp: Massage commit message and comment. ] Fixes: `c758907004` ("x86/entry/64: Remove unneeded kernel CR3 switching") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211126101209.8613-3-jiangshanlai@gmail.com	2021-12-03 19:13:53 +01:00
Lai Jiangshan	c07e45553d	x86/entry: Add a fence for kernel entry SWAPGS in paranoid_entry() Commit `18ec54fdd6` ("x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations") added FENCE_SWAPGS_{KERNEL\|USER}_ENTRY for conditional SWAPGS. In paranoid_entry(), it uses only FENCE_SWAPGS_KERNEL_ENTRY for both branches. This is because the fence is required for both cases since the CR3 write is conditional even when PTI is enabled. But `96b2371413` ("x86/entry/64: Switch CR3 before SWAPGS in paranoid entry") changed the order of SWAPGS and the CR3 write. And it missed the needed FENCE_SWAPGS_KERNEL_ENTRY for the user gsbase case. Add it back by changing the branches so that FENCE_SWAPGS_KERNEL_ENTRY can cover both branches. [ bp: Massage, fix typos, remove obsolete comment while at it. ] Fixes: `96b2371413` ("x86/entry/64: Switch CR3 before SWAPGS in paranoid entry") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211126101209.8613-2-jiangshanlai@gmail.com	2021-12-03 18:55:47 +01:00
Ingo Molnar	e1cd82a339	x86/mm: Add missing <asm/cpufeatures.h> dependency to <asm/page_64.h> In the following commit: `025768a966` x86/cpu: Use alternative to generate the TASK_SIZE_MAX constant ... we added the new task_size_max() inline, which uses X86_FEATURE_LA57, but doesn't include <asm/cpufeatures.h> which defines the constant. Due to the way alternatives macros work currently this doesn't get reported as an immediate build error, only as a link error, if a .c file happens to include <asm/page.h> first: > ld: kernel/fork.o:(.altinstructions+0x98): undefined reference to `X86_FEATURE_LA57' In the current upstream kernel no .c file includes <asm/page.h> before including some other header that includes <asm/cpufeatures.h>, which is why this dependency bug went unnoticed. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2021-12-03 09:30:45 -08:00
Geert Uytterhoeven	9e4d52a00a	x86/ce4100: Replace "ti,pcf8575" by "nxp,pcf8575" The TI part is equivalent to the NXP part, and its compatible value is not documented in the DT bindings. Note that while the Linux driver DT match table does not contain the compatible value of the TI part, it could still match to this part, as i2c_device_id-based matching ignores the vendor part of the compatible value. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Link: https://lkml.kernel.org/r/0c00cec971f5c405e47d04e493d854de0efc2e49.1638539629.git.geert+renesas@glider.be	2021-12-03 18:23:57 +01:00
Michael Sterritt	1d5379d047	x86/sev: Fix SEV-ES INS/OUTS instructions for word, dword, and qword Properly type the operands being passed to __put_user()/__get_user(). Otherwise, these routines truncate data for dependent instructions (e.g., INSW) and only read/write one byte. This has been tested by sending a string with REP OUTSW to a port and then reading it back in with REP INSW on the same port. Previous behavior was to only send and receive the first char of the size. For example, word operations for "abcd" would only read/write "ac". With change, the full string is now written and read back. Fixes: `f980f9c31a` (x86/sev-es: Compile early handler code into kernel image) Signed-off-by: Michael Sterritt <sterritt@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Marc Orr <marcorr@google.com> Reviewed-by: Peter Gonda <pgonda@google.com> Reviewed-by: Joerg Roedel <jroedel@suse.de> Link: https://lkml.kernel.org/r/20211119232757.176201-1-sterritt@google.com	2021-12-03 18:09:30 +01:00
Joerg Roedel	51523ed1c2	x86/64/mm: Map all kernel memory into trampoline_pgd The trampoline_pgd only maps the 0xfffffff000000000-0xffffffffffffffff range of kernel memory (with 4-level paging). This range contains the kernel's text+data+bss mappings and the module mapping space but not the direct mapping and the vmalloc area. This is enough to get the application processors out of real-mode, but for code that switches back to real-mode the trampoline_pgd is missing important parts of the address space. For example, consider this code from arch/x86/kernel/reboot.c, function machine_real_restart() for a 64-bit kernel: #ifdef CONFIG_X86_32 load_cr3(initial_page_table); #else write_cr3(real_mode_header->trampoline_pgd); /* Exiting long mode will fail if CR4.PCIDE is set. / if (boot_cpu_has(X86_FEATURE_PCID)) cr4_clear_bits(X86_CR4_PCIDE); #endif / Jump to the identity-mapped low memory code / #ifdef CONFIG_X86_32 asm volatile("jmpl %0" : : "rm" (real_mode_header->machine_real_restart_asm), "a" (type)); #else asm volatile("ljmpl *%0" : : "m" (real_mode_header->machine_real_restart_asm), "D" (type)); #endif The code switches to the trampoline_pgd, which unmaps the direct mapping and also the kernel stack. The call to cr4_clear_bits() will find no stack and crash the machine. The real_mode_header pointer below points into the direct mapping, and dereferencing it also causes a crash. The reason this does not crash always is only that kernel mappings are global and the CR3 switch does not flush those mappings. But if theses mappings are not in the TLB already, the above code will crash before it can jump to the real-mode stub. Extend the trampoline_pgd to contain all kernel mappings to prevent these crashes and to make code which runs on this page-table more robust. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20211202153226.22946-5-joro@8bytes.org	2021-12-03 09:11:43 +01:00
Sean Christopherson	a955cad84c	KVM: x86/mmu: Retry page fault if root is invalidated by memslot update Bail from the page fault handler if the root shadow page was obsoleted by a memslot update. Do the check _after_ acuiring mmu_lock, as the TDP MMU doesn't rely on the memslot/MMU generation, and instead relies on the root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes mmu_lock for write. For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has moved past the gfn associated with the SP. For other MMUs, the resulting behavior is far more convoluted, though unlikely to be truly problematic. Installing SPs/SPTEs into the obsolete root isn't directly problematic, as the obsolete root will be unloaded and dropped before the vCPU re-enters the guest. But because the legacy MMU tracks shadow pages by their role, any SP created by the fault can can be reused in the new post-reload root. Again, that _shouldn't_ be problematic as any leaf child SPTEs will be created for the current/valid memslot generation, and kvm_mmu_get_page() will not reuse child SPs from the old generation as they will be flagged as obsolete. But, given that continuing with the fault is pointess (the root will be unloaded), apply the check to all MMUs. Fixes: `b7cccd397f` ("KVM: x86/mmu: Fast invalidation for TDP MMU") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211120045046.3940942-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-02 04:12:12 -05:00
Dan Carpenter	bfbb307c62	KVM: VMX: Set failure code in prepare_vmcs02() The error paths in the prepare_vmcs02() function are supposed to set *entry_failure_code but this path does not. It leads to using an uninitialized variable in the caller. Fixes: `71f7347025` ("KVM: nVMX: Load GUEST_IA32_PERF_GLOBAL_CTRL MSR on VM-Entry") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Message-Id: <20211130125337.GB24578@kili> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-02 04:12:11 -05:00
Paolo Bonzini	ef8b4b7203	KVM: ensure APICv is considered inactive if there is no APIC kvm_vcpu_apicv_active() returns false if a virtual machine has no in-kernel local APIC, however kvm_apicv_activated might still be true if there are no reasons to disable APICv; in fact it is quite likely that there is none because APICv is inhibited by specific configurations of the local APIC and those configurations cannot be programmed. This triggers a WARN: WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm) != kvm_vcpu_apicv_active(vcpu)); To avoid this, introduce another cause for APICv inhibition, namely the absence of an in-kernel local APIC. This cause is enabled by default, and is dropped by either KVM_CREATE_IRQCHIP or the enabling of KVM_CAP_IRQCHIP_SPLIT. Reported-by: Ignat Korchagin <ignat@cloudflare.com> Fixes: `ee49a89329` ("KVM: x86: Move SVM's APICv sanity check to common x86", 2021-10-22) Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Tested-by: Ignat Korchagin <ignat@cloudflare.com> Message-Id: <20211130123746.293379-1-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-02 04:12:11 -05:00
Like Xu	cb1d220da0	KVM: x86/pmu: Fix reserved bits for AMD PerfEvtSeln register If we run the following perf command in an AMD Milan guest: perf stat \ -e cpu/event=0x1d0/ \ -e cpu/event=0x1c7/ \ -e cpu/umask=0x1f,event=0x18e/ \ -e cpu/umask=0x7,event=0x18e/ \ -e cpu/umask=0x18,event=0x18e/ \ ./workload dmesg will report a #GP warning from an unchecked MSR access error on MSR_F15H_PERF_CTLx. This is because according to APM (Revision: 4.03) Figure 13-7, the bits [35:32] of AMD PerfEvtSeln register is a part of the event select encoding, which extends the EVENT_SELECT field from 8 bits to 12 bits. Opportunistically update pmu->reserved_bits for reserved bit 19. Reported-by: Jim Mattson <jmattson@google.com> Fixes: `ca724305a2` ("KVM: x86/vPMU: Implement AMD vPMU code for KVM") Signed-off-by: Like Xu <likexu@tencent.com> Message-Id: <20211118130320.95997-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-02 04:11:50 -05:00
Feng Tang	b50db7095f	x86/tsc: Disable clocksource watchdog for TSC on qualified platorms There are cases that the TSC clocksource is wrongly judged as unstable by the clocksource watchdog mechanism which tries to validate the TSC against HPET, PM_TIMER or jiffies. While there is hardly a general reliable way to check the validity of a watchdog, Thomas Gleixner proposed [1]: "I'm inclined to lift that requirement when the CPU has: 1) X86_FEATURE_CONSTANT_TSC 2) X86_FEATURE_NONSTOP_TSC 3) X86_FEATURE_NONSTOP_TSC_S3 4) X86_FEATURE_TSC_ADJUST 5) At max. 4 sockets After two decades of horrors we're finally at a point where TSC seems to be halfway reliable and less abused by BIOS tinkerers. TSC_ADJUST was really key as we can now detect even small modifications reliably and the important point is that we can cure them as well (not pretty but better than all other options)." As feature #3 X86_FEATURE_NONSTOP_TSC_S3 only exists on several generations of Atom processorz, and is always coupled with X86_FEATURE_CONSTANT_TSC and X86_FEATURE_NONSTOP_TSC, skip checking it, and also be more defensive to use maximal 2 sockets. The check is done inside tsc_init() before registering 'tsc-early' and 'tsc' clocksources, as there were cases that both of them had been wrongly judged as unreliable. For more background of tsc/watchdog, there is a good summary in [2] [tglx} Update vs. jiffies: On systems where the only remaining clocksource aside of TSC is jiffies there is no way to make this work because that creates a circular dependency. Jiffies accuracy depends on not missing a periodic timer interrupt, which is not guaranteed. That could be detected by TSC, but as TSC is not trusted this cannot be compensated. The consequence is a circulus vitiosus which results in shutting down TSC and falling back to the jiffies clocksource which is even more unreliable. [1]. https://lore.kernel.org/lkml/87eekfk8bd.fsf@nanos.tec.linutronix.de/ [2]. https://lore.kernel.org/lkml/87a6pimt1f.ffs@nanos.tec.linutronix.de/ [ tglx: Refine comment and amend changelog ] Fixes: `6e3cd95234` ("x86/hpet: Use another crystalball to evaluate HPET usability") Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211117023751.24190-2-feng.tang@intel.com	2021-12-02 00:40:36 +01:00
Feng Tang	c7719e7934	x86/tsc: Add a timer to make sure TSC_adjust is always checked The TSC_ADJUST register is checked every time a CPU enters idle state, but Thomas Gleixner mentioned there is still a caveat that a system won't enter idle [1], either because it's too busy or configured purposely to not enter idle. Setup a periodic timer (every 10 minutes) to make sure the check is happening on a regular base. [1] https://lore.kernel.org/lkml/875z286xtk.fsf@nanos.tec.linutronix.de/ Fixes: `6e3cd95234` ("x86/hpet: Use another crystalball to evaluate HPET usability") Requested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211117023751.24190-1-feng.tang@intel.com	2021-12-02 00:40:35 +01:00
Marco Elver	52d0b8b187	x86/fpu/signal: Initialize sw_bytes in save_xstate_epilog() save_sw_bytes() did not fully initialize sw_bytes, which caused KMSAN to report an infoleak (see below). Initialize sw_bytes explicitly to avoid this. KMSAN report follows: ===================================================== BUG: KMSAN: kernel-infoleak in instrument_copy_to_user ./include/linux/instrumented.h:121 BUG: KMSAN: kernel-infoleak in __copy_to_user ./include/linux/uaccess.h:154 BUG: KMSAN: kernel-infoleak in save_xstate_epilog+0x2df/0x510 arch/x86/kernel/fpu/signal.c:127 instrument_copy_to_user ./include/linux/instrumented.h:121 __copy_to_user ./include/linux/uaccess.h:154 save_xstate_epilog+0x2df/0x510 arch/x86/kernel/fpu/signal.c:127 copy_fpstate_to_sigframe+0x861/0xb60 arch/x86/kernel/fpu/signal.c:245 get_sigframe+0x656/0x7e0 arch/x86/kernel/signal.c:296 __setup_rt_frame+0x14d/0x2a60 arch/x86/kernel/signal.c:471 setup_rt_frame arch/x86/kernel/signal.c:781 handle_signal arch/x86/kernel/signal.c:825 arch_do_signal_or_restart+0x417/0xdd0 arch/x86/kernel/signal.c:870 handle_signal_work kernel/entry/common.c:149 exit_to_user_mode_loop+0x1f6/0x490 kernel/entry/common.c:173 exit_to_user_mode_prepare kernel/entry/common.c:208 __syscall_exit_to_user_mode_work kernel/entry/common.c:290 syscall_exit_to_user_mode+0x7e/0xc0 kernel/entry/common.c:302 do_syscall_64+0x60/0xd0 arch/x86/entry/common.c:88 entry_SYSCALL_64_after_hwframe+0x44/0xae ??:? Local variable sw_bytes created at: save_xstate_epilog+0x80/0x510 arch/x86/kernel/fpu/signal.c:121 copy_fpstate_to_sigframe+0x861/0xb60 arch/x86/kernel/fpu/signal.c:245 Bytes 20-47 of 48 are uninitialized Memory access of size 48 starts at ffff8880801d3a18 Data copied to user address 00007ffd90e2ef50 ===================================================== Link: https://lore.kernel.org/all/CAG_fn=V9T6OKPonSjsi9PmWB0hMHFC=yawozdft8i1-MSxrv=w@mail.gmail.com/ Fixes: `53599b4d54` ("x86/fpu/signal: Prepare for variable sigframe length") Reported-by: Alexander Potapenko <glider@google.com> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lkml.kernel.org/r/20211126124746.761278-1-glider@google.com	2021-11-30 15:13:47 -08:00
Mark Rutland	dca99fb643	x86: Snapshot thread flags Some thread flags can be set remotely, and so even when IRQs are disabled, the flags can change under our feet. Generally this is unlikely to cause a problem in practice, but it is somewhat unsound, and KCSAN will legitimately warn that there is a data race. To avoid such issues, a snapshot of the flags has to be taken prior to using them. Some places already use READ_ONCE() for that, others do not. Convert them all to the new flag accessor helpers. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20211129130653.2037928-12-mark.rutland@arm.com	2021-12-01 00:06:43 +01:00
Kirill A. Shutemov	c494eb366d	x86/sev-es: Use insn_decode_mmio() for MMIO implementation Switch SEV implementation to insn_decode_mmio(). The helper is going to be used by TDX too. No functional changes. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Tested-by: Joerg Roedel <jroedel@suse.de> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lkml.kernel.org/r/20211130184933.31005-5-kirill.shutemov@linux.intel.com	2021-11-30 14:53:30 -08:00
Kirill A. Shutemov	70a81f99e4	x86/insn-eval: Introduce insn_decode_mmio() In preparation for sharing MMIO instruction decode between SEV-ES and TDX, factor out the common decode into a new insn_decode_mmio() helper. For regular virtual machine, MMIO is handled by the VMM and KVM emulates instructions that caused MMIO. But, this model doesn't work for a secure VMs (like SEV or TDX) as VMM doesn't have access to the guest memory and register state. So, for TDX or SEV VMM needs assistance in handling MMIO. It induces exception in the guest. Guest has to decode the instruction and handle it on its own. The code is based on the current SEV MMIO implementation. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Tested-by: Joerg Roedel <jroedel@suse.de> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lkml.kernel.org/r/20211130184933.31005-4-kirill.shutemov@linux.intel.com	2021-11-30 14:53:19 -08:00

... 3 4 5 6 7 ...

40454 Commits