Commit Graph

10974 Commits

Author SHA1 Message Date
Jan Beulich
8e1034a526 xen/pci-swiotlb: reduce visibility of symbols
xen_swiotlb and pci_xen_swiotlb_init() are only used within the file
defining them, so make them static and remove the stubs. Otoh
pci_xen_swiotlb_detect() has a use (as function pointer) from the main
pci-swiotlb.c file - convert its stub to a #define to NULL.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

Link: https://lore.kernel.org/r/aef5fc33-9c02-4df0-906a-5c813142e13c@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2021-09-20 17:01:19 +02:00
Peter Zijlstra
847d9317b2 x86/xen: Mark xen_force_evtchn_callback() noinstr
vmlinux.o: warning: objtool: check_events()+0xd: call to xen_force_evtchn_callback() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20210624095148.996055323@infradead.org
2021-09-17 13:20:25 +02:00
Peter Zijlstra
20125c872a x86/xen: Make save_fl() noinstr
vmlinux.o: warning: objtool: pv_ops[30]: native_save_fl
vmlinux.o: warning: objtool: pv_ops[30]: __raw_callee_save_xen_save_fl
vmlinux.o: warning: objtool: pv_ops[30]: xen_save_fl_direct
vmlinux.o: warning: objtool: lockdep_hardirqs_off()+0x73: call to pv_ops[30]() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20210624095148.749712274@infradead.org
2021-09-17 13:14:44 +02:00
Peter Zijlstra
7361fac046 x86/xen: Make set_debugreg() noinstr
vmlinux.o: warning: objtool: pv_ops[2]: xen_set_debugreg
vmlinux.o: warning: objtool: pv_ops[2]: native_set_debugreg
vmlinux.o: warning: objtool: exc_debug()+0x3b: call to pv_ops[2]() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20210624095148.687755639@infradead.org
2021-09-17 13:14:39 +02:00
Peter Zijlstra
f4afb713e5 x86/xen: Make get_debugreg() noinstr
vmlinux.o: warning: objtool: pv_ops[1]: xen_get_debugreg
vmlinux.o: warning: objtool: pv_ops[1]: native_get_debugreg
vmlinux.o: warning: objtool: exc_debug()+0x25: call to pv_ops[1]() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20210624095148.625523645@infradead.org
2021-09-17 13:12:34 +02:00
Peter Zijlstra
209cfd0cbb x86/xen: Make write_cr2() noinstr
vmlinux.o: warning: objtool: pv_ops[42]: native_write_cr2
vmlinux.o: warning: objtool: pv_ops[42]: xen_write_cr2
vmlinux.o: warning: objtool: exc_nmi()+0x127: call to pv_ops[42]() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20210624095148.563524913@infradead.org
2021-09-17 13:12:16 +02:00
Peter Zijlstra
0a53c9acf4 x86/xen: Make read_cr2() noinstr
vmlinux.o: warning: objtool: pv_ops[41]: native_read_cr2
vmlinux.o: warning: objtool: pv_ops[41]: xen_read_cr2
vmlinux.o: warning: objtool: pv_ops[41]: xen_read_cr2_direct
vmlinux.o: warning: objtool: exc_double_fault()+0x15: call to pv_ops[41]() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20210624095148.500331616@infradead.org
2021-09-17 13:11:50 +02:00
Peter Zijlstra
eac46b323b x86/paravirt: Use PVOP_* for paravirt calls
Doing unconditional indirect calls through the pv_ops vector is weird.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20210624095148.437720419@infradead.org
2021-09-15 15:51:48 +02:00
Peter Zijlstra
e9382440de x86/paravirt: Mark arch_local_irq_*() __always_inline
vmlinux.o: warning: objtool: lockdep_hardirqs_on()+0x72: call to arch_local_save_flags() leaves .noinstr.text section
vmlinux.o: warning: objtool: lockdep_hardirqs_off()+0x73: call to arch_local_save_flags() leaves .noinstr.text section
vmlinux.o: warning: objtool: match_held_lock()+0x11f: call to arch_local_save_flags() leaves .noinstr.text section

vmlinux.o: warning: objtool: lock_is_held_type()+0x4e: call to arch_local_irq_save() leaves .noinstr.text section

vmlinux.o: warning: objtool: lock_is_held_type()+0x65: call to arch_local_irq_disable() leaves .noinstr.text section

vmlinux.o: warning: objtool: lock_is_held_type()+0xfe: call to arch_local_irq_enable() leaves .noinstr.text section

It makes no sense to not inline these things.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20210624095148.373073648@infradead.org
2021-09-15 15:51:47 +02:00
Peter Zijlstra
c6b01dace2 x86: Always inline ip_within_syscall_gap()
vmlinux.o: warning: objtool: vc_switch_off_ist()+0x20: call to ip_within_syscall_gap.isra.0() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210624095148.188166492@infradead.org
2021-09-15 15:51:47 +02:00
Thomas Gleixner
f3305be5fe x86/fpu/signal: Change return type of fpu__restore_sig() to boolean
None of the call sites cares about the error code. All they need to know is
whether the function succeeded or not.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.909065931@linutronix.de
2021-09-14 21:10:03 +02:00
Thomas Gleixner
052adee668 x86/fpu/signal: Change return type of copy_fpstate_to_sigframe() to boolean
None of the call sites cares about the actual return code. Change the
return type to boolean and return 'true' on success.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.736773588@linutronix.de
2021-09-14 21:10:03 +02:00
Thomas Gleixner
4164a482a5 x86/fpu/signal: Move header zeroing out of xsave_to_user_sigframe()
There is no reason to have the header zeroing in the pagefault disabled
region. Do it upfront once.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.621674721@linutronix.de
2021-09-14 21:10:03 +02:00
Peter Collingbourne
7962c2eddb
arch: remove unused function syscall_set_arguments()
This function appears to have been unused since it was first introduced in
commit 828c365cc8 ("tracehook: asm/syscall.h").

Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/I8ce04f002903a37c0b6c1d16e9b2a3afa716c097
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2021-09-14 16:06:20 +02:00
H. Peter Anvin
0507503671 x86/asm: Avoid adding register pressure for the init case in static_cpu_has()
gcc will sometimes manifest the address of boot_cpu_data in a register
as part of constant propagation. When multiple static_cpu_has() are used
this may foul the mainline code with a register load which will only be
used on the fallback path, which is unused after initialization.

Explicitly force gcc to use immediate (rip-relative) addressing for
the fallback path, thus removing any possible register use from
static_cpu_has().

While making changes, modernize the code to use
.pushsection...popsection instead of .section...previous.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210910195910.2542662-4-hpa@zytor.com
2021-09-13 19:48:21 +02:00
H. Peter Anvin (Intel)
f87bc8dc7a x86/asm: Add _ASM_RIP() macro for x86-64 (%rip) suffix
Add a macro _ASM_RIP() to add a (%rip) suffix on 64 bits only. This is
useful for immediate memory references where one doesn't want gcc
to possibly use a register indirection as it may in the case of an "m"
constraint.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210910195910.2542662-3-hpa@zytor.com
2021-09-13 19:38:40 +02:00
Will Deacon
a69ae291e1 x86/uaccess: Fix 32-bit __get_user_asm_u64() when CC_HAS_ASM_GOTO_OUTPUT=y
Commit 865c50e1d2 ("x86/uaccess: utilize CONFIG_CC_HAS_ASM_GOTO_OUTPUT")
added an optimised version of __get_user_asm() for x86 using 'asm goto'.

Like the non-optimised code, the 32-bit implementation of 64-bit
get_user() expands to a pair of 32-bit accesses.  Unlike the
non-optimised code, the _original_ pointer is incremented to copy the
high word instead of loading through a new pointer explicitly
constructed to point at a 32-bit type.  Consequently, if the pointer
points at a 64-bit type then we end up loading the wrong data for the
upper 32-bits.

This was observed as a mount() failure in Android targeting i686 after
b0cfcdd9b9 ("d_path: make 'prepend()' fill up the buffer exactly on
overflow") because the call to copy_from_kernel_nofault() from
prepend_copy() ends up in __get_kernel_nofault() and casts the source
pointer to a 'u64 __user *'.  An attempt to mount at "/debug_ramdisk"
therefore ends up failing trying to mount "/debumdismdisk".

Use the existing '__gu_ptr' source pointer to unsigned int for 32-bit
__get_user_asm_u64() instead of the original pointer.

Cc: Bill Wendling <morbo@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: 865c50e1d2 ("x86/uaccess: utilize CONFIG_CC_HAS_ASM_GOTO_OUTPUT")
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-13 09:59:35 -07:00
Thomas Gleixner
4339d0c63c x86/fpu/signal: Clarify exception handling in restore_fpregs_from_user()
FPU restore from a signal frame can trigger various exceptions. The
exceptions are caught with an exception table entry. The handler of this
entry stores the trap number in EAX. The FPU specific fixup negates that
trap number to convert it into an negative error code.

Any other exception than #PF is fatal and recovery is not possible. This
relies on the fact that the #PF exception number is the same as EFAULT, but
that's not really obvious.

Remove the negation from the exception fixup as it really has no value and
check for X86_TRAP_PF at the call site.

There is still confusion due to the return code conversion for the error
case which will be cleaned up separately.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.506192488@linutronix.de
2021-09-13 18:26:05 +02:00
Thomas Gleixner
c6304556f3 x86/fpu: Use EX_TYPE_FAULT_MCE_SAFE for exception fixups
The macros used for restoring FPU state from a user space buffer can handle
all exceptions including #MC. They need to return the trap number in the
error case as the code which invokes them needs to distinguish the cause of
the failure. It aborts the operation for anything except #PF.

Use the new EX_TYPE_FAULT_MCE_SAFE exception table fixup type to document
the nature of the fixup.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.387464538@linutronix.de
2021-09-13 18:15:30 +02:00
Thomas Gleixner
2cadf5248b x86/extable: Provide EX_TYPE_DEFAULT_MCE_SAFE and EX_TYPE_FAULT_MCE_SAFE
Provide exception fixup types which can be used to identify fixups which
allow in kernel #MC recovery and make them invoke the existing handlers.

These will be used at places where #MC recovery is handled correctly by the
caller.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.269689153@linutronix.de
2021-09-13 17:56:56 +02:00
Thomas Gleixner
46d28947d9 x86/extable: Rework the exception table mechanics
The exception table entries contain the instruction address, the fixup
address and the handler address. All addresses are relative. Storing the
handler address has a few downsides:

 1) Most handlers need to be exported

 2) Handlers can be defined everywhere and there is no overview about the
    handler types

 3) MCE needs to check the handler type to decide whether an in kernel #MC
    can be recovered. The functionality of the handler itself is not in any
    way special, but for these checks there need to be separate functions
    which in the worst case have to be exported.

    Some of these 'recoverable' exception fixups are pretty obscure and
    just reuse some other handler to spare code. That obfuscates e.g. the
    #MC safe copy functions. Cleaning that up would require more handlers
    and exports

Rework the exception fixup mechanics by storing a fixup type number instead
of the handler address and invoke the proper handler for each fixup
type. Also teach the extable sort to leave the type field alone.

This makes most handlers static except for special cases like the MCE
MSR fixup and the BPF fixup. This allows to add more types for cleaning up
the obscure places without adding more handler code and exports.

There is a marginal code size reduction for a production config and it
removes _eight_ exported symbols.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lkml.kernel.org/r/20210908132525.211958725@linutronix.de
2021-09-13 17:51:47 +02:00
Thomas Gleixner
32fd8b59f9 x86/extable: Get rid of redundant macros
No point in defining the identical macros twice depending on C or assembly
mode. They are still identical.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.023659534@linutronix.de
2021-09-13 12:52:38 +02:00
Arnd Bergmann
a7a08b275a arch: remove compat_alloc_user_space
All users of compat_alloc_user_space() and copy_in_user() have been
removed from the kernel, only a few functions in sparc remain that can be
changed to calling arch_copy_in_user() instead.

Link: https://lkml.kernel.org/r/20210727144859.4150043-7-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08 15:32:35 -07:00
Linus Torvalds
192ad3c27a ARM:
- Page ownership tracking between host EL1 and EL2
 
 - Rely on userspace page tables to create large stage-2 mappings
 
 - Fix incompatibility between pKVM and kmemleak
 
 - Fix the PMU reset state, and improve the performance of the virtual PMU
 
 - Move over to the generic KVM entry code
 
 - Address PSCI reset issues w.r.t. save/restore
 
 - Preliminary rework for the upcoming pKVM fixed feature
 
 - A bunch of MM cleanups
 
 - a vGIC fix for timer spurious interrupts
 
 - Various cleanups
 
 s390:
 
 - enable interpretation of specification exceptions
 
 - fix a vcpu_idx vs vcpu_id mixup
 
 x86:
 
 - fast (lockless) page fault support for the new MMU
 
 - new MMU now the default
 
 - increased maximum allowed VCPU count
 
 - allow inhibit IRQs on KVM_RUN while debugging guests
 
 - let Hyper-V-enabled guests run with virtualized LAPIC as long as they
   do not enable the Hyper-V "AutoEOI" feature
 
 - fixes and optimizations for the toggling of AMD AVIC (virtualized LAPIC)
 
 - tuning for the case when two-dimensional paging (EPT/NPT) is disabled
 
 - bugfixes and cleanups, especially with respect to 1) vCPU reset and
   2) choosing a paging mode based on CR0/CR4/EFER
 
 - support for 5-level page table on AMD processors
 
 Generic:
 
 - MMU notifier invalidation callbacks do not take mmu_lock unless necessary
 
 - improved caching of LRU kvm_memory_slot
 
 - support for histogram statistics
 
 - add statistics for halt polling and remote TLB flush requests
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmE2CIAUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMyqwf+Ky2WoThuQ9Ra0r/m8pUTAx5+gsAf
 MmG24rNLE+26X0xuBT9Q5+etYYRLrRTWJvo5cgHooz7muAYW6scR+ho5xzvLTAxi
 DAuoijkXsSdGoFCp0OMUHiwG3cgY5N7feTEwLPAb2i6xr/l6SZyCP4zcwiiQbJ2s
 UUD0i3rEoNQ02/hOEveud/ENxzUli9cmmgHKXR3kNgsJClSf1fcuLnhg+7EGMhK9
 +c2V+hde5y0gmEairQWm22MLMRolNZ5NL4kjykiNh2M5q9YvbHe5+f/JmENlNZMT
 bsUQT6Ry1ukuJ0V59rZvUw71KknPFzZ3d6HgW4pwytMq6EJKiISHzRbVnQ==
 =FCAB
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "ARM:
   - Page ownership tracking between host EL1 and EL2
   - Rely on userspace page tables to create large stage-2 mappings
   - Fix incompatibility between pKVM and kmemleak
   - Fix the PMU reset state, and improve the performance of the virtual
     PMU
   - Move over to the generic KVM entry code
   - Address PSCI reset issues w.r.t. save/restore
   - Preliminary rework for the upcoming pKVM fixed feature
   - A bunch of MM cleanups
   - a vGIC fix for timer spurious interrupts
   - Various cleanups

  s390:
   - enable interpretation of specification exceptions
   - fix a vcpu_idx vs vcpu_id mixup

  x86:
   - fast (lockless) page fault support for the new MMU
   - new MMU now the default
   - increased maximum allowed VCPU count
   - allow inhibit IRQs on KVM_RUN while debugging guests
   - let Hyper-V-enabled guests run with virtualized LAPIC as long as
     they do not enable the Hyper-V "AutoEOI" feature
   - fixes and optimizations for the toggling of AMD AVIC (virtualized
     LAPIC)
   - tuning for the case when two-dimensional paging (EPT/NPT) is
     disabled
   - bugfixes and cleanups, especially with respect to vCPU reset and
     choosing a paging mode based on CR0/CR4/EFER
   - support for 5-level page table on AMD processors

  Generic:
   - MMU notifier invalidation callbacks do not take mmu_lock unless
     necessary
   - improved caching of LRU kvm_memory_slot
   - support for histogram statistics
   - add statistics for halt polling and remote TLB flush requests"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (210 commits)
  KVM: Drop unused kvm_dirty_gfn_invalid()
  KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset is adjusted
  KVM: MMU: mark role_regs and role accessors as maybe unused
  KVM: MIPS: Remove a "set but not used" variable
  x86/kvm: Don't enable IRQ when IRQ enabled in kvm_wait
  KVM: stats: Add VM stat for remote tlb flush requests
  KVM: Remove unnecessary export of kvm_{inc,dec}_notifier_count()
  KVM: x86/mmu: Move lpage_disallowed_link further "down" in kvm_mmu_page
  KVM: x86/mmu: Relocate kvm_mmu_page.tdp_mmu_page for better cache locality
  Revert "KVM: x86: mmu: Add guest physical address check in translate_gpa()"
  KVM: x86/mmu: Remove unused field mmio_cached in struct kvm_mmu_page
  kvm: x86: Increase KVM_SOFT_MAX_VCPUS to 710
  kvm: x86: Increase MAX_VCPUS to 1024
  kvm: x86: Set KVM_MAX_VCPU_ID to 4*KVM_MAX_VCPUS
  KVM: VMX: avoid running vmx_handle_exit_irqoff in case of emulation
  KVM: x86/mmu: Don't freak out if pml5_root is NULL on 4-level host
  KVM: s390: index kvm->arch.idle_mask by vcpu_idx
  KVM: s390: Enable specification exception interpretation
  KVM: arm64: Trim guest debug exception handling
  KVM: SVM: Add 5-level page table support for SVM
  ...
2021-09-07 13:40:51 -07:00
Eduardo Habkost
1dbaf04cb9 kvm: x86: Increase KVM_SOFT_MAX_VCPUS to 710
Support for 710 VCPUs was tested by Red Hat since RHEL-8.4,
so increase KVM_SOFT_MAX_VCPUS to 710.

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Message-Id: <20210903211600.2002377-4-ehabkost@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-06 06:12:05 -04:00
Eduardo Habkost
074c82c8f7 kvm: x86: Increase MAX_VCPUS to 1024
Increase KVM_MAX_VCPUS to 1024, so we can test larger VMs.

I'm not changing KVM_SOFT_MAX_VCPUS yet because I'm afraid it
might involve complicated questions around the meaning of
"supported" and "recommended" in the upstream tree.
KVM_SOFT_MAX_VCPUS will be changed in a separate patch.

For reference, visible effects of this change are:
- KVM_CAP_MAX_VCPUS will now return 1024 (of course)
- Default value for CPUID[HYPERV_CPUID_IMPLEMENT_LIMITS (00x40000005)].EAX
  will now be 1024
- KVM_MAX_VCPU_ID will change from 1151 to 4096
- Size of struct kvm will increase from 19328 to 22272 bytes
  (in x86_64)
- Size of struct kvm_ioapic will increase from 1780 to 5084 bytes
  (in x86_64)
- Bitmap stack variables that will grow:
  - At kvm_hv_flush_tlb() kvm_hv_send_ipi(),
    vp_bitmap[] and vcpu_bitmap[] will now be 128 bytes long
  - vcpu_bitmap at bioapic_write_indirect() will be 128 bytes long
    once patch "KVM: x86: Fix stack-out-of-bounds memory access
    from ioapic_write_indirect()" is applied

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Message-Id: <20210903211600.2002377-3-ehabkost@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-06 06:11:55 -04:00
Eduardo Habkost
4ddacd525a kvm: x86: Set KVM_MAX_VCPU_ID to 4*KVM_MAX_VCPUS
Instead of requiring KVM_MAX_VCPU_ID to be manually increased
every time we increase KVM_MAX_VCPUS, set it to 4*KVM_MAX_VCPUS.
This should be enough for CPU topologies where Cores-per-Package
and Packages-per-Socket are not powers of 2.

In practice, this increases KVM_MAX_VCPU_ID from 1023 to 1152.
The only side effect of this change is making some fields in
struct kvm_ioapic larger, increasing the struct size from 1628 to
1780 bytes (in x86_64).

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Message-Id: <20210903211600.2002377-2-ehabkost@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-06 06:11:39 -04:00
Linus Torvalds
c07f191907 hyperv-next for 5.15
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmEuJwwTHHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXp0ICACsx9NtQh1f9xGMClYrbobJfGmiwHVV
 uKut/44Vg39tSyZB4mQt3A8YcaQj1Nibo6HVmxJtbKbrKwlTGAiQh5fiOmBOd7Re
 /rII/S+CGtAyChI1adHTSL2xdk6WY0c7XQw+IPaERBikG5rO81Y6NLjFZNOv494k
 JnG9uGGjAcJWFYylPcLxt4sR/hEfE4KDzsWjWOb5azYgo/RwOan6zYDdkUgocp4A
 J+zmgCiME8LLmEV19gn7p4gpX7X9m5mcNgn53eICYPhrBqI0PTWocm6DepCEnrQ+
 pEobIagWIMx5Dr7euEJwLxFSN7bdzleVOa4FSfM0zUsEjdbiPH47VQFM
 =Vae6
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-next-signed-20210831' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv updates from Wei Liu:

 - make Hyper-V code arch-agnostic (Michael Kelley)

 - fix sched_clock behaviour on Hyper-V (Ani Sinha)

 - fix a fault when Linux runs as the root partition on MSHV (Praveen
   Kumar)

 - fix VSS driver (Vitaly Kuznetsov)

 - cleanup (Sonia Sharma)

* tag 'hyperv-next-signed-20210831' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  hv_utils: Set the maximum packet size for VSS driver to the length of the receive buffer
  Drivers: hv: Enable Hyper-V code to be built on ARM64
  arm64: efi: Export screen_info
  arm64: hyperv: Initialize hypervisor on boot
  arm64: hyperv: Add panic handler
  arm64: hyperv: Add Hyper-V hypercall and register access utilities
  x86/hyperv: fix root partition faults when writing to VP assist page MSR
  hv: hyperv.h: Remove unused inline functions
  drivers: hv: Decouple Hyper-V clock/timer code from VMbus drivers
  x86/hyperv: add comment describing TSC_INVARIANT_CONTROL MSR setting bit 0
  Drivers: hv: Move Hyper-V misc functionality to arch-neutral code
  Drivers: hv: Add arch independent default functions for some Hyper-V handlers
  Drivers: hv: Make portions of Hyper-V init code be arch neutral
  x86/hyperv: fix for unwanted manipulation of sched_clock when TSC marked unstable
  asm-generic/hyperv: Add missing #include of nmi.h
2021-09-01 18:25:20 -07:00
Linus Torvalds
477f70cd2a drm for v5.15-rc1
core:
 - extract i915 eDP backlight into core
 - DP aux bus support
 - drm_device.irq_enabled removed
 - port drivers to native irq interfaces
 - export gem shadow plane handling for vgem
 - print proper driver name in framebuffer registration
 - driver fixes for implicit fencing rules
 - ARM fixed rate compression modifier added
 - updated fb damage handling
 - rmfb ioctl logging/docs
 - drop drm_gem_object_put_locked
 - define DRM_FORMAT_MAX_PLANES
 - add gem fb vmap/vunmap helpers
 - add lockdep_assert(once) helpers
 - mark drm irq midlayer as legacy
 - use offset adjusted bo mapping conversion
 
 vgaarb:
 - cleanups
 
 fbdev:
 - extend efifb handling to all arches
 - div by 0 fixes for multiple drivers
 
 udmabuf:
 - add hugepage mapping support
 
 dma-buf:
 - non-dynamic exporter fixups
 - document implicit fencing rules
 
 amdgpu:
 - Initial Cyan Skillfish support
 - switch virtual DCE over to vkms based atomic
 - VCN/JPEG power down fixes
 - NAVI PCIE link handling fixes
 - AMD HDMI freesync fixes
 - Yellow Carp + Beige Goby fixes
 - Clockgating/S0ix/SMU/EEPROM fixes
 - embed hw fence in job
 - rework dma-resv handling
 - ensure eviction to system ram
 
 amdkfd:
 - uapi: SVM address range query added
 - sysfs leak fix
 - GPUVM TLB optimizations
 - vmfault/migration counters
 
 i915:
 - Enable JSL and EHL by default
 - preliminary XeHP/DG2 support
 - remove all CNL support (never shipped)
 - move to TTM for discrete memory support
 - allow mixed object mmap handling
 - GEM uAPI spring cleaning
   - add I915_MMAP_OBJECT_FIXED
   - reinstate ADL-P mmap ioctls
   - drop a bunch of unused by userspace features
   - disable and remove GPU relocations
 - revert some i915 misfeatures
 - major refactoring of GuC for Gen11+
 - execbuffer object locking separate step
 - reject caching/set-domain on discrete
 - Enable pipe DMC loading on XE-LPD and ADL-P
 - add PSF GV point support
 - Refactor and fix DDI buffer translations
 - Clean up FBC CFB allocation code
 - Finish INTEL_GEN() and friends macro conversions
 
 nouveau:
 - add eDP backlight support
 - implicit fence fix
 
 msm:
 - a680/7c3 support
 - drm/scheduler conversion
 
 panfrost:
 - rework GPU reset
 
 virtio:
 - fix fencing for planes
 
 ast:
 - add detect support
 
 bochs:
 - move to tiny GPU driver
 
 vc4:
 - use hotplug irqs
 - HDMI codec support
 
 vmwgfx:
 - use internal vmware device headers
 
 ingenic:
 - demidlayering irq
 
 rcar-du:
 - shutdown fixes
 - convert to bridge connector helpers
 
 zynqmp-dsub:
 - misc fixes
 
 mgag200:
 - convert PLL handling to atomic
 
 mediatek:
 - MT8133 AAL support
 - gem mmap object support
 - MT8167 support
 
 etnaviv:
 - NXP Layerscape LS1028A SoC support
 - GEM mmap cleanups
 
 tegra:
 - new user API
 
 exynos:
 - missing unlock fix
 - build warning fix
 - use refcount_t
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmEtvn8ACgkQDHTzWXnE
 hr7aqw//WfcIyGdPLjAz59cW8jm+FgihD5colHtOUYRHRO4GeX/bNNufquR8+N3y
 HESsyZdpihFHms/wURMq41ibmHg0EuHA01HZzjZuGBesG4F9I8sP/HnDOxDuYuAx
 N7Lg4PlUNlfFHmw7Y84owQ6s/XWmNp5iZ8e/mTK5hcraJFQKS4QO74n9RbG/F1vC
 Hc3P6AnpqGac2AEGXt0NjIRxVVCTUIBGx+XOhj+1AMyAGzt9VcO1DS9PVCS0zsEy
 zKMj9tZAPNg0wYsXAi4kA1lK7uVY8KoXSVDYLpsI5Or2/e7mfq2b4EWrezbtp6UA
 H+w86axuwJq7NaYHYH6HqyrLTOmvcHgIl2LoZN91KaNt61xfJT3XZkyQoYViGIrJ
 oZy6X/+s+WPoW98bHZrr6vbcxtWKfEeQyUFEAaDMmraKNJwROjtwgFC9DP8MDctq
 PUSM+XkwbGRRxQfv9dNKufeWfV5blVfzEJO8EfTU1YET3WTDaUHe/FoIcLZt2DZG
 JAJgZkIlU8egthPdakUjQz/KoyLMyovcN5zcjgzgjA9PyNEq74uElN9l446kSSxu
 jEVErOdd+aG3Zzk7/ZZL/RmpNQpPfpQ2RaPUkgeUsW01myNzUNuU3KUDaSlVa+Oi
 1n7eKoaQ2to/+LjhYApVriri4hIZckNNn5FnnhkgwGi8mpHQIVQ=
 =vZkA
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2021-08-31-1' of git://anongit.freedesktop.org/drm/drm

Pull drm updates from Dave Airlie:
 "Highlights:

   - i915 has seen a lot of refactoring and uAPI cleanups due to a
     change in the upstream direction going forward

     This has all been audited with known userspace, but there may be
     some pitfalls that were missed.

   - i915 now uses common TTM to enable discrete memory on DG1/2 GPUs

   - i915 enables Jasper and Elkhart Lake by default and has preliminary
     XeHP/DG2 support

   - amdgpu adds support for Cyan Skillfish

   - lots of implicit fencing rules documented and fixed up in drivers

   - msm now uses the core scheduler

   - the irq midlayer has been removed for non-legacy drivers

   - the sysfb code now works on more than x86.

  Otherwise the usual smattering of stuff everywhere, panels, bridges,
  refactorings.

  Detailed summary:

  core:
   - extract i915 eDP backlight into core
   - DP aux bus support
   - drm_device.irq_enabled removed
   - port drivers to native irq interfaces
   - export gem shadow plane handling for vgem
   - print proper driver name in framebuffer registration
   - driver fixes for implicit fencing rules
   - ARM fixed rate compression modifier added
   - updated fb damage handling
   - rmfb ioctl logging/docs
   - drop drm_gem_object_put_locked
   - define DRM_FORMAT_MAX_PLANES
   - add gem fb vmap/vunmap helpers
   - add lockdep_assert(once) helpers
   - mark drm irq midlayer as legacy
   - use offset adjusted bo mapping conversion

  vgaarb:
   - cleanups

  fbdev:
   - extend efifb handling to all arches
   - div by 0 fixes for multiple drivers

  udmabuf:
   - add hugepage mapping support

  dma-buf:
   - non-dynamic exporter fixups
   - document implicit fencing rules

  amdgpu:
   - Initial Cyan Skillfish support
   - switch virtual DCE over to vkms based atomic
   - VCN/JPEG power down fixes
   - NAVI PCIE link handling fixes
   - AMD HDMI freesync fixes
   - Yellow Carp + Beige Goby fixes
   - Clockgating/S0ix/SMU/EEPROM fixes
   - embed hw fence in job
   - rework dma-resv handling
   - ensure eviction to system ram

  amdkfd:
   - uapi: SVM address range query added
   - sysfs leak fix
   - GPUVM TLB optimizations
   - vmfault/migration counters

  i915:
   - Enable JSL and EHL by default
   - preliminary XeHP/DG2 support
   - remove all CNL support (never shipped)
   - move to TTM for discrete memory support
   - allow mixed object mmap handling
   - GEM uAPI spring cleaning
       - add I915_MMAP_OBJECT_FIXED
       - reinstate ADL-P mmap ioctls
       - drop a bunch of unused by userspace features
       - disable and remove GPU relocations
   - revert some i915 misfeatures
   - major refactoring of GuC for Gen11+
   - execbuffer object locking separate step
   - reject caching/set-domain on discrete
   - Enable pipe DMC loading on XE-LPD and ADL-P
   - add PSF GV point support
   - Refactor and fix DDI buffer translations
   - Clean up FBC CFB allocation code
   - Finish INTEL_GEN() and friends macro conversions

  nouveau:
   - add eDP backlight support
   - implicit fence fix

  msm:
   - a680/7c3 support
   - drm/scheduler conversion

  panfrost:
   - rework GPU reset

  virtio:
   - fix fencing for planes

  ast:
   - add detect support

  bochs:
   - move to tiny GPU driver

  vc4:
   - use hotplug irqs
   - HDMI codec support

  vmwgfx:
   - use internal vmware device headers

  ingenic:
   - demidlayering irq

  rcar-du:
   - shutdown fixes
   - convert to bridge connector helpers

  zynqmp-dsub:
   - misc fixes

  mgag200:
   - convert PLL handling to atomic

  mediatek:
   - MT8133 AAL support
   - gem mmap object support
   - MT8167 support

  etnaviv:
   - NXP Layerscape LS1028A SoC support
   - GEM mmap cleanups

  tegra:
   - new user API

  exynos:
   - missing unlock fix
   - build warning fix
   - use refcount_t"

* tag 'drm-next-2021-08-31-1' of git://anongit.freedesktop.org/drm/drm: (1318 commits)
  drm/amd/display: Move AllowDRAMSelfRefreshOrDRAMClockChangeInVblank to bounding box
  drm/amd/display: Remove duplicate dml init
  drm/amd/display: Update bounding box states (v2)
  drm/amd/display: Update number of DCN3 clock states
  drm/amdgpu: disable GFX CGCG in aldebaran
  drm/amdgpu: Clear RAS interrupt status on aldebaran
  drm/amdgpu: Add support for RAS XGMI err query
  drm/amdkfd: Account for SH/SE count when setting up cu masks.
  drm/amdgpu: rename amdgpu_bo_get_preferred_pin_domain
  drm/amdgpu: drop redundant cancel_delayed_work_sync call
  drm/amdgpu: add missing cleanups for more ASICs on UVD/VCE suspend
  drm/amdgpu: add missing cleanups for Polaris12 UVD/VCE on suspend
  drm/amdkfd: map SVM range with correct access permission
  drm/amdkfd: check access permisson to restore retry fault
  drm/amdgpu: Update RAS XGMI Error Query
  drm/amdgpu: Add driver infrastructure for MCA RAS
  drm/amd/display: Add Logging for HDMI color depth information
  drm/amd/amdgpu: consolidate PSP TA init shared buf functions
  drm/amd/amdgpu: add name field back to ras_common_if
  drm/amdgpu: Fix build with missing pm_suspend_target_state module export
  ...
2021-09-01 11:26:46 -07:00
Linus Torvalds
9e9fb7655e Core:
- Enable memcg accounting for various networking objects.
 
 BPF:
 
  - Introduce bpf timers.
 
  - Add perf link and opaque bpf_cookie which the program can read
    out again, to be used in libbpf-based USDT library.
 
  - Add bpf_task_pt_regs() helper to access user space pt_regs
    in kprobes, to help user space stack unwinding.
 
  - Add support for UNIX sockets for BPF sockmap.
 
  - Extend BPF iterator support for UNIX domain sockets.
 
  - Allow BPF TCP congestion control progs and bpf iterators to call
    bpf_setsockopt(), e.g. to switch to another congestion control
    algorithm.
 
 Protocols:
 
  - Support IOAM Pre-allocated Trace with IPv6.
 
  - Support Management Component Transport Protocol.
 
  - bridge: multicast: add vlan support.
 
  - netfilter: add hooks for the SRv6 lightweight tunnel driver.
 
  - tcp:
     - enable mid-stream window clamping (by user space or BPF)
     - allow data-less, empty-cookie SYN with TFO_SERVER_COOKIE_NOT_REQD
     - more accurate DSACK processing for RACK-TLP
 
  - mptcp:
     - add full mesh path manager option
     - add partial support for MP_FAIL
     - improve use of backup subflows
     - optimize option processing
 
  - af_unix: add OOB notification support.
 
  - ipv6: add IFLA_INET6_RA_MTU to expose MTU value advertised by
          the router.
 
  - mac80211: Target Wake Time support in AP mode.
 
  - can: j1939: extend UAPI to notify about RX status.
 
 Driver APIs:
 
  - Add page frag support in page pool API.
 
  - Many improvements to the DSA (distributed switch) APIs.
 
  - ethtool: extend IRQ coalesce uAPI with timer reset modes.
 
  - devlink: control which auxiliary devices are created.
 
  - Support CAN PHYs via the generic PHY subsystem.
 
  - Proper cross-chip support for tag_8021q.
 
  - Allow TX forwarding for the software bridge data path to be
    offloaded to capable devices.
 
 Drivers:
 
  - veth: more flexible channels number configuration.
 
  - openvswitch: introduce per-cpu upcall dispatch.
 
  - Add internet mix (IMIX) mode to pktgen.
 
  - Transparently handle XDP operations in the bonding driver.
 
  - Add LiteETH network driver.
 
  - Renesas (ravb):
    - support Gigabit Ethernet IP
 
  - NXP Ethernet switch (sja1105)
    - fast aging support
    - support for "H" switch topologies
    - traffic termination for ports under VLAN-aware bridge
 
  - Intel 1G Ethernet
     - support getcrosststamp() with PCIe PTM (Precision Time
       Measurement) for better time sync
     - support Credit-Based Shaper (CBS) offload, enabling HW traffic
       prioritization and bandwidth reservation
 
  - Broadcom Ethernet (bnxt)
     - support pulse-per-second output
     - support larger Rx rings
 
  - Mellanox Ethernet (mlx5)
     - support ethtool RSS contexts and MQPRIO channel mode
     - support LAG offload with bridging
     - support devlink rate limit API
     - support packet sampling on tunnels
 
  - Huawei Ethernet (hns3):
     - basic devlink support
     - add extended IRQ coalescing support
     - report extended link state
 
  - Netronome Ethernet (nfp):
     - add conntrack offload support
 
  - Broadcom WiFi (brcmfmac):
     - add WPA3 Personal with FT to supported cipher suites
     - support 43752 SDIO device
 
  - Intel WiFi (iwlwifi):
     - support scanning hidden 6GHz networks
     - support for a new hardware family (Bz)
 
  - Xen pv driver:
     - harden netfront against malicious backends
 
  - Qualcomm mobile
     - ipa: refactor power management and enable automatic suspend
     - mhi: move MBIM to WWAN subsystem interfaces
 
 Refactor:
 
  - Ambient BPF run context and cgroup storage cleanup.
 
  - Compat rework for ndo_ioctl.
 
 Old code removal:
 
  - prism54 remove the obsoleted driver, deprecated by the p54 driver.
 
  - wan: remove sbni/granch driver.
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmEukBYACgkQMUZtbf5S
 IrsyHA//TO8dw18NYts4n9LmlJT2naJ7yBUUSSXK/M+DtW0MQ9nnHhqzPm5uJdRl
 IgQTNJrW3dYzRwgqaWZqEwO1t5/FI+f87ND1Nsekg7x9tF66a6ov5WxU26TwwSba
 U+si/inQ/4chuQ+LxMQobqCDxaLE46I2dIoRl+YfndJ24DRzYSwAEYIPPbSdfyU+
 +/l+3s4GaxO4k/hLciPAiOniyxLoUNiGUTNh+2yqRBXelSRJRKVnl+V22ANFrxRW
 nTEiplfVKhlPU1e4iLuRtaxDDiePHhw9I3j/lMHhfeFU2P/gKJIvz4QpGV0CAZg2
 1VvDU32WEx1GQLXJbKm0KwoNRUq1QSjOyyFti+BO7ugGaYAR4gKhShOqlSYLzUtB
 tbtzQhSNLWOGqgmSJOztZb5kFDm2EdRSll5/lP2uyFlPkIsIp0QbscJVzNTnS74b
 Xz15ZOw41Z4TfWPEMWgfrx6Zkm7pPWkly+7WfUkPcHa1gftNz6tzXXxSXcXIBPdi
 yQ5JCzzxrM5573YHuk5YedwZpn6PiAt4A/muFGk9C6aXP60TQAOS/ppaUzZdnk4D
 NfOk9mj06WEULjYjPcKEuT3GGWE6kmjb8Pu0QZWKOchv7vr6oZly1EkVZqYlXELP
 AfhcrFeuufie8mqm0jdb4LnYaAnqyLzlb1J4Zxh9F+/IX7G3yoc=
 =JDGD
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - Enable memcg accounting for various networking objects.

  BPF:

   - Introduce bpf timers.

   - Add perf link and opaque bpf_cookie which the program can read out
     again, to be used in libbpf-based USDT library.

   - Add bpf_task_pt_regs() helper to access user space pt_regs in
     kprobes, to help user space stack unwinding.

   - Add support for UNIX sockets for BPF sockmap.

   - Extend BPF iterator support for UNIX domain sockets.

   - Allow BPF TCP congestion control progs and bpf iterators to call
     bpf_setsockopt(), e.g. to switch to another congestion control
     algorithm.

  Protocols:

   - Support IOAM Pre-allocated Trace with IPv6.

   - Support Management Component Transport Protocol.

   - bridge: multicast: add vlan support.

   - netfilter: add hooks for the SRv6 lightweight tunnel driver.

   - tcp:
       - enable mid-stream window clamping (by user space or BPF)
       - allow data-less, empty-cookie SYN with TFO_SERVER_COOKIE_NOT_REQD
       - more accurate DSACK processing for RACK-TLP

   - mptcp:
       - add full mesh path manager option
       - add partial support for MP_FAIL
       - improve use of backup subflows
       - optimize option processing

   - af_unix: add OOB notification support.

   - ipv6: add IFLA_INET6_RA_MTU to expose MTU value advertised by the
     router.

   - mac80211: Target Wake Time support in AP mode.

   - can: j1939: extend UAPI to notify about RX status.

  Driver APIs:

   - Add page frag support in page pool API.

   - Many improvements to the DSA (distributed switch) APIs.

   - ethtool: extend IRQ coalesce uAPI with timer reset modes.

   - devlink: control which auxiliary devices are created.

   - Support CAN PHYs via the generic PHY subsystem.

   - Proper cross-chip support for tag_8021q.

   - Allow TX forwarding for the software bridge data path to be
     offloaded to capable devices.

  Drivers:

   - veth: more flexible channels number configuration.

   - openvswitch: introduce per-cpu upcall dispatch.

   - Add internet mix (IMIX) mode to pktgen.

   - Transparently handle XDP operations in the bonding driver.

   - Add LiteETH network driver.

   - Renesas (ravb):
       - support Gigabit Ethernet IP

   - NXP Ethernet switch (sja1105):
       - fast aging support
       - support for "H" switch topologies
       - traffic termination for ports under VLAN-aware bridge

   - Intel 1G Ethernet
       - support getcrosststamp() with PCIe PTM (Precision Time
         Measurement) for better time sync
       - support Credit-Based Shaper (CBS) offload, enabling HW traffic
         prioritization and bandwidth reservation

   - Broadcom Ethernet (bnxt)
       - support pulse-per-second output
       - support larger Rx rings

   - Mellanox Ethernet (mlx5)
       - support ethtool RSS contexts and MQPRIO channel mode
       - support LAG offload with bridging
       - support devlink rate limit API
       - support packet sampling on tunnels

   - Huawei Ethernet (hns3):
       - basic devlink support
       - add extended IRQ coalescing support
       - report extended link state

   - Netronome Ethernet (nfp):
       - add conntrack offload support

   - Broadcom WiFi (brcmfmac):
       - add WPA3 Personal with FT to supported cipher suites
       - support 43752 SDIO device

   - Intel WiFi (iwlwifi):
       - support scanning hidden 6GHz networks
       - support for a new hardware family (Bz)

   - Xen pv driver:
       - harden netfront against malicious backends

   - Qualcomm mobile
       - ipa: refactor power management and enable automatic suspend
       - mhi: move MBIM to WWAN subsystem interfaces

  Refactor:

   - Ambient BPF run context and cgroup storage cleanup.

   - Compat rework for ndo_ioctl.

  Old code removal:

   - prism54 remove the obsoleted driver, deprecated by the p54 driver.

   - wan: remove sbni/granch driver"

* tag 'net-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1715 commits)
  net: Add depends on OF_NET for LiteX's LiteETH
  ipv6: seg6: remove duplicated include
  net: hns3: remove unnecessary spaces
  net: hns3: add some required spaces
  net: hns3: clean up a type mismatch warning
  net: hns3: refine function hns3_set_default_feature()
  ipv6: remove duplicated 'net/lwtunnel.h' include
  net: w5100: check return value after calling platform_get_resource()
  net/mlxbf_gige: Make use of devm_platform_ioremap_resourcexxx()
  net: mdio: mscc-miim: Make use of the helper function devm_platform_ioremap_resource()
  net: mdio-ipq4019: Make use of devm_platform_ioremap_resource()
  fou: remove sparse errors
  ipv4: fix endianness issue in inet_rtm_getroute_build_skb()
  octeontx2-af: Set proper errorcode for IPv4 checksum errors
  octeontx2-af: Fix static code analyzer reported issues
  octeontx2-af: Fix mailbox errors in nix_rss_flowkey_cfg
  octeontx2-af: Fix loop in free and unmap counter
  af_unix: fix potential NULL deref in unix_dgram_connect()
  dpaa2-eth: Replace strlcpy with strscpy
  octeontx2-af: Use NDC TX for transmit packet data
  ...
2021-08-31 16:43:06 -07:00
Linus Torvalds
ccd8ec4a3f A set of updates to support port 0x22/0x23 based PCI configuration space
which can be found on various ALi chipsets and is also available on older
 Intel systems which expose a PIRQ router. While the Intel support is more
 or less nostalgia, the ALi chips are still in use on popular embedded
 boards used for routers.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEsn2QTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoQT1EACIvzRbycwclASIV6rBK5FMcVa2VuXR
 GVqrfERPCUHQxnLshUJxnLk0NvZcQrLHjYl/QMCHBFOeEh3XrzU7JkKDW0Q8Dnov
 QGFRtandKDwY4TwnCPKVdz/HeWMxNRT7OF4d08Q3iKCN5l39RLxraMixSrFL8soO
 wgGcRTjbTa6HaMlqacFN7DwwiHxbIGJNepi0yqLZBV2dQOnZPd+ujV1FRSNXkv9p
 vFPfuazk/psiSXy3x/+YVPUw+6h8DRDkflc9+wvSR+1cVl8eyrjkLgLH43ihddEN
 Dl1SG5vKyCOtvQm+TEYdB5qjb/Zd4BjlbvKPJ+94OTtsjIIwxzInizkeTXiLHXnl
 SDHX9Sc8L4sYP5+tAew1WMj8K2/p6FzdHm+sBJHd2JFSsMpeErI7p0y0Nz58E7pG
 0cRqeWlq7rbGFPq544A8cgx/LjPkZT4LgutGpJ6f3NTZeLfj09xbFRqxNOHqAx+h
 fp+36RNb1/j70Yz+4r7lLeDOVswbK+YxPIZGdnNfINTHeGllthDI5vaUL0L2jZnI
 CnnKjss2a1WkDC8gczr/3QYcQRKrKDHL0hn0nUh+9laAaTSwNv3oRrkUWvMqwaT8
 qSMMm5Eb84B4fZLyvPIcAwyC++JU/cVCgWEP37EzhYcvp6tq8GmR1cdi1lo2/K4O
 qhg1d7loNh0eCg==
 =R+c1
 -----END PGP SIGNATURE-----

Merge tag 'x86-irq-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 PIRQ updates from Thomas Gleixner:
 "A set of updates to support port 0x22/0x23 based PCI configuration
  space which can be found on various ALi chipsets and is also available
  on older Intel systems which expose a PIRQ router.

  While the Intel support is more or less nostalgia, the ALi chips are
  still in use on popular embedded boards used for routers"

* tag 'x86-irq-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Fix typo s/ECLR/ELCR/ for the PIC register
  x86: Avoid magic number with ELCR register accesses
  x86/PCI: Add support for the Intel 82426EX PIRQ router
  x86/PCI: Add support for the Intel 82374EB/82374SB (ESC) PIRQ router
  x86/PCI: Add support for the ALi M1487 (IBC) PIRQ router
  x86: Add support for 0x22/0x23 port I/O configuration space
2021-08-30 15:20:05 -07:00
Linus Torvalds
0a096f240a A reworked version of the opt-in L1D flush mechanism:
A stop gap for potential future speculation related hardware
   vulnerabilities and a mechanism for truly security paranoid
   applications.
 
   It allows a task to request that the L1D cache is flushed when the kernel
   switches to a different mm. This can be requested via prctl().
 
   Changes vs. the previous versions:
 
     - Get rid of the software flush fallback
 
     - Make the handling consistent with other mitigations
 
     - Kill the task when it ends up on a SMT enabled core which defeats the
       purpose of L1D flushing obviously
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEsn0oTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoa5fD/47vHGtjAtDr/DaXR1C6F9AvVbKEl8p
 oNHn8IukE6ts6G4dFH9wUvo/Ut0K3kxX54I+BATew0LTy6tsQeUYh/xjwXMupgNV
 oKOc9waoqdFvju3ayLFWJmuACLdXpyrGC1j35Aji61zSbR/GdtZ4oDxbuN2YJDAT
 BTcgKrBM5nQm94JNa083RQSCU5LJxbC7ETkIh6NR73RSPCjUC1Wpxy1sAQAa2MPD
 8EzcJ/DjVGaHCI7adX10sz3xdUcyOz7qYz16HpoMGx+oSiq7pGEBtUiK97EYMcrB
 s+ADFUjYmx/pbEWv2r4c9zxNh7ZV3aLBsWwi7bScHIsv8GjrsA/mYLWskuwOV6BB
 22qZjfd0c4raiJwd+nmSx+D2Szv6lZ20gP+krtP2VNC6hUv7ft0VPLySiaFMmUHj
 quooDZis/W5n+4C9Q8Rk9uUtKzzJOngqW+duftiixHiNQ/ECP/QCAHhZYck/NOkL
 tZkNj6lJj9+2iR7mhbYROZ+wrYQzRvqNb2pJJQoi/wA0q7wPSKBi3m+51lPsht5W
 tn94CpaDDZ4IB7Fe1NtcA0UpYJSWpDQGlau4qp92HMCCIcRFfQEm+m9x8axwcj7m
 ECblHJYBPHuNcCHvPA8kHvr1nd6UUXrGPIo8TK8YhUUbK6pO0OjdNzZX496ia/2g
 pLzaW2ENTPLbXg==
 =27wH
 -----END PGP SIGNATURE-----

Merge tag 'x86-cpu-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cache flush updates from Thomas Gleixner:
 "A reworked version of the opt-in L1D flush mechanism.

  This is a stop gap for potential future speculation related hardware
  vulnerabilities and a mechanism for truly security paranoid
  applications.

  It allows a task to request that the L1D cache is flushed when the
  kernel switches to a different mm. This can be requested via prctl().

  Changes vs the previous versions:

   - Get rid of the software flush fallback

   - Make the handling consistent with other mitigations

   - Kill the task when it ends up on a SMT enabled core which defeats
     the purpose of L1D flushing obviously"

* tag 'x86-cpu-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Documentation: Add L1D flushing Documentation
  x86, prctl: Hook L1D flushing in via prctl
  x86/mm: Prepare for opt-in based L1D flush in switch_mm()
  x86/process: Make room for TIF_SPEC_L1D_FLUSH
  sched: Add task_work callback for paranoid L1D flush
  x86/mm: Refactor cond_ibpb() to support other use cases
  x86/smp: Add a per-cpu view of SMT state
2021-08-30 15:00:33 -07:00
Linus Torvalds
4a2b88eb02 Perf events changes for v5.15 are:
- Add support for Intel Sapphire Rapids server CPU uncore events
  - Allow the AMD uncore driver to be built as a module
  - Misc cleanups and fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmEsrWgRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iPbg/+LTi1ki2kswWF5Fmoo0IWZ1GiCkWD0rSm
 SqCJR3u/MHd7QxsfPXFuhABt7mqaC50epfim0MNC1J11wRoNy6GzM/9nICdez9CK
 c2c3jAwoEfWE7o5PTSjXolH0FFXVYwQ9WBRJTBCaZjuXUW7TOk7h9o8fXoF8SssU
 VkP9z0YKanhN480v4k7/KR20ktY6GPFFu5cxhC0wyygZXIoG6ku+nmDjvN1ipo54
 JQYQbCTz+dj/C0uZehbEoTZWx7cajAxlbq+Iyor3ND30YHLeRxWNEcq/Jzn9Szqa
 LXnIi3mg4/mHc8mtZjDHXazpMxGYI02hkBACftPb2gizR4DOwtWlw1A2oHjbfmvM
 oF29kcZmFU/Na2O5JTf0IpV4LLkt/OlrTsTcd5ROYWgmri3UV18MhaKz0R0vQNhI
 8Xx6TJVA9fFHCIONg+4yxzDkYlxHJ9Tg8yFb06M6NWOnud03xYv4FGOzi0laoums
 8XbYZUnMh8aVkXz05CYUadknu/ajMOSqAZZAstng3unazrutSCMkZ+Pzs4X+yqq5
 Zz2Tb26oi6KepLD0eQNTmo1pRwuWC/IBqVF5aKH4e4qgyta/VxWob3Qd+I7ONaCl
 6HpbaWz01Nw8U2wE4dQB0wKwIGlLE8bRQTS2QFuqaEHu0yZsQGt8zMwVWB+f93Gb
 viglPyeA/y8=
 =QrPT
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 perf event updates from Ingo Molnar:

 - Add support for Intel Sapphire Rapids server CPU uncore events

 - Allow the AMD uncore driver to be built as a module

 - Misc cleanups and fixes

* tag 'perf-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  perf/x86/amd/ibs: Add bitfield definitions in new <asm/amd-ibs.h> header
  perf/amd/uncore: Allow the driver to be built as a module
  x86/cpu: Add get_llc_id() helper function
  perf/amd/uncore: Clean up header use, use <linux/ include paths instead of <asm/
  perf/amd/uncore: Simplify code, use free_percpu()'s built-in check for NULL
  perf/hw_breakpoint: Replace deprecated CPU-hotplug functions
  perf/x86/intel: Replace deprecated CPU-hotplug functions
  perf/x86: Remove unused assignment to pointer 'e'
  perf/x86/intel/uncore: Fix IIO cleanup mapping procedure for SNR/ICX
  perf/x86/intel/uncore: Support IMC free-running counters on Sapphire Rapids server
  perf/x86/intel/uncore: Support IIO free-running counters on Sapphire Rapids server
  perf/x86/intel/uncore: Factor out snr_uncore_mmio_map()
  perf/x86/intel/uncore: Add alias PMU name
  perf/x86/intel/uncore: Add Sapphire Rapids server MDF support
  perf/x86/intel/uncore: Add Sapphire Rapids server M3UPI support
  perf/x86/intel/uncore: Add Sapphire Rapids server UPI support
  perf/x86/intel/uncore: Add Sapphire Rapids server M2M support
  perf/x86/intel/uncore: Add Sapphire Rapids server IMC support
  perf/x86/intel/uncore: Add Sapphire Rapids server PCU support
  perf/x86/intel/uncore: Add Sapphire Rapids server M2PCIe support
  ...
2021-08-30 13:50:20 -07:00
Linus Torvalds
8f645b4208 - Do not start processing MCEs logged early because the decoding chain
is not up yet - delay that processing until everything is ready
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmEsp9MACgkQEsHwGGHe
 VUp1ARAAtlYWtIQXUV+YiGGnueO3WZlbME/O/I6uz3KSIgZg59qaXPpscL7vCjS1
 pFav1F3GKpplvkWAPBQf290yLYn934oLDanof1ENcAQxE46RD/CTr4V1J3xSzb31
 +XEcwHoDFMdSHp219pinNWEDFGlYvb9/Q8AxCMFlUomRrPE/1tjo2l8gmZ7Wi1r5
 uUwImmiYKnKoin3HpaY5n0NW+3madWhVfPhwv7Tsh0Hp0Qh5KRra+0OoLSGeKNDk
 EDpKE/XKAjCNjcBLNIAWtLCHvPC1bWh2jC+Qmu8a9UGn4e5KmWw+GCUoOuBRuA6X
 KOAoXtZXARsDDXadQKjNnQ4P3CvFcmNpCzjoFAarsflA3vKSHttDamk14bxE/mRx
 AkNdw7oQYfOLAP2PVpPGgBEiU5WIoPwXdudDi6R8Tu107Z0mhiJIgbjPajJOvFpj
 EnTO4k/A+Lk4uhPYYFhlhphDT/B8KdVlnEv0VJds0nPEE2uC3JiXxu7dHbBPYcXm
 KtvJ5A+jDEMvFoPzoqwDBr7EzeHUUjyyI1Gym5EsjCNoH4WG8SyXYngCRrFOaIFe
 e047UnJ4a7ibsFHAPrB9LtWLxvhPl13D4D+Ejb1GvnzDwSaAJAP9gVBGC0VpicId
 NGEYAjsRsLGP2D3I4+Bhh1MN6mL/wlcmzO4izXXhMyc9vuTrXws=
 =1ROV
 -----END PGP SIGNATURE-----

Merge tag 'ras_core_for_v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RAS update from Borislav Petkov:
 "A single RAS change for 5.15:

   - Do not start processing MCEs logged early because the decoding
     chain is not up yet - delay that processing until everything is
     ready"

* tag 'ras_core_for_v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce: Defer processing of early errors
2021-08-30 13:23:17 -07:00
Linus Torvalds
c7a5238ef6 s390 updates for 5.15 merge window
- Improve ftrace code patching so that stop_machine is not required anymore.
   This requires a small common code patch acked by Steven Rostedt:
   https://lore.kernel.org/linux-s390/20210730220741.4da6fdf6@oasis.local.home/
 
 - Enable KCSAN for s390. This comes with a small common code change to fix a
   compile warning. Acked by Marco Elver:
   https://lore.kernel.org/r/20210729142811.1309391-1-hca@linux.ibm.com
 
 - Add KFENCE support for s390. This also comes with a minimal x86 patch from
   Marco Elver who said also this can be carried via the s390 tree:
   https://lore.kernel.org/linux-s390/YQJdarx6XSUQ1tFZ@elver.google.com/
 
 - More changes to prepare the decompressor for relocation.
 
 - Enable DAT also for CPU restart path.
 
 - Final set of register asm removal patches; leaving only three locations where
   needed and sane.
 
 - Add NNPA, Vector-Packed-Decimal-Enhancement Facility 2, PCI MIO support to
   hwcaps flags.
 
 - Cleanup hwcaps implementation.
 
 - Add new instructions to in-kernel disassembler.
 
 - Various QDIO cleanups.
 
 - Add SCLP debug feature.
 
 - Various other cleanups and improvements all over the place.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmEs1GsACgkQIg7DeRsp
 bsJ9+A/9FApCECNPgu6jOX4Ee+no+LxpCPUF8rvt56TFTLv7+Dhm7fJl0xQ9utsZ
 FyLMDAr1/FKdm2wBW23QZH4vEIt1bd6e/03DwwK+6IjHKZHRIfB8eGJMsLj/TDzm
 K6/+FI7qXjvpNXxgkCqXf5yESi/y5Dgr+16kTBhPZj5awRiwe5puPamji3uiQ45V
 r4MdGCCC9BnTZvtPpUrr8ImnUqHJ4/TMo1YYdykLbZFuAvvYUyZ5YG5kh0pMa8JZ
 DGJpfLQfy7ZNscIzdVhZtfzzESVtS6/AOeBzDMO1pbM1CGXtvpJJP0Wjlr/PGwoW
 fvuMHpqTlDi+TfNZiPP5lwsFC89xSd6gtZH7vAuI8kFCXgW3RMjABF6h/mzpH1WO
 jXyaSEZROc/83gxPMYyOYiqrKyAFPbpZ/Rnav2bvGQGneqx7RvmpF3GgA9WEo1PW
 rMDoEbLstJuHk0E2uEV+OnQd5F7MHNonzpYfp/7pyQ+PL8w2GExV9yngVc/f3TqG
 HYLC9rc3K6DkxZappcJm0qTb7lDTMFI7xK3g9RiqPQBJE1v1MYE/rai48nW69ypE
 bRNL76AjyXKo+zKR2wlhJVMY1I1+DarMopHhZj6fzQT5te1LLsv8OuTU2gkt6dIq
 YoSYOYvModf3HbKnJul2tszQG9yl+vpE9MiCyBQSsxIYXCriq/c=
 =WDRh
 -----END PGP SIGNATURE-----

Merge tag 's390-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 updates from Heiko Carstens:

 - Improve ftrace code patching so that stop_machine is not required
   anymore. This requires a small common code patch acked by Steven
   Rostedt:

     https://lore.kernel.org/linux-s390/20210730220741.4da6fdf6@oasis.local.home/

 - Enable KCSAN for s390. This comes with a small common code change to
   fix a compile warning. Acked by Marco Elver:

     https://lore.kernel.org/r/20210729142811.1309391-1-hca@linux.ibm.com

 - Add KFENCE support for s390. This also comes with a minimal x86 patch
   from Marco Elver who said also this can be carried via the s390 tree:

     https://lore.kernel.org/linux-s390/YQJdarx6XSUQ1tFZ@elver.google.com/

 - More changes to prepare the decompressor for relocation.

 - Enable DAT also for CPU restart path.

 - Final set of register asm removal patches; leaving only three
   locations where needed and sane.

 - Add NNPA, Vector-Packed-Decimal-Enhancement Facility 2, PCI MIO
   support to hwcaps flags.

 - Cleanup hwcaps implementation.

 - Add new instructions to in-kernel disassembler.

 - Various QDIO cleanups.

 - Add SCLP debug feature.

 - Various other cleanups and improvements all over the place.

* tag 's390-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (105 commits)
  s390: remove SCHED_CORE from defconfigs
  s390/smp: do not use nodat_stack for secondary CPU start
  s390/smp: enable DAT before CPU restart callback is called
  s390: update defconfigs
  s390/ap: fix state machine hang after failure to enable irq
  KVM: s390: generate kvm hypercall functions
  s390/sclp: add tracing of SCLP interactions
  s390/debug: add early tracing support
  s390/debug: fix debug area life cycle
  s390/debug: keep debug data on resize
  s390/diag: make restart_part2 a local label
  s390/mm,pageattr: fix walk_pte_level() early exit
  s390: fix typo in linker script
  s390: remove do_signal() prototype and do_notify_resume() function
  s390/crypto: fix all kernel-doc warnings in vfio_ap_ops.c
  s390/pci: improve DMA translation init and exit
  s390/pci: simplify CLP List PCI handling
  s390/pci: handle FH state mismatch only on disable
  s390/pci: fix misleading rc in clp_set_pci_fn()
  s390/boot: factor out offset_vmlinux_info() function
  ...
2021-08-30 13:07:15 -07:00
Kim Phillips
6a371bafe6 perf/x86/amd/ibs: Add bitfield definitions in new <asm/amd-ibs.h> header
Add <asm/amd-ibs.h> with bitfield definitions for IBS MSRs,
and demonstrate usage within the driver.

Also move 'struct perf_ibs_data' where it can be shared with
the perf tool that will soon be using it.

No functional changes.

Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210817221048.88063-9-kim.phillips@amd.com
2021-08-26 09:14:36 +02:00
Kim Phillips
9164d9493a x86/cpu: Add get_llc_id() helper function
Factor out a helper function rather than export cpu_llc_id, which is
needed in order to be able to build the AMD uncore driver as a module.

Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210817221048.88063-7-kim.phillips@amd.com
2021-08-26 09:14:36 +02:00
Borislav Petkov
3bff147b18 x86/mce: Defer processing of early errors
When a fatal machine check results in a system reset, Linux does not
clear the error(s) from machine check bank(s) - hardware preserves the
machine check banks across a warm reset.

During initialization of the kernel after the reboot, Linux reads, logs,
and clears all machine check banks.

But there is a problem. In:

  5de97c9f6d ("x86/mce: Factor out and deprecate the /dev/mcelog driver")

the call to mce_register_decode_chain() moved later in the boot
sequence. This means that /dev/mcelog doesn't see those early error
logs.

This was partially fixed by:

  cd9c57cad3 ("x86/MCE: Dump MCE to dmesg if no consumers")

which made sure that the logs were not lost completely by printing
to the console. But parsing console logs is error prone. Users of
/dev/mcelog should expect to find any early errors logged to standard
places.

Add a new flag MCP_QUEUE_LOG to machine_check_poll() to be used in early
machine check initialization to indicate that any errors found should
just be queued to genpool. When mcheck_late_init() is called it will
call mce_schedule_work() to actually log and flush any errors queued in
the genpool.

 [ Based on an original patch, commit message by and completely
   productized by Tony Luck. ]

Fixes: 5de97c9f6d ("x86/mce: Factor out and deprecate the /dev/mcelog driver")
Reported-by: Sumanth Kamatala <skamatala@juniper.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210824003129.GA1642753@agluck-desk2.amr.corp.intel.com
2021-08-24 10:40:58 +02:00
Wei Huang
cb0f722aff KVM: x86/mmu: Support shadowing NPT when 5-level paging is enabled in host
When the 5-level page table CPU flag is set in the host, but the guest
has CR4.LA57=0 (including the case of a 32-bit guest), the top level of
the shadow NPT page tables will be fixed, consisting of one pointer to
a lower-level table and 511 non-present entries.  Extend the existing
code that creates the fixed PML4 or PDP table, to provide a fixed PML5
table if needed.

This is not needed on EPT because the number of layers in the tables
is specified in the EPTP instead of depending on the host CR4.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Wei Huang <wei.huang2@amd.com>
Message-Id: <20210818165549.3771014-3-wei.huang2@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:07:48 -04:00
Wei Huang
746700d21f KVM: x86: Allow CPU to force vendor-specific TDP level
AMD future CPUs will require a 5-level NPT if host CR4.LA57 is set.
To prevent kvm_mmu_get_tdp_level() from incorrectly changing NPT level
on behalf of CPUs, add a new parameter in kvm_configure_mmu() to force
a fixed TDP level.

Signed-off-by: Wei Huang <wei.huang2@amd.com>
Message-Id: <20210818165549.3771014-2-wei.huang2@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:44 -04:00
Maxim Levitsky
61e5f69ef0 KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ
KVM_GUESTDBG_BLOCKIRQ will allow KVM to block all interrupts
while running.

This change is mostly intended for more robust single stepping
of the guest and it has the following benefits when enabled:

* Resuming from a breakpoint is much more reliable.
  When resuming execution from a breakpoint, with interrupts enabled,
  more often than not, KVM would inject an interrupt and make the CPU
  jump immediately to the interrupt handler and eventually return to
  the breakpoint, to trigger it again.

  From the user point of view it looks like the CPU never executed a
  single instruction and in some cases that can even prevent forward
  progress, for example, when the breakpoint is placed by an automated
  script (e.g lx-symbols), which does something in response to the
  breakpoint and then continues the guest automatically.
  If the script execution takes enough time for another interrupt to
  arrive, the guest will be stuck on the same breakpoint RIP forever.

* Normal single stepping is much more predictable, since it won't
  land the debugger into an interrupt handler.

* RFLAGS.TF has less chance to be leaked to the guest:

  We set that flag behind the guest's back to do single stepping
  but if single step lands us into an interrupt/exception handler
  it will be leaked to the guest in the form of being pushed
  to the stack.
  This doesn't completely eliminate this problem as exceptions
  can still happen, but at least this reduces the chances
  of this happening.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210811122927.900604-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:37 -04:00
Mingwei Zhang
71f51d2c32 KVM: x86/mmu: Add detailed page size stats
Existing KVM code tracks the number of large pages regardless of their
sizes. Therefore, when large page of 1GB (or larger) is adopted, the
information becomes less useful because lpages counts a mix of 1G and 2M
pages.

So remove the lpages since it is easy for user space to aggregate the info.
Instead, provide a comprehensive page stats of all sizes from 4K to 512G.

Suggested-by: Ben Gardon <bgardon@google.com>

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Cc: Jing Zhang <jingzhangos@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Message-Id: <20210803044607.599629-4-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:34 -04:00
Vitaly Kuznetsov
0f250a6463 KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use
APICV_INHIBIT_REASON_HYPERV is currently unconditionally forced upon
SynIC activation as SynIC's AutoEOI is incompatible with APICv/AVIC. It is,
however, possible to track whether the feature was actually used by the
guest and only inhibit APICv/AVIC when needed.

TLFS suggests a dedicated 'HV_DEPRECATING_AEOI_RECOMMENDED' flag to let
Windows know that AutoEOI feature should be avoided. While it's up to
KVM userspace to set the flag, KVM can help a bit by exposing global
APICv/AVIC enablement.

Maxim:
   - always set HV_DEPRECATING_AEOI_RECOMMENDED in kvm_get_hv_cpuid,
     since this feature can be used regardless of AVIC

Paolo:
   - use arch.apicv_update_lock to protect the hv->synic_auto_eoi_used
     instead of atomic ops

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-12-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:25 -04:00
Maxim Levitsky
b0a1637f64 KVM: x86: APICv: fix race in kvm_request_apicv_update on SVM
Currently on SVM, the kvm_request_apicv_update toggles the APICv
memslot without doing any synchronization.

If there is a mismatch between that memslot state and the AVIC state,
on one of the vCPUs, an APIC mmio access can be lost:

For example:

VCPU0: enable the APIC_ACCESS_PAGE_PRIVATE_MEMSLOT
VCPU1: access an APIC mmio register.

Since AVIC is still disabled on VCPU1, the access will not be intercepted
by it, and neither will it cause MMIO fault, but rather it will just be
read/written from/to the dummy page mapped into the
APIC_ACCESS_PAGE_PRIVATE_MEMSLOT.

Fix that by adding a lock guarding the AVIC state changes, and carefully
order the operations of kvm_request_apicv_update to avoid this race:

1. Take the lock
2. Send KVM_REQ_APICV_UPDATE
3. Update the apic inhibit reason
4. Release the lock

This ensures that at (2) all vCPUs are kicked out of the guest mode,
but don't yet see the new avic state.
Then only after (4) all other vCPUs can update their AVIC state and resume.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-10-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:23 -04:00
Maxim Levitsky
36222b117e KVM: x86: don't disable APICv memslot when inhibited
Thanks to the former patches, it is now possible to keep the APICv
memslot always enabled, and it will be invisible to the guest
when it is inhibited

This code is based on a suggestion from Sean Christopherson:
https://lkml.org/lkml/2021/7/19/2970

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-9-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:22 -04:00
Peter Xu
4139b1972a KVM: X86: Introduce kvm_mmu_slot_lpages() helpers
Introduce kvm_mmu_slot_lpages() to calculcate lpage_info and rmap array size.
The other __kvm_mmu_slot_lpages() can take an extra parameter of npages rather
than fetching from the memslot pointer.  Start to use the latter one in
kvm_alloc_memslot_metadata().

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210730220455.26054-4-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:04:51 -04:00
Jakub Kicinski
f444fea789 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/ptp/Kconfig:
  55c8fca1da ("ptp_pch: Restore dependency on PCI")
  e5f3155267 ("ethernet: fix PTP_1588_CLOCK dependencies")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-19 18:09:18 -07:00
Maxim Levitsky
0f923e0712 KVM: nSVM: avoid picking up unsupported bits from L2 in int_ctl (CVE-2021-3653)
* Invert the mask of bits that we pick from L2 in
  nested_vmcb02_prepare_control

* Invert and explicitly use VIRQ related bits bitmask in svm_clear_vintr

This fixes a security issue that allowed a malicious L1 to run L2 with
AVIC enabled, which allowed the L2 to exploit the uninitialized and enabled
AVIC to read/write the host physical memory at some offsets.

Fixes: 3d6368ef58 ("KVM: SVM: Add VMRUN handler")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-16 09:48:27 -04:00
Uros Bizjak
65297341d8 KVM: x86: Move declaration of kvm_spurious_fault() to x86.h
Move the declaration of kvm_spurious_fault() to KVM's "private" x86.h,
it should never be called by anything other than low level KVM code.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
[sean: rebased to a series without __ex()/__kvm_handle_fault_on_reboot()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210809173955.1710866-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:35:16 -04:00
Sean Christopherson
ad0577c375 KVM: x86: Kill off __ex() and __kvm_handle_fault_on_reboot()
Remove the __kvm_handle_fault_on_reboot() and __ex() macros now that all
VMX and SVM instructions use asm goto to handle the fault (or in the
case of VMREAD, completely custom logic).  Drop kvm_spurious_fault()'s
asmlinkage annotation as __kvm_handle_fault_on_reboot() was the only
flow that invoked it from assembly code.

Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210809173955.1710866-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:35:16 -04:00
Lai Jiangshan
34e9f86007 KVM: X86: Remove unneeded KVM_DEBUGREG_RELOAD
Commit ae561edeb4 ("KVM: x86: DR0-DR3 are not clear on reset") added code to
ensure eff_db are updated when they're modified through non-standard paths.

But there is no reason to also update hardware DRs unless hardware breakpoints
are active or DR exiting is disabled, and in those cases updating hardware is
handled by KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_BP_ENABLED.

KVM_DEBUGREG_RELOAD just causes unnecesarry load of hardware DRs and is better
to be removed.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210809174307.145263-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:35:14 -04:00
Paolo Bonzini
9a63b4517c Merge branch 'kvm-tdpmmu-fixes' into HEAD
Merge topic branch with fixes for 5.14-rc6 and 5.15 merge window.
2021-08-13 03:35:01 -04:00
Sean Christopherson
ce25681d59 KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock
Add yet another spinlock for the TDP MMU and take it when marking indirect
shadow pages unsync.  When using the TDP MMU and L1 is running L2(s) with
nested TDP, KVM may encounter shadow pages for the TDP entries managed by
L1 (controlling L2) when handling a TDP MMU page fault.  The unsync logic
is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
which runs with mmu_lock held for read, not write.

Lack of a critical section manifests most visibly as an underflow of
unsync_children in clear_unsync_child_bit() due to unsync_children being
corrupted when multiple CPUs write it without a critical section and
without atomic operations.  But underflow is the best case scenario.  The
worst case scenario is that unsync_children prematurely hits '0' and
leads to guest memory corruption due to KVM neglecting to properly sync
shadow pages.

Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
would functionally be ok.  Usurping the lock could degrade performance when
building upper level page tables on different vCPUs, especially since the
unsync flow could hold the lock for a comparatively long time depending on
the number of indirect shadow pages and the depth of the paging tree.

For simplicity, take the lock for all MMUs, even though KVM could fairly
easily know that mmu_lock is held for write.  If mmu_lock is held for
write, there cannot be contention for the inner spinlock, and marking
shadow pages unsync across multiple vCPUs will be slow enough that
bouncing the kvm_arch cacheline should be in the noise.

Note, even though L2 could theoretically be given access to its own EPT
entries, a nested MMU must hold mmu_lock for write and thus cannot race
against a TDP MMU page fault.  I.e. the additional spinlock only _needs_ to
be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
that is running with the TDP MMU enabled.  Holding mmu_lock for read also
prevents the indirect shadow page from being freed.  But as above, keep
it simple and always take the lock.

Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
effectively disable unsync behavior for nested TDP.  Write protecting leaf
shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
VMMs typically don't modify TDP entries, but the same may not hold true for
non-standard use cases and/or VMMs that are migrating physical pages (from
L1's perspective).

Alternative #2, the unsync logic could be made thread safe.  In theory,
simply converting all relevant kvm_mmu_page fields to atomics and using
atomic bitops for the bitmap would suffice.  However, (a) an in-depth audit
would be required, (b) the code churn would be substantial, and (c) legacy
shadow paging would incur additional atomic operations in performance
sensitive paths for no benefit (to legacy shadow paging).

Fixes: a2855afc7e ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181815.3378104-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:32:14 -04:00
Maciej W. Rozycki
d253166168 x86: Avoid magic number with ELCR register accesses
Define PIC_ELCR1 and PIC_ELCR2 macros for accesses to the ELCR registers 
implemented by many chipsets in their embedded 8259A PIC cores, avoiding 
magic numbers that are difficult to handle, and complementing the macros 
we already have for registers originally defined with discrete 8259A PIC 
implementations.  No functional change.

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/alpine.DEB.2.21.2107200237300.9461@angie.orcam.me.uk
2021-08-10 23:31:43 +02:00
Maciej W. Rozycki
fb6a0408ea x86: Add support for 0x22/0x23 port I/O configuration space
Define macros and accessors for the configuration space addressed 
indirectly with an index register and a data register at the port I/O 
locations of 0x22 and 0x23 respectively.

This space is defined by the Intel MultiProcessor Specification for the 
IMCR register used to switch between the PIC and the APIC mode[1], by 
Cyrix processors for their configuration[2][3], and also some chipsets.

Given the lack of atomicity with the indirect addressing a spinlock is 
required to protect accesses, although for Cyrix processors it is enough 
if accesses are executed with interrupts locally disabled, because the 
registers are local to the accessing CPU, and IMCR is only ever poked at 
by the BSP and early enough for interrupts not to have been configured 
yet.  Therefore existing code does not have to change or use the new 
spinlock and neither it does.

Put the spinlock in a library file then, so that it does not get pulled 
unnecessarily for configurations that do not refer it.

Convert Cyrix accessors to wrappers so as to retain the brevity and 
clarity of the `getCx86' and `setCx86' calls.

References:

[1] "MultiProcessor Specification", Version 1.4, Intel Corporation, 
    Order Number: 242016-006, May 1997, Section 3.6.2.1 "PIC Mode", pp. 
    3-7, 3-8

[2] "5x86 Microprocessor", Cyrix Corporation, Order Number: 94192-00, 
    July 1995, Section 2.3.2.4 "Configuration Registers", p. 2-23

[3] "6x86 Processor", Cyrix Corporation, Order Number: 94175-01, March 
    1996, Section 2.4.4 "6x86 Configuration Registers", p. 2-23

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/alpine.DEB.2.21.2107182353140.9461@angie.orcam.me.uk
2021-08-10 23:31:43 +02:00
Paolo Bonzini
319afe6856 KVM: xen: do not use struct gfn_to_hva_cache
gfn_to_hva_cache is not thread-safe, so it is usually used only within
a vCPU (whose code is protected by vcpu->mutex).  The Xen interface
implementation has such a cache in kvm->arch, but it is not really
used except to store the location of the shared info page.  Replace
shinfo_set and shinfo_cache with just the value that is passed via
KVM_XEN_ATTR_TYPE_SHARED_INFO; the only complication is that the
initialization value is not zero anymore and therefore kvm_xen_init_vm
needs to be introduced.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-05 03:31:40 -04:00
Praveen Kumar
e5d9b714fe x86/hyperv: fix root partition faults when writing to VP assist page MSR
For root partition the VP assist pages are pre-determined by the
hypervisor. The root kernel is not allowed to change them to
different locations. And thus, we are getting below stack as in
current implementation root is trying to perform write to specific
MSR.

    [ 2.778197] unchecked MSR access error: WRMSR to 0x40000073 (tried to write 0x0000000145ac5001) at rIP: 0xffffffff810c1084 (native_write_msr+0x4/0x30)
    [ 2.784867] Call Trace:
    [ 2.791507] hv_cpu_init+0xf1/0x1c0
    [ 2.798144] ? hyperv_report_panic+0xd0/0xd0
    [ 2.804806] cpuhp_invoke_callback+0x11a/0x440
    [ 2.811465] ? hv_resume+0x90/0x90
    [ 2.818137] cpuhp_issue_call+0x126/0x130
    [ 2.824782] __cpuhp_setup_state_cpuslocked+0x102/0x2b0
    [ 2.831427] ? hyperv_report_panic+0xd0/0xd0
    [ 2.838075] ? hyperv_report_panic+0xd0/0xd0
    [ 2.844723] ? hv_resume+0x90/0x90
    [ 2.851375] __cpuhp_setup_state+0x3d/0x90
    [ 2.858030] hyperv_init+0x14e/0x410
    [ 2.864689] ? enable_IR_x2apic+0x190/0x1a0
    [ 2.871349] apic_intr_mode_init+0x8b/0x100
    [ 2.878017] x86_late_time_init+0x20/0x30
    [ 2.884675] start_kernel+0x459/0x4fb
    [ 2.891329] secondary_startup_64_no_verify+0xb0/0xbb

Since the hypervisor already provides the VP assist pages for root
partition, we need to memremap the memory from hypervisor for root
kernel to use. The mapping is done in hv_cpu_init during bringup and is
unmapped in hv_cpu_die during teardown.

Signed-off-by: Praveen Kumar <kumarpraveen@linux.microsoft.com>
Reviewed-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Link: https://lore.kernel.org/r/20210731120519.17154-1-kumarpraveen@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-08-04 11:56:53 +00:00
Like Xu
e79f49c37c KVM: x86/pmu: Introduce pmc->is_paused to reduce the call time of perf interfaces
Based on our observations, after any vm-exit associated with vPMU, there
are at least two or more perf interfaces to be called for guest counter
emulation, such as perf_event_{pause, read_value, period}(), and each one
will {lock, unlock} the same perf_event_ctx. The frequency of calls becomes
more severe when guest use counters in a multiplexed manner.

Holding a lock once and completing the KVM request operations in the perf
context would introduce a set of impractical new interfaces. So we can
further optimize the vPMU implementation by avoiding repeated calls to
these interfaces in the KVM context for at least one pattern:

After we call perf_event_pause() once, the event will be disabled and its
internal count will be reset to 0. So there is no need to pause it again
or read its value. Once the event is paused, event period will not be
updated until the next time it's resumed or reprogrammed. And there is
also no need to call perf_event_period twice for a non-running counter,
considering the perf_event for a running counter is never paused.

Based on this implementation, for the following common usage of
sampling 4 events using perf on a 4u8g guest:

  echo 0 > /proc/sys/kernel/watchdog
  echo 25 > /proc/sys/kernel/perf_cpu_time_max_percent
  echo 10000 > /proc/sys/kernel/perf_event_max_sample_rate
  echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent
  for i in `seq 1 1 10`
  do
  taskset -c 0 perf record \
  -e cpu-cycles -e instructions -e branch-instructions -e cache-misses \
  /root/br_instr a
  done

the average latency of the guest NMI handler is reduced from
37646.7 ns to 32929.3 ns (~1.14x speed up) on the Intel ICX server.
Also, in addition to collecting more samples, no loss of sampling
accuracy was observed compared to before the optimization.

Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20210728120705.6855-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
2021-08-04 05:55:56 -04:00
Hamza Mahfooz
269e9552d2 KVM: const-ify all relevant uses of struct kvm_memory_slot
As alluded to in commit f36f3f2846 ("KVM: add "new" argument to
kvm_arch_commit_memory_region"), a bunch of other places where struct
kvm_memory_slot is used, needs to be refactored to preserve the
"const"ness of struct kvm_memory_slot across-the-board.

Signed-off-by: Hamza Mahfooz <someguy@effective-light.com>
Message-Id: <20210713023338.57108-1-someguy@effective-light.com>
[Do not touch body of slot_rmap_walk_init. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-03 06:04:24 -04:00
Sean Christopherson
49d8665cc2 KVM: x86: Move EDX initialization at vCPU RESET to common code
Move the EDX initialization at vCPU RESET, which is now identical between
VMX and SVM, into common code.

No functional change intended.

Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-20-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:01:52 -04:00
Peter Xu
ec1cf69c37 KVM: X86: Add per-vm stat for max rmap list size
Add a new statistic max_mmu_rmap_size, which stores the maximum size of rmap
for the vm.

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210625153214.43106-2-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 09:36:37 -04:00
Marco Elver
00e67bf030 kfence, x86: only define helpers if !MODULE
x86's <asm/tlbflush.h> only declares non-module accessible functions
(such as flush_tlb_one_kernel) if !MODULE.

In preparation of including <asm/kfence.h> from the KFENCE test module,
only define the helpers if !MODULE to avoid breaking the build with
CONFIG_KFENCE_KUNIT_TEST=m.

Signed-off-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/YQJdarx6XSUQ1tFZ@elver.google.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2021-07-30 17:09:01 +02:00
Balbir Singh
b5f06f64e2 x86/mm: Prepare for opt-in based L1D flush in switch_mm()
The goal of this is to allow tasks that want to protect sensitive
information, against e.g. the recently found snoop assisted data sampling
vulnerabilites, to flush their L1D on being switched out.  This protects
their data from being snooped or leaked via side channels after the task
has context switched out.

This could also be used to wipe L1D when an untrusted task is switched in,
but that's not a really well defined scenario while the opt-in variant is
clearly defined.

The mechanism is default disabled and can be enabled on the kernel command
line.

Prepare for the actual prctl based opt-in:

  1) Provide the necessary setup functionality similar to the other
     mitigations and enable the static branch when the command line option
     is set and the CPU provides support for hardware assisted L1D
     flushing. Software based L1D flush is not supported because it's CPU
     model specific and not really well defined.

     This does not come with a sysfs file like the other mitigations
     because it is not bound to any specific vulnerability.

     Support has to be queried via the prctl(2) interface.

  2) Add TIF_SPEC_L1D_FLUSH next to L1D_SPEC_IB so the two bits can be
     mangled into the mm pointer in one go which allows to reuse the
     existing mechanism in switch_mm() for the conditional IBPB speculation
     barrier efficiently.

  3) Add the L1D flush specific functionality which flushes L1D when the
     outgoing task opted in.

     Also check whether the incoming task has requested L1D flush and if so
     validate that it is not accidentaly running on an SMT sibling as this
     makes the whole excercise moot because SMT siblings share L1D which
     opens tons of other attack vectors. If that happens schedule task work
     which signals the incoming task on return to user/guest with SIGBUS as
     this is part of the paranoid L1D flush contract.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210108121056.21940-1-sblbir@amazon.com
2021-07-28 11:42:24 +02:00
Balbir Singh
8aacd1eab5 x86/process: Make room for TIF_SPEC_L1D_FLUSH
The upcoming support for paranoid L1D flush in switch_mm() requires that
TIF_SPEC_IB and the new TIF_SPEC_L1D_FLUSH are two consecutive bits in
thread_info::flags.

Move TIF_SPEC_FORCE_UPDATE to a spare bit to make room for the new one.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210108121056.21940-1-sblbir@amazon.com
2021-07-28 11:42:24 +02:00
Balbir Singh
371b09c6fd x86/mm: Refactor cond_ibpb() to support other use cases
cond_ibpb() has the necessary bits required to track the previous mm in
switch_mm_irqs_off(). This can be reused for other use cases like L1D
flushing on context switch.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210108121056.21940-3-sblbir@amazon.com
2021-07-28 11:42:24 +02:00
Balbir Singh
c52787b590 x86/smp: Add a per-cpu view of SMT state
A new field smt_active in cpuinfo_x86 identifies if the current core/cpu
is in SMT mode or not.

This is helpful when the system has some of its cores with threads offlined
and can be used for cases where action is taken based on the state of SMT.

The upcoming support for paranoid L1D flush will make use of this information.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210108121056.21940-2-sblbir@amazon.com
2021-07-28 11:42:23 +02:00
Arnd Bergmann
1a33b18b3b compat: make linux/compat.h available everywhere
Parts of linux/compat.h are under an #ifdef, but we end up
using more of those over time, moving things around bit by
bit.

To get it over with once and for all, make all of this file
uncondititonal now so it can be accessed everywhere. There
are only a few types left that are in asm/compat.h but not
yet in the asm-generic version, so add those in the process.

This requires providing a few more types in asm-generic/compat.h
that were not already there. The only tricky one is
compat_sigset_t, which needs a little help on 32-bit architectures
and for x86.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-23 14:20:24 +01:00
Dave Airlie
8da49a33dd drm-misc-next for v5.15-rc1:
UAPI Changes:
 - Remove sysfs stats for dma-buf attachments, as it causes a performance regression.
   Previous merge is not in a rc kernel yet, so no userspace regression possible.
 
 Cross-subsystem Changes:
 - Sanitize user input in kyro's viewport ioctl.
 - Use refcount_t in fb_info->count
 - Assorted fixes to dma-buf.
 - Extend x86 efifb handling to all archs.
 - Fix neofb divide by 0.
 - Document corpro,gm7123 bridge dt bindings.
 
 Core Changes:
 - Slightly rework drm master handling.
 - Cleanup vgaarb handling.
 - Assorted fixes.
 
 Driver Changes:
 - Add support for ws2401 panel.
 - Assorted fixes to stm, ast, bochs.
 - Demidlayer ingenic irq.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEuXvWqAysSYEJGuVH/lWMcqZwE8MFAmD5TGAACgkQ/lWMcqZw
 E8PNgxAApjTYQSfjIBbOZnNraxW6w7/bPea35E9A47EdBQsNGnYftNsFjbrn/mCJ
 D+0eRLjCMlg4FF1SHdh9cPJ35py+ygbDeupogboLITfU99eGBth3fM2Xdg9LPcBh
 dbni/JLG9R7gIvSlqdJuweN21trfVrV/9FQEilG5DvQcl27Wx5g8VMRZke1EqGKX
 7Id09Uq50ky18vhDjQRCveYhRqJAxV+XozBatzHyxpDVzjLQvRhlAAYdvrSMHZ5R
 jreGzOfR8awc6Om+w7wx3Jn1oEGmXVZB/VqxEqGtMOr3lpARPucxrqfHsqpam3rv
 yIoEKPrkG+k6fsU7Tbg59jNqe/PbCUW3AlpyuBxf55EbnVGgjLDbq4sRRMkehPfA
 fhC31ujOXQQnAgaxyeQAaAJFKNFJzA8Cq5ZPfG+zztzuomHCiUVQBRowP65hJMzR
 +ZlEDnhUD3STLz39zuO1reZR1ZoPIvKbsokHAA+ZrIwUd6U3D3ia8V51pq+lL5aS
 TGDkyMN9jyZ+SO8Z7+2FnJAv9FAOPU/WCLU/fWW46jAvuezwMIwVcjfSqDU2XbZD
 e7KgHpHhx3BGxI8TThHKlY7mf6IL2Bm7X1Cv1pdZs/eEn3Udh2ax942uTQZu/YOO
 0AT1XchpvYCBNRw05bVI3OlJ+w3I8uV+h+11jHOKeY6cbwdHeKE=
 =BUya
 -----END PGP SIGNATURE-----

Merge tag 'drm-misc-next-2021-07-22' of git://anongit.freedesktop.org/drm/drm-misc into drm-next

drm-misc-next for v5.15-rc1:

UAPI Changes:
- Remove sysfs stats for dma-buf attachments, as it causes a performance regression.
  Previous merge is not in a rc kernel yet, so no userspace regression possible.

Cross-subsystem Changes:
- Sanitize user input in kyro's viewport ioctl.
- Use refcount_t in fb_info->count
- Assorted fixes to dma-buf.
- Extend x86 efifb handling to all archs.
- Fix neofb divide by 0.
- Document corpro,gm7123 bridge dt bindings.

Core Changes:
- Slightly rework drm master handling.
- Cleanup vgaarb handling.
- Assorted fixes.

Driver Changes:
- Add support for ws2401 panel.
- Assorted fixes to stm, ast, bochs.
- Demidlayer ingenic irq.

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/2d0d2fe8-01fc-e216-c3fd-38db9e69944e@linux.intel.com
2021-07-23 11:32:43 +10:00
Javier Martinez Canillas
d391c58271 drivers/firmware: move x86 Generic System Framebuffers support
The x86 architecture has generic support to register a system framebuffer
platform device. It either registers a "simple-framebuffer" if the config
option CONFIG_X86_SYSFB is enabled, or a legacy VGA/VBE/EFI FB device.

But the code is generic enough to be reused by other architectures and can
be moved out of the arch/x86 directory.

This will allow to also support the simple{fb,drm} drivers on non-x86 EFI
platforms, such as aarch64 where these drivers are only supported with DT.

Signed-off-by: Javier Martinez Canillas <javierm@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20210625130947.1803678-2-javierm@redhat.com
2021-07-21 12:04:56 +02:00
Michael Kelley
afca4d95dd Drivers: hv: Make portions of Hyper-V init code be arch neutral
The code to allocate and initialize the hv_vp_index array is
architecture neutral. Similarly, the code to allocate and
populate the hypercall input and output arg pages is architecture
neutral.  Move both sets of code out from arch/x86 and into
utility functions in drivers/hv/hv_common.c that can be shared
by Hyper-V initialization on ARM64.

No functional changes. However, the allocation of the hypercall
input and output arg pages is done differently so that the
size is always the Hyper-V page size, even if not the same as
the guest page size (such as with ARM64's 64K page size).

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1626287687-2045-2-git-send-email-mikelley@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-07-15 12:59:45 +00:00
Aneesh Kumar K.V
dc4875f0e7 mm: rename p4d_page_vaddr to p4d_pgtable and make it return pud_t *
No functional change in this patch.

[aneesh.kumar@linux.ibm.com: m68k build error reported by kernel robot]
  Link: https://lkml.kernel.org/r/87tulxnb2v.fsf@linux.ibm.com

Link: https://lkml.kernel.org/r/20210615110859.320299-2-aneesh.kumar@linux.ibm.com
Link: https://lore.kernel.org/linuxppc-dev/CAHk-=wi+J+iodze9FtjM3Zi4j4OeS+qqbKxME9QN4roxPEXH9Q@mail.gmail.com/
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-08 11:48:22 -07:00
Aneesh Kumar K.V
9cf6fa2458 mm: rename pud_page_vaddr to pud_pgtable and make it return pmd_t *
No functional change in this patch.

[aneesh.kumar@linux.ibm.com: fix]
  Link: https://lkml.kernel.org/r/87wnqtnb60.fsf@linux.ibm.com
[sfr@canb.auug.org.au: another fix]
  Link: https://lkml.kernel.org/r/20210619134410.89559-1-aneesh.kumar@linux.ibm.com

Link: https://lkml.kernel.org/r/20210615110859.320299-1-aneesh.kumar@linux.ibm.com
Link: https://lore.kernel.org/linuxppc-dev/CAHk-=wi+J+iodze9FtjM3Zi4j4OeS+qqbKxME9QN4roxPEXH9Q@mail.gmail.com/
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-08 11:48:22 -07:00
Linus Torvalds
1423e2660c Fixes and improvements for FPU handling on x86:
- Prevent sigaltstack out of bounds writes. The kernel unconditionally
     writes the FPU state to the alternate stack without checking whether
     the stack is large enough to accomodate it.
 
     Check the alternate stack size before doing so and in case it's too
     small force a SIGSEGV instead of silently corrupting user space data.
 
   - MINSIGSTKZ and SIGSTKSZ are constants in signal.h and have never been
     updated despite the fact that the FPU state which is stored on the
     signal stack has grown over time which causes trouble in the field
     when AVX512 is available on a CPU. The kernel does not expose the
     minimum requirements for the alternate stack size depending on the
     available and enabled CPU features.
 
     ARM already added an aux vector AT_MINSIGSTKSZ for the same reason.
     Add it to x86 as well
 
   - A major cleanup of the x86 FPU code. The recent discoveries of XSTATE
     related issues unearthed quite some inconsistencies, duplicated code
     and other issues.
 
     The fine granular overhaul addresses this, makes the code more robust
     and maintainable, which allows to integrate upcoming XSTATE related
     features in sane ways.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmDlcpETHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoeP5D/4i+AgYYeiMLgGb+NS7iaKPfoWo6LIz
 y3qdTSA0DQaIYbYivWwRO/g0GYdDMXDWeZalFi7eGnVI8O3eOog+22Zrf/y0UINB
 KJHdYd4ApWHhs401022y5hexrWQvnV8w1yQCuj/zLm6eC+AVhdwt2AY+IBoRrdUj
 wqY97B/4rJNsBvvqTDn9EeDrJA2y0y0Suc7AhIp2BGMI+dpIdxys8RJDamXNWyDL
 gJf0YRgUoiIn3AHKb+fgv60AoxfC175NSg/5/y/scFNXqVlW0Up4YCb7pqG9o2Ga
 f3XvtWfbw1N5PmUYjFkALwEkzGUbM3v0RA3xLY2j2WlWm9fBPPy59dt+i/h/VKyA
 GrA7i7lcIqX8dfVH6XkrReZBkRDSB6t9SZTvV54jAz5fcIZO2Rg++UFUvI/R6GKK
 XCcxukYaArwo+IG62iqDszS3gfLGhcor/cviOeULRC5zMUIO4Jah+IhDnifmShtC
 M5s9QzrwIRD/XMewGRQmvkiN4kBfE7jFoBQr1J9leCXJKrM+2JQmMzVInuubTQIq
 SdlKOaAIn7xtekz+6XdFG9Gmhck0PCLMJMOLNvQkKWI3KqGLRZ+dAWKK0vsCizAx
 0BA7ZeB9w9lFT+D8mQCX77JvW9+VNwyfwIOLIrJRHk3VqVpS5qvoiFTLGJJBdZx4
 /TbbRZu7nXDN2w==
 =Mq1m
 -----END PGP SIGNATURE-----

Merge tag 'x86-fpu-2021-07-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fpu updates from Thomas Gleixner:
 "Fixes and improvements for FPU handling on x86:

   - Prevent sigaltstack out of bounds writes.

     The kernel unconditionally writes the FPU state to the alternate
     stack without checking whether the stack is large enough to
     accomodate it.

     Check the alternate stack size before doing so and in case it's too
     small force a SIGSEGV instead of silently corrupting user space
     data.

   - MINSIGSTKZ and SIGSTKSZ are constants in signal.h and have never
     been updated despite the fact that the FPU state which is stored on
     the signal stack has grown over time which causes trouble in the
     field when AVX512 is available on a CPU. The kernel does not expose
     the minimum requirements for the alternate stack size depending on
     the available and enabled CPU features.

     ARM already added an aux vector AT_MINSIGSTKSZ for the same reason.
     Add it to x86 as well.

   - A major cleanup of the x86 FPU code. The recent discoveries of
     XSTATE related issues unearthed quite some inconsistencies,
     duplicated code and other issues.

     The fine granular overhaul addresses this, makes the code more
     robust and maintainable, which allows to integrate upcoming XSTATE
     related features in sane ways"

* tag 'x86-fpu-2021-07-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
  x86/fpu/xstate: Clear xstate header in copy_xstate_to_uabi_buf() again
  x86/fpu/signal: Let xrstor handle the features to init
  x86/fpu/signal: Handle #PF in the direct restore path
  x86/fpu: Return proper error codes from user access functions
  x86/fpu/signal: Split out the direct restore code
  x86/fpu/signal: Sanitize copy_user_to_fpregs_zeroing()
  x86/fpu/signal: Sanitize the xstate check on sigframe
  x86/fpu/signal: Remove the legacy alignment check
  x86/fpu/signal: Move initial checks into fpu__restore_sig()
  x86/fpu: Mark init_fpstate __ro_after_init
  x86/pkru: Remove xstate fiddling from write_pkru()
  x86/fpu: Don't store PKRU in xstate in fpu_reset_fpstate()
  x86/fpu: Remove PKRU handling from switch_fpu_finish()
  x86/fpu: Mask PKRU from kernel XRSTOR[S] operations
  x86/fpu: Hook up PKRU into ptrace()
  x86/fpu: Add PKRU storage outside of task XSAVE buffer
  x86/fpu: Dont restore PKRU in fpregs_restore_userspace()
  x86/fpu: Rename xfeatures_mask_user() to xfeatures_mask_uabi()
  x86/fpu: Move FXSAVE_LEAK quirk info __copy_kernel_to_fpregs()
  x86/fpu: Rename __fpregs_load_activate() to fpregs_restore_userregs()
  ...
2021-07-07 11:12:01 -07:00
Linus Torvalds
4cad671979 asm-generic/unaligned: Unify asm/unaligned.h around struct helper
The get_unaligned()/put_unaligned() helpers are traditionally architecture
 specific, with the two main variants being the "access-ok.h" version
 that assumes unaligned pointer accesses always work on a particular
 architecture, and the "le-struct.h" version that casts the data to a
 byte aligned type before dereferencing, for architectures that cannot
 always do unaligned accesses in hardware.
 
 Based on the discussion linked below, it appears that the access-ok
 version is not realiable on any architecture, but the struct version
 probably has no downsides. This series changes the code to use the
 same implementation on all architectures, addressing the few exceptions
 separately.
 
 Link: https://lore.kernel.org/lkml/75d07691-1e4f-741f-9852-38c0b4f520bc@synopsys.com/
 Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363
 Link: https://lore.kernel.org/lkml/20210507220813.365382-14-arnd@kernel.org/
 Link: git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic.git unaligned-rework-v2
 Link: https://lore.kernel.org/lkml/CAHk-=whGObOKruA_bU3aPGZfoDqZM1_9wBkwREp0H0FgR-90uQ@mail.gmail.com/
 Signed-off-by: Arnd Bergmann <arnd@arndb.de>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAmDfFx4ACgkQmmx57+YA
 GNkqzRAAjdlIr8M+xI2CyT0/A9tswYfLMeWejmYopq3zlxI6RnvPiJJDIdY2I8US
 1npIiDo55w061CnXL9rV65ocL3XmGu1mabOvgM6ATsec+8t4WaXBV9tysxTJ9ea0
 ltLTa2P5DXWALvWiVMTME7hFaf1cW+8Uqt3LmXxDp2l5zasXajCHAH6YokON2PfM
 CsaRhwSxIu8Sbnu/IQGBI9JW5UXsBfKSyUwtM0OwP7jFOuIeZ4WBVA+j6UxONnFC
 wouKmAM/ThoOsaV9aP4EZLIfBx8d4/hfYQjZ958kYXurerruYkJeEqdIRbV0QqTy
 2O6ZrJ6uqPlzfWz9h458me2dt98YEtALHV/3DCWUcBfHmUQtxElyJYEhG0YjVF3H
 5RYtjw8Q2LS/QR5ask1Xn0JfT89rRnLi2migAtsA4Ce70JP4Us6wGobkj4SHlgDt
 P7+eVq2Mkhqw/kmV8N4p+ZS5lpkK0JniDN+ONDhkZqHL/zXG/HQzx9wLV69jlvo2
 ASevKxITdi+bKHWs5ANungkBOnBUQZacq46mVyi4HPDwMAFyWvVYTbFumy9koagQ
 o9NEgX3RsZcxxi7bU1xuFPFMLMlUQT3Nb30+84B4fKe9FmvHC1hizTiCnp7q4bZr
 z6a6AMHke7YLqKZOqzTJGRR3lPoZZDCb775SAd70LQp6XPZXOHs=
 =IY5U
 -----END PGP SIGNATURE-----

Merge tag 'asm-generic-unaligned-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic

Pull asm/unaligned.h unification from Arnd Bergmann:
 "Unify asm/unaligned.h around struct helper

  The get_unaligned()/put_unaligned() helpers are traditionally
  architecture specific, with the two main variants being the
  "access-ok.h" version that assumes unaligned pointer accesses always
  work on a particular architecture, and the "le-struct.h" version that
  casts the data to a byte aligned type before dereferencing, for
  architectures that cannot always do unaligned accesses in hardware.

  Based on the discussion linked below, it appears that the access-ok
  version is not realiable on any architecture, but the struct version
  probably has no downsides. This series changes the code to use the
  same implementation on all architectures, addressing the few
  exceptions separately"

Link: https://lore.kernel.org/lkml/75d07691-1e4f-741f-9852-38c0b4f520bc@synopsys.com/
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363
Link: https://lore.kernel.org/lkml/20210507220813.365382-14-arnd@kernel.org/
Link: git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic.git unaligned-rework-v2
Link: https://lore.kernel.org/lkml/CAHk-=whGObOKruA_bU3aPGZfoDqZM1_9wBkwREp0H0FgR-90uQ@mail.gmail.com/

* tag 'asm-generic-unaligned-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  asm-generic: simplify asm/unaligned.h
  asm-generic: uaccess: 1-byte access is always aligned
  netpoll: avoid put_unaligned() on single character
  mwifiex: re-fix for unaligned accesses
  apparmor: use get_unaligned() only for multi-byte words
  partitions: msdos: fix one-byte get_unaligned()
  asm-generic: unaligned always use struct helpers
  asm-generic: unaligned: remove byteshift helpers
  powerpc: use linux/unaligned/le_struct.h on LE power7
  m68k: select CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
  sh: remove unaligned access for sh4a
  openrisc: always use unaligned-struct header
  asm-generic: use asm-generic/unaligned.h for most architectures
2021-07-02 12:43:40 -07:00
Linus Torvalds
71bd934101 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "190 patches.

  Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
  vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
  migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
  zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
  core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
  signals, exec, kcov, selftests, compress/decompress, and ipc"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
  ipc/util.c: use binary search for max_idx
  ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
  ipc: use kmalloc for msg_queue and shmid_kernel
  ipc sem: use kvmalloc for sem_undo allocation
  lib/decompressors: remove set but not used variabled 'level'
  selftests/vm/pkeys: exercise x86 XSAVE init state
  selftests/vm/pkeys: refill shadow register after implicit kernel write
  selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
  selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
  kcov: add __no_sanitize_coverage to fix noinstr for all architectures
  exec: remove checks in __register_bimfmt()
  x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
  hfsplus: report create_date to kstat.btime
  hfsplus: remove unnecessary oom message
  nilfs2: remove redundant continue statement in a while-loop
  kprobes: remove duplicated strong free_insn_page in x86 and s390
  init: print out unknown kernel parameters
  checkpatch: do not complain about positive return values starting with EPOLL
  checkpatch: improve the indented label test
  checkpatch: scripts/spdxcheck.py now requires python3
  ...
2021-07-02 12:08:10 -07:00
Andy Shevchenko
f39650de68 kernel.h: split out panic and oops helpers
kernel.h is being used as a dump for all kinds of stuff for a long time.
Here is the attempt to start cleaning it up by splitting out panic and
oops helpers.

There are several purposes of doing this:
- dropping dependency in bug.h
- dropping a loop by moving out panic_notifier.h
- unload kernel.h from something which has its own domain

At the same time convert users tree-wide to use new headers, although for
the time being include new header back to kernel.h to avoid twisted
indirected includes for existing users.

[akpm@linux-foundation.org: thread_info.h needs limits.h]
[andriy.shevchenko@linux.intel.com: ia64 fix]
  Link: https://lkml.kernel.org/r/20210520130557.55277-1-andriy.shevchenko@linux.intel.com

Link: https://lkml.kernel.org/r/20210511074137.33666-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Co-developed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Sebastian Reichel <sre@kernel.org>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Acked-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:04 -07:00
Anshuman Khandual
1c2f7d14d8 mm/thp: define default pmd_pgtable()
Currently most platforms define pmd_pgtable() as pmd_page() duplicating
the same code all over.  Instead just define a default value i.e
pmd_page() for pmd_pgtable() and let platforms override when required via
<asm/pgtable.h>.  All the existing platform that override pmd_pgtable()
have been moved into their respective <asm/pgtable.h> header in order to
precede before the new generic definition.  This makes it much cleaner
with reduced code.

Link: https://lkml.kernel.org/r/1623646133-20306-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:03 -07:00
Anshuman Khandual
fac7757e1f mm: define default value for FIRST_USER_ADDRESS
Currently most platforms define FIRST_USER_ADDRESS as 0UL duplication the
same code all over.  Instead just define a generic default value (i.e 0UL)
for FIRST_USER_ADDRESS and let the platforms override when required.  This
makes it much cleaner with reduced code.

The default FIRST_USER_ADDRESS here would be skipped in <linux/pgtable.h>
when the given platform overrides its value via <asm/pgtable.h>.

Link: https://lkml.kernel.org/r/1620615725-24623-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
Acked-by: Guo Ren <guoren@kernel.org>			[csky]
Acked-by: Stafford Horne <shorne@gmail.com>		[openrisc]
Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>	[RISC-V]
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:02 -07:00
Linus Torvalds
1dfb0f47ac X86 entry code related updates:
- Consolidate the macros for .byte ... opcode sequences
 
  - Deduplicate register offset defines in include files
 
  - Simplify the ia32,x32 compat handling of the related syscall tables to
    get rid of #ifdeffery.
 
  - Clear all EFLAGS which are not required for syscall handling
 
  - Consolidate the syscall tables and switch the generation over to the
    generic shell script and remove the CFLAGS tweaks which are not longer
    required.
 
  - Use 'int' type for system call numbers to match the generic code.
 
  - Add more selftests for syscalls
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmDbKzMTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoae8D/9+pksdf8lE5dRLtngSeTDLiyIV+qq4
 vSks7XfrTTAhOV2nRwtIulc2CO6H7jcvn6ehmiC/X0Tn9JK5brwSJJYryNEjA3cp
 3p9jPrB1w1SDhx35JzILN4DDaJfI3jobLSLDq0KQzuEL0+c0R4l3WBplpCzbLjqj
 NaFQgslf8RSnjha9NLTKzlzSaNNNo9Ioo6DyrsBDEdcRBtAPlFfdVtT3oJE73ANH
 dK5POoVWysmAnDAwEW17j9bBJLtxeWsrhM9CrtqvcKr3HhK9WjWUFAr+diQf5GKf
 BAD2A+5y8wZQXvFOuC9WZxfQwUFSLExt8BfcXblOUbf2CdlvoYVzOlvI141kA++4
 q4wQ1vl6MbLCp6wLysc3bnwKUEmnf2E4Iyj5JR2aFrw096pAoZ3ZbAQi7s3Vhb16
 aSbGxIw3rHRuB0f8VmOA0iEHiXlkRmE/K+nH1/uDTUZLaDpktPvpKQJsp0+9qXFk
 eVtEw4bVKJ7q5ozjMzpm9aPxPp1v8MGxUOJOy80W7Ti+vBp2KmMKc1gy8QsYrTvW
 Vzvpp3U+/WFh2X7AG0zlP/JEnOuJmMwMK5QhzMC2rEbaHJ66ht7SABvtSbOHHw5Z
 zugxTE0lx3n7izCxW1RLEu//xtWY0FbU2L5oE2Ace27myUPeBQCDJzynUn93dMM9
 9nq2TtgTCF6XvA==
 =+sb9
 -----END PGP SIGNATURE-----

Merge tag 'x86-entry-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 entry code related updates from Thomas Gleixner:

 - Consolidate the macros for .byte ... opcode sequences

 - Deduplicate register offset defines in include files

 - Simplify the ia32,x32 compat handling of the related syscall tables
   to get rid of #ifdeffery.

 - Clear all EFLAGS which are not required for syscall handling

 - Consolidate the syscall tables and switch the generation over to the
   generic shell script and remove the CFLAGS tweaks which are not
   longer required.

 - Use 'int' type for system call numbers to match the generic code.

 - Add more selftests for syscalls

* tag 'x86-entry-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/syscalls: Don't adjust CFLAGS for syscall tables
  x86/syscalls: Remove -Wno-override-init for syscall tables
  x86/uml/syscalls: Remove array index from syscall initializers
  x86/syscalls: Clear 'offset' and 'prefix' in case they are set in env
  x86/entry: Use int everywhere for system call numbers
  x86/entry: Treat out of range and gap system calls the same
  x86/entry/64: Sign-extend system calls on entry to int
  selftests/x86/syscall: Add tests under ptrace to syscall_numbering_64
  selftests/x86/syscall: Simplify message reporting in syscall_numbering
  selftests/x86/syscall: Update and extend syscall_numbering_64
  x86/syscalls: Switch to generic syscallhdr.sh
  x86/syscalls: Use __NR_syscalls instead of __NR_syscall_max
  x86/unistd: Define X32_NR_syscalls only for 64-bit kernel
  x86/syscalls: Stop filling syscall arrays with *_sys_ni_syscall
  x86/syscalls: Switch to generic syscalltbl.sh
  x86/entry/x32: Rename __x32_compat_sys_* to __x64_compat_sys_*
2021-06-29 12:44:51 -07:00
Linus Torvalds
a22c3f615a X86 interrupt related changes:
- Consolidate the VECTOR defines and the usage sites.
 
   - Cleanup GDT/IDT related code and replace open coded ASM with proper
     native helfper functions.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmDbLAUTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoTiXEACiuisDJ2fYFqU1dmYRbWIDtWbgsJ3k
 CVABRjgCbGfviKaaJuMoHf5tbnXWWu7y8jd8Z+h9cwOlyQOzNBsZjplzPS0h8zME
 KAekAkO2VGf5G7VdWLrfMvjIY/NDuAgxj+7w01LvnyWROePGRkbeP3iH41qo+auM
 5Cj4lu333+rO4kzmdXzwJ7CHQXOa/OT0MrBL14saYFaM3qSSkCzeIXnE6/ZNapsE
 zZYOCDF19MpPm6GZT1i4qRxirhw1TLNycsYavlOxZ/Hyp0BO0t2TiNRwZtdIVz+a
 1sedm+pD9E+1qHQfB+P03P65OixxN0hArNlKgGou5LDMRF45pvfqQXEBbTsqHSxh
 vWlL/tK7Z7U5dsK7ZA0HvlZYdrunWn/cNMqWb08WDyuPLxJ0QxJjsdOB2teVEus+
 kNYsP0ZxRvPNHKtqVfTXGS8ksrNS/57lUz6UJmBA3UYhYg33UgPCfF/gQzTnpfSo
 4TzhWIeLlCOId9FPxXpXa4NjjsqXvNEOPGrTx4BY8SYHYln4HoSyffRIZQ8xl0lA
 Qfetod+Hajt+5JXGndb906kexY7i14ZOrkHEjkUtq0asNmbwJ+hVs2VaYcq/ghuS
 BmhlnarYuWw9t11yD9Ln5stoVgRJ2KEX5T9fOCtCsJZyHo+Eta/p14ocU0eLQQdh
 HbsRKB+pE+al2A==
 =eAPe
 -----END PGP SIGNATURE-----

Merge tag 'x86-irq-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 interrupt related updates from Thomas Gleixner:

 - Consolidate the VECTOR defines and the usage sites.

 - Cleanup GDT/IDT related code and replace open coded ASM with proper
   native helper functions.

* tag 'x86-irq-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/kexec: Set_[gi]dt() -> native_[gi]dt_invalidate() in machine_kexec_*.c
  x86: Add native_[ig]dt_invalidate()
  x86/idt: Remove address argument from idt_invalidate()
  x86/irq: Add and use NR_EXTERNAL_VECTORS and NR_SYSTEM_VECTORS
  x86/irq: Remove unused vectors defines
2021-06-29 12:36:59 -07:00
Linus Torvalds
36824f198c ARM:
- Add MTE support in guests, complete with tag save/restore interface
 
 - Reduce the impact of CMOs by moving them in the page-table code
 
 - Allow device block mappings at stage-2
 
 - Reduce the footprint of the vmemmap in protected mode
 
 - Support the vGIC on dumb systems such as the Apple M1
 
 - Add selftest infrastructure to support multiple configuration
   and apply that to PMU/non-PMU setups
 
 - Add selftests for the debug architecture
 
 - The usual crop of PMU fixes
 
 PPC:
 
 - Support for the H_RPT_INVALIDATE hypercall
 
 - Conversion of Book3S entry/exit to C
 
 - Bug fixes
 
 S390:
 
 - new HW facilities for guests
 
 - make inline assembly more robust with KASAN and co
 
 x86:
 
 - Allow userspace to handle emulation errors (unknown instructions)
 
 - Lazy allocation of the rmap (host physical -> guest physical address)
 
 - Support for virtualizing TSC scaling on VMX machines
 
 - Optimizations to avoid shattering huge pages at the beginning of live migration
 
 - Support for initializing the PDPTRs without loading them from memory
 
 - Many TLB flushing cleanups
 
 - Refuse to load if two-stage paging is available but NX is not (this has
   been a requirement in practice for over a year)
 
 - A large series that separates the MMU mode (WP/SMAP/SMEP etc.) from
   CR0/CR4/EFER, using the MMU mode everywhere once it is computed
   from the CPU registers
 
 - Use PM notifier to notify the guest about host suspend or hibernate
 
 - Support for passing arguments to Hyper-V hypercalls using XMM registers
 
 - Support for Hyper-V TLB flush hypercalls and enlightened MSR bitmap on
   AMD processors
 
 - Hide Hyper-V hypercalls that are not included in the guest CPUID
 
 - Fixes for live migration of virtual machines that use the Hyper-V
   "enlightened VMCS" optimization of nested virtualization
 
 - Bugfixes (not many)
 
 Generic:
 
 - Support for retrieving statistics without debugfs
 
 - Cleanups for the KVM selftests API
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmDV9UYUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOIRgf/XX8fKLh24RnTOs2ldIu2AfRGVrT4
 QMrr8MxhmtukBAszk2xKvBt8/6gkUjdaIC3xqEnVjxaDaUvZaEtP7CQlF5JV45rn
 iv1zyxUKucXrnIOr+gCioIT7qBlh207zV35ArKioP9Y83cWx9uAs22pfr6g+7RxO
 h8bJZlJbSG6IGr3voANCIb9UyjU1V/l8iEHqRwhmr/A5rARPfD7g8lfMEQeGkzX6
 +/UydX2fumB3tl8e2iMQj6vLVdSOsCkehvpHK+Z33EpkKhan7GwZ2sZ05WmXV/nY
 QLAYfD10KegoNWl5Ay4GTp4hEAIYVrRJCLC+wnLdc0U8udbfCuTC31LK4w==
 =NcRh
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "This covers all architectures (except MIPS) so I don't expect any
  other feature pull requests this merge window.

  ARM:

   - Add MTE support in guests, complete with tag save/restore interface

   - Reduce the impact of CMOs by moving them in the page-table code

   - Allow device block mappings at stage-2

   - Reduce the footprint of the vmemmap in protected mode

   - Support the vGIC on dumb systems such as the Apple M1

   - Add selftest infrastructure to support multiple configuration and
     apply that to PMU/non-PMU setups

   - Add selftests for the debug architecture

   - The usual crop of PMU fixes

  PPC:

   - Support for the H_RPT_INVALIDATE hypercall

   - Conversion of Book3S entry/exit to C

   - Bug fixes

  S390:

   - new HW facilities for guests

   - make inline assembly more robust with KASAN and co

  x86:

   - Allow userspace to handle emulation errors (unknown instructions)

   - Lazy allocation of the rmap (host physical -> guest physical
     address)

   - Support for virtualizing TSC scaling on VMX machines

   - Optimizations to avoid shattering huge pages at the beginning of
     live migration

   - Support for initializing the PDPTRs without loading them from
     memory

   - Many TLB flushing cleanups

   - Refuse to load if two-stage paging is available but NX is not (this
     has been a requirement in practice for over a year)

   - A large series that separates the MMU mode (WP/SMAP/SMEP etc.) from
     CR0/CR4/EFER, using the MMU mode everywhere once it is computed
     from the CPU registers

   - Use PM notifier to notify the guest about host suspend or hibernate

   - Support for passing arguments to Hyper-V hypercalls using XMM
     registers

   - Support for Hyper-V TLB flush hypercalls and enlightened MSR bitmap
     on AMD processors

   - Hide Hyper-V hypercalls that are not included in the guest CPUID

   - Fixes for live migration of virtual machines that use the Hyper-V
     "enlightened VMCS" optimization of nested virtualization

   - Bugfixes (not many)

  Generic:

   - Support for retrieving statistics without debugfs

   - Cleanups for the KVM selftests API"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (314 commits)
  KVM: x86: rename apic_access_page_done to apic_access_memslot_enabled
  kvm: x86: disable the narrow guest module parameter on unload
  selftests: kvm: Allows userspace to handle emulation errors.
  kvm: x86: Allow userspace to handle emulation errors
  KVM: x86/mmu: Let guest use GBPAGES if supported in hardware and TDP is on
  KVM: x86/mmu: Get CR4.SMEP from MMU, not vCPU, in shadow page fault
  KVM: x86/mmu: Get CR0.WP from MMU, not vCPU, in shadow page fault
  KVM: x86/mmu: Drop redundant rsvd bits reset for nested NPT
  KVM: x86/mmu: Optimize and clean up so called "last nonleaf level" logic
  KVM: x86: Enhance comments for MMU roles and nested transition trickiness
  KVM: x86/mmu: WARN on any reserved SPTE value when making a valid SPTE
  KVM: x86/mmu: Add helpers to do full reserved SPTE checks w/ generic MMU
  KVM: x86/mmu: Use MMU's role to determine PTTYPE
  KVM: x86/mmu: Collapse 32-bit PAE and 64-bit statements for helpers
  KVM: x86/mmu: Add a helper to calculate root from role_regs
  KVM: x86/mmu: Add helper to update paging metadata
  KVM: x86/mmu: Don't update nested guest's paging bitmasks if CR0.PG=0
  KVM: x86/mmu: Consolidate reset_rsvds_bits_mask() calls
  KVM: x86/mmu: Use MMU role_regs to get LA57, and drop vCPU LA57 helper
  KVM: x86/mmu: Get nested MMU's root level from the MMU's role
  ...
2021-06-28 15:40:51 -07:00
Linus Torvalds
9840cfcb97 arm64 updates for 5.14
- Optimise SVE switching for CPUs with 128-bit implementations.
 
  - Fix output format from SVE selftest.
 
  - Add support for versions v1.2 and 1.3 of the SMC calling convention.
 
  - Allow Pointer Authentication to be configured independently for
    kernel and userspace.
 
  - PMU driver cleanups for managing IRQ affinity and exposing event
    attributes via sysfs.
 
  - KASAN optimisations for both hardware tagging (MTE) and out-of-line
    software tagging implementations.
 
  - Relax frame record alignment requirements to facilitate 8-byte
    alignment with KASAN and Clang.
 
  - Cleanup of page-table definitions and removal of unused memory types.
 
  - Reduction of ARCH_DMA_MINALIGN back to 64 bytes.
 
  - Refactoring of our instruction decoding routines and addition of some
    missing encodings.
 
  - Move entry code moved into C and hardened against harmful compiler
    instrumentation.
 
  - Update booting requirements for the FEAT_HCX feature, added to v8.7
    of the architecture.
 
  - Fix resume from idle when pNMI is being used.
 
  - Additional CPU sanity checks for MTE and preparatory changes for
    systems where not all of the CPUs support 32-bit EL0.
 
  - Update our kernel string routines to the latest Cortex Strings
    implementation.
 
  - Big cleanup of our cache maintenance routines, which were confusingly
    named and inconsistent in their implementations.
 
  - Tweak linker flags so that GDB can understand vmlinux when using RELR
    relocations.
 
  - Boot path cleanups to enable early initialisation of per-cpu
    operations needed by KCSAN.
 
  - Non-critical fixes and miscellaneous cleanup.
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmDUh1YQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNDaUCAC+2Jy2Yopd94uBPYajGybM0rqCUgE7b5n1
 A7UzmQ6fia2hwqCPmxGG+sRabovwN7C1bKrUCc03RIbErIa7wum1edeyqmF/Aw44
 DUDY1MAOSZaFmX8L62QCvxG1hfdLPtGmHMd1hdXvxYK7PCaigEFnzbLRWTtgE+Ok
 JhdvNfsoeITJObHnvYPF3rV3NAbyYni9aNJ5AC/qb3dlf6XigEraXaMj29XHKfwc
 +vmn+25oqFkLHyFeguqIoK+vUQAy/8TjFfjX83eN3LZknNhDJgWS1Iq1Nm+Vxt62
 RvDUUecWJjAooCWgmil6pt0enI+q6E8LcX3A3cWWrM6psbxnYzkU
 =I6KS
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "There's a reasonable amount here and the juicy details are all below.

  It's worth noting that the MTE/KASAN changes strayed outside of our
  usual directories due to core mm changes and some associated changes
  to some other architectures; Andrew asked for us to carry these [1]
  rather that take them via the -mm tree.

  Summary:

   - Optimise SVE switching for CPUs with 128-bit implementations.

   - Fix output format from SVE selftest.

   - Add support for versions v1.2 and 1.3 of the SMC calling
     convention.

   - Allow Pointer Authentication to be configured independently for
     kernel and userspace.

   - PMU driver cleanups for managing IRQ affinity and exposing event
     attributes via sysfs.

   - KASAN optimisations for both hardware tagging (MTE) and out-of-line
     software tagging implementations.

   - Relax frame record alignment requirements to facilitate 8-byte
     alignment with KASAN and Clang.

   - Cleanup of page-table definitions and removal of unused memory
     types.

   - Reduction of ARCH_DMA_MINALIGN back to 64 bytes.

   - Refactoring of our instruction decoding routines and addition of
     some missing encodings.

   - Move entry code moved into C and hardened against harmful compiler
     instrumentation.

   - Update booting requirements for the FEAT_HCX feature, added to v8.7
     of the architecture.

   - Fix resume from idle when pNMI is being used.

   - Additional CPU sanity checks for MTE and preparatory changes for
     systems where not all of the CPUs support 32-bit EL0.

   - Update our kernel string routines to the latest Cortex Strings
     implementation.

   - Big cleanup of our cache maintenance routines, which were
     confusingly named and inconsistent in their implementations.

   - Tweak linker flags so that GDB can understand vmlinux when using
     RELR relocations.

   - Boot path cleanups to enable early initialisation of per-cpu
     operations needed by KCSAN.

   - Non-critical fixes and miscellaneous cleanup"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (150 commits)
  arm64: tlb: fix the TTL value of tlb_get_level
  arm64: Restrict undef hook for cpufeature registers
  arm64/mm: Rename ARM64_SWAPPER_USES_SECTION_MAPS
  arm64: insn: avoid circular include dependency
  arm64: smp: Bump debugging information print down to KERN_DEBUG
  drivers/perf: fix the missed ida_simple_remove() in ddr_perf_probe()
  perf/arm-cmn: Fix invalid pointer when access dtc object sharing the same IRQ number
  arm64: suspend: Use cpuidle context helpers in cpu_suspend()
  PSCI: Use cpuidle context helpers in psci_cpu_suspend_enter()
  arm64: Convert cpu_do_idle() to using cpuidle context helpers
  arm64: Add cpuidle context save/restore helpers
  arm64: head: fix code comments in set_cpu_boot_mode_flag
  arm64: mm: drop unused __pa(__idmap_text_start)
  arm64: mm: fix the count comments in compute_indices
  arm64/mm: Fix ttbr0 values stored in struct thread_info for software-pan
  arm64: mm: Pass original fault address to handle_mm_fault()
  arm64/mm: Drop SECTION_[SHIFT|SIZE|MASK]
  arm64/mm: Use CONT_PMD_SHIFT for ARM64_MEMSTART_SHIFT
  arm64/mm: Drop SWAPPER_INIT_MAP_SIZE
  arm64: Conditionally configure PTR_AUTH key of the kernel.
  ...
2021-06-28 14:04:24 -07:00
Linus Torvalds
8e4d7a78f0 Misc cleanups & removal of obsolete code.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZejQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hKCg//a0wiOyJBiWLAW0uiOucF2ICVQZj5rKgi
 M4HRJZ9jNkFUFVQ/eXYI7uedSqJ6B4hwoUqU6Yp6e05CF/Jgxe2OQXnknearjtDp
 xs8yBsnLolCrtHzWvuJAZL8InXwvUYrsxu1A8kWKd1ezZQ2V2aFEI4KtYcPVoBBi
 hRNMy1JVJbUoCG5s/CbsMpTKH0ehQFGsG46rCLQJ4s9H3rcYaCv9NY2q1EYKBrha
 ZiZjPSFBKaTAVEoc3tUbqsNZAqgyuwRcBQL0K5VDI9p92fudvKgsTI7erbmp+Lij
 mLhjjoPQK1C07kj0HpCPyoGMiTbJ2piag/jZnxSEiQnNxmZjqjRUhDuDhp6uc/SE
 98CEYWPoVbU7N6QLEurHVRAfaQ/ZC7PfiR7lhkoJHizaszFY1NFRxplsU1rzTwGq
 YZdr+y49tJTCU1wIvWF2eFBZHBEgfA6fP4TRGgVsQ7r8IhugR1nCLcnTfMLYXt2t
 9Fe57M7cBgZMgNf5AgvraowugJrTLX7240YPKxHnv5yLjIBt4bulm8X4Lq/MKgc+
 UbRfB7Trd2c9T4EVDy26rQ7qk+VC8rIbzEp4kvlDpV8u7BtLYhVonxVz6qPong5b
 NxOczaFsfL5gWJmfGU+vfc+RFl2lNhQQMLo/gdEn89qZL8nxL/4byejwfCs0YfC2
 wgDXNwRJb+g=
 =YqZp
 -----END PGP SIGNATURE-----

Merge tag 'x86-cleanups-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Ingo Molnar:
 "Misc cleanups & removal of obsolete code"

* tag 'x86-cleanups-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sgx: Correct kernel-doc's arg name in sgx_encl_release()
  doc: Remove references to IBM Calgary
  x86/setup: Document that Windows reserves the first MiB
  x86/crash: Remove crash_reserve_low_1M()
  x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options
  x86/alternative: Align insn bytes vertically
  x86: Fix leftover comment typos
  x86/asm: Simplify __smp_mb() definition
  x86/alternatives: Make the x86nops[] symbol static
2021-06-28 13:10:25 -07:00
Linus Torvalds
909489bf9f Changes for this cycle:
- Micro-optimize and standardize the do_syscall_64() calling convention
  - Make syscall entry flags clearing more conservative
  - Clean up syscall table handling
  - Clean up & standardize assembly macros, in preparation of FRED
  - Misc cleanups and fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZeG8RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gHQw//fI9MAIVQbB6tVMH6GtFkQZIJLMt/bik5
 AWelEXoBUbbLFGKpugC+oWGJjsvZ026f65hfQEswuqD4n0Xx8FFPRi51LP88lLya
 XQV8nssJYUKYZAVA0EJd7NmnJchbnRc4KQmu6ekEQdP6+Nht8k7U9O2QetgQgcE5
 IYhXctoYpr/FnBpV5PmVNAakOt0cZh6mXAtpzjHfdU8lUHZ13zPIpniSXCPd4vUB
 u/a3x3l1fP+Gg8d1vpfGCBvNKRBEh5pJsjaObMlLM/qhHupsDi5Ji6y6pcJSgkcv
 2nBtRGYDjYIQ0qXx6ILhNuqGFT76i/j2p8YfwMnH4NmYk908RlT0quu7fI8wBO9E
 cKd3m9BG8wP67xbOrG/0ckdl3+y/1iW8kPY6SeO03Vvfm6ryqHdZs4oi4CmcX9lP
 bFXi5AiYdHm0vqbwQG8P9LerWotgz4yFC9z7yC1KXJDXJxSwVxDFiXvyvxepRi6E
 NZxe4RSnDp7sijEvZJa/2EA+rDVDIokfzTLgnRSMkaUuxwNsVjeNsV0b5727kiVC
 DwVkxC7NZKG9UBr6WFs9hxRPE0g6xz3EJEBXaWpk2ggBmQxTfBRTjV0Pe3ii7dqQ
 z7O3Gv8pojki3ttG4wExLepPHRxTBzjdsoV6/BHZpraYTP11bpQlgx/K7IYJZYa5
 Tt9IZ4vNd10=
 =mbmH
 -----END PGP SIGNATURE-----

Merge tag 'x86-asm-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 asm updates from Ingo Molnar:

 - Micro-optimize and standardize the do_syscall_64() calling convention

 - Make syscall entry flags clearing more conservative

 - Clean up syscall table handling

 - Clean up & standardize assembly macros, in preparation of FRED

 - Misc cleanups and fixes

* tag 'x86-asm-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/asm: Make <asm/asm.h> valid on cross-builds as well
  x86/regs: Syscall_get_nr() returns -1 for a non-system call
  x86/entry: Split PUSH_AND_CLEAR_REGS into two submacros
  x86/syscall: Maximize MSR_SYSCALL_MASK
  x86/syscall: Unconditionally prototype {ia32,x32}_sys_call_table[]
  x86/entry: Reverse arguments to do_syscall_64()
  x86/entry: Unify definitions from <asm/calling.h> and <asm/ptrace-abi.h>
  x86/asm: Use _ASM_BYTES() in <asm/nops.h>
  x86/asm: Add _ASM_BYTES() macro for a .byte ... opcode sequence
  x86/asm: Have the __ASM_FORM macros handle commas in arguments
2021-06-28 12:57:11 -07:00
Linus Torvalds
e5a0fc4e20 CPU setup code changes:
- Clean up & simplify AP exception handling setup.
 
  - Consolidate the disjoint IDT setup code living in
    idt_setup_traps() and idt_setup_ist_traps() into
    a single idt_setup_traps() initialization function
    and call it before cpu_init().
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZdu0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gWgA//ecu2FqCFT2gpZP7ABdJqAhtYA8f7rg/a
 BMKNNfTha/L9Dot2PSqMq8oVi8NA++EbKjuWeFujAIoU/2vso+YrmHc84O35nOOF
 u2Zgps7UK+ffKo11Yu1Vpb911ctJClAwL0CemsC30QDbpGVHPecLuxOIgY6E6BwC
 qhNqNLp0K4bFRq0ya27O8RPiz9LjCzUILHHvWSAl5m5tWqovED8aXdjrDJcFXqwY
 u9nuuRpUpQWqCldZP9X7+pdo4Z2HZjvIBjqHD/wl3VMjV6q+k+su6AjV9p1D8hoz
 otY96i8MQjD/sgIa1H+tUc2ZusGzDls+EpYiGaPmqeXMitKEwOFpVDAaT8SelUms
 bR4VQ9IYB1NG7Qbco3NQHMV1sWuvUJcLG6ILYFWXgH0hP1EDHFn/TvOn0rfJysbE
 AmCpwmUo0b8Bj6nbKkVcXxoX1FdeqiM5+cPxHxGVgxVoR0Umz13EX4y4cBzSIRht
 eYwT6H1CxR9a4TIr8cMBsN14QsnV3f6lv/RNfVdmZEJVVr0boRI90L2xMLBB9RkP
 z03g7VvfMuSWnKyOFheP4ae9ul2qxAT380+g1oHQH0XIFtj9yIhzJHpoUCzhgCra
 Ui2Z71Dhq0R1UNpPsPfc1XkQI9chiahn8gc1u2zvN4SzZa6DZH22VvGNK0ghoIxq
 5WFho50hNIk=
 =BPbv
 -----END PGP SIGNATURE-----

Merge tag 'x86-apic-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 exception handling updates from Ingo Molnar:

 - Clean up & simplify AP exception handling setup.

 - Consolidate the disjoint IDT setup code living in idt_setup_traps()
   and idt_setup_ist_traps() into a single idt_setup_traps()
   initialization function and call it before cpu_init().

* tag 'x86-apic-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/idt: Rework IDT setup for boot CPU
  x86/cpu: Init AP exception handling from cpu_init_secondary()
2021-06-28 12:46:30 -07:00
Linus Torvalds
54a728dc5e Scheduler udpates for this cycle:
- Changes to core scheduling facilities:
 
     - Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
       coordinated scheduling across SMT siblings. This is a much
       requested feature for cloud computing platforms, to allow
       the flexible utilization of SMT siblings, without exposing
       untrusted domains to information leaks & side channels, plus
       to ensure more deterministic computing performance on SMT
       systems used by heterogenous workloads.
 
       There's new prctls to set core scheduling groups, which
       allows more flexible management of workloads that can share
       siblings.
 
     - Fix task->state access anti-patterns that may result in missed
       wakeups and rename it to ->__state in the process to catch new
       abuses.
 
  - Load-balancing changes:
 
      - Tweak newidle_balance for fair-sched, to improve
        'memcache'-like workloads.
 
      - "Age" (decay) average idle time, to better track & improve workloads
        such as 'tbench'.
 
      - Fix & improve energy-aware (EAS) balancing logic & metrics.
 
      - Fix & improve the uclamp metrics.
 
      - Fix task migration (taskset) corner case on !CONFIG_CPUSET.
 
      - Fix RT and deadline utilization tracking across policy changes
 
      - Introduce a "burstable" CFS controller via cgroups, which allows
        bursty CPU-bound workloads to borrow a bit against their future
        quota to improve overall latencies & batching. Can be tweaked
        via /sys/fs/cgroup/cpu/<X>/cpu.cfs_burst_us.
 
      - Rework assymetric topology/capacity detection & handling.
 
  - Scheduler statistics & tooling:
 
      - Disable delayacct by default, but add a sysctl to enable
        it at runtime if tooling needs it. Use static keys and
        other optimizations to make it more palatable.
 
      - Use sched_clock() in delayacct, instead of ktime_get_ns().
 
  - Misc cleanups and fixes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZcPoRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1g3yw//WfhIqy7Psa9d/MBMjQDRGbTuO4+w22Dj
 vmWFU44Q4KJxQHWeIgUlrK+dzvYWvNmflUs2CUUOiDVzxFTHMIyBtL4qCBUbx4Ns
 vKAcB9wsWZge2o3WzZqpProRhdoRaSKw8egUr2q7rACVBkckY7eGP/OjWxXU8BdA
 b7D0LPWwuIBFfN4pFYeCDLn32Dqr9s6Chyj+ZecabdG7EE6Gu+f1diVcxy7JE/mc
 4WWL0D1RqdgpGrBEuMJIxPYekdrZiuy4jtEbztz5gbTBteN1cj3BLfqn0Pc/e6rO
 Vyuc5mXCAmzRVi18z6g6bsVl+IA/nrbErENB2OHOhOYtqiZxqGTd4GPWZszMyY17
 5AsEO5+5pcaBsy4gyp09qURggBu9zhJnMVmOI3rIHZkmkhwzc6uUJlyhDCTiFWOz
 3ZF3LjbZEyCKodMD8qMHbs3axIBpIfZqjzkvSKyFnvfXEGVytVse7NUuWtQ36u92
 GnURxVeYY1TDVXvE1Y8owNKMxknKQ6YRlypP7Dtbeo/qG6hShp0xmS7qDLDi0ybZ
 ZlK+bDECiVoDf3nvJo+8v5M82IJ3CBt4UYldeRJsa1YCK/FsbK8tp91fkEfnXVue
 +U6LPX0AmMpXacR5HaZfb3uBIKRw/QMdP/7RFtBPhpV6jqCrEmuqHnpPQiEVtxwO
 UmG7bt94Trk=
 =3VDr
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler udpates from Ingo Molnar:

 - Changes to core scheduling facilities:

    - Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
      coordinated scheduling across SMT siblings. This is a much
      requested feature for cloud computing platforms, to allow the
      flexible utilization of SMT siblings, without exposing untrusted
      domains to information leaks & side channels, plus to ensure more
      deterministic computing performance on SMT systems used by
      heterogenous workloads.

      There are new prctls to set core scheduling groups, which allows
      more flexible management of workloads that can share siblings.

    - Fix task->state access anti-patterns that may result in missed
      wakeups and rename it to ->__state in the process to catch new
      abuses.

 - Load-balancing changes:

    - Tweak newidle_balance for fair-sched, to improve 'memcache'-like
      workloads.

    - "Age" (decay) average idle time, to better track & improve
      workloads such as 'tbench'.

    - Fix & improve energy-aware (EAS) balancing logic & metrics.

    - Fix & improve the uclamp metrics.

    - Fix task migration (taskset) corner case on !CONFIG_CPUSET.

    - Fix RT and deadline utilization tracking across policy changes

    - Introduce a "burstable" CFS controller via cgroups, which allows
      bursty CPU-bound workloads to borrow a bit against their future
      quota to improve overall latencies & batching. Can be tweaked via
      /sys/fs/cgroup/cpu/<X>/cpu.cfs_burst_us.

    - Rework assymetric topology/capacity detection & handling.

 - Scheduler statistics & tooling:

    - Disable delayacct by default, but add a sysctl to enable it at
      runtime if tooling needs it. Use static keys and other
      optimizations to make it more palatable.

    - Use sched_clock() in delayacct, instead of ktime_get_ns().

 - Misc cleanups and fixes.

* tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
  sched/doc: Update the CPU capacity asymmetry bits
  sched/topology: Rework CPU capacity asymmetry detection
  sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag
  psi: Fix race between psi_trigger_create/destroy
  sched/fair: Introduce the burstable CFS controller
  sched/uclamp: Fix uclamp_tg_restrict()
  sched/rt: Fix Deadline utilization tracking during policy change
  sched/rt: Fix RT utilization tracking during policy change
  sched: Change task_struct::state
  sched,arch: Remove unused TASK_STATE offsets
  sched,timer: Use __set_current_state()
  sched: Add get_current_state()
  sched,perf,kvm: Fix preemption condition
  sched: Introduce task_is_running()
  sched: Unbreak wakeups
  sched/fair: Age the average idle time
  sched/cpufreq: Consider reduced CPU capacity in energy calculation
  sched/fair: Take thermal pressure into account while estimating energy
  thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure
  sched/fair: Return early from update_tg_cfs_load() if delta == 0
  ...
2021-06-28 12:14:19 -07:00
Linus Torvalds
28a27cbd86 Perf events updates for this cycle:
- Platform PMU driver updates:
 
      - x86 Intel uncore driver updates for Skylake (SNR) and Icelake (ICX) servers
      - Fix RDPMC support
      - Fix [extended-]PEBS-via-PT support
      - Fix Sapphire Rapids event constraints
      - Fix :ppp support on Sapphire Rapids
      - Fix fixed counter sanity check on Alder Lake & X86_FEATURE_HYBRID_CPU
      - Other heterogenous-PMU fixes
 
  - Kprobes:
 
      - Remove the unused and misguided kprobe::fault_handler callbacks.
      - Warn about kprobes taking a page fault.
      - Fix the 'nmissed' stat counter.
 
  - Misc cleanups and fixes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZaxMRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hPgw//f9SnGzFoP1uR5TBqM8j/QHulMewew/iD
 dM5lh2emdmqHWYPBeRxUHgag38K2Golr3Y+NxLA3R+RMx+OZQe8Mz/wYvPQcBvsV
 k1HHImU3GRMn4GM7GwxH3vPIottDUx3mNS2J6pzlw3kwRUVqrxUdj/0/pSY/4eJ7
 ZT4uq4yLV83Jd3qioU7o7e/u6MrdNIIcAXRpVDdE9Mm1+kWXSVN7/h3Vsiz4tj5E
 iS+UXEtSc1a2mnmekv63pYkJHHNUb6guD8jgI/wrm1KIFGjDRifM+3TV6R/kB96/
 TfD2LhCcTShfSp8KI191pgV7/NQbB/PmLdSYmff3rTBiii4cqXuCygJCHInZ09z0
 4fTSSqM6aHg7kfTQyOCp+DUQ+9vNVXWo8mxt9c6B8xA0GyCI3zhjQ4UIiSUWRpjs
 Be5ZyF0kNNuPxYrKFnGnBf8+51DURpCz3sDdYRuK4KNkj1+4ZvJo/KzGTMUUIE4B
 IDQG6wDP5Kb388eRDtKrG5X7IXg+L5F/kezin60j0QF5MwDgxirT217teN8H1lNn
 YgWMjRK8Tw0flUJsbCxa51/nl93UtByB+fIRIc88MSeLxcI6/ORW+TxBBEqkYm5Z
 6BLFtmHSuAqAXUuyZXSGLcW7XLJvIaDoHgvbDn6l4g7FMWHqPOIq6nJQY3L8ben2
 e+fQrGh4noI=
 =20Vc
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf events updates from Ingo Molnar:

 - Platform PMU driver updates:

     - x86 Intel uncore driver updates for Skylake (SNR) and Icelake (ICX) servers
     - Fix RDPMC support
     - Fix [extended-]PEBS-via-PT support
     - Fix Sapphire Rapids event constraints
     - Fix :ppp support on Sapphire Rapids
     - Fix fixed counter sanity check on Alder Lake & X86_FEATURE_HYBRID_CPU
     - Other heterogenous-PMU fixes

 - Kprobes:

     - Remove the unused and misguided kprobe::fault_handler callbacks.
     - Warn about kprobes taking a page fault.
     - Fix the 'nmissed' stat counter.

 - Misc cleanups and fixes.

* tag 'perf-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix task context PMU for Hetero
  perf/x86/intel: Fix instructions:ppp support in Sapphire Rapids
  perf/x86/intel: Add more events requires FRONTEND MSR on Sapphire Rapids
  perf/x86/intel: Fix fixed counter check warning for some Alder Lake
  perf/x86/intel: Fix PEBS-via-PT reload base value for Extended PEBS
  perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task
  kprobes: Do not increment probe miss count in the fault handler
  x86,kprobes: WARN if kprobes tries to handle a fault
  kprobes: Remove kprobe::fault_handler
  uprobes: Update uprobe_write_opcode() kernel-doc comment
  perf/hw_breakpoint: Fix DocBook warnings in perf hw_breakpoint
  perf/core: Fix DocBook warnings
  perf/core: Make local function perf_pmu_snapshot_aux() static
  perf/x86/intel/uncore: Enable I/O stacks to IIO PMON mapping on ICX
  perf/x86/intel/uncore: Enable I/O stacks to IIO PMON mapping on SNR
  perf/x86/intel/uncore: Generalize I/O stacks to PMON mapping procedure
  perf/x86/intel/uncore: Drop unnecessary NULL checks after container_of()
2021-06-28 12:03:20 -07:00
Linus Torvalds
a15286c63d Locking changes for this cycle:
- Core locking & atomics:
 
      - Convert all architectures to ARCH_ATOMIC: move every
        architecture to ARCH_ATOMIC, then get rid of ARCH_ATOMIC
        and all the transitory facilities and #ifdefs.
 
        Much reduction in complexity from that series:
 
            63 files changed, 756 insertions(+), 4094 deletions(-)
 
      - Self-test enhancements
 
  - Futexes:
 
      - Add the new FUTEX_LOCK_PI2 ABI, which is a variant that
        doesn't set FLAGS_CLOCKRT (.e. uses CLOCK_MONOTONIC).
 
        [ The temptation to repurpose FUTEX_LOCK_PI's implicit
          setting of FLAGS_CLOCKRT & invert the flag's meaning
          to avoid having to introduce a new variant was
          resisted successfully. ]
 
      - Enhance futex self-tests
 
  - Lockdep:
 
      - Fix dependency path printouts
      - Optimize trace saving
      - Broaden & fix wait-context checks
 
  - Misc cleanups and fixes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZaEYRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hPdxAAiNCsxL6X1cZ8zqbWsvLefT9Zqhzgs5u6
 gdZele7PNibvbYdON26b5RUzuKfOW/hgyX6LKqr+AiNYTT9PGhcY+tycUr2PGk5R
 LMyhJWmmX5cUVPU92ky+z5hEHB2gr4XPJcvgpKKUL0XB1tBaSvy2DtgwPuhXOoT1
 1sCQfy63t71snt2RfEnibVW6xovwaA2lsqL81lLHJN4iRFWvqO498/m4+PWkylsm
 ig/+VT1Oz7t4wqu3NhTqNNZv+4K4W2asniyo53Dg2BnRm/NjhJtgg4jRibrb0ssb
 67Xdq6y8+xNBmEAKj+Re8VpMcu4aj346Ctk7d4gst2ah/Rc0TvqfH6mezH7oq7RL
 hmOrMBWtwQfKhEE/fDkng30nrVxc/98YXP0n2rCCa0ySsaF6b6T185mTcYDRDxFs
 BVNS58ub+zxrF9Zd4nhIHKaEHiL2ZdDimqAicXN0RpywjIzTQ/y11uU7I1WBsKkq
 WkPYs+FPHnX7aBv1MsuxHhb8sUXjG924K4JeqnjF45jC3sC1crX+N0jv4wHw+89V
 h4k20s2Tw6m5XGXlgGwMJh0PCcD6X22Vd9Uyw8zb+IJfvNTGR9Rp1Ec+1gMRSll+
 xsn6G6Uy9bcNU0SqKlBSfelweGKn4ZxbEPn76Jc8KWLiepuZ6vv5PBoOuaujWht9
 KAeOC5XdjMk=
 =tH//
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:

 - Core locking & atomics:

     - Convert all architectures to ARCH_ATOMIC: move every architecture
       to ARCH_ATOMIC, then get rid of ARCH_ATOMIC and all the
       transitory facilities and #ifdefs.

       Much reduction in complexity from that series:

           63 files changed, 756 insertions(+), 4094 deletions(-)

     - Self-test enhancements

 - Futexes:

     - Add the new FUTEX_LOCK_PI2 ABI, which is a variant that doesn't
       set FLAGS_CLOCKRT (.e. uses CLOCK_MONOTONIC).

       [ The temptation to repurpose FUTEX_LOCK_PI's implicit setting of
         FLAGS_CLOCKRT & invert the flag's meaning to avoid having to
         introduce a new variant was resisted successfully. ]

     - Enhance futex self-tests

 - Lockdep:

     - Fix dependency path printouts

     - Optimize trace saving

     - Broaden & fix wait-context checks

 - Misc cleanups and fixes.

* tag 'locking-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
  locking/lockdep: Correct the description error for check_redundant()
  futex: Provide FUTEX_LOCK_PI2 to support clock selection
  futex: Prepare futex_lock_pi() for runtime clock selection
  lockdep/selftest: Remove wait-type RCU_CALLBACK tests
  lockdep/selftests: Fix selftests vs PROVE_RAW_LOCK_NESTING
  lockdep: Fix wait-type for empty stack
  locking/selftests: Add a selftest for check_irq_usage()
  lockding/lockdep: Avoid to find wrong lock dep path in check_irq_usage()
  locking/lockdep: Remove the unnecessary trace saving
  locking/lockdep: Fix the dep path printing for backwards BFS
  selftests: futex: Add futex compare requeue test
  selftests: futex: Add futex wait test
  seqlock: Remove trailing semicolon in macros
  locking/lockdep: Reduce LOCKDEP dependency list
  locking/lockdep,doc: Improve readability of the block matrix
  locking/atomics: atomic-instrumented: simplify ifdeffery
  locking/atomic: delete !ARCH_ATOMIC remnants
  locking/atomic: xtensa: move to ARCH_ATOMIC
  locking/atomic: sparc: move to ARCH_ATOMIC
  locking/atomic: sh: move to ARCH_ATOMIC
  ...
2021-06-28 11:45:29 -07:00
Linus Torvalds
b89c07dea1 A single ELF format fix for a section flags mismatch bug that breaks
kernel tooling such as kpatch-build.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZYv4RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1ipeBAAhJPS/kCQ17Y5zGyMB0/6yfCWIifODoS7
 9J+6/mqKHPDdV07yzPtOXuTTmpKV4OHPi8Yj8kaXs5L5fOmQ1uAwITwZNF5hU0a5
 CiFIsubUCJmglf9b6L9EH5pBEQ72Cq4u8zIhJ9LmZ4t625AHJAm2ikZgascc4U67
 RvVoGr5sYTo0YEsc1IDM1wUtnUhXBNjS1VwkXNnCFFTXYHju47MeY1sPHq2hvkzO
 iJGC9A+hxfM1eQt9/qC/2L/6F/XECN61gcR9Get8TkWeEGHmPG+FthmPLd4oO9Ho
 03J4JfMbmXumWosAeilYBNUkfii/M5Em78Wpv/cB94iSt67rq7Eb+8gm4D5svmfN
 l+utsPY/HYB+uWV0hy2cV/ORRiwcJnon54dEWL6912YkKz+OIb3DK/7l9ex5lW+D
 r3o8NP0s6S+RgUkOFxz5VaYK1giu6fiaFysWdKeflvwlvY/64owMepQ1QfPBbeB7
 3DTzvuYZ4Cb1x/vR6WBbFqGcuJKZ1CsZIBLCblveUs+G0wlu147K5E1qlXg/Wvq7
 5Vzznc4fmRng8np5hxAw8ieLkatWg7szyryUV/4H2Ubs/jWGcH628ZYbapaCb7EM
 Eson65xzbVfhnz16z8sN13XIF1lGe8sb0+qiFSclEfyDUnZDuhwMn6d9Ubqxrg5J
 uTULEzmY/rI=
 =MvPd
 -----END PGP SIGNATURE-----
mergetag object d33b9035e1
 type commit
 tag objtool-core-2021-06-28
 tagger Ingo Molnar <mingo@kernel.org> 1624859477 +0200
 
 The biggest change in this cycle is the new code to handle
 and rewrite variable sized jump labels - which results in
 slightly tighter code generation in hot paths, through the
 use of short(er) NOPs.
 
 Also a number of cleanups and fixes, and a change to the
 generic include/linux/compiler.h to handle a s390 GCC quirk.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZZGcRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1goYg/7BxUIJXP0F5wbrMbAvJIDRgR/j3TA+ztk
 uNU1yabBGluMxCqJ87HadJ+A5d010G+GRUn/birVr7w1UuwWv8HOda78dnyG7tme
 xm78/1FlOnstuOTQxhK6rjbb2cp+QOmdsAQkq1TF4SOxArBQiwtjiOvytHjb5yNx
 7LrlbtuZ7Dtc0qd2evkG4ma4QkGoDhBS1dRogrItc27ZLuFIQoNnEd2K2QNMgczw
 a/Jx8fgNmdoJSq+vkBn9TnS/cJYUW/PAlPNtO3ac8yE857aDIVnjXFRzveAP/nTh
 rwFD6aCGnJAqyqP7A8ElNjySos5O+ebYApxe7rEx0TNLbrc55qSP9lpdIO+vgytV
 Xzy4O7z6o+lailQ4EoF8Qf+rlPeue0kLF23SsNbZY1uT0vjX1Uv70xgKbkuyPygp
 GNXAy6dOXK0AfaZYL/Wa50yVnJnkYDjes/hHr+HEam5Oad566pqIyQNP8yWSPqaf
 KHkL//1pb5C2RKwot4IYv/ftHfZB5QftoFq6bhGBc1GXUd/FiqivvGHPW/6g7rxi
 ZIrXs+Fqm/5KP9mssNONfyz5XEvbcUTD1CbeqX9eyVbiYZbLp1oWSgtogiRW9ya+
 HR7t0Dt/UFzFWbilb6EZff/Hdr1NZBZLdrfpvVDoMf5tR9J0BIOyjddTu89g/FIO
 KcfJ5yyjJBU=
 =+HAB
 -----END PGP SIGNATURE-----

Merge tags 'objtool-urgent-2021-06-28' and 'objtool-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool fix and updates from Ingo Molnar:
 "An ELF format fix for a section flags mismatch bug that breaks kernel
  tooling such as kpatch-build.

  The biggest change in this cycle is the new code to handle and rewrite
  variable sized jump labels - which results in slightly tighter code
  generation in hot paths, through the use of short(er) NOPs.

  Also a number of cleanups and fixes, and a change to the generic
  include/linux/compiler.h to handle a s390 GCC quirk"

* tag 'objtool-urgent-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Don't make .altinstructions writable

* tag 'objtool-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Improve reloc hash size guestimate
  instrumentation.h: Avoid using inline asm operand modifiers
  compiler.h: Avoid using inline asm operand modifiers
  kbuild: Fix objtool dependency for 'OBJECT_FILES_NON_STANDARD_<obj> := n'
  objtool: Reflow handle_jump_alt()
  jump_label/x86: Remove unused JUMP_LABEL_NOP_SIZE
  jump_label, x86: Allow short NOPs
  objtool: Provide stats for jump_labels
  objtool: Rewrite jump_label instructions
  objtool: Decode jump_entry::key addend
  jump_label, x86: Emit short JMP
  jump_label: Free jump_entry::key bit1 for build use
  jump_label, x86: Add variable length patching support
  jump_label, x86: Introduce jump_entry_size()
  jump_label, x86: Improve error when we fail expected text
  jump_label, x86: Factor out the __jump_table generation
  jump_label, x86: Strip ASM jump_label support
  x86, objtool: Dont exclude arch/x86/realmode/
  objtool: Rewrite hashtable sizing
2021-06-28 11:35:55 -07:00
Linus Torvalds
d04f7de0a5 - Differentiate the type of exception the #VC handler raises depending
on code executed in the guest and handle the case where failure to
 get the RIP would result in a #GP, as it should, instead of in a #PF
 
 - Disable interrupts while the per-CPU GHCB is held
 
 - Split the #VC handler depending on where the #VC exception has
 happened and therefore provide for precise context tracking like the
 rest of the exception handlers deal with noinstr regions now
 
 - Add defines for the GHCB version 2 protocol so that further shared
 development with KVM can happen without merge conflicts
 
 - The usual small cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmDZij8ACgkQEsHwGGHe
 VUpwIQ/8CzFbGm2k2RdmO0H/VPwfF1HFSWpM9YFGSs++yOqfiyCFbyIcTcRbK4IO
 +BUIRoHSgCWPb+5pJli1Wf0J/sIdYr9D4MDWt1oRQG6e/4NE2SL3EOnYJWW5VtOT
 u1AVk01ooPOFDKIoh4OIZ7tCKAeNWBv+oe5dmP46spiEZbHHCzHIEaBuOQRzvX9C
 jSKulDHjA4iaNl/BQMF7dJL1+aPWj2NXjSj86fhMAa+m5MspDXbIaM5wMZfPzc1k
 Rj/m89JThp+mFwik46o/7g/5Q8SYtTE+Hqi1TX/65/dbyizLqbH5W3g0zwrD8TYf
 B7kHguqkoE1j1avLwOYK1yJB8ZTjtf+OXjUAR4UPzxkG7Xhelu5Qb7RD/WCJ3YqO
 KEFIFq+hsiAqvb6RkmX0aVecIJ49aqGX+onsMpLWq9pz2R4BRcH7jo81TIBcosg5
 2Kfx2aPcMec7u7RMBHqwiaC4Adp7/vmHhukawfI8xCWLd7wEjvAMP3eeePxR+C0l
 SSnn0O9COj8pctvq4eOGJAUXzPa4YtsaX+kILBs+hUdQXmQGVSxyTpakyhhUpGQ8
 YyblbHybS8JeYdGqPVS/tn0Rc2DqOSQJetjmXAGhlkEkkGY8i1Ddwe0MaamJozol
 g/wHNYcok/OQWglvVThv6EAY2pTSeWelmjUkZi1dnkYNH1VUxxE=
 =iyX+
 -----END PGP SIGNATURE-----

Merge tag 'x86_sev_for_v5.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 SEV updates from Borislav Petkov:

 - Differentiate the type of exception the #VC handler raises depending
   on code executed in the guest and handle the case where failure to
   get the RIP would result in a #GP, as it should, instead of in a #PF

 - Disable interrupts while the per-CPU GHCB is held

 - Split the #VC handler depending on where the #VC exception has
   happened and therefore provide for precise context tracking like the
   rest of the exception handlers deal with noinstr regions now

 - Add defines for the GHCB version 2 protocol so that further shared
   development with KVM can happen without merge conflicts

 - The usual small cleanups

* tag 'x86_sev_for_v5.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sev: Use "SEV: " prefix for messages from sev.c
  x86/sev: Add defines for GHCB version 2 MSR protocol requests
  x86/sev: Split up runtime #VC handler for correct state tracking
  x86/sev: Make sure IRQs are disabled while GHCB is active
  x86/sev: Propagate #GP if getting linear instruction address failed
  x86/insn: Extend error reporting from insn_fetch_from_user[_inatomic]()
  x86/insn-eval: Make 0 a valid RIP for insn_get_effective_ip()
  x86/sev: Fix error message in runtime #VC handler
2021-06-28 11:29:12 -07:00
Linus Torvalds
2594b713c1 - New AMD models support
- Allow MONITOR/MWAIT to be used for C1 state entry on Hygon too
 
 - Use the special RAPL CPUID bit to detect the functionality on AMD and
   Hygon instead of doing family matching.
 
 - Add support for new Intel microcode deprecating TSX on some models and
 do not enable kernel workarounds for those CPUs when TSX transactions
 always abort, as a result of that microcode update.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmDZhzEACgkQEsHwGGHe
 VUo5ow//eRwlb1OL/D3jzLT4nTYX8+XdufaJF1HBr1Cf3mdNkiEgyu2bvsXNTpN/
 ZP7CFCHibgYeHJ7qTTkhoK1DCe4YHjj450oCgg7pv40Mv9E29Rpszie8y8e/ngkc
 g9OiAeEd4A32v8bRMAOOX0UZN4afismXBW0k4iwOAguNFiZ/usrrVYTZpJe3wG65
 /YM9FdDZ+Mt7BavJdVyGh03PpzoSMrKyEQ673CHhERQyy5oEublrDSmtt5hQJv1W
 4tgNOWpw57Gi7Vs7UYd7VvBQKwQZKeQeHJWu1TXUB6pw0lKYvULH6m0dasvc6cGb
 WtCBvbQU9MRP0LvdvYOdgmSgn400z7mEwlUWmAFJLIUlDsuRpZmVQ4C1/OUnOSdx
 amb7I3bp1z6Rqjs9ADW5h87qDA+q5OmbIZeIDvuRypQOB3yEktAEdUvWb65b1Fgm
 9CpzebxyaOUM9YRxDzDd2joZYKnfI3stF6UCrVXaZwYei+Jmzn5gc8ZOoOX9g6gO
 eX/sLW2RWRx6XxilaWZijOHJTjokVUpEnD12aGtKO6ou5QbFTwldc2Metpua42cL
 5p8wRxEYeKT/EE/GKy/qIEp624QaInSEmfyq8RFKU4em7GSaSUmoQF5151LfnoRY
 ARHkEdz+T8s5RI5xSvUZLRMNYjig9tZas3blYfbJHnU7V2+bspQ=
 =wW+k
 -----END PGP SIGNATURE-----

Merge tag 'x86_cpu_for_v5.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cpu updates from Borislav Petkov:

 - New AMD models support

 - Allow MONITOR/MWAIT to be used for C1 state entry on Hygon too

 - Use the special RAPL CPUID bit to detect the functionality on AMD and
   Hygon instead of doing family matching.

 - Add support for new Intel microcode deprecating TSX on some models
   and do not enable kernel workarounds for those CPUs when TSX
   transactions always abort, as a result of that microcode update.

* tag 'x86_cpu_for_v5.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/tsx: Clear CPUID bits when TSX always force aborts
  x86/events/intel: Do not deploy TSX force abort workaround when TSX is deprecated
  x86/msr: Define new bits in TSX_FORCE_ABORT MSR
  perf/x86/rapl: Use CPUID bit on AMD and Hygon parts
  x86/cstate: Allow ACPI C1 FFH MWAIT use on Hygon systems
  x86/amd_nb: Add AMD family 19h model 50h PCI ids
  x86/cpu: Fix core name for Sapphire Rapids
2021-06-28 11:22:40 -07:00
Linus Torvalds
f565b20734 - Add the required information to the faked APEI-reported mem error so
that the kernel properly attempts to offline the corresponding page, as
 it does for kernel-detected correctable errors.
 
 - Fix a typo in AMD's error descriptions.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmDZg1cACgkQEsHwGGHe
 VUpW8BAAlnwit5Vg4UVocY7mwTi0GvP36Fz2u81kppMROpWgQhhmX35ZxoxgQoSC
 0ojKnOJTgGpOdKknmK/vom4ysxNRZxjz0zat9n+cqcfqVwP14KzhjaX1FPXnEQfE
 mPkn3v8fsML87glPTzmpELYSOZTpu6OYdiFZAzKL8Gp8aytyh4FamTV2eTxn5ClG
 +dejrN0NFiSALarliNttPnpfC5JvQ0KUJFxapYaMd27ssqL/2XMvJmBSpGC+OaZg
 lvvv7XuRrIPRZ7lU3Zipz7Rv5r8tTfPUMr33DcUuAZxpXW3zRpds153HktTYSqsv
 pZHTTLZ73GbAFVlkjqP6wcAtW2ygKW3lxsPuBSR8aIj8yU7rrrkG4wm2XsvCtrXP
 4KrTZLgqGHFQaXbp1BzJzrnLyb6dxZXkEaAnX/7ZygDz+L5aMlG/7XEk4c/R9YbS
 bg6NO/Dh1E547cf+bN6/yYNwPjNaO1lGOMU9N2IwjCiHFERzTsFGyFNjqMSGa7Ul
 34FZAB11aklqbj+0amu5IeMd8vM3unqTGnYEQCcyG09mdsa9/bjEvEgCirq5FXf3
 szjUmGpdtAsxCRZ7SzhsDu1IMT0F2D8hwgJbFSLXmtpiq5WB/EHaYbiqg8F6V36J
 bENGE3rLj3HkgWHsLpgEMX2OXh7Pzo3UqwwbtOuYEiwwhvh7CZk=
 =1Azq
 -----END PGP SIGNATURE-----

Merge tag 'ras_core_for_v5.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 RAS updates from Borislav Petkov:

 - Add the required information to the faked APEI-reported mem error so
   that the kernel properly attempts to offline the corresponding page,
   as it does for kernel-detected correctable errors.

 - Fix a typo in AMD's error descriptions.

* tag 'ras_core_for_v5.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  EDAC/mce_amd: Fix typo "FIfo" -> "Fifo"
  x86/mce: Include a MCi_MISC value in faked mce logs
  x86/MCE/AMD, EDAC/mce_amd: Add new SMCA bank types
2021-06-28 11:19:40 -07:00
Linus Torvalds
94ca94bbbb Two more urgent FPU fixes:
- prevent unprivileged userspace from reinitializing supervisor states
 
  - Prepare init_fpstate, which is the buffer used when initializing
  FPU state, properly in case the skip-writing-state-components XSAVE*
  variants are used.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmDVltYACgkQEsHwGGHe
 VUqY9Q//c4MhJP2E15cqTWupxYk41k0UMjqPIwGmt6hRoDNKeFQm0xSgeOwe2mgk
 bbzGDJOfAi2Hxza2fw6No4wIiaB3sZIqK451aI1SM9HTDB/B/dMGBPXAp9qRlnbT
 kU/rDqQVqi7wlwunSunFoSLTwmQw0Lispmzwz9yirdQ+jVsnuTLWtPbUZM8RL/j8
 XAhVwhDNc+Wuw0OBvRsyP5Mp6k9+2ic6z2ObIgSfgp4GeDG2F/+ZQ5W5ZeHVGQda
 5QqKIdWCmAinzdz3N0iksthT3RJwLmYZ0K/qvLMrYNCvZiuUBdgrUn1Yrjo1c3lx
 W+SUMtgehlylfyBbyGn5zBbJtZJtflx+kYLHLzw58lWC+ekRfxqx2F+e7S4facXr
 Xn9IpnIAhru1/SAItSvScxXzjVW4DwZKO3tLr+/KsrRsTnS15pD6rx6OK88HHP/y
 ofjCeS0P8STb7/Gzzqj7c+7bJvSZo/h7jmF+H2y5tRhUXZogSoh1z/QGYpvcFrwP
 GOZeACREBv+D1PQNp/DN/ZiZHg6+csEg+3abtRaZSbdnfsCSpU/imXcX9GPco5vu
 XS+Gxle2aqvRmQNuJEbNr7YDfocZWWXmXnkPSKCtvqSgNdxjFjZ2v3TRTAgvHEoS
 Otpsv5Hk9g0FCep4oHG3zv8cb+Nk7Ycl2ZLZXQwE2Egane6U4K8=
 =uqQE
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:
 "Two more urgent FPU fixes:

   - prevent unprivileged userspace from reinitializing supervisor
     states

   - prepare init_fpstate, which is the buffer used when initializing
     FPU state, properly in case the skip-writing-state-components
     XSAVE* variants are used"

* tag 'x86_urgent_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fpu: Make init_fpstate correct with optimized XSAVE
  x86/fpu: Preserve supervisor states in sanitize_restored_user_xstate()
2021-06-25 10:00:25 -07:00
Maxim Levitsky
a01b45e9d3 KVM: x86: rename apic_access_page_done to apic_access_memslot_enabled
This better reflects the purpose of this variable on AMD, since
on AMD the AVIC's memory slot can be enabled and disabled dynamically.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210623113002.111448-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 18:00:49 -04:00
Aaron Lewis
19238e75bd kvm: x86: Allow userspace to handle emulation errors
Add a fallback mechanism to the in-kernel instruction emulator that
allows userspace the opportunity to process an instruction the emulator
was unable to.  When the in-kernel instruction emulator fails to process
an instruction it will either inject a #UD into the guest or exit to
userspace with exit reason KVM_INTERNAL_ERROR.  This is because it does
not know how to proceed in an appropriate manner.  This feature lets
userspace get involved to see if it can figure out a better path
forward.

Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: David Edmondson <david.edmondson@oracle.com>
Message-Id: <20210510144834.658457-2-aaronlewis@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 18:00:48 -04:00
Sean Christopherson
7cd138db5c KVM: x86/mmu: Optimize and clean up so called "last nonleaf level" logic
Drop the pre-computed last_nonleaf_level, which is arguably wrong and at
best confusing.  Per the comment:

  Can have large pages at levels 2..last_nonleaf_level-1.

the intent of the variable would appear to be to track what levels can
_legally_ have large pages, but that intent doesn't align with reality.
The computed value will be wrong for 5-level paging, or if 1gb pages are
not supported.

The flawed code is not a problem in practice, because except for 32-bit
PSE paging, bit 7 is reserved if large pages aren't supported at the
level.  Take advantage of this invariant and simply omit the level magic
math for 64-bit page tables (including PAE).

For 32-bit paging (non-PAE), the adjustments are needed purely because
bit 7 is ignored if PSE=0.  Retain that logic as is, but make
is_last_gpte() unique per PTTYPE so that the PSE check is avoided for
PAE and EPT paging.  In the spirit of avoiding branches, bump the "last
nonleaf level" for 32-bit PSE paging by adding the PSE bit itself.

Note, bit 7 is ignored or has other meaning in CR3/EPTP, but despite
FNAME(walk_addr_generic) briefly grabbing CR3/EPTP in "pte", they are
not PTEs and will blow up all the other gpte helpers.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-51-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 18:00:47 -04:00
Sean Christopherson
616007c866 KVM: x86: Enhance comments for MMU roles and nested transition trickiness
Expand the comments for the MMU roles.  The interactions with gfn_track
PGD reuse in particular are hairy.

Regarding PGD reuse, add comments in the nested virtualization flows to
call out why kvm_init_mmu() is unconditionally called even when nested
TDP is used.

Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-50-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 18:00:47 -04:00
Sean Christopherson
a4c93252fe KVM: x86/mmu: Drop "nx" from MMU context now that there are no readers
Drop kvm_mmu.nx as there no consumers left.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-39-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 18:00:44 -04:00
Sean Christopherson
167f8a5cae KVM: x86/mmu: Rename "nxe" role bit to "efer_nx" for macro shenanigans
Rename "nxe" to "efer_nx" so that future macro magic can use the pattern
<reg>_<bit> for all CR0, CR4, and EFER bits that included in the role.
Using "efer_nx" also makes it clear that the role bit reflects EFER.NX,
not the NX bit in the corresponding PTE.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-25-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 18:00:41 -04:00
Sean Christopherson
6c032f12dd Revert "KVM: MMU: record maximum physical address width in kvm_mmu_extended_role"
Drop MAXPHYADDR from mmu_role now that all MMUs have their role
invalidated after a CPUID update.  Invalidating the role forces all MMUs
to re-evaluate the guest's MAXPHYADDR, and the guest's MAXPHYADDR can
only be changed only through a CPUID update.

This reverts commit de3ccd26fa.

Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 18:00:36 -04:00
Sean Christopherson
63f5a1909f KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is broken
Warn userspace that KVM_SET_CPUID{,2} after KVM_RUN "may" cause guest
instability.  Initialize last_vmentry_cpu to -1 and use it to detect if
the vCPU has been run at least once when its CPUID model is changed.

KVM does not correctly handle changes to paging related settings in the
guest's vCPU model after KVM_RUN, e.g. MAXPHYADDR, GBPAGES, etc...  KVM
could theoretically zap all shadow pages, but actually making that happen
is a mess due to lock inversion (vcpu->mutex is held).  And even then,
updating paging settings on the fly would only work if all vCPUs are
stopped, updated in concert with identical settings, then restarted.

To support running vCPUs with different vCPU models (that affect paging),
KVM would need to track all relevant information in kvm_mmu_page_role.
Note, that's the _page_ role, not the full mmu_role.  Updating mmu_role
isn't sufficient as a vCPU can reuse a shadow page translation that was
created by a vCPU with different settings and thus completely skip the
reserved bit checks (that are tied to CPUID).

Tracking CPUID state in kvm_mmu_page_role is _extremely_ undesirable as
it would require doubling gfn_track from a u16 to a u32, i.e. would
increase KVM's memory footprint by 2 bytes for every 4kb of guest memory.
E.g. MAXPHYADDR (6 bits), GBPAGES, AMD vs. INTEL = 1 bit, and SEV C-BIT
would all need to be tracked.

In practice, there is no remotely sane use case for changing any paging
related CPUID entries on the fly, so just sweep it under the rug (after
yelling at userspace).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 18:00:36 -04:00
Sean Christopherson
49c6f8756c KVM: x86: Force all MMUs to reinitialize if guest CPUID is modified
Invalidate all MMUs' roles after a CPUID update to force reinitizliation
of the MMU context/helpers.  Despite the efforts of commit de3ccd26fa
("KVM: MMU: record maximum physical address width in kvm_mmu_extended_role"),
there are still a handful of CPUID-based properties that affect MMU
behavior but are not incorporated into mmu_role.  E.g. 1gb hugepage
support, AMD vs. Intel handling of bit 8, and SEV's C-Bit location all
factor into the guest's reserved PTE bits.

The obvious alternative would be to add all such properties to mmu_role,
but doing so provides no benefit over simply forcing a reinitialization
on every CPUID update, as setting guest CPUID is a rare operation.

Note, reinitializing all MMUs after a CPUID update does not fix all of
KVM's woes.  Specifically, kvm_mmu_page_role doesn't track the CPUID
properties, which means that a vCPU can reuse shadow pages that should
not exist for the new vCPU model, e.g. that map GPAs that are now illegal
(due to MAXPHYADDR changes) or that set bits that are now reserved
(PAGE_SIZE for 1gb pages), etc...

Tracking the relevant CPUID properties in kvm_mmu_page_role would address
the majority of problems, but fully tracking that much state in the
shadow page role comes with an unpalatable cost as it would require a
non-trivial increase in KVM's memory footprint.  The GBPAGES case is even
worse, as neither Intel nor AMD provides a way to disable 1gb hugepage
support in the hardware page walker, i.e. it's a virtualization hole that
can't be closed when using TDP.

In other words, resetting the MMU after a CPUID update is largely a
superficial fix.  But, it will allow reverting the tracking of MAXPHYADDR
in the mmu_role, and that case in particular needs to mostly work because
KVM's shadow_root_level depends on guest MAXPHYADDR when 5-level paging
is supported.  For cases where KVM botches guest behavior, the damage is
limited to that guest.  But for the shadow_root_level, a misconfigured
MMU can cause KVM to incorrectly access memory, e.g. due to walking off
the end of its shadow page tables.

Fixes: 7dcd575520 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 18:00:36 -04:00
Sean Christopherson
f71a53d118 Revert "KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack"
Restore CR4.LA57 to the mmu_role to fix an amusing edge case with nested
virtualization.  When KVM (L0) is using TDP, CR4.LA57 is not reflected in
mmu_role.base.level because that tracks the shadow root level, i.e. TDP
level.  Normally, this is not an issue because LA57 can't be toggled
while long mode is active, i.e. the guest has to first disable paging,
then toggle LA57, then re-enable paging, thus ensuring an MMU
reinitialization.

But if L1 is crafty, it can load a new CR4 on VM-Exit and toggle LA57
without having to bounce through an unpaged section.  L1 can also load a
new CR3 on exit, i.e. it doesn't even need to play crazy paging games, a
single entry PML5 is sufficient.  Such shenanigans are only problematic
if L0 and L1 use TDP, otherwise L1 and L2 share an MMU that gets
reinitialized on nested VM-Enter/VM-Exit due to mmu_role.base.guest_mode.

Note, in the L2 case with nested TDP, even though L1 can switch between
L2s with different LA57 settings, thus bypassing the paging requirement,
in that case KVM's nested_mmu will track LA57 in base.level.

This reverts commit 8053f924ca.

Fixes: 8053f924ca ("KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 18:00:36 -04:00
Jing Zhang
0193cc908b KVM: stats: Separate generic stats from architecture specific ones
Generic KVM stats are those collected in architecture independent code
or those supported by all architectures; put all generic statistics in
a separate structure.  This ensures that they are defined the same way
in the statistics API which is being added, removing duplication among
different architectures in the declaration of the descriptors.

No functional change intended.

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-2-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 11:47:56 -04:00
Thomas Gleixner
aee8c67a4f x86/fpu: Return proper error codes from user access functions
When *RSTOR from user memory raises an exception, there is no way to
differentiate them. That's bad because it forces the slow path even when
the failure was not a fault. If the operation raised eg. #GP then going
through the slow path is pointless.

Use _ASM_EXTABLE_FAULT() which stores the trap number and let the exception
fixup return the negated trap number as error.

This allows to separate the fast path and let it handle faults directly and
avoid the slow path for all other exceptions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121457.601480369@linutronix.de
2021-06-23 20:04:58 +02:00
Thomas Gleixner
72a6c08c44 x86/pkru: Remove xstate fiddling from write_pkru()
The PKRU value of a task is stored in task->thread.pkru when the task is
scheduled out. PKRU is restored on schedule in from there. So keeping the
XSAVE buffer up to date is a pointless exercise.

Remove the xstate fiddling and cleanup all related functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.897372712@linutronix.de
2021-06-23 19:55:51 +02:00
Thomas Gleixner
954436989c x86/fpu: Remove PKRU handling from switch_fpu_finish()
PKRU is already updated and the xstate is not longer the proper source
of information.

 [ bp: Use cpu_feature_enabled() ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.708180184@linutronix.de
2021-06-23 19:54:49 +02:00
Thomas Gleixner
30a304a138 x86/fpu: Mask PKRU from kernel XRSTOR[S] operations
As the PKRU state is managed separately restoring it from the xstate
buffer would be counterproductive as it might either restore a stale
value or reinit the PKRU state to 0.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.606745195@linutronix.de
2021-06-23 19:47:35 +02:00
Dave Hansen
e84ba47e31 x86/fpu: Hook up PKRU into ptrace()
One nice thing about having PKRU be XSAVE-managed is that it gets naturally
exposed into the XSAVE-using ABIs.  Now that XSAVE will not be used to
manage PKRU, these ABIs need to be manually enabled to deal with PKRU.

ptrace() uses copy_uabi_xstate_to_kernel() to collect the tracee's
XSTATE. As PKRU is not in the task's XSTATE buffer, use task->thread.pkru
for filling in up the ptrace buffer.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.508770763@linutronix.de
2021-06-23 19:44:24 +02:00
Dave Hansen
9782a712eb x86/fpu: Add PKRU storage outside of task XSAVE buffer
PKRU is currently partly XSAVE-managed and partly not. It has space
in the task XSAVE buffer and is context-switched by XSAVE/XRSTOR.
However, it is switched more eagerly than FPU because there may be a
need for PKRU to be up-to-date for things like copy_to/from_user() since
PKRU affects user-permission memory accesses, not just accesses from
userspace itself.

This leaves PKRU in a very odd position. XSAVE brings very little value
to the table for how Linux uses PKRU except for signal related XSTATE
handling.

Prepare to move PKRU away from being XSAVE-managed. Allocate space in
the thread_struct for it and save/restore it in the context-switch path
separately from the XSAVE-managed features. task->thread_struct.pkru
is only valid when the task is scheduled out. For the current task the
authoritative source is the hardware, i.e. it has to be retrieved via
rdpkru().

Leave the XSAVE code in place for now to ensure bisectability.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.399107624@linutronix.de
2021-06-23 19:37:45 +02:00
Thomas Gleixner
2ebe81c6d8 x86/fpu: Dont restore PKRU in fpregs_restore_userspace()
switch_to() and flush_thread() write the task's PKRU value eagerly so
the PKRU value of current is always valid in the hardware.

That means there is no point in restoring PKRU on exit to user or when
reactivating the task's FPU registers in the signal frame setup path.

This allows to remove all the xstate buffer updates with PKRU values once
the PKRU state is stored in thread struct while a task is scheduled out.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.303919033@linutronix.de
2021-06-23 19:33:32 +02:00
Thomas Gleixner
65e9521021 x86/fpu: Rename xfeatures_mask_user() to xfeatures_mask_uabi()
Rename it so it's clear that this is about user ABI features which can
differ from the feature set which the kernel saves and restores because the
kernel handles e.g. PKRU differently. But the user ABI (ptrace, signal
frame) expects it to be there.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.211585137@linutronix.de
2021-06-23 19:29:52 +02:00
Thomas Gleixner
1d9bffab11 x86/fpu: Move FXSAVE_LEAK quirk info __copy_kernel_to_fpregs()
copy_kernel_to_fpregs() restores all xfeatures but it is also the place
where the AMD FXSAVE_LEAK bug is handled.

That prevents fpregs_restore_userregs() to limit the restored features,
which is required to untangle PKRU and XSTATE handling and also for the
upcoming supervisor state management.

Move the FXSAVE_LEAK quirk into __copy_kernel_to_fpregs() and deinline that
function which has become rather fat.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.114271278@linutronix.de
2021-06-23 19:26:37 +02:00
Thomas Gleixner
727d01100e x86/fpu: Rename __fpregs_load_activate() to fpregs_restore_userregs()
Rename it so that it becomes entirely clear what this function is
about. It's purpose is to restore the FPU registers to the state which was
saved in the task's FPU memory state either at context switch or by an in
kernel FPU user.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.018867925@linutronix.de
2021-06-23 19:23:40 +02:00
Thomas Gleixner
e7ecad17c8 x86/fpu: Rename fpu__clear_all() to fpu_flush_thread()
Make it clear what the function is about.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121455.827979263@linutronix.de
2021-06-23 19:20:10 +02:00
Thomas Gleixner
371071131c x86/fpu: Use pkru_write_default() in copy_init_fpstate_to_fpregs()
There is no point in using copy_init_pkru_to_fpregs() which in turn calls
write_pkru(). write_pkru() tries to fiddle with the task's xstate buffer
for nothing because the XRSTOR[S](init_fpstate) just cleared the xfeature
flag in the xstate header which makes get_xsave_addr() fail.

It's a useless exercise anyway because the reinitialization activates the
FPU so before the task's xstate buffer can be used again a XRSTOR[S] must
happen which in turn dumps the PKRU value.

Get rid of the now unused copy_init_pkru_to_fpregs().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121455.732508792@linutronix.de
2021-06-23 19:15:16 +02:00
Thomas Gleixner
ff7ebff47c x86/pkru: Provide pkru_write_default()
Provide a simple and trivial helper which just writes the PKRU default
value without trying to fiddle with the task's xsave buffer.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121455.513729794@linutronix.de
2021-06-23 19:09:53 +02:00
Thomas Gleixner
739e2eec0f x86/pkru: Provide pkru_get_init_value()
When CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS is disabled then the following
code fails to compile:

     if (cpu_feature_enabled(X86_FEATURE_OSPKE)) {
     	u32 pkru = READ_ONCE(init_pkru_value);
	..
     }

because init_pkru_value is defined as '0' which makes READ_ONCE() upset.

Provide an accessor macro to avoid #ifdeffery all over the place.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121455.404880646@linutronix.de
2021-06-23 19:02:49 +02:00
Thomas Gleixner
8a1dc55a3f x86/cpu: Sanitize X86_FEATURE_OSPKE
X86_FEATURE_OSPKE is enabled first on the boot CPU and the feature flag is
set. Secondary CPUs have to enable CR4.PKE as well and set their per CPU
feature flag. That's ineffective because all call sites have checks for
boot_cpu_data.

Make it smarter and force the feature flag when PKU is enabled on the boot
cpu which allows then to use cpu_feature_enabled(X86_FEATURE_OSPKE) all
over the place. That either compiles the code out when PKEY support is
disabled in Kconfig or uses a static_cpu_has() for the feature check which
makes a significant difference in hotpaths, e.g. context switch.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121455.305113644@linutronix.de
2021-06-23 18:59:44 +02:00
Thomas Gleixner
b2681e791d x86/fpu: Rename and sanitize fpu__save/copy()
Both function names are a misnomer.

fpu__save() is actually about synchronizing the hardware register state
into the task's memory state so that either coredump or a math exception
handler can inspect the state at the time where the problem happens.

The function guarantees to preserve the register state, while "save" is a
common terminology for saving the current state so it can be modified and
restored later. This is clearly not the case here.

Rename it to fpu_sync_fpstate().

fpu__copy() is used to clone the current task's FPU state when duplicating
task_struct. While the register state is a copy the rest of the FPU state
is not.

Name it accordingly and remove the really pointless @src argument along
with the warning which comes along with it.

Nothing can ever copy the FPU state of a non-current task. It's clearly
just a consequence of arch_dup_task_struct(), but it makes no sense to
proliferate that further.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121455.196727450@linutronix.de
2021-06-23 18:55:56 +02:00
Dave Hansen
784a46618f x86/pkeys: Move read_pkru() and write_pkru()
write_pkru() was originally used just to write to the PKRU register.  It
was mercifully short and sweet and was not out of place in pgtable.h with
some other pkey-related code.

But, later work included a requirement to also modify the task XSAVE
buffer when updating the register.  This really is more related to the
XSAVE architecture than to paging.

Move the read/write_pkru() to asm/pkru.h.  pgtable.h won't miss them.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121455.102647114@linutronix.de
2021-06-23 18:52:57 +02:00
Thomas Gleixner
a75c52896b x86/fpu/xstate: Sanitize handling of independent features
The copy functions for the independent features are horribly named and the
supervisor and independent part is just overengineered.

The point is that the supplied mask has either to be a subset of the
independent features or a subset of the task->fpu.xstate managed features.

Rewrite it so it checks for invalid overlaps of these areas in the caller
supplied feature mask. Rename it so it follows the new naming convention
for these operations. Mop up the function documentation.

This allows to use that function for other purposes as well.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lkml.kernel.org/r/20210623121455.004880675@linutronix.de
2021-06-23 18:46:20 +02:00
Andy Lutomirski
01707b6653 x86/fpu: Rename "dynamic" XSTATEs to "independent"
The salient feature of "dynamic" XSTATEs is that they are not part of the
main task XSTATE buffer.  The fact that they are dynamically allocated is
irrelevant and will become quite confusing when user math XSTATEs start
being dynamically allocated.  Rename them to "independent" because they
are independent of the main XSTATE code.

This is just a search-and-replace with some whitespace updates to keep
things aligned.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/1eecb0e4f3e07828ebe5d737ec77dc3b708fad2d.1623388344.git.luto@kernel.org
Link: https://lkml.kernel.org/r/20210623121454.911450390@linutronix.de
2021-06-23 18:42:11 +02:00
Thomas Gleixner
1c61fada30 x86/fpu: Rename copy_kernel_to_fpregs() to restore_fpregs_from_fpstate()
This is not a copy functionality. It restores the register state from the
supplied kernel buffer.

No functional changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121454.716058365@linutronix.de
2021-06-23 18:36:42 +02:00
Thomas Gleixner
08ded2cd18 x86/fpu: Get rid of the FNSAVE optimization
The FNSAVE support requires conditionals in quite some call paths because
FNSAVE reinitializes the FPU hardware. If the save has to preserve the FPU
register state then the caller has to conditionally restore it from memory
when FNSAVE is in use.

This also requires a conditional in context switch because the restore
avoidance optimization cannot work with FNSAVE. As this only affects 20+
years old CPUs there is really no reason to keep this optimization
effective for FNSAVE. It's about time to not optimize for antiques anymore.

Just unconditionally FRSTOR the save content to the registers and clean up
the conditionals all over the place.

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121454.617369268@linutronix.de
2021-06-23 18:29:41 +02:00
Thomas Gleixner
ebe7234b08 x86/fpu: Rename copy_fpregs_to_fpstate() to save_fpregs_to_fpstate()
A copy is guaranteed to leave the source intact, which is not the case when
FNSAVE is used as that reinitilizes the registers.

Save does not make such guarantees and it matches what this is about,
i.e. to save the state for a later restore.

Rename it to save_fpregs_to_fpstate().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121454.508853062@linutronix.de
2021-06-23 18:26:43 +02:00
Thomas Gleixner
1cc34413ff x86/fpu: Rename xstate copy functions which are related to UABI
Rename them to reflect that these functions deal with user space format
XSAVE buffers.

      copy_kernel_to_xstate() -> copy_uabi_from_kernel_to_xstate()
      copy_user_to_xstate()   -> copy_sigframe_from_user_to_xstate()

Again a clear statement that these functions deal with user space ABI.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121454.318485015@linutronix.de
2021-06-23 18:23:14 +02:00
Thomas Gleixner
6fdc908cb5 x86/fpu: Rename fregs-related copy functions
The function names for fnsave/fnrstor operations are horribly named and
a permanent source of confusion.

Rename:
	copy_kernel_to_fregs() to frstor()
	copy_fregs_to_user()   to fnsave_to_user_sigframe()
	copy_user_to_fregs()   to frstor_from_user_sigframe()

so it's clear what these are doing. All these functions are really low
level wrappers around the equally named instructions, so mapping to the
documentation is just natural.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121454.223594101@linutronix.de
2021-06-23 18:20:27 +02:00
Thomas Gleixner
16dcf43859 x86/fpu: Rename fxregs-related copy functions
The function names for fxsave/fxrstor operations are horribly named and
a permanent source of confusion.

Rename:
	copy_fxregs_to_kernel() to fxsave()
	copy_kernel_to_fxregs() to fxrstor()
	copy_fxregs_to_user() to fxsave_to_user_sigframe()
	copy_user_to_fxregs() to fxrstor_from_user_sigframe()

so it's clear what these are doing. All these functions are really low
level wrappers around the equally named instructions, so mapping to the
documentation is just natural.

While at it, replace the static_cpu_has(X86_FEATURE_FXSR) with
use_fxsr() to be consistent with the rest of the code.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121454.017863494@linutronix.de
2021-06-23 18:12:30 +02:00
Thomas Gleixner
6b862ba182 x86/fpu: Rename copy_user_to_xregs() and copy_xregs_to_user()
The function names for xsave[s]/xrstor[s] operations are horribly named and
a permanent source of confusion.

Rename:
	copy_xregs_to_user() to xsave_to_user_sigframe()
	copy_user_to_xregs() to xrstor_from_user_sigframe()

so it's entirely clear what this is about. This is also a clear indicator
of the potentially different storage format because this is user ABI and
cannot use compacted format.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.924266705@linutronix.de
2021-06-23 18:01:56 +02:00
Thomas Gleixner
b16313f71c x86/fpu: Rename copy_xregs_to_kernel() and copy_kernel_to_xregs()
The function names for xsave[s]/xrstor[s] operations are horribly named and
a permanent source of confusion.

Rename:
	copy_xregs_to_kernel() to os_xsave()
	copy_kernel_to_xregs() to os_xrstor()

These are truly low level wrappers around the actual instructions
XSAVE[OPT]/XRSTOR and XSAVES/XRSTORS with the twist that the selection
based on the available CPU features happens with an alternative to avoid
conditionals all over the place and to provide the best performance for hot
paths.

The os_ prefix tells that this is the OS selected mechanism.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.830239347@linutronix.de
2021-06-23 17:57:57 +02:00
Thomas Gleixner
1f3171252d x86/fpu: Get rid of copy_supervisor_to_kernel()
If the fast path of restoring the FPU state on sigreturn fails or is not
taken and the current task's FPU is active then the FPU has to be
deactivated for the slow path to allow a safe update of the tasks FPU
memory state.

With supervisor states enabled, this requires to save the supervisor state
in the memory state first. Supervisor states require XSAVES so saving only
the supervisor state requires to reshuffle the memory buffer because XSAVES
uses the compacted format and therefore stores the supervisor states at the
beginning of the memory state. That's just an overengineered optimization.

Get rid of it and save the full state for this case.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.734561971@linutronix.de
2021-06-23 17:53:31 +02:00
Thomas Gleixner
02b93c0b00 x86/fpu: Get rid of using_compacted_format()
This function is pointlessly global and a complete misnomer because it's
usage is related to both supervisor state checks and compacted format
checks. Remove it and just make the conditions check the XSAVES feature.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.425493349@linutronix.de
2021-06-23 17:49:47 +02:00
Thomas Gleixner
dbb60ac764 x86/fpu: Move fpu__write_begin() to regset
The only usecase for fpu__write_begin is the set() callback of regset, so
the function is pointlessly global.

Move it to the regset code and rename it to fpu_force_restore() which is
exactly decribing what the function does.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.328652975@linutronix.de
2021-06-23 17:49:47 +02:00
Thomas Gleixner
5a32fac8db x86/fpu/regset: Move fpu__read_begin() into regset
The function can only be used from the regset get() callbacks safely. So
there is no reason to have it globally exposed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.234942936@linutronix.de
2021-06-23 17:49:47 +02:00
Thomas Gleixner
afac9e8943 x86/fpu: Remove fpstate_sanitize_xstate()
No more users.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.124819167@linutronix.de
2021-06-23 17:49:47 +02:00
Thomas Gleixner
eb6f51723f x86/fpu: Make copy_xstate_to_kernel() usable for [x]fpregs_get()
When xsave with init state optimization is used then a component's state
in the task's xsave buffer can be stale when the corresponding feature bit
is not set.

fpregs_get() and xfpregs_get() invoke fpstate_sanitize_xstate() to update
the task's xsave buffer before retrieving the FX or FP state. That's just
duplicated code as copy_xstate_to_kernel() already handles this correctly.

Add a copy mode argument to the function which allows to restrict the state
copy to the FP and SSE features.

Also rename the function to copy_xstate_to_uabi_buf() so the name reflects
what it is doing.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121452.805327286@linutronix.de
2021-06-23 17:49:47 +02:00
Thomas Gleixner
43be46e896 x86/fpu: Sanitize xstateregs_set()
xstateregs_set() operates on a stopped task and tries to copy the provided
buffer into the task's fpu.state.xsave buffer.

Any error while copying or invalid state detected after copying results in
wiping the target task's FPU state completely including supervisor states.

That's just wrong. The caller supplied invalid data or has a problem with
unmapped memory, so there is absolutely no justification to corrupt the
target state.

Fix this with the following modifications:

 1) If data has to be copied from userspace, allocate a buffer and copy from
    user first.

 2) Use copy_kernel_to_xstate() unconditionally so that header checking
    works correctly.

 3) Return on error without corrupting the target state.

This prevents corrupting states and lets the caller deal with the problem
it caused in the first place.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121452.214903673@linutronix.de
2021-06-23 17:49:46 +02:00
Thomas Gleixner
e68524456c x86/fpu: Move inlines where they belong
They are only used in fpstate_init() and there is no point to have them in
a header just to make reading the code harder.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121452.023118522@linutronix.de
2021-06-23 17:49:46 +02:00
Thomas Gleixner
4098b3eef3 x86/fpu: Remove unused get_xsave_field_ptr()
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121451.915614415@linutronix.de
2021-06-23 17:49:46 +02:00
Thomas Gleixner
ce38f038ed x86/fpu: Get rid of fpu__get_supported_xfeatures_mask()
This function is really not doing what the comment advertises:

 "Find supported xfeatures based on cpu features and command-line input.
  This must be called after fpu__init_parse_early_param() is called and
  xfeatures_mask is enumerated."

fpu__init_parse_early_param() does not exist anymore and the function just
returns a constant.

Remove it and fix the caller and get rid of further references to
fpu__init_parse_early_param().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121451.816404717@linutronix.de
2021-06-23 17:49:46 +02:00
Borislav Petkov
c4cf5f6198 Merge x86/urgent into x86/fpu
Pick up dependent changes which either went mainline (x86/urgent is
based on -rc7 and that contains them) as urgent fixes and the current
x86/urgent branch which contains two more urgent fixes, so that the
bigger FPU rework can base off ontop.

Signed-off-by: Borislav Petkov <bp@suse.de>
2021-06-23 17:43:38 +02:00
Paolo Bonzini
c3ab0e28a4 Merge branch 'topic/ppc-kvm' of https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux into HEAD
- Support for the H_RPT_INVALIDATE hypercall

- Conversion of Book3S entry/exit to C

- Bug fixes
2021-06-23 07:30:41 -04:00
Brijesh Singh
310f134ed4 x86/sev: Add defines for GHCB version 2 MSR protocol requests
Add the necessary defines for supporting the GHCB version 2 protocol.
This includes defines for:

	- MSR-based AP hlt request/response
	- Hypervisor Feature request/response

This is the bare minimum of requests that need to be supported by a GHCB
version 2 implementation. There are more requests in the specification,
but those depend on Secure Nested Paging support being available.

These defines are shared between SEV host and guest support.

  [ bp: Fold in https://lkml.kernel.org/r/20210622144825.27588-2-joro@8bytes.org too.
        Simplify the brewing macro maze into readability. ]

Co-developed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/YNLXQIZ5e1wjkshG@8bytes.org
2021-06-23 11:25:17 +02:00
Peter Zijlstra
1f008d46f1 x86: Always inline task_size_max()
Fix:

  vmlinux.o: warning: objtool: handle_bug()+0x10: call to task_size_max() leaves .noinstr.text section

When #UD isn't a BUG, we shouldn't violate noinstr (we'll still
probably die, but that's another story).

Fixes: 025768a966 ("x86/cpu: Use alternative to generate the TASK_SIZE_MAX constant")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210621120120.682468274@infradead.org
2021-06-22 13:56:43 +02:00
Thomas Gleixner
f9dfb5e390 x86/fpu: Make init_fpstate correct with optimized XSAVE
The XSAVE init code initializes all enabled and supported components with
XRSTOR(S) to init state. Then it XSAVEs the state of the components back
into init_fpstate which is used in several places to fill in the init state
of components.

This works correctly with XSAVE, but not with XSAVEOPT and XSAVES because
those use the init optimization and skip writing state of components which
are in init state. So init_fpstate.xsave still contains all zeroes after
this operation.

There are two ways to solve that:

   1) Use XSAVE unconditionally, but that requires to reshuffle the buffer when
      XSAVES is enabled because XSAVES uses compacted format.

   2) Save the components which are known to have a non-zero init state by other
      means.

Looking deeper, #2 is the right thing to do because all components the
kernel supports have all-zeroes init state except the legacy features (FP,
SSE). Those cannot be hard coded because the states are not identical on all
CPUs, but they can be saved with FXSAVE which avoids all conditionals.

Use FXSAVE to save the legacy FP/SSE components in init_fpstate along with
a BUILD_BUG_ON() which reminds developers to validate that a newly added
component has all zeroes init state. As a bonus remove the now unused
copy_xregs_to_kernel_booting() crutch.

The XSAVE and reshuffle method can still be implemented in the unlikely
case that components are added which have a non-zero init state and no
other means to save them. For now, FXSAVE is just simple and good enough.

  [ bp: Fix a typo or two in the text. ]

Fixes: 6bad06b768 ("x86, xsave: Use xsaveopt in context-switch path when supported")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210618143444.587311343@linutronix.de
2021-06-22 11:06:21 +02:00
Joerg Roedel
be1a540886 x86/sev: Split up runtime #VC handler for correct state tracking
Split up the #VC handler code into a from-user and a from-kernel part.
This allows clean and correct state tracking, as the #VC handler needs
to enter NMI-state when raised from kernel mode and plain IRQ state when
raised from user-mode.

Fixes: 62441a1fb5 ("x86/sev-es: Correctly track IRQ states in runtime #VC handler")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210618115409.22735-3-joro@8bytes.org
2021-06-21 16:01:05 +02:00
Ingo Molnar
b2c0931a07 Merge branch 'sched/urgent' into sched/core, to resolve conflicts
This commit in sched/urgent moved the cfs_rq_is_decayed() function:

  a7b359fc6a: ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")

and this fresh commit in sched/core modified it in the old location:

  9e077b52d8: ("sched/pelt: Check that *_avg are null when *_sum are")

Merge the two variants.

Conflicts:
	kernel/sched/fair.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-06-18 11:31:25 +02:00
Ashish Kalra
0dbb112304 KVM: X86: Introduce KVM_HC_MAP_GPA_RANGE hypercall
This hypercall is used by the SEV guest to notify a change in the page
encryption status to the hypervisor. The hypercall should be invoked
only when the encryption attribute is changed from encrypted -> decrypted
and vice versa. By default all guest pages are considered encrypted.

The hypercall exits to userspace to manage the guest shared regions and
integrate with the userspace VMM's migration code.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <90778988e1ee01926ff9cac447aacb745f954c8c.1623174621.git.ashish.kalra@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 14:25:39 -04:00
Paolo Bonzini
e3cb6fa0e2 KVM: switch per-VM stats to u64
Make them the same type as vCPU stats.  There is no reason
to limit the counters to unsigned long.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 14:25:27 -04:00
Sean Christopherson
25b62c6274 KVM: nVMX: Free only guest_mode (L2) roots on INVVPID w/o EPT
When emulating INVVPID for L1, free only L2+ roots, using the guest_mode
tag in the MMU role to identify L2+ roots.  From L1's perspective, its
own TLB entries use VPID=0, and INVVPID is not requied to invalidate such
entries.  Per Intel's SDM, INVVPID _may_ invalidate entries with VPID=0,
but it is not required to do so.

Cc: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:53 -04:00
Sean Christopherson
b512910039 KVM: x86: Drop skip MMU sync and TLB flush params from "new PGD" helpers
Drop skip_mmu_sync and skip_tlb_flush from __kvm_mmu_new_pgd() now that
all call sites unconditionally skip both the sync and flush.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:52 -04:00
Sean Christopherson
07ffaf343e KVM: nVMX: Sync all PGDs on nested transition with shadow paging
Trigger a full TLB flush on behalf of the guest on nested VM-Enter and
VM-Exit when VPID is disabled for L2.  kvm_mmu_new_pgd() syncs only the
current PGD, which can theoretically leave stale, unsync'd entries in a
previous guest PGD, which could be consumed if L2 is allowed to load CR3
with PCID_NOFLUSH=1.

Rename KVM_REQ_HV_TLB_FLUSH to KVM_REQ_TLB_FLUSH_GUEST so that it can
be utilized for its obvious purpose of emulating a guest TLB flush.

Note, there is no change the actual TLB flush executed by KVM, even
though the fast PGD switch uses KVM_REQ_TLB_FLUSH_CURRENT.  When VPID is
disabled for L2, vpid02 is guaranteed to be '0', and thus
nested_get_vpid02() will return the VPID that is shared by L1 and L2.

Generate the request outside of kvm_mmu_new_pgd(), as getting the common
helper to correctly identify which requested is needed is quite painful.
E.g. using KVM_REQ_TLB_FLUSH_GUEST when nested EPT is in play is wrong as
a TLB flush from the L1 kernel's perspective does not invalidate EPT
mappings.  And, by using KVM_REQ_TLB_FLUSH_GUEST, nVMX can do future
simplification by moving the logic into nested_vmx_transition_tlb_flush().

Fixes: 41fab65e7c ("KVM: nVMX: Skip MMU sync on nested VMX transition when possible")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:51 -04:00
Maxim Levitsky
158a48ecf7 KVM: x86: avoid loading PDPTRs after migration when possible
if new KVM_*_SREGS2 ioctls are used, the PDPTRs are
a part of the migration state and are correctly
restored by those ioctls.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210607090203.133058-9-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:48 -04:00
Sean Christopherson
c7313155bf KVM: x86: Always load PDPTRs on CR3 load for SVM w/o NPT and a PAE guest
Kill off pdptrs_changed() and instead go through the full kvm_set_cr3()
for PAE guest, even if the new CR3 is the same as the current CR3.  For
VMX, and SVM with NPT enabled, the PDPTRs are unconditionally marked as
unavailable after VM-Exit, i.e. the optimization is dead code except for
SVM without NPT.

In the unlikely scenario that anyone cares about SVM without NPT _and_ a
PAE guest, they've got bigger problems if their guest is loading the same
CR3 so frequently that the performance of kvm_set_cr3() is notable,
especially since KVM's fast PGD switching means reloading the same CR3
does not require a full rebuild.  Given that PAE and PCID are mutually
exclusive, i.e. a sync and flush are guaranteed in any case, the actual
benefits of the pdptrs_changed() optimization are marginal at best.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210607090203.133058-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:46 -04:00
Vitaly Kuznetsov
10d7bf1e46 KVM: x86: hyper-v: Cache guest CPUID leaves determining features availability
Limiting exposed Hyper-V features requires a fast way to check if the
particular feature is exposed in guest visible CPUIDs or not. To aboid
looping through all CPUID entries on every hypercall/MSR access cache
the required leaves on CPUID update.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:38 -04:00
Vitaly Kuznetsov
644f706719 KVM: x86: hyper-v: Introduce KVM_CAP_HYPERV_ENFORCE_CPUID
Modeled after KVM_CAP_ENFORCE_PV_FEATURE_CPUID, the new capability allows
for limiting Hyper-V features to those exposed to the guest in Hyper-V
CPUIDs (0x40000003, 0x40000004, ...).

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:38 -04:00
Vineeth Pillai
59d21d67f3 KVM: SVM: Software reserved fields
SVM added support for certain reserved fields to be used by
software or hypervisor. Add the following reserved fields:
  - VMCB offset 0x3e0 - 0x3ff
  - Clean bit 31
  - SVM intercept exit code 0xf0000000

Later patches will make use of this for supporting Hyper-V
nested virtualization enhancements.

Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Message-Id: <a1f17a43a8e9e751a1a9cc0281649d71bdbf721b.1622730232.git.viremana@linux.microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:37 -04:00
Vineeth Pillai
3c86c0d3db KVM: x86: hyper-v: Move the remote TLB flush logic out of vmx
Currently the remote TLB flush logic is specific to VMX.
Move it to a common place so that SVM can use it as well.

Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Message-Id: <4f4e4ca19778437dae502f44363a38e99e3ef5d1.1622730232.git.viremana@linux.microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:36 -04:00
Vineeth Pillai
32431fb253 hyperv: SVM enlightened TLB flush support flag
Bit 22 of HYPERV_CPUID_FEATURES.EDX is specific to SVM and specifies
support for enlightened TLB flush. With this enlightenment enabled,
ASID invalidations flushes only gva->hpa entries. To flush TLB entries
derived from NPT, hypercalls should be used
(HvFlushGuestPhysicalAddressSpace or HvFlushGuestPhysicalAddressList)

Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Message-Id: <a060f872d0df1955e52e30b877b3300485edb27c.1622730232.git.viremana@linux.microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:36 -04:00
Krish Sadhukhan
d5a0483f9f KVM: nVMX: nSVM: Add a new VCPU statistic to show if VCPU is in guest mode
Add the following per-VCPU statistic to KVM debugfs to show if a given
VCPU is in guest mode:

	guest_mode

Also add this as a per-VM statistic to KVM debugfs to show the total number
of VCPUs that are in guest mode in a given VM.

Signed-off-by: Krish Sadhukhan <Krish.Sadhukhan@oracle.com>
Message-Id: <20210609180340.104248-3-krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:36 -04:00
Sean Christopherson
ecc513e5bb KVM: x86: Drop "pre_" from enter/leave_smm() helpers
Now that .post_leave_smm() is gone, drop "pre_" from the remaining
helpers.  The helpers aren't invoked purely before SMI/RSM processing,
e.g. both helpers are invoked after state is snapshotted (from regs or
SMRAM), and the RSM helper is invoked after some amount of register state
has been stuffed.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609185619.992058-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:35 -04:00
Vitaly Kuznetsov
4651fc56ba KVM: x86: Drop vendor specific functions for APICv/AVIC enablement
Now that APICv/AVIC enablement is kept in common 'enable_apicv' variable,
there's no need to call kvm_apicv_init() from vendor specific code.

No functional change intended.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210609150911.1471882-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:33 -04:00
Vitaly Kuznetsov
fdf513e37a KVM: x86: Use common 'enable_apicv' variable for both APICv and AVIC
Unify VMX and SVM code by moving APICv/AVIC enablement tracking to common
'enable_apicv' variable. Note: unlike APICv, AVIC is disabled by default.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210609150911.1471882-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:33 -04:00
Ilias Stamatis
1ab9287add KVM: X86: Add vendor callbacks for writing the TSC multiplier
Currently vmx_vcpu_load_vmcs() writes the TSC_MULTIPLIER field of the
VMCS every time the VMCS is loaded. Instead of doing this, set this
field from common code on initialization and whenever the scaling ratio
changes.

Additionally remove vmx->current_tsc_ratio. This field is redundant as
vcpu->arch.tsc_scaling_ratio already tracks the current TSC scaling
ratio. The vmx->current_tsc_ratio field is only used for avoiding
unnecessary writes but it is no longer needed after removing the code
from the VMCS load path.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Message-Id: <20210607105438.16541-1-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:29 -04:00
Ilias Stamatis
edcfe54058 KVM: X86: Move write_l1_tsc_offset() logic to common code and rename it
The write_l1_tsc_offset() callback has a misleading name. It does not
set L1's TSC offset, it rather updates the current TSC offset which
might be different if a nested guest is executing. Additionally, both
the vmx and svm implementations use the same logic for calculating the
current TSC before writing it to hardware.

Rename the function and move the common logic to the caller. The vmx/svm
specific code now merely sets the given offset to the corresponding
hardware structure.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-9-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:29 -04:00
Ilias Stamatis
83150f2932 KVM: X86: Add functions that calculate the nested TSC fields
When L2 is entered we need to "merge" the TSC multiplier and TSC offset
values of 01 and 12 together.

The merging is done using the following equations:
  offset_02 = ((offset_01 * mult_12) >> shift_bits) + offset_12
  mult_02 = (mult_01 * mult_12) >> shift_bits

Where shift_bits is kvm_tsc_scaling_ratio_frac_bits.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-8-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:29 -04:00
Ilias Stamatis
307a94c721 KVM: X86: Add functions for retrieving L2 TSC fields from common code
In order to implement as much of the nested TSC scaling logic as
possible in common code, we need these vendor callbacks for retrieving
the TSC offset and the TSC multiplier that L1 has set for L2.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-7-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:28 -04:00
Ilias Stamatis
fe3eb50418 KVM: X86: Add a ratio parameter to kvm_scale_tsc()
Sometimes kvm_scale_tsc() needs to use the current scaling ratio and
other times (like when reading the TSC from user space) it needs to use
L1's scaling ratio. Have the caller specify this by passing the ratio as
a parameter.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-5-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:28 -04:00
Ilias Stamatis
805d705ff8 KVM: X86: Store L1's TSC scaling ratio in 'struct kvm_vcpu_arch'
Store L1's scaling ratio in the kvm_vcpu_arch struct like we already do
for L1's TSC offset. This allows for easy save/restore when we enter and
then exit the nested guest.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-3-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:27 -04:00
Ben Gardon
d501f747ef KVM: x86/mmu: Lazily allocate memslot rmaps
If the TDP MMU is in use, wait to allocate the rmaps until the shadow
MMU is actually used. (i.e. a nested VM is launched.) This saves memory
equal to 0.2% of guest memory in cases where the TDP MMU is used and
there are no nested guests involved.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-8-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:27 -04:00
Ben Gardon
a255740876 KVM: x86/mmu: Add a field to control memslot rmap allocation
Add a field to control whether new memslots should have rmaps allocated
for them. As of this change, it's not safe to skip allocating rmaps, so
the field is always set to allocate rmaps. Future changes will make it
safe to operate without rmaps, using the TDP MMU. Then further changes
will allow the rmaps to be allocated lazily when needed for nested
oprtation.

No functional change expected.

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-6-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:26 -04:00
Siddharth Chandrasekaran
d8f5537a88 KVM: hyper-v: Advertise support for fast XMM hypercalls
Now that kvm_hv_flush_tlb() has been patched to support XMM hypercall
inputs, we can start advertising this feature to guests.

Cc: Alexander Graf <graf@amazon.com>
Cc: Evgeny Iakovlev <eyakovl@amazon.de>
Signed-off-by: Siddharth Chandrasekaran <sidcha@amazon.de>
Message-Id: <e63fc1c61dd2efecbefef239f4f0a598bd552750.1622019134.git.sidcha@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:24 -04:00
Siddharth Chandrasekaran
5974565bc2 KVM: x86: kvm_hv_flush_tlb use inputs from XMM registers
Hyper-V supports the use of XMM registers to perform fast hypercalls.
This allows guests to take advantage of the improved performance of the
fast hypercall interface even though a hypercall may require more than
(the current maximum of) two input registers.

The XMM fast hypercall interface uses six additional XMM registers (XMM0
to XMM5) to allow the guest to pass an input parameter block of up to
112 bytes.

Add framework to read from XMM registers in kvm_hv_hypercall() and use
the additional hypercall inputs from XMM registers in kvm_hv_flush_tlb()
when possible.

Cc: Alexander Graf <graf@amazon.com>
Co-developed-by: Evgeny Iakovlev <eyakovl@amazon.de>
Signed-off-by: Evgeny Iakovlev <eyakovl@amazon.de>
Signed-off-by: Siddharth Chandrasekaran <sidcha@amazon.de>
Message-Id: <fc62edad33f1920fe5c74dde47d7d0b4275a9012.1622019134.git.sidcha@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:24 -04:00
Kan Liang
5471eea5d3 perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task
The counter value of a perf task may leak to another RDPMC task.
For example, a perf stat task as below is running on CPU 0.

    perf stat -e 'branches,cycles' -- taskset -c 0 ./workload

In the meantime, an RDPMC task, which is also running on CPU 0, may read
the GP counters periodically. (The RDPMC task creates a fixed event,
but read four GP counters.)

    $./rdpmc_read_all_counters
    index 0x0 value 0x8001e5970f99
    index 0x1 value 0x8005d750edb6
    index 0x2 value 0x0
    index 0x3 value 0x0

    index 0x0 value 0x8002358e48a5
    index 0x1 value 0x8006bd1e3bc9
    index 0x2 value 0x0
    index 0x3 value 0x0

It is a potential security issue. Once the attacker knows what the other
thread is counting. The PerfMon counter can be used as a side-channel to
attack cryptosystems.

The counter value of the perf stat task leaks to the RDPMC task because
perf never clears the counter when it's stopped.

Three methods were considered to address the issue.

 - Unconditionally reset the counter in x86_pmu_del(). It can bring extra
   overhead even when there is no RDPMC task running.

 - Only reset the un-assigned dirty counters when the RDPMC task is
   scheduled in via sched_task(). It fails for the below case.

	Thread A			Thread B

	clone(CLONE_THREAD) --->
	set_affine(0)
					set_affine(1)
					while (!event-enabled)
						;
	event = perf_event_open()
	mmap(event)
	ioctl(event, IOC_ENABLE); --->
					RDPMC

   Counters are still leaked to the thread B.

 - Only reset the un-assigned dirty counters before updating the CR4.PCE
   bit. The method is implemented here.

The dirty counter is a counter, on which the assigned event has been
deleted, but the counter is not reset. To track the dirty counters,
add a 'dirty' variable in the struct cpu_hw_events.

The security issue can only be found with an RDPMC task. To enable the
RDMPC, the CR4.PCE bit has to be updated. Add a
perf_clear_dirty_counters() right before updating the CR4.PCE bit to
clear the existing dirty counters. Only the current un-assigned dirty
counters are reset, because the RDPMC assigned dirty counters will be
updated soon.

After applying the patch,

        $ ./rdpmc_read_all_counters
        index 0x0 value 0x0
        index 0x1 value 0x0
        index 0x2 value 0x0
        index 0x3 value 0x0

        index 0x0 value 0x0
        index 0x1 value 0x0
        index 0x2 value 0x0
        index 0x3 value 0x0

Performance

The performance of a context switch only be impacted when there are two
or more perf users and one of the users must be an RDPMC user. In other
cases, there is no performance impact.

The worst-case occurs when there are two users: the RDPMC user only
uses one counter; while the other user uses all available counters.
When the RDPMC task is scheduled in, all the counters, other than the
RDPMC assigned one, have to be reset.

Test results for the worst-case, using a modified lat_ctx as measured
on an Ice Lake platform, which has 8 GP and 3 FP counters (ignoring
SLOTS).

    lat_ctx -s 128K -N 1000 processes 2

Without the patch:
  The context switch time is 4.97 us

With the patch:
  The context switch time is 5.16 us

There is ~4% performance drop for the context switching time in the
worst-case.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1623693582-187370-1-git-send-email-kan.liang@linux.intel.com
2021-06-17 14:11:47 +02:00
Pawan Gupta
1348924ba8 x86/msr: Define new bits in TSX_FORCE_ABORT MSR
Intel client processors that support the IA32_TSX_FORCE_ABORT MSR
related to perf counter interaction [1] received a microcode update that
deprecates the Transactional Synchronization Extension (TSX) feature.
The bit FORCE_ABORT_RTM now defaults to 1, writes to this bit are
ignored. A new bit TSX_CPUID_CLEAR clears the TSX related CPUID bits.

The summary of changes to the IA32_TSX_FORCE_ABORT MSR are:

  Bit 0: FORCE_ABORT_RTM (legacy bit, new default=1) Status bit that
  indicates if RTM transactions are always aborted. This bit is
  essentially !SDV_ENABLE_RTM(Bit 2). Writes to this bit are ignored.

  Bit 1: TSX_CPUID_CLEAR (new bit, default=0) When set, CPUID.HLE = 0
  and CPUID.RTM = 0.

  Bit 2: SDV_ENABLE_RTM (new bit, default=0) When clear, XBEGIN will
  always abort with EAX code 0. When set, XBEGIN will not be forced to
  abort (but will always abort in SGX enclaves). This bit is intended to
  be used on developer systems. If this bit is set, transactional
  atomicity correctness is not certain. SDV = Software Development
  Vehicle (SDV), i.e. developer systems.

Performance monitoring counter 3 is usable in all cases, regardless of
the value of above bits.

Add support for a new CPUID bit - CPUID.RTM_ALWAYS_ABORT (CPUID 7.EDX[11])
 - to indicate the status of always abort behavior.

[1] [ bp: Look for document ID 604224, "Performance Monitoring Impact
      of Intel Transactional Synchronization Extension Memory". Since
      there's no way for us to have stable links to documents... ]

 [ bp: Massage and extend commit message. ]

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Link: https://lkml.kernel.org/r/9add61915b4a4eedad74fbd869107863a28b428e.1623704845.git-series.pawan.kumar.gupta@linux.intel.com
2021-06-15 17:23:15 +02:00
Thomas Gleixner
510b80a6a0 x86/pkru: Write hardware init value to PKRU when xstate is init
When user space brings PKRU into init state, then the kernel handling is
broken:

  T1 user space
     xsave(state)
     state.header.xfeatures &= ~XFEATURE_MASK_PKRU;
     xrstor(state)

  T1 -> kernel
     schedule()
       XSAVE(S) -> T1->xsave.header.xfeatures[PKRU] == 0
       T1->flags |= TIF_NEED_FPU_LOAD;

       wrpkru();

     schedule()
       ...
       pk = get_xsave_addr(&T1->fpu->state.xsave, XFEATURE_PKRU);
       if (pk)
	 wrpkru(pk->pkru);
       else
	 wrpkru(DEFAULT_PKRU);

Because the xfeatures bit is 0 and therefore the value in the xsave
storage is not valid, get_xsave_addr() returns NULL and switch_to()
writes the default PKRU. -> FAIL #1!

So that wrecks any copy_to/from_user() on the way back to user space
which hits memory which is protected by the default PKRU value.

Assumed that this does not fail (pure luck) then T1 goes back to user
space and because TIF_NEED_FPU_LOAD is set it ends up in

  switch_fpu_return()
      __fpregs_load_activate()
        if (!fpregs_state_valid()) {
  	 load_XSTATE_from_task();
        }

But if nothing touched the FPU between T1 scheduling out and back in,
then the fpregs_state is still valid which means switch_fpu_return()
does nothing and just clears TIF_NEED_FPU_LOAD. Back to user space with
DEFAULT_PKRU loaded. -> FAIL #2!

The fix is simple: if get_xsave_addr() returns NULL then set the
PKRU value to 0 instead of the restrictive default PKRU value in
init_pkru_value.

 [ bp: Massage in minor nitpicks from folks. ]

Fixes: 0cecca9d03 ("x86/fpu: Eager switch PKRU state")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Rik van Riel <riel@surriel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210608144346.045616965@linutronix.de
2021-06-09 12:12:45 +02:00
Thomas Gleixner
12f7764ac6 x86/process: Check PF_KTHREAD and not current->mm for kernel threads
switch_fpu_finish() checks current->mm as indicator for kernel threads.
That's wrong because kernel threads can temporarily use a mm of a user
process via kthread_use_mm().

Check the task flags for PF_KTHREAD instead.

Fixes: 0cecca9d03 ("x86/fpu: Eager switch PKRU state")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210608144345.912645927@linutronix.de
2021-06-09 10:39:04 +02:00
Mike Rapoport
23721c8e92 x86/crash: Remove crash_reserve_low_1M()
The entire memory range under 1M is unconditionally reserved in
setup_arch(), so there is no need for crash_reserve_low_1M() anymore.

Remove this function.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210601075354.5149-4-rppt@kernel.org
2021-06-07 12:14:45 +02:00
Borislav Petkov
0a5f38c81e Linux 5.13-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmC9UH8eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGRDYH/3WgnRz5DfVhjmlD
 Lg38mPmbZWhFibXghrYrpbVpTyhjGFRuNtXAt2p7/nYnM71wzI6Qkx6cRKZeB5HE
 /SqeksPWUEgJaUuoXeQBrBaG7q/+9ph7Rgaf2wP7k+E00RI3E4pbMubuqFAUeikr
 itKFD9aTUsgT5XbG2hH5Ddwh5hBD2C/1PVt3jpLnJkXRCn91uEh+R7SHXP/fsjAd
 ZaGOVbAGm+jePCQDBXpVUn+8fJdxvQg7rxWVRRRhi5LXG+pnAezbkGl746zBwaSw
 K6lmVSA+eAiVkKu6nR4HJv9Hax1juFbp9xpcCo4jzxO5NJF4jsmytjLEaYFdi4NX
 G542808=
 =BPDL
 -----END PGP SIGNATURE-----

Merge tag 'v5.13-rc5' into x86/cleanups

Pick up dependent changes in order to base further cleanups ontop.

Signed-off-by: Borislav Petkov <bp@suse.de>
2021-06-07 11:02:30 +02:00
Linus Torvalds
773ac53bbf - Fix out-of-spec hardware (1st gen Hygon) which does not implement
MSR_AMD64_SEV even though the spec clearly states so, and check CPUID
 bits first.
 
 - Send only one signal to a task when it is a SEGV_PKUERR si_code type.
 
 - Do away with all the wankery of reserving X amount of memory in
 the first megabyte to prevent BIOS corrupting it and simply and
 unconditionally reserve the whole first megabyte.
 
 - Make alternatives NOP optimization work at an arbitrary position
 within the patched sequence because the compiler can put single-byte
 NOPs for alignment anywhere in the sequence (32-bit retpoline), vs our
 previous assumption that the NOPs are only appended.
 
 - Force-disable ENQCMD[S] instructions support and remove update_pasid()
 because of insufficient protection against FPU state modification in an
 interrupt context, among other xstate horrors which are being addressed
 at the moment. This one limits the fallout until proper enablement.
 
 - Use cpu_feature_enabled() in the idxd driver so that it can be
 build-time disabled through the defines in .../asm/disabled-features.h.
 
 - Fix LVT thermal setup for SMI delivery mode by making sure the APIC
 LVT value is read before APIC initialization so that softlockups during
 boot do not happen at least on one machine.
 
 - Mark all legacy interrupts as legacy vectors when the IO-APIC is
 disabled and when all legacy interrupts are routed through the PIC.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmC8fdEACgkQEsHwGGHe
 VUqO5A/+IbIo8myl8VPjw6HRnHgY8rsYRjxdtmVhbaMi5XOmTMfVA9zJ6QALxseo
 Mar8bmWcezEs0/FmNvk1vEOtIgZvRVy5RqXbu3W2EgWICuzRWbj822q+KrkbY0tH
 1GWjcZQO8VlgeuQsukyj5QHaBLffpn3Fh1XB8r0cktZvwciM+LRNMnK8d6QjqxNM
 ctTX4wdI6kc076pOi7MhKxSe+/xo5Wnf27lClLMOcsO/SS42KqgeRM5psWqxihhL
 j6Y3Oe+Nm+7GKF8y841PUSlwjgWmlZa6UkR6DBTP7DGnHDa5hMpzxYvHOquq/SbA
 leV9OLqI0iWs56kSzbEcXo7do1kld62KjsA2KtUhJfVAtm+igQLh5G0jESBwrWca
 TBWaE5kt6s8wP7LXeg26o4U8XD8vqEH88Tmsjlgqb/t/PKDV9PMGvNpF00dPZFo6
 Jhj2yntJYjLQYoAQLuQm5pfnKhZy3KKvk7ViGcnp3iN9i4eU9HzawIiXnliNOrTI
 ohQ9KoRhy1Cx0UfLkR+cdK4ks0u26DC2/Ewt0CE5AP/CQ1rX6Zbv2gFLjSpy7yQo
 6A99HEpbaLuy3kDt5vn91viPNUlOveuIXIdHp6u+zgFfx88eLUoEvfR135aV/Gyh
 p5PJm/BO99KByQzFCnilkp7nBeKtnKYSmUojA6JsZKjzJimSPYo=
 =zRI1
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:
 "A bunch of x86/urgent stuff accumulated for the last two weeks so
  lemme unload it to you.

  It should be all totally risk-free, of course. :-)

   - Fix out-of-spec hardware (1st gen Hygon) which does not implement
     MSR_AMD64_SEV even though the spec clearly states so, and check
     CPUID bits first.

   - Send only one signal to a task when it is a SEGV_PKUERR si_code
     type.

   - Do away with all the wankery of reserving X amount of memory in the
     first megabyte to prevent BIOS corrupting it and simply and
     unconditionally reserve the whole first megabyte.

   - Make alternatives NOP optimization work at an arbitrary position
     within the patched sequence because the compiler can put
     single-byte NOPs for alignment anywhere in the sequence (32-bit
     retpoline), vs our previous assumption that the NOPs are only
     appended.

   - Force-disable ENQCMD[S] instructions support and remove
     update_pasid() because of insufficient protection against FPU state
     modification in an interrupt context, among other xstate horrors
     which are being addressed at the moment. This one limits the
     fallout until proper enablement.

   - Use cpu_feature_enabled() in the idxd driver so that it can be
     build-time disabled through the defines in disabled-features.h.

   - Fix LVT thermal setup for SMI delivery mode by making sure the APIC
     LVT value is read before APIC initialization so that softlockups
     during boot do not happen at least on one machine.

   - Mark all legacy interrupts as legacy vectors when the IO-APIC is
     disabled and when all legacy interrupts are routed through the PIC"

* tag 'x86_urgent_for_v5.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sev: Check SME/SEV support in CPUID first
  x86/fault: Don't send SIGSEGV twice on SEGV_PKUERR
  x86/setup: Always reserve the first 1M of RAM
  x86/alternative: Optimize single-byte NOPs at an arbitrary position
  x86/cpufeatures: Force disable X86_FEATURE_ENQCMD and remove update_pasid()
  dmaengine: idxd: Use cpu_feature_enabled()
  x86/thermal: Fix LVT thermal setup for SMI delivery mode
  x86/apic: Mark _all_ legacy interrupts when IO/APIC is missing
2021-06-06 12:25:43 -07:00
Peter Collingbourne
92638b4e1b mm: arch: remove indirection level in alloc_zeroed_user_highpage_movable()
In an upcoming change we would like to add a flag to
GFP_HIGHUSER_MOVABLE so that it would no longer be an OR
of GFP_HIGHUSER and __GFP_MOVABLE. This poses a problem for
alloc_zeroed_user_highpage_movable() which passes __GFP_MOVABLE
into an arch-specific __alloc_zeroed_user_highpage() hook which ORs
in GFP_HIGHUSER.

Since __alloc_zeroed_user_highpage() is only ever called from
alloc_zeroed_user_highpage_movable(), we can remove one level
of indirection here. Remove __alloc_zeroed_user_highpage(),
make alloc_zeroed_user_highpage_movable() the hook, and use
GFP_HIGHUSER_MOVABLE in the hook implementations so that they will
pick up the new flag that we are going to add.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/Ic6361c657b2cdcd896adbe0cf7cb5a7fbb1ed7bf
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20210602235230.3928842-2-pcc@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2021-06-04 19:32:21 +01:00
Ingo Molnar
a9e906b71f Merge branch 'sched/urgent' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-06-03 19:00:49 +02:00
Thomas Gleixner
9bfecd0583 x86/cpufeatures: Force disable X86_FEATURE_ENQCMD and remove update_pasid()
While digesting the XSAVE-related horrors which got introduced with
the supervisor/user split, the recent addition of ENQCMD-related
functionality got on the radar and turned out to be similarly broken.

update_pasid(), which is only required when X86_FEATURE_ENQCMD is
available, is invoked from two places:

 1) From switch_to() for the incoming task

 2) Via a SMP function call from the IOMMU/SMV code

#1 is half-ways correct as it hacks around the brokenness of get_xsave_addr()
   by enforcing the state to be 'present', but all the conditionals in that
   code are completely pointless for that.

   Also the invocation is just useless overhead because at that point
   it's guaranteed that TIF_NEED_FPU_LOAD is set on the incoming task
   and all of this can be handled at return to user space.

#2 is broken beyond repair. The comment in the code claims that it is safe
   to invoke this in an IPI, but that's just wishful thinking.

   FPU state of a running task is protected by fregs_lock() which is
   nothing else than a local_bh_disable(). As BH-disabled regions run
   usually with interrupts enabled the IPI can hit a code section which
   modifies FPU state and there is absolutely no guarantee that any of the
   assumptions which are made for the IPI case is true.

   Also the IPI is sent to all CPUs in mm_cpumask(mm), but the IPI is
   invoked with a NULL pointer argument, so it can hit a completely
   unrelated task and unconditionally force an update for nothing.
   Worse, it can hit a kernel thread which operates on a user space
   address space and set a random PASID for it.

The offending commit does not cleanly revert, but it's sufficient to
force disable X86_FEATURE_ENQCMD and to remove the broken update_pasid()
code to make this dysfunctional all over the place. Anything more
complex would require more surgery and none of the related functions
outside of the x86 core code are blatantly wrong, so removing those
would be overkill.

As nothing enables the PASID bit in the IA32_XSS MSR yet, which is
required to make this actually work, this cannot result in a regression
except for related out of tree train-wrecks, but they are broken already
today.

Fixes: 20f0afd1fb ("x86/mmu: Allocate/free a PASID")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/87mtsd6gr9.ffs@nanos.tec.linutronix.de
2021-06-03 16:33:09 +02:00
Andrew Cooper
cbcddaa33d perf/x86/rapl: Use CPUID bit on AMD and Hygon parts
AMD and Hygon CPUs have a CPUID bit for RAPL.  Drop the fam17h suffix as
it is stale already.

Make use of this instead of a model check to work more nicely in virtual
environments where RAPL typically isn't available.

 [ bp: drop the ../cpu/powerflags.c hunk which is superfluous as the
   "rapl" bit name appears already in flags. ]

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210514135920.16093-1-andrew.cooper3@citrix.com
2021-06-01 21:10:33 +02:00
Borislav Petkov
9a90ed065a x86/thermal: Fix LVT thermal setup for SMI delivery mode
There are machines out there with added value crap^WBIOS which provide an
SMI handler for the local APIC thermal sensor interrupt. Out of reset,
the BSP on those machines has something like 0x200 in that APIC register
(timestamps left in because this whole issue is timing sensitive):

  [    0.033858] read lvtthmr: 0x330, val: 0x200

which means:

 - bit 16 - the interrupt mask bit is clear and thus that interrupt is enabled
 - bits [10:8] have 010b which means SMI delivery mode.

Now, later during boot, when the kernel programs the local APIC, it
soft-disables it temporarily through the spurious vector register:

  setup_local_APIC:

  	...

	/*
	 * If this comes from kexec/kcrash the APIC might be enabled in
	 * SPIV. Soft disable it before doing further initialization.
	 */
	value = apic_read(APIC_SPIV);
	value &= ~APIC_SPIV_APIC_ENABLED;
	apic_write(APIC_SPIV, value);

which means (from the SDM):

"10.4.7.2 Local APIC State After It Has Been Software Disabled

...

* The mask bits for all the LVT entries are set. Attempts to reset these
bits will be ignored."

And this happens too:

  [    0.124111] APIC: Switch to symmetric I/O mode setup
  [    0.124117] lvtthmr 0x200 before write 0xf to APIC 0xf0
  [    0.124118] lvtthmr 0x10200 after write 0xf to APIC 0xf0

This results in CPU 0 soft lockups depending on the placement in time
when the APIC soft-disable happens. Those soft lockups are not 100%
reproducible and the reason for that can only be speculated as no one
tells you what SMM does. Likely, it confuses the SMM code that the APIC
is disabled and the thermal interrupt doesn't doesn't fire at all,
leading to CPU 0 stuck in SMM forever...

Now, before

  4f432e8bb1 ("x86/mce: Get rid of mcheck_intel_therm_init()")

due to how the APIC_LVTTHMR was read before APIC initialization in
mcheck_intel_therm_init(), it would read the value with the mask bit 16
clear and then intel_init_thermal() would replicate it onto the APs and
all would be peachy - the thermal interrupt would remain enabled.

But that commit moved that reading to a later moment in
intel_init_thermal(), resulting in reading APIC_LVTTHMR on the BSP too
late and with its interrupt mask bit set.

Thus, revert back to the old behavior of reading the thermal LVT
register before the APIC gets initialized.

Fixes: 4f432e8bb1 ("x86/mce: Get rid of mcheck_intel_therm_init()")
Reported-by: James Feeney <james@nurealm.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://lkml.kernel.org/r/YKIqDdFNaXYd39wz@zn.tnic
2021-05-31 22:32:26 +02:00
Linus Torvalds
224478289c ARM fixes:
* Another state update on exit to userspace fix
 
 * Prevent the creation of mixed 32/64 VMs
 
 * Fix regression with irqbypass not restarting the guest on failed connect
 
 * Fix regression with debug register decoding resulting in overlapping access
 
 * Commit exception state on exit to usrspace
 
 * Fix the MMU notifier return values
 
 * Add missing 'static' qualifiers in the new host stage-2 code
 
 x86 fixes:
 * fix guest missed wakeup with assigned devices
 
 * fix WARN reported by syzkaller
 
 * do not use BIT() in UAPI headers
 
 * make the kvm_amd.avic parameter bool
 
 PPC fixes:
 * make halt polling heuristics consistent with other architectures
 
 selftests:
 * various fixes
 
 * new performance selftest memslot_perf_test
 
 * test UFFD minor faults in demand_paging_test
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCyF0MUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOHSgf/Q4Hm5e12Bj2xJy6A+iShnrbbT8PW
 hcIIOA7zGWXfjVYcBV7anbj7CcpzfIz0otcRBABa5mkhj+fb3YmPEb0EzCPi4Hru
 zxpcpB2w7W7WtUOIKe2EmaT+4Pk6/iLcfr8UMHMqx460akE9OmIg10QNWai3My/3
 RIOeakSckBI9e/1TQZbxH66dsLwCT0lLco7i7AWHdFxkzUQyoA34HX5pczOCBsO5
 3nXH+/txnRVhqlcyzWLVVGVzFqmpHtBqkIInDOXfUqIoxo/gOhOgF1QdMUEKomxn
 5ZFXlL5IXNtr+7yiI67iHX7CWkGZE9oJ04TgPHn6LR6wRnVvc3JInzcB5Q==
 =ollO
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "ARM fixes:

   - Another state update on exit to userspace fix

   - Prevent the creation of mixed 32/64 VMs

   - Fix regression with irqbypass not restarting the guest on failed
     connect

   - Fix regression with debug register decoding resulting in
     overlapping access

   - Commit exception state on exit to usrspace

   - Fix the MMU notifier return values

   - Add missing 'static' qualifiers in the new host stage-2 code

  x86 fixes:

   - fix guest missed wakeup with assigned devices

   - fix WARN reported by syzkaller

   - do not use BIT() in UAPI headers

   - make the kvm_amd.avic parameter bool

  PPC fixes:

   - make halt polling heuristics consistent with other architectures

  selftests:

   - various fixes

   - new performance selftest memslot_perf_test

   - test UFFD minor faults in demand_paging_test"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (44 commits)
  selftests: kvm: fix overlapping addresses in memslot_perf_test
  KVM: X86: Kill off ctxt->ud
  KVM: X86: Fix warning caused by stale emulation context
  KVM: X86: Use kvm_get_linear_rip() in single-step and #DB/#BP interception
  KVM: x86/mmu: Fix comment mentioning skip_4k
  KVM: VMX: update vcpu posted-interrupt descriptor when assigning device
  KVM: rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK
  KVM: x86: add start_assignment hook to kvm_x86_ops
  KVM: LAPIC: Narrow the timer latency between wait_lapic_expire and world switch
  selftests: kvm: do only 1 memslot_perf_test run by default
  KVM: X86: Use _BITUL() macro in UAPI headers
  KVM: selftests: add shared hugetlbfs backing source type
  KVM: selftests: allow using UFFD minor faults for demand paging
  KVM: selftests: create alias mappings when using shared memory
  KVM: selftests: add shmem backing source type
  KVM: selftests: refactor vm_mem_backing_src_type flags
  KVM: selftests: allow different backing source types
  KVM: selftests: compute correct demand paging size
  KVM: selftests: simplify setup_demand_paging error handling
  KVM: selftests: Print a message if /dev/kvm is missing
  ...
2021-05-29 06:02:25 -10:00
Thomas Gleixner
7d65f9e806 x86/apic: Mark _all_ legacy interrupts when IO/APIC is missing
PIC interrupts do not support affinity setting and they can end up on
any online CPU. Therefore, it's required to mark the associated vectors
as system-wide reserved. Otherwise, the corresponding irq descriptors
are copied to the secondary CPUs but the vectors are not marked as
assigned or reserved. This works correctly for the IO/APIC case.

When the IO/APIC is disabled via config, kernel command line or lack of
enumeration then all legacy interrupts are routed through the PIC, but
nothing marks them as system-wide reserved vectors.

As a consequence, a subsequent allocation on a secondary CPU can result in
allocating one of these vectors, which triggers the BUG() in
apic_update_vector() because the interrupt descriptor slot is not empty.

Imran tried to work around that by marking those interrupts as allocated
when a CPU comes online. But that's wrong in case that the IO/APIC is
available and one of the legacy interrupts, e.g. IRQ0, has been switched to
PIC mode because then marking them as allocated will fail as they are
already marked as system vectors.

Stay consistent and update the legacy vectors after attempting IO/APIC
initialization and mark them as system vectors in case that no IO/APIC is
available.

Fixes: 69cde0004a ("x86/vector: Use matrix allocator for vector assignment")
Reported-by: Imran Khan <imran.f.khan@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210519233928.2157496-1-imran.f.khan@oracle.com
2021-05-29 11:41:14 +02:00
Muralidhara M K
94a311ce24 x86/MCE/AMD, EDAC/mce_amd: Add new SMCA bank types
Add the (HWID, MCATYPE) tuples and names for new SMCA bank types.

Also, add their respective error descriptions to the MCE decoding module
edac_mce_amd. Also while at it, optimize the string names for some SMCA
banks.

 [ bp: Drop repeated comments, explain why UMC_V2 is a separate entry. ]

Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Naveen Krishna Chatradhi  <nchatrad@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lkml.kernel.org/r/20210526164601.66228-1-nchatrad@amd.com
2021-05-27 20:08:14 +02:00
Marcelo Tosatti
57ab87947a KVM: x86: add start_assignment hook to kvm_x86_ops
Add a start_assignment hook to kvm_x86_ops, which is called when
kvm_arch_start_assignment is done.

The hook is required to update the wakeup vector of a sleeping vCPU
when a device is assigned to the guest.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Message-Id: <20210525134321.254128742@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 07:50:13 -04:00
Mark Rutland
9be85de977 locking/atomic: make ARCH_ATOMIC a Kconfig symbol
Subsequent patches will move architectures over to the ARCH_ATOMIC API,
after preparing the asm-generic atomic implementations to function with
or without ARCH_ATOMIC.

As some architectures use the asm-generic implementations exclusively
(and don't have a local atomic.h), and to avoid the risk that
ARCH_ATOMIC isn't defined in some cases we expect, let's make the
ARCH_ATOMIC macro a Kconfig symbol instead, so that we can guarantee it
is consistently available where needed.

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210525140232.53872-2-mark.rutland@arm.com
2021-05-26 13:20:49 +02:00
H. Peter Anvin (Intel)
2978996f62 x86/entry: Use int everywhere for system call numbers
System call numbers are defined as int, so use int everywhere for system
call numbers. This is strictly a cleanup; it should not change anything
user visible; all ABI changes have been done in the preceeding patches.

[ tglx: Replaced the unsigned long cast ]

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210518191303.4135296-7-hpa@zytor.com
2021-05-25 10:07:00 +02:00
H. Peter Anvin (Intel)
283fa3b648 x86: Add native_[ig]dt_invalidate()
In some places, the native forms of descriptor table invalidation is
required. Rather than open-coding them, add explicitly native functions to
invalidate the GDT and IDT.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210519212154.511983-6-hpa@zytor.com
2021-05-21 12:36:45 +02:00
H. Peter Anvin (Intel)
8ec9069a43 x86/idt: Remove address argument from idt_invalidate()
There is no reason to specify any specific address to idt_invalidate(). It
looks mostly like an artifact of unifying code done differently by
accident. The most "sensible" address to set here is a NULL pointer -
virtual address zero, just as a visual marker.

This also makes it possible to mark the struct desc_ptr in idt_invalidate()
as static const.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210519212154.511983-5-hpa@zytor.com
2021-05-21 12:36:45 +02:00
H. Peter Anvin (Intel)
ff85100388 x86/irq: Add and use NR_EXTERNAL_VECTORS and NR_SYSTEM_VECTORS
Add defines for the number of external vectors and number of system
vectors instead of requiring the use of (FIRST_SYSTEM_VECTOR -
FIRST_EXTERNAL_VECTOR) and (NR_VECTORS - FIRST_SYSTEM_VECTOR)
respectively. Clean up the usage sites.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20210519212154.511983-3-hpa@zytor.com
2021-05-21 12:36:44 +02:00
H. Peter Anvin (Intel)
f1b7d45d3f x86/irq: Remove unused vectors defines
UV_BAU_MESSAGE is defined but not used anywhere in the kernel. Presumably
this is a stale vector number that can be reclaimed.

MCE_VECTOR is not an actual vector: #MC is an exception, not an interrupt
vector, and as such is correctly described as X86_TRAP_MC. MCE_VECTOR is
not used anywhere is the kernel.

Note that NMI_VECTOR *is* used; specifically it is the vector number
programmed into the APIC LVT when an NMI interrupt is configured. At
the moment it is always numerically identical to X86_TRAP_NMI, that is
not necessarily going to be the case indefinitely.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lore.kernel.org/r/20210519212154.511983-4-hpa@zytor.com
2021-05-21 12:36:44 +02:00
Masahiro Yamada
49f731f197 x86/syscalls: Use __NR_syscalls instead of __NR_syscall_max
__NR_syscall_max is only used by x86 and UML. In contrast, __NR_syscalls is
widely used by all the architectures.

Convert __NR_syscall_max to __NR_syscalls and adjust the usage sites.

This prepares x86 to switch to the generic syscallhdr.sh script.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210517073815.97426-6-masahiroy@kernel.org
2021-05-20 15:03:59 +02:00
Masahiro Yamada
f63815eb1d x86/unistd: Define X32_NR_syscalls only for 64-bit kernel
X32_NR_syscalls is needed only when building a 64bit kernel.

Move it to proper #ifdef guard.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210517073815.97426-5-masahiroy@kernel.org
2021-05-20 15:03:59 +02:00
Masahiro Yamada
6218d0f6b8 x86/syscalls: Switch to generic syscalltbl.sh
Many architectures duplicate similar shell scripts.

Convert x86 and UML to use scripts/syscalltbl.sh. The generic script
generates seperate headers for x86/64 and x86/x32 syscalls, while the x86
specific script coalesced them into one. Adjust the code accordingly.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210517073815.97426-3-masahiroy@kernel.org
2021-05-20 15:03:58 +02:00
Masahiro Yamada
2e958a8a51 x86/entry/x32: Rename __x32_compat_sys_* to __x64_compat_sys_*
The SYSCALL macros are mapped to symbols as follows:

  __SYSCALL_COMMON(nr, sym)  -->  __x64_<sym>
  __SYSCALL_X32(nr, sym)     -->  __x32_<sym>

Originally, the syscalls in the x32 special range (512-547) were all
compat.

This assumption is now broken after the following commits:

  55db9c0e85 ("net: remove compat_sys_{get,set}sockopt")
  5f764d624a ("fs: remove the compat readv/writev syscalls")
  598b3cec83 ("fs: remove compat_sys_vmsplice")
  c3973b401e ("mm: remove compat_process_vm_{readv,writev}")

Those commits redefined __x32_sys_* to __x64_sys_* because there is no stub
like __x32_sys_*.

Defining them as follows is more sensible and cleaner.

  __SYSCALL_COMMON(nr, sym)  -->  __x64_<sym>
  __SYSCALL_X32(nr, sym)     -->  __x64_<sym>

This works because both x86_64 and x32 use the same ABI (RDI, RSI, RDX,
R10, R8, R9)

The ugly #define __x32_sys_* will go away.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210517073815.97426-2-masahiroy@kernel.org
2021-05-20 15:03:58 +02:00
Chang S. Bae
1c33bb0507 x86/elf: Support a new ELF aux vector AT_MINSIGSTKSZ
Historically, signal.h defines MINSIGSTKSZ (2KB) and SIGSTKSZ (8KB), for
use by all architectures with sigaltstack(2). Over time, the hardware state
size grew, but these constants did not evolve. Today, literal use of these
constants on several architectures may result in signal stack overflow, and
thus user data corruption.

A few years ago, the ARM team addressed this issue by establishing
getauxval(AT_MINSIGSTKSZ). This enables the kernel to supply a value
at runtime that is an appropriate replacement on current and future
hardware.

Add getauxval(AT_MINSIGSTKSZ) support to x86, analogous to the support
added for ARM in

  94b07c1f8c ("arm64: signal: Report signal frame size to userspace via auxv").

Also, include a documentation to describe x86-specific auxiliary vectors.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Len Brown <len.brown@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20210518200320.17239-4-chang.seok.bae@intel.com
2021-05-19 12:18:45 +02:00
Chang S. Bae
939ef71329 x86/signal: Introduce helpers to get the maximum signal frame size
Signal frames do not have a fixed format and can vary in size when a number
of things change: supported XSAVE features, 32 vs. 64-bit apps, etc.

Add support for a runtime method for userspace to dynamically discover
how large a signal stack needs to be.

Introduce a new variable, max_frame_size, and helper functions for the
calculation to be used in a new user interface. Set max_frame_size to a
system-wide worst-case value, instead of storing multiple app-specific
values.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Len Brown <len.brown@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: H.J. Lu <hjl.tools@gmail.com>
Link: https://lkml.kernel.org/r/20210518200320.17239-3-chang.seok.bae@intel.com
2021-05-19 11:46:27 +02:00
Thomas Gleixner
1dcc917a0e x86/idt: Rework IDT setup for boot CPU
A basic IDT setup for the boot CPU has to be done before invoking
cpu_init() because that might trigger #GP when accessing certain MSRs. This
setup cannot install the IST variants on 64-bit because the TSS setup which
is required for ISTs to work happens in cpu_init(). That leaves a
theoretical window where a NMI would invoke the ASM entry point which
relies on IST being enabled on the kernel stack which is undefined
behaviour.

This setup logic has never worked correctly, but on the other hand a NMI
hitting the boot CPU before it has fully set up the IDT would be fatal
anyway. So the small window between the wrong NMI gate and the IST based
NMI gate is not really adding a substantial amount of risk.

But the setup logic is nevertheless more convoluted than necessary. The
recent separation of the TSS setup into a separate function to ensure that
setup so it can setup TSS first, then initialize IDT with the IST variants
before invoking cpu_init() and get rid of the post cpu_init() IST setup.

Move the invocation of cpu_init_exception_handling() ahead of
idt_setup_traps() and merge the IST setup into the default setup table.

Reported-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lai Jiangshan <laijs@linux.alibaba.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210507114000.569244755@linutronix.de
2021-05-18 14:49:21 +02:00
Borislav Petkov
b1efd0ff4b x86/cpu: Init AP exception handling from cpu_init_secondary()
SEV-ES guests require properly setup task register with which the TSS
descriptor in the GDT can be located so that the IST-type #VC exception
handler which they need to function properly, can be executed.

This setup needs to happen before attempting to load microcode in
ucode_cpu_init() on secondary CPUs which can cause such #VC exceptions.

Simplify the machinery by running that exception setup from a new function
cpu_init_secondary() and explicitly call cpu_init_exception_handling() for
the boot CPU before cpu_init(). The latter prepares for fixing and
simplifying the exception/IST setup on the boot CPU.

There should be no functional changes resulting from this patch.

[ tglx: Reworked it so cpu_init_exception_handling() stays seperate ]

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lai Jiangshan <laijs@linux.alibaba.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>                                                                                                                                                                                                                        
Link: https://lore.kernel.org/r/87k0o6gtvu.ffs@nanos.tec.linutronix.de
2021-05-18 14:49:21 +02:00
Paolo Bonzini
a4345a7cec KVM/arm64 fixes for 5.13, take #1
- Fix regression with irqbypass not restarting the guest on failed connect
 - Fix regression with debug register decoding resulting in overlapping access
 - Commit exception state on exit to usrspace
 - Fix the MMU notifier return values
 - Add missing 'static' qualifiers in the new host stage-2 code
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmCfmUoPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDei8QAMOWMA9wFTydsMTyRwDDZzD9i3Vg4bYlTdj1
 1C1FiHHGL37t44coo1eHtnydWBuhxhhwDHWQE8owFbDHyOnPzEX+NwhmJ4gVlUW5
 51aSxfPgXzKiv17WyncqZO9SfA5/RFyA/C2gRq9/fMr/7CpQJjqrvdQXaWh4kPVa
 9jFMVd1sCDUPd5c9Jyxd42CmVZjg6mCorOKaEwlI7NZkulRBlFW21A5y+M57sGTF
 RLIuQcggFJaG17kZN4p6v55Yoclt8O4xVbDv8SZV3vO1gjpaF1LtXdsmAKvbDZrZ
 lEtdumPHyD1maFhwXQFMOyvOgEaRhlhiNaTgKUOyX2LgeW1utCiYO/KwysflZvIC
 oLsfx3x+G0nSxa+MWGL9m52Hrt4yyscfbKfBg6nqJB+AqD3teH20xfsEUHTEuYkW
 kEgeWcJcWkadL5+ngs6S4PwFr88NyVBdUAagNd5VXE/KFhxCcr4B9oOXk5WdOaMi
 ZvLG5IQfIH6k3w+h2wR2WSoxYwltriZ3PwrPIeJ2Se33bK15xtQy1k/IIqvZP/oK
 0xxRVoY+nwuru0QZGwyI7zCFFvZzEKOXJ3qzJ2NeQxoTBky/e0bvUwnU8gXLXGPM
 lx2Gzw6t+xlTfcF9oIaQq7WlOsrC7Zr4uiTurZGLZKWklso9tLdzW35zmdN6D3qx
 sP2LC4iv
 =57tg
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-fixes-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 5.13, take #1

- Fix regression with irqbypass not restarting the guest on failed connect
- Fix regression with debug register decoding resulting in overlapping access
- Commit exception state on exit to usrspace
- Fix the MMU notifier return values
- Add missing 'static' qualifiers in the new host stage-2 code
2021-05-17 09:55:12 +02:00
Linus Torvalds
8ce3648158 Two fixes for timers:
- Use the ALARM feature check in the alarmtimer core code insted of
     the old method of checking for the set_alarm() callback. Drivers
     can have that callback set but the feature bit cleared. If such
     a RTC device is selected then alarms wont work.
 
   - Use a proper define to let the preprocessor check whether Hyper-V VDSO
     clocksource should be active. The code used a constant in an enum with
     #ifdef, which evaluates to always false and disabled the clocksource
     for VDSO.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmChLI8THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoUJMD/wOQ/R7jXe/EWti3+w11TATvkP+ZzDv
 LcAfZ/ZP8wgrUTbjLqTTyeOFoI9q39emnq3FvCoRsF+rdHRbnZNAB3kWQmh/i1tL
 j8BuGogzvVLkBmriQIzVxYgEroCZVySWkO27B7ToBq64IeI4IBVB4jQiJis614m7
 5wTHKgN0MkAtWUmwDqkqycFDuWyZNPkR3Ht26zk46Lvk0dmIPh14zbVzezfFEtq4
 9DBeGuLDLVtzaBNLWUvnpXL7wxuFB+E8euO5otbmgRNz7CXaE6e6zy6zspK2ahmp
 FRq+nrG6yK6ucoFhGFABfKZCGorhh1ghhniPUXQKP9B29z146pN6TLFAVAutBk4z
 RoRdyGb9npoO1pB0f2tl0U65TBBlMCnLnDB3hcQ/eyMG7AC8ABHalBIFUjzEPB4b
 3eDa+ZxfkW8/oiSLTssQiJ6TJW1EQNaVja1TuHvtPi5RdasbS4LEkQnDaePQ3/nl
 tDLekfsDF4KxetZehIlRDqyN9cqIHVphs3pTysyWR7+aOTduWWF58ZtgR7SvTCVu
 7Zu+PhP06A1MtEugnwcAcpG5XYCsAXdZXinuQhPndXqazN4wMJkanXNk03z//JmQ
 wG//lFAC+9EfA8i9RDr2DeE6JISD2g+jj2Di9bjjxelp5Mi0bNZ0zdIiww6EJjRg
 v4F0vCp3By8SQg==
 =TruV
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2021-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Thomas Gleixner:
 "Two fixes for timers:

   - Use the ALARM feature check in the alarmtimer core code insted of
     the old method of checking for the set_alarm() callback.

     Drivers can have that callback set but the feature bit cleared. If
     such a RTC device is selected then alarms wont work.

   - Use a proper define to let the preprocessor check whether Hyper-V
     VDSO clocksource should be active.

     The code used a constant in an enum with #ifdef, which evaluates to
     always false and disabled the clocksource for VDSO"

* tag 'timers-urgent-2021-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource/drivers/hyper-v: Re-enable VDSO_CLOCKMODE_HVCLOCK on X86
  alarmtimer: Check RTC features instead of ops
2021-05-16 09:42:13 -07:00
Linus Torvalds
ccb013c29d - Enable -Wundef for the compressed kernel build stage
- Reorganize SEV code to streamline and simplify future development
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCg1XQACgkQEsHwGGHe
 VUpRKA//dwzDD1QU16JucfhgFlv/9OTm48ukSwAb9lZjDEy4H1CtVL3xEHFd7L3G
 LJp0LTW+OQf0/0aGlQp/cP6sBF6G9Bf4mydx70Id4SyCQt8eZDodB+ZOOWbeteWq
 p92fJPbX8CzAglutbE+3v/MD8CCAllTiLZnJZPVj4Kux2/wF6EryDgF1+rb5q8jp
 ObTT9817mHVwWVUYzbgceZtd43IocOlKZRmF1qivwScMGylQTe1wfMjunpD5pVt8
 Zg4UDNknNfYduqpaG546E6e1zerGNaJK7SHnsuzHRUVU5icNqtgBk061CehP9Ksq
 DvYXLUl4xF16j6xJAqIZPNrBkJGdQf4q1g5x2FiBm7rSQU5owzqh5rkVk4EBFFzn
 UtzeXpqbStbsZHXycyxBNdq2HXxkFPf2NXZ+bkripPg+DifOGots1uwvAft+6iAE
 GudK6qxAvr8phR1cRyy6BahGtgOStXbZYEz0ZdU6t7qFfZMz+DomD5Jimj0kAe6B
 s6ras5xm8q3/Py87N/KNjKtSEpgsHv/7F+idde7ODtHhpRL5HCBqhkZOSRkMMZqI
 ptX1oSTvBXwRKyi5x9YhkKHUFqfFSUTfJhiRFCWK+IEAv3Y7SipJtfkqxRbI6fEV
 FfCeueKDDdViBtseaRceVLJ8Tlr6Qjy27fkPPTqJpthqPpCdoZ0=
 =ENfF
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:
 "The three SEV commits are not really urgent material. But we figured
  since getting them in now will avoid a huge amount of conflicts
  between future SEV changes touching tip, the kvm and probably other
  trees, sending them to you now would be best.

  The idea is that the tip, kvm etc branches for 5.14 will all base
  ontop of -rc2 and thus everything will be peachy. What is more, those
  changes are purely mechanical and defines movement so they should be
  fine to go now (famous last words).

  Summary:

   - Enable -Wundef for the compressed kernel build stage

   - Reorganize SEV code to streamline and simplify future development"

* tag 'x86_urgent_for_v5.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot/compressed: Enable -Wundef
  x86/msr: Rename MSR_K8_SYSCFG to MSR_AMD64_SYSCFG
  x86/sev: Move GHCB MSR protocol and NAE definitions in a common header
  x86/sev-es: Rename sev-es.{ch} to sev.{ch}
2021-05-16 09:31:06 -07:00
Vitaly Kuznetsov
3486d2c9be clocksource/drivers/hyper-v: Re-enable VDSO_CLOCKMODE_HVCLOCK on X86
Mohammed reports (https://bugzilla.kernel.org/show_bug.cgi?id=213029)
the commit e4ab4658f1 ("clocksource/drivers/hyper-v: Handle vDSO
differences inline") broke vDSO on x86. The problem appears to be that
VDSO_CLOCKMODE_HVCLOCK is an enum value in 'enum vdso_clock_mode' and
'#ifdef VDSO_CLOCKMODE_HVCLOCK' branch evaluates to false (it is not
a define).

Use a dedicated HAVE_VDSO_CLOCKMODE_HVCLOCK define instead.

Fixes: e4ab4658f1 ("clocksource/drivers/hyper-v: Handle vDSO differences inline")
Reported-by: Mohammed Gamal <mgamal@redhat.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20210513073246.1715070-1-vkuznets@redhat.com
2021-05-14 14:55:13 +02:00
Andi Kleen
28188cc461 x86/cpu: Fix core name for Sapphire Rapids
Sapphire Rapids uses Golden Cove, not Willow Cove.

Fixes: 53375a5a21 ("x86/cpu: Resort and comment Intel models")
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210513163904.3083274-1-ak@linux.intel.com
2021-05-14 14:31:14 +02:00
Ingo Molnar
41f45fb045 x86/asm: Make <asm/asm.h> valid on cross-builds as well
Stephen Rothwell reported that the objtool cross-build breaks on
non-x86 hosts:

  > tools/arch/x86/include/asm/asm.h:185:24: error: invalid register name for 'current_stack_pointer'
  >   185 | register unsigned long current_stack_pointer asm(_ASM_SP);
  >       |                        ^~~~~~~~~~~~~~~~~~~~~

The PowerPC host obviously doesn't know much about x86 register names.

Protect the kernel-specific bits of <asm/asm.h>, so that it can be
included by tooling and cross-built.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-05-14 08:50:28 +02:00
Huang Rui
3743d55b28 x86, sched: Fix the AMD CPPC maximum performance value on certain AMD Ryzen generations
Some AMD Ryzen generations has different calculation method on maximum
performance. 255 is not for all ASICs, some specific generations should use 166
as the maximum performance. Otherwise, it will report incorrect frequency value
like below:

  ~ → lscpu | grep MHz
  CPU MHz:                         3400.000
  CPU max MHz:                     7228.3198
  CPU min MHz:                     2200.0000

[ mingo: Tidied up whitespace use. ]
[ Alexander Monakov <amonakov@ispras.ru>: fix 225 -> 255 typo. ]

Fixes: 41ea667227 ("x86, sched: Calculate frequency invariance for AMD systems")
Fixes: 3c55e94c0a ("cpufreq: ACPI: Extend frequency tables to cover boost frequencies")
Reported-by: Jason Bagavatsingham <jason.bagavatsingham@gmail.com>
Fixed-by: Alexander Monakov <amonakov@ispras.ru>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jason Bagavatsingham <jason.bagavatsingham@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210425073451.2557394-1-ray.huang@amd.com
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=211791
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-05-13 12:10:24 +02:00
Ingo Molnar
c43426334b x86: Fix leftover comment typos
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-05-12 20:00:51 +02:00
Ingo Molnar
6f0d271d21 Merge branch 'linus' into x86/cleanups, to pick up dependent commits
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-05-12 19:59:37 +02:00
Peter Zijlstra
ab3257042c jump_label, x86: Allow short NOPs
Now that objtool is able to rewrite jump_label instructions, have the
compiler emit a JMP, such that it can decide on the optimal encoding,
and set jump_entry::key bit1 to indicate that objtool should rewrite
the instruction to a matching NOP.

For x86_64-allyesconfig this gives:

  jl\     NOP     JMP
  short:  22997   124
  long:   30874   90

IOW, we save (22997+124) * 3 bytes of kernel text in hotpaths.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210506194158.216763632@infradead.org
2021-05-12 14:54:56 +02:00
Peter Zijlstra
e7bf1ba97a jump_label, x86: Emit short JMP
Now that we can patch short JMP/NOP, allow the compiler/assembler to
emit short JMP instructions.

There is no way to have the assembler emit short NOPs based on the
potential displacement, so leave those long for now.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210506194157.967034497@infradead.org
2021-05-12 14:54:55 +02:00
Peter Zijlstra
fa5e5dc396 jump_label, x86: Introduce jump_entry_size()
This allows architectures to have variable sized jumps.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210506194157.786777050@infradead.org
2021-05-12 14:54:55 +02:00
Peter Zijlstra
e1aa35c4c4 jump_label, x86: Factor out the __jump_table generation
Both arch_static_branch() and arch_static_branch_jump() have the same
blurb to generate the __jump_table entry, share it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210506194157.663132781@infradead.org
2021-05-12 14:54:55 +02:00
Peter Zijlstra
8bfafcdccb jump_label, x86: Strip ASM jump_label support
In prepration for variable size jump_label support; remove all ASM
bits, which are currently unused.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210506194157.599716762@infradead.org
2021-05-12 14:54:55 +02:00
Valentin Schneider
f1a0a376ca sched/core: Initialize the idle task with preemption disabled
As pointed out by commit

  de9b8f5dcb ("sched: Fix crash trying to dequeue/enqueue the idle thread")

init_idle() can and will be invoked more than once on the same idle
task. At boot time, it is invoked for the boot CPU thread by
sched_init(). Then smp_init() creates the threads for all the secondary
CPUs and invokes init_idle() on them.

As the hotplug machinery brings the secondaries to life, it will issue
calls to idle_thread_get(), which itself invokes init_idle() yet again.
In this case it's invoked twice more per secondary: at _cpu_up(), and at
bringup_cpu().

Given smp_init() already initializes the idle tasks for all *possible*
CPUs, no further initialization should be required. Now, removing
init_idle() from idle_thread_get() exposes some interesting expectations
with regards to the idle task's preempt_count: the secondary startup always
issues a preempt_disable(), requiring some reset of the preempt count to 0
between hot-unplug and hotplug, which is currently served by
idle_thread_get() -> idle_init().

Given the idle task is supposed to have preemption disabled once and never
see it re-enabled, it seems that what we actually want is to initialize its
preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove
init_idle() from idle_thread_get().

Secondary startups were patched via coccinelle:

  @begone@
  @@

  -preempt_disable();
  ...
  cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210512094636.2958515-1-valentin.schneider@arm.com
2021-05-12 13:01:45 +02:00
Borislav Petkov
1bc67873d4 x86/asm: Simplify __smp_mb() definition
Drop the bitness ifdeffery in favor of using _ASM_SP,
which is the helper macro for the rSP register specification
for 32 and 64 bit depending on the build.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210512093310.5635-1-bp@alien8.de
2021-05-12 12:22:57 +02:00
H. Peter Anvin (Intel)
dce0aa3b2e x86/syscall: Unconditionally prototype {ia32,x32}_sys_call_table[]
Even if these APIs are disabled, and the arrays therefore do not
exist, having the prototypes allows us to use IS_ENABLED() rather than
using #ifdefs.

If something ends up trying to actually *use* these arrays a linker
error will ensue.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210510185316.3307264-4-hpa@zytor.com
2021-05-12 10:49:15 +02:00
H. Peter Anvin (Intel)
3e5e7f7736 x86/entry: Reverse arguments to do_syscall_64()
Reverse the order of arguments to do_syscall_64() so that the first
argument is the pt_regs pointer. This is not only consistent with
*all* other entry points from assembly, but it actually makes the
compiled code slightly better.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210510185316.3307264-3-hpa@zytor.com
2021-05-12 10:49:14 +02:00
Linus Torvalds
0aa099a312 * Lots of bug fixes.
* Fix virtualization of RDPID
 
 * Virtualization of DR6_BUS_LOCK, which on bare metal is new in
   the 5.13 merge window
 
 * More nested virtualization migration fixes (nSVM and eVMCS)
 
 * Fix for KVM guest hibernation
 
 * Fix for warning in SEV-ES SRCU usage
 
 * Block KVM from loading on AMD machines with 5-level page tables,
   due to the APM not mentioning how host CR4.LA57 exactly impacts
   the guest.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCZWwgUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOE9wgAk7Io8cuvnhC9ogVqzZWrPweWqFg8
 fJcPMB584JRnMqYHBVYbkTPGe8SsCHKR2MKsNdc4cEP111cyr3suWsxOdmjJn58i
 7ahy6PcKx7wWeWwEt7O599l6CeoX5XB9ExvA6eiXAv7iZeOJHFa+Ny2GlWgauy6Y
 DELryEomx1r4IUkZaSR+2fYjzvOWTXQixwU/jwx8NcTJz0DrzknzLE7XOciPBfn0
 t0Q2rCXdL2nF1uPksZbntx8Qoa6t6GDVIyrH/ZCPQYJtAX6cjxNAh3zwCe+hMnOd
 fW8ntBH1nZRiNnberA4IICAzqnUokgPWdKBrZT2ntWHBK+aqxXHznrlPJA==
 =e+gD
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:

 - Lots of bug fixes.

 - Fix virtualization of RDPID

 - Virtualization of DR6_BUS_LOCK, which on bare metal is new to this
   release

 - More nested virtualization migration fixes (nSVM and eVMCS)

 - Fix for KVM guest hibernation

 - Fix for warning in SEV-ES SRCU usage

 - Block KVM from loading on AMD machines with 5-level page tables, due
   to the APM not mentioning how host CR4.LA57 exactly impacts the
   guest.

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (48 commits)
  KVM: SVM: Move GHCB unmapping to fix RCU warning
  KVM: SVM: Invert user pointer casting in SEV {en,de}crypt helpers
  kvm: Cap halt polling at kvm->max_halt_poll_ns
  tools/kvm_stat: Fix documentation typo
  KVM: x86: Prevent deadlock against tk_core.seq
  KVM: x86: Cancel pvclock_gtod_work on module removal
  KVM: x86: Prevent KVM SVM from loading on kernels with 5-level paging
  KVM: X86: Expose bus lock debug exception to guest
  KVM: X86: Add support for the emulation of DR6_BUS_LOCK bit
  KVM: PPC: Book3S HV: Fix conversion to gfn-based MMU notifier callbacks
  KVM: x86: Hide RDTSCP and RDPID if MSR_TSC_AUX probing failed
  KVM: x86: Tie Intel and AMD behavior for MSR_TSC_AUX to guest CPU model
  KVM: x86: Move uret MSR slot management to common x86
  KVM: x86: Export the number of uret MSRs to vendor modules
  KVM: VMX: Disable loading of TSX_CTRL MSR the more conventional way
  KVM: VMX: Use common x86's uret MSR list as the one true list
  KVM: VMX: Use flag to indicate "active" uret MSRs instead of sorting list
  KVM: VMX: Configure list of user return MSRs at module init
  KVM: x86: Add support for RDPID without RDTSCP
  KVM: SVM: Probe and load MSR_TSC_AUX regardless of RDTSCP support in host
  ...
2021-05-10 12:30:45 -07:00
Arnd Bergmann
637be9183e asm-generic: use asm-generic/unaligned.h for most architectures
There are several architectures that just duplicate the contents
of asm-generic/unaligned.h, so change those over to use the
file directly, to make future modifications easier.

The exceptions are:

- arm32 sets HAVE_EFFICIENT_UNALIGNED_ACCESS, but wants the
  unaligned-struct version

- ppc64le disables HAVE_EFFICIENT_UNALIGNED_ACCESS but includes
  the access-ok version

- most m68k also uses the access-ok version without setting
  HAVE_EFFICIENT_UNALIGNED_ACCESS.

- sh4a has a custom inline asm version

- openrisc is the only one using the memmove version that
  generally leads to worse code.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
2021-05-10 17:43:15 +02:00
H. Peter Anvin (Intel)
eef23e72b7 x86/asm: Use _ASM_BYTES() in <asm/nops.h>
Use the new generalized _ASM_BYTES() macro from <asm/asm.h> instead of
the "home grown" _ASM_MK_NOP() in <asm/nops.h>.

Add <asm/asm.h> and update <asm/nops.h> in the tools directory...

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210510090940.924953-4-hpa@zytor.com
2021-05-10 12:33:28 +02:00
H. Peter Anvin (Intel)
d88be187a6 x86/asm: Add _ASM_BYTES() macro for a .byte ... opcode sequence
Make it easy to create a sequence of bytes that can be used in either
assembly proper on in a C asm() statement.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210510090940.924953-3-hpa@zytor.com
2021-05-10 12:33:28 +02:00
H. Peter Anvin (Intel)
be5bb8021c x86/asm: Have the __ASM_FORM macros handle commas in arguments
The __ASM_FORM macros are really useful, but in order to be able to
use them to define instructions via .byte directives breaks because of
the necessary commas. Change the macros to handle commas correctly.

[ mingo: Removed stray whitespaces & aligned the definitions vertically. ]

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210510090940.924953-2-hpa@zytor.com
2021-05-10 12:33:28 +02:00
Brijesh Singh
059e5c321a x86/msr: Rename MSR_K8_SYSCFG to MSR_AMD64_SYSCFG
The SYSCFG MSR continued being updated beyond the K8 family; drop the K8
name from it.

Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/20210427111636.1207-4-brijesh.singh@amd.com
2021-05-10 07:51:38 +02:00
Brijesh Singh
b81fc74d53 x86/sev: Move GHCB MSR protocol and NAE definitions in a common header
The guest and the hypervisor contain separate macros to get and set
the GHCB MSR protocol and NAE event fields. Consolidate the GHCB
protocol definitions and helper macros in one place.

Leave the supported protocol version define in separate files to keep
the guest and hypervisor flexibility to support different GHCB version
in the same release.

There is no functional change intended.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/20210427111636.1207-3-brijesh.singh@amd.com
2021-05-10 07:46:39 +02:00
Brijesh Singh
e759959fe3 x86/sev-es: Rename sev-es.{ch} to sev.{ch}
SEV-SNP builds upon the SEV-ES functionality while adding new hardware
protection. Version 2 of the GHCB specification adds new NAE events that
are SEV-SNP specific. Rename the sev-es.{ch} to sev.{ch} so that all
SEV* functionality can be consolidated in one place.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/20210427111636.1207-2-brijesh.singh@amd.com
2021-05-10 07:40:27 +02:00
Sean Christopherson
03ca4589fa KVM: x86: Prevent KVM SVM from loading on kernels with 5-level paging
Disallow loading KVM SVM if 5-level paging is supported.  In theory, NPT
for L1 should simply work, but there unknowns with respect to how the
guest's MAXPHYADDR will be handled by hardware.

Nested NPT is more problematic, as running an L1 VMM that is using
2-level page tables requires stacking single-entry PDP and PML4 tables in
KVM's NPT for L2, as there are no equivalent entries in L1's NPT to
shadow.  Barring hardware magic, for 5-level paging, KVM would need stack
another layer to handle PML5.

Opportunistically rename the lm_root pointer, which is used for the
aforementioned stacking when shadowing 2-level L1 NPT, to pml4_root to
call out that it's specifically for PML4.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210505204221.1934471-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:21 -04:00
Chenyi Qiang
e8ea85fb28 KVM: X86: Add support for the emulation of DR6_BUS_LOCK bit
Bus lock debug exception introduces a new bit DR6_BUS_LOCK (bit 11 of
DR6) to indicate that bus lock #DB exception is generated. The set/clear
of DR6_BUS_LOCK is similar to the DR6_RTM. The processor clears
DR6_BUS_LOCK when the exception is generated. For all other #DB, the
processor sets this bit to 1. Software #DB handler should set this bit
before returning to the interrupted task.

In VMM, to avoid breaking the CPUs without bus lock #DB exception
support, activate the DR6_BUS_LOCK conditionally in DR6_FIXED_1 bits.
When intercepting the #DB exception caused by bus locks, bit 11 of the
exit qualification is set to identify it. The VMM should emulate the
exception by clearing the bit 11 of the guest DR6.

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20210202090433.13441-3-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:20 -04:00
Sean Christopherson
61a05d444d KVM: x86: Tie Intel and AMD behavior for MSR_TSC_AUX to guest CPU model
Squish the Intel and AMD emulation of MSR_TSC_AUX together and tie it to
the guest CPU model instead of the host CPU behavior.  While not strictly
necessary to avoid guest breakage, emulating cross-vendor "architecture"
will provide consistent behavior for the guest, e.g. WRMSR fault behavior
won't change if the vCPU is migrated to a host with divergent behavior.

Note, the "new" kvm_is_supported_user_return_msr() checks do not add new
functionality on either SVM or VMX.  On SVM, the equivalent was
"tsc_aux_uret_slot < 0", and on VMX the check was buried in the
vmx_find_uret_msr() call at the find_uret_msr label.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-15-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:19 -04:00
Sean Christopherson
e5fda4bbad KVM: x86: Move uret MSR slot management to common x86
Now that SVM and VMX both probe MSRs before "defining" user return slots
for them, consolidate the code for probe+define into common x86 and
eliminate the odd behavior of having the vendor code define the slot for
a given MSR.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:19 -04:00
Sean Christopherson
9cc39a5a43 KVM: x86: Export the number of uret MSRs to vendor modules
Split out and export the number of configured user return MSRs so that
VMX can iterate over the set of MSRs without having to do its own tracking.
Keep the list itself internal to x86 so that vendor code still has to go
through the "official" APIs to add/modify entries.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:18 -04:00
Sean Christopherson
8ea8b8d6f8 KVM: VMX: Use common x86's uret MSR list as the one true list
Drop VMX's global list of user return MSRs now that VMX doesn't resort said
list to isolate "active" MSRs, i.e. now that VMX's list and x86's list have
the same MSRs in the same order.

In addition to eliminating the redundant list, this will also allow moving
more of the list management into common x86.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:18 -04:00
Sean Christopherson
5104d7ffcf KVM: VMX: Disable preemption when probing user return MSRs
Disable preemption when probing a user return MSR via RDSMR/WRMSR.  If
the MSR holds a different value per logical CPU, the WRMSR could corrupt
the host's value if KVM is preempted between the RDMSR and WRMSR, and
then rescheduled on a different CPU.

Opportunistically land the helper in common x86, SVM will use the helper
in a future commit.

Fixes: 4be5341026 ("KVM: VMX: Initialize vmx->guest_msrs[] right after allocation")
Cc: stable@vger.kernel.org
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-6-seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:16 -04:00
Vitaly Kuznetsov
3d6b84132d x86/kvm: Disable all PV features on crash
Crash shutdown handler only disables kvmclock and steal time, other PV
features remain active so we risk corrupting memory or getting some
side-effects in kdump kernel. Move crash handler to kvm.c and unify
with CPU offline.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210414123544.1060604-5-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:10 -04:00
Vitaly Kuznetsov
c02027b574 x86/kvm: Disable kvmclock on all CPUs on shutdown
Currenly, we disable kvmclock from machine_shutdown() hook and this
only happens for boot CPU. We need to disable it for all CPUs to
guard against memory corruption e.g. on restore from hibernate.

Note, writing '0' to kvmclock MSR doesn't clear memory location, it
just prevents hypervisor from updating the location so for the short
while after write and while CPU is still alive, the clock remains usable
and correct so we don't need to switch to some other clocksource.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210414123544.1060604-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:10 -04:00
Lai Jiangshan
a217a6593c KVM/VMX: Invoke NMI non-IST entry instead of IST entry
In VMX, the host NMI handler needs to be invoked after NMI VM-Exit.
Before commit 1a5488ef0d ("KVM: VMX: Invoke NMI handler via indirect
call instead of INTn"), this was done by INTn ("int $2"). But INTn
microcode is relatively expensive, so the commit reworked NMI VM-Exit
handling to invoke the kernel handler by function call.

But this missed a detail. The NMI entry point for direct invocation is
fetched from the IDT table and called on the kernel stack.  But on 64-bit
the NMI entry installed in the IDT expects to be invoked on the IST stack.
It relies on the "NMI executing" variable on the IST stack to work
correctly, which is at a fixed position in the IST stack.  When the entry
point is unexpectedly called on the kernel stack, the RSP-addressed "NMI
executing" variable is obviously also on the kernel stack and is
"uninitialized" and can cause the NMI entry code to run in the wrong way.

Provide a non-ist entry point for VMX which shares the C-function with
the regular NMI entry and invoke the new asm entry point instead.

On 32-bit this just maps to the regular NMI entry point as 32-bit has no
ISTs and is not affected.

[ tglx: Made it independent for backporting, massaged changelog ]

Fixes: 1a5488ef0d ("KVM: VMX: Invoke NMI handler via indirect call instead of INTn")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Lai Jiangshan <laijs@linux.alibaba.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87r1imi8i1.ffs@nanos.tec.linutronix.de
2021-05-05 22:54:10 +02:00
Sean Christopherson
fc48a6d1fa x86/cpu: Remove write_tsc() and write_rdtscp_aux() wrappers
Drop write_tsc() and write_rdtscp_aux(); the former has no users, and the
latter has only a single user and is slightly misleading since the only
in-kernel consumer of MSR_TSC_AUX is RDPID, not RDTSCP.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210504225632.1532621-3-seanjc@google.com
2021-05-05 21:50:14 +02:00
Alexey Dobriyan
790d1ce71d x86: Delete UD0, UD1 traces
Both instructions aren't used by kernel.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/YIHHYNKbiSf5N7+o@localhost.localdomain
2021-05-05 21:50:13 +02:00
Linus Torvalds
025768a966 x86/cpu: Use alternative to generate the TASK_SIZE_MAX constant
We used to generate this constant with static jumps, which certainly
works, but generates some quite unreadable and horrid code, and extra
jumps.

It's actually much simpler to just use our alternative_asm()
infrastructure to generate a simple alternative constant, making the
generated code much more obvious (and straight-line rather than "jump
around to load the right constant").

Acked-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
2021-05-05 08:52:31 +02:00
Maxim Levitsky
c74ad08f33 KVM: nSVM: fix few bugs in the vmcb02 caching logic
* Define and use an invalid GPA (all ones) for init value of last
  and current nested vmcb physical addresses.

* Reset the current vmcb12 gpa to the invalid value when leaving
  the nested mode, similar to what is done on nested vmexit.

* Reset	the last seen vmcb12 address when disabling the nested SVM,
  as it relies on vmcb02 fields which are freed at that point.

Fixes: 4995a3685f ("KVM: SVM: Use a separate vmcb for the nested L2 guest")

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210503125446.1353307-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-03 11:25:37 -04:00
Linus Torvalds
152d32aa84 ARM:
- Stage-2 isolation for the host kernel when running in protected mode
 
 - Guest SVE support when running in nVHE mode
 
 - Force W^X hypervisor mappings in nVHE mode
 
 - ITS save/restore for guests using direct injection with GICv4.1
 
 - nVHE panics now produce readable backtraces
 
 - Guest support for PTP using the ptp_kvm driver
 
 - Performance improvements in the S2 fault handler
 
 x86:
 
 - Optimizations and cleanup of nested SVM code
 
 - AMD: Support for virtual SPEC_CTRL
 
 - Optimizations of the new MMU code: fast invalidation,
   zap under read lock, enable/disably dirty page logging under
   read lock
 
 - /dev/kvm API for AMD SEV live migration (guest API coming soon)
 
 - support SEV virtual machines sharing the same encryption context
 
 - support SGX in virtual machines
 
 - add a few more statistics
 
 - improved directed yield heuristics
 
 - Lots and lots of cleanups
 
 Generic:
 
 - Rework of MMU notifier interface, simplifying and optimizing
 the architecture-specific code
 
 - Some selftests improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCJ13kUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroM1HAgAqzPxEtiTPTFeFJV5cnPPJ3dFoFDK
 y/juZJUQ1AOtvuWzzwuf175ewkv9vfmtG6rVohpNSkUlJYeoc6tw7n8BTTzCVC1b
 c/4Dnrjeycr6cskYlzaPyV6MSgjSv5gfyj1LA5UEM16LDyekmaynosVWY5wJhju+
 Bnyid8l8Utgz+TLLYogfQJQECCrsU0Wm//n+8TWQgLf1uuiwshU5JJe7b43diJrY
 +2DX+8p9yWXCTz62sCeDWNahUv8AbXpMeJ8uqZPYcN1P0gSEUGu8xKmLOFf9kR7b
 M4U1Gyz8QQbjd2lqnwiWIkvRLX6gyGVbq2zH0QbhUe5gg3qGUX7JjrhdDQ==
 =AXUi
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "This is a large update by KVM standards, including AMD PSP (Platform
  Security Processor, aka "AMD Secure Technology") and ARM CoreSight
  (debug and trace) changes.

  ARM:

   - CoreSight: Add support for ETE and TRBE

   - Stage-2 isolation for the host kernel when running in protected
     mode

   - Guest SVE support when running in nVHE mode

   - Force W^X hypervisor mappings in nVHE mode

   - ITS save/restore for guests using direct injection with GICv4.1

   - nVHE panics now produce readable backtraces

   - Guest support for PTP using the ptp_kvm driver

   - Performance improvements in the S2 fault handler

  x86:

   - AMD PSP driver changes

   - Optimizations and cleanup of nested SVM code

   - AMD: Support for virtual SPEC_CTRL

   - Optimizations of the new MMU code: fast invalidation, zap under
     read lock, enable/disably dirty page logging under read lock

   - /dev/kvm API for AMD SEV live migration (guest API coming soon)

   - support SEV virtual machines sharing the same encryption context

   - support SGX in virtual machines

   - add a few more statistics

   - improved directed yield heuristics

   - Lots and lots of cleanups

  Generic:

   - Rework of MMU notifier interface, simplifying and optimizing the
     architecture-specific code

   - a handful of "Get rid of oprofile leftovers" patches

   - Some selftests improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (379 commits)
  KVM: selftests: Speed up set_memory_region_test
  selftests: kvm: Fix the check of return value
  KVM: x86: Take advantage of kvm_arch_dy_has_pending_interrupt()
  KVM: SVM: Skip SEV cache flush if no ASIDs have been used
  KVM: SVM: Remove an unnecessary prototype declaration of sev_flush_asids()
  KVM: SVM: Drop redundant svm_sev_enabled() helper
  KVM: SVM: Move SEV VMCB tracking allocation to sev.c
  KVM: SVM: Explicitly check max SEV ASID during sev_hardware_setup()
  KVM: SVM: Unconditionally invoke sev_hardware_teardown()
  KVM: SVM: Enable SEV/SEV-ES functionality by default (when supported)
  KVM: SVM: Condition sev_enabled and sev_es_enabled on CONFIG_KVM_AMD_SEV=y
  KVM: SVM: Append "_enabled" to module-scoped SEV/SEV-ES control variables
  KVM: SEV: Mask CPUID[0x8000001F].eax according to supported features
  KVM: SVM: Move SEV module params/variables to sev.c
  KVM: SVM: Disable SEV/SEV-ES if NPT is disabled
  KVM: SVM: Free sev_asid_bitmap during init if SEV setup fails
  KVM: SVM: Zero out the VMCB array used to track SEV ASID association
  x86/sev: Drop redundant and potentially misleading 'sev_enabled'
  KVM: x86: Move reverse CPUID helpers to separate header file
  KVM: x86: Rename GPR accessors to make mode-aware variants the defaults
  ...
2021-05-01 10:14:08 -07:00
Linus Torvalds
d42f323a7d Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "A few misc subsystems and some of MM.

  175 patches.

  Subsystems affected by this patch series: ia64, kbuild, scripts, sh,
  ocfs2, kfifo, vfs, kernel/watchdog, and mm (slab-generic, slub,
  kmemleak, debug, pagecache, msync, gup, memremap, memcg, pagemap,
  mremap, dma, sparsemem, vmalloc, documentation, kasan, initialization,
  pagealloc, and memory-failure)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (175 commits)
  mm/memory-failure: unnecessary amount of unmapping
  mm/mmzone.h: fix existing kernel-doc comments and link them to core-api
  mm: page_alloc: ignore init_on_free=1 for debug_pagealloc=1
  net: page_pool: use alloc_pages_bulk in refill code path
  net: page_pool: refactor dma_map into own function page_pool_dma_map
  SUNRPC: refresh rq_pages using a bulk page allocator
  SUNRPC: set rq_page_end differently
  mm/page_alloc: inline __rmqueue_pcplist
  mm/page_alloc: optimize code layout for __alloc_pages_bulk
  mm/page_alloc: add an array-based interface to the bulk page allocator
  mm/page_alloc: add a bulk page allocator
  mm/page_alloc: rename alloced to allocated
  mm/page_alloc: duplicate include linux/vmalloc.h
  mm, page_alloc: avoid page_to_pfn() in move_freepages()
  mm/Kconfig: remove default DISCONTIGMEM_MANUAL
  mm: page_alloc: dump migrate-failed pages
  mm/mempolicy: fix mpol_misplaced kernel-doc
  mm/mempolicy: rewrite alloc_pages_vma documentation
  mm/mempolicy: rewrite alloc_pages documentation
  mm/mempolicy: rename alloc_pages_current to alloc_pages
  ...
2021-04-30 14:38:01 -07:00
Linus Torvalds
c70a4be130 powerpc updates for 5.13
- Enable KFENCE for 32-bit.
 
  - Implement EBPF for 32-bit.
 
  - Convert 32-bit to do interrupt entry/exit in C.
 
  - Convert 64-bit BookE to do interrupt entry/exit in C.
 
  - Changes to our signal handling code to use user_access_begin/end() more extensively.
 
  - Add support for time namespaces (CONFIG_TIME_NS)
 
  - A series of fixes that allow us to reenable STRICT_KERNEL_RWX.
 
  - Other smaller features, fixes & cleanups.
 
 Thanks to: Alexey Kardashevskiy, Andreas Schwab, Andrew Donnellan, Aneesh Kumar K.V,
   Athira Rajeev, Bhaskar Chowdhury, Bixuan Cui, Cédric Le Goater, Chen Huang, Chris
   Packham, Christophe Leroy, Christopher M. Riedl, Colin Ian King, Dan Carpenter, Daniel
   Axtens, Daniel Henrique Barboza, David Gibson, Davidlohr Bueso, Denis Efremov,
   dingsenjie, Dmitry Safonov, Dominic DeMarco, Fabiano Rosas, Ganesh Goudar, Geert
   Uytterhoeven, Geetika Moolchandani, Greg Kurz, Guenter Roeck, Haren Myneni, He Ying,
   Jiapeng Chong, Jordan Niethe, Laurent Dufour, Lee Jones, Leonardo Bras, Li Huafei,
   Madhavan Srinivasan, Mahesh Salgaonkar, Masahiro Yamada, Nathan Chancellor, Nathan
   Lynch, Nicholas Piggin, Oliver O'Halloran, Paul Menzel, Pu Lehui, Randy Dunlap, Ravi
   Bangoria, Rosen Penev, Russell Currey, Santosh Sivaraj, Sebastian Andrzej Siewior,
   Segher Boessenkool, Shivaprasad G Bhat, Srikar Dronamraju, Stephen Rothwell, Thadeu Lima
   de Souza Cascardo, Thomas Gleixner, Tony Ambardar, Tyrel Datwyler, Vaibhav Jain,
   Vincenzo Frascino, Xiongwei Song, Yang Li, Yu Kuai, Zhang Yunkai.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmCLV1kTHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgLUyD/4jrTolG4sVec211hYO+0VuJzoqN4Cf
 j2CA2Ju39butnSMiq4LJUPRB7QRZY1OofkoNFpZeDQspjfZXPz2ulpYAz+SxHWE2
 ReHPmWH1rOABlUPXFboePF4OLwmAs9eR5mN2z9HpKXbT3k78HaToLqiONyB4fVCr
 Q5TkJeRn/Y7ZJLdyPLTpczHHleQ8KoM6kT7ncXnTm6p97JOBJSrGaJ5N/8X5a4+e
 6jtgB7Pvw8jNDShSr8BDLBgBZZcmoTiuG8KfgwRZ+m+mKB1yI2X8S/a54w/lDi9g
 UcSv3jQcFLJuW+T/pYe4R330uWDYa0cwjJOtMmsJ98S4EYOevoe9fZuL97qNshme
 xtBr4q1i03G1icYOJJ8dXtvabG2rUzj8t1SCDpwYfrynzTWVRikiQYTXUBhRSFoK
 nsoklvKd2IZa485XYJ2ljSyClMy8S4yJJ9RuzZ94DTXDSJUesKuyRWGnso4mhkcl
 wvl4wwMTJvnCMKVo6dsJyV24QWfd6dABxzm04uPA94CKhG33UwK8252jXVeaohSb
 WSO7qWBONgDXQLJ0mXRcEYa9NHvFS4Jnp6APbxnHr1gS+K+PNkD4gPBf34FoyN0E
 9s27kvEYk5vr8APUclETF6+FkbGUD5bFbusjt3hYloFpAoHQ/k5pFVDsOZNPA8sW
 fDIRp05KunDojw==
 =dfKL
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:

 - Enable KFENCE for 32-bit.

 - Implement EBPF for 32-bit.

 - Convert 32-bit to do interrupt entry/exit in C.

 - Convert 64-bit BookE to do interrupt entry/exit in C.

 - Changes to our signal handling code to use user_access_begin/end()
   more extensively.

 - Add support for time namespaces (CONFIG_TIME_NS)

 - A series of fixes that allow us to reenable STRICT_KERNEL_RWX.

 - Other smaller features, fixes & cleanups.

Thanks to Alexey Kardashevskiy, Andreas Schwab, Andrew Donnellan, Aneesh
Kumar K.V, Athira Rajeev, Bhaskar Chowdhury, Bixuan Cui, Cédric Le
Goater, Chen Huang, Chris Packham, Christophe Leroy, Christopher M.
Riedl, Colin Ian King, Dan Carpenter, Daniel Axtens, Daniel Henrique
Barboza, David Gibson, Davidlohr Bueso, Denis Efremov, dingsenjie,
Dmitry Safonov, Dominic DeMarco, Fabiano Rosas, Ganesh Goudar, Geert
Uytterhoeven, Geetika Moolchandani, Greg Kurz, Guenter Roeck, Haren
Myneni, He Ying, Jiapeng Chong, Jordan Niethe, Laurent Dufour, Lee
Jones, Leonardo Bras, Li Huafei, Madhavan Srinivasan, Mahesh Salgaonkar,
Masahiro Yamada, Nathan Chancellor, Nathan Lynch, Nicholas Piggin,
Oliver O'Halloran, Paul Menzel, Pu Lehui, Randy Dunlap, Ravi Bangoria,
Rosen Penev, Russell Currey, Santosh Sivaraj, Sebastian Andrzej Siewior,
Segher Boessenkool, Shivaprasad G Bhat, Srikar Dronamraju, Stephen
Rothwell, Thadeu Lima de Souza Cascardo, Thomas Gleixner, Tony Ambardar,
Tyrel Datwyler, Vaibhav Jain, Vincenzo Frascino, Xiongwei Song, Yang Li,
Yu Kuai, and Zhang Yunkai.

* tag 'powerpc-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (302 commits)
  powerpc/signal32: Fix erroneous SIGSEGV on RT signal return
  powerpc: Avoid clang uninitialized warning in __get_user_size_allowed
  powerpc/papr_scm: Mark nvdimm as unarmed if needed during probe
  powerpc/kvm: Fix build error when PPC_MEM_KEYS/PPC_PSERIES=n
  powerpc/kasan: Fix shadow start address with modules
  powerpc/kernel/iommu: Use largepool as a last resort when !largealloc
  powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE() to save TCEs
  powerpc/44x: fix spelling mistake in Kconfig "varients" -> "variants"
  powerpc/iommu: Annotate nested lock for lockdep
  powerpc/iommu: Do not immediately panic when failed IOMMU table allocation
  powerpc/iommu: Allocate it_map by vmalloc
  selftests/powerpc: remove unneeded semicolon
  powerpc/64s: remove unneeded semicolon
  powerpc/eeh: remove unneeded semicolon
  powerpc/selftests: Add selftest to test concurrent perf/ptrace events
  powerpc/selftests/perf-hwbreak: Add testcases for 2nd DAWR
  powerpc/selftests/perf-hwbreak: Coalesce event creation code
  powerpc/selftests/ptrace-hwbreak: Add testcases for 2nd DAWR
  powerpc/configs: Add IBMVNIC to some 64-bit configs
  selftests/powerpc: Add uaccess flush test
  ...
2021-04-30 12:22:28 -07:00
Nicholas Piggin
6f680e70b6 mm/vmalloc: provide fallback arch huge vmap support functions
If an architecture doesn't support a particular page table level as a huge
vmap page size then allow it to skip defining the support query function.

Link: https://lkml.kernel.org/r/20210317062402.533919-11-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:40 -07:00
Nicholas Piggin
97dc2a1548 x86: inline huge vmap supported functions
This allows unsupported levels to be constant folded away, and so
p4d_free_pud_page can be removed because it's no longer linked to.

Link: https://lkml.kernel.org/r/20210317062402.533919-10-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:40 -07:00
Nicholas Piggin
bbc180a5ad mm: HUGE_VMAP arch support cleanup
This changes the awkward approach where architectures provide init
functions to determine which levels they can provide large mappings for,
to one where the arch is queried for each call.

This removes code and indirection, and allows constant-folding of dead
code for unsupported levels.

This also adds a prot argument to the arch query.  This is unused
currently but could help with some architectures (e.g., some powerpc
processors can't map uncacheable memory with large pages).

Link: https://lkml.kernel.org/r/20210317062402.533919-7-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Ding Tianhong <dingtianhong@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Cc: Will Deacon <will@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:40 -07:00
Linus Torvalds
635de956a7 The x86 MM changes in this cycle were:
- Implement concurrent TLB flushes, which overlaps the local TLB flush with the
    remote TLB flush. In testing this improved sysbench performance measurably by
    a couple of percentage points, especially if TLB-heavy security mitigations
    are active.
 
  - Further micro-optimizations to improve the performance of TLB flushes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmCKbNcRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hjYBAAsyNUa/gOu0g6/Cx8R86w9HtHHmm5vso/
 6nJjWj2fd2qJ9JShlddxvXEMeXtPTYabVWQkiiriFMuofk6JeKnlHm1Jzl6keABX
 OQFwjIFeNASPRcdXvuuYPOVWAJJdr2oL9QUr6OOK1ccQJTz/Cd0zA+VQ5YqcsCon
 yaWbkxELwKXpgql+qt66eAZ6Q2Y1TKXyrTW7ZgxQi0yeeWqMaEOub0/oyS7Ax1Rg
 qEJMwm1prb76NPzeqR/G3e4KTrDZfQ/B/KnSsz36GTJpl4eye6XqWDUgm1nAGNIc
 5dbc4Vx7JtZsUOuC0AmzWb3hsDyzVcN/lQvijdZ2RsYR3gvuYGaBhKqExqV0XH6P
 oqaWOKWCz+LqWbsgJmxCpqkt1LZl5+VUOcfJ97WkIS7DyIPtSHTzQXbBMZqKLeat
 mn5UcKYB2Gi7wsUPv6VC2ChKbDqN0VT8G86XbYylGo4BE46KoZKPUNY/QWKLUPd6
 0UKcVeNM2HFyf1C73p/tO/z7hzu3qLuMMnsphP6/c2pKLpdgawEXgbnVKNId1B/c
 NrzyhTvVaMt+Um28bBRhHONIlzPJwWcnZbdY7NqMnu+LBKQ68cL/h4FOIV/RDLNb
 GJLgfAr8fIw/zIpqYuFHiiMNo9wWqVtZko1MvXhGceXUL69QuzTra2XR/6aDxkPf
 6gQVesetTvo=
 =3Cyp
 -----END PGP SIGNATURE-----

Merge tag 'x86-mm-2021-04-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 tlb updates from Ingo Molnar:
 "The x86 MM changes in this cycle were:

   - Implement concurrent TLB flushes, which overlaps the local TLB
     flush with the remote TLB flush.

     In testing this improved sysbench performance measurably by a
     couple of percentage points, especially if TLB-heavy security
     mitigations are active.

   - Further micro-optimizations to improve the performance of TLB
     flushes"

* tag 'x86-mm-2021-04-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smp: Micro-optimize smp_call_function_many_cond()
  smp: Inline on_each_cpu_cond() and on_each_cpu()
  x86/mm/tlb: Remove unnecessary uses of the inline keyword
  cpumask: Mark functions as pure
  x86/mm/tlb: Do not make is_lazy dirty for no reason
  x86/mm/tlb: Privatize cpu_tlbstate
  x86/mm/tlb: Flush remote and local TLBs concurrently
  x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()
  smp: Run functions concurrently in smp_call_function_many_cond()
2021-04-29 11:41:43 -07:00
Linus Torvalds
0080665fbd Devicetree updates for v5.13:
- Refactoring powerpc and arm64 kexec DT handling to common code. This
   enables IMA on arm64.
 
 - Add kbuild support for applying DT overlays at build time. The first
   user are the DT unittests.
 
 - Fix kerneldoc formatting and W=1 warnings in drivers/of/
 
 - Fix handling 64-bit flag on PCI resources
 
 - Bump dtschema version required to v2021.2.1
 
 - Enable undocumented compatible checks for dtbs_check. This allows
   tracking of missing binding schemas.
 
 - DT docs improvements. Regroup the DT docs and add the example schema
   and DT kernel ABI docs to the doc build.
 
 - Convert Broadcom Bluetooth and video-mux bindings to schema
 
 - Add QCom sm8250 Venus video codec binding schema
 
 - Add vendor prefixes for AESOP, YIC System Co., Ltd, and Siliconfile
   Technologies Inc.
 
 - Cleanup of DT schema type references on common properties and
   standard unit properties
 -----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCgAuFiEEktVUI4SxYhzZyEuo+vtdtY28YcMFAmCIYdgQHHJvYmhAa2Vy
 bmVsLm9yZwAKCRD6+121jbxhw/PKEACkOCWDnLSY9U7w1uGDHr6UgXIWOY9j8bYy
 2pTvDrVa6KZphT6yGU/hxrOk8Mqh5AMd2vUhO2OCoyyl/priTv+Ktqo+bikvJZLa
 MQm3JnrLpPy/GetdmVD8wq1l+FoeOSTnRIJqRxInsd8UFVpZImtP22ELox6KgGiv
 keVHIrjsHU/HpafK3w8wHCLikCZk+1Gl6pL/QgFDv2FaaCTKW16Dt64dPqYm49Xk
 j7YMMQWl+3NJ9ywZV0+PMbl9udi3EjGm5Ap5VfKzpj53Nh07QObg/QtH/1sj0HPo
 apyW7jAyQFyLytbjxzFL/tljtOeW/5rZos1GWThZ326e+Y0mTKUTDZShvNplfjIf
 e26FvVi7gndWlRSr30Ia5gdNFAx72IkpJUAuypBXgb+qNPchBJjAXLn9tcIcg/k+
 2R6BIB7SkVLpgTnJ1Bq1+PRqkKM+ggACdJNJIUApj44xoiG01vtGDGRaFuIio+Ch
 HT4aBbic4kLvagm8VzuiIF/sL7af5pntzArcyOfQTaZ92DyGI2C0j90rK3yPRIYM
 u9qX/24t1SXiUji74QpoQFzt/+Egy5hYXMJOJJSywUjKf7DBhehqklTjiJRQHKm6
 0DJ/n8q4lNru8F0Y4keKSuYTfHBstF7fS3UTH/rUmBAbfEwkvZe6B29KQbs+7aph
 GTw+jeoR5Q==
 =rF27
 -----END PGP SIGNATURE-----

Merge tag 'devicetree-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux

Pull devicetree updates from Rob Herring:

 - Refactor powerpc and arm64 kexec DT handling to common code. This
   enables IMA on arm64.

 - Add kbuild support for applying DT overlays at build time. The first
   user are the DT unittests.

 - Fix kerneldoc formatting and W=1 warnings in drivers/of/

 - Fix handling 64-bit flag on PCI resources

 - Bump dtschema version required to v2021.2.1

 - Enable undocumented compatible checks for dtbs_check. This allows
   tracking of missing binding schemas.

 - DT docs improvements. Regroup the DT docs and add the example schema
   and DT kernel ABI docs to the doc build.

 - Convert Broadcom Bluetooth and video-mux bindings to schema

 - Add QCom sm8250 Venus video codec binding schema

 - Add vendor prefixes for AESOP, YIC System Co., Ltd, and Siliconfile
   Technologies Inc.

 - Cleanup of DT schema type references on common properties and
   standard unit properties

* tag 'devicetree-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (64 commits)
  powerpc: If kexec_build_elf_info() fails return immediately from elf64_load()
  powerpc: Free fdt on error in elf64_load()
  of: overlay: Fix kerneldoc warning in of_overlay_remove()
  of: linux/of.h: fix kernel-doc warnings
  of/pci: Add IORESOURCE_MEM_64 to resource flags for 64-bit memory addresses
  dt-bindings: bcm4329-fmac: add optional brcm,ccode-map
  docs: dt: update writing-schema.rst references
  dt-bindings: media: venus: Add sm8250 dt schema
  of: base: Fix spelling issue with function param 'prop'
  docs: dt: Add DT API documentation
  of: Add missing 'Return' section in kerneldoc comments
  of: Fix kerneldoc output formatting
  docs: dt: Group DT docs into relevant sub-sections
  docs: dt: Make 'Devicetree' wording more consistent
  docs: dt: writing-schema: Include the example schema in the doc build
  docs: dt: writing-schema: Remove spurious indentation
  dt-bindings: Fix reference in submitting-patches.rst to the DT ABI doc
  dt-bindings: ddr: Add optional manufacturer and revision ID to LPDDR3
  dt-bindings: media: video-interfaces: Drop the example
  devicetree: bindings: clock: Minor typo fix in the file armada3700-tbg-clock.txt
  ...
2021-04-28 15:50:24 -07:00
Linus Torvalds
fc05860628 for-5.13/drivers-2021-04-27
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCIJYcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpieWD/92qbtWl/z+9oCY212xV+YMoMqj/vGROX+U
 9i/FQJ3AIC/AUoNjZeW3NIbiaNqde5mrLlUSCHgn6RLsHK7p0GQJ4ohpbIGFG5+i
 2+Efm+vjlCxLVGrkeZEwMtsht7w/NbOYDr1Rgv9b4lQ6iWI11Mg8E337Whl1me1k
 h6bEXaioK9yqxYtsLgcn9I1qQ2p7gok0HX7zFU/XxEUZylqH6E4vQhj2+NL8UUqE
 7siFHADZE99Z7LXtOkl8YyOlGU52RCUzqDHWydvkipKjgYBi95HLXGT64Z+WCEvz
 HI54oVDRWr+uWdqDFfy+ncHm8pNeP0GV9JPhDz4ELRTSndoxB2il7wRLvp6wxV9d
 8Y4j7vb30i+8GGbM0c79dnlG76D9r5ivbTKixcXFKB128NusQR6JymIv1pKlSKhk
 H871/iOarrepAAUwVR5CtldDDJCy/q1Hks+7UXbaM3F9iNitxsJNZryQq9xdTu/N
 ThFOTz+VECG4RJLxIwmsWGiLgwr52/ybAl2MBcn+s7uC4jM/TFKpdQBfQnOAiINb
 MLlfuYRRSMg1Osb2fYZneR2ifmSNOMRdDJb+tsZGz4xWmZcj0uL4QgqcsOvuiOEQ
 veF/Ky50qw57hWtiEhvqa7/WIxzNF3G3wejqqA8hpT9Qifu0QawYTnXGUttYNBB1
 mO9R3/ccaw==
 =c0x4
 -----END PGP SIGNATURE-----

Merge tag 'for-5.13/drivers-2021-04-27' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:

 - MD changes via Song:
        - raid5 POWER fix
        - raid1 failure fix
        - UAF fix for md cluster
        - mddev_find_or_alloc() clean up
        - Fix NULL pointer deref with external bitmap
        - Performance improvement for raid10 discard requests
        - Fix missing information of /proc/mdstat

 - rsxx const qualifier removal (Arnd)

 - Expose allocated brd pages (Calvin)

 - rnbd via Gioh Kim:
        - Change maintainer
        - Change domain address of maintainers' email
        - Add polling IO mode and document update
        - Fix memory leak and some bug detected by static code analysis
          tools
        - Code refactoring

 - Series of floppy cleanups/fixes (Denis)

 - s390 dasd fixes (Julian)

 - kerneldoc fixes (Lee)

 - null_blk double free (Lv)

 - null_blk virtual boundary addition (Max)

 - Remove xsysace driver (Michal)

 - umem driver removal (Davidlohr)

 - ataflop fixes (Dan)

 - Revalidate disk removal (Christoph)

 - Bounce buffer cleanups (Christoph)

 - Mark lightnvm as deprecated (Christoph)

 - mtip32xx init cleanups (Shixin)

 - Various fixes (Tian, Gustavo, Coly, Yang, Zhang, Zhiqiang)

* tag 'for-5.13/drivers-2021-04-27' of git://git.kernel.dk/linux-block: (143 commits)
  async_xor: increase src_offs when dropping destination page
  drivers/block/null_blk/main: Fix a double free in null_init.
  md/raid1: properly indicate failure when ending a failed write request
  md-cluster: fix use-after-free issue when removing rdev
  nvme: introduce generic per-namespace chardev
  nvme: cleanup nvme_configure_apst
  nvme: do not try to reconfigure APST when the controller is not live
  nvme: add 'kato' sysfs attribute
  nvme: sanitize KATO setting
  nvmet: avoid queuing keep-alive timer if it is disabled
  brd: expose number of allocated pages in debugfs
  ataflop: fix off by one in ataflop_probe()
  ataflop: potential out of bounds in do_format()
  drbd: Fix fall-through warnings for Clang
  block/rnbd: Use strscpy instead of strlcpy
  block/rnbd-clt-sysfs: Remove copy buffer overlap in rnbd_clt_get_path_name
  block/rnbd-clt: Remove max_segment_size
  block/rnbd-clt: Generate kobject_uevent when the rnbd device state changes
  block/rnbd-srv: Remove unused arguments of rnbd_srv_rdma_ev
  Documentation/ABI/rnbd-clt: Add description for nr_poll_queues
  ...
2021-04-28 14:39:37 -07:00
Linus Torvalds
42dec9a936 Perf events changes in this cycle were:
- Improve Intel uncore PMU support:
 
      - Parse uncore 'discovery tables' - a new hardware capability enumeration method
        introduced on the latest Intel platforms. This table is in a well-defined PCI
        namespace location and is read via MMIO. It is organized in an rbtree.
 
        These uncore tables will allow the discovery of standard counter blocks, but
        fancier counters still need to be enumerated explicitly.
 
      - Add Alder Lake support
 
      - Improve IIO stacks to PMON mapping support on Skylake servers
 
  - Add Intel Alder Lake PMU support - which requires the introduction of 'hybrid' CPUs
    and PMUs. Alder Lake is a mix of Golden Cove ('big') and Gracemont ('small' - Atom derived)
    cores.
 
    The CPU-side feature set is entirely symmetrical - but on the PMU side there's
    core type dependent PMU functionality.
 
  - Reduce data loss with CPU level hardware tracing on Intel PT / AUX profiling, by
    fixing the AUX allocation watermark logic.
 
  - Improve ring buffer allocation on NUMA systems
 
  - Put 'struct perf_event' into their separate kmem_cache pool
 
  - Add support for synchronous signals for select perf events. The immediate motivation
    is to support low-overhead sampling-based race detection for user-space code. The
    feature consists of the following main changes:
 
     - Add thread-only event inheritance via perf_event_attr::inherit_thread, which limits
       inheritance of events to CLONE_THREAD.
 
     - Add the ability for events to not leak through exec(), via perf_event_attr::remove_on_exec.
 
     - Allow the generation of SIGTRAP via perf_event_attr::sigtrap, extend siginfo with an u64
       ::si_perf, and add the breakpoint information to ::si_addr and ::si_perf if the event is
       PERF_TYPE_BREAKPOINT.
 
    The siginfo support is adequate for breakpoints right now - but the new field can be used
    to introduce support for other types of metadata passed over siginfo as well.
 
  - Misc fixes, cleanups and smaller updates.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmCJGpERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1j9zBAAuVbG2snV6SBSdXLhQcM66N3NckOXvSY5
 QjjhQcuwJQEK/NJB3266K5d8qSmdyRBsWf3GCsrmyBT67P1V28K44Pu7oCV0UDtf
 mpVRjEP0oR7hNsANSSgo8Fa4ZD7H5waX7dK7925Tvw8By3mMoZoddiD/84WJHhxO
 NDF+GRFaRj+/dpbhV8cdCoXTjYdkC36vYuZs3b9lu0tS9D/AJgsNy7TinLvO02Cs
 5peP+2y29dgvCXiGBiuJtEA6JyGnX3nUJCvfOZZ/DWDc3fdduARlRrc5Aiq4n/wY
 UdSkw1VTZBlZ1wMSdmHQVeC5RIH3uWUtRoNqy0Yc90lBm55AQ0EENwIfWDUDC5zy
 USdBqWTNWKMBxlEilUIyqKPQK8LW/31TRzqy8BWKPNcZt5yP5YS1SjAJRDDjSwL/
 I+OBw1vjLJamYh8oNiD5b+VLqNQba81jFASfv+HVWcULumnY6ImECCpkg289Fkpi
 BVR065boifJDlyENXFbvTxyMBXQsZfA+EhtxG7ju2Ni+TokBbogyCb3L2injPt9g
 7jjtTOqmfad4gX1WSc+215iYZMkgECcUd9E+BfOseEjBohqlo7yNKIfYnT8mE/Xq
 nb7eHjyvLiE8tRtZ+7SjsujOMHv9LhWFAbSaxU/kEVzpkp0zyd6mnnslDKaaHLhz
 goUMOL/D0lg=
 =NhQ7
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf event updates from Ingo Molnar:

 - Improve Intel uncore PMU support:

     - Parse uncore 'discovery tables' - a new hardware capability
       enumeration method introduced on the latest Intel platforms. This
       table is in a well-defined PCI namespace location and is read via
       MMIO. It is organized in an rbtree.

       These uncore tables will allow the discovery of standard counter
       blocks, but fancier counters still need to be enumerated
       explicitly.

     - Add Alder Lake support

     - Improve IIO stacks to PMON mapping support on Skylake servers

 - Add Intel Alder Lake PMU support - which requires the introduction of
   'hybrid' CPUs and PMUs. Alder Lake is a mix of Golden Cove ('big')
   and Gracemont ('small' - Atom derived) cores.

   The CPU-side feature set is entirely symmetrical - but on the PMU
   side there's core type dependent PMU functionality.

 - Reduce data loss with CPU level hardware tracing on Intel PT / AUX
   profiling, by fixing the AUX allocation watermark logic.

 - Improve ring buffer allocation on NUMA systems

 - Put 'struct perf_event' into their separate kmem_cache pool

 - Add support for synchronous signals for select perf events. The
   immediate motivation is to support low-overhead sampling-based race
   detection for user-space code. The feature consists of the following
   main changes:

     - Add thread-only event inheritance via
       perf_event_attr::inherit_thread, which limits inheritance of
       events to CLONE_THREAD.

     - Add the ability for events to not leak through exec(), via
       perf_event_attr::remove_on_exec.

     - Allow the generation of SIGTRAP via perf_event_attr::sigtrap,
       extend siginfo with an u64 ::si_perf, and add the breakpoint
       information to ::si_addr and ::si_perf if the event is
       PERF_TYPE_BREAKPOINT.

   The siginfo support is adequate for breakpoints right now - but the
   new field can be used to introduce support for other types of
   metadata passed over siginfo as well.

 - Misc fixes, cleanups and smaller updates.

* tag 'perf-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
  signal, perf: Add missing TRAP_PERF case in siginfo_layout()
  signal, perf: Fix siginfo_t by avoiding u64 on 32-bit architectures
  perf/x86: Allow for 8<num_fixed_counters<16
  perf/x86/rapl: Add support for Intel Alder Lake
  perf/x86/cstate: Add Alder Lake CPU support
  perf/x86/msr: Add Alder Lake CPU support
  perf/x86/intel/uncore: Add Alder Lake support
  perf: Extend PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE
  perf/x86/intel: Add Alder Lake Hybrid support
  perf/x86: Support filter_match callback
  perf/x86/intel: Add attr_update for Hybrid PMUs
  perf/x86: Add structures for the attributes of Hybrid PMUs
  perf/x86: Register hybrid PMUs
  perf/x86: Factor out x86_pmu_show_pmu_cap
  perf/x86: Remove temporary pmu assignment in event_init
  perf/x86/intel: Factor out intel_pmu_check_extra_regs
  perf/x86/intel: Factor out intel_pmu_check_event_constraints
  perf/x86/intel: Factor out intel_pmu_check_num_counters
  perf/x86: Hybrid PMU support for extra_regs
  perf/x86: Hybrid PMU support for event constraints
  ...
2021-04-28 13:03:44 -07:00
Linus Torvalds
0ff0edb550 Locking changes for this cycle were:
- rtmutex cleanup & spring cleaning pass that removes ~400 lines of code
  - Futex simplifications & cleanups
  - Add debugging to the CSD code, to help track down a tenacious race (or hw problem)
  - Add lockdep_assert_not_held(), to allow code to require a lock to not be held,
    and propagate this into the ath10k driver
  - Misc LKMM documentation updates
  - Misc KCSAN updates: cleanups & documentation updates
  - Misc fixes and cleanups
  - Fix locktorture bugs with ww_mutexes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmCJDn0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hPrRAAryS4zPnuDsfkVk0smxo7a0lK5ljbH2Xo
 28QUZXOl6upnEV8dzbjwG7eAjt5ZJVI5tKIeG0PV0NUJH2nsyHwESdtULGGYuPf/
 4YUzNwZJa+nI/jeBnVsXCimLVxxnNCRdR7yOVOHm4ukEwa+YTNt1pvlYRmUd4YyH
 Q5cCrpb3THvLka3AAamEbqnHnAdGxHKuuHYVRkODpMQ+zrQvtN8antYsuk8kJsqM
 m+GZg/dVCuLEPah5k+lOACtcq/w7HCmTlxS8t4XLvD52jywFZLcCPvi1rk0+JR+k
 Vd9TngC09GJ4jXuDpr42YKkU9/X6qy2Es39iA/ozCvc1Alrhspx/59XmaVSuWQGo
 XYuEPx38Yuo/6w16haSgp0k4WSay15A4uhCTQ75VF4vli8Bqgg9PaxLyQH1uG8e2
 xk8U90R7bDzLlhKYIx1Vu5Z0t7A1JtB5CJtgpcfg/zQLlzygo75fHzdAiU5fDBDm
 3QQXSU2Oqzt7c5ZypioHWazARk7tL6th38KGN1gZDTm5zwifpaCtHi7sml6hhZ/4
 ATH6zEPzIbXJL2UqumSli6H4ye5ORNjOu32r7YPqLI4IDbzpssfoSwfKYlQG4Tvn
 4H1Ukirzni0gz5+wbleItzf2aeo1rocs4YQTnaT02j8NmUHUz4AzOHGOQFr5Tvh0
 wk/P4MIoSb0=
 =cOOk
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:

 - rtmutex cleanup & spring cleaning pass that removes ~400 lines of
   code

 - Futex simplifications & cleanups

 - Add debugging to the CSD code, to help track down a tenacious race
   (or hw problem)

 - Add lockdep_assert_not_held(), to allow code to require a lock to not
   be held, and propagate this into the ath10k driver

 - Misc LKMM documentation updates

 - Misc KCSAN updates: cleanups & documentation updates

 - Misc fixes and cleanups

 - Fix locktorture bugs with ww_mutexes

* tag 'locking-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
  kcsan: Fix printk format string
  static_call: Relax static_call_update() function argument type
  static_call: Fix unused variable warn w/o MODULE
  locking/rtmutex: Clean up signal handling in __rt_mutex_slowlock()
  locking/rtmutex: Restrict the trylock WARN_ON() to debug
  locking/rtmutex: Fix misleading comment in rt_mutex_postunlock()
  locking/rtmutex: Consolidate the fast/slowpath invocation
  locking/rtmutex: Make text section and inlining consistent
  locking/rtmutex: Move debug functions as inlines into common header
  locking/rtmutex: Decrapify __rt_mutex_init()
  locking/rtmutex: Remove pointless CONFIG_RT_MUTEXES=n stubs
  locking/rtmutex: Inline chainwalk depth check
  locking/rtmutex: Move rt_mutex_debug_task_free() to rtmutex.c
  locking/rtmutex: Remove empty and unused debug stubs
  locking/rtmutex: Consolidate rt_mutex_init()
  locking/rtmutex: Remove output from deadlock detector
  locking/rtmutex: Remove rtmutex deadlock tester leftovers
  locking/rtmutex: Remove rt_mutex_timed_lock()
  MAINTAINERS: Add myself as futex reviewer
  locking/mutex: Remove repeated declaration
  ...
2021-04-28 12:37:53 -07:00
Linus Torvalds
c6536676c7 - turn the stack canary into a normal __percpu variable on 32-bit which
gets rid of the LAZY_GS stuff and a lot of code.
 
 - Add an insn_decode() API which all users of the instruction decoder
 should preferrably use. Its goal is to keep the details of the
 instruction decoder away from its users and simplify and streamline how
 one decodes insns in the kernel. Convert its users to it.
 
 - kprobes improvements and fixes
 
 - Set the maximum DIE per package variable on Hygon
 
 - Rip out the dynamic NOP selection and simplify all the machinery around
 selecting NOPs. Use the simplified NOPs in objtool now too.
 
 - Add Xeon Sapphire Rapids to list of CPUs that support PPIN
 
 - Simplify the retpolines by folding the entire thing into an
 alternative now that objtool can handle alternatives with stack
 ops. Then, have objtool rewrite the call to the retpoline with the
 alternative which then will get patched at boot time.
 
 - Document Intel uarch per models in intel-family.h
 
 - Make Sub-NUMA Clustering topology the default and Cluster-on-Die the
 exception on Intel.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCHyJQACgkQEsHwGGHe
 VUpjiRAAwPZdwwp08ypZuMHR4EhLNru6gYhbAoALGgtYnQjLtn5onQhIeieK+R4L
 cmZpxHT9OFp5dXHk4kwygaQBsD4pPOiIpm60kye1dN3cSbOORRdkwEoQMpKMZ+5Y
 kvVsmn7lrwRbp600KdE4G6L5+N6gEgr0r6fMFWWGK3mgVAyCzPexVHgydcp131ch
 iYMo6/pPDcNkcV/hboVKgx7GISdQ7L356L1MAIW/Sxtw6uD/X4qGYW+kV2OQg9+t
 nQDaAo7a8Jqlop5W5TQUdMLKQZ1xK8SFOSX/nTS15DZIOBQOGgXR7Xjywn1chBH/
 PHLwM5s4XF6NT5VlIA8tXNZjWIZTiBdldr1kJAmdDYacrtZVs2LWSOC0ilXsd08Z
 EWtvcpHfHEqcuYJlcdALuXY8xDWqf6Q2F7BeadEBAxwnnBg+pAEoLXI/1UwWcmsj
 wpaZTCorhJpYo2pxXckVdHz2z0LldDCNOXOjjaWU8tyaOBKEK6MgAaYU7e0yyENv
 mVc9n5+WuvXuivC6EdZ94Pcr/KQsd09ezpJYcVfMDGv58YZrb6XIEELAJIBTu2/B
 Ua8QApgRgetx+1FKb8X6eGjPl0p40qjD381TADb4rgETPb1AgKaQflmrSTIik+7p
 O+Eo/4x/GdIi9jFk3K+j4mIznRbUX0cheTJgXoiI4zXML9Jv94w=
 =bm4S
 -----END PGP SIGNATURE-----

Merge tag 'x86_core_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 updates from Borislav Petkov:

 - Turn the stack canary into a normal __percpu variable on 32-bit which
   gets rid of the LAZY_GS stuff and a lot of code.

 - Add an insn_decode() API which all users of the instruction decoder
   should preferrably use. Its goal is to keep the details of the
   instruction decoder away from its users and simplify and streamline
   how one decodes insns in the kernel. Convert its users to it.

 - kprobes improvements and fixes

 - Set the maximum DIE per package variable on Hygon

 - Rip out the dynamic NOP selection and simplify all the machinery
   around selecting NOPs. Use the simplified NOPs in objtool now too.

 - Add Xeon Sapphire Rapids to list of CPUs that support PPIN

 - Simplify the retpolines by folding the entire thing into an
   alternative now that objtool can handle alternatives with stack ops.
   Then, have objtool rewrite the call to the retpoline with the
   alternative which then will get patched at boot time.

 - Document Intel uarch per models in intel-family.h

 - Make Sub-NUMA Clustering topology the default and Cluster-on-Die the
   exception on Intel.

* tag 'x86_core_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
  x86, sched: Treat Intel SNC topology as default, COD as exception
  x86/cpu: Comment Skylake server stepping too
  x86/cpu: Resort and comment Intel models
  objtool/x86: Rewrite retpoline thunk calls
  objtool: Skip magical retpoline .altinstr_replacement
  objtool: Cache instruction relocs
  objtool: Keep track of retpoline call sites
  objtool: Add elf_create_undef_symbol()
  objtool: Extract elf_symbol_add()
  objtool: Extract elf_strtab_concat()
  objtool: Create reloc sections implicitly
  objtool: Add elf_create_reloc() helper
  objtool: Rework the elf_rebuild_reloc_section() logic
  objtool: Fix static_call list generation
  objtool: Handle per arch retpoline naming
  objtool: Correctly handle retpoline thunk calls
  x86/retpoline: Simplify retpolines
  x86/alternatives: Optimize optimize_nops()
  x86: Add insn_decode_kernel()
  x86/kprobes: Move 'inline' to the beginning of the kprobe_is_ss() declaration
  ...
2021-04-27 17:45:09 -07:00
Linus Torvalds
4d480dbf21 hyperv-next for 5.13
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmCG9+oTHHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXqo5CACQrfupoIeawVUMZQOGPOKW56zcmo+l
 kwgEYdukleYebJzES3zxdAod2k45WnAJ3aMQJaL2DxZ5SZdTJG1zIK08wlP87ui8
 m80Htq/8c3fBM90gjUSjShxHw9SaWwwSQUVBKrm0doS7o0iUq0PPHHE6gvJHMX/w
 IcHug294c6ArCz0qNR5aiBxPNGixXBX7S7/5ubdjxszU2BVAzrfFLWYOWU4HzHyN
 g68BDY6F2K9+F3XOVO0zhcCdhzvIzb5Bh0V06VBKl9HRWnk28h0/Y7fBq9HVzCZu
 k7k5+o6lJUyyFkXR8MlcBKRlWnFXSHc5wIdJ/gcXTzEMsqrJlQ1vrGog
 =pGet
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-next-signed-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull Hyper-V updates from Wei Liu:

 - VMBus enhancement

 - Free page reporting support for Hyper-V balloon driver

 - Some patches for running Linux as Arm64 Hyper-V guest

 - A few misc clean-up patches

* tag 'hyperv-next-signed-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (30 commits)
  drivers: hv: Create a consistent pattern for checking Hyper-V hypercall status
  x86/hyperv: Move hv_do_rep_hypercall to asm-generic
  video: hyperv_fb: Add ratelimit on error message
  Drivers: hv: vmbus: Increase wait time for VMbus unload
  Drivers: hv: vmbus: Initialize unload_event statically
  Drivers: hv: vmbus: Check for pending channel interrupts before taking a CPU offline
  Drivers: hv: vmbus: Drivers: hv: vmbus: Introduce CHANNELMSG_MODIFYCHANNEL_RESPONSE
  Drivers: hv: vmbus: Introduce and negotiate VMBus protocol version 5.3
  Drivers: hv: vmbus: Use after free in __vmbus_open()
  Drivers: hv: vmbus: remove unused function
  Drivers: hv: vmbus: Remove unused linux/version.h header
  x86/hyperv: remove unused linux/version.h header
  x86/Hyper-V: Support for free page reporting
  x86/hyperv: Fix unused variable 'hi' warning in hv_apic_read
  x86/hyperv: Fix unused variable 'msr_val' warning in hv_qlock_wait
  hv: hyperv.h: a few mundane typo fixes
  drivers: hv: Fix EXPORT_SYMBOL and tab spaces issue
  Drivers: hv: vmbus: Drop error message when 'No request id available'
  asm-generic/hyperv: Add missing function prototypes per -W1 warnings
  clocksource/drivers/hyper-v: Move handling of STIMER0 interrupts
  ...
2021-04-26 10:44:16 -07:00
Linus Torvalds
64f8e73de0 Support for enhanced split lock detection:
Newer CPUs provide a second mechanism to detect operations with lock
   prefix which go accross a cache line boundary. Such operations have to
   take bus lock which causes a system wide performance degradation when
   these operations happen frequently.
 
   The new mechanism is not using the #AC exception. It triggers #DB and is
   restricted to operations in user space. Kernel side split lock access can
   only be detected by the #AC based variant. Contrary to the #AC based
   mechanism the #DB based variant triggers _after_ the instruction was
   executed. The mechanism is CPUID enumerated and contrary to the #AC
   version which is based on the magic TEST_CTRL_MSR and model/family based
   enumeration on the way to become architectural.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmCGkr8THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYodUKD/9tUXhInR7+1ykEHpMvdmSp48vqY3nc
 sKmT22pPl+OchnJ62mw3T8gKpBYVleJmcCaY2qVx7hfaVcWApLGJvX4tmfXmv422
 XDSJ6b8Os6wfgx5FR//I17z8ZtXnnuKkPrTMoRsQUw2qLq31y6fdQv+GW/cc1Kpw
 mengjmPE+HnpaKbtuQfPdc4a+UvLjvzBMAlDZPTBPKYrP4FFqYVnUVwyTg5aLVDY
 gHz4V8+b502RS/zPfTAtE3J848od+NmcUPdFlcG9DVA+hR0Rl0thvruCTFiD2vVh
 i9DJ7INof5FoJDEzh0dGsD7x+MB6OY8GZyHdUMeGgIRPtWkqrG52feQQIn2YYlaL
 fB3DlpNv7NIJ/0JMlALvh8S0tEoOcYdHqH+M/3K/zbzecg/FAo+lVo8WciGLPqWs
 ykUG5/f/OnlTvgB8po1ebJu0h0jHnoK9heWWXk9zWIRVDPXHFOWKW3kSbTTb3icR
 9hfjP/SNejpmt9Ju1OTwsgnV7NALIdVX+G5jyIEsjFl31Co1RZNYhHLFvi11FWlQ
 /ssvFK9O5ZkliocGCAN9+yuOnM26VqWSCE4fis6/2aSgD2Y4Gpvb//cP96SrcNAH
 u8eXNvGLlniJP3F3JImWIfIPQTrpvQhcU4eZ6NtviXqj/utQXX6c9PZ1PLYpcvUh
 9AWF8rwhT8X4oA==
 =lmi8
 -----END PGP SIGNATURE-----

Merge tag 'x86-splitlock-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 bus lock detection updates from Thomas Gleixner:
 "Support for enhanced split lock detection:

  Newer CPUs provide a second mechanism to detect operations with lock
  prefix which go accross a cache line boundary. Such operations have to
  take bus lock which causes a system wide performance degradation when
  these operations happen frequently.

  The new mechanism is not using the #AC exception. It triggers #DB and
  is restricted to operations in user space. Kernel side split lock
  access can only be detected by the #AC based variant.

  Contrary to the #AC based mechanism the #DB based variant triggers
  _after_ the instruction was executed. The mechanism is CPUID
  enumerated and contrary to the #AC version which is based on the magic
  TEST_CTRL_MSR and model/family based enumeration on the way to become
  architectural"

* tag 'x86-splitlock-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Documentation/admin-guide: Change doc for split_lock_detect parameter
  x86/traps: Handle #DB for bus lock
  x86/cpufeatures: Enumerate #DB for bus lock detection
2021-04-26 10:09:38 -07:00
Linus Torvalds
eea2647e74 Entry code update:
Provide support for randomized stack offsets per syscall to make
  stack-based attacks harder which rely on the deterministic stack layout.
 
  The feature is based on the original idea of PaX's RANDSTACK feature, but
  uses a significantly different implementation.
 
  The offset does not affect the pt_regs location on the task stack as this
  was agreed on to be of dubious value. The offset is applied before the
  actual syscall is invoked.
 
  The offset is stored per cpu and the randomization happens at the end of
  the syscall which is less predictable than on syscall entry.
 
  The mechanism to apply the offset is via alloca(), i.e. abusing the
  dispised VLAs. This comes with the drawback that stack-clash-protection
  has to be disabled for the affected compilation units and there is also
  a negative interaction with stack-protector.
 
  Those downsides are traded with the advantage that this approach does not
  require any intrusive changes to the low level assembly entry code, does
  not affect the unwinder and the correct stack alignment is handled
  automatically by the compiler.
 
  The feature is guarded with a static branch which avoids the overhead when
  disabled.
 
  Currently this is supported for X86 and ARM64.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmCGjz8THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoWsvD/4tGnPAurd6lbzxWzRjW7jOOVyzkODM
 UXtIxxICaj7o6MNcloaGe1QtJ8+QOCw3yPQfLG/SoWHse5+oUKQRL9dmWVeJyRSt
 JZ1pirkKqWrB+OmPbJKUiO3/TsZ2Z/vO41JVgVTL5/HWhOECSDzZsJkuvF/H+qYD
 ReDzd7FUNd76pwVOsXq/cxXclRa81/wMNZRVwmyAwFYE2XoPtQyTERTLrfj6aQKF
 P0txr9fEjYlPPwYOk1kjBAoJfDltNm48BBL7CGZtRlsqpNpdsJ1MkeGffhodb6F0
 pJYQMlQJHXABZb5GF+v93+iASDpRFn0EvPmLkCxQUfZYLOkRsnuEF2S/fsYX/WPo
 uin/wQKwLVdeQq9d9BwlZUKEgsQuV7Q0GVN+JnEQerwD6cWTxv4a1RIUH+K/4Wo5
 nTeJVRKcs6m7UkGQRm8JbqnUP0vCV+PSiWWB8J9CmjYeCPbkGjt6mBIsmPaDZ9VL
 4i+UX5DJayoREF/rspOBcJftUmExize49p9860UI9N6fd7DsDt7Dq9Ai+ADtZa4C
 9BPbF4NWzJq8IWLqBi+PpKBAT3JMX9qQi7s9sbrRxpxtew9Keu5qggKZJYumX71V
 qgUMk+xB86HZOrtF6F3oY0zxYv3haPvDydsDgqojtqNGk4PdAdgDYJQwMlb8QSly
 SwIWPHIfvP4R9w==
 =GMlJ
 -----END PGP SIGNATURE-----

Merge tag 'x86-entry-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull entry code update from Thomas Gleixner:
 "Provide support for randomized stack offsets per syscall to make
  stack-based attacks harder which rely on the deterministic stack
  layout.

  The feature is based on the original idea of PaX's RANDSTACK feature,
  but uses a significantly different implementation.

  The offset does not affect the pt_regs location on the task stack as
  this was agreed on to be of dubious value. The offset is applied
  before the actual syscall is invoked.

  The offset is stored per cpu and the randomization happens at the end
  of the syscall which is less predictable than on syscall entry.

  The mechanism to apply the offset is via alloca(), i.e. abusing the
  dispised VLAs. This comes with the drawback that
  stack-clash-protection has to be disabled for the affected compilation
  units and there is also a negative interaction with stack-protector.

  Those downsides are traded with the advantage that this approach does
  not require any intrusive changes to the low level assembly entry
  code, does not affect the unwinder and the correct stack alignment is
  handled automatically by the compiler.

  The feature is guarded with a static branch which avoids the overhead
  when disabled.

  Currently this is supported for X86 and ARM64"

* tag 'x86-entry-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  arm64: entry: Enable random_kstack_offset support
  lkdtm: Add REPORT_STACK for checking stack offsets
  x86/entry: Enable random_kstack_offset support
  stack: Optionally randomize kernel stack offset each syscall
  init_on_alloc: Optimize static branches
  jump_label: Provide CONFIG-driven build state defaults
2021-04-26 10:02:09 -07:00
Linus Torvalds
ea5bc7b977 Trivial cleanups and fixes all over the place.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCGmYIACgkQEsHwGGHe
 VUr45w/8CSXr7MXaFBj4To0hTWJXSZyF6YGqlZOSJXFcFh4cWTNwfVOoFaV47aDo
 +HsCNTkGENcKhLrDUWDRiG/Uo46jxtOtl1vhq7U4pGemSYH871XWOKfb5k5XNMwn
 /uhaHMI4aEfd6bUFnF518NeyRIsD0BdqFj4tB7RbAiyFwdETDX9Tkj/uBKnQ4zon
 4tEDoXgThuK5YKK9zVQg5pa7aFp2zg1CAdX/WzBkS8BHVBPXSV0CF97AJYQOM/V+
 lUHv+BN3wp97GYHPQMPsbkNr8IuFoe2mIvikwjxg8iOFpzEU1G1u09XV9R+PXByX
 LclFTRqK/2uU5hJlcsBiKfUuidyErYMRYImbMAOREt2w0ogWVu2zQ7HkjVve25h1
 sQPwPudbAt6STbqRxvpmB3yoV4TCYwnF91FcWgEy+rcEK2BDsHCnScA45TsK5I1C
 kGR1K17pHXprgMZFPveH+LgxewB6smDv+HllxQdSG67LhMJXcs2Epz0TsN8VsXw8
 dlD3lGReK+5qy9FTgO7mY0xhiXGz1IbEdAPU4eRBgih13puu03+jqgMaMabvBWKD
 wax+BWJUrPtetwD5fBPhlS/XdJDnd8Mkv2xsf//+wT0s4p+g++l1APYxeB8QEehm
 Pd7Mvxm4GvQkfE13QEVIPYQRIXCMH/e9qixtY5SHUZDBVkUyFM0=
 =bO1i
 -----END PGP SIGNATURE-----

Merge tag 'x86_cleanups_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 cleanups from Borislav Petkov:
 "Trivial cleanups and fixes all over the place"

* tag 'x86_cleanups_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  MAINTAINERS: Remove me from IDE/ATAPI section
  x86/pat: Do not compile stubbed functions when X86_PAT is off
  x86/asm: Ensure asm/proto.h can be included stand-alone
  x86/platform/intel/quark: Fix incorrect kernel-doc comment syntax in files
  x86/msr: Make locally used functions static
  x86/cacheinfo: Remove unneeded dead-store initialization
  x86/process/64: Move cpu_current_top_of_stack out of TSS
  tools/turbostat: Unmark non-kernel-doc comment
  x86/syscalls: Fix -Wmissing-prototypes warnings from COND_SYSCALL()
  x86/fpu/math-emu: Fix function cast warning
  x86/msr: Fix wr/rdmsr_safe_regs_on_cpu() prototypes
  x86: Fix various typos in comments, take #2
  x86: Remove unusual Unicode characters from comments
  x86/kaslr: Return boolean values from a function returning bool
  x86: Fix various typos in comments
  x86/setup: Remove unused RESERVE_BRK_ARRAY()
  stacktrace: Move documentation for arch_stack_walk_reliable() to header
  x86: Remove duplicate TSC DEADLINE MSR definitions
2021-04-26 09:25:47 -07:00
Linus Torvalds
81a489790a Add the guest side of SGX support in KVM guests. Work by Sean
Christopherson, Kai Huang and Jarkko Sakkinen. Along with the usual
 fixes, cleanups and improvements.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCGlgYACgkQEsHwGGHe
 VUqbYA/+IgX7uBkATndzTBL6l/D3QQaMRUkOk5nO9sOzQaYJ/Qwarfakax61CZrl
 dZFdF07T/kSpMXQ6HIjzEaRx6j12xMYksrm8xBBSfXjtkIYu4auVloX2ldKhHwaK
 OyiKS+R0O/Q7XvozEiPsQCf7XwraZFO+iMJ0jMxbPO7ZvxDXDBv0Fx3d9yzPx9Qg
 BbJuIEKMoFPR3P39CWw0cOXr12Z9mmFReBKoSV4dZbZMRmv7FrA/Qlc+uS+RNZFK
 /5sCn7x27qVx8Ha/Lh42kQf+yqv1l3437aqmG2vAbHQPmnbDmBeApZ6jhaoX3jhD
 9ylkcpWFFf26oSbYAdmztZENLXRWLH6OIPxtmbf2HMsROiNR/cV0s4d2aduN/dHz
 s1VnaDFayoub9CPWtiv0RJJnwmB6d+wF2JbQGh+kPZMX3VaxVPwTVLWQdsAVaB8Y
 y7A2vZeWWHvP1a7ATbTFRDlTKKV3qDpMTD1B+hFELLNjMvyDU5c/1GhrIh0o1Jo3
 jGrauylSInMxDkpDTDhQqU+/CSnV03zdzq1DSzxgig2Q0Es6pKxQHbL0honTf0GJ
 l+8nefsQqRguZ1rVeuuSYvGPF++eqfyOiTZgN4fWdtZWJKMabsPNUbc4U3sP0/Sn
 oe3Ixo2F41E9++MODF1G80DKLD/mVLYxdzC91suOmgfB2gbRhSg=
 =KFYo
 -----END PGP SIGNATURE-----

Merge tag 'x86_sgx_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 SGX updates from Borislav Petkov:
 "Add the guest side of SGX support in KVM guests. Work by Sean
  Christopherson, Kai Huang and Jarkko Sakkinen.

  Along with the usual fixes, cleanups and improvements"

* tag 'x86_sgx_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
  x86/sgx: Mark sgx_vepc_vm_ops static
  x86/sgx: Do not update sgx_nr_free_pages in sgx_setup_epc_section()
  x86/sgx: Move provisioning device creation out of SGX driver
  x86/sgx: Add helpers to expose ECREATE and EINIT to KVM
  x86/sgx: Add helper to update SGX_LEPUBKEYHASHn MSRs
  x86/sgx: Add encls_faulted() helper
  x86/sgx: Add SGX2 ENCLS leaf definitions (EAUG, EMODPR and EMODT)
  x86/sgx: Move ENCLS leaf definitions to sgx.h
  x86/sgx: Expose SGX architectural definitions to the kernel
  x86/sgx: Initialize virtual EPC driver even when SGX driver is disabled
  x86/cpu/intel: Allow SGX virtualization without Launch Control support
  x86/sgx: Introduce virtual EPC for use by KVM guests
  x86/sgx: Add SGX_CHILD_PRESENT hardware error code
  x86/sgx: Wipe out EREMOVE from sgx_free_epc_page()
  x86/cpufeatures: Add SGX1 and SGX2 sub-features
  x86/cpufeatures: Make SGX_LC feature bit depend on SGX bit
  x86/sgx: Remove unnecessary kmap() from sgx_ioc_enclave_init()
  selftests/sgx: Use getauxval() to simplify test code
  selftests/sgx: Improve error detection and messages
  x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()
  ...
2021-04-26 09:15:56 -07:00
Linus Torvalds
2c5ce2dba2 First big cleanup to the paravirt infra to use alternatives and thus
eliminate custom code patching. For that, the alternatives infra is
 extended to accomodate paravirt's needs and, as a result, a lot of
 paravirt patching code goes away, leading to a sizeable cleanup and
 simplification. Work by Juergen Gross.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCGiXQACgkQEsHwGGHe
 VUocbw/+OkFzphK6zlNA8O3RJ24u2csXUWWUtpGlZ2220Nn/Bgyso2+fyg/NEeQg
 EmEttaY3JG/riCDfHk5Xm2saeVtsbPXN4f0sJm/Io/djF7Cm03WS0eS0aA2Rnuca
 MhmvvkrzYqZXAYVaxKkIH6sNlPgyXX7vDNPbTd/0ZCOb3ZKIyXwL+SaLatMCtE5o
 ou7e8Bj8xPSwcaCyK6sqjrT6jdpPjoTrxxrwENW8AlRu5lCU1pIY03GGhARPVoEm
 fWkZsIPn7DxhpyIqzJtEMX8EK1xN96E+NGkNuSAtJGP9HRb+3j5f4s3IUAfXiLXq
 r7NecFw8zHhPKl9J0pPCiW7JvMrCMU5xGwyeUmmhKyK2BxwvvAC173ohgMlCfB2Q
 FPIsQWemat17tSue8LIA8SmlSDQz6R+tTdUFT+vqmNV34PxOIEeSdV7HG8rs87Ec
 dYB9ENUgXqI+h2t7atE68CpTLpWXzNDcq2olEsaEUXenky2hvsi+VxNkWpmlKQ3I
 NOMU/AyH8oUzn5O0o3oxdPhDLmK5ItEFxjYjwrgLfKFQ+Y8vIMMq3LrKQGwOj+ZU
 n9qC7JjOwDKZGjd3YqNNRhnXp+w0IJvUHbyr3vIAcp8ohQwEKgpUvpZzf/BKUvHh
 nJgJSJ53GFJBbVOJMfgVq+JcFr+WO8MDKHaw6zWeCkivFZdSs4g=
 =h+km
 -----END PGP SIGNATURE-----

Merge tag 'x86_alternatives_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 alternatives/paravirt updates from Borislav Petkov:
 "First big cleanup to the paravirt infra to use alternatives and thus
  eliminate custom code patching.

  For that, the alternatives infrastructure is extended to accomodate
  paravirt's needs and, as a result, a lot of paravirt patching code
  goes away, leading to a sizeable cleanup and simplification.

  Work by Juergen Gross"

* tag 'x86_alternatives_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/paravirt: Have only one paravirt patch function
  x86/paravirt: Switch functions with custom code to ALTERNATIVE
  x86/paravirt: Add new PVOP_ALT* macros to support pvops in ALTERNATIVEs
  x86/paravirt: Switch iret pvops to ALTERNATIVE
  x86/paravirt: Simplify paravirt macros
  x86/paravirt: Remove no longer needed 32-bit pvops cruft
  x86/paravirt: Add new features for paravirt patching
  x86/alternative: Use ALTERNATIVE_TERNARY() in _static_cpu_has()
  x86/alternative: Support ALTERNATIVE_TERNARY
  x86/alternative: Support not-feature
  x86/paravirt: Switch time pvops functions to use static_call()
  static_call: Add function to query current function
  static_call: Move struct static_call_key definition to static_call_types.h
  x86/alternative: Merge include files
  x86/alternative: Drop unused feature parameter from ALTINSTR_REPLACEMENT()
2021-04-26 09:01:29 -07:00
Sean Christopherson
4daf2a1c45 x86/sev: Drop redundant and potentially misleading 'sev_enabled'
Drop the sev_enabled flag and switch its one user over to sev_active().
sev_enabled was made redundant with the introduction of sev_status in
commit b57de6cd16 ("x86/sev-es: Add SEV-ES Feature Detection").
sev_enabled and sev_active() are guaranteed to be equivalent, as each is
true iff 'sev_status & MSR_AMD64_SEV_ENABLED' is true, and are only ever
written in tandem (ignoring compressed boot's version of sev_status).

Removing sev_enabled avoids confusion over whether it refers to the guest
or the host, and will also allow KVM to usurp "sev_enabled" for its own
purposes.

No functional change intended.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422021125.3417167-7-seanjc@google.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-26 05:27:14 -04:00
Paolo Bonzini
fd49e8ee70 Merge branch 'kvm-sev-cgroup' into HEAD 2021-04-22 13:19:01 -04:00
Nathan Tempelman
54526d1fd5 KVM: x86: Support KVM VMs sharing SEV context
Add a capability for userspace to mirror SEV encryption context from
one vm to another. On our side, this is intended to support a
Migration Helper vCPU, but it can also be used generically to support
other in-guest workloads scheduled by the host. The intention is for
the primary guest and the mirror to have nearly identical memslots.

The primary benefits of this are that:
1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they
can't accidentally clobber each other.
2) The VMs can have different memory-views, which is necessary for post-copy
migration (the migration vCPUs on the target need to read and write to
pages, when the primary guest would VMEXIT).

This does not change the threat model for AMD SEV. Any memory involved
is still owned by the primary guest and its initial state is still
attested to through the normal SEV_LAUNCH_* flows. If userspace wanted
to circumvent SEV, they could achieve the same effect by simply attaching
a vCPU to the primary VM.
This patch deliberately leaves userspace in charge of the memslots for the
mirror, as it already has the power to mess with them in the primary guest.

This patch does not support SEV-ES (much less SNP), as it does not
handle handing off attested VMSAs to the mirror.

For additional context, we need a Migration Helper because SEV PSP
migration is far too slow for our live migration on its own. Using
an in-guest migrator lets us speed this up significantly.

Signed-off-by: Nathan Tempelman <natet@google.com>
Message-Id: <20210408223214.2582277-1-natet@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-21 12:20:02 -04:00
Joseph Salisbury
753ed9c95c drivers: hv: Create a consistent pattern for checking Hyper-V hypercall status
There is not a consistent pattern for checking Hyper-V hypercall status.
Existing code uses a number of variants.  The variants work, but a consistent
pattern would improve the readability of the code, and be more conformant
to what the Hyper-V TLFS says about hypercall status.

Implemented new helper functions hv_result(), hv_result_success(), and
hv_repcomp().  Changed the places where hv_do_hypercall() and related variants
are used to use the helper functions.

Signed-off-by: Joseph Salisbury <joseph.salisbury@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1618620183-9967-2-git-send-email-joseph.salisbury@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-04-21 09:49:19 +00:00
Joseph Salisbury
6523592cee x86/hyperv: Move hv_do_rep_hypercall to asm-generic
This patch makes no functional changes.  It simply moves hv_do_rep_hypercall()
out of arch/x86/include/asm/mshyperv.h and into asm-generic/mshyperv.h

hv_do_rep_hypercall() is architecture independent, so it makes sense that it
should be in the architecture independent mshyperv.h, not in the x86-specific
mshyperv.h.

This is done in preperation for a follow up patch which creates a consistent
pattern for checking Hyper-V hypercall status.

Signed-off-by: Joseph Salisbury <joseph.salisbury@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1618620183-9967-1-git-send-email-joseph.salisbury@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-04-21 09:49:19 +00:00
Colin Ian King
b53002e035 floppy: remove redundant assignment to variable st
The variable st is being assigned a value that is never read and
it is being updated later with a new value. The initialization is
redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Denis Efremov <efremov@linux.com>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20210415130020.1959951-1-colin.king@canonical.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-20 08:59:03 -06:00
Sean Christopherson
70210c044b KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions
Add an ECREATE handler that will be used to intercept ECREATE for the
purpose of enforcing and enclave's MISCSELECT, ATTRIBUTES and XFRM, i.e.
to allow userspace to restrict SGX features via CPUID.  ECREATE will be
intercepted when any of the aforementioned masks diverges from hardware
in order to enforce the desired CPUID model, i.e. inject #GP if the
guest attempts to set a bit that hasn't been enumerated as allowed-1 in
CPUID.

Note, access to the PROVISIONKEY is not yet supported.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <c3a97684f1b71b4f4626a1fc3879472a95651725.1618196135.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-20 04:18:55 -04:00
Sean Christopherson
3c0c2ad1ae KVM: VMX: Add basic handling of VM-Exit from SGX enclave
Add support for handling VM-Exits that originate from a guest SGX
enclave.  In SGX, an "enclave" is a new CPL3-only execution environment,
wherein the CPU and memory state is protected by hardware to make the
state inaccesible to code running outside of the enclave.  When exiting
an enclave due to an asynchronous event (from the perspective of the
enclave), e.g. exceptions, interrupts, and VM-Exits, the enclave's state
is automatically saved and scrubbed (the CPU loads synthetic state), and
then reloaded when re-entering the enclave.  E.g. after an instruction
based VM-Exit from an enclave, vmcs.GUEST_RIP will not contain the RIP
of the enclave instruction that trigered VM-Exit, but will instead point
to a RIP in the enclave's untrusted runtime (the guest userspace code
that coordinates entry/exit to/from the enclave).

To help a VMM recognize and handle exits from enclaves, SGX adds bits to
existing VMCS fields, VM_EXIT_REASON.VMX_EXIT_REASON_FROM_ENCLAVE and
GUEST_INTERRUPTIBILITY_INFO.GUEST_INTR_STATE_ENCLAVE_INTR.  Define the
new architectural bits, and add a boolean to struct vcpu_vmx to cache
VMX_EXIT_REASON_FROM_ENCLAVE.  Clear the bit in exit_reason so that
checks against exit_reason do not need to account for SGX, e.g.
"if (exit_reason == EXIT_REASON_EXCEPTION_NMI)" continues to work.

KVM is a largely a passive observer of the new bits, e.g. KVM needs to
account for the bits when propagating information to a nested VMM, but
otherwise doesn't need to act differently for the majority of VM-Exits
from enclaves.

The one scenario that is directly impacted is emulation, which is for
all intents and purposes impossible[1] since KVM does not have access to
the RIP or instruction stream that triggered the VM-Exit.  The inability
to emulate is a non-issue for KVM, as most instructions that might
trigger VM-Exit unconditionally #UD in an enclave (before the VM-Exit
check.  For the few instruction that conditionally #UD, KVM either never
sets the exiting control, e.g. PAUSE_EXITING[2], or sets it if and only
if the feature is not exposed to the guest in order to inject a #UD,
e.g. RDRAND_EXITING.

But, because it is still possible for a guest to trigger emulation,
e.g. MMIO, inject a #UD if KVM ever attempts emulation after a VM-Exit
from an enclave.  This is architecturally accurate for instruction
VM-Exits, and for MMIO it's the least bad choice, e.g. it's preferable
to killing the VM.  In practice, only broken or particularly stupid
guests should ever encounter this behavior.

Add a WARN in skip_emulated_instruction to detect any attempt to
modify the guest's RIP during an SGX enclave VM-Exit as all such flows
should either be unreachable or must handle exits from enclaves before
getting to skip_emulated_instruction.

[1] Impossible for all practical purposes.  Not truly impossible
    since KVM could implement some form of para-virtualization scheme.

[2] PAUSE_LOOP_EXITING only affects CPL0 and enclaves exist only at
    CPL3, so we also don't need to worry about that interaction.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <315f54a8507d09c292463ef29104e1d4c62e9090.1618196135.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-20 04:18:54 -04:00
Sean Christopherson
00e7646c35 KVM: x86: Define new #PF SGX error code bit
Page faults that are signaled by the SGX Enclave Page Cache Map (EPCM),
as opposed to the traditional IA32/EPT page tables, set an SGX bit in
the error code to indicate that the #PF was induced by SGX.  KVM will
need to emulate this behavior as part of its trap-and-execute scheme for
virtualizing SGX Launch Control, e.g. to inject SGX-induced #PFs if
EINIT faults in the host, and to support live migration.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <e170c5175cb9f35f53218a7512c9e3db972b97a2.1618196135.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-20 04:18:54 -04:00
Keqian Zhu
d90b15edbe KVM: x86: Remove unused function declaration
kvm_mmu_slot_largepage_remove_write_access() is decared but not used,
just remove it.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Message-Id: <20210406063504.17552-1-zhukeqian1@huawei.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-20 04:18:50 -04:00
Wanpeng Li
4a7132efff KVM: X86: Count attempted/successful directed yield
To analyze some performance issues with lock contention and scheduling,
it is nice to know when directed yield are successful or failing.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1617941911-5338-2-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 18:04:46 -04:00
Kan Liang
d0946a882e perf/x86/intel: Hybrid PMU support for perf capabilities
Some platforms, e.g. Alder Lake, have hybrid architecture. Although most
PMU capabilities are the same, there are still some unique PMU
capabilities for different hybrid PMUs. Perf should register a dedicated
pmu for each hybrid PMU.

Add a new struct x86_hybrid_pmu, which saves the dedicated pmu and
capabilities for each hybrid PMU.

The architecture MSR, MSR_IA32_PERF_CAPABILITIES, only indicates the
architecture features which are available on all hybrid PMUs. The
architecture features are stored in the global x86_pmu.intel_cap.

For Alder Lake, the model-specific features are perf metrics and
PEBS-via-PT. The corresponding bits of the global x86_pmu.intel_cap
should be 0 for these two features. Perf should not use the global
intel_cap to check the features on a hybrid system.
Add a dedicated intel_cap in the x86_hybrid_pmu to store the
model-specific capabilities. Use the dedicated intel_cap to replace
the global intel_cap for thse two features. The dedicated intel_cap
will be set in the following "Add Alder Lake Hybrid support" patch.

Add is_hybrid() to distinguish a hybrid system. ADL may have an
alternative configuration. With that configuration, the
X86_FEATURE_HYBRID_CPU is not set. Perf cannot rely on the feature bit.
Add a new static_key_false, perf_is_hybrid, to indicate a hybrid system.
It will be assigned in the following "Add Alder Lake Hybrid support"
patch as well.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1618237865-33448-5-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:24 +02:00
Ricardo Neri
250b3c0d79 x86/cpu: Add helper function to get the type of the current hybrid CPU
On processors with Intel Hybrid Technology (i.e., one having more than
one type of CPU in the same package), all CPUs support the same
instruction set and enumerate the same features on CPUID. Thus, all
software can run on any CPU without restrictions. However, there may be
model-specific differences among types of CPUs. For instance, each type
of CPU may support a different number of performance counters. Also,
machine check error banks may be wired differently. Even though most
software will not care about these differences, kernel subsystems
dealing with these differences must know.

Add and expose a new helper function get_this_hybrid_cpu_type() to query
the type of the current hybrid CPU. The function will be used later in
the perf subsystem.

The Intel Software Developer's Manual defines the CPU type as 8-bit
identifier.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1618237865-33448-3-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:23 +02:00
Ricardo Neri
a161545ab5 x86/cpufeatures: Enumerate Intel Hybrid Technology feature bit
Add feature enumeration to identify a processor with Intel Hybrid
Technology: one in which CPUs of more than one type are the same package.
On a hybrid processor, all CPUs support the same homogeneous (i.e.,
symmetric) instruction set. All CPUs enumerate the same features in CPUID.
Thus, software (user space and kernel) can run and migrate to any CPU in
the system as well as utilize any of the enumerated features without any
change or special provisions. The main difference among CPUs in a hybrid
processor are power and performance properties.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1618237865-33448-2-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:23 +02:00
Ben Gardon
c0e64238ac KVM: x86/mmu: Protect the tdp_mmu_roots list with RCU
Protect the contents of the TDP MMU roots list with RCU in preparation
for a future patch which will allow the iterator macro to be used under
the MMU lock in read mode.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210401233736.638171-9-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 09:06:01 -04:00
Sean Christopherson
b4c5936c47 KVM: Kill off the old hva-based MMU notifier callbacks
Yank out the hva-based MMU notifier APIs now that all architectures that
use the notifiers have moved to the gfn-based APIs.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:31:08 -04:00
Sean Christopherson
3039bcc744 KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code.  Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.

In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.

The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.

Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.

Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.

MIPS, PPC, and arm64 will be converted one at a time in future patches.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:31:06 -04:00
Maxim Levitsky
7e582ccbbd KVM: x86: implement KVM_CAP_SET_GUEST_DEBUG2
Store the supported bits into KVM_GUESTDBG_VALID_MASK
macro, similar to how arm does this.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210401135451.1004564-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:31:02 -04:00
Sean Christopherson
5f7c292b89 KVM: Move prototypes for MMU notifier callbacks to generic code
Move the prototypes for the MMU notifier callbacks out of arch code and
into common code.  There is no benefit to having each arch replicate the
prototypes since any deviation from the invocation in common code will
explode.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210326021957.1424875-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:55 -04:00
Paolo Bonzini
d9bd0082e2 Merge remote-tracking branch 'tip/x86/sgx' into kvm-next
Pull generic x86 SGX changes needed to support SGX in virtual machines.
2021-04-17 08:29:47 -04:00
Christophe Leroy
808094fcbf lib/vdso: Add vdso_data pointer as input to __arch_get_timens_vdso_data()
For the same reason as commit e876f0b69d ("lib/vdso: Allow
architectures to provide the vdso data pointer"), powerpc wants to
avoid calculation of relative position to code.

As the timens_vdso_data is next page to vdso_data, provide
vdso_data pointer to __arch_get_timens_vdso_data() in order
to ease the calculation on powerpc in following patches.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Acked-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/539c4204b1baa77c55f758904a1ea239abbc7a5c.1617209142.git.christophe.leroy@csgroup.eu
2021-04-14 23:04:44 +10:00
Jan Kiszka
f7b21a0e41 x86/asm: Ensure asm/proto.h can be included stand-alone
Fix:

  ../arch/x86/include/asm/proto.h:14:30: warning: ‘struct task_struct’ declared \
    inside parameter list will not be visible outside of this definition or declaration
  long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2);
                               ^~~~~~~~~~~

  .../arch/x86/include/asm/proto.h:40:34: warning: ‘struct task_struct’ declared \
    inside parameter list will not be visible outside of this definition or declaration
   long do_arch_prctl_common(struct task_struct *task, int option,
                                    ^~~~~~~~~~~

if linux/sched.h hasn't be included previously. This fixes a build error
when this header is used outside of the kernel tree.

 [ bp: Massage commit message. ]

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/b76b4be3-cf66-f6b2-9a6c-3e7ef54f9845@web.de
2021-04-12 13:12:46 +02:00
Andrew Cooper
99cb64de36 x86/cpu: Comment Skylake server stepping too
Further to

  53375a5a21 ("x86/cpu: Resort and comment Intel models"),

CascadeLake and CooperLake are steppings of Skylake, and make up the 1st
to 3rd generation "Xeon Scalable Processor" line.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210409121027.16437-1-andrew.cooper3@citrix.com
2021-04-10 11:14:33 +02:00
Linus Torvalds
adb2c4174f Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "14 patches.

  Subsystems affected by this patch series: mm (kasan, gup, pagecache,
  and kfence), MAINTAINERS, mailmap, nds32, gcov, ocfs2, ia64, and lib"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  lib: fix kconfig dependency on ARCH_WANT_FRAME_POINTERS
  kfence, x86: fix preemptible warning on KPTI-enabled systems
  lib/test_kasan_module.c: suppress unused var warning
  kasan: fix conflict with page poisoning
  fs: direct-io: fix missing sdio->boundary
  ia64: fix user_stack_pointer() for ptrace()
  ocfs2: fix deadlock between setattr and dio_end_io_write
  gcov: re-fix clang-11+ support
  nds32: flush_dcache_page: use page_mapping_file to avoid races with swapoff
  mm/gup: check page posion status for coredump.
  .mailmap: fix old email addresses
  mailmap: update email address for Jordan Crouse
  treewide: change my e-mail address, fix my name
  MAINTAINERS: update CZ.NIC's Turris information
2021-04-09 17:06:32 -07:00
Marco Elver
6a77d38efc kfence, x86: fix preemptible warning on KPTI-enabled systems
On systems with KPTI enabled, we can currently observe the following
warning:

  BUG: using smp_processor_id() in preemptible
  caller is invalidate_user_asid+0x13/0x50
  CPU: 6 PID: 1075 Comm: dmesg Not tainted 5.12.0-rc4-gda4a2b1a5479-kfence_1+ #1
  Hardware name: Hewlett-Packard HP Pro 3500 Series/2ABF, BIOS 8.11 10/24/2012
  Call Trace:
   dump_stack+0x7f/0xad
   check_preemption_disabled+0xc8/0xd0
   invalidate_user_asid+0x13/0x50
   flush_tlb_one_kernel+0x5/0x20
   kfence_protect+0x56/0x80
   ...

While it normally makes sense to require preemption to be off, so that
the expected CPU's TLB is flushed and not another, in our case it really
is best-effort (see comments in kfence_protect_page()).

Avoid the warning by disabling preemption around flush_tlb_one_kernel().

Link: https://lore.kernel.org/lkml/YGIDBAboELGgMgXy@elver.google.com/
Link: https://lkml.kernel.org/r/20210330065737.652669-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Tomi Sarvela <tomi.p.sarvela@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-09 14:54:23 -07:00
Peter Zijlstra
53375a5a21 x86/cpu: Resort and comment Intel models
The INTEL_FAM6 list has become a mess again. Try and bring some sanity
back into it.

Where previously we had one microarch per year and a number of SKUs
within that, this no longer seems to be the case. We now get different
uarch names that share a 'core' design.

Add the core name starting at skylake and reorder to keep the cores
in chronological order. Furthermore, Intel marketed the names {Amber,
Coffee, Whiskey} Lake, but those are in fact steppings of Kaby Lake, add
comments for them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/YE+HhS8i0gshHD3W@hirez.programming.kicks-ass.net
2021-04-08 14:22:10 +02:00
Kees Cook
fe950f6020 x86/entry: Enable random_kstack_offset support
Allow for a randomized stack offset on a per-syscall basis, with roughly
5-6 bits of entropy, depending on compiler and word size. Since the
method of offsetting uses macros, this cannot live in the common entry
code (the stack offset needs to be retained for the life of the syscall,
which means it needs to happen at the actual entry point).

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210401232347.2791257-5-keescook@chromium.org
2021-04-08 14:05:20 +02:00
Vitaly Kuznetsov
fa26d0c778 ACPI: processor: Fix build when CONFIG_ACPI_PROCESSOR=m
Commit 8cdddd182b ("ACPI: processor: Fix CPU0 wakeup in
acpi_idle_play_dead()") tried to fix CPU0 hotplug breakage by copying
wakeup_cpu0() + start_cpu0() logic from hlt_play_dead()//mwait_play_dead()
into acpi_idle_play_dead(). The problem is that these functions are not
exported to modules so when CONFIG_ACPI_PROCESSOR=m build fails.

The issue could've been fixed by exporting both wakeup_cpu0()/start_cpu0()
(the later from assembly) but it seems putting the whole pattern into a
new function and exporting it instead is better.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: 8cdddd182b ("CPI: processor: Fix CPU0 wakeup in acpi_idle_play_dead()")
Cc: <stable@vger.kernel.org> # 5.10+
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-04-07 19:02:43 +02:00
Sean Christopherson
b3754e5d3d x86/sgx: Move provisioning device creation out of SGX driver
And extract sgx_set_attribute() out of sgx_ioc_enclave_provision() and
export it as symbol for KVM to use.

The provisioning key is sensitive. The SGX driver only allows to create
an enclave which can access the provisioning key when the enclave
creator has permission to open /dev/sgx_provision. It should apply to
a VM as well, as the provisioning key is platform-specific, thus an
unrestricted VM can also potentially compromise the provisioning key.

Move the provisioning device creation out of sgx_drv_init() to
sgx_init() as a preparation for adding SGX virtualization support,
so that even if the SGX driver is not enabled due to flexible launch
control not being available, SGX virtualization can still be enabled,
and use it to restrict a VM's capability of being able to access the
provisioning key.

 [ bp: Massage commit message. ]

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/0f4d044d621561f26d5f4ef73e8dc6cd18cc7e79.1616136308.git.kai.huang@intel.com
2021-04-06 19:18:46 +02:00
Sean Christopherson
d155030b1e x86/sgx: Add helpers to expose ECREATE and EINIT to KVM
The host kernel must intercept ECREATE to impose policies on guests, and
intercept EINIT to be able to write guest's virtual SGX_LEPUBKEYHASH MSR
values to hardware before running guest's EINIT so it can run correctly
according to hardware behavior.

Provide wrappers around __ecreate() and __einit() to hide the ugliness
of overloading the ENCLS return value to encode multiple error formats
in a single int.  KVM will trap-and-execute ECREATE and EINIT as part
of SGX virtualization, and reflect ENCLS execution result to guest by
setting up guest's GPRs, or on an exception, injecting the correct fault
based on return value of __ecreate() and __einit().

Use host userspace addresses (provided by KVM based on guest physical
address of ENCLS parameters) to execute ENCLS/EINIT when possible.
Accesses to both EPC and memory originating from ENCLS are subject to
segmentation and paging mechanisms.  It's also possible to generate
kernel mappings for ENCLS parameters by resolving PFN but using
__uaccess_xx() is simpler.

 [ bp: Return early if the __user memory accesses fail, use
   cpu_feature_enabled(). ]

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/20e09daf559aa5e9e680a0b4b5fba940f1bad86e.1616136308.git.kai.huang@intel.com
2021-04-06 19:18:27 +02:00
Sean Christopherson
32ddda8e44 x86/sgx: Add SGX2 ENCLS leaf definitions (EAUG, EMODPR and EMODT)
Define the ENCLS leafs that are available with SGX2, also referred to as
Enclave Dynamic Memory Management (EDMM).  The leafs will be used by KVM
to conditionally expose SGX2 capabilities to guests.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/5f0970c251ebcc6d5add132f0d750cc753b7060f.1616136308.git.kai.huang@intel.com
2021-04-06 09:43:42 +02:00
Sean Christopherson
9c55c78a73 x86/sgx: Move ENCLS leaf definitions to sgx.h
Move the ENCLS leaf definitions to sgx.h so that they can be used by
KVM.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/2e6cd7c5c1ced620cfcd292c3c6c382827fde6b2.1616136308.git.kai.huang@intel.com
2021-04-06 09:43:41 +02:00
Sean Christopherson
8ca52cc38d x86/sgx: Expose SGX architectural definitions to the kernel
Expose SGX architectural structures, as KVM will use many of the
architectural constants and structs to virtualize SGX.

Name the new header file as asm/sgx.h, rather than asm/sgx_arch.h, to
have single header to provide SGX facilities to share with other kernel
componments. Also update MAINTAINERS to include asm/sgx.h.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/6bf47acd91ab4d709e66ad1692c7803e4c9063a0.1616136308.git.kai.huang@intel.com
2021-04-06 09:43:41 +02:00
Peter Zijlstra
9bc0bb5072 objtool/x86: Rewrite retpoline thunk calls
When the compiler emits: "CALL __x86_indirect_thunk_\reg" for an
indirect call, have objtool rewrite it to:

	ALTERNATIVE "call __x86_indirect_thunk_\reg",
		    "call *%reg", ALT_NOT(X86_FEATURE_RETPOLINE)

Additionally, in order to not emit endless identical
.altinst_replacement chunks, use a global symbol for them, see
__x86_indirect_alt_*.

This also avoids objtool from having to do code generation.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Link: https://lkml.kernel.org/r/20210326151300.320177914@infradead.org
2021-04-02 12:47:28 +02:00
Peter Zijlstra
119251855f x86/retpoline: Simplify retpolines
Due to:

  c9c324dc22 ("objtool: Support stack layout changes in alternatives")

it is now possible to simplify the retpolines.

Currently our retpolines consist of 2 symbols:

 - __x86_indirect_thunk_\reg: the compiler target
 - __x86_retpoline_\reg:  the actual retpoline.

Both are consecutive in code and aligned such that for any one register
they both live in the same cacheline:

  0000000000000000 <__x86_indirect_thunk_rax>:
   0:   ff e0                   jmpq   *%rax
   2:   90                      nop
   3:   90                      nop
   4:   90                      nop

  0000000000000005 <__x86_retpoline_rax>:
   5:   e8 07 00 00 00          callq  11 <__x86_retpoline_rax+0xc>
   a:   f3 90                   pause
   c:   0f ae e8                lfence
   f:   eb f9                   jmp    a <__x86_retpoline_rax+0x5>
  11:   48 89 04 24             mov    %rax,(%rsp)
  15:   c3                      retq
  16:   66 2e 0f 1f 84 00 00 00 00 00   nopw   %cs:0x0(%rax,%rax,1)

The thunk is an alternative_2, where one option is a JMP to the
retpoline. This was done so that objtool didn't need to deal with
alternatives with stack ops. But that problem has been solved, so now
it is possible to fold the entire retpoline into the alternative to
simplify and consolidate unused bytes:

  0000000000000000 <__x86_indirect_thunk_rax>:
   0:   ff e0                   jmpq   *%rax
   2:   90                      nop
   3:   90                      nop
   4:   90                      nop
   5:   90                      nop
   6:   90                      nop
   7:   90                      nop
   8:   90                      nop
   9:   90                      nop
   a:   90                      nop
   b:   90                      nop
   c:   90                      nop
   d:   90                      nop
   e:   90                      nop
   f:   90                      nop
  10:   90                      nop
  11:   66 66 2e 0f 1f 84 00 00 00 00 00        data16 nopw %cs:0x0(%rax,%rax,1)
  1c:   0f 1f 40 00             nopl   0x0(%rax)

Notice that since the longest alternative sequence is now:

   0:   e8 07 00 00 00          callq  c <.altinstr_replacement+0xc>
   5:   f3 90                   pause
   7:   0f ae e8                lfence
   a:   eb f9                   jmp    5 <.altinstr_replacement+0x5>
   c:   48 89 04 24             mov    %rax,(%rsp)
  10:   c3                      retq

17 bytes, we have 15 bytes NOP at the end of our 32 byte slot. (IOW, if
we can shrink the retpoline by 1 byte we can pack it more densely).

 [ bp: Massage commit message. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210326151259.506071949@infradead.org
2021-04-02 12:42:04 +02:00
Peter Zijlstra
23c1ad538f x86/alternatives: Optimize optimize_nops()
Currently, optimize_nops() scans to see if the alternative starts with
NOPs. However, the emit pattern is:

  141:	\oldinstr
  142:	.skip (len-(142b-141b)), 0x90

That is, when 'oldinstr' is short, the tail is padded with NOPs. This case
never gets optimized.

Rewrite optimize_nops() to replace any trailing string of NOPs inside
the alternative to larger NOPs. Also run it irrespective of patching,
replacing NOPs in both the original and replaced code.

A direct consequence is that 'padlen' becomes superfluous, so remove it.

 [ bp:
   - Adjust commit message
   - remove a stale comment about needing to pad
   - add a comment in optimize_nops()
   - exit early if the NOP verif. loop catches a mismatch - function
     should not not add NOPs in that case
   - fix the "optimized NOPs" offsets output ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210326151259.442992235@infradead.org
2021-04-02 12:41:17 +02:00
Ingo Molnar
b1f480bc06 Merge branch 'x86/cpu' into WIP.x86/core, to merge the NOP changes & resolve a semantic conflict
Conflict-merge this main commit in essence:

  a89dfde3dc: ("x86: Remove dynamic NOP selection")

With this upstream commit:

  b908297047: ("bpf: Use NOP_ATOMIC5 instead of emit_nops(&prog, 5) for BPF_TRAMP_F_CALL_ORIG")

Semantic merge conflict:

  arch/x86/net/bpf_jit_comp.c

  - memcpy(prog, ideal_nops[NOP_ATOMIC5], X86_PATCH_SIZE);
  + memcpy(prog, x86_nops[5], X86_PATCH_SIZE);

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-04-02 12:36:30 +02:00
Ingo Molnar
e855e80d00 Linux 5.12-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFRBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmBhB7AeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGCPUH+KKkSoOlN2YNu1oc
 iy2nznwZoSQTk5ZLz7PypO/WWmmtgzudkObG7yqIURdrncsAkHR17Wu2P7rdBr1j
 Ma+VhF9MQ+xx+r86upH7c3gYfhyfdUMvzuLy0rwLQ1Yrzrb7xFcVkj3BHk54TAQA
 w05sRPuVJ3/c/HPYV2iXkkdnnMbXSTCebeDDwjFb9D3qagr4vcd/PjDHmGbfNF8R
 o6gLpbK5Ly6ww1nth9gGGUjzrW95yVItvcroP6vQWljxhuy+NE1lXRm8LsGhxqtW
 foFFptJup5nhSNJXWtQt/U3huVD6mZ3W3y9cOThPjXZRy2wva3I1IpBKoEFReUpG
 /Tq8EA==
 =tPUY
 -----END PGP SIGNATURE-----

Merge tag 'v5.12-rc5' into WIP.x86/core, to pick up recent NOP related changes

In particular we want to have this upstream commit:

  b908297047: ("bpf: Use NOP_ATOMIC5 instead of emit_nops(&prog, 5) for BPF_TRAMP_F_CALL_ORIG")

... before merging in x86/cpu changes and the removal of the NOP optimizations, and
applying PeterZ's !retpoline objtool series.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-04-02 12:33:16 +02:00
Vitaly Kuznetsov
8cdddd182b ACPI: processor: Fix CPU0 wakeup in acpi_idle_play_dead()
Commit 496121c021 ("ACPI: processor: idle: Allow probing on platforms
with one ACPI C-state") broke CPU0 hotplug on certain systems, e.g.
I'm observing the following on AWS Nitro (e.g r5b.xlarge but other
instance types are affected as well):

 # echo 0 > /sys/devices/system/cpu/cpu0/online
 # echo 1 > /sys/devices/system/cpu/cpu0/online
 <10 seconds delay>
 -bash: echo: write error: Input/output error

In fact, the above mentioned commit only revealed the problem and did
not introduce it. On x86, to wakeup CPU an NMI is being used and
hlt_play_dead()/mwait_play_dead() loops are prepared to handle it:

	/*
	 * If NMI wants to wake up CPU0, start CPU0.
	 */
	if (wakeup_cpu0())
		start_cpu0();

cpuidle_play_dead() -> acpi_idle_play_dead() (which is now being called on
systems where it wasn't called before the above mentioned commit) serves
the same purpose but it doesn't have a path for CPU0. What happens now on
wakeup is:
 - NMI is sent to CPU0
 - wakeup_cpu0_nmi() works as expected
 - we get back to while (1) loop in acpi_idle_play_dead()
 - safe_halt() puts CPU0 to sleep again.

The straightforward/minimal fix is add the special handling for CPU0 on x86
and that's what the patch is doing.

Fixes: 496121c021 ("ACPI: processor: idle: Allow probing on platforms with one ACPI C-state")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: 5.10+ <stable@vger.kernel.org> # 5.10+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-04-01 13:37:55 +02:00
Borislav Petkov
f2ac256b9a Merge 'x86/alternatives'
Pick up dependent changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
2021-03-31 18:04:19 +02:00
Peter Zijlstra
52fa82c21f x86: Add insn_decode_kernel()
Add a helper to decode kernel instructions; there's no point in
endlessly repeating those last two arguments.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210326151259.379242587@infradead.org
2021-03-31 16:20:22 +02:00
Ingo Molnar
feecb81732 Linux 5.12-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFRBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmBhB7AeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGCPUH+KKkSoOlN2YNu1oc
 iy2nznwZoSQTk5ZLz7PypO/WWmmtgzudkObG7yqIURdrncsAkHR17Wu2P7rdBr1j
 Ma+VhF9MQ+xx+r86upH7c3gYfhyfdUMvzuLy0rwLQ1Yrzrb7xFcVkj3BHk54TAQA
 w05sRPuVJ3/c/HPYV2iXkkdnnMbXSTCebeDDwjFb9D3qagr4vcd/PjDHmGbfNF8R
 o6gLpbK5Ly6ww1nth9gGGUjzrW95yVItvcroP6vQWljxhuy+NE1lXRm8LsGhxqtW
 foFFptJup5nhSNJXWtQt/U3huVD6mZ3W3y9cOThPjXZRy2wva3I1IpBKoEFReUpG
 /Tq8EA==
 =tPUY
 -----END PGP SIGNATURE-----

Merge tag 'v5.12-rc5' into locking/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-03-29 15:56:48 +02:00
Fenghua Yu
ebb1064e7c x86/traps: Handle #DB for bus lock
Bus locks degrade performance for the whole system, not just for the CPU
that requested the bus lock. Two CPU features "#AC for split lock" and
"#DB for bus lock" provide hooks so that the operating system may choose
one of several mitigation strategies.

#AC for split lock is already implemented. Add code to use the #DB for
bus lock feature to cover additional situations with new options to
mitigate.

split_lock_detect=
		#AC for split lock		#DB for bus lock

off		Do nothing			Do nothing

warn		Kernel OOPs			Warn once per task and
		Warn once per task and		and continues to run.
		disable future checking
	 	When both features are
		supported, warn in #AC

fatal		Kernel OOPs			Send SIGBUS to user.
		Send SIGBUS to user
		When both features are
		supported, fatal in #AC

ratelimit:N	Do nothing			Limit bus lock rate to
						N per second in the
						current non-root user.

Default option is "warn".

Hardware only generates #DB for bus lock detect when CPL>0 to avoid
nested #DB from multiple bus locks while the first #DB is being handled.
So no need to handle #DB for bus lock detected in the kernel.

#DB for bus lock is enabled by bus lock detection bit 2 in DEBUGCTL MSR
while #AC for split lock is enabled by split lock detection bit 29 in
TEST_CTRL MSR.

Both breakpoint and bus lock in the same instruction can trigger one #DB.
The bus lock is handled before the breakpoint in the #DB handler.

Delivery of #DB for bus lock in userspace clears DR6[11], which is set by
the #DB handler right after reading DR6.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210322135325.682257-3-fenghua.yu@intel.com
2021-03-28 22:52:15 +02:00
Fenghua Yu
f21d4d3b97 x86/cpufeatures: Enumerate #DB for bus lock detection
A bus lock is acquired through either a split locked access to writeback
(WB) memory or any locked access to non-WB memory. This is typically >1000
cycles slower than an atomic operation within a cache line. It also
disrupts performance on other cores.

Some CPUs have the ability to notify the kernel by a #DB trap after a user
instruction acquires a bus lock and is executed. This allows the kernel to
enforce user application throttling or mitigation. Both breakpoint and bus
lock can trigger the #DB trap in the same instruction and the ordering of
handling them is the kernel #DB handler's choice.

The CPU feature flag to be shown in /proc/cpuinfo will be "bus_lock_detect".

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210322135325.682257-2-fenghua.yu@intel.com
2021-03-28 22:52:14 +02:00
Lai Jiangshan
1591584e2e x86/process/64: Move cpu_current_top_of_stack out of TSS
cpu_current_top_of_stack is currently stored in TSS.sp1. TSS is exposed
through the cpu_entry_area which is visible with user CR3 when PTI is
enabled and active.

This makes it a coveted fruit for attackers.  An attacker can fetch the
kernel stack top from it and continue next steps of actions based on the
kernel stack.

But it is actualy not necessary to be stored in the TSS.  It is only
accessed after the entry code switched to kernel CR3 and kernel GS_BASE
which means it can be in any regular percpu variable.

The reason why it is in TSS is historical (pre PTI) because TSS is also
used as scratch space in SYSCALL_64 and therefore cache hot.

A syscall also needs the per CPU variable current_task and eventually
__preempt_count, so placing cpu_current_top_of_stack next to them makes it
likely that they end up in the same cache line which should avoid
performance regressions. This is not enforced as the compiler is free to
place these variables, so these entry relevant variables should move into
a data structure to make this enforceable.

The seccomp_benchmark doesn't show any performance loss in the "getpid
native" test result.  Actually, the result changes from 93ns before to 92ns
with this change when KPTI is disabled. The test is very stable and
although the test doesn't show a higher degree of precision it gives enough
confidence that moving cpu_current_top_of_stack does not cause a
regression.

[ tglx: Removed unneeded export. Massaged changelog ]

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210125173444.22696-2-jiangshanlai@gmail.com
2021-03-28 22:40:10 +02:00
Linus Torvalds
6c20f6df61 xen: branch for v5.12-rc5
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYF37OgAKCRCAXGG7T9hj
 vp8hAP4h7mvjfkntbFXagrJK9pi2xVC9d/YO5nfa8/K3LcGVnQD/fKcU9ggPN9vI
 GLnhyprGLcCA4aTL6Ogb37o9fDd4Yws=
 =joIg
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.12b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:
 "This contains a small series with a more elegant fix of a problem
  which was originally fixed in rc2"

* tag 'for-linus-5.12b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  Revert "xen: fix p2m size in dom0 for disabled memory hotplug case"
  xen/x86: make XEN_BALLOON_MEMORY_HOTPLUG_LIMIT depend on MEMORY_HOTPLUG
2021-03-26 11:15:25 -07:00
Sean Christopherson
b8921dccf3 x86/cpufeatures: Add SGX1 and SGX2 sub-features
Add SGX1 and SGX2 feature flags, via CPUID.0x12.0x0.EAX, as scattered
features, since adding a new leaf for only two bits would be wasteful.
As part of virtualizing SGX, KVM will expose the SGX CPUID leafs to its
guest, and to do so correctly needs to query hardware and kernel support
for SGX1 and SGX2.

Suppress both SGX1 and SGX2 from /proc/cpuinfo. SGX1 basically means
SGX, and for SGX2 there is no concrete use case of using it in
/proc/cpuinfo.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/d787827dbfca6b3210ac3e432e3ac1202727e786.1616136308.git.kai.huang@intel.com
2021-03-25 17:33:11 +01:00
Masahiro Yamada
7dfe553aff x86/syscalls: Fix -Wmissing-prototypes warnings from COND_SYSCALL()
Building kernel/sys_ni.c with W=1 emits tons of -Wmissing-prototypes warnings:

  $ make W=1 kernel/sys_ni.o
    [ snip ]
    CC      kernel/sys_ni.o
     ./arch/x86/include/asm/syscall_wrapper.h:83:14: warning: no previous prototype for '__ia32_sys_io_setup' [-Wmissing-prototypes]
     ...

The problem is in __COND_SYSCALL(), the __SYS_STUB0() and __SYS_STUBx() macros
defined a few lines above already have forward declarations.

Let's do likewise for __COND_SYSCALL() to fix the warnings.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Mickaël Salaün <mic@linux.microsoft.com>
Link: https://lore.kernel.org/r/20210301131533.64671-2-masahiroy@kernel.org
2021-03-25 16:20:41 +01:00
Roger Pau Monne
af44a387e7 Revert "xen: fix p2m size in dom0 for disabled memory hotplug case"
This partially reverts commit 882213990d ("xen: fix p2m size in dom0
for disabled memory hotplug case")

There's no need to special case XEN_UNPOPULATED_ALLOC anymore in order
to correctly size the p2m. The generic memory hotplug option has
already been tied together with the Xen hotplug limit, so enabling
memory hotplug should already trigger a properly sized p2m on Xen PV.

Note that XEN_UNPOPULATED_ALLOC depends on ZONE_DEVICE which pulls in
MEMORY_HOTPLUG.

Leave the check added to __set_phys_to_machine and the adjusted
comment about EXTRA_MEM_RATIO.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20210324122424.58685-3-roger.pau@citrix.com

[boris: fixed formatting issues]
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2021-03-24 18:33:36 -05:00
Masami Hiramatsu
6256e668b7 x86/kprobes: Use int3 instead of debug trap for single-step
Use int3 instead of debug trap exception for single-stepping the
probed instructions. Some instructions which change the ip
registers or modify IF flags are emulated because those are not
able to be single-stepped by int3 or may allow the interrupt
while single-stepping.

This actually changes the kprobes behavior.

- kprobes can not probe following instructions; int3, iret,
  far jmp/call which get absolute address as immediate,
  indirect far jmp/call, indirect near jmp/call with addressing
  by memory (register-based indirect jmp/call are OK), and
  vmcall/vmlaunch/vmresume/vmxoff.

- If the kprobe post_handler doesn't set before registering,
  it may not be called in some case even if you set it afterwards.
  (IOW, kprobe booster is enabled at registration, user can not
   change it)

But both are rare issue, unsupported instructions will not be
used in the kernel (or rarely used), and post_handlers are
rarely used (I don't see it except for the test code).

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/161469874601.49483.11985325887166921076.stgit@devnote2
2021-03-23 16:07:56 +01:00
Ingo Molnar
163b099146 x86: Fix various typos in comments, take #2
Fix another ~42 single-word typos in arch/x86/ code comments,
missed a few in the first pass, in particular in .S files.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-kernel@vger.kernel.org
2021-03-21 23:50:28 +01:00
Ingo Molnar
c681df88dc x86: Remove unusual Unicode characters from comments
We've accumulated a few unusual Unicode characters in arch/x86/
over the years, substitute them with their proper ASCII equivalents.

A few of them were a whitespace equivalent: ' ' - the use was harmless.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
2021-03-21 23:50:07 +01:00
Ingo Molnar
ca8778c45e Merge branch 'linus' into x86/cleanups, to resolve conflict
Conflicts:
	arch/x86/kernel/kprobes/ftrace.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-03-21 22:16:08 +01:00
Linus Torvalds
5e3ddf96e7 - Add the arch-specific mapping between physical and logical CPUs to fix
devicetree-node lookups.
 
 - Restore the IRQ2 ignore logic
 
 - Fix get_nr_restart_syscall() to return the correct restart syscall number.
 Split in a 4-patches set to avoid kABI breakage when backporting to dead
 kernels.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmBXJu0ACgkQEsHwGGHe
 VUrCkQ/9Et5W76HMQfHccluks2i2yNXgd7nROhIt0iMS1Ph86AWYJZmMZ2dbaqW8
 nORU20ziHme+9PScmcJb2LdJxIRDtYNs1J811IYeKNpvj8KHXtV2VYCVG9UcL21E
 FmUlZf5oINiDMzu3q4SuqHw9t7X6RCItolQIRmQHDXqPraFhBxji2VOFXDIg+qhf
 a4sBz6UfxA4a/b7d/KxHxNvuQE5Cluc9gninhtaYh1b7OQZJX4+vTa3W5V4kK0df
 ohOH5pnJp9V7qH2CmB3UcGWJTxHeLbm4E0KYkyasnKG9M0KmIvJ6jNARlRAo3hAF
 hn9D4xLtsnIWjtO6xEVdF7kSizkYZRPay5kX88quvlSa0FkkPnsUvFtW79Yi3ZNy
 vL2NAu2biqNQyo7ZWVffJns2DrJwYZ6KOGA6oUBwTUBfieF9KMdDew8IXRUMYNdO
 LzW87Irf9eZj9c+b7Rtr0VofmKgRYwy1Lo8eVT+VGkV+nOTOB9rlAll2lYBq3aNA
 W6ei0S5/1zaRF5aU6Qmnap4eb1X/tp845q6CPYa9kIsZwVyGFOa7iLeYcNn9qHdB
 G6RW6CUh97A7wwxUYt5VGUscjYV2V9Ycv9HvIwrG/T7aezWnhI9ODtggzDgCnbls
 og6N/+heLZ9G/DyxAEmHuazV2ItDPJq69gag/POHhXJaSUGbdbA=
 =WfC4
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:
 "The freshest pile of shiny x86 fixes for 5.12:

   - Add the arch-specific mapping between physical and logical CPUs to
     fix devicetree-node lookups

   - Restore the IRQ2 ignore logic

   - Fix get_nr_restart_syscall() to return the correct restart syscall
     number. Split in a 4-patches set to avoid kABI breakage when
     backporting to dead kernels"

* tag 'x86_urgent_for_v5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/apic/of: Fix CPU devicetree-node lookups
  x86/ioapic: Ignore IRQ2 again
  x86: Introduce restart_block->arch_data to remove TS_COMPAT_RESTART
  x86: Introduce TS_COMPAT_RESTART to fix get_nr_restart_syscall()
  x86: Move TS_COMPAT back to asm/thread_info.h
  kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data()
2021-03-21 11:04:20 -07:00
Ingo Molnar
01438749e3 Merge branch 'locking/urgent' into locking/core, to pick up dependent commits
We are applying further, lower-prio fixes on top of two ww_mutex fixes in locking/urgent.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-03-19 12:10:49 +01:00
Sean Christopherson
b318e8decf KVM: x86: Protect userspace MSR filter with SRCU, and set atomically-ish
Fix a plethora of issues with MSR filtering by installing the resulting
filter as an atomic bundle instead of updating the live filter one range
at a time.  The KVM_X86_SET_MSR_FILTER ioctl() isn't truly atomic, as
the hardware MSR bitmaps won't be updated until the next VM-Enter, but
the relevant software struct is atomically updated, which is what KVM
really needs.

Similar to the approach used for modifying memslots, make arch.msr_filter
a SRCU-protected pointer, do all the work configuring the new filter
outside of kvm->lock, and then acquire kvm->lock only when the new filter
has been vetted and created.  That way vCPU readers either see the old
filter or the new filter in their entirety, not some half-baked state.

Yuan Yao pointed out a use-after-free in ksm_msr_allowed() due to a
TOCTOU bug, but that's just the tip of the iceberg...

  - Nothing is __rcu annotated, making it nigh impossible to audit the
    code for correctness.
  - kvm_add_msr_filter() has an unpaired smp_wmb().  Violation of kernel
    coding style aside, the lack of a smb_rmb() anywhere casts all code
    into doubt.
  - kvm_clear_msr_filter() has a double free TOCTOU bug, as it grabs
    count before taking the lock.
  - kvm_clear_msr_filter() also has memory leak due to the same TOCTOU bug.

The entire approach of updating the live filter is also flawed.  While
installing a new filter is inherently racy if vCPUs are running, fixing
the above issues also makes it trivial to ensure certain behavior is
deterministic, e.g. KVM can provide deterministic behavior for MSRs with
identical settings in the old and new filters.  An atomic update of the
filter also prevents KVM from getting into a half-baked state, e.g. if
installing a filter fails, the existing approach would leave the filter
in a half-baked state, having already committed whatever bits of the
filter were already processed.

[*] https://lkml.kernel.org/r/20210312083157.25403-1-yaoyuan0329os@gmail.com

Fixes: 1a155254ff ("KVM: x86: Introduce MSR filtering")
Cc: stable@vger.kernel.org
Cc: Alexander Graf <graf@amazon.com>
Reported-by: Yuan Yao <yaoyuan0329os@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210316184436.2544875-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-18 13:55:14 -04:00
Ingo Molnar
d9f6e12fb0 x86: Fix various typos in comments
Fix ~144 single-word typos in arch/x86/ code comments.

Doing this in a single commit should reduce the churn.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-kernel@vger.kernel.org
2021-03-18 15:31:53 +01:00
Ingo Molnar
14ff3ed86e Linux 5.12-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmBOgu4eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGUd0H/3Ey8aWjVAig9Pe+
 VQVZKwG+LXWH6UmUx5qyaTxophhmGnWLvkigJMn63qIg4eQtfp2gNFHK+T4OJNIP
 ybnkjFZ337x4J9zD6m8mt4Wmelq9iW2wNOS+3YZAyYiGlXfMGM7SlYRCQRQznTED
 2O/JCMsOoP+Z8tr5ah/bzs0dANsXmTZ3QqRP2uzb6irKTgFR3/weOhj+Ht1oJ4Aq
 V+bgdcwhtk20hJhlvVeqws+o74LR789tTDCknlz/YNMv9e6VPfyIQ5vJAcFmZATE
 Ezj9yzkZ4IU+Ux6ikAyaFyBU8d1a4Wqye3eHCZBsEo6tcSAhbTZ90eoU86vh6ajS
 LZjwkNw=
 =6y1u
 -----END PGP SIGNATURE-----

Merge tag 'v5.12-rc3' into x86/cleanups, to refresh the tree

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-03-18 15:27:03 +01:00
Vitaly Kuznetsov
cc9cfddb04 KVM: x86: hyper-v: Track Hyper-V TSC page status
Create an infrastructure for tracking Hyper-V TSC page status, i.e. if it
was updated from guest/host side or if we've failed to set it up (because
e.g. guest wrote some garbage to HV_X64_MSR_REFERENCE_TSC) and there's no
need to retry.

Also, in a hypothetical situation when we are in 'always catchup' mode for
TSC we can now avoid contending 'hv->hv_lock' on every guest enter by
setting the state to HV_TSC_PAGE_BROKEN after compute_tsc_page_parameters()
returns false.

Check for HV_TSC_PAGE_SET state instead of '!hv->tsc_ref.tsc_sequence' in
get_time_ref_counter() to properly handle the situation when we failed to
write the updated TSC page values to the guest.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210316143736.964151-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-18 08:02:46 -04:00
Oleg Nesterov
b2e9df850c x86: Introduce restart_block->arch_data to remove TS_COMPAT_RESTART
Save the current_thread_info()->status of X86 in the new
restart_block->arch_data field so TS_COMPAT_RESTART can be removed again.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210201174716.GA17898@redhat.com
2021-03-16 22:13:11 +01:00
Oleg Nesterov
8c150ba2fb x86: Introduce TS_COMPAT_RESTART to fix get_nr_restart_syscall()
The comment in get_nr_restart_syscall() says:

	 * The problem is that we can get here when ptrace pokes
	 * syscall-like values into regs even if we're not in a syscall
	 * at all.

Yes, but if not in a syscall then the

	status & (TS_COMPAT|TS_I386_REGS_POKED)

check below can't really help:

	- TS_COMPAT can't be set

	- TS_I386_REGS_POKED is only set if regs->orig_ax was changed by
	  32bit debugger; and even in this case get_nr_restart_syscall()
	  is only correct if the tracee is 32bit too.

Suppose that a 64bit debugger plays with a 32bit tracee and

	* Tracee calls sleep(2)	// TS_COMPAT is set
	* User interrupts the tracee by CTRL-C after 1 sec and does
	  "(gdb) call func()"
	* gdb saves the regs by PTRACE_GETREGS
	* does PTRACE_SETREGS to set %rip='func' and %orig_rax=-1
	* PTRACE_CONT		// TS_COMPAT is cleared
	* func() hits int3.
	* Debugger catches SIGTRAP.
	* Restore original regs by PTRACE_SETREGS.
	* PTRACE_CONT

get_nr_restart_syscall() wrongly returns __NR_restart_syscall==219, the
tracee calls ia32_sys_call_table[219] == sys_madvise.

Add the sticky TS_COMPAT_RESTART flag which survives after return to user
mode. It's going to be removed in the next step again by storing the
information in the restart block. As a further cleanup it might be possible
to remove also TS_I386_REGS_POKED with that.

Test-case:

  $ cvs -d :pserver:anoncvs:anoncvs@sourceware.org:/cvs/systemtap co ptrace-tests
  $ gcc -o erestartsys-trap-debuggee ptrace-tests/tests/erestartsys-trap-debuggee.c --m32
  $ gcc -o erestartsys-trap-debugger ptrace-tests/tests/erestartsys-trap-debugger.c -lutil
  $ ./erestartsys-trap-debugger
  Unexpected: retval 1, errno 22
  erestartsys-trap-debugger: ptrace-tests/tests/erestartsys-trap-debugger.c:421

Fixes: 609c19a385 ("x86/ptrace: Stop setting TS_COMPAT in ptrace code")
Reported-by: Jan Kratochvil <jan.kratochvil@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210201174709.GA17895@redhat.com
2021-03-16 22:13:11 +01:00
Oleg Nesterov
66c1b6d74c x86: Move TS_COMPAT back to asm/thread_info.h
Move TS_COMPAT back to asm/thread_info.h, close to TS_I386_REGS_POKED.

It was moved to asm/processor.h by b9d989c721 ("x86/asm: Move the
thread_info::status field to thread_struct"), then later 37a8f7c383
("x86/asm: Move 'status' from thread_struct to thread_info") moved the
'status' field back but TS_COMPAT was forgotten.

Preparatory patch to fix the COMPAT case for get_nr_restart_syscall()

Fixes: 609c19a385 ("x86/ptrace: Stop setting TS_COMPAT in ptrace code")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210201174649.GA17880@redhat.com
2021-03-16 22:13:11 +01:00
Peter Zijlstra
a89dfde3dc x86: Remove dynamic NOP selection
This ensures that a NOP is a NOP and not a random other instruction that
is also a NOP. It allows simplification of dynamic code patching that
wants to verify existing code before writing new instructions (ftrace,
jump_label, static_call, etc..).

Differentiating on NOPs is not a feature.

This pessimises 32bit (DONTCARE) and 32bit on 64bit CPUs (CARELESS).
32bit is not a performance target.

Everything x86_64 since AMD K10 (2007) and Intel IvyBridge (2012) is
fine with using NOPL (as opposed to prefix NOP). And per FEATURE_NOPL
being required for x86_64, all x86_64 CPUs can use NOPL. So stop
caring about NOPs, simplify things and get on with life.

[ The problem seems to be that some uarchs can only decode NOPL on a
single front-end port while others have severe decode penalties for
excessive prefixes. All modern uarchs can handle both, except Atom,
which has prefix penalties. ]

[ Also, much doubt you can actually measure any of this on normal
workloads. ]

After this, FEATURE_NOPL is unused except for required-features for
x86_64. FEATURE_K8 is only used for PTI.

 [ bp: Kernel build measurements showed ~0.3s slowdown on Sandybridge
   which is hardly a slowdown. Get rid of X86_FEATURE_K7, while at it. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> # bpf
Acked-by: Linus Torvalds <torvalds@linuxfoundation.org>
Link: https://lkml.kernel.org/r/20210312115749.065275711@infradead.org
2021-03-15 16:24:59 +01:00
Borislav Petkov
f935178b5c x86/insn: Make insn_complete() static
... and move it above the only place it is used.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210304174237.31945-22-bp@alien8.de
2021-03-15 13:03:46 +01:00
Borislav Petkov
404b639e51 x86/insn: Remove kernel_insn_init()
Now that it is not needed anymore, drop it.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210304174237.31945-21-bp@alien8.de
2021-03-15 12:58:36 +01:00
Borislav Petkov
93281c4a96 x86/insn: Add an insn_decode() API
Users of the instruction decoder should use this to decode instruction
bytes. For that, have insn*() helpers return an int value to denote
success/failure. When there's an error fetching the next insn byte and
the insn falls short, return -ENODATA to denote that.

While at it, make insn_get_opcode() more stricter as to whether what has
seen so far is a valid insn and if not.

Copy linux/kconfig.h for the tools-version of the decoder so that it can
use IS_ENABLED().

Also, cast the INSN_MODE_KERN dummy define value to (enum insn_mode)
for tools use of the decoder because perf tool builds with -Werror and
errors out with -Werror=sign-compare otherwise.

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lkml.kernel.org/r/20210304174237.31945-5-bp@alien8.de
2021-03-15 11:05:47 +01:00
Borislav Petkov
d30c7b820b x86/insn: Add a __ignore_sync_check__ marker
Add an explicit __ignore_sync_check__ marker which will be used to mark
lines which are supposed to be ignored by file synchronization check
scripts, its advantage being that it explicitly denotes such lines in
the code.

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lkml.kernel.org/r/20210304174237.31945-4-bp@alien8.de
2021-03-15 11:00:57 +01:00
Borislav Petkov
9e761296c5 x86/insn: Rename insn_decode() to insn_decode_from_regs()
Rename insn_decode() to insn_decode_from_regs() to denote that it
receives regs as param and uses registers from there during decoding.
Free the former name for a more generic version of the function.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210304174237.31945-2-bp@alien8.de
2021-03-15 10:53:10 +01:00
Borislav Petkov
aa7680f6fe Linux 5.12-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmBOgu4eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGUd0H/3Ey8aWjVAig9Pe+
 VQVZKwG+LXWH6UmUx5qyaTxophhmGnWLvkigJMn63qIg4eQtfp2gNFHK+T4OJNIP
 ybnkjFZ337x4J9zD6m8mt4Wmelq9iW2wNOS+3YZAyYiGlXfMGM7SlYRCQRQznTED
 2O/JCMsOoP+Z8tr5ah/bzs0dANsXmTZ3QqRP2uzb6irKTgFR3/weOhj+Ht1oJ4Aq
 V+bgdcwhtk20hJhlvVeqws+o74LR789tTDCknlz/YNMv9e6VPfyIQ5vJAcFmZATE
 Ezj9yzkZ4IU+Ux6ikAyaFyBU8d1a4Wqye3eHCZBsEo6tcSAhbTZ90eoU86vh6ajS
 LZjwkNw=
 =6y1u
 -----END PGP SIGNATURE-----

Merge tag 'v5.12-rc3' into x86/core

Pick up dependent SEV-ES urgent changes to base new work ontop.

Signed-off-by: Borislav Petkov <bp@suse.de>
2021-03-15 10:49:00 +01:00
Sean Christopherson
e83bc09caf KVM: x86: Get active PCID only when writing a CR3 value
Retrieve the active PCID only when writing a guest CR3 value, i.e. don't
get the PCID when using EPT or NPT.  The PCID is especially problematic
for EPT as the bits have different meaning, and so the PCID and must be
manually stripped, which is annoying and unnecessary.  And on VMX,
getting the active PCID also involves reading the guest's CR3 and
CR4.PCIDE, i.e. may add pointless VMREADs.

Opportunistically rename the pgd/pgd_level params to root_hpa and
root_level to better reflect their new roles.  Keep the function names,
as "load the guest PGD" is still accurate/correct.

Last, and probably least, pass root_hpa as a hpa_t/u64 instead of an
unsigned long.  The EPTP holds a 64-bit value, even in 32-bit mode, so
in theory EPT could support HIGHMEM for 32-bit KVM.  Never mind that
doing so would require changing the MMU page allocators and reworking
the MMU to use kmap().

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305183123.3978098-2-seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:56 -04:00
Sean Christopherson
e7b7bdea77 KVM: x86/mmu: Move logic for setting SPTE masks for EPT into the MMU proper
Let the MMU deal with the SPTE masks to avoid splitting the logic and
knowledge across the MMU and VMX.

The SPTE masks that are used for EPT are very, very tightly coupled to
the MMU implementation.  The use of available bits, the existence of A/D
types, the fact that shadow_x_mask even exists, and so on and so forth
are all baked into the MMU implementation.  Cross referencing the params
to the masks is also a nightmare, as pretty much every param is a u64.

A future patch will make the location of the MMU_WRITABLE and
HOST_WRITABLE bits MMU specific, to free up bit 11 for a MMU_PRESENT bit.
Doing that change with the current kvm_mmu_set_mask_ptes() would be an
absolute mess.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210225204749.1512652-18-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:48 -04:00
Babu Moger
d00b99c514 KVM: SVM: Add support for Virtual SPEC_CTRL
Newer AMD processors have a feature to virtualize the use of the
SPEC_CTRL MSR. Presence of this feature is indicated via CPUID
function 0x8000000A_EDX[20]: GuestSpecCtrl. Hypervisors are not
required to enable this feature since it is automatically enabled on
processors that support it.

A hypervisor may wish to impose speculation controls on guest
execution or a guest may want to impose its own speculation controls.
Therefore, the processor implements both host and guest
versions of SPEC_CTRL.

When in host mode, the host SPEC_CTRL value is in effect and writes
update only the host version of SPEC_CTRL. On a VMRUN, the processor
loads the guest version of SPEC_CTRL from the VMCB. When the guest
writes SPEC_CTRL, only the guest version is updated. On a VMEXIT,
the guest version is saved into the VMCB and the processor returns
to only using the host SPEC_CTRL for speculation control. The guest
SPEC_CTRL is located at offset 0x2E0 in the VMCB.

The effective SPEC_CTRL setting is the guest SPEC_CTRL setting or'ed
with the hypervisor SPEC_CTRL setting. This allows the hypervisor to
ensure a minimum SPEC_CTRL if desired.

This support also fixes an issue where a guest may sometimes see an
inconsistent value for the SPEC_CTRL MSR on processors that support
this feature. With the current SPEC_CTRL support, the first write to
SPEC_CTRL is intercepted and the virtualized version of the SPEC_CTRL
MSR is not updated. When the guest reads back the SPEC_CTRL MSR, it
will be 0x0, instead of the actual expected value. There isn’t a
security concern here, because the host SPEC_CTRL value is or’ed with
the Guest SPEC_CTRL value to generate the effective SPEC_CTRL value.
KVM writes with the guest's virtualized SPEC_CTRL value to SPEC_CTRL
MSR just before the VMRUN, so it will always have the actual value
even though it doesn’t appear that way in the guest. The guest will
only see the proper value for the SPEC_CTRL register if the guest was
to write to the SPEC_CTRL register again. With Virtual SPEC_CTRL
support, the save area spec_ctrl is properly saved and restored.
So, the guest will always see the proper value when it is read back.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Message-Id: <161188100955.28787.11816849358413330720.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:25 -04:00
Babu Moger
f333374e10 x86/cpufeatures: Add the Virtual SPEC_CTRL feature
Newer AMD processors have a feature to virtualize the use of the
SPEC_CTRL MSR. Presence of this feature is indicated via CPUID
function 0x8000000A_EDX[20]: GuestSpecCtrl. When present, the
SPEC_CTRL MSR is automatically virtualized.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Acked-by: Borislav Petkov <bp@suse.de>
Message-Id: <161188100272.28787.4097272856384825024.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:25 -04:00
Sean Christopherson
c483c45471 KVM: x86: Move RDPMC emulation to common code
Move the entirety of the accelerated RDPMC emulation to x86.c, and assign
the common handler directly to the exit handler array for VMX.  SVM has
bizarre nrips behavior that prevents it from directly invoking the common
handler.  The nrips goofiness will be addressed in a future patch.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:20 -04:00
Sean Christopherson
5ff3a351f6 KVM: x86: Move trivial instruction-based exit handlers to common code
Move the trivial exit handlers, e.g. for instructions that KVM
"emulates" as nops, to common x86 code.  Assign the common handlers
directly to the exit handler arrays and drop the vendor trampolines.

Opportunistically use pr_warn_once() where appropriate.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:19 -04:00
Sean Christopherson
92f9895c14 KVM: x86: Move XSETBV emulation to common code
Move the entirety of XSETBV emulation to x86.c, and assign the
function directly to both VMX's and SVM's exit handlers, i.e. drop the
unnecessary trampolines.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:18 -04:00
Sean Christopherson
cb6a32c2b8 KVM: x86: Handle triple fault in L2 without killing L1
Synthesize a nested VM-Exit if L2 triggers an emulated triple fault
instead of exiting to userspace, which likely will kill L1.  Any flow
that does KVM_REQ_TRIPLE_FAULT is suspect, but the most common scenario
for L2 killing L1 is if L0 (KVM) intercepts a contributory exception that
is _not_intercepted by L1.  E.g. if KVM is intercepting #GPs for the
VMware backdoor, a #GP that occurs in L2 while vectoring an injected #DF
will cause KVM to emulate triple fault.

Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210302174515.2812275-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:15 -04:00
Sean Christopherson
61a1773e2e KVM: x86/mmu: Unexport MMU load/unload functions
Unexport the MMU load and unload helpers now that they are no longer
used (incorrectly) in vendor code.

Opportunistically move the kvm_mmu_sync_roots() declaration into mmu.h,
it should not be exposed to vendor code.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305011101.3597423-16-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:23 -04:00
Dongli Zhang
43c11d91fb KVM: x86: to track if L1 is running L2 VM
The new per-cpu stat 'nested_run' is introduced in order to track if L1 VM
is running or used to run L2 VM.

An example of the usage of 'nested_run' is to help the host administrator
to easily track if any L1 VM is used to run L2 VM. Suppose there is issue
that may happen with nested virtualization, the administrator will be able
to easily narrow down and confirm if the issue is due to nested
virtualization via 'nested_run'. For example, whether the fix like
commit 88dddc11a8 ("KVM: nVMX: do not use dangling shadow VMCS after
guest reset") is required.

Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Message-Id: <20210305225747.7682-1-dongli.zhang@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:28:02 -04:00
Linus Torvalds
19469d2ada A single objtool fix to handle the PUSHF/POPF validation correctly for the
paravirt changes which modified arch_local_irq_restore not to use popf.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmBOLF8THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoSfOD/940XqIrDp/cXuqKL1r4zE5n4DF/nBy
 cHp8KOfo+T302crNOvylpSuL7kCCcfDM/E2BBBZ7JubN4d1VA0HDF0tV6PApmpWx
 4uGT/9ZXB7Hl2Gu5M+VvOSBQIytPYyQCGdbiWeHYfvO5HTqC1G8Pfbg2Otw+6Wgy
 jUJuuDS0xwmlo56WTDWz1aB/f/oOHUEaS3XDeaqZ86oqvD0di+tODUJoDvtYGkam
 K6nXRhFfEa4bI7Ynsa4RyMhjNOxNiFDimYnZjgGba4+8X6KGSG4N83rOr6tjHGL+
 AsBM1o5TRfBpudi5rbDAOEIhy0V3FyefIbeQeL6DZoNMS4ey8qRkYkqCLp+lOxTm
 F9T5ORZuWV43gs4c2GODGy5MHDKzcPA15OBRU2BECKrILnNG5MPMcNt3iTJbO8kY
 YNZs2svGw8/MVl928idjYPecEsTNzLi3z3MdV6QfJLCbGpIBzeX83PbvK0dKgxwL
 yeuJXBOz3sYbcxxLbueGv2Z+xH0wneHXUqPJT/YI8KFdxknZkwSnf4MA5bqVu2Mn
 q4etZxtAokvyl79NZQXvLgRxCwNj4PeXli1k11t4WhJxDLpKIm8N7QMNSKu4Z/tw
 GxAbe85Wut1ywU6srGKEnpibCFAmFyZ5HN+awKrt5BkphdGwphYb88Ldk3859o0B
 ZIKBRlhIz870ag==
 =IIWb
 -----END PGP SIGNATURE-----

Merge tag 'objtool-urgent-2021-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool fix from Thomas Gleixner:
 "A single objtool fix to handle the PUSHF/POPF validation correctly for
  the paravirt changes which modified arch_local_irq_restore not to use
  popf"

* tag 'objtool-urgent-2021-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool,x86: Fix uaccess PUSHF/POPF validation
2021-03-14 13:15:55 -07:00
Linus Torvalds
0a7c10df49 - A couple of SEV-ES fixes and robustifications: verify usermode stack
pointer in NMI is not coming from the syscall gap, correctly track IRQ
 states in the #VC handler and access user insn bytes atomically in same
 handler as latter cannot sleep.
 
 - Balance 32-bit fast syscall exit path to do the proper work on exit
 and thus not confuse audit and ptrace frameworks.
 
 - Two fixes for the ORC unwinder going "off the rails" into KASAN
 redzones and when ORC data is missing.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmBN+ksACgkQEsHwGGHe
 VUpYtBAAj199n50ipP2x+jjgCueIytMqwCCRozrgZ8JF0Al83piVfjhuAYQpfvD8
 cKCxN/jSEF0YoUg/grBTPLG6f0J4B2GoekSlSc3ljnuhBby4iJ9B4YgE7qym6tuT
 G/mBOuAo2HBzvB70i1BYPN6mrA+6SG1d4tIhRLGKHCz+hQm8yYnJYVbiOkLBECeP
 0QOOpX6hR5ytOOCRqwD/O5YIdZD8NvlA4sQE522Mrw/4PWz9XcS2kwpOQFHoRsFL
 if3t2yLMiGMfV0dyUCMoGZl0NqpnIZynoNdVPq/bllTW5obnmh6z8Eir44PzJmVJ
 ftVZTcReRqm5ObgwZh0g1H7CRjKe0xU9FyJHRmQl3Xb5g3wRAm3OkMJ2hQcOUPy9
 VOB4vp7kbDg3MmGJe2xOtsEeAyVHGzTaWlmZ0moxjJXiLTjUy69eelmvLepypO3P
 Bo/xpjn9hN7L9ptKv1exsSatQRN7KWTCxtz+NBJgC4pEpkdtDBkaWunIKeauPTZ2
 CAJJrp2sn7i5/CKPOuhjbQ+nSTMptpuZQxTDrjVUO0/6qs4ffQT3O+WXRV1bQ07v
 ObRqi0hIYgm4vSiBfVRfxOU+Zrx0j3kny4/xUs6CIjMjrjIp4RgBbqvZ95ZMooMi
 yeyZdVfzQ7PRaam5J2V3IHxKz7554hvMl5Zf4zAdl0qcQw3YZ0o=
 =rw8S
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.12_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - A couple of SEV-ES fixes and robustifications: verify usermode stack
   pointer in NMI is not coming from the syscall gap, correctly track
   IRQ states in the #VC handler and access user insn bytes atomically
   in same handler as latter cannot sleep.

 - Balance 32-bit fast syscall exit path to do the proper work on exit
   and thus not confuse audit and ptrace frameworks.

 - Two fixes for the ORC unwinder going "off the rails" into KASAN
   redzones and when ORC data is missing.

* tag 'x86_urgent_for_v5.12_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sev-es: Use __copy_from_user_inatomic()
  x86/sev-es: Correctly track IRQ states in runtime #VC handler
  x86/sev-es: Check regs->sp is trusted before adjusting #VC IST stack
  x86/sev-es: Introduce ip_within_syscall_gap() helper
  x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls
  x86/unwind/orc: Silence warnings caused by missing ORC data
  x86/unwind/orc: Disable KASAN checking in the ORC unwinder, part 2
2021-03-14 12:48:10 -07:00
Linus Torvalds
9d0c8e793f More fixes for ARM and x86.
-----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmBLsyoUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMpYgf/Zu1Byif+XZVdwm52wJN38ppUUVmn
 4u8HvQ8Ht+P0cGg1IaNx9D5QXGRgdn72qEpWUF5aH03ahTANAuf6zXw+evKmiub/
 RtJfxZWEcWeLdugLVHUSrR4MOox7uvFmCdcdht4sEPdjFdH/9JeceC3A1pZ/DYTR
 +eS+E3dMWQjXnd2Omo/5f5H1LTZjNLEditnkcHT5unwKKukc008V/avgs8xOAKJB
 xf3oqJF960IO+NYf8rRQb8WtyGeo0grrWjgeqvZ37gwGUaFB9ldVxchsVLsL66OR
 bJRIoSiTgL+TUYSMQ5mKG4tmmBnPHUHfgfNoOXlWMoJHIjFeQ9oM6eTHhA==
 =QTFW
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "More fixes for ARM and x86"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: LAPIC: Advancing the timer expiration on guest initiated write
  KVM: x86/mmu: Skip !MMU-present SPTEs when removing SP in exclusive mode
  KVM: kvmclock: Fix vCPUs > 64 can't be online/hotpluged
  kvm: x86: annotate RCU pointers
  KVM: arm64: Fix exclusive limit for IPA size
  KVM: arm64: Reject VM creation when the default IPA size is unsupported
  KVM: arm64: Ensure I-cache isolation between vcpus of a same VM
  KVM: arm64: Don't use cbz/adr with external symbols
  KVM: arm64: Fix range alignment when walking page tables
  KVM: arm64: Workaround firmware wrongly advertising GICv2-on-v3 compatibility
  KVM: arm64: Rename __vgic_v3_get_ich_vtr_el2() to __vgic_v3_get_gic_config()
  KVM: arm64: Don't access PMSELR_EL0/PMUSERENR_EL0 when no PMU is available
  KVM: arm64: Turn kvm_arm_support_pmu_v3() into a static key
  KVM: arm64: Fix nVHE hyp panic host context restore
  KVM: arm64: Avoid corrupting vCPU context register in guest exit
  KVM: arm64: nvhe: Save the SPE context early
  kvm: x86: use NULL instead of using plain integer as pointer
  KVM: SVM: Connect 'npt' module param to KVM's internal 'npt_enabled'
  KVM: x86: Ensure deadline timer has truly expired before posting its IRQ
2021-03-14 12:35:02 -07:00
Muhammad Usama Anjum
6fcd9cbc6a kvm: x86: annotate RCU pointers
This patch adds the annotation to fix the following sparse errors:
arch/x86/kvm//x86.c:8147:15: error: incompatible types in comparison expression (different address spaces):
arch/x86/kvm//x86.c:8147:15:    struct kvm_apic_map [noderef] __rcu *
arch/x86/kvm//x86.c:8147:15:    struct kvm_apic_map *
arch/x86/kvm//x86.c:10628:16: error: incompatible types in comparison expression (different address spaces):
arch/x86/kvm//x86.c:10628:16:    struct kvm_apic_map [noderef] __rcu *
arch/x86/kvm//x86.c:10628:16:    struct kvm_apic_map *
arch/x86/kvm//x86.c:10629:15: error: incompatible types in comparison expression (different address spaces):
arch/x86/kvm//x86.c:10629:15:    struct kvm_pmu_event_filter [noderef] __rcu *
arch/x86/kvm//x86.c:10629:15:    struct kvm_pmu_event_filter *
arch/x86/kvm//lapic.c:267:15: error: incompatible types in comparison expression (different address spaces):
arch/x86/kvm//lapic.c:267:15:    struct kvm_apic_map [noderef] __rcu *
arch/x86/kvm//lapic.c:267:15:    struct kvm_apic_map *
arch/x86/kvm//lapic.c:269:9: error: incompatible types in comparison expression (different address spaces):
arch/x86/kvm//lapic.c:269:9:    struct kvm_apic_map [noderef] __rcu *
arch/x86/kvm//lapic.c:269:9:    struct kvm_apic_map *
arch/x86/kvm//lapic.c:637:15: error: incompatible types in comparison expression (different address spaces):
arch/x86/kvm//lapic.c:637:15:    struct kvm_apic_map [noderef] __rcu *
arch/x86/kvm//lapic.c:637:15:    struct kvm_apic_map *
arch/x86/kvm//lapic.c:994:15: error: incompatible types in comparison expression (different address spaces):
arch/x86/kvm//lapic.c:994:15:    struct kvm_apic_map [noderef] __rcu *
arch/x86/kvm//lapic.c:994:15:    struct kvm_apic_map *
arch/x86/kvm//lapic.c:1036:15: error: incompatible types in comparison expression (different address spaces):
arch/x86/kvm//lapic.c:1036:15:    struct kvm_apic_map [noderef] __rcu *
arch/x86/kvm//lapic.c:1036:15:    struct kvm_apic_map *
arch/x86/kvm//lapic.c:1173:15: error: incompatible types in comparison expression (different address spaces):
arch/x86/kvm//lapic.c:1173:15:    struct kvm_apic_map [noderef] __rcu *
arch/x86/kvm//lapic.c:1173:15:    struct kvm_apic_map *
arch/x86/kvm//pmu.c:190:18: error: incompatible types in comparison expression (different address spaces):
arch/x86/kvm//pmu.c:190:18:    struct kvm_pmu_event_filter [noderef] __rcu *
arch/x86/kvm//pmu.c:190:18:    struct kvm_pmu_event_filter *
arch/x86/kvm//pmu.c:251:18: error: incompatible types in comparison expression (different address spaces):
arch/x86/kvm//pmu.c:251:18:    struct kvm_pmu_event_filter [noderef] __rcu *
arch/x86/kvm//pmu.c:251:18:    struct kvm_pmu_event_filter *
arch/x86/kvm//pmu.c:522:18: error: incompatible types in comparison expression (different address spaces):
arch/x86/kvm//pmu.c:522:18:    struct kvm_pmu_event_filter [noderef] __rcu *
arch/x86/kvm//pmu.c:522:18:    struct kvm_pmu_event_filter *
arch/x86/kvm//pmu.c:522:18: error: incompatible types in comparison expression (different address spaces):
arch/x86/kvm//pmu.c:522:18:    struct kvm_pmu_event_filter [noderef] __rcu *
arch/x86/kvm//pmu.c:522:18:    struct kvm_pmu_event_filter *

Signed-off-by: Muhammad Usama Anjum <musamaanjum@gmail.com>
Message-Id: <20210305191123.GA497469@LEGION>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-12 13:17:41 -05:00
Peter Zijlstra
ba08abca66 objtool,x86: Fix uaccess PUSHF/POPF validation
Commit ab234a260b ("x86/pv: Rework arch_local_irq_restore() to not
use popf") replaced "push %reg; popf" with something like: "test
$0x200, %reg; jz 1f; sti; 1:", which breaks the pushf/popf symmetry
that commit ea24213d80 ("objtool: Add UACCESS validation") relies
on.

The result is:

  drivers/gpu/drm/amd/amdgpu/si.o: warning: objtool: si_common_hw_init()+0xf36: PUSHF stack exhausted

Meanwhile, commit c9c324dc22 ("objtool: Support stack layout changes
in alternatives") makes that we can actually use stack-ops in
alternatives, which means we can revert 1ff865e343 ("x86,smap: Fix
smap_{save,restore}() alternatives").

That in turn means we can limit the PUSHF/POPF handling of
ea24213d80 to those instructions that are in alternatives.

Fixes: ab234a260b ("x86/pv: Rework arch_local_irq_restore() to not use popf")
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/YEY4rIbQYa5fnnEp@hirez.programming.kicks-ass.net
2021-03-12 09:15:49 +01:00
Juergen Gross
054ac8ad5e x86/paravirt: Have only one paravirt patch function
There is no need any longer to have different paravirt patch functions
for native and Xen. Eliminate native_patch() and rename
paravirt_patch_default() to paravirt_patch().

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210311142319.4723-15-jgross@suse.com
2021-03-11 20:11:09 +01:00
Juergen Gross
fafe5e7422 x86/paravirt: Switch functions with custom code to ALTERNATIVE
Instead of using paravirt patching for custom code sequences use
ALTERNATIVE for the functions with custom code replacements.

Instead of patching an ud2 instruction for unpopulated vector entries
into the caller site, use a simple function just calling BUG() as a
replacement.

Simplify the register defines for assembler paravirt calling, as there
isn't much usage left.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210311142319.4723-14-jgross@suse.com
2021-03-11 20:07:01 +01:00
Juergen Gross
00aa3193ab x86/paravirt: Add new PVOP_ALT* macros to support pvops in ALTERNATIVEs
Instead of using paravirt patching for custom code sequences add
support for using ALTERNATIVE handling combined with paravirt call
patching.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210311142319.4723-13-jgross@suse.com
2021-03-11 20:05:44 +01:00
Juergen Gross
ae755b5a45 x86/paravirt: Switch iret pvops to ALTERNATIVE
The iret paravirt op is rather special as it is using a jmp instead
of a call instruction. Switch it to ALTERNATIVE.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210311142319.4723-12-jgross@suse.com
2021-03-11 19:58:54 +01:00