Commit Graph

17015 Commits

Author SHA1 Message Date
Thomas Gleixner
72f40a2823 x86/softirq/64: Inline do_softirq_own_stack()
There is no reason to have this as a seperate function for a single caller.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210210002513.382806685@linutronix.de
2021-02-10 23:34:17 +01:00
Thomas Gleixner
db1cc7aede softirq: Move do_softirq_own_stack() to generic asm header
To avoid include recursion hell move the do_softirq_own_stack() related
content into a generic asm header and include it from all places in arch/
which need the prototype.

This allows architectures to provide an inline implementation of
do_softirq_own_stack() without introducing a lot of #ifdeffery all over the
place.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210210002513.289960691@linutronix.de
2021-02-10 23:34:16 +01:00
Thomas Gleixner
624db9eabc x86: Select CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
Now that all invocations of irq_exit_rcu() happen on the irq stack, turn on
CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK which causes the core code to invoke
__do_softirq() directly without going through do_softirq_own_stack().

That means do_softirq_own_stack() is only invoked from task context which
means it can't be on the irq stack. Remove the conditional from
run_softirq_on_irqstack_cond() and rename the function accordingly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210210002513.068033456@linutronix.de
2021-02-10 23:34:16 +01:00
Thomas Gleixner
52d743f3b7 x86/softirq: Remove indirection in do_softirq_own_stack()
Use the new inline stack switching and remove the old ASM indirect call
implementation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210210002512.972714001@linutronix.de
2021-02-10 23:34:15 +01:00
Thomas Gleixner
5b51e1db9b x86/entry: Convert device interrupts to inline stack switching
Convert device interrupts to inline stack switching by replacing the
existing macro implementation with the new inline version. Tweak the
function signature of the actual handler function to have the vector
argument as u32. That allows the inline macro to avoid extra intermediates
and lets the compiler be smarter about the whole thing.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210210002512.769728139@linutronix.de
2021-02-10 23:34:15 +01:00
Thomas Gleixner
3c5e0267ec x86/apic: Split out spurious handling code
sysvec_spurious_apic_interrupt() calls into the handling body of
__spurious_interrupt() which is not obvious as that function is declared
inside the DEFINE_IDTENTRY_IRQ(spurious_interrupt) macro.

As __spurious_interrupt() is currently always inlined this ends up with two
copies of the same code for no reason.

Split the handling function out and invoke it from both entry points.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210210002512.469379641@linutronix.de
2021-02-10 23:34:14 +01:00
Thomas Gleixner
951c2a51ae x86/irq/64: Adjust the per CPU irq stack pointer by 8
The per CPU hardirq_stack_ptr contains the pointer to the irq stack in the
form that it is ready to be assigned to [ER]SP so that the first push ends
up on the top entry of the stack.

But the stack switching on 64 bit has the following rules:

    1) Store the current stack pointer (RSP) in the top most stack entry
       to allow the unwinder to link back to the previous stack

    2) Set RSP to the top most stack entry

    3) Invoke functions on the irq stack

    4) Pop RSP from the top most stack entry (stored in #1) so it's back
       to the original stack.

That requires all stack switching code to decrement the stored pointer by 8
in order to be able to store the current RSP and then set RSP to that
location. That's a pointless exercise.

Do the -8 adjustment right when storing the pointer and make the data type
a void pointer to avoid confusion vs. the struct irq_stack data type which
is on 64bit only used to declare the backing store. Move the definition
next to the inuse flag so they likely end up in the same cache
line. Sticking them into a struct to enforce it is a seperate change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210210002512.354260928@linutronix.de
2021-02-10 23:34:14 +01:00
Thomas Gleixner
e7f8900179 x86/irq: Sanitize irq stack tracking
The recursion protection for hard interrupt stacks is an unsigned int per
CPU variable initialized to -1 named __irq_count. 

The irq stack switching is only done when the variable is -1, which creates
worse code than just checking for 0. When the stack switching happens it
uses this_cpu_add/sub(1), but there is no reason to do so. It simply can
use straight writes. This is a historical leftover from the low level ASM
code which used inc and jz to make a decision.

Rename it to hardirq_stack_inuse, make it a bool and use plain stores.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210210002512.228830141@linutronix.de
2021-02-10 23:34:13 +01:00
Peter Zijlstra
87ccc826bf x86/unwind/orc: Change REG_SP_INDIRECT
Currently REG_SP_INDIRECT is unused but means (%rsp + offset),
change it to mean (%rsp) + offset.

The reason is that we're going to swizzle stack in the middle of a C
function with non-trivial stack footprint. This means that when the
unwinder finds the ToS, it needs to dereference it (%rsp) and then add
the offset to the next frame, resulting in: (%rsp) + offset

This is somewhat unfortunate, since REG_BP_INDIRECT is used (by DRAP)
and thus needs to retain the current (%rbp + offset).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
2021-02-10 20:53:51 +01:00
Juergen Gross
ab234a260b x86/pv: Rework arch_local_irq_restore() to not use popf
POPF is a rather expensive operation, so don't use it for restoring
irq flags. Instead, test whether interrupts are enabled in the flags
parameter and enable interrupts via STI in that case.

This results in the restore_fl paravirt op to be no longer needed.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210120135555.32594-7-jgross@suse.com
2021-02-10 12:36:45 +01:00
Juergen Gross
afd30525a6 x86/xen: Drop USERGS_SYSRET64 paravirt call
USERGS_SYSRET64 is used to return from a syscall via SYSRET, but
a Xen PV guest will nevertheless use the IRET hypercall, as there
is no sysret PV hypercall defined.

So instead of testing all the prerequisites for doing a sysret and
then mangling the stack for Xen PV again for doing an iret just use
the iret exit from the beginning.

This can easily be done via an ALTERNATIVE like it is done for the
sysenter compat case already.

It should be noted that this drops the optimization in Xen for not
restoring a few registers when returning to user mode, but it seems
as if the saved instructions in the kernel more than compensate for
this drop (a kernel build in a Xen PV guest was slightly faster with
this patch applied).

While at it remove the stale sysret32 remnants.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210120135555.32594-6-jgross@suse.com
2021-02-10 12:32:07 +01:00
Juergen Gross
53c9d92409 x86/pv: Switch SWAPGS to ALTERNATIVE
SWAPGS is used only for interrupts coming from user mode or for
returning to user mode. So there is no reason to use the PARAVIRT
framework, as it can easily be replaced by an ALTERNATIVE depending
on X86_FEATURE_XENPV.

There are several instances using the PV-aware SWAPGS macro in paths
which are never executed in a Xen PV guest. Replace those with the
plain swapgs instruction. For SWAPGS_UNSAFE_STACK the same applies.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210120135555.32594-5-jgross@suse.com
2021-02-10 12:25:49 +01:00
Andy Shevchenko
1b79fc4f2b x86/apb_timer: Remove driver for deprecated platform
Intel Moorestown and Medfield are quite old Intel Atom based
32-bit platforms, which were in limited use in some Android phones,
tablets and consumer electronics more than eight years ago.

There are no bugs or problems ever reported outside from Intel
for breaking any of that platforms for years. It seems no real
users exists who run more or less fresh kernel on it. Commit
05f4434bc1 ("ASoC: Intel: remove mfld_machine") is also in align
with this theory.

Due to above and to reduce a burden of supporting outdated drivers,
remove the support for outdated platforms completely.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-02-09 15:28:37 +01:00
Yin Fengwei
ebbfc978f3 x86/acrn: Introduce acrn_cpuid_base() and hypervisor feature bits
ACRN Hypervisor reports hypervisor features via CPUID leaf 0x40000001
which is similar to KVM. A VM can check if it's the privileged VM using
the feature bits. The Service VM is the only privileged VM by design.

Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fengwei Yin <fengwei.yin@intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Yu Wang <yu1.wang@intel.com>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Shuo Liu <shuo.a.liu@intel.com>
Link: https://lore.kernel.org/r/20210207031040.49576-4-shuo.a.liu@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-09 10:58:18 +01:00
Shuo Liu
7995700e65 x86/acrn: Introduce acrn_{setup, remove}_intr_handler()
The ACRN Hypervisor builds an I/O request when a trapped I/O access
happens in User VM. Then, ACRN Hypervisor issues an upcall by sending
a notification interrupt to the Service VM. HSM in the Service VM needs
to hook the notification interrupt to handle I/O requests.

Notification interrupts from ACRN Hypervisor are already supported and
a, currently uninitialized, callback called.

Export two APIs for HSM to setup/remove its callback.

Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fengwei Yin <fengwei.yin@intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Yu Wang <yu1.wang@intel.com>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Originally-by: Yakui Zhao <yakui.zhao@intel.com>
Reviewed-by: Zhi Wang <zhi.a.wang@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Shuo Liu <shuo.a.liu@intel.com>
Link: https://lore.kernel.org/r/20210207031040.49576-3-shuo.a.liu@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-09 10:58:18 +01:00
Jarkko Sakkinen
2ade0d6093 x86/sgx: Maintain encl->refcount for each encl->mm_list entry
This has been shown in tests:

[  +0.000008] WARNING: CPU: 3 PID: 7620 at kernel/rcu/srcutree.c:374 cleanup_srcu_struct+0xed/0x100

This is essentially a use-after free, although SRCU notices it as
an SRCU cleanup in an invalid context.

== Background ==

SGX has a data structure (struct sgx_encl_mm) which keeps per-mm SGX
metadata.  This is separate from struct sgx_encl because, in theory,
an enclave can be mapped from more than one mm.  sgx_encl_mm includes
a pointer back to the sgx_encl.

This means that sgx_encl must have a longer lifetime than all of the
sgx_encl_mm's that point to it.  That's usually the case: sgx_encl_mm
is freed only after the mmu_notifier is unregistered in sgx_release().

However, there's a race.  If the process is exiting,
sgx_mmu_notifier_release() can be called in parallel with sgx_release()
instead of being called *by* it.  The mmu_notifier path keeps encl_mm
alive past when sgx_encl can be freed.  This inverts the lifetime rules
and means that sgx_mmu_notifier_release() can access a freed sgx_encl.

== Fix ==

Increase encl->refcount when encl_mm->encl is established. Release
this reference when encl_mm is freed. This ensures that encl outlives
encl_mm.

 [ bp: Massage commit message. ]

Fixes: 1728ab54b4 ("x86/sgx: Add a page reclaimer")
Reported-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20210207221401.29933-1-jarkko@kernel.org
2021-02-08 19:11:30 +01:00
Rafael J. Wysocki
d11a1d08a0 cpufreq: ACPI: Update arch scale-invariance max perf ratio if CPPC is not there
If the maximum performance level taken for computing the
arch_max_freq_ratio value used in the x86 scale-invariance code is
higher than the one corresponding to the cpuinfo.max_freq value
coming from the acpi_cpufreq driver, the scale-invariant utilization
falls below 100% even if the CPU runs at cpuinfo.max_freq or slightly
faster, which causes the schedutil governor to select a frequency
below cpuinfo.max_freq.  That frequency corresponds to a frequency
table entry below the maximum performance level necessary to get to
the "boost" range of CPU frequencies which prevents "boost"
frequencies from being used in some workloads.

While this issue is related to scale-invariance, it may be amplified
by commit db865272d9 ("cpufreq: Avoid configuring old governors as
default with intel_pstate") from the 5.10 development cycle which
made it extremely easy to default to schedutil even if the preferred
driver is acpi_cpufreq as long as intel_pstate is built too, because
the mere presence of the latter effectively removes the ondemand
governor from the defaults.  Distro kernels are likely to include
both intel_pstate and acpi_cpufreq on x86, so their users who cannot
use intel_pstate or choose to use acpi_cpufreq may easily be
affectecd by this issue.

If CPPC is available, it can be used to address this issue by
extending the frequency tables created by acpi_cpufreq to cover the
entire available frequency range (including "boost" frequencies) for
each CPU, but if CPPC is not there, acpi_cpufreq has no idea what
the maximum "boost" frequency is and the frequency tables created by
it cannot be extended in a meaningful way, so in that case make it
ask the arch scale-invariance code to to use the "nominal" performance
level for CPU utilization scaling in order to avoid the issue at hand.

Fixes: db865272d9 ("cpufreq: Avoid configuring old governors as default with intel_pstate")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2021-02-08 13:45:51 +01:00
Borislav Petkov
9223d0dccb thermal: Move therm_throt there from x86/mce
This functionality has nothing to do with MCE, move it to the thermal
framework and untangle it from MCE.

Requested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://lkml.kernel.org/r/20210202121003.GD18075@zn.tnic
2021-02-08 11:43:20 +01:00
Borislav Petkov
4f432e8bb1 x86/mce: Get rid of mcheck_intel_therm_init()
Move the APIC_LVTTHMR read which needs to happen on the BSP, to
intel_init_thermal(). One less boot dependency.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://lkml.kernel.org/r/20210201142704.12495-2-bp@alien8.de
2021-02-08 11:28:30 +01:00
Linus Torvalds
c6792d44d8 - For syscall user dispatch, separate ptctl operation from syscall
redirection range specification before the API has been made official in 5.11.
 
 - Ensure tasks using the generic syscall code do trap after returning
 from a syscall when single-stepping is requested.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmAfz7gACgkQEsHwGGHe
 VUp+8hAAlNdy5EJVBVEBT8U6K9ZxHJ2Mnk/uPteD8Sq9o37dndfJ5utrXd52h9om
 JFfcsIVO7Ej2i7bKNVzM1FgUeO5UqtwGoZyJxuyT4ma+MZIjFibaem0+ousovJiU
 MhB6Vl+jkEBIEJXg2z9btoLTa86SPJM77u+gtJXaeQegcNJENY1jpUHYlV22q90/
 b3b3MTVNNbw3bQty5hwWSU9G6PEXa888CJ+lEeuSjMQrVTmQ5i5oSMfYbUMCZIwm
 RQGcC/8qlDFfECBP9qMfq6sSoGnJ9uYmcT2Dzo7NiZHvBhtkzoWP4myjVF5g1oc/
 H5nUwrG2EXem73xuAdxbPe1nqVoU2byd658GjZ0St/Zcb5usanNEOkgJa3f+O3X5
 eRT5u9PFzhaTo2UDcLo02DlEqi/4Ed7bXJ2gxryHHxVi91Dr4G1uR+PL04MXJ6r8
 8YCf10c5qOrQ8u5DJ7/yq7uZkNpecdwzvEpQWkR7SmEjY0hNo2yt0Lt8JcD6eFcv
 Jx27bETAseUTrynnJJmyG7y+HvDds5M+t1gj8NPPs7vA/XkdEFRUdKoDGCJE+p6+
 y+cvRemx5p9YTiiTIEaiG187jR3M460DOvmT54xHcIWEWoJz3WfcRfXUqkx4xWOB
 TdJW5qTUnIkPr8XvHVcJUl6o9HIODclJCgZ7F7ceUP8XF2s2ATw=
 =l5j7
 -----END PGP SIGNATURE-----

Merge tag 'core_urgent_for_v5.11_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull syscall entry fixes from Borislav Petkov:

 - For syscall user dispatch, separate prctl operation from syscall
   redirection range specification before the API has been made official
   in 5.11.

 - Ensure tasks using the generic syscall code do trap after returning
   from a syscall when single-stepping is requested.

* tag 'core_urgent_for_v5.11_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  entry: Use different define for selector variable in SUD
  entry: Ensure trap after single-step on system call return
2021-02-07 10:16:24 -08:00
Linus Torvalds
e24f9c5f6e - Remove superfluous EFI PGD range checks which lead to those assertions failing
with certain kernel configs and LLVM.
 
 - Disable setting breakpoints on facilities involved in #DB exception handling
 to avoid infinite loops.
 
 - Add extra serialization to non-serializing MSRs (IA32_TSC_DEADLINE and
 x2 APIC MSRs) to adhere to SDM's recommendation and avoid any theoretical
 issues.
 
 - Re-add the EPB MSR reading on turbostat so that it works on older
 kernels which don't have the corresponding EPB sysfs file.
 
 - Add Alder Lake to the list of CPUs which support split lock.
 
 - Fix %dr6 register handling in order to be able to set watchpoints with gdb
 again.
 
 - Disable CET instrumentation in the kernel so that gcc doesn't add
 ENDBR64 to kernel code and thus confuse tracing.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmAfwqsACgkQEsHwGGHe
 VUroWA//fVOzuJxG51vAh4QEFmV0QX5V3T5If1acDVhtg9hf+iHBiD0jwhl9l5lu
 CN3AmSBUzb1WFujRED/YD7ahW1IFuRe3nIXAEQ8DkMP4y8b9ry48LKPAVQkBX5Tq
 gCEUotRXBdUafLt1rnLUGVLKcL8pn65zRJc6nYTJfPYTd79wBPUlm89X6c0GJk7+
 Zjv/Zt3r+SUe5f3e/M0hhphqKntpWwwvqcj2NczJxods/9lbhvw9jnDrC1FeN+Q9
 d1gK56e1DY/iqezxU9B5V4jOmLtp3B7WpyrnyKEkQTUjuYryaiXaegxPrQ9Qv1Ej
 ZcsusN8LG/TeWrIF7mWhBDraO05Sgw0n+d9i4h89XUtRFB/DwQdNRN/l8YPknQW8
 3b0AYxpAcvlZhA20N1NQc/uwqsOtb06LQ29BeZCTDA4JFG3qUAzKNaWBptoUFIA/
 t/tq7DogJbcvKWKxyWeQq280w6uxDjki+ntY0Om95ZK2NgltpQuoiBHG0YjpbI4I
 DkuL/3Yck/aaM1TBVSab6145ki8vg+zIydvEmAH7JXkDiOZbIZAV2mtqN8NE7cuS
 PVZU3dt7GHhSc/xQW4EoRtqtgiRzADPGrrlDWPwwRVgvaMkjxpk+N3ycsFuPk7hL
 qQb26YJ5u14ntjvtfq0u53HQhriYGsa6JqwBHiNAZaN5Azo+1ws=
 =XwH4
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.11_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:
 "I hope this is the last batch of x86/urgent updates for this round:

   - Remove superfluous EFI PGD range checks which lead to those
     assertions failing with certain kernel configs and LLVM.

   - Disable setting breakpoints on facilities involved in #DB exception
     handling to avoid infinite loops.

   - Add extra serialization to non-serializing MSRs (IA32_TSC_DEADLINE
     and x2 APIC MSRs) to adhere to SDM's recommendation and avoid any
     theoretical issues.

   - Re-add the EPB MSR reading on turbostat so that it works on older
     kernels which don't have the corresponding EPB sysfs file.

   - Add Alder Lake to the list of CPUs which support split lock.

   - Fix %dr6 register handling in order to be able to set watchpoints
     with gdb again.

   - Disable CET instrumentation in the kernel so that gcc doesn't add
     ENDBR64 to kernel code and thus confuse tracing"

* tag 'x86_urgent_for_v5.11_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/efi: Remove EFI PGD build time checks
  x86/debug: Prevent data breakpoints on cpu_dr7
  x86/debug: Prevent data breakpoints on __per_cpu_offset
  x86/apic: Add extra serialization for non-serializing MSRs
  tools/power/turbostat: Fallback to an MSR read for EPB
  x86/split_lock: Enable the split lock feature on another Alder Lake CPU
  x86/debug: Fix DR6 handling
  x86/build: Disable CET instrumentation in the kernel
2021-02-07 09:40:47 -08:00
Gabriel Krisman Bertazi
6342adcaa6 entry: Ensure trap after single-step on system call return
Commit 2991552447 ("entry: Drop usage of TIF flags in the generic syscall
code") introduced a bug on architectures using the generic syscall entry
code, in which processes stopped by PTRACE_SYSCALL do not trap on syscall
return after receiving a TIF_SINGLESTEP.

The reason is that the meaning of TIF_SINGLESTEP flag is overloaded to
cause the trap after a system call is executed, but since the above commit,
the syscall call handler only checks for the SYSCALL_WORK flags on the exit
work.

Split the meaning of TIF_SINGLESTEP such that it only means single-step
mode, and create a new type of SYSCALL_WORK to request a trap immediately
after a syscall in single-step mode.  In the current implementation, the
SYSCALL_WORK flag shadows the TIF_SINGLESTEP flag for simplicity.

Update x86 to flip this bit when a tracer enables single stepping.

Fixes: 2991552447 ("entry: Drop usage of TIF flags in the generic syscall code")
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kyle Huey <me@kylehuey.com>
Link: https://lore.kernel.org/r/87h7mtc9pr.fsf_-_@collabora.com
2021-02-06 00:21:42 +01:00
Lai Jiangshan
3943abf2db x86/debug: Prevent data breakpoints on cpu_dr7
local_db_save() is called at the start of exc_debug_kernel(), reads DR7 and
disables breakpoints to prevent recursion.

When running in a guest (X86_FEATURE_HYPERVISOR), local_db_save() reads the
per-cpu variable cpu_dr7 to check whether a breakpoint is active or not
before it accesses DR7.

A data breakpoint on cpu_dr7 therefore results in infinite #DB recursion.

Disallow data breakpoints on cpu_dr7 to prevent that.

Fixes: 84b6a3491567a("x86/entry: Optimize local_db_save() for virt")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210204152708.21308-2-jiangshanlai@gmail.com
2021-02-05 20:13:12 +01:00
Lai Jiangshan
c4bed4b969 x86/debug: Prevent data breakpoints on __per_cpu_offset
When FSGSBASE is enabled, paranoid_entry() fetches the per-CPU GSBASE value
via __per_cpu_offset or pcpu_unit_offsets.

When a data breakpoint is set on __per_cpu_offset[cpu] (read-write
operation), the specific CPU will be stuck in an infinite #DB loop.

RCU will try to send an NMI to the specific CPU, but it is not working
either since NMI also relies on paranoid_entry(). Which means it's
undebuggable.

Fixes: eaad981291ee3("x86/entry/64: Introduce the FIND_PERCPU_BASE macro")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210204152708.21308-1-jiangshanlai@gmail.com
2021-02-05 20:13:11 +01:00
Daniel Vetter
dc9b7be557 x86/sgx: Drop racy follow_pfn() check
PTE insertion is fundamentally racy, and this check doesn't do anything
useful. Quoting Sean:

  "Yeah, it can be whacked. The original, never-upstreamed code asserted
  that the resolved PFN matched the PFN being installed by the fault
  handler as a sanity check on the SGX driver's EPC management. The
  WARN assertion got dropped for whatever reason, leaving that useless
  chunk."

Jason stumbled over this as a new user of follow_pfn(), and I'm trying
to get rid of unsafe callers of that function so it can be locked down
further.

This is independent prep work for the referenced patch series:

  https://lore.kernel.org/dri-devel/20201127164131.2244124-1-daniel.vetter@ffwll.ch/

Fixes: 947c6e11fa ("x86/sgx: Add ptrace() support for the SGX driver")
Reported-by: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/20210204184519.2809313-1-daniel.vetter@ffwll.ch
2021-02-05 10:45:11 +01:00
Dave Hansen
25a068b8e9 x86/apic: Add extra serialization for non-serializing MSRs
Jan Kiszka reported that the x2apic_wrmsr_fence() function uses a plain
MFENCE while the Intel SDM (10.12.3 MSR Access in x2APIC Mode) calls for
MFENCE; LFENCE.

Short summary: we have special MSRs that have weaker ordering than all
the rest. Add fencing consistent with current SDM recommendations.

This is not known to cause any issues in practice, only in theory.

Longer story below:

The reason the kernel uses a different semantic is that the SDM changed
(roughly in late 2017). The SDM changed because folks at Intel were
auditing all of the recommended fences in the SDM and realized that the
x2apic fences were insufficient.

Why was the pain MFENCE judged insufficient?

WRMSR itself is normally a serializing instruction. No fences are needed
because the instruction itself serializes everything.

But, there are explicit exceptions for this serializing behavior written
into the WRMSR instruction documentation for two classes of MSRs:
IA32_TSC_DEADLINE and the X2APIC MSRs.

Back to x2apic: WRMSR is *not* serializing in this specific case.
But why is MFENCE insufficient? MFENCE makes writes visible, but
only affects load/store instructions. WRMSR is unfortunately not a
load/store instruction and is unaffected by MFENCE. This means that a
non-serializing WRMSR could be reordered by the CPU to execute before
the writes made visible by the MFENCE have even occurred in the first
place.

This means that an x2apic IPI could theoretically be triggered before
there is any (visible) data to process.

Does this affect anything in practice? I honestly don't know. It seems
quite possible that by the time an interrupt gets to consume the (not
yet) MFENCE'd data, it has become visible, mostly by accident.

To be safe, add the SDM-recommended fences for all x2apic WRMSRs.

This also leaves open the question of the _other_ weakly-ordered WRMSR:
MSR_IA32_TSC_DEADLINE. While it has the same ordering architecture as
the x2APIC MSRs, it seems substantially less likely to be a problem in
practice. While writes to the in-memory Local Vector Table (LVT) might
theoretically be reordered with respect to a weakly-ordered WRMSR like
TSC_DEADLINE, the SDM has this to say:

  In x2APIC mode, the WRMSR instruction is used to write to the LVT
  entry. The processor ensures the ordering of this write and any
  subsequent WRMSR to the deadline; no fencing is required.

But, that might still leave xAPIC exposed. The safest thing to do for
now is to add the extra, recommended LFENCE.

 [ bp: Massage commit message, fix typos, drop accidentally added
   newline to tools/arch/x86/include/asm/barrier.h. ]

Reported-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200305174708.F77040DD@viggo.jf.intel.com
2021-02-04 19:36:31 +01:00
Mike Rapoport
5c279c4cf2 Revert "x86/setup: don't remove E820_TYPE_RAM for pfn 0"
This reverts commit bde9cfa3af.

Changing the first memory page type from E820_TYPE_RESERVED to
E820_TYPE_RAM makes it a part of "System RAM" resource rather than a
reserved resource and this in turn causes devmem_is_allowed() to treat
is as area that can be accessed but it is filled with zeroes instead of
the actual data as previously.

The change in /dev/mem output causes lilo to fail as was reported at
slakware users forum, and probably other legacy applications will
experience similar problems.

Link: https://www.linuxquestions.org/questions/slackware-14/slackware-current-lilo-vesa-warnings-after-recent-updates-4175689617/#post6214439
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-04 10:26:53 -08:00
Andy Lutomirski
f22fecaf39 x86/ptrace: Clean up PTRACE_GETREGS/PTRACE_PUTREGS regset selection
task_user_regset_view() has nonsensical semantics, but those semantics
appear to be relied on by existing users of PTRACE_GETREGSET and
PTRACE_SETREGSET.  (See added comments below for details.)

It shouldn't be used for PTRACE_GETREGS or PTRACE_SETREGS, though. A
native 64-bit ptrace() call and an x32 ptrace() call using GETREGS
or SETREGS wants the 64-bit regset views, and a 32-bit ptrace() call
(native or compat) should use the 32-bit regset.

task_user_regset_view() almost does this except that it will
malfunction if a ptracer is itself ptraced and the outer ptracer
modifies CS on entry to a ptrace() syscall.  Hopefully that has never
happened.  (The compat ptrace() code already hardcoded the 32-bit
regset, so this change has no effect on that path.)

Improve the situation and deobfuscate the code by hardcoding the
64-bit view in the x32 ptrace() and selecting the view based on the
kernel config in the native ptrace().

I tried to figure out the history behind this API. I naïvely assumed
that PTRAGE_GETREGSET and PTRACE_SETREGSET were ancient APIs that
predated compat, but no. They were introduced by

  2225a122ae ("ptrace: Add support for generic PTRACE_GETREGSET/PTRACE_SETREGSET")

in 2010, and they are simply a poor design.  ELF core dumps have the
ELF e_machine field and a bunch of register sets in ELF notes, and the
pair (e_machine, NT_XXX) indicates the format of the regset blob.  But
the new PTRACE_GET/SETREGSET API coopted the NT_XXX numbering without
any way to specify which e_machine was in effect.  This is especially
bad on x86, where a process can freely switch between 32-bit and
64-bit mode, and, in fact, the PTRAGE_SETREGSET call itself can cause
this switch to happen.  Oops.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/9daa791d0c7eaebd59c5bc2b2af1b0e7bebe707d.1612375698.git.luto@kernel.org
2021-02-04 12:33:15 +01:00
Sean Christopherson
ed72736183 x86/reboot: Force all cpus to exit VMX root if VMX is supported
Force all CPUs to do VMXOFF (via NMI shootdown) during an emergency
reboot if VMX is _supported_, as VMX being off on the current CPU does
not prevent other CPUs from being in VMX root (post-VMXON).  This fixes
a bug where a crash/panic reboot could leave other CPUs in VMX root and
prevent them from being woken via INIT-SIPI-SIPI in the new kernel.

Fixes: d176720d34 ("x86: disable VMX on all CPUs on reboot")
Cc: stable@vger.kernel.org
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David P. Reed <dpreed@deepplum.com>
[sean: reworked changelog and further tweaked comment]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20201231002702.2223707-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:31 -05:00
Sean Christopherson
db7d8e4768 x86/apic: Export x2apic_mode for use by KVM in "warm" path
Export x2apic_mode so that KVM can query whether x2APIC is active
without having to incur the RDMSR in x2apic_enabled().  When Posted
Interrupts are in use for a guest with an assigned device, KVM ends up
checking for x2APIC at least once every time a vCPU halts.  KVM could
obviously snapshot x2apic_enabled() to avoid the RDMSR, but that's
rather silly given that x2apic_mode holds the exact info needed by KVM.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210115220354.434807-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:22 -05:00
Fenghua Yu
8acf417805 x86/split_lock: Enable the split lock feature on another Alder Lake CPU
Add Alder Lake mobile processor to CPU list to enumerate and enable the
split lock feature.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20210201190007.4031869-1-fenghua.yu@intel.com
2021-02-01 21:34:51 +01:00
Peter Zijlstra
9ad22e1659 x86/debug: Fix DR6 handling
Tom reported that one of the GDB test-cases failed, and Boris bisected
it to commit:

  d53d9bc0cf ("x86/debug: Change thread.debugreg6 to thread.virtual_dr6")

The debugging session led us to commit:

  6c0aca288e ("x86: Ignore trap bits on single step exceptions")

It turns out that TF and data breakpoints are both traps and will be
merged, while instruction breakpoints are faults and will not be merged.
This means 6c0aca288e is wrong, only TF and instruction breakpoints
need to be excluded while TF and data breakpoints can be merged.

 [ bp: Massage commit message. ]

Fixes: d53d9bc0cf ("x86/debug: Change thread.debugreg6 to thread.virtual_dr6")
Fixes: 6c0aca288e ("x86: Ignore trap bits on single step exceptions")
Reported-by: Tom de Vries <tdevries@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/YBMAbQGACujjfz%2Bi@hirez.programming.kicks-ass.net
Link: https://lkml.kernel.org/r/20210128211627.GB4348@worktop.programming.kicks-ass.net
2021-02-01 15:49:02 +01:00
Will Deacon
8cf55f24ce x86/ldt: Use tlb_gather_mmu_fullmm() when freeing LDT page-tables
free_ldt_pgtables() uses the MMU gather API for batching TLB flushes
over the call to free_pgd_range(). However, tlb_gather_mmu() expects
to operate on user addresses and so passing LDT_{BASE,END}_ADDR will
confuse the range setting logic in __tlb_adjust_range(), causing the
gather to identify a range starting at TASK_SIZE. Such a large range
will be converted into a 'fullmm' flush by the low-level invalidation
code, so change the caller to invoke tlb_gather_mmu_fullmm() directly.

Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20210127235347.1402-7-will@kernel.org
2021-01-29 20:02:29 +01:00
Will Deacon
a72afd8730 tlb: mmu_gather: Remove start/end arguments from tlb_gather_mmu()
The 'start' and 'end' arguments to tlb_gather_mmu() are no longer
needed now that there is a separate function for 'fullmm' flushing.

Remove the unused arguments and update all callers.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/CAHk-=wjQWa14_4UpfDf=fiineNP+RH74kZeDMo_f1D35xNzq9w@mail.gmail.com
2021-01-29 20:02:29 +01:00
Will Deacon
ae8eba8b5d tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu()
Since commit 7a30df49f6 ("mm: mmu_gather: remove __tlb_reset_range()
for force flush"), the 'start' and 'end' arguments to tlb_finish_mmu()
are no longer used, since we flush the whole mm in case of a nested
invalidation.

Remove the unused arguments and update all callers.

Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20210127235347.1402-3-will@kernel.org
2021-01-29 20:02:28 +01:00
Yejune Deng
0a74d61c7d x86/fpu/xstate: Use sizeof() instead of a constant
Use sizeof() instead of a constant in fpstate_sanitize_xstate().
Remove use of the address of the 0th array element of ->st_space and
->xmm_space which is equivalent to the array address itself:

No code changed:

  # arch/x86/kernel/fpu/xstate.o:

   text    data     bss     dec     hex filename
   9694     899       4   10597    2965 xstate.o.before
   9694     899       4   10597    2965 xstate.o.after

md5:
   5a43fc70bad8e2a1784f67f01b71aabb  xstate.o.before.asm
   5a43fc70bad8e2a1784f67f01b71aabb  xstate.o.after.asm

 [ bp: Massage commit message. ]

Signed-off-by: Yejune Deng <yejune.deng@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210122071925.41285-1-yejune.deng@gmail.com
2021-01-29 12:33:17 +01:00
Viresh Kumar
a6a0683b71 arch: x86: Remove CONFIG_OPROFILE support
The "oprofile" user-space tools don't use the kernel OPROFILE support
any more, and haven't in a long time. User-space has been converted to
the perf interfaces.

Remove the old oprofile's architecture specific support.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Robert Richter <rric@kernel.org>
Acked-by: William Cohen <wcohen@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
2021-01-29 10:05:51 +05:30
Sean Christopherson
fb35d30fe5 x86/cpufeatures: Assign dedicated feature word for CPUID_0x8000001F[EAX]
Collect the scattered SME/SEV related feature flags into a dedicated
word.  There are now five recognized features in CPUID.0x8000001F.EAX,
with at least one more on the horizon (SEV-SNP).  Using a dedicated word
allows KVM to use its automagic CPUID adjustment logic when reporting
the set of supported features to userspace.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Brijesh Singh <brijesh.singh@amd.com>
Link: https://lkml.kernel.org/r/20210122204047.2860075-2-seanjc@google.com
2021-01-28 17:41:24 +01:00
Fangrui Song
bb73d07148 x86/build: Treat R_386_PLT32 relocation as R_386_PC32
This is similar to commit

  b21ebf2fb4 ("x86: Treat R_X86_64_PLT32 as R_X86_64_PC32")

but for i386. As far as the kernel is concerned, R_386_PLT32 can be
treated the same as R_386_PC32.

R_386_PLT32/R_X86_64_PLT32 are PC-relative relocation types which
can only be used by branches. If the referenced symbol is defined
externally, a PLT will be used.

R_386_PC32/R_X86_64_PC32 are PC-relative relocation types which can be
used by address taking operations and branches. If the referenced symbol
is defined externally, a copy relocation/canonical PLT entry will be
created in the executable.

On x86-64, there is no PIC vs non-PIC PLT distinction and an
R_X86_64_PLT32 relocation is produced for both `call/jmp foo` and
`call/jmp foo@PLT` with newer (2018) GNU as/LLVM integrated assembler.
This avoids canonical PLT entries (st_shndx=0, st_value!=0).

On i386, there are 2 types of PLTs, PIC and non-PIC. Currently,
the GCC/GNU as convention is to use R_386_PC32 for non-PIC PLT and
R_386_PLT32 for PIC PLT. Copy relocations/canonical PLT entries
are possible ABI issues but GCC/GNU as will likely keep the status
quo because (1) the ABI is legacy (2) the change will drop a GNU
ld diagnostic for non-default visibility ifunc in shared objects.

clang-12 -fno-pic (since [1]) can emit R_386_PLT32 for compiler
generated function declarations, because preventing canonical PLT
entries is weighed over the rare ifunc diagnostic.

Further info for the more interested:

  https://github.com/ClangBuiltLinux/linux/issues/1210
  https://sourceware.org/bugzilla/show_bug.cgi?id=27169
  a084c0388e [1]

 [ bp: Massage commit message. ]

Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Fangrui Song <maskray@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Link: https://lkml.kernel.org/r/20210127205600.1227437-1-maskray@google.com
2021-01-28 12:24:06 +01:00
Misono Tomohiro
02a16aa135 x86/MSR: Filter MSR writes through X86_IOC_WRMSR_REGS ioctl too
Commit

  a7e1f67ed2 ("x86/msr: Filter MSR writes")

introduced a module parameter to disable writing to the MSR device file
and tainted the kernel upon writing. As MSR registers can be written by
the X86_IOC_WRMSR_REGS ioctl too, the same filtering and tainting should
be applied to the ioctl as well.

 [ bp: Massage commit message and space out statements. ]

Fixes: a7e1f67ed2 ("x86/msr: Filter MSR writes")
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210127122456.13939-1-misono.tomohiro@jp.fujitsu.com
2021-01-27 19:06:47 +01:00
Josh Poimboeuf
aeb818fcc9 x86/acpi: Support objtool validation in wakeup_64.S
The OBJECT_FILES_NON_STANDARD annotation is used to tell objtool to
ignore a file.  File-level ignores won't work when validating vmlinux.o.

Instead, tell objtool to ignore do_suspend_lowlevel() directly with the
STACK_FRAME_NON_STANDARD annotation.

Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/269eda576c53bc9ecc8167c211989111013a67aa.1611263462.git.jpoimboe@redhat.com
2021-01-26 11:33:03 -06:00
Josh Poimboeuf
f83d1a0190 x86/acpi: Annotate indirect branch as safe
This indirect jump is harmless; annotate it to keep objtool's retpoline
validation happy.

Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/a7288e7043265d95c1a5d64f9fd751ead4854bdc.1611263462.git.jpoimboe@redhat.com
2021-01-26 11:33:03 -06:00
Josh Poimboeuf
7cae4b1cf1 x86/ftrace: Support objtool vmlinux.o validation in ftrace_64.S
With objtool vmlinux.o validation of return_to_handler(), now that
objtool has visibility inside the retpoline, jumping from EMPTY state to
a proper function state results in a stack state mismatch.

return_to_handler() is actually quite normal despite the underlying
magic.  Just annotate it as a normal function.

Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/14f48e623f61dbdcd84cf27a56ed8ccae73199ef.1611263462.git.jpoimboe@redhat.com
2021-01-26 11:33:02 -06:00
Josh Poimboeuf
b735bd3e68 objtool: Combine UNWIND_HINT_RET_OFFSET and UNWIND_HINT_FUNC
The ORC metadata generated for UNWIND_HINT_FUNC isn't actually very
func-like.  With certain usages it can cause stack state mismatches
because it doesn't set the return address (CFI_RA).

Also, users of UNWIND_HINT_RET_OFFSET no longer need to set a custom
return stack offset.  Instead they just need to specify a func-like
situation, so the current ret_offset code is hacky for no good reason.

Solve both problems by simplifying the RET_OFFSET handling and
converting it into a more useful UNWIND_HINT_FUNC.

If we end up needing the old 'ret_offset' functionality again in the
future, we should be able to support it pretty easily with the addition
of a custom 'sp_offset' in UNWIND_HINT_FUNC.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/db9d1f5d79dddfbb3725ef6d8ec3477ad199948d.1611263462.git.jpoimboe@redhat.com
2021-01-26 11:12:00 -06:00
Josh Poimboeuf
18660698a3 x86/ftrace: Add UNWIND_HINT_FUNC annotation for ftrace_stub
Prevent an unreachable objtool warning after the sibling call detection
gets improved.  ftrace_stub() is basically a function, annotate it as
such.

Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/6845e1b2fb0723a95740c6674e548ba38c5ea489.1611263461.git.jpoimboe@redhat.com
2021-01-26 11:11:59 -06:00
Caz Yokoyama
c6bba9e5fe x86/gpu: Add Alderlake-S stolen memory support
Alderlake-S is a Gen 12 based hybrid processor architecture. As it
belongs to Gen 12 family, it uses the same GTT stolen memory settings
like its predecessors - ICL(Gen 11) and TGL(Gen 12). Inherit gen11
and gen 9 quirks for determining base and size of stolen memory.

Bspec: 52055
Bspec: 49589
Bspec: 49636

Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Imre Deak <imre.deak@intel.com>
Cc: x86@kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Caz Yokoyama <caz.yokoyama@intel.com>
Signed-off-by: Aditya Swarup <aditya.swarup@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210126001744.29442-2-aditya.swarup@intel.com
2021-01-26 06:55:09 -08:00
Linus Torvalds
5130680642 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "18 patches.

  Subsystems affected by this patch series: mm (pagealloc, memcg, kasan,
  memory-failure, and highmem), ubsan, proc, and MAINTAINERS"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  MAINTAINERS: add a couple more files to the Clang/LLVM section
  proc_sysctl: fix oops caused by incorrect command parameters
  powerpc/mm/highmem: use __set_pte_at() for kmap_local()
  mips/mm/highmem: use set_pte() for kmap_local()
  mm/highmem: prepare for overriding set_pte_at()
  sparc/mm/highmem: flush cache and TLB
  mm: fix page reference leak in soft_offline_page()
  ubsan: disable unsigned-overflow check for i386
  kasan, mm: fix resetting page_alloc tags for HW_TAGS
  kasan, mm: fix conflicts with init_on_alloc/free
  kasan: fix HW_TAGS boot parameters
  kasan: fix incorrect arguments passing in kasan_add_zero_shadow
  kasan: fix unaligned address is unhandled in kasan_remove_zero_shadow
  mm: fix numa stats for thp migration
  mm: memcg: fix memcg file_dirty numa stat
  mm: memcg/slab: optimize objcg stock draining
  mm: fix initialization of struct page for holes in memory layout
  x86/setup: don't remove E820_TYPE_RAM for pfn 0
2021-01-24 12:16:34 -08:00
Linus Torvalds
24c56ee06c - Correct the marking of kthreads which are supposed to run on a specific,
single CPU vs such which are affine to only one CPU, mark per-cpu workqueue
  threads as such and make sure that marking "survives" CPU hotplug. Fix CPU
  hotplug issues with such kthreads.
 
  - A fix to not push away tasks on CPUs coming online.
 
  - Have workqueue CPU hotplug code use cpu_possible_mask when breaking affinity
    on CPU offlining so that pending workers can finish on newly arrived onlined
    CPUs too.
 
  - Dump tasks which haven't vacated a CPU which is currently being unplugged.
 
  - Register a special scale invariance callback which gets called on resume
  from RAM to read out APERF/MPERF after resume and thus make the schedutil
  scaling governor more precise.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmANYCAACgkQEsHwGGHe
 VUo+OBAAjfqkijDlXiGX6lrT5gRx5NZICpeMgbWa7J13XHT1ysD/b0fMGFIUyF6k
 aszDLTl8U/S1/qGAYlzTSPAFcdZ+ENiFqQ48ozMk4jZC3p0quHTjs/PdiSG6kYBi
 +e4smht+bSyLKxsG8hN0kJ+mLEd+uIQ13kP4YkxPgWbJ9WNP/U6HHGBo0rBchtSe
 Kn6bdd8CfwmC6rSazp7kdQoFoWeQaoMI1ODX3VphK1GtL1wq8WSICzRhpg3caeyG
 3lCIddoNW9mCA9Nkc6R6HeV3uW9JGkPAjnmtTIEHDbg9pib7xNT978ieTQuqNDCi
 DlAHDGumzoaiVJZhD/1fj/RXMJr2YUHxtrXWNsXpiKJ9g8Tn+WC0UW/4+Mx2L/km
 0RSoXJlMs1fGopS2I/fObZ6RPhmg4D+gJsMCdaHQzX4NgxZAGhNNPxMckZ0IM8A0
 2NNXSHUZHVTHeJEW0E/glOcpWb5hG+vDwiBMNEWfTwYpTfrw2EEOZaKniZE7WlSL
 4ItM9rkLGl1KToJzAH4A0oUtSy3vtSCo8B1noGlc09Lj+oCIBlr81z9+C79a2oxG
 qE7Xd4X7y7Qs3JeCbRZWQa7/2Kf1v4XnjELrJJeCZC85r0ZqJDwRX8w7lkmW2XPU
 m4J2prr/DDZSqrRh23/xC1fsU+vcBKSfKUFKAH4Lg2VIaUfSUEk=
 =2DAF
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:

 - Correct the marking of kthreads which are supposed to run on a
   specific, single CPU vs such which are affine to only one CPU, mark
   per-cpu workqueue threads as such and make sure that marking
   "survives" CPU hotplug. Fix CPU hotplug issues with such kthreads.

 - A fix to not push away tasks on CPUs coming online.

 - Have workqueue CPU hotplug code use cpu_possible_mask when breaking
   affinity on CPU offlining so that pending workers can finish on newly
   arrived onlined CPUs too.

 - Dump tasks which haven't vacated a CPU which is currently being
   unplugged.

 - Register a special scale invariance callback which gets called on
   resume from RAM to read out APERF/MPERF after resume and thus make
   the schedutil scaling governor more precise.

* tag 'sched_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Relax the set_cpus_allowed_ptr() semantics
  sched: Fix CPU hotplug / tighten is_per_cpu_kthread()
  sched: Prepare to use balance_push in ttwu()
  workqueue: Restrict affinity change to rescuer
  workqueue: Tag bound workers with KTHREAD_IS_PER_CPU
  kthread: Extract KTHREAD_IS_PER_CPU
  sched: Don't run cpu-online with balance_push() enabled
  workqueue: Use cpu_possible_mask instead of cpu_active_mask to break affinity
  sched/core: Print out straggler tasks in sched_cpu_dying()
  x86: PM: Register syscore_ops for scale invariance
2021-01-24 10:09:20 -08:00
Linus Torvalds
17b6c49da3 - Add a new Intel model number for Alder Lake
- Differentiate which aspects of the FPU state get saved/restored when the FPU
    is used in-kernel and fix a boot crash on K7 due to early MXCSR access before
    CR4.OSFXSR is even set.
 
  - A couple of noinstr annotation fixes
 
  - Correct die ID setting on AMD for users of topology information which need
    the correct die ID
 
  - A SEV-ES fix to handle string port IO to/from kernel memory properly
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmANUr0ACgkQEsHwGGHe
 VUos4hAAlBik/z+y+DaZGJyxtpST2YQaEbwbW3UMqyLsdVnLTTRnKzC1T+fEfD2Q
 SxtCPYH5iuPbCgOOoQboWt6Aa53JlX9bRBZ/87Ub/ELJ9NgMxMQFXAiaDZAAY6Zy
 L2B13KpoGOifPjrGDgksnafyqYv1CYesiArfOffHgvC3/0j7ONdda2SRDQ697TBw
 FSV/WfUjCo0+JdXRRaP6YH5t9MxFerHxVH38xTDFwXikS9CVyddosLo5EP2wAQvi
 5+160i2jB25vyMEsFBr5wE0xDpWLUdClVpzHXXPG2i0P+NHATiBcreTMPzeYOUXu
 Hfc/y4ukOVDoMGlHLNKHq89alI87soMJIEjm2sAG1ZIypKyMJw7YUXQNRR3TcP0U
 c7/C3W1mCWD1+8nLtlIMM0Z20DacQOf9YWko95+uh08+S52KpTOgnx+mpoZjK1PQ
 Wv9HxPJKycrgRNhfverN5FSiOEW/DdvqNfVHTjuuzNLyKdM1NoZ/YTIyABk4RfFq
 USUnC5rk4GqvCYdaLTEKkAJvLCmRKgVYd75Rc4/pPKILS6kv82vpj3BjClBaH0h1
 yrvpafvXzOhwKP/J5q0vm57NJdqPZwuW4Ah+74tptmQL4rga84U4FOs3JpNJq0uu
 1mj6xSFD8ZyI11BSkYbZAHTy1eNERze+azftCSPq/6EifYvqnsE=
 =3rZM
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Add a new Intel model number for Alder Lake

 - Differentiate which aspects of the FPU state get saved/restored when
   the FPU is used in-kernel and fix a boot crash on K7 due to early
   MXCSR access before CR4.OSFXSR is even set.

 - A couple of noinstr annotation fixes

 - Correct die ID setting on AMD for users of topology information which
   need the correct die ID

 - A SEV-ES fix to handle string port IO to/from kernel memory properly

* tag 'x86_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu: Add another Alder Lake CPU to the Intel family
  x86/mmx: Use KFPU_387 for MMX string operations
  x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
  x86/topology: Make __max_die_per_package available unconditionally
  x86: __always_inline __{rd,wr}msr()
  x86/mce: Remove explicit/superfluous tracing
  locking/lockdep: Avoid noinstr warning for DEBUG_LOCKDEP
  locking/lockdep: Cure noinstr fail
  x86/sev: Fix nonistr violation
  x86/entry: Fix noinstr fail
  x86/cpu/amd: Set __max_die_per_package on AMD
  x86/sev-es: Handle string port IO to kernel memory properly
2021-01-24 09:46:05 -08:00
Mike Rapoport
bde9cfa3af x86/setup: don't remove E820_TYPE_RAM for pfn 0
Patch series "mm: fix initialization of struct page for holes in  memory layout", v3.

Commit 73a6e474cb ("mm: memmap_init: iterate over memblock regions
rather that check each PFN") exposed several issues with the memory map
initialization and these patches fix those issues.

Initially there were crashes during compaction that Qian Cai reported
back in April [1].  It seemed back then that the problem was fixed, but
a few weeks ago Andrea Arcangeli hit the same bug [2] and there was an
additional discussion at [3].

[1] https://lore.kernel.org/lkml/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw
[2] https://lore.kernel.org/lkml/20201121194506.13464-1-aarcange@redhat.com
[3] https://lore.kernel.org/mm-commits/20201206005401.qKuAVgOXr%akpm@linux-foundation.org

This patch (of 2):

The first 4Kb of memory is a BIOS owned area and to avoid its allocation
for the kernel it was not listed in e820 tables as memory.  As the result,
pfn 0 was never recognised by the generic memory management and it is not
a part of neither node 0 nor ZONE_DMA.

If set_pfnblock_flags_mask() would be ever called for the pageblock
corresponding to the first 2Mbytes of memory, having pfn 0 outside of
ZONE_DMA would trigger

	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);

Along with reserving the first 4Kb in e820 tables, several first pages are
reserved with memblock in several places during setup_arch().  These
reservations are enough to ensure the kernel does not touch the BIOS area
and it is not necessary to remove E820_TYPE_RAM for pfn 0.

Remove the update of e820 table that changes the type of pfn 0 and move
the comment describing why it was done to trim_low_memory_range() that
reserves the beginning of the memory.

Link: https://lkml.kernel.org/r/20210111194017.22696-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 09:20:52 -08:00
Andy Lutomirski
8ece53ef7f x86/vm86/32: Remove VM86_SCREEN_BITMAP support
The implementation was rather buggy.  It unconditionally marked PTEs
read-only, even for VM_SHARED mappings.  I'm not sure whether this is
actually a problem, but it certainly seems unwise.  More importantly, it
released the mmap lock before flushing the TLB, which could allow a racing
CoW operation to falsely believe that the underlying memory was not
writable.

I can't find any users at all of this mechanism, so just remove it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Stas Sergeev <stsp2@yandex.ru>
Link: https://lkml.kernel.org/r/f3086de0babcab36f69949b5780bde851f719bc8.1611078018.git.luto@kernel.org
2021-01-21 20:08:53 +01:00
Sami Tolvanen
31bf928817 x86/sgx: Fix the return type of sgx_init()
device_initcall() expects a function of type initcall_t, which returns
an integer. Change the signature of sgx_init() to match.

Fixes: e7e0545299 ("x86/sgx: Initialize metadata for Enclave Page Cache (EPC) sections")
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/20210113232311.277302-1-samitolvanen@google.com
2021-01-21 14:04:06 +01:00
Andy Lutomirski
e45122893a x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
Currently, requesting kernel FPU access doesn't distinguish which parts of
the extended ("FPU") state are needed.  This is nice for simplicity, but
there are a few cases in which it's suboptimal:

 - The vast majority of in-kernel FPU users want XMM/YMM/ZMM state but do
   not use legacy 387 state.  These users want MXCSR initialized but don't
   care about the FPU control word.  Skipping FNINIT would save time.
   (Empirically, FNINIT is several times slower than LDMXCSR.)

 - Code that wants MMX doesn't want or need MXCSR initialized.
   _mmx_memcpy(), for example, can run before CR4.OSFXSR gets set, and
   initializing MXCSR will fail because LDMXCSR generates an #UD when the
   aforementioned CR4 bit is not set.

 - Any future in-kernel users of XFD (eXtended Feature Disable)-capable
   dynamic states will need special handling.

Add a more specific API that allows callers to specify exactly what they
want.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Krzysztof Piotr Olędzki <ole@ans.pl>
Link: https://lkml.kernel.org/r/aff1cac8b8fc7ee900cf73e8f2369966621b053f.1611205691.git.luto@kernel.org
2021-01-21 12:07:28 +01:00
Rafael J. Wysocki
9c7d9017a4 x86: PM: Register syscore_ops for scale invariance
On x86 scale invariace tends to be disabled during resume from
suspend-to-RAM, because the MPERF or APERF MSR values are not as
expected then due to updates taking place after the platform
firmware has been invoked to complete the suspend transition.

That, of course, is not desirable, especially if the schedutil
scaling governor is in use, because the lack of scale invariance
causes it to be less reliable.

To counter that effect, modify init_freq_invariance() to register
a syscore_ops object for scale invariance with the ->resume callback
pointing to init_counter_refs() which will run on the CPU starting
the resume transition (the other CPUs will be taken care of the
"online" operations taking place later).

Fixes: e2b0d619b4 ("x86, sched: check for counters overflow in frequency invariant accounting")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Link: https://lkml.kernel.org/r/1803209.Mvru99baaF@kreacher
2021-01-19 17:04:03 +01:00
Tom Rix
b86cb29287 x86: Remove definition of DEBUG
Defining DEBUG should only be done in development. So remove it.

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20210114212827.47584-1-trix@redhat.com
2021-01-15 08:23:10 +01:00
Borislav Petkov
1eb8f690bc x86/topology: Make __max_die_per_package available unconditionally
Move it outside of CONFIG_SMP in order to avoid ifdeffery at the usage
sites.

Fixes: 76e2fc63ca ("x86/cpu/amd: Set __max_die_per_package on AMD")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210114111814.5346-1-bp@alien8.de
2021-01-14 12:18:36 +01:00
Peter Zijlstra
737495361d x86/mce: Remove explicit/superfluous tracing
There's some explicit tracing left in exc_machine_check_kernel(),
remove it, as it's already implied by irqentry_nmi_enter().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210106144017.719310466@infradead.org
2021-01-12 21:10:59 +01:00
Peter Zijlstra
a1d5c98aac x86/sev: Fix nonistr violation
When the compiler fails to inline, it violates nonisntr:

  vmlinux.o: warning: objtool: __sev_es_nmi_complete()+0xc7: call to sev_es_wr_ghcb_msr() leaves .noinstr.text section

Fixes: 4ca68e023b ("x86/sev-es: Handle NMI State")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210106144017.532902065@infradead.org
2021-01-12 21:10:58 +01:00
Yazen Ghannam
76e2fc63ca x86/cpu/amd: Set __max_die_per_package on AMD
Set the maximum DIE per package variable on AMD using the
NodesPerProcessor topology value. This will be used by RAPL, among
others, to determine the maximum number of DIEs on the system in order
to do per-DIE manipulations.

 [ bp: Productize into a proper patch. ]

Fixes: 028c221ed1 ("x86/CPU/AMD: Save AMD NodeId as cpu_die_id")
Reported-by: Johnathan Smithinovic <johnathan.smithinovic@gmx.at>
Reported-by: Rafael Kitover <rkitover@gmail.com>
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Johnathan Smithinovic <johnathan.smithinovic@gmx.at>
Tested-by: Rafael Kitover <rkitover@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=210939
Link: https://lkml.kernel.org/r/20210106112106.GE5729@zn.tnic
Link: https://lkml.kernel.org/r/20210111101455.1194-1-bp@alien8.de
2021-01-12 12:21:01 +01:00
Linus Torvalds
f1ee3e150b hyperv-fixes for 5.11-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAl/8PSkTHHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXvCTB/4gs46EYFB5if10OjV/K8YgfDkcrkHD
 pu/e0VrlqFQn5DS1hh4lsnEZ8UJ0oL9ctG/QewnnNgaM6786+IrFn0XWHKQWZ+xz
 DlKbnMjQPsmTtY+MyAw1VeJrC91jCVenAuXnRTlm9eieOtMfHC+0VOoba2Ih8ZZz
 2b3Yic7IqfaMMJgq5lOIFmhVTygUY75Gnh+hu1pBatKpTG4P4DWui/E+QZx7x6FD
 05RuMWvo2ZtTkMLd1TlRjdNJt23zW3EdkhfyEWwCRVdn8WSwAz10baDvZvqwYsCn
 rucix6p9ZXLpdCdSpal4P1WkPN28yoGpwCrD0Af/jaBj8296ssKViCoc
 =gmx1
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-fixes-signed-20210111' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv fixes from Wei Liu:

  - fix kexec panic/hang (Dexuan Cui)

  - fix occasional crashes when flushing TLB (Wei Liu)

* tag 'hyperv-fixes-signed-20210111' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  x86/hyperv: check cpu mask after interrupt has been disabled
  x86/hyperv: Fix kexec panic/hang issues
2021-01-11 11:28:58 -08:00
Hyunwook (Wooky) Baek
7024f60d65 x86/sev-es: Handle string port IO to kernel memory properly
Don't assume dest/source buffers are userspace addresses when manually
copying data for string I/O or MOVS MMIO, as {get,put}_user() will fail
if handed a kernel address and ultimately lead to a kernel panic.

When invoking INSB/OUTSB instructions in kernel space in a
SEV-ES-enabled VM, the kernel crashes with the following message:

  "SEV-ES: Unsupported exception in #VC instruction emulation - can't continue"

Handle that case properly.

 [ bp: Massage commit message. ]

Fixes: f980f9c31a ("x86/sev-es: Compile early handler code into kernel image")
Signed-off-by: Hyunwook (Wooky) Baek <baekhw@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Link: https://lkml.kernel.org/r/20210110071102.2576186-1-baekhw@google.com
2021-01-11 20:01:52 +01:00
Valentin Schneider
6d3b47ddff x86/resctrl: Apply READ_ONCE/WRITE_ONCE to task_struct.{rmid,closid}
A CPU's current task can have its {closid, rmid} fields read locally
while they are being concurrently written to from another CPU.
This can happen anytime __resctrl_sched_in() races with either
__rdtgroup_move_task() or rdt_move_group_tasks().

Prevent load / store tearing for those accesses by giving them the
READ_ONCE() / WRITE_ONCE() treatment.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/9921fda88ad81afb9885b517fbe864a2bc7c35a9.1608243147.git.reinette.chatre@intel.com
2021-01-11 11:43:23 +01:00
Reinette Chatre
e0ad6dc896 x86/resctrl: Use task_curr() instead of task_struct->on_cpu to prevent unnecessary IPI
James reported in [1] that there could be two tasks running on the same CPU
with task_struct->on_cpu set. Using task_struct->on_cpu as a test if a task
is running on a CPU may thus match the old task for a CPU while the
scheduler is running and IPI it unnecessarily.

task_curr() is the correct helper to use. While doing so move the #ifdef
check of the CONFIG_SMP symbol to be a C conditional used to determine
if this helper should be used to ensure the code is always checked for
correctness by the compiler.

[1] https://lore.kernel.org/lkml/a782d2f3-d2f6-795f-f4b1-9462205fd581@arm.com

Reported-by: James Morse <james.morse@arm.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/e9e68ce1441a73401e08b641cc3b9a3cf13fe6d4.1608243147.git.reinette.chatre@intel.com
2021-01-11 11:34:45 +01:00
Tom Rix
3ff4ec0e28 x86/resctrl: Add printf attribute to log function
Mark the function with the __printf attribute to allow the compiler to
more thoroughly typecheck its arguments against a format string with
-Wformat and similar flags.

 [ bp: Massage commit message. ]

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20201221160009.3752017-1-trix@redhat.com
2021-01-11 11:20:36 +01:00
Paul E. McKenney
7bb39313cd x86/mce: Make mce_timed_out() identify holdout CPUs
The

  "Timeout: Not all CPUs entered broadcast exception handler"

message will appear from time to time given enough systems, but this
message does not identify which CPUs failed to enter the broadcast
exception handler. This information would be valuable if available,
for example, in order to correlate with other hardware-oriented error
messages.

Add a cpumask of CPUs which maintains which CPUs have entered this
handler, and print out which ones failed to enter in the event of a
timeout.

 [ bp: Massage. ]

Reported-by: Jonathan Lemon <bsd@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20210106174102.GA23874@paulmck-ThinkPad-P72
2021-01-08 18:00:09 +01:00
Fenghua Yu
a0195f314a x86/resctrl: Don't move a task to the same resource group
Shakeel Butt reported in [1] that a user can request a task to be moved
to a resource group even if the task is already in the group. It just
wastes time to do the move operation which could be costly to send IPI
to a different CPU.

Add a sanity check to ensure that the move operation only happens when
the task is not already in the resource group.

[1] https://lore.kernel.org/lkml/CALvZod7E9zzHwenzf7objzGKsdBmVwTgEJ0nPgs0LUFU3SN5Pw@mail.gmail.com/

Fixes: e02737d5b8 ("x86/intel_rdt: Add tasks files")
Reported-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/962ede65d8e95be793cb61102cca37f7bb018e66.1608243147.git.reinette.chatre@intel.com
2021-01-08 09:08:03 +01:00
Fenghua Yu
ae28d1aae4 x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR
Currently, when moving a task to a resource group the PQR_ASSOC MSR is
updated with the new closid and rmid in an added task callback. If the
task is running, the work is run as soon as possible. If the task is not
running, the work is executed later in the kernel exit path when the
kernel returns to the task again.

Updating the PQR_ASSOC MSR as soon as possible on the CPU a moved task
is running is the right thing to do. Queueing work for a task that is
not running is unnecessary (the PQR_ASSOC MSR is already updated when
the task is scheduled in) and causing system resource waste with the way
in which it is implemented: Work to update the PQR_ASSOC register is
queued every time the user writes a task id to the "tasks" file, even if
the task already belongs to the resource group.

This could result in multiple pending work items associated with a
single task even if they are all identical and even though only a single
update with most recent values is needed. Specifically, even if a task
is moved between different resource groups while it is sleeping then it
is only the last move that is relevant but yet a work item is queued
during each move.

This unnecessary queueing of work items could result in significant
system resource waste, especially on tasks sleeping for a long time.
For example, as demonstrated by Shakeel Butt in [1] writing the same
task id to the "tasks" file can quickly consume significant memory. The
same problem (wasted system resources) occurs when moving a task between
different resource groups.

As pointed out by Valentin Schneider in [2] there is an additional issue
with the way in which the queueing of work is done in that the task_struct
update is currently done after the work is queued, resulting in a race with
the register update possibly done before the data needed by the update is
available.

To solve these issues, update the PQR_ASSOC MSR in a synchronous way
right after the new closid and rmid are ready during the task movement,
only if the task is running. If a moved task is not running nothing
is done since the PQR_ASSOC MSR will be updated next time the task is
scheduled. This is the same way used to update the register when tasks
are moved as part of resource group removal.

[1] https://lore.kernel.org/lkml/CALvZod7E9zzHwenzf7objzGKsdBmVwTgEJ0nPgs0LUFU3SN5Pw@mail.gmail.com/
[2] https://lore.kernel.org/lkml/20201123022433.17905-1-valentin.schneider@arm.com

 [ bp: Massage commit message and drop the two update_task_closid_rmid()
   variants. ]

Fixes: e02737d5b8 ("x86/intel_rdt: Add tasks files")
Reported-by: Shakeel Butt <shakeelb@google.com>
Reported-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: James Morse <james.morse@arm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/17aa2fb38fc12ce7bb710106b3e7c7b45acb9e94.1608243147.git.reinette.chatre@intel.com
2021-01-08 09:03:36 +01:00
Ying-Tsun Huang
cb7f4a8b1f x86/mtrr: Correct the range check before performing MTRR type lookups
In mtrr_type_lookup(), if the input memory address region is not in the
MTRR, over 4GB, and not over the top of memory, a write-back attribute
is returned. These condition checks are for ensuring the input memory
address region is actually mapped to the physical memory.

However, if the end address is just aligned with the top of memory,
the condition check treats the address is over the top of memory, and
write-back attribute is not returned.

And this hits in a real use case with NVDIMM: the nd_pmem module tries
to map NVDIMMs as cacheable memories when NVDIMMs are connected. If a
NVDIMM is the last of the DIMMs, the performance of this NVDIMM becomes
very low since it is aligned with the top of memory and its memory type
is uncached-minus.

Move the input end address change to inclusive up into
mtrr_type_lookup(), before checking for the top of memory in either
mtrr_type_lookup_{variable,fixed}() helpers.

 [ bp: Massage commit message. ]

Fixes: 0cc705f56e ("x86/mm/mtrr: Clean up mtrr_type_lookup()")
Signed-off-by: Ying-Tsun Huang <ying-tsun.huang@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201215070721.4349-1-ying-tsun.huang@amd.com
2021-01-06 13:01:13 +01:00
Adrian Huang
91a8f6cb06 x86/mm: Refine mmap syscall implementation
It is unnecessary to use the local variable 'error' in the mmap syscall
implementation function - just return -EINVAL directly and get rid of
the local variable altogether.

 [ bp: Massage commit message. ]

Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/20201217052648.24656-1-adrianhuang0701@gmail.com
2021-01-05 19:07:42 +01:00
Peter Gonda
a8f7e08a81 x86/sev-es: Fix SEV-ES OUT/IN immediate opcode vc handling
The IN and OUT instructions with port address as an immediate operand
only use an 8-bit immediate (imm8). The current VC handler uses the
entire 32-bit immediate value but these instructions only set the first
bytes.

Cast the operand to an u8 for that.

 [ bp: Massage commit message. ]

Fixes: 25189d08e5 ("x86/sev-es: Add support for handling IOIO exceptions")
Signed-off-by: Peter Gonda <pgonda@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Link: https://lkml.kernel.org/r/20210105163311.221490-1-pgonda@google.com
2021-01-05 18:55:00 +01:00
Dexuan Cui
dfe94d4086 x86/hyperv: Fix kexec panic/hang issues
Currently the kexec kernel can panic or hang due to 2 causes:

1) hv_cpu_die() is not called upon kexec, so the hypervisor corrupts the
old VP Assist Pages when the kexec kernel runs. The same issue is fixed
for hibernation in commit 421f090c81 ("x86/hyperv: Suspend/resume the
VP assist page for hibernation"). Now fix it for kexec.

2) hyperv_cleanup() is called too early. In the kexec path, the other CPUs
are stopped in hv_machine_shutdown() -> native_machine_shutdown(), so
between hv_kexec_handler() and native_machine_shutdown(), the other CPUs
can still try to access the hypercall page and cause panic. The workaround
"hv_hypercall_pg = NULL;" in hyperv_cleanup() is unreliabe. Move
hyperv_cleanup() to a better place.

Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20201222065541.24312-1-decui@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-01-05 17:52:04 +00:00
Masami Hiramatsu
abd82e533d x86/kprobes: Do not decode opcode in resume_execution()
Currently, kprobes decodes the opcode right after single-stepping in
resume_execution(). But the opcode was already decoded while preparing
arch_specific_insn in arch_copy_kprobe().

Decode the opcode in arch_copy_kprobe() instead of in resume_execution()
and set some flags which classify the opcode for the resuming process.

 [ bp: Massage commit message. ]

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/160830072561.349576.3014979564448023213.stgit@devnote2
2021-01-05 15:42:30 +01:00
Borislav Petkov
c769dcd423 x86/microcode: Make microcode_init() static
No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201230122147.26938-1-bp@alien8.de
2020-12-31 11:44:00 +01:00
Heiner Kallweit
4b2d8ca920 x86/reboot: Add Zotac ZBOX CI327 nano PCI reboot quirk
On this system the M.2 PCIe WiFi card isn't detected after reboot, only
after cold boot. reboot=pci fixes this behavior. In [0] the same issue
is described, although on another system and with another Intel WiFi
card. In case it's relevant, both systems have Celeron CPUs.

Add a PCI reboot quirk on affected systems until a more generic fix is
available.

[0] https://bugzilla.kernel.org/show_bug.cgi?id=202399

 [ bp: Massage commit message. ]

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1524eafd-f89c-cfa4-ed70-0bde9e45eec9@gmail.com
2020-12-30 18:38:39 +01:00
Zheng Yongjun
3052636aa9 x86/mtrr: Convert comma to semicolon
Replace a comma between expression statements with a semicolon.

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201216131159.14393-1-zhengyongjun3@huawei.com
2020-12-30 08:56:35 +01:00
Linus Torvalds
3913d00ac5 A treewide cleanup of interrupt descriptor (ab)use with all sorts of racy
accesses, inefficient and disfunctional code. The goal is to remove the
 export of irq_to_desc() to prevent these things from creeping up again.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/ifgsTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoYm6EACAo8sObkuY3oWLagtGj1KHxon53oGZ
 VfDw2LYKM+rgJjDWdiyocyxQU5gtm6loWCrIHjH2adRQ4EisB5r8hfI8NZHxNMyq
 8khUi822NRBfFN6SCpO8eW9o95euscNQwCzqi7gV9/U/BAKoDoSEYzS4y0YmJlup
 mhoikkrFiBuFXplWI0gbP4ihb8S/to2+kTL6o7eBoJY9+fSXIFR3erZ6f3fLjYZG
 CQUUysTywdDhLeDkC9vaesXwgdl2XnaPRwcQqmK8Ez0QYNYpawyILUHLD75cIHDu
 bHdK2ZoDv/wtad/3BoGTK3+wChz20a/4/IAnBIUVgmnSLsPtW8zNEOPWNNc0aGg+
 rtafi5bvJ1lMoSZhkjLWQDOGU6vFaXl9NkC2fpF+dg1skFMT2CyLC8LD/ekmocon
 zHAPBva9j3m2A80hI3dUH9azo/IOl1GHG8ccM6SCxY3S/9vWSQChNhQDLe25xBEO
 VtKZS7DYFCRiL8mIy9GgwZWof8Vy2iMua2ML+W9a3mC9u3CqSLbCFmLMT/dDoXl1
 oHnMdAHk1DRatA8pJAz83C75RxbAS2riGEqtqLEQ6OaNXn6h0oXCanJX9jdKYDBh
 z6ijWayPSRMVktN6FDINsVNFe95N4GwYcGPfagIMqyMMhmJDic6apEzEo7iA76lk
 cko28MDqTIK4UQ==
 =BXv+
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2020-12-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:
 "This is the second attempt after the first one failed miserably and
  got zapped to unblock the rest of the interrupt related patches.

  A treewide cleanup of interrupt descriptor (ab)use with all sorts of
  racy accesses, inefficient and disfunctional code. The goal is to
  remove the export of irq_to_desc() to prevent these things from
  creeping up again"

* tag 'irq-core-2020-12-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
  genirq: Restrict export of irq_to_desc()
  xen/events: Implement irq distribution
  xen/events: Reduce irq_info:: Spurious_cnt storage size
  xen/events: Only force affinity mask for percpu interrupts
  xen/events: Use immediate affinity setting
  xen/events: Remove disfunct affinity spreading
  xen/events: Remove unused bind_evtchn_to_irq_lateeoi()
  net/mlx5: Use effective interrupt affinity
  net/mlx5: Replace irq_to_desc() abuse
  net/mlx4: Use effective interrupt affinity
  net/mlx4: Replace irq_to_desc() abuse
  PCI: mobiveil: Use irq_data_get_irq_chip_data()
  PCI: xilinx-nwl: Use irq_data_get_irq_chip_data()
  NTB/msi: Use irq_has_action()
  mfd: ab8500-debugfs: Remove the racy fiddling with irq_desc
  pinctrl: nomadik: Use irq_has_action()
  drm/i915/pmu: Replace open coded kstat_irqs() copy
  drm/i915/lpe_audio: Remove pointless irq_to_desc() usage
  s390/irq: Use irq_desc_kstat_cpu() in show_msi_interrupt()
  parisc/irq: Use irq_desc_kstat_cpu() in show_interrupts()
  ...
2020-12-24 13:50:23 -08:00
Linus Torvalds
4a1106afee EFI updates collected by Ard Biesheuvel:
- Don't move BSS section around pointlessly in the x86 decompressor
  - Refactor helper for discovering the EFI secure boot mode
  - Wire up EFI secure boot to IMA for arm64
  - Some fixes for the capsule loader
  - Expose the RT_PROP table via the EFI test module
  - Relax DT and kernel placement restrictions on ARM
 
 + followup fixes:
 
  - fix the build breakage on IA64 caused by recent capsule loader changes
  - suppress a type mismatch build warning in the expansion of
        EFI_PHYS_ALIGN on ARM
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/kWCMACgkQEsHwGGHe
 VUqVlxAAg3jSS5w5fuaXON2xYZmgKdlRB0fjbklo1ZrWS6sEHrP+gmVmrJSWGZP+
 qFleQ6AxaYK57UiBXxS6Xfn7hHRToqdOAGnaSYzIg1aQIofRoLxvm3YHBMKllb+g
 x73IBS/Hu9/kiH8EVDrJSkBpVdbPwDnw+FeW4ZWUMF9GVmV8oA6Zx23BVSVsbFda
 jat/cEsJQS3GfECJ/Fg5ae+c/2zn5NgbaVtLxVnMnJfAwEpoPz3ogKoANSskdZg3
 z6pA1aMFoHr+lnlzcsM5zdboQlwZRKPHvFpsXPexESBy5dPkYhxFnHqgK4hSZglC
 c3QoO9Gn+KOJl4KAKJWNzCrd3G9kKY5RXkoei4bH9wGMjW2c68WrbFyXgNsO3vYR
 v5CKpq3+jlwGo03GiLJgWQFdgqX0EgTVHHPTcwFpt8qAMi9/JIPSIeTE41p2+AjZ
 cW5F0IlikaR+N8vxc2TDvQTuSsroMiLcocvRWR61oV/48pFlEjqiUjV31myDsASg
 gGkOxZOOz2iBJfK8lCrKp5p9JwGp0M0/GSHTxlYQFy+p4SrcOiPX4wYYdLsWxioK
 AbVhvOClgB3kN7y7TpLvdjND00ciy4nKEC0QZ5p5G59jSLnpSBM/g6av24LsSQwo
 S1HJKhQPbzcI1lhaPjo91HQoOOMZHWLes0SqK4FGNIH+0imHliA=
 =n7Gc
 -----END PGP SIGNATURE-----

Merge tag 'efi_updates_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull EFI updates from Borislav Petkov:
 "These got delayed due to a last minute ia64 build issue which got
  fixed in the meantime.

  EFI updates collected by Ard Biesheuvel:

   - Don't move BSS section around pointlessly in the x86 decompressor

   - Refactor helper for discovering the EFI secure boot mode

   - Wire up EFI secure boot to IMA for arm64

   - Some fixes for the capsule loader

   - Expose the RT_PROP table via the EFI test module

   - Relax DT and kernel placement restrictions on ARM

  with a few followup fixes:

   - fix the build breakage on IA64 caused by recent capsule loader
     changes

   - suppress a type mismatch build warning in the expansion of
     EFI_PHYS_ALIGN on ARM"

* tag 'efi_updates_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi: arm: force use of unsigned type for EFI_PHYS_ALIGN
  efi: ia64: disable the capsule loader
  efi: stub: get rid of efi_get_max_fdt_addr()
  efi/efi_test: read RuntimeServicesSupported
  efi: arm: reduce minimum alignment of uncompressed kernel
  efi: capsule: clean scatter-gather entries from the D-cache
  efi: capsule: use atomic kmap for transient sglist mappings
  efi: x86/xen: switch to efi_get_secureboot_mode helper
  arm64/ima: add ima_arch support
  ima: generalize x86/EFI arch glue for other EFI architectures
  efi: generalize efi_get_secureboot
  efi/libstub: EFI_GENERIC_STUB_INITRD_CMDLINE_LOADER should not default to yes
  efi/x86: Only copy the compressed kernel image in efi_relocate_kernel()
  efi/libstub/x86: simplify efi_is_native()
2020-12-24 12:40:07 -08:00
Linus Torvalds
1375b9803e Merge branch 'akpm' (patches from Andrew)
Merge KASAN updates from Andrew Morton.

This adds a new hardware tag-based mode to KASAN.  The new mode is
similar to the existing software tag-based KASAN, but relies on arm64
Memory Tagging Extension (MTE) to perform memory and pointer tagging
(instead of shadow memory and compiler instrumentation).

By Andrey Konovalov and Vincenzo Frascino.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (60 commits)
  kasan: update documentation
  kasan, mm: allow cache merging with no metadata
  kasan: sanitize objects when metadata doesn't fit
  kasan: clarify comment in __kasan_kfree_large
  kasan: simplify assign_tag and set_tag calls
  kasan: don't round_up too much
  kasan, mm: rename kasan_poison_kfree
  kasan, mm: check kasan_enabled in annotations
  kasan: add and integrate kasan boot parameters
  kasan: inline (un)poison_range and check_invalid_free
  kasan: open-code kasan_unpoison_slab
  kasan: inline random_tag for HW_TAGS
  kasan: inline kasan_reset_tag for tag-based modes
  kasan: remove __kasan_unpoison_stack
  kasan: allow VMAP_STACK for HW_TAGS mode
  kasan, arm64: unpoison stack only with CONFIG_KASAN_STACK
  kasan: introduce set_alloc_info
  kasan: rename get_alloc/free_info
  kasan: simplify quarantine_put call site
  kselftest/arm64: check GCR_EL1 after context switch
  ...
2020-12-22 13:38:17 -08:00
Andi Kleen
e14fd4ba8f x86/split-lock: Avoid returning with interrupts enabled
When a split lock is detected always make sure to disable interrupts
before returning from the trap handler.

The kernel exit code assumes that all exits run with interrupts
disabled, otherwise the SWAPGS sequence can race against interrupts and
cause recursing page faults and later panics.

The problem will only happen on CPUs with split lock disable
functionality, so Icelake Server, Tiger Lake, Snow Ridge, Jacobsville.

Fixes: ca4c6a9858 ("x86/traps: Make interrupt enable/disable symmetric in C code")
Fixes: bce9b042ec ("x86/traps: Disable interrupts in exc_aligment_check()") # v5.8+
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-22 13:06:10 -08:00
Andrey Konovalov
d56a9ef84b kasan, arm64: unpoison stack only with CONFIG_KASAN_STACK
There's a config option CONFIG_KASAN_STACK that has to be enabled for
KASAN to use stack instrumentation and perform validity checks for
stack variables.

There's no need to unpoison stack when CONFIG_KASAN_STACK is not enabled.
Only call kasan_unpoison_task_stack[_below]() when CONFIG_KASAN_STACK is
enabled.

Note, that CONFIG_KASAN_STACK is an option that is currently always
defined when CONFIG_KASAN is enabled, and therefore has to be tested
with #if instead of #ifdef.

Link: https://lkml.kernel.org/r/d09dd3f8abb388da397fd11598c5edeaa83fe559.1606162397.git.andreyknvl@google.com
Link: https://linux-review.googlesource.com/id/If8a891e9fe01ea543e00b576852685afec0887e3
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-22 12:55:08 -08:00
Linus Torvalds
6a447b0e31 ARM:
* PSCI relay at EL2 when "protected KVM" is enabled
 * New exception injection code
 * Simplification of AArch32 system register handling
 * Fix PMU accesses when no PMU is enabled
 * Expose CSV3 on non-Meltdown hosts
 * Cache hierarchy discovery fixes
 * PV steal-time cleanups
 * Allow function pointers at EL2
 * Various host EL2 entry cleanups
 * Simplification of the EL2 vector allocation
 
 s390:
 * memcg accouting for s390 specific parts of kvm and gmap
 * selftest for diag318
 * new kvm_stat for when async_pf falls back to sync
 
 x86:
 * Tracepoints for the new pagetable code from 5.10
 * Catch VFIO and KVM irqfd events before userspace
 * Reporting dirty pages to userspace with a ring buffer
 * SEV-ES host support
 * Nested VMX support for wait-for-SIPI activity state
 * New feature flag (AVX512 FP16)
 * New system ioctl to report Hyper-V-compatible paravirtualization features
 
 Generic:
 * Selftest improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl/bdL4UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNgQQgAnTH6rhXa++Zd5F0EM2NwXwz3iEGb
 lOq1DZSGjs6Eekjn8AnrWbmVQr+CBCuGU9MrxpSSzNDK/awryo3NwepOWAZw9eqk
 BBCVwGBbJQx5YrdgkGC0pDq2sNzcpW/VVB3vFsmOxd9eHblnuKSIxEsCCXTtyqIt
 XrLpQ1UhvI4yu102fDNhuFw2EfpzXm+K0Lc0x6idSkdM/p7SyeOxiv8hD4aMr6+G
 bGUQuMl4edKZFOWFigzr8NovQAvDHZGrwfihu2cLRYKLhV97QuWVmafv/yYfXcz2
 drr+wQCDNzDOXyANnssmviazrhOX0QmTAhbIXGGX/kTxYKcfPi83ZLoI3A==
 =ISud
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "Much x86 work was pushed out to 5.12, but ARM more than made up for it.

  ARM:
   - PSCI relay at EL2 when "protected KVM" is enabled
   - New exception injection code
   - Simplification of AArch32 system register handling
   - Fix PMU accesses when no PMU is enabled
   - Expose CSV3 on non-Meltdown hosts
   - Cache hierarchy discovery fixes
   - PV steal-time cleanups
   - Allow function pointers at EL2
   - Various host EL2 entry cleanups
   - Simplification of the EL2 vector allocation

  s390:
   - memcg accouting for s390 specific parts of kvm and gmap
   - selftest for diag318
   - new kvm_stat for when async_pf falls back to sync

  x86:
   - Tracepoints for the new pagetable code from 5.10
   - Catch VFIO and KVM irqfd events before userspace
   - Reporting dirty pages to userspace with a ring buffer
   - SEV-ES host support
   - Nested VMX support for wait-for-SIPI activity state
   - New feature flag (AVX512 FP16)
   - New system ioctl to report Hyper-V-compatible paravirtualization features

  Generic:
   - Selftest improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (171 commits)
  KVM: SVM: fix 32-bit compilation
  KVM: SVM: Add AP_JUMP_TABLE support in prep for AP booting
  KVM: SVM: Provide support to launch and run an SEV-ES guest
  KVM: SVM: Provide an updated VMRUN invocation for SEV-ES guests
  KVM: SVM: Provide support for SEV-ES vCPU loading
  KVM: SVM: Provide support for SEV-ES vCPU creation/loading
  KVM: SVM: Update ASID allocation to support SEV-ES guests
  KVM: SVM: Set the encryption mask for the SVM host save area
  KVM: SVM: Add NMI support for an SEV-ES guest
  KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest
  KVM: SVM: Do not report support for SMM for an SEV-ES guest
  KVM: x86: Update __get_sregs() / __set_sregs() to support SEV-ES
  KVM: SVM: Add support for CR8 write traps for an SEV-ES guest
  KVM: SVM: Add support for CR4 write traps for an SEV-ES guest
  KVM: SVM: Add support for CR0 write traps for an SEV-ES guest
  KVM: SVM: Add support for EFER write traps for an SEV-ES guest
  KVM: SVM: Support string IO operations for an SEV-ES guest
  KVM: SVM: Support MMIO for an SEV-ES guest
  KVM: SVM: Create trace events for VMGEXIT MSR protocol processing
  KVM: SVM: Create trace events for VMGEXIT processing
  ...
2020-12-20 10:44:05 -08:00
Linus Torvalds
09c0796adf Tracing updates for 5.11
The major update to this release is that there's a new arch config option called:
 CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS. Currently, only x86_64 enables it.
 All the ftrace callbacks now take a struct ftrace_regs instead of a struct
 pt_regs. If the architecture has HAVE_DYNAMIC_FTRACE_WITH_ARGS enabled, then
 the ftrace_regs will have enough information to read the arguments of the
 function being traced, as well as access to the stack pointer. This way, if
 a user (like live kernel patching) only cares about the arguments, then it
 can avoid using the heavier weight "regs" callback, that puts in enough
 information in the struct ftrace_regs to simulate a breakpoint exception
 (needed for kprobes).
 
 New config option that audits the timestamps of the ftrace ring buffer at
 most every event recorded.  The "check_buffer()" calls will conflict with
 mainline, because I purposely added the check without including the fix that
 it caught, which is in mainline. Running a kernel built from the commit of
 the added check will trigger it.
 
 Ftrace recursion protection has been cleaned up to move the protection to
 the callback itself (this saves on an extra function call for those
 callbacks).
 
 Perf now handles its own RCU protection and does not depend on ftrace to do
 it for it (saving on that extra function call).
 
 New debug option to add "recursed_functions" file to tracefs that lists all
 the places that triggered the recursion protection of the function tracer.
 This will show where things need to be fixed as recursion slows down the
 function tracer.
 
 The eval enum mapping updates done at boot up are now offloaded to a work
 queue, as it caused a noticeable pause on slow embedded boards.
 
 Various clean ups and last minute fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCX9uq8xQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qtrwAQCHevqWMjKc1Q76bnCgwB0AbFKB6vqy
 5b6g/co5+ihv8wD/eJPWlZMAt97zTVW7bdp5qj/GTiCDbAsODMZ597LsxA0=
 =rZEz
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "The major update to this release is that there's a new arch config
  option called CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS.

  Currently, only x86_64 enables it. All the ftrace callbacks now take a
  struct ftrace_regs instead of a struct pt_regs. If the architecture
  has HAVE_DYNAMIC_FTRACE_WITH_ARGS enabled, then the ftrace_regs will
  have enough information to read the arguments of the function being
  traced, as well as access to the stack pointer.

  This way, if a user (like live kernel patching) only cares about the
  arguments, then it can avoid using the heavier weight "regs" callback,
  that puts in enough information in the struct ftrace_regs to simulate
  a breakpoint exception (needed for kprobes).

  A new config option that audits the timestamps of the ftrace ring
  buffer at most every event recorded.

  Ftrace recursion protection has been cleaned up to move the protection
  to the callback itself (this saves on an extra function call for those
  callbacks).

  Perf now handles its own RCU protection and does not depend on ftrace
  to do it for it (saving on that extra function call).

  New debug option to add "recursed_functions" file to tracefs that
  lists all the places that triggered the recursion protection of the
  function tracer. This will show where things need to be fixed as
  recursion slows down the function tracer.

  The eval enum mapping updates done at boot up are now offloaded to a
  work queue, as it caused a noticeable pause on slow embedded boards.

  Various clean ups and last minute fixes"

* tag 'trace-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits)
  tracing: Offload eval map updates to a work queue
  Revert: "ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS"
  ring-buffer: Add rb_check_bpage in __rb_allocate_pages
  ring-buffer: Fix two typos in comments
  tracing: Drop unneeded assignment in ring_buffer_resize()
  tracing: Disable ftrace selftests when any tracer is running
  seq_buf: Avoid type mismatch for seq_buf_init
  ring-buffer: Fix a typo in function description
  ring-buffer: Remove obsolete rb_event_is_commit()
  ring-buffer: Add test to validate the time stamp deltas
  ftrace/documentation: Fix RST C code blocks
  tracing: Clean up after filter logic rewriting
  tracing: Remove the useless value assignment in test_create_synth_event()
  livepatch: Use the default ftrace_ops instead of REGS when ARGS is available
  ftrace/x86: Allow for arguments to be passed in to ftrace_regs by default
  ftrace: Have the callbacks receive a struct ftrace_regs instead of pt_regs
  MAINTAINERS: assign ./fs/tracefs to TRACING
  tracing: Fix some typos in comments
  ftrace: Remove unused varible 'ret'
  ring-buffer: Add recording of ring buffer recursion into recursed_functions
  ...
2020-12-17 13:22:17 -08:00
Linus Torvalds
007c74e16c Merge branch 'stable/for-linus-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb
Pull swiotlb update from Konrad Rzeszutek Wilk:
 "A generic (but for right now engaged only with AMD SEV) mechanism to
  adjust a larger size SWIOTLB based on the total memory of the SEV
  guests which right now require the bounce buffer for interacting with
  the outside world.

  Normal knobs (swiotlb=XYZ) still work"

* 'stable/for-linus-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb:
  x86,swiotlb: Adjust SWIOTLB bounce buffer size for SEV guests
2020-12-16 13:51:34 -08:00
Linus Torvalds
ac73e3dc8a Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:

 - a few random little subsystems

 - almost all of the MM patches which are staged ahead of linux-next
   material. I'll trickle to post-linux-next work in as the dependents
   get merged up.

Subsystems affected by this patch series: kthread, kbuild, ide, ntfs,
ocfs2, arch, and mm (slab-generic, slab, slub, dax, debug, pagecache,
gup, swap, shmem, memcg, pagemap, mremap, hmm, vmalloc, documentation,
kasan, pagealloc, memory-failure, hugetlb, vmscan, z3fold, compaction,
oom-kill, migration, cma, page-poison, userfaultfd, zswap, zsmalloc,
uaccess, zram, and cleanups).

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (200 commits)
  mm: cleanup kstrto*() usage
  mm: fix fall-through warnings for Clang
  mm: slub: convert sysfs sprintf family to sysfs_emit/sysfs_emit_at
  mm: shmem: convert shmem_enabled_show to use sysfs_emit_at
  mm:backing-dev: use sysfs_emit in macro defining functions
  mm: huge_memory: convert remaining use of sprintf to sysfs_emit and neatening
  mm: use sysfs_emit for struct kobject * uses
  mm: fix kernel-doc markups
  zram: break the strict dependency from lzo
  zram: add stat to gather incompressible pages since zram set up
  zram: support page writeback
  mm/process_vm_access: remove redundant initialization of iov_r
  mm/zsmalloc.c: rework the list_add code in insert_zspage()
  mm/zswap: move to use crypto_acomp API for hardware acceleration
  mm/zswap: fix passing zero to 'PTR_ERR' warning
  mm/zswap: make struct kernel_param_ops definitions const
  userfaultfd/selftests: hint the test runner on required privilege
  userfaultfd/selftests: fix retval check for userfaultfd_open()
  userfaultfd/selftests: always dump something in modes
  userfaultfd: selftests: make __{s,u}64 format specifiers portable
  ...
2020-12-15 12:53:37 -08:00
Dmitry Safonov
cd544fd1dc mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio
As kernel expect to see only one of such mappings, any further operations
on the VMA-copy may be unexpected by the kernel.  Maybe it's being on the
safe side, but there doesn't seem to be any expected use-case for this, so
restrict it now.

Link: https://lkml.kernel.org/r/20201013013416.390574-4-dima@arista.com
Fixes: commit e346b38130 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
Signed-off-by: Dmitry Safonov <dima@arista.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:41 -08:00
Jason Gunthorpe
57efa1fe59 mm/gup: prevent gup_fast from racing with COW during fork
Since commit 70e806e4e6 ("mm: Do early cow for pinned pages during
fork() for ptes") pages under a FOLL_PIN will not be write protected
during COW for fork.  This means that pages returned from
pin_user_pages(FOLL_WRITE) should not become write protected while the pin
is active.

However, there is a small race where get_user_pages_fast(FOLL_PIN) can
establish a FOLL_PIN at the same time copy_present_page() is write
protecting it:

        CPU 0                             CPU 1
   get_user_pages_fast()
    internal_get_user_pages_fast()
                                       copy_page_range()
                                         pte_alloc_map_lock()
                                           copy_present_page()
                                             atomic_read(has_pinned) == 0
					     page_maybe_dma_pinned() == false
     atomic_set(has_pinned, 1);
     gup_pgd_range()
      gup_pte_range()
       pte_t pte = gup_get_pte(ptep)
       pte_access_permitted(pte)
       try_grab_compound_head()
                                             pte = pte_wrprotect(pte)
	                                     set_pte_at();
                                         pte_unmap_unlock()
      // GUP now returns with a write protected page

The first attempt to resolve this by using the write protect caused
problems (and was missing a barrrier), see commit f3c64eda3e ("mm: avoid
early COW write protect games during fork()")

Instead wrap copy_p4d_range() with the write side of a seqcount and check
the read side around gup_pgd_range().  If there is a collision then
get_user_pages_fast() fails and falls back to slow GUP.

Slow GUP is safe against this race because copy_page_range() is only
called while holding the exclusive side of the mmap_lock on the src
mm_struct.

[akpm@linux-foundation.org: coding style fixes]
  Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com

Link: https://lkml.kernel.org/r/2-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
Fixes: f3c64eda3e ("mm: avoid early COW write protect games during fork()")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: "Ahmed S. Darwish" <a.darwish@linutronix.de>	[seqcount_t parts]
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:39 -08:00
Paolo Bonzini
722e039d9a KVM/arm64 updates for Linux 5.11
- PSCI relay at EL2 when "protected KVM" is enabled
 - New exception injection code
 - Simplification of AArch32 system register handling
 - Fix PMU accesses when no PMU is enabled
 - Expose CSV3 on non-Meltdown hosts
 - Cache hierarchy discovery fixes
 - PV steal-time cleanups
 - Allow function pointers at EL2
 - Various host EL2 entry cleanups
 - Simplification of the EL2 vector allocation
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl/XoggPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDsRYP/3ZtGWsyBc1sKdaTBIwQdnrPQHL+7o1Mmjnl
 b+YqRMWcJW4g3O81GW6IA+vM0A1UMJxVOjzkZd8KulGv3RCZiqQmWJClWFlYbwLj
 e+HHx+Zo/qsmDrwcVoFI8/n+iC/a5fIaCbSWMSPaKHrOMxBiHQk0qlaq4AZ8gb7a
 /eHYqI/hISJQb1ZVFHmwlp8FoMnB2M6/FDpCf8oeGKjpF2hjghIPugJ0oRlPLZjB
 o3Q6ELEScJV1wBy7d1+5rkm52t9j8gpGhXxja0QwypADNzk5KHEzghXq+rTWUh1S
 et9OfqkflMtKMsh0qNwe5ZFbqtsH69qtYMAj4ok7rZOwQcbJ97VSrP5ka7VVzSdC
 AgcQU9c9LoyQ7rk0dbs3t0cd8hMgVu50guZ/iHfW88CcdykN9M0nnSPRAYpNbW85
 xndBQ5k/a4FoufwoY4e0hS28HIiRfLoEA68mps+yoMiiKh27HO2v4GFRIJoCNxzp
 YQ01zOBp9FKYTsxj0h7mMf+5EEyo9E4X/kJOfZpOVVbVKy82wPAGLJpDEnbnoJUe
 j1jBmiV/trkn+nTnWmDoXcw2ljuIF9dBm2M8r8yGKdNEHptnN8tMVRlCRImVVWW0
 BbZGAzoK0tpKXPIlUh4aXS3mtV9qlohs9rzjVyKfGnaRRbRGANM8qrH5aKuDFinM
 RugpMWyk
 =hf4L
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for Linux 5.11

- PSCI relay at EL2 when "protected KVM" is enabled
- New exception injection code
- Simplification of AArch32 system register handling
- Fix PMU accesses when no PMU is enabled
- Expose CSV3 on non-Meltdown hosts
- Cache hierarchy discovery fixes
- PV steal-time cleanups
- Allow function pointers at EL2
- Various host EL2 entry cleanups
- Simplification of the EL2 vector allocation
2020-12-15 12:48:24 -05:00
Thomas Gleixner
a313357e70 genirq: Move irq_has_action() into core code
This function uses irq_to_desc() and is going to be used by modules to
replace the open coded irq_to_desc() (ab)usage. The final goal is to remove
the export of irq_to_desc() so driver cannot fiddle with it anymore.

Move it into the core code and fixup the usage sites to include the proper
header.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201210194042.548936472@linutronix.de
2020-12-15 16:19:30 +01:00
Linus Torvalds
148842c98a Yet another large set of x86 interrupt management updates:
- Simplification and distangling of the MSI related functionality
 
    - Let IO/APIC construct the RTE entries from an MSI message instead of
      having IO/APIC specific code in the interrupt remapping drivers
 
    - Make the retrieval of the parent interrupt domain (vector or remap
      unit) less hardcoded and use the relevant irqdomain callbacks for
      selection.
 
    - Allow the handling of more than 255 CPUs without a virtualized IOMMU
      when the hypervisor supports it. This has made been possible by the
      above modifications and also simplifies the existing workaround in the
      HyperV specific virtual IOMMU.
 
    - Cleanup of the historical timer_works() irq flags related
      inconsistencies.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/Xxd8THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoYpOD/9C5TppNlPMUyx2SflH6bxt37pJEpln
 +hYTKsk+jSThntr5mfj+GifGvgmHOVBTGnlDUnUnrpN7TQmLFBzwTOtnBLW53AO2
 16/u0+Xci4LNCtEkaymf0Rq4MfsfriXHPJr0A/CnZ0tpHSf5QKHAiitSiGujdMlb
 gbq43+zXd+jNkH7vkOLPX/7dZVI1hNASQEevJu2tRR4xYTuXFdBxvLgYkHtYKKrK
 R1sbs6nI6yIzye2u4m4xGu29SxgUft+zdUf+UehJKM3yFmf51d9qpkX+kLaTWuaL
 VPsMItbn0kdvxwXQWO6DYnIAAnVKCklyHQJTZCoNq9Fe91OoByak1CEVspSOa1av
 JmycNSch4IYWasR4vVCB1gbb+V9SejcKu5SV3CDrEDqwkOIpfiqpriUXSCJTLlFd
 QOEDOLuuk/79Qs//J/tb/nJ4IuKv8WPudDfIlMro8wUsAr67DjD4mnXprZ+svwWx
 Ct/0/Memk+BSa0cw6pvg24BUZGN6zrufkBu2HKT9GOXRUdNkdLkiPhT8mK4T/O0l
 f90QCLjPSOJ/K/pLEWdUHEPmgC5Q9RsXOmwVGqX+RbjfP7mYTJXlmWnBb+cFNch0
 xFIH3SxVGylxxT06NX3SkvinrHj10CoAlmneefBlLtx6dF+2P84DAMZSF0OFToVI
 c2KMg5zoesI4bg==
 =8Gfs
 -----END PGP SIGNATURE-----

Merge tag 'x86-apic-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 apic updates from Thomas Gleixner:
 "Yet another large set of x86 interrupt management updates:

   - Simplification and distangling of the MSI related functionality

   - Let IO/APIC construct the RTE entries from an MSI message instead
     of having IO/APIC specific code in the interrupt remapping drivers

   - Make the retrieval of the parent interrupt domain (vector or remap
     unit) less hardcoded and use the relevant irqdomain callbacks for
     selection.

   - Allow the handling of more than 255 CPUs without a virtualized
     IOMMU when the hypervisor supports it. This has made been possible
     by the above modifications and also simplifies the existing
     workaround in the HyperV specific virtual IOMMU.

   - Cleanup of the historical timer_works() irq flags related
     inconsistencies"

* tag 'x86-apic-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
  x86/ioapic: Cleanup the timer_works() irqflags mess
  iommu/hyper-v: Remove I/O-APIC ID check from hyperv_irq_remapping_select()
  iommu/amd: Fix IOMMU interrupt generation in X2APIC mode
  iommu/amd: Don't register interrupt remapping irqdomain when IR is disabled
  iommu/amd: Fix union of bitfields in intcapxt support
  x86/ioapic: Correct the PCI/ISA trigger type selection
  x86/ioapic: Use I/O-APIC ID for finding irqdomain, not index
  x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it
  x86/kvm: Enable 15-bit extension when KVM_FEATURE_MSI_EXT_DEST_ID detected
  iommu/hyper-v: Disable IRQ pseudo-remapping if 15 bit APIC IDs are available
  x86/apic: Support 15 bits of APIC ID in MSI where available
  x86/ioapic: Handle Extended Destination ID field in RTE
  iommu/vt-d: Simplify intel_irq_remapping_select()
  x86: Kill all traces of irq_remapping_get_irq_domain()
  x86/ioapic: Use irq_find_matching_fwspec() to find remapping irqdomain
  x86/hpet: Use irq_find_matching_fwspec() to find remapping irqdomain
  iommu/hyper-v: Implement select() method on remapping irqdomain
  iommu/vt-d: Implement select() method on remapping irqdomain
  iommu/amd: Implement select() method on remapping irqdomain
  x86/apic: Add select() method on vector irqdomain
  ...
2020-12-14 18:59:53 -08:00
Linus Torvalds
edd7ab7684 The new preemtible kmap_local() implementation:
- Consolidate all kmap_atomic() internals into a generic implementation
     which builds the base for the kmap_local() API and make the
     kmap_atomic() interface wrappers which handle the disabling/enabling of
     preemption and pagefaults.
 
   - Switch the storage from per-CPU to per task and provide scheduler
     support for clearing mapping when scheduling out and restoring them
     when scheduling back in.
 
   - Merge the migrate_disable/enable() code, which is also part of the
     scheduler pull request. This was required to make the kmap_local()
     interface available which does not disable preemption when a mapping
     is established. It has to disable migration instead to guarantee that
     the virtual address of the mapped slot is the same accross preemption.
 
   - Provide better debug facilities: guard pages and enforced utilization
     of the mapping mechanics on 64bit systems when the architecture allows
     it.
 
   - Provide the new kmap_local() API which can now be used to cleanup the
     kmap_atomic() usage sites all over the place. Most of the usage sites
     do not require the implicit disabling of preemption and pagefaults so
     the penalty on 64bit and 32bit non-highmem systems is removed and quite
     some of the code can be simplified. A wholesale conversion is not
     possible because some usage depends on the implicit side effects and
     some need to be cleaned up because they work around these side effects.
 
     The migrate disable side effect is only effective on highmem systems
     and when enforced debugging is enabled. On 64bit and 32bit non-highmem
     systems the overhead is completely avoided.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XyQwTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoUolD/9+R+BX96fGir+I8rG9dc3cbLw5meSi
 0I/Nq3PToZMs2Iqv50DsoaPYHHz/M6fcAO9LRIgsE9jRbnY93GnsBM0wU9Y8yQaT
 4wUzOG5WHaLDfqIkx/CN9coUl458oEiwOEbn79A2FmPXFzr7IpkufnV3ybGDwzwP
 p73bjMJMPPFrsa9ig87YiYfV/5IAZHi82PN8Cq1v4yNzgXRP3Tg6QoAuCO84ZnWF
 RYlrfKjcJ2xPdn+RuYyXolPtxr1hJQ0bOUpe4xu/UfeZjxZ7i1wtwLN9kWZe8CKH
 +x4Lz8HZZ5QMTQ9sCHOLtKzu2MceMcpISzoQH4/aFQCNMgLn1zLbS790XkYiQCuR
 ne9Cua+IqgYfGMG8cq8+bkU9HCNKaXqIBgPEKE/iHYVmqzCOqhW5Cogu4KFekf6V
 Wi7pyyUdX2en8BAWpk5NHc8de9cGcc+HXMq2NIcgXjVWvPaqRP6DeITERTZLJOmz
 XPxq5oPLGl7wdm7z+ICIaNApy8zuxpzb6sPLNcn7l5OeorViORlUu08AN8587wAj
 FiVjp6ZYomg+gyMkiNkDqFOGDH5TMENpOFoB0hNNEyJwwS0xh6CgWuwZcv+N8aPO
 HuS/P+tNANbD8ggT4UparXYce7YCtgOf3IG4GA3JJYvYmJ6pU+AZOWRoDScWq4o+
 +jlfoJhMbtx5Gg==
 =n71I
 -----END PGP SIGNATURE-----

Merge tag 'core-mm-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull kmap updates from Thomas Gleixner:
 "The new preemtible kmap_local() implementation:

   - Consolidate all kmap_atomic() internals into a generic
     implementation which builds the base for the kmap_local() API and
     make the kmap_atomic() interface wrappers which handle the
     disabling/enabling of preemption and pagefaults.

   - Switch the storage from per-CPU to per task and provide scheduler
     support for clearing mapping when scheduling out and restoring them
     when scheduling back in.

   - Merge the migrate_disable/enable() code, which is also part of the
     scheduler pull request. This was required to make the kmap_local()
     interface available which does not disable preemption when a
     mapping is established. It has to disable migration instead to
     guarantee that the virtual address of the mapped slot is the same
     across preemption.

   - Provide better debug facilities: guard pages and enforced
     utilization of the mapping mechanics on 64bit systems when the
     architecture allows it.

   - Provide the new kmap_local() API which can now be used to cleanup
     the kmap_atomic() usage sites all over the place. Most of the usage
     sites do not require the implicit disabling of preemption and
     pagefaults so the penalty on 64bit and 32bit non-highmem systems is
     removed and quite some of the code can be simplified. A wholesale
     conversion is not possible because some usage depends on the
     implicit side effects and some need to be cleaned up because they
     work around these side effects.

     The migrate disable side effect is only effective on highmem
     systems and when enforced debugging is enabled. On 64bit and 32bit
     non-highmem systems the overhead is completely avoided"

* tag 'core-mm-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  ARM: highmem: Fix cache_is_vivt() reference
  x86/crashdump/32: Simplify copy_oldmem_page()
  io-mapping: Provide iomap_local variant
  mm/highmem: Provide kmap_local*
  sched: highmem: Store local kmaps in task struct
  x86: Support kmap_local() forced debugging
  mm/highmem: Provide CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP
  mm/highmem: Provide and use CONFIG_DEBUG_KMAP_LOCAL
  microblaze/mm/highmem: Add dropped #ifdef back
  xtensa/mm/highmem: Make generic kmap_atomic() work correctly
  mm/highmem: Take kmap_high_get() properly into account
  highmem: High implementation details and document API
  Documentation/io-mapping: Remove outdated blurb
  io-mapping: Cleanup atomic iomap
  mm/highmem: Remove the old kmap_atomic cruft
  highmem: Get rid of kmap_types.h
  xtensa/mm/highmem: Switch to generic kmap atomic
  sparc/mm/highmem: Switch to generic kmap atomic
  powerpc/mm/highmem: Switch to generic kmap atomic
  nds32/mm/highmem: Switch to generic kmap atomic
  ...
2020-12-14 18:35:53 -08:00
Linus Torvalds
adb35e8dc9 Scheduler updates:
- migrate_disable/enable() support which originates from the RT tree and
    is now a prerequisite for the new preemptible kmap_local() API which aims
    to replace kmap_atomic().
 
  - A fair amount of topology and NUMA related improvements
 
  - Improvements for the frequency invariant calculations
 
  - Enhanced robustness for the global CPU priority tracking and decision
    making
 
  - The usual small fixes and enhancements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XwK4THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoX28D/9cVrvziSQGfBfuQWnUiw8iOIq1QBa2
 Me+Tvenhfrlt7xU6rbP9ciFu7eTN+fS06m5uQPGI+t22WuJmHzbmw1bJVXfkvYfI
 /QoU+Hg7DkDAn1p7ZKXh0dRkV0nI9ixxSHl0E+Zf1ATBxCUMV2SO85flg6z/4qJq
 3VWUye0dmR7/bhtkIjv5rwce9v2JB2g1AbgYXYTW9lHVoUdGoMSdiZAF4tGyHLnx
 sJ6DMqQ+k+dmPyYO0z5MTzjW/fXit4n9w2e3z9TvRH/uBu58WSW1RBmQYX6aHBAg
 dhT9F4lvTs6lJY23x5RSFWDOv6xAvKF5a0xfb8UZcyH5EoLYrPRvm42a0BbjdeRa
 u0z7LbwIlKA+RFdZzFZWz8UvvO0ljyMjmiuqZnZ5dY9Cd80LSBuxrWeQYG0qg6lR
 Y2povhhCepEG+q8AXIe2YjHKWKKC1s/l/VY3CNnCzcd21JPQjQ4Z5eWGmHif5IED
 CntaeFFhZadR3w02tkX35zFmY3w4soKKrbI4EKWrQwd+cIEQlOSY7dEPI/b5BbYj
 MWAb3P4EG9N77AWTNmbhK4nN0brEYb+rBbCA+5dtNBVhHTxAC7OTWElJOC2O66FI
 e06dREjvwYtOkRUkUguWwErbIai2gJ2MH0VILV3hHoh64oRk7jjM8PZYnjQkdptQ
 Gsq0rJW5iiu/OQ==
 =Oz1V
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Thomas Gleixner:

 - migrate_disable/enable() support which originates from the RT tree
   and is now a prerequisite for the new preemptible kmap_local() API
   which aims to replace kmap_atomic().

 - A fair amount of topology and NUMA related improvements

 - Improvements for the frequency invariant calculations

 - Enhanced robustness for the global CPU priority tracking and decision
   making

 - The usual small fixes and enhancements all over the place

* tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
  sched/fair: Trivial correction of the newidle_balance() comment
  sched/fair: Clear SMT siblings after determining the core is not idle
  sched: Fix kernel-doc markup
  x86: Print ratio freq_max/freq_base used in frequency invariance calculations
  x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
  x86, sched: Calculate frequency invariance for AMD systems
  irq_work: Optimize irq_work_single()
  smp: Cleanup smp_call_function*()
  irq_work: Cleanup
  sched: Limit the amount of NUMA imbalance that can exist at fork time
  sched/numa: Allow a floating imbalance between NUMA nodes
  sched: Avoid unnecessary calculation of load imbalance at clone time
  sched/numa: Rename nr_running and break out the magic number
  sched: Make migrate_disable/enable() independent of RT
  sched/topology: Condition EAS enablement on FIE support
  arm64: Rebuild sched domains on invariance status changes
  sched/topology,schedutil: Wrap sched domains rebuild
  sched/uclamp: Allow to reset a task uclamp constraint value
  sched/core: Fix typos in comments
  Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug
  ...
2020-12-14 18:29:11 -08:00
Linus Torvalds
8a8ca83ec3 Perf updates:
Core:
 
    - Better handling of page table leaves on archictectures which have
      architectures have non-pagetable aligned huge/large pages.  For such
      architectures a leaf can actually be part of a larger entry.
 
    - Prevent a deadlock vs. exec_update_mutex
 
  Architectures:
 
    - The related updates for page size calculation of leaf entries
 
    - The usual churn to support new CPUs
 
    - Small fixes and improvements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XvgATHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoUrdEACatdr93wv75vnm5tCZM4EsFvB2PzVJ
 ck4K4+hHiMVV4802qf+kW5plF+rckAU4TAai/L7wkTntKHvjD/0/o1epoIStb+dS
 SCpVkQMCLT/8xT242iHPOfgsQpVpJnIiBwVRjn8HXu82nXdgMJhKnBjTe634UfxW
 o2OCFiyJzpRi5l86gVp67ueqgvl34NPI2JaSLc0g80QfZ8akzdePPpED35CzYjZh
 41k+7ssvt6qch3vMUySHAhkX4gQl0nc80YAaF/XZbCfvdyY7D03PtfBjfvphTSK0
 l54z9aWh0ciK9P1aPfvkHDXBJUR2VtUAx2GiURK+XU3jNk3KMrz9CcBl1D/exIAg
 07IsiYVoB38YAUOZoR9K8p+p+5EuwYRRUMAgfQfBALCuaLQV477Cne82b2KmNCus
 1izUQvcDDf0s74OyYTHWFXRGla95COJvNLzkrZ1oU3mX4HgdKdOAUbf/2XTLWeKO
 3HOIS+jsg5cp82tRe4X5r51h73pONYlo9lLo/CjQXz25vMcXKtE/MZGq2gkRff4p
 N4k88eQ5LOsRqUaU46GcHozXRCfcpW7SPI9AaN5I/fKGIZvHP7uMdMb+g5DV8yHI
 dNZ8u5uLPHwdg80C3fJ3Pnp7VsVNHliPXMwv0vib7BCp7aUVZWeFnOntw3PdYFRk
 XKEbfl36IuAadg==
 =rZ99
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf updates from Thomas Gleixner:
 "Core:

   - Better handling of page table leaves on archictectures which have
     architectures have non-pagetable aligned huge/large pages. For such
     architectures a leaf can actually be part of a larger entry.

   - Prevent a deadlock vs exec_update_mutex

  Architectures:

   - The related updates for page size calculation of leaf entries

   - The usual churn to support new CPUs

   - Small fixes and improvements all over the place"

* tag 'perf-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  perf/x86/intel: Add Tremont Topdown support
  uprobes/x86: Fix fall-through warnings for Clang
  perf/x86: Fix fall-through warnings for Clang
  kprobes/x86: Fix fall-through warnings for Clang
  perf/x86/intel/lbr: Fix the return type of get_lbr_cycles()
  perf/x86/intel: Fix rtm_abort_event encoding on Ice Lake
  x86/kprobes: Restore BTF if the single-stepping is cancelled
  perf: Break deadlock involving exec_update_mutex
  sparc64/mm: Implement pXX_leaf_size() support
  powerpc/8xx: Implement pXX_leaf_size() support
  arm64/mm: Implement pXX_leaf_size() support
  perf/core: Fix arch_perf_get_page_size()
  mm: Introduce pXX_leaf_size()
  mm/gup: Provide gup_get_pte() more generic
  perf/x86/intel: Add event constraint for CYCLE_ACTIVITY.STALLS_MEM_ANY
  perf/x86/intel/uncore: Add Rocket Lake support
  perf/x86/msr: Add Rocket Lake CPU support
  perf/x86/cstate: Add Rocket Lake CPU support
  perf/x86/intel: Add Rocket Lake CPU support
  perf,mm: Handle non-page-table-aligned hugetlbfs
  ...
2020-12-14 17:34:12 -08:00
Linus Torvalds
8c1dccc803 RCU, LKMM and KCSAN updates collected by Paul McKenney:
RCU:
 
     - Avoid cpuinfo-induced IPI pileups and idle-CPU IPIs.
 
     - Lockdep-RCU updates reducing the need for __maybe_unused.
 
     - Tasks-RCU updates.
 
     - Miscellaneous fixes.
 
     - Documentation updates.
 
     - Torture-test updates.
 
   KCSAN:
 
     - updates for selftests, avoiding setting watchpoints on NULL pointers
 
     - fix to watchpoint encoding
 
   LKMM:
 
     - updates for documentation along with some updates to example-code
       litmus tests
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/Xon4THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYobXUD/92LJTI/TMgK6Z6EEQBiJZO/2mNKjK8
 FEKc6AqTNMlZNsWCfQ5UgqtHpn+MkBZsX1x4u22gehE1qaCB8gnQ5wXgbXon8tQm
 exxVk6vvQZjseeqCMqrsUYQlD7dNgHnf1qAmWXJvji4sA/1Opo6n2M74tqfE2ueV
 S5hpQwSuK/6Zu2Hrr62HD8+Fx0in6ZuKRZxHGp1392l++DGbniJM3dzntRXB+JbZ
 w3PDHFCQuGzTytyeKuQV48ot9IK+2YzmjIp/+4tHL6mvU38xeSu6gcYtqKPcfYWw
 D6HXvDa965h5IrFdSA2JWSzjJ+VYgZVElk2HyXDNIae0fM/8GidgoIDQipT1WAur
 sxW/Ke4U6Jm5MMqXqV8iMNduktkGD1/h6G/iB1Yis29xFdthorNpbHVAP+8cKXgf
 1cR6RorOuBYv6XpyzygHtE7qfLY5ST352pJ4+UqNzboujOcuEnGaygttt0F/F8sA
 ZH8NT5dyUfbGeqepdZWkbj116Hjeg3fyV3CZeyBhDeqpjf1Nn3nbJ1xRksPLfa3i
 IKvN7HSzEg+vKnsJNnQeFlAmQ/W3n2bedzRqfaCg77pNhKI6jPuavY5f2YGFUj0y
 yx0UzOYoI1Cln0keBMmynbyUKgJ7zstLkrt/JenjhtD3B+0df5BmYjkL+nqkP6ax
 +XTCu7Xg+B061g==
 =N/iO
 -----END PGP SIGNATURE-----

Merge tag 'core-rcu-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RCU updates from Thomas Gleixner:
 "RCU, LKMM and KCSAN updates collected by Paul McKenney.

  RCU:
   - Avoid cpuinfo-induced IPI pileups and idle-CPU IPIs

   - Lockdep-RCU updates reducing the need for __maybe_unused

   - Tasks-RCU updates

   - Miscellaneous fixes

   - Documentation updates

   - Torture-test updates

  KCSAN:
   - updates for selftests, avoiding setting watchpoints on NULL pointers

   - fix to watchpoint encoding

  LKMM:
   - updates for documentation along with some updates to example-code
     litmus tests"

* tag 'core-rcu-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
  srcu: Take early exit on memory-allocation failure
  rcu/tree: Defer kvfree_rcu() allocation to a clean context
  rcu: Do not report strict GPs for outgoing CPUs
  rcu: Fix a typo in rcu_blocking_is_gp() header comment
  rcu: Prevent lockdep-RCU splats on lock acquisition/release
  rcu/tree: nocb: Avoid raising softirq for offloaded ready-to-execute CBs
  rcu,ftrace: Fix ftrace recursion
  rcu/tree: Make struct kernel_param_ops definitions const
  rcu/tree: Add a warning if CPU being onlined did not report QS already
  rcu: Clarify nocb kthreads naming in RCU_NOCB_CPU config
  rcu: Fix single-CPU check in rcu_blocking_is_gp()
  rcu: Implement rcu_segcblist_is_offloaded() config dependent
  list.h: Update comment to explicitly note circular lists
  rcu: Panic after fixed number of stalls
  x86/smpboot:  Move rcu_cpu_starting() earlier
  rcu: Allow rcu_irq_enter_check_tick() from NMI
  tools/memory-model: Label MP tests' producers and consumers
  tools/memory-model: Use "buf" and "flag" for message-passing tests
  tools/memory-model: Add types to litmus tests
  tools/memory-model: Add a glossary of LKMM terms
  ...
2020-12-14 17:21:16 -08:00
Linus Torvalds
1ac0884d54 A set of updates for entry/exit handling:
- More generalization of entry/exit functionality
 
  - The consolidation work to reclaim TIF flags on x86 and also for non-x86
    specific TIF flags which are solely relevant for syscall related work
    and have been moved into their own storage space. The x86 specific part
    had to be merged in to avoid a major conflict.
 
  - The TIF_NOTIFY_SIGNAL work which replaces the inefficient signal
    delivery mode of task work and results in an impressive performance
    improvement for io_uring. The non-x86 consolidation of this is going to
    come seperate via Jens.
 
  - The selective syscall redirection facility which provides a clean and
    efficient way to support the non-Linux syscalls of WINE by catching them
    at syscall entry and redirecting them to the user space emulation. This
    can be utilized for other purposes as well and has been designed
    carefully to avoid overhead for the regular fastpath. This includes the
    core changes and the x86 support code.
 
  - Simplification of the context tracking entry/exit handling for the users
    of the generic entry code which guarantee the proper ordering and
    protection.
 
  - Preparatory changes to make the generic entry code accomodate S390
    specific requirements which are mostly related to their syscall restart
    mechanism.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XoPoTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoe0tD/4jSKHIogVM9kVpiYfwjDGS1NluaBXn
 71ZoASbX9GZebyGandMyF2QP1iJ24ZO0RztBwHEVH6fyomKB2iFNedssCpO9yfWV
 3eFRpOvMpbszY2W2bd0QG3GrqaTttjVfB4ahkGLzqeSbchdob6hZpNDYtBZnujA6
 GSnrrurfJkCGoQny+yJQYdQJXQU+BIX90B2a2Q+jW123Luy/iHXC1f/krZSA1m14
 fC9xYLSUjPphTzh2ZOW+C3DgdjOL5PfAm/6F+DArt4GtLgrEGD7R74aLSFhvetky
 dn5QtG+yAsz1i0cc5Wu/JBcT9tOkY92rPYSyLI9bYQUSQ/bMyuprz6oYKj3dubsu
 ZSsKPdkNFPIniL4fLdCMWZcIXX5xgnrxKjdgXZXW3gtrcxSns8w8uED3Sh7dgE08
 pgIeq67E5g/OB8kJXH1VxdewmeQb9cOmnzzHwNO7TrrGbBKjDTYHNdYOKf1dUTTK
 ZX1UjLfGwxTkMYAbQD1k0JGZ2OLRshzSaH5BW/ZKa3bvJW6yYOq+/YT8B8hbJ8U3
 vThlO75/55IJxS5r5Y3vZd/IHdsYbPuETD+TA8tNYtPqNZasW8nnk4TYctWqzDuO
 /Ka1wvWYid3c6ySznQn4zSyRjr968AfHeZ9YTUMhWufy5waXVmdBMG41u3IKfsVt
 osyzNc4EK19/Mg==
 =hsjV
 -----END PGP SIGNATURE-----

Merge tag 'core-entry-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core entry/exit updates from Thomas Gleixner:
 "A set of updates for entry/exit handling:

   - More generalization of entry/exit functionality

   - The consolidation work to reclaim TIF flags on x86 and also for
     non-x86 specific TIF flags which are solely relevant for syscall
     related work and have been moved into their own storage space. The
     x86 specific part had to be merged in to avoid a major conflict.

   - The TIF_NOTIFY_SIGNAL work which replaces the inefficient signal
     delivery mode of task work and results in an impressive performance
     improvement for io_uring. The non-x86 consolidation of this is
     going to come seperate via Jens.

   - The selective syscall redirection facility which provides a clean
     and efficient way to support the non-Linux syscalls of WINE by
     catching them at syscall entry and redirecting them to the user
     space emulation. This can be utilized for other purposes as well
     and has been designed carefully to avoid overhead for the regular
     fastpath. This includes the core changes and the x86 support code.

   - Simplification of the context tracking entry/exit handling for the
     users of the generic entry code which guarantee the proper ordering
     and protection.

   - Preparatory changes to make the generic entry code accomodate S390
     specific requirements which are mostly related to their syscall
     restart mechanism"

* tag 'core-entry-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
  entry: Add syscall_exit_to_user_mode_work()
  entry: Add exit_to_user_mode() wrapper
  entry_Add_enter_from_user_mode_wrapper
  entry: Rename exit_to_user_mode()
  entry: Rename enter_from_user_mode()
  docs: Document Syscall User Dispatch
  selftests: Add benchmark for syscall user dispatch
  selftests: Add kselftest for syscall user dispatch
  entry: Support Syscall User Dispatch on common syscall entry
  kernel: Implement selective syscall userspace redirection
  signal: Expose SYS_USER_DISPATCH si_code type
  x86: vdso: Expose sigreturn address on vdso to the kernel
  MAINTAINERS: Add entry for common entry code
  entry: Fix boot for !CONFIG_GENERIC_ENTRY
  x86: Support HAVE_CONTEXT_TRACKING_OFFSTACK
  context_tracking: Only define schedule_user() on !HAVE_CONTEXT_TRACKING_OFFSTACK archs
  sched: Detect call to schedule from critical entry code
  context_tracking: Don't implement exception_enter/exit() on CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK
  context_tracking: Introduce HAVE_CONTEXT_TRACKING_OFFSTACK
  x86: Reclaim unused x86 TI flags
  ...
2020-12-14 17:13:53 -08:00
Linus Torvalds
0ca2ce81eb arm64 updates for 5.11:
- Expose tag address bits in siginfo. The original arm64 ABI did not
   expose any of the bits 63:56 of a tagged address in siginfo. In the
   presence of user ASAN or MTE, this information may be useful. The
   implementation is generic to other architectures supporting tags (like
   SPARC ADI, subject to wiring up the arch code). The user will have to
   opt in via sigaction(SA_EXPOSE_TAGBITS) so that the extra bits, if
   available, become visible in si_addr.
 
 - Default to 32-bit wide ZONE_DMA. Previously, ZONE_DMA was set to the
   lowest 1GB to cope with the Raspberry Pi 4 limitations, to the
   detriment of other platforms. With these changes, the kernel scans the
   Device Tree dma-ranges and the ACPI IORT information before deciding
   on a smaller ZONE_DMA.
 
 - Strengthen READ_ONCE() to acquire when CONFIG_LTO=y. When building
   with LTO, there is an increased risk of the compiler converting an
   address dependency headed by a READ_ONCE() invocation into a control
   dependency and consequently allowing for harmful reordering by the
   CPU.
 
 - Add CPPC FFH support using arm64 AMU counters.
 
 - set_fs() removal on arm64. This renders the User Access Override (UAO)
   ARMv8 feature unnecessary.
 
 - Perf updates: PMU driver for the ARM DMC-620 memory controller, sysfs
   identifier file for SMMUv3, stop event counters support for i.MX8MP,
   enable the perf events-based hard lockup detector.
 
 - Reorganise the kernel VA space slightly so that 52-bit VA
   configurations can use more virtual address space.
 
 - Improve the robustness of the arm64 memory offline event notifier.
 
 - Pad the Image header to 64K following the EFI header definition
   updated recently to increase the section alignment to 64K.
 
 - Support CONFIG_CMDLINE_EXTEND on arm64.
 
 - Do not use tagged PC in the kernel (TCR_EL1.TBID1==1), freeing up 8
   bits for PtrAuth.
 
 - Switch to vmapped shadow call stacks.
 
 - Miscellaneous clean-ups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAl/XcSgACgkQa9axLQDI
 XvGkwg//SLknimELD/cphf2UzZm5RFuCU0x1UnIXs9XYo5BrOpgVLLA//+XkCrKN
 0GLAdtBDfw1axWJudzgMBiHrv6wSGh4p3YWjLIW06u/PJu3m3U8oiiolvvF8d7Yq
 UKDseKGQnQkrl97J0SyA+Da/u8D11GEzp52SWL5iRxzt6vInEC27iTOp9n1yoaoP
 f3y7qdp9kv831ryUM3rXFYpc8YuMWXk+JpBSNaxqmjlvjMzipA5PhzBLmNzfc657
 XcrRX5qsgjEeJW8UUnWUVNB42j7tVzN77yraoUpoVVCzZZeWOQxqq5EscKPfIhRt
 AjtSIQNOs95ZVE0SFCTjXnUUb823coUs4dMCdftqlE62JNRwdR+3bkfa+QjPTg1F
 O9ohW1AzX0/JB19QBxMaOgbheB8GFXh3DVJ6pizTgxJgyPvQQtFuEhT1kq8Cst0U
 Pe+pEWsg9t41bUXNz+/l9tUWKWpeCfFNMTrBXLmXrNlTLeOvDh/0UiF0+2lYJYgf
 YAboibQ5eOv2wGCcSDEbNMJ6B2/6GtubDJxH4du680F6Emb6pCSw0ntPwB7mSGLG
 5dXz+9FJxDLjmxw7BXxQgc5MoYIrt5JQtaOQ6UxU8dPy53/+py4Ck6tXNkz0+Ap7
 gPPaGGy1GqobQFu3qlHtOK1VleQi/sWcrpmPHrpiiFUf6N7EmcY=
 =zXFk
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:

 - Expose tag address bits in siginfo. The original arm64 ABI did not
   expose any of the bits 63:56 of a tagged address in siginfo. In the
   presence of user ASAN or MTE, this information may be useful. The
   implementation is generic to other architectures supporting tags
   (like SPARC ADI, subject to wiring up the arch code). The user will
   have to opt in via sigaction(SA_EXPOSE_TAGBITS) so that the extra
   bits, if available, become visible in si_addr.

 - Default to 32-bit wide ZONE_DMA. Previously, ZONE_DMA was set to the
   lowest 1GB to cope with the Raspberry Pi 4 limitations, to the
   detriment of other platforms. With these changes, the kernel scans
   the Device Tree dma-ranges and the ACPI IORT information before
   deciding on a smaller ZONE_DMA.

 - Strengthen READ_ONCE() to acquire when CONFIG_LTO=y. When building
   with LTO, there is an increased risk of the compiler converting an
   address dependency headed by a READ_ONCE() invocation into a control
   dependency and consequently allowing for harmful reordering by the
   CPU.

 - Add CPPC FFH support using arm64 AMU counters.

 - set_fs() removal on arm64. This renders the User Access Override
   (UAO) ARMv8 feature unnecessary.

 - Perf updates: PMU driver for the ARM DMC-620 memory controller, sysfs
   identifier file for SMMUv3, stop event counters support for i.MX8MP,
   enable the perf events-based hard lockup detector.

 - Reorganise the kernel VA space slightly so that 52-bit VA
   configurations can use more virtual address space.

 - Improve the robustness of the arm64 memory offline event notifier.

 - Pad the Image header to 64K following the EFI header definition
   updated recently to increase the section alignment to 64K.

 - Support CONFIG_CMDLINE_EXTEND on arm64.

 - Do not use tagged PC in the kernel (TCR_EL1.TBID1==1), freeing up 8
   bits for PtrAuth.

 - Switch to vmapped shadow call stacks.

 - Miscellaneous clean-ups.

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (78 commits)
  perf/imx_ddr: Add system PMU identifier for userspace
  bindings: perf: imx-ddr: add compatible string
  arm64: Fix build failure when HARDLOCKUP_DETECTOR_PERF is enabled
  arm64: mte: fix prctl(PR_GET_TAGGED_ADDR_CTRL) if TCF0=NONE
  arm64: mark __system_matches_cap as __maybe_unused
  arm64: uaccess: remove vestigal UAO support
  arm64: uaccess: remove redundant PAN toggling
  arm64: uaccess: remove addr_limit_user_check()
  arm64: uaccess: remove set_fs()
  arm64: uaccess cleanup macro naming
  arm64: uaccess: split user/kernel routines
  arm64: uaccess: refactor __{get,put}_user
  arm64: uaccess: simplify __copy_user_flushcache()
  arm64: uaccess: rename privileged uaccess routines
  arm64: sdei: explicitly simulate PAN/UAO entry
  arm64: sdei: move uaccess logic to arch/arm64/
  arm64: head.S: always initialize PSTATE
  arm64: head.S: cleanup SCTLR_ELx initialization
  arm64: head.S: rename el2_setup -> init_kernel_el
  arm64: add C wrappers for SET_PSTATE_*()
  ...
2020-12-14 16:24:30 -08:00
Linus Torvalds
84292fffc2 - Fix the vmlinux size check on 64-bit along with adding useful clarifications on the topic
(Arvind Sankar)
 
 - Remove -m16 workaround now that the GCC versions that need it are unsupported
 (Nick Desaulniers)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XhwYACgkQEsHwGGHe
 VUqYPQ/9ECDO6Ms8rkvZf4f7+LfKO8XsRYf0h411h5zuVssgPybIbZCvuiQ5Jj9r
 IzTe0pO0AoVoX5/JE+mxbvroBb5hRokNehJayPmdVjZSrzsYWZR3RVW6iSRo/TQ0
 bjzY1cHC4EPmtyYA4FDr+b0JAEEbyPuPOAcqSzBBadsx5FURvwH7s1hUILCZ1nmy
 i8DADZDo7Puk1W8VQkGkjUeo3QVY8KG2VzZegU0iUR/M7Hoek3WBPObYFg0bsKN4
 6hvRWPMt2lFXOSNZOFhG+LjtWXTJ22g7ENU4SHXhHBdxmM8H/OZvh/3UZTes7rdo
 u/LZJ580GfOa3TYHOToGQq9iBxoi9Se3vLXUtt2I9Z7sbBVUGTKx2Llezvt9/QRh
 kJkpvMc2r2b9n7Cu52gHKQ89OrqFzs2scNyjQ2KRU61bPwLjWoGiobugnl8T9q8Y
 Li8LKS2tCD6yo7L1eYt0NDriWVF3/Tv1rebWMNIf3sGSHFkxgHxFviTx3L3SWEBS
 8JMuKLPh20WC9ombXP/tDN2rtTmmL9hEoqZ9uSTG1tRNLOLLNv790ucUh+lp6RFp
 7y8dzl+L3hXsFdcCYooOIGO1phmdThwSLMqZHmbpRWpKNr+8P0wCqZP16twnoXNC
 kO7OiNd6XHVpJg4WnN3C3X70n7TPXSZ7dkpDNFBJaH2QilNEHa8=
 =Qk/S
 -----END PGP SIGNATURE-----

Merge tag 'x86_build_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 build updates from Borislav Petkov:
 "Two x86 build fixes:

   - Fix the vmlinux size check on 64-bit along with adding useful
     clarifications on the topic (Arvind Sankar)

   - Remove -m16 workaround now that the GCC versions that need it are
     unsupported (Nick Desaulniers)"

* tag 'x86_build_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/build: Remove -m16 workaround for unsupported versions of GCC
  x86/build: Fix vmlinux size check on 64-bit
2020-12-14 13:54:50 -08:00
Linus Torvalds
8ba27ae36b - Add logic to correct MBM total and local values fixing errata SKX99 and BDF102
(Fenghua Yu)
 
 - Cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XhTUACgkQEsHwGGHe
 VUpTkA/9FNOaJohBa9XO2bv3RjsqE4/2PqZS434HJL6yrbbRDFyscSp2sNCDlfh/
 6zj9R72HpRf6xLW8CTeszrfUQ0z3CFnfwz8EAolhv5DOiJnM3wSS5inmLhtyTMgw
 mJ8qVXXELdAFm1R8rAQLVmA3FE9aV6u19POstKeXMUykCyaxKDhNLJgn3CiXpegU
 AvG3QWmrR/1x4DjQguMNwXMQiYVDkRdnfJa5SOzCTwyUj0D0an+kPmSHEMvhaOL6
 tUYfjLkZYdB5THG8eUM833EJmgRe7X1VtTQIydcVyt/6tbL50FmVYUpWk+SXuSnJ
 /uBdzx0IXmfvYKgGSFw+FGOyGH8u02nR4621rECNLNf1h8YknzpN7Ri1hB83GjQe
 PMiW/gCwxM6gfRsBpJ17xnAbmR5Az2dOz71uj5k13L1GNL8VZD2DZz//a4TDarzE
 4R8i3PQ/oUMgmL+ARpVWwycc/dhB8o+1glmAjOwWHO/kM1k/hx9Ou5bbLlhffbL5
 zkk6ORrNfKzAMvE1flkjim5XgNHY8n/L4CwBKXJFQ3IYbXBGg0CKxBHWvTpIz1KO
 O4h52OnYUpM15sY22NSvpdRiHcsgEqNOB4lb6K2dby9iE5b5RHpNgfMNO6+eOtag
 dgCWopamLK2+MFeNe48+1v/3bDveskhvJon91GUZYmfkFzhZTN8=
 =9Lsd
 -----END PGP SIGNATURE-----

Merge tag 'x86_cache_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cache resource control updates from Borislav Petkov:

 - add logic to correct MBM total and local values fixing errata SKX99
   and BDF102 (Fenghua Yu)

 - cleanups

* tag 'x86_cache_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/resctrl: Clean up unused function parameter in rmdir path
  x86/resctrl: Constify kernfs_ops
  x86/resctrl: Correct MBM total and local values
  Documentation/x86: Rename resctrl_ui.rst and add two errata to the file
2020-12-14 13:53:17 -08:00
Linus Torvalds
405f868f13 - Remove all uses of TIF_IA32 and TIF_X32 and reclaim the two bits in the end
(Gabriel Krisman Bertazi)
 
 - All kinds of minor cleanups all over the tree.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XgtoACgkQEsHwGGHe
 VUqGuA/9GqN2zNQdhgRvAQ+FLZiOYK9MfXcoayfMq8T61VRPDBWaQRfVYKmfmEjS
 0l5OnYgZQ9n6vzqFy6pmgc/ix8Jr553dZp5NCamcOqjCTcuO/LwRRh+ZBeFSBTPi
 r2qFYKKRYvM7nbyUMm4WqvAakxJ18xsjNbIslr9Aqe8WtHBKKX3MOu8SOpFtGyXz
 aEc4rhsS45iZa5gTXhvOn73tr3yHGWU1rzyyAAAmDGTgAxRwsTna8v16C4+v+Bua
 Zg18Wiutj8ZjtFpzKJtGWGZoSBap3Jw2Ys64g42MBQUE56KY/99tQVo/SvbYvvlf
 PHWLH0f3rPNJ6J2qeKwhtNzPlEAH/6e416A1/6TVwsK+8pdfGmkfaQh2iDHLhJ5i
 CSwF61H44ZaE3pc1tHHbC5ALvydPlup7D4MKgztfq0mZ3OoV2Vg7dtyyr+Ybz72b
 G+Kl/tmyacQTXo0FiYbZKETo3/VfTdBXGyVax1rHkx3pt8zvhFg3kxb1TT/l/CoM
 eSTx53PtTdVtbGOq1CjnUm0FKlbh4+kLoNuo9DYKeXUQBs8PWOCZmL3wXmm4cqlZ
 mDZVWvll7CjToY8izzcE/AG279cWkgcL5Tcg7W7CR66+egfDdpuqOZ4tv4TyzoWq
 0J7WeNj+TAo98b7RA0Ux8LOlszRxS2ykuI6uB2MgwCaRMbbaQao=
 =lLiH
 -----END PGP SIGNATURE-----

Merge tag 'x86_cleanups_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Borislav Petkov:
 "Another branch with a nicely negative diffstat, just the way I
  like 'em:

   - Remove all uses of TIF_IA32 and TIF_X32 and reclaim the two bits in
     the end (Gabriel Krisman Bertazi)

   - All kinds of minor cleanups all over the tree"

* tag 'x86_cleanups_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  x86/ia32_signal: Propagate __user annotation properly
  x86/alternative: Update text_poke_bp() kernel-doc comment
  x86/PCI: Make a kernel-doc comment a normal one
  x86/asm: Drop unused RDPID macro
  x86/boot/compressed/64: Use TEST %reg,%reg instead of CMP $0,%reg
  x86/head64: Remove duplicate include
  x86/mm: Declare 'start' variable where it is used
  x86/head/64: Remove unused GET_CR2_INTO() macro
  x86/boot: Remove unused finalize_identity_maps()
  x86/uaccess: Document copy_from_user_nmi()
  x86/dumpstack: Make show_trace_log_lvl() static
  x86/mtrr: Fix a kernel-doc markup
  x86/setup: Remove unused MCA variables
  x86, libnvdimm/test: Remove COPY_MC_TEST
  x86: Reclaim TIF_IA32 and TIF_X32
  x86/mm: Convert mmu context ia32_compat into a proper flags field
  x86/elf: Use e_machine to check for x32/ia32 in setup_additional_pages()
  elf: Expose ELF header on arch_setup_additional_pages()
  x86/elf: Use e_machine to select start_thread for x32
  elf: Expose ELF header in compat_start_thread()
  ...
2020-12-14 13:45:26 -08:00
Linus Torvalds
9c70f04678 The main part of this branch is the ongoing fight against windmills in
an attempt to have userspace tools not poke at naked MSRs. This round
 deals with MSR_IA32_ENERGY_PERF_BIAS and removes direct poking into it
 by our in-tree tools in favor of the proper "energy_perf_bias" sysfs
 interface which we already have.
 
 In addition, the msr.ko write filtering's error message points to a new
 summary page which contains the info we collected from helpful reporters
 about which userspace tools write MSRs:
 
   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/about
 
 along with the current status of their conversion.
 
 Rest is the usual small fixes and improvements.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XVKYACgkQEsHwGGHe
 VUondg//fv3aQM3KtWE7sxv6BjpiUNozPBELRuKo+EskHSxHudRhBxzdSMM7WgKq
 2uojb2CQtzRzYhHuiXjXKfbB7Ci/Jo4EDCJW2otpiqit7/UgXu15Q5ypCUMIteiV
 u9A2w3oN3GPR5TuofLWCffaotVMpFok3u7jX7RxEQPWmZqJItTwZpqYLeyniHaKM
 c6taAxZVyV13iejRhxim2zkl/hMXpjA8I+8CqWIL25J7GYlYeWLWxWYmHIQTs0NM
 zSIyr47RD8RRXVeRdeJMxnQblKE1zrObIV1fUXXu1dSW47DkrrcOQwEMorNjPtPA
 FR5Xhi+TX8JrBasMpwCnV/CTj6Ua8UsMfwQcPOFnXALPj87HfFSypa5BpnBH5xTW
 PaiatRmiNJm3g79ncaTvXCksMbb4WANqOYK+gsGYvtKbfLR+caWT6vytjZA6sC6x
 laynstV9PFUyewdwjjAjilhArzV+y+5RsRudBK8xSjcawbyV4ZEorNKYS9qrhm+y
 7CAM9A8fCQiO6POr6W7HcfmkUOHC9PLhtyjdJH89tAmaf+sfvaczzx3awwSuKx7P
 0rJlDiJP1v7yEpOMWHbpGIqjMBaWK4y3mb4g3UwFpHpo8cTl+WXZQppOPIBn9GA9
 ASLYT/ze7zk1Ua2V88qoXiC5AEvqBnSq4fp2pmf06ROZgBnYT6o=
 =ISyk
 -----END PGP SIGNATURE-----

Merge tag 'x86_misc_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 updates from Borislav Petkov:
 "The main part of this branch is the ongoing fight against windmills in
  an attempt to have userspace tools not poke at naked MSRs.

  This round deals with MSR_IA32_ENERGY_PERF_BIAS and removes direct
  poking into it by our in-tree tools in favor of the proper
  "energy_perf_bias" sysfs interface which we already have.

  In addition, the msr.ko write filtering's error message points to a
  new summary page which contains the info we collected from helpful
  reporters about which userspace tools write MSRs:

      https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/about

  along with the current status of their conversion.

  The rest is the usual small fixes and improvements"

* tag 'x86_misc_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/msr: Add a pointer to an URL which contains further details
  x86/pci: Fix the function type for check_reserved_t
  selftests/x86: Add missing .note.GNU-stack sections
  selftests/x86/fsgsbase: Fix GS == 1, 2, and 3 tests
  x86/msr: Downgrade unrecognized MSR message
  x86/msr: Do not allow writes to MSR_IA32_ENERGY_PERF_BIAS
  tools/power/x86_energy_perf_policy: Read energy_perf_bias from sysfs
  tools/power/turbostat: Read energy_perf_bias from sysfs
  tools/power/cpupower: Read energy_perf_bias from sysfs
  MAINTAINERS: Cleanup SGI-related entries
2020-12-14 13:29:34 -08:00
Linus Torvalds
ae1c1a8fd9 - Add a new uv_sysfs driver and expose read-only information from UV BIOS
(Justin Ernst and Mike Travis)
 
 - The usual set of small fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XWLkACgkQEsHwGGHe
 VUpgoQ//WnhLoyByrxyKScOLW2aCCX4sIQ2R0Mjqt4z6wHHL6OB1GIswtIWsFlpw
 UtFOhTmKV0A5q/yXIGYY1EOZPhA4wrrTesYHsOoVcJ3xdwrAarc3fzw02uKT/uQp
 sW0JxTtJ5iK0XZp+K81344sQy26iuTiLh0Lwg9lJ5mpUotW+1iSOp7LAgcN462r9
 3l8To6qWyWnnMWgJlV7jdcsswk0V2TkT/iipQITNqEGtwqRm/uFGRgK/669IbqP/
 P/qROhU1XseHxVFfPMXfFVeFjNTpkcgD11C8YeqWiHXrmBesCHIC9cd2pYfJepLJ
 F4sIh6zqt6v6z5UOD+yo8IjdmL0v9nPLJiCeiu9ES59pVIx7DsR+p005fBLTAXp8
 QpbiAhJn4ajpu78QAkkhieAvldXP2+1h42wK8HUTWKt2wPsnyGozvY71kp8UkMSd
 1Kh3mahrAN57TfvlewrMr/ynhGMMuWQHIr3ScYei0L6vF2c6f98HSiNzzZ1PiiY7
 6OwR+qISucV8ZeXMv6va5POI/T67UZjjuT5Zw0ls1LpqJ6CkGmcrCeq6qYbxK7Fy
 Vd8T/v51FknrTF0J42GMLb9ICIFmC0yAvzflTytFvWqgGEZpEEpRzmKfeC+a0G7B
 m3XEm/xE3+3TKnID+2tPSFOEU4mdwKC0+lguD7DV8aTYpdjsxXs=
 =VZBo
 -----END PGP SIGNATURE-----

Merge tag 'x86_platform_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 platform updates from Borislav Petkov:

 - add a new uv_sysfs driver and expose read-only information from UV
   BIOS (Justin Ernst and Mike Travis)

 - the usual set of small fixes

* tag 'x86_platform_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/platform/uv: Update sysfs documentation
  x86/platform/uv: Add deprecated messages to /proc info leaves
  x86/platform/uv: Add sysfs hubless leaves
  x86/platform/uv: Add sysfs leaves to replace those in procfs
  x86/platform/uv: Add kernel interfaces for obtaining system info
  x86/platform/uv: Make uv_pcibus_kset and uv_hubs_kset static
  x86/platform/uv: Fix an error code in uv_hubs_init()
  x86/platform/uv: Update MAINTAINERS for uv_sysfs driver
  x86/platform/uv: Update ABI documentation of /sys/firmware/sgi_uv/
  x86/platform/uv: Add new uv_sysfs platform driver
  x86/platform/uv: Add and export uv_bios_* functions
  x86/platform/uv: Remove existing /sys/firmware/sgi_uv/ interface
2020-12-14 13:27:53 -08:00
Linus Torvalds
0d712978dc - Save the AMD's physical die ID into cpuinfo_x86.cpu_die_id and convert all
code to use it (Yazen Ghannam)
 
 - Remove a dead and unused TSEG region remapping workaround on AMD (Arvind Sankar)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XVlYACgkQEsHwGGHe
 VUpxTA/9F0KsgSyTh66uX+aX5qkQ3WTBVgxbXGFrn5qPvwcALXabU8qObDWTSdwS
 1YbiWDjKNBJX+dggWe/fcQgUZxu5DFkM4IKEW1V7MLJEcdfylcqCyc1YNpEI4ySn
 ebw2Sy4/5iXGAvhz802/WoUU/o3A2uZwe0RFyodHGxof5027HkZhRHeYB27Htw+l
 z0IsmiYOoPl/4mNuVgr/qieIFSw1SUE9kwjU8RvM6xVWmXWXpM68JHa9s+/51pFt
 6BaOz485OyzWUCtSx3/++GEkU2d53bWYOuQ1zTLEiuaBfYC5n5T/kAcT4WJNK6Tf
 tX7yrzmWm9ecykIxfkgMrhG57G38y2GMJcEg+dFQHeXC062fdHDg+oY6Ql2EkAm5
 t5RIQ/cyOmQCLns31rHI/kwQ3RMKc/lfnL/z8lrlfWsC5o755yFJKttbfLJugbTo
 3BO1fbs4xgQcgi0KoqXOUETrQtsOLtr9FJwvcArB94XXqcIPClE8Ir7n8T7FCuLr
 9litSXIdn46EHwD6hD5QIk7y+Rxwk/jxZFys3eh90jcWDDZTaG2lz3if33RbZ1go
 XBrS5X3HsMODGZlaMeUjrbFIz3e0Zyoo+RO/TX48w8nzivC6xSNxSNFgIZ1XTF5E
 SLMGa6lEQ9mLiqRfgFjynNwSYOSlGv3euMkZaVPS3hnNmn+vZbI=
 =RsCs
 -----END PGP SIGNATURE-----

Merge tag 'x86_cpu_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cpuid updates from Borislav Petkov:
 "Only AMD-specific changes this time:

   - Save the AMD physical die ID into cpuinfo_x86.cpu_die_id and
     convert all code to use it (Yazen Ghannam)

   - Remove a dead and unused TSEG region remapping workaround on AMD
     (Arvind Sankar)"

* tag 'x86_cpu_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu/amd: Remove dead code for TSEG region remapping
  x86/topology: Set cpu_die_id only if DIE_TYPE found
  EDAC/mce_amd: Use struct cpuinfo_x86.cpu_die_id for AMD NodeId
  x86/CPU/AMD: Remove amd_get_nb_id()
  x86/CPU/AMD: Save AMD NodeId as cpu_die_id
2020-12-14 13:21:33 -08:00
Linus Torvalds
5583ff677b "Intel SGX is new hardware functionality that can be used by
applications to populate protected regions of user code and data called
 enclaves. Once activated, the new hardware protects enclave code and
 data from outside access and modification.
 
 Enclaves provide a place to store secrets and process data with those
 secrets. SGX has been used, for example, to decrypt video without
 exposing the decryption keys to nosy debuggers that might be used to
 subvert DRM. Software has generally been rewritten specifically to
 run in enclaves, but there are also projects that try to run limited
 unmodified software in enclaves."
 
 Most of the functionality is concentrated into arch/x86/kernel/cpu/sgx/
 except the addition of a new mprotect() hook to control enclave page
 permissions and support for vDSO exceptions fixup which will is used by
 SGX enclaves.
 
 All this work by Sean Christopherson, Jarkko Sakkinen and many others.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XTtMACgkQEsHwGGHe
 VUqxFw/+NZGf2b3CWPcrvwXCpkvSpIrqh1jQwyvkZyJ1gen7Vy8dkvf99h8+zQPI
 4wSArEyjhYJKAAmBNefLKi/Cs/bdkGzLlZyDGqtM641XRjf0xXIpQkOBb6UBa+Pv
 to8veQmVH2bBTM49qnd+H1wM6FzYvhTYCD8xr4HlLXtIfpP2CK2GvCb8s/4LifgD
 fTucZX9TFwLgVkWOHWHN0n8XMR2Fjb2YCrwjFMKyr/M2W+pPoOCTIt4PWDuXiOeG
 rFP7R4DT9jDg8ht5j2dHQT/Bo8TvTCB4Oj98MrX1TTgkSjLJySSMfyQg5EwNfSIa
 HC0lg/6qwAxnhWX7cCCBETNZ4aYDmz/dxcCSsLbomGP9nMaUgUy7qn5nNuNbJilb
 oCBsr8LDMzu1LJzmkduM8Uw6OINh+J8ICoVXaR5pS7gSZz/+vqIP/rK691AiqhJL
 QeMkI9gQ83jEXpr/AV7ABCjGCAeqELOkgravUyTDev24eEc0LyU0qENpgxqWSTca
 OvwSWSwNuhCKd2IyKZBnOmjXGwvncwX0gp1KxL9WuLkR6O8XldLAYmVCwVAOrIh7
 snRot8+3qNjELa65Nh5DapwLJrU24TRoKLHLgfWK8dlqrMejNtXKucQ574Np0feR
 p2hrNisOrtCwxAt7OAgWygw8agN6cJiY18onIsr4wSBm5H7Syb0=
 =k7tj
 -----END PGP SIGNATURE-----

Merge tag 'x86_sgx_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 SGC support from Borislav Petkov:
 "Intel Software Guard eXtensions enablement. This has been long in the
  making, we were one revision number short of 42. :)

  Intel SGX is new hardware functionality that can be used by
  applications to populate protected regions of user code and data
  called enclaves. Once activated, the new hardware protects enclave
  code and data from outside access and modification.

  Enclaves provide a place to store secrets and process data with those
  secrets. SGX has been used, for example, to decrypt video without
  exposing the decryption keys to nosy debuggers that might be used to
  subvert DRM. Software has generally been rewritten specifically to run
  in enclaves, but there are also projects that try to run limited
  unmodified software in enclaves.

  Most of the functionality is concentrated into arch/x86/kernel/cpu/sgx/
  except the addition of a new mprotect() hook to control enclave page
  permissions and support for vDSO exceptions fixup which will is used
  by SGX enclaves.

  All this work by Sean Christopherson, Jarkko Sakkinen and many others"

* tag 'x86_sgx_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
  x86/sgx: Return -EINVAL on a zero length buffer in sgx_ioc_enclave_add_pages()
  x86/sgx: Fix a typo in kernel-doc markup
  x86/sgx: Fix sgx_ioc_enclave_provision() kernel-doc comment
  x86/sgx: Return -ERESTARTSYS in sgx_ioc_enclave_add_pages()
  selftests/sgx: Use a statically generated 3072-bit RSA key
  x86/sgx: Clarify 'laundry_list' locking
  x86/sgx: Update MAINTAINERS
  Documentation/x86: Document SGX kernel architecture
  x86/sgx: Add ptrace() support for the SGX driver
  x86/sgx: Add a page reclaimer
  selftests/x86: Add a selftest for SGX
  x86/vdso: Implement a vDSO for Intel SGX enclave call
  x86/traps: Attempt to fixup exceptions in vDSO before signaling
  x86/fault: Add a helper function to sanitize error code
  x86/vdso: Add support for exception fixup in vDSO functions
  x86/sgx: Add SGX_IOC_ENCLAVE_PROVISION
  x86/sgx: Add SGX_IOC_ENCLAVE_INIT
  x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES
  x86/sgx: Add SGX_IOC_ENCLAVE_CREATE
  x86/sgx: Add an SGX misc driver interface
  ...
2020-12-14 13:14:57 -08:00
Linus Torvalds
85fe40cad2 - A single cleanup removing "break" after a return statement (Tom Rix)
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XSkwACgkQEsHwGGHe
 VUoYZBAAwGeZCV6KKQERAhMsTKAI+UvGS8rAOBOOTheDHyS/X4Rf63OuSLO/wC3c
 NJ9Z4M8GeMmW8K/oeEeUyEI6980FC2I0tQE20J29SPjK37jTowIHjJ26ofKGUP1+
 TiqSL2Eo90KqUSVg4GGdDKEuO6by5gS8tRoNE9Of0eGz6ahVOn9PFHt6QVk9cFug
 KOByzLjwmBd+XNE/gm56Tl4khOUBYy8+2JGZfewfJOsktln9t+s0WnuB1P7PU+2B
 9uI2AAs9JmaO5RzIVHi3VYFsPEzrL3GXBpBCR2XJucIk4h6Km94yNDZW/th5XP6T
 vKDNrhg3ErmeI0MgfIKHq+QXxGLobsra9/ZQazg8ZuPvIg47jzo/TaCiLtJGPips
 Kr6MLMK6lfQOnOsaDjnGUJPMpOvAmMxOU9o7MVn1XxpDiFr8Fzt1njBMJ/kZikXt
 YNOUj+Pws/WfvdFW/QkBBao6iOidEFxgTLJnADeocL0T0y5feVqvrOrWKFXGiL+H
 i8ehg4040DZA0PxqxxUYe+EVQgMs41LAHr0zvUxI+hxQBKPFFg6xkpUCOaHd/rSZ
 XTx0AsYuaE2RPY/ldDWkWrzGja9FfzTSKwzkDkhECrHXGYaIyiNSCpFZQ4Mqua3c
 Z0mT4pLx6blLFqJ+O+Q+juuMSixSVKWe5NwEoG45sWZfGAHIl0M=
 =IsGj
 -----END PGP SIGNATURE-----

Merge tag 'x86_microcode_update_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 microcode loader update from Borislav Petkov:
 "This one wins the award for most boring pull request ever. But that's
  a good thing - this is how I like 'em and the microcode loader
  *should* be boring. :-)

  A single cleanup removing "break" after a return statement (Tom Rix)"

* tag 'x86_microcode_update_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/microcode/amd: Remove unneeded break
2020-12-14 13:13:15 -08:00
Linus Torvalds
2b34233ce2 - Enable additional logging mode on older Xeons (Tony Luck)
- Pass error records logged by firmware through the MCE decoding chain
   to provide human-readable error descriptions instead of raw values
   (Smita Koralahalli)
 
 - Some #MC handler fixes (Gabriele Paoloni)
 
 - The usual small fixes and cleanups all over.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XSJcACgkQEsHwGGHe
 VUob6hAAgSJq1IcftZR4DSk/Mlrt0x4orDNCmoGhxAlOT7ryiidYhXuKV2tvWloA
 3v7X5E0r9CroS3PMghQtVOD7qJMjNAWKun5C6zPhLkIeV+CvcLAHhfShlcyWhJ76
 PQaHHSxRQGEh5M2Xcp26kdwqrARbhcl66ukBvFNpiNUkLH+robDmIazraI5a2bV/
 9azNoVUZBSXQoJYpPz/tBTxu2EToj/xrMVIZg1OPHR5cxtOwUSZCr8V69KGK4onm
 avYQY8TSCDxsG1VcywYzBNi2W6lKs2EFlhCVZBLCz0NIkZCYTXErI4OsuExMufLu
 t17sAfHjQg+SxEcL5pc+iQkr9i0LLnujKz+Cl0ShtRk6SER0U+9pc/yf0wQSGDhB
 AZz87z+a6+r4pxdTSclOkpQCAfRR+pWjNwA5dyi6/72Qbqi6lmwKWDPJnnyq0YS4
 UZI01zjs7ir93nS1zwJcekFOJCSTsb6XmhEgMVlpw+YoZHaOki1KJMCU0kIgZt8O
 YlEniP/DdXBS0mflOJQnoes7XrcIWVqWEubeRZdoWYnC07hNmdg4XJ0c3Skx8ZW+
 gL8kt4pDWlnKHlTlhtgocG3H5BkMazrYEmbograc/Oe8lkr9ESqIS7yS1l8lM7Z6
 i0HXATcdvDHV0AqW/uoNczXpck4x8xrahIzyPqAve2G15XEIgq4=
 =D9fV
 -----END PGP SIGNATURE-----

Merge tag 'ras_updates_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 RAS updates from Borislav Petkov:

 - Enable additional logging mode on older Xeons (Tony Luck)

 - Pass error records logged by firmware through the MCE decoding chain
   to provide human-readable error descriptions instead of raw values
   (Smita Koralahalli)

 - Some #MC handler fixes (Gabriele Paoloni)

 - The usual small fixes and cleanups all over.

* tag 'ras_updates_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce: Rename kill_it to kill_current_task
  x86/mce: Remove redundant call to irq_work_queue()
  x86/mce: Panic for LMCE only if mca_cfg.tolerant < 3
  x86/mce: Move the mce_panic() call and 'kill_it' assignments to the right places
  x86/mce, cper: Pass x86 CPER through the MCA handling chain
  x86/mce: Use "safe" MSR functions when enabling additional error logging
  x86/mce: Correct the detection of invalid notifier priorities
  x86/mce: Assign boolean values to a bool variable
  x86/mce: Enable additional error logging on certain Intel CPUs
  x86/mce: Remove unneeded break
2020-12-14 13:00:10 -08:00
Tom Lendacky
0f60bde15e KVM: SVM: Add GHCB accessor functions for retrieving fields
Update the GHCB accessor functions to add functions for retrieve GHCB
fields by name. Update existing code to use the new accessor functions.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <664172c53a5fb4959914e1a45d88e805649af0ad.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-14 11:09:32 -05:00
Tom Lendacky
69372cf012 x86/cpu: Add VM page flush MSR availablility as a CPUID feature
On systems that do not have hardware enforced cache coherency between
encrypted and unencrypted mappings of the same physical page, the
hypervisor can use the VM page flush MSR (0xc001011e) to flush the cache
contents of an SEV guest page. When a small number of pages are being
flushed, this can be used in place of issuing a WBINVD across all CPUs.

CPUID 0x8000001f_eax[2] is used to determine if the VM page flush MSR is
available. Add a CPUID feature to indicate it is supported and define the
MSR.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <f1966379e31f9b208db5257509c4a089a87d33d0.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-14 11:09:30 -05:00
Masami Hiramatsu
0d07c0ec43 x86/kprobes: Fix optprobe to detect INT3 padding correctly
Commit

  7705dc8557 ("x86/vmlinux: Use INT3 instead of NOP for linker fill bytes")

changed the padding bytes between functions from NOP to INT3. However,
when optprobe decodes a target function it finds INT3 and gives up the
jump optimization.

Instead of giving up any INT3 detection, check whether the rest of the
bytes to the end of the function are INT3. If all of them are INT3,
those come from the linker. In that case, continue the optprobe jump
optimization.

 [ bp: Massage commit message. ]

Fixes: 7705dc8557 ("x86/vmlinux: Use INT3 instead of NOP for linker fill bytes")
Reported-by: Adam Zabrocki <pi3@pi3.com.pl>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/160767025681.3880685.16021570341428835411.stgit@devnote2
2020-12-12 15:25:17 +01:00
Kyung Min Park
e1b35da5e6 x86: Enumerate AVX512 FP16 CPUID feature flag
Enumerate AVX512 Half-precision floating point (FP16) CPUID feature
flag. Compared with using FP32, using FP16 cut the number of bits
required for storage in half, reducing the exponent from 8 bits to 5,
and the mantissa from 23 bits to 10. Using FP16 also enables developers
to train and run inference on deep learning models fast when all
precision or magnitude (FP32) is not needed.

A processor supports AVX512 FP16 if CPUID.(EAX=7,ECX=0):EDX[bit 23]
is present. The AVX512 FP16 requires AVX512BW feature be implemented
since the instructions for manipulating 32bit masks are associated with
AVX512BW.

The only in-kernel usage of this is kvm passthrough. The CPU feature
flag is shown as "avx512_fp16" in /proc/cpuinfo.

Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Message-Id: <20201208033441.28207-2-kyung.min.park@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-11 19:00:58 -05:00
Ashish Kalra
e998879d4f x86,swiotlb: Adjust SWIOTLB bounce buffer size for SEV guests
For SEV, all DMA to and from guest has to use shared (un-encrypted) pages.
SEV uses SWIOTLB to make this happen without requiring changes to device
drivers.  However, depending on the workload being run, the default 64MB
of it might not be enough and it may run out of buffers to use for DMA,
resulting in I/O errors and/or performance degradation for high
I/O workloads.

Adjust the default size of SWIOTLB for SEV guests using a
percentage of the total memory available to guest for the SWIOTLB buffers.

Adds a new sev_setup_arch() function which is invoked from setup_arch()
and it calls into a new swiotlb generic code function swiotlb_adjust_size()
to do the SWIOTLB buffer adjustment.

v5 fixed build errors and warnings as
Reported-by: kbuild test robot <lkp@intel.com>

Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Co-developed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2020-12-11 15:43:41 -05:00
Giovanni Gherdovich
3149cd5530 x86: Print ratio freq_max/freq_base used in frequency invariance calculations
The value freq_max/freq_base is a fundamental component of frequency
invariance calculations. It may come from a variety of sources such as MSRs
or ACPI data, tracking it down when troubleshooting a system could be
non-trivial. It is worth saving it in the kernel logs.

 # dmesg | grep 'Estimated ratio of average max'
 [   14.024036] smpboot: Estimated ratio of average max frequency by base frequency (times 1024): 1289

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20201112182614.10700-4-ggherdovich@suse.cz
2020-12-11 10:30:23 +01:00
Giovanni Gherdovich
976df7e573 x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
Frequency invariant accounting calculations need the ratio
freq_curr/freq_max, but freq_max is unknown as it depends on dynamic power
allocation between cores: AMD EPYC CPUs implement "Core Performance Boost".
Three candidates are considered to estimate this value:

- maximum non-boost frequency
- maximum boost frequency
- the mid point between the above two

Experimental data on an AMD EPYC Zen2 machine slightly favors the third
option, which is applied with this patch.

The analysis uses the ondemand cpufreq governor as baseline, and compares
it with schedutil in a number of configurations. Using the freq_max value
described above offers a moderate advantage in performance and efficiency:

sugov-max (freq_max=max_boost) performs the worst on tbench: less
throughput and reduced efficiency than the other invariant-schedutil
options (see "Data Overview" below). Consider that tbench is generally a
problematic case as no schedutil version currently is better than ondemand.

sugov-P0 (freq_max=max_P) is the worst on dbench, while the other sugov's
can surpass ondemand with less filesystem latency and slightly increased
efficiency.

1. DATA OVERVIEW
2. DETAILED PERFORMANCE TABLES
3. POWER CONSUMPTION TABLE

1. DATA OVERVIEW
================

sugov-noinv : non-invariant schedutil governor
sugov-max   : invariant schedutil, freq_max=max_boost
sugov-mid   : invariant schedutil, freq_max=midpoint
sugov-P0    : invariant schedutil, freq_max=max_P
perfgov     : performance governor

driver      : acpi_cpufreq
machine     : AMD EPYC 7742 (Zen2, aka "Rome"), dual socket,
              128 cores / 256 threads, SATA SSD storage, 250G of memory,
	      XFS filesystem

Benchmarks are described in the next section.
Tilde (~) means the value is the same as baseline.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
            ondemand  perfgov  sugov-noinv  sugov-max  sugov-mid  sugov-P0  better if
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
                                        PERFORMANCE RATIOS
tbench        1.00       1.44       0.90       0.87       0.93       0.93      higher
dbench        1.00       0.91       0.95       0.94       0.94       1.06      lower
kernbench     1.00       0.93       ~          ~          ~          0.97      lower
gitsource     1.00       0.66       0.97       0.96       ~          0.95      lower
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
                                    PERFORMANCE-PER-WATT RATIOS
tbench        1.00       1.16       0.84       0.84       0.88       0.85      higher
dbench        1.00       1.03       1.02       1.02       1.02       0.93      higher
kernbench     1.00       1.05       ~          ~          ~          ~         higher
gitsource     1.00       1.46       1.04       1.04       ~          1.05      higher

2. DETAILED PERFORMANCE TABLES
==============================

Benchmark          : tbench4 (i.e. dbench4 over the network, actually loopback)
Varying parameter  : number of clients
Unit               : MB/sec (higher is better)

                  5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean  1        427.19  +- 0.16% (        )     778.35  +- 0.10% (  82.20%)     346.92  +- 0.14% ( -18.79%)
Hmean  2        853.82  +- 0.09% (        )    1536.23  +- 0.03% (  79.93%)     694.36  +- 0.05% ( -18.68%)
Hmean  4       1657.54  +- 0.12% (        )    2938.18  +- 0.12% (  77.26%)    1362.81  +- 0.11% ( -17.78%)
Hmean  8       3301.87  +- 0.06% (        )    5679.10  +- 0.04% (  72.00%)    2693.35  +- 0.04% ( -18.43%)
Hmean  16      6139.65  +- 0.05% (        )    9498.81  +- 0.04% (  54.71%)    4889.97  +- 0.17% ( -20.35%)
Hmean  32     11170.28  +- 0.09% (        )   17393.25  +- 0.08% (  55.71%)    9104.55  +- 0.09% ( -18.49%)
Hmean  64     19322.97  +- 0.17% (        )   31573.91  +- 0.08% (  63.40%)   18552.52  +- 0.40% (  -3.99%)
Hmean  128    30383.71  +- 0.11% (        )   37416.91  +- 0.15% (  23.15%)   25938.70  +- 0.41% ( -14.63%)
Hmean  256    31143.96  +- 0.41% (        )   30908.76  +- 0.88% (  -0.76%)   29754.32  +- 0.24% (  -4.46%)
Hmean  512    30858.49  +- 0.26% (        )   38524.60  +- 1.19% (  24.84%)   42080.39  +- 0.56% (  36.37%)
Hmean  1024   39187.37  +- 0.19% (        )   36213.86  +- 0.26% (  -7.59%)   39555.98  +- 0.12% (   0.94%)

                            5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean  1        352.59  +- 1.03% ( -17.46%)     352.08  +- 0.75% ( -17.58%)     352.31  +- 1.48% ( -17.53%)
Hmean  2        697.32  +- 0.08% ( -18.33%)     700.16  +- 0.20% ( -18.00%)     696.79  +- 0.06% ( -18.39%)
Hmean  4       1369.88  +- 0.04% ( -17.35%)    1369.72  +- 0.07% ( -17.36%)    1365.91  +- 0.05% ( -17.59%)
Hmean  8       2696.79  +- 0.04% ( -18.33%)    2711.06  +- 0.04% ( -17.89%)    2715.10  +- 0.61% ( -17.77%)
Hmean  16      4725.03  +- 0.03% ( -23.04%)    4875.65  +- 0.02% ( -20.59%)    4953.05  +- 0.28% ( -19.33%)
Hmean  32      9231.65  +- 0.10% ( -17.36%)    8704.89  +- 0.27% ( -22.07%)   10562.02  +- 0.36% (  -5.45%)
Hmean  64     15364.27  +- 0.19% ( -20.49%)   17786.64  +- 0.15% (  -7.95%)   19665.40  +- 0.22% (   1.77%)
Hmean  128    42100.58  +- 0.13% (  38.56%)   34946.28  +- 0.13% (  15.02%)   38635.79  +- 0.06% (  27.16%)
Hmean  256    30660.23  +- 1.08% (  -1.55%)   32307.67  +- 0.54% (   3.74%)   31153.27  +- 0.12% (   0.03%)
Hmean  512    24604.32  +- 0.14% ( -20.27%)   40408.50  +- 1.10% (  30.95%)   38800.29  +- 1.23% (  25.74%)
Hmean  1024   35535.47  +- 0.28% (  -9.32%)   41070.38  +- 2.56% (   4.81%)   31308.29  +- 2.52% ( -20.11%)

Benchmark          : dbench (filesystem stressor)
Varying parameter  : number of clients
Unit               : seconds (lower is better)

NOTE-1: This dbench version measures the average latency of a set of filesystem
        operations, as we found the traditional dbench metric (throughput) to be
	misleading.
NOTE-2: Due to high variability, we partition the original dataset and apply
        statistical bootrapping (a resampling method). Accuracy is reported in the
	form of 95% confidence intervals.

                  5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
SubAmean  1         98.79  +- 0.92 (        )      83.36  +- 0.82 (  15.62%)      84.82  +- 0.92 (  14.14%)
SubAmean  2        116.00  +- 0.89 (        )     102.12  +- 0.77 (  11.96%)     109.63  +- 0.89 (   5.49%)
SubAmean  4        149.90  +- 1.03 (        )     132.12  +- 0.91 (  11.86%)     143.90  +- 1.15 (   4.00%)
SubAmean  8        182.41  +- 1.13 (        )     159.86  +- 0.93 (  12.36%)     165.82  +- 1.03 (   9.10%)
SubAmean  16       237.83  +- 1.23 (        )     219.46  +- 1.14 (   7.72%)     229.28  +- 1.19 (   3.59%)
SubAmean  32       334.34  +- 1.49 (        )     309.94  +- 1.42 (   7.30%)     321.19  +- 1.36 (   3.93%)
SubAmean  64       576.61  +- 2.16 (        )     540.75  +- 2.00 (   6.22%)     551.27  +- 1.99 (   4.39%)
SubAmean  128     1350.07  +- 4.14 (        )    1205.47  +- 3.20 (  10.71%)    1280.26  +- 3.75 (   5.17%)
SubAmean  256     3444.42  +- 7.97 (        )    3698.00 +- 27.43 (  -7.36%)    3494.14  +- 7.81 (  -1.44%)
SubAmean  2048   39457.89 +- 29.01 (        )   34105.33 +- 41.85 (  13.57%)   39688.52 +- 36.26 (  -0.58%)

                            5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
SubAmean  1         85.68  +- 1.04 (  13.27%)      84.16  +- 0.84 (  14.81%)      83.99  +- 0.90 (  14.99%)
SubAmean  2        108.42  +- 0.95 (   6.54%)     109.91  +- 1.39 (   5.24%)     112.06  +- 0.91 (   3.39%)
SubAmean  4        136.90  +- 1.04 (   8.67%)     137.59  +- 0.93 (   8.21%)     136.55  +- 0.95 (   8.91%)
SubAmean  8        163.15  +- 0.96 (  10.56%)     166.07  +- 1.02 (   8.96%)     165.81  +- 0.99 (   9.10%)
SubAmean  16       224.86  +- 1.12 (   5.45%)     223.83  +- 1.06 (   5.89%)     230.66  +- 1.19 (   3.01%)
SubAmean  32       320.51  +- 1.38 (   4.13%)     322.85  +- 1.49 (   3.44%)     321.96  +- 1.46 (   3.70%)
SubAmean  64       553.25  +- 1.93 (   4.05%)     554.19  +- 2.08 (   3.89%)     562.26  +- 2.22 (   2.49%)
SubAmean  128     1264.35  +- 3.72 (   6.35%)    1256.99  +- 3.46 (   6.89%)    2018.97 +- 18.79 ( -49.55%)
SubAmean  256     3466.25  +- 8.25 (  -0.63%)    3450.58  +- 8.44 (  -0.18%)    5032.12 +- 38.74 ( -46.09%)
SubAmean  2048   39133.10 +- 45.71 (   0.82%)   39905.95 +- 34.33 (  -1.14%)   53811.86 +-193.04 ( -36.38%)

Benchmark          : kernbench (kernel compilation)
Varying parameter  : number of jobs
Unit               : seconds (lower is better)

                  5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean  2        471.71 +- 26.61% (        )     409.88 +- 16.99% (  13.11%)     430.63  +- 0.18% (   8.71%)
Amean  4        211.87  +- 0.58% (        )     194.03  +- 0.74% (   8.42%)     215.33  +- 0.64% (  -1.63%)
Amean  8        109.79  +- 1.27% (        )     101.43  +- 1.53% (   7.61%)     111.05  +- 1.95% (  -1.15%)
Amean  16        59.50  +- 1.28% (        )      55.61  +- 1.35% (   6.55%)      59.65  +- 1.78% (  -0.24%)
Amean  32        34.94  +- 1.22% (        )      32.36  +- 1.95% (   7.41%)      35.44  +- 0.63% (  -1.43%)
Amean  64        22.58  +- 0.38% (        )      20.97  +- 1.28% (   7.11%)      22.41  +- 1.73% (   0.74%)
Amean  128       17.72  +- 0.44% (        )      16.68  +- 0.32% (   5.88%)      17.65  +- 0.96% (   0.37%)
Amean  256       16.44  +- 0.53% (        )      15.76  +- 0.32% (   4.18%)      16.76  +- 0.60% (  -1.93%)
Amean  512       16.54  +- 0.21% (        )      15.62  +- 0.41% (   5.53%)      16.84  +- 0.85% (  -1.83%)

                            5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean  2        421.30  +- 0.24% (  10.69%)     419.26  +- 0.15% (  11.12%)     414.38  +- 0.33% (  12.15%)
Amean  4  	217.81  +- 5.53% (  -2.80%)     211.63  +- 0.99% (   0.12%)     208.43  +- 0.47% (   1.63%)
Amean  8  	108.80  +- 0.43% (   0.90%)     108.48  +- 1.44% (   1.19%)     108.59  +- 3.08% (   1.09%)
Amean  16 	 58.84  +- 0.74% (   1.12%)      58.37  +- 0.94% (   1.91%)      57.78  +- 0.78% (   2.90%)
Amean  32 	 34.04  +- 2.00% (   2.59%)      34.28  +- 1.18% (   1.91%)      33.98  +- 2.21% (   2.75%)
Amean  64 	 22.22  +- 1.69% (   1.60%)      22.27  +- 1.60% (   1.38%)      22.25  +- 1.41% (   1.47%)
Amean  128	 17.55  +- 0.24% (   0.97%)      17.53  +- 0.94% (   1.04%)      17.49  +- 0.43% (   1.30%)
Amean  256	 16.51  +- 0.46% (  -0.40%)      16.48  +- 0.48% (  -0.19%)      16.44  +- 1.21% (   0.00%)
Amean  512	 16.50  +- 0.35% (   0.19%)      16.35  +- 0.42% (   1.14%)      16.37  +- 0.33% (   0.99%)

Benchmark          : gitsource (time to run the git unit test suite)
Varying parameter  : none
Unit               : seconds (lower is better)

                  5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean          1035.76  +- 0.30% (        )     688.21  +- 0.04% (  33.56%)    1003.85  +- 0.14% (   3.08%)

                            5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean           995.82  +- 0.08% (   3.86%)    1011.98  +- 0.03% (   2.30%)     986.87  +- 0.19% (   4.72%)

3. POWER CONSUMPTION TABLE
==========================

Average power consumption (watts).

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
            ondemand  perfgov  sugov-noinv  sugov-max  sugov-mid  sugov-P0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
tbench4     227.25     281.83     244.17     236.76     241.50     247.99
dbench4     151.97     161.87     157.08     158.10     158.06     153.73
kernbench   162.78     167.22     162.90     164.19     164.65     164.72
gitsource   133.65     139.00     133.04     134.43     134.18     134.32

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20201112182614.10700-3-ggherdovich@suse.cz
2020-12-11 10:29:55 +01:00
Nathan Fontenot
41ea667227 x86, sched: Calculate frequency invariance for AMD systems
This is the first pass in creating the ability to calculate the
frequency invariance on AMD systems. This approach uses the CPPC
highest performance and nominal performance values that range from
0 - 255 instead of a high and base frquency. This is because we do
not have the ability on AMD to get a highest frequency value.

On AMD systems the highest performance and nominal performance
vaues do correspond to the highest and base frequencies for the system
so using them should produce an appropriate ratio but some tweaking
is likely necessary.

Due to CPPC being initialized later in boot than when the frequency
invariant calculation is currently made, I had to create a callback
from the CPPC init code to do the calculation after we have CPPC
data.

Special thanks to "kernel test robot <lkp@intel.com>" for reporting that
compilation of drivers/acpi/cppc_acpi.c is conditional to
CONFIG_ACPI_CPPC_LIB, not just CONFIG_ACPI.

[ ggherdovich@suse.cz: made safe under CPU hotplug, edited changelog. ]

Signed-off-by: Nathan Fontenot <nathan.fontenot@amd.com>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20201112182614.10700-2-ggherdovich@suse.cz
2020-12-11 10:26:00 +01:00
Thomas Gleixner
058df195c2 x86/ioapic: Cleanup the timer_works() irqflags mess
Mark tripped over the creative irqflags handling in the IO-APIC timer
delivery check which ends up doing:

        local_irq_save(flags);
	local_irq_enable();
        local_irq_restore(flags);

which triggered a new consistency check he's working on required for
replacing the POPF based restore with a conditional STI.

That code is a historical mess and none of this is needed. Make it
straightforward use local_irq_disable()/enable() as that's all what is
required. It is invoked from interrupt enabled code nowadays.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/87k0tpju47.fsf@nanos.tec.linutronix.de
2020-12-10 23:02:31 +01:00
Thomas Gleixner
190113b4c6 x86/apic/vector: Fix ordering in vector assignment
Prarit reported that depending on the affinity setting the

 ' irq $N: Affinity broken due to vector space exhaustion.'

message is showing up in dmesg, but the vector space on the CPUs in the
affinity mask is definitely not exhausted.

Shung-Hsi provided traces and analysis which pinpoints the problem:

The ordering of trying to assign an interrupt vector in
assign_irq_vector_any_locked() is simply wrong if the interrupt data has a
valid node assigned. It does:

 1) Try the intersection of affinity mask and node mask
 2) Try the node mask
 3) Try the full affinity mask
 4) Try the full online mask

Obviously #2 and #3 are in the wrong order as the requested affinity
mask has to take precedence.

In the observed cases #1 failed because the affinity mask did not contain
CPUs from node 0. That made it allocate a vector from node 0, thereby
breaking affinity and emitting the misleading message.

Revert the order of #2 and #3 so the full affinity mask without the node
intersection is tried before actually affinity is broken.

If no node is assigned then only the full affinity mask and if that fails
the full online mask is tried.

Fixes: d6ffc6ac83 ("x86/vector: Respect affinity mask in irq descriptor")
Reported-by: Prarit Bhargava <prarit@redhat.com>
Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87ft4djtyp.fsf@nanos.tec.linutronix.de
2020-12-10 23:00:54 +01:00
Xiaochen Shen
06c5fe9b12 x86/resctrl: Fix incorrect local bandwidth when mba_sc is enabled
The MBA software controller (mba_sc) is a feedback loop which
periodically reads MBM counters and tries to restrict the bandwidth
below a user-specified value. It tags along the MBM counter overflow
handler to do the updates with 1s interval in mbm_update() and
update_mba_bw().

The purpose of mbm_update() is to periodically read the MBM counters to
make sure that the hardware counter doesn't wrap around more than once
between user samplings. mbm_update() calls __mon_event_count() for local
bandwidth updating when mba_sc is not enabled, but calls mbm_bw_count()
instead when mba_sc is enabled. __mon_event_count() will not be called
for local bandwidth updating in MBM counter overflow handler, but it is
still called when reading MBM local bandwidth counter file
'mbm_local_bytes', the call path is as below:

  rdtgroup_mondata_show()
    mon_event_read()
      mon_event_count()
        __mon_event_count()

In __mon_event_count(), m->chunks is updated by delta chunks which is
calculated from previous MSR value (m->prev_msr) and current MSR value.
When mba_sc is enabled, m->chunks is also updated in mbm_update() by
mistake by the delta chunks which is calculated from m->prev_bw_msr
instead of m->prev_msr. But m->chunks is not used in update_mba_bw() in
the mba_sc feedback loop.

When reading MBM local bandwidth counter file, m->chunks was changed
unexpectedly by mbm_bw_count(). As a result, the incorrect local
bandwidth counter which calculated from incorrect m->chunks is shown to
the user.

Fix this by removing incorrect m->chunks updating in mbm_bw_count() in
MBM counter overflow handler, and always calling __mon_event_count() in
mbm_update() to make sure that the hardware local bandwidth counter
doesn't wrap around.

Test steps:
  # Run workload with aggressive memory bandwidth (e.g., 10 GB/s)
  git clone https://github.com/intel/intel-cmt-cat && cd intel-cmt-cat
  && make
  ./tools/membw/membw -c 0 -b 10000 --read

  # Enable MBA software controller
  mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl

  # Create control group c1
  mkdir /sys/fs/resctrl/c1

  # Set MB throttle to 6 GB/s
  echo "MB:0=6000;1=6000" > /sys/fs/resctrl/c1/schemata

  # Write PID of the workload to tasks file
  echo `pidof membw` > /sys/fs/resctrl/c1/tasks

  # Read local bytes counters twice with 1s interval, the calculated
  # local bandwidth is not as expected (approaching to 6 GB/s):
  local_1=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes`
  sleep 1
  local_2=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes`
  echo "local b/w (bytes/s):" `expr $local_2 - $local_1`

Before fix:
  local b/w (bytes/s): 11076796416

After fix:
  local b/w (bytes/s): 5465014272

Fixes: ba0f26d852 (x86/intel_rdt/mba_sc: Prepare for feedback loop)
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/1607063279-19437-1-git-send-email-xiaochen.shen@intel.com
2020-12-10 17:52:37 +01:00
Gustavo A. R. Silva
bd11952b40 uprobes/x86: Fix fall-through warnings for Clang
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of letting the code fall
through to the next case.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://github.com/KSPP/linux/issues/115
2020-12-09 17:08:59 +01:00
Gustavo A. R. Silva
e689b300c9 kprobes/x86: Fix fall-through warnings for Clang
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of just letting the code
fall through to the next case.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://github.com/KSPP/linux/issues/115
2020-12-09 17:08:58 +01:00
Masami Hiramatsu
78ff2733ff x86/kprobes: Restore BTF if the single-stepping is cancelled
Fix to restore BTF if single-stepping causes a page fault and
it is cancelled.

Usually the BTF flag was restored when the single stepping is done
(in resume_execution()). However, if a page fault happens on the
single stepping instruction, the fault handler is invoked and
the single stepping is cancelled. Thus, the BTF flag is not
restored.

Fixes: 1ecc798c67 ("x86: debugctlmsr kprobes")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/160389546985.106936.12727996109376240993.stgit@devnote2
2020-12-09 17:08:57 +01:00
Arvind Sankar
262bd5724a x86/cpu/amd: Remove dead code for TSEG region remapping
Commit

  26bfa5f894 ("x86, amd: Cleanup init_amd")

moved the code that remaps the TSEG region using 4k pages from
init_amd() to bsp_init_amd().

However, bsp_init_amd() is executed well before the direct mapping is
actually created:

  setup_arch()
    -> early_cpu_init()
      -> early_identify_cpu()
        -> this_cpu->c_bsp_init()
	  -> bsp_init_amd()
    ...
    -> init_mem_mapping()

So the change effectively disabled the 4k remapping, because
pfn_range_is_mapped() is always false at this point.

It has been over six years since the commit, and no-one seems to have
noticed this, so just remove the code. The original code was also
incomplete, since it doesn't check how large the TSEG address range
actually is, so it might remap only part of it in any case.

Hygon has copied the incorrect version, so the code has never run on it
since the cpu support was added two years ago. Remove it from there as
well.

Committer notes:

This workaround is incomplete anyway:

1. The code must check MSRC001_0113.TValid (SMM TSeg Mask MSR) first, to
check whether the TSeg address range is enabled.

2. The code must check whether the range is not 2M aligned - if it is,
there's nothing to work around.

3. In all the BIOSes tested, the TSeg range is in a e820 reserved area
and those are not mapped anymore, after

  66520ebc2d ("x86, mm: Only direct map addresses that are marked as E820_RAM")

which means, there's nothing to be worked around either.

So let's rip it out.

Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201127171324.1846019-1-nivedita@alum.mit.edu
2020-12-08 18:45:21 +01:00
Borislav Petkov
f77f420d34 x86/msr: Add a pointer to an URL which contains further details
After having collected the majority of reports about MSRs being written
by userspace tools and what tools those are, and all newer reports
mostly repeating, add an URL where detailed information is gathered and
kept up-to-date.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201205002825.19107-1-bp@alien8.de
2020-12-08 10:17:08 +01:00
Mike Travis
148c277165 x86/platform/uv: Add deprecated messages to /proc info leaves
Add "deprecated" message to any access to old /proc/sgi_uv/* leaves.

 [ bp: Do not have a trailing function opening brace and the arguments
   continuing on the next line and align them on the opening brace. ]

Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lkml.kernel.org/r/20201128034227.120869-5-mike.travis@hpe.com
2020-12-07 20:03:09 +01:00
Mike Travis
a67fffb017 x86/platform/uv: Add kernel interfaces for obtaining system info
Add kernel interfaces used to obtain info for the uv_sysfs driver
to display.

Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lkml.kernel.org/r/20201128034227.120869-2-mike.travis@hpe.com
2020-12-07 19:44:54 +01:00
Qiujun Huang
72ebb5ff80 x86/alternative: Update text_poke_bp() kernel-doc comment
Update kernel-doc parameter name after

  c3d6324f84 ("x86/alternatives: Teach text_poke_bp() to emulate instructions")

changed the last parameter from @handler to @emulate.

 [ bp: Make commit message more precise. ]

Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201203145020.2441-1-hqjagain@gmail.com
2020-12-07 12:44:22 +01:00
Linus Torvalds
8100a58044 A set of fixes for x86:
- Make the AMD L3 QoS code and data priorization enable/disable mechanism
    work correctly. The control bit was only set/cleared on one of the CPUs
    in a L3 domain, but it has to be modified on all CPUs in the domain. The
    initial documentation was not clear about this, but the updated one from
    Oct 2020 spells it out.
 
  - Fix an off by one in the UV platform detection code which causes the UV
    hubs to be identified wrongly. The chip revisions start at 1 not at 0.
 
  - Fix a long standing bug in the evaluation of prefixes in the uprobes
    code which fails to handle repeated prefixes properly. The aggregate
    size of the prefixes can be larger than the bytes array but the code
    blindly iterated over the aggregate size beyond the array boundary.
    Add a macro to handle this case properly and use it at the affected
    places.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/M2GoTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoUGeD/0TxQyVIZTJjY/ExDYvQZeSWgvFVon9
 A/QcwkgtkWzYKI3YThgxBtSNKhHPP8eQ9QK8ErcaQKEHU5TdXVEOwHgTm1A6OGat
 0+M9/EWEe7tTu+cLpu7esQ+1VbSvEcFXZbljbhCKrShlBlsIjFG4eCPAprSw9yjI
 RSgfXKZu4NmHVS6nfJTwmIIaTeLQ6U3b7b1D5s/66slBFScqnLbRNhABVbHbos4F
 pl/lxDCFOddy2YbEojHjjGqMA7oxPav7c0nYFOM/zG+wAqfEjbqOxReT31bGQPi2
 XT9K4JEqDqILo0KnhV4GsYoWAhes3BtmsJ9IoZ7IijsMriYl80mD9URAORidJ4PX
 28Ckk9V/DlE8uDrAnBDcWDSoKlg78mhVV7V9L6v43teg/gJfSZNROtNDBmqRmwG4
 Op2NJfzJITtaxVQuSZRkSs8rzGv+QUfaM1sBUQ+Oz4KYeIjjA7G2MAOECrzIAWKB
 GWc5toYRVS6oGT+RbZhSxZYoh8ASoGJ2MrL8K4OV4RqEqHHcXcih0WmmljtsDIFI
 td4FHHH6fghIb9S6iYKiApd6k2qKa33mwJwa/xZOoIrv0w5xT0WDJnxT60gu/Mec
 YDkqhmA009CNSD2G4oNRNF5MH7gp34UII+25jOGatbVh+5DDPYs+5Jnh/DR7jssR
 PryAG9ER7UUb6w==
 =rNBq
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
 "A set of fixes for x86:

   - Make the AMD L3 QoS code and data priorization enable/disable
     mechanism work correctly.

     The control bit was only set/cleared on one of the CPUs in a L3
     domain, but it has to be modified on all CPUs in the domain. The
     initial documentation was not clear about this, but the updated one
     from Oct 2020 spells it out.

   - Fix an off by one in the UV platform detection code which causes
     the UV hubs to be identified wrongly.

     The chip revisions start at 1 not at 0.

   - Fix a long standing bug in the evaluation of prefixes in the
     uprobes code which fails to handle repeated prefixes properly.

     The aggregate size of the prefixes can be larger than the bytes
     array but the code blindly iterated over the aggregate size beyond
     the array boundary. Add a macro to handle this case properly and
     use it at the affected places"

* tag 'x86-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sev-es: Use new for_each_insn_prefix() macro to loop over prefixes bytes
  x86/insn-eval: Use new for_each_insn_prefix() macro to loop over prefixes bytes
  x86/uprobes: Do not use prefixes.nbytes when looping over prefixes.bytes
  x86/platform/uv: Fix UV4 hub revision adjustment
  x86/resctrl: Fix AMD L3 QOS CDP enable/disable
2020-12-06 11:22:39 -08:00
Masami Hiramatsu
4e9a5ae8df x86/uprobes: Do not use prefixes.nbytes when looping over prefixes.bytes
Since insn.prefixes.nbytes can be bigger than the size of
insn.prefixes.bytes[] when a prefix is repeated, the proper check must
be

  insn.prefixes.bytes[i] != 0 and i < 4

instead of using insn.prefixes.nbytes.

Introduce a for_each_insn_prefix() macro for this purpose. Debugged by
Kees Cook <keescook@chromium.org>.

 [ bp: Massage commit message, sync with the respective header in tools/
   and drop "we". ]

Fixes: 2b14449835 ("uprobes, mm, x86: Add the ability to install and remove uprobes breakpoints")
Reported-by: syzbot+9b64b619f10f19d19a7c@syzkaller.appspotmail.com
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/160697103739.3146288.7437620795200799020.stgit@devnote2
2020-12-06 09:58:13 +01:00
Jarkko Sakkinen
a4b9c48b96 x86/sgx: Return -EINVAL on a zero length buffer in sgx_ioc_enclave_add_pages()
The sgx_enclave_add_pages.length field is documented as

 * @length:     length of the data (multiple of the page size)

Fail with -EINVAL, when the caller gives a zero length buffer of data
to be added as pages to an enclave. Right now 'ret' is returned as
uninitialized in that case.

 [ bp: Flesh out commit message. ]

Fixes: c6d26d3707 ("x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/linux-sgx/X8ehQssnslm194ld@mwanda/
Link: https://lkml.kernel.org/r/20201203183527.139317-1-jarkko@kernel.org
2020-12-03 19:54:40 +01:00
Mike Travis
8dcc0e19df x86/platform/uv: Fix UV4 hub revision adjustment
Currently, UV4 is incorrectly identified as UV4A and UV4A as UV5. Hub
chip starts with revision 1, fix it.

 [ bp: Massage commit message. ]

Fixes: 647128f153 ("x86/platform/uv: Update UV MMRs for UV5")
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Acked-by: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Link: https://lkml.kernel.org/r/20201203152252.371199-1-mike.travis@hpe.com
2020-12-03 18:09:18 +01:00
Gabriel Krisman Bertazi
1d7637d89c signal: Expose SYS_USER_DISPATCH si_code type
SYS_USER_DISPATCH will be triggered when a syscall is sent to userspace
by the Syscall User Dispatch mechanism.  This adjusts eventual
BUILD_BUG_ON around the tree.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20201127193238.821364-3-krisman@collabora.com
2020-12-02 10:32:16 +01:00
Gabriele Paoloni
e1c06d2366 x86/mce: Rename kill_it to kill_current_task
Currently, if an MCE happens in user-mode or while the kernel is copying
data from user space, 'kill_it' is used to check if execution of the
interrupted task can be recovered or not; the flag name however is not
very meaningful, hence rename it to match its goal.

 [ bp: Massage commit message, rename the queue_task_work() arg too. ]

Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201127161819.3106432-6-gabriele.paoloni@intel.com
2020-12-01 18:58:50 +01:00
Gabriele Paoloni
d5b38e3d0f x86/mce: Remove redundant call to irq_work_queue()
Currently, __mc_scan_banks() in do_machine_check() does the following
callchain:

  __mc_scan_banks()->mce_log()->irq_work_queue(&mce_irq_work).

Hence, the call to irq_work_queue() below after __mc_scan_banks()
seems redundant. Just remove it.

Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20201127161819.3106432-5-gabriele.paoloni@intel.com
2020-12-01 18:54:32 +01:00
Gabriele Paoloni
3a866b16fd x86/mce: Panic for LMCE only if mca_cfg.tolerant < 3
Right now for LMCE, if no_way_out is set, mce_panic() is called
regardless of mca_cfg.tolerant. This is not correct as, if
mca_cfg.tolerant = 3, the code should never panic.

Add that check.

 [ bp: use local ptr 'cfg'. ]

Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20201127161819.3106432-4-gabriele.paoloni@intel.com
2020-12-01 18:49:29 +01:00
Gabriele Paoloni
e273e6e12a x86/mce: Move the mce_panic() call and 'kill_it' assignments to the right places
Right now, for local MCEs the machine calls panic(), if needed, right
after lmce is set. For MCE broadcasting, mce_reign() takes care of
calling mce_panic().

Hence:
- improve readability by moving the conditional evaluation of
tolerant up to when kill_it is set first;
- move the mce_panic() call up into the statement where mce_end()
fails.

 [ bp: Massage, remove comment in the mce_end() failure case because it
   is superfluous; use local ptr 'cfg' in both tests. ]

Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20201127161819.3106432-3-gabriele.paoloni@intel.com
2020-12-01 18:45:56 +01:00
Borislav Petkov
15936ca13d Linux 5.10-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl/EM9oeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG/3kH/RNkFyTlHlUkZpJx
 8Ks2yWgUln7YhZcmOaG/IcIyWnhCgo3l35kiaH7XxM+rPMZzidp51MHUllaTAQDc
 u+5EFHMJsmTWUfE8ocHPb1cPdYEDSoVr6QUsixbL9+uADpRz+VZVtWMb89EiyMrC
 wvLIzpnqY5UNriWWBxD0hrmSsT4g9XCsauer4k2KB+zvebwg6vFOMCFLFc2qz7fb
 ABsrPFqLZOMp+16chGxyHP7LJ6ygI/Hwf7tPW8ppv4c+hes4HZg7yqJxXhV02QbJ
 s10s6BTcEWMqKg/T6L/VoScsMHWUcNdvrr3uuPQhgup240XdmB1XO8rOKddw27e7
 VIjrjNw=
 =4ZaP
 -----END PGP SIGNATURE-----

Merge tag 'v5.10-rc6' into ras/core

Merge the -rc6 tag to pick up dependent changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
2020-12-01 18:22:55 +01:00
Xiaochen Shen
19eb86a72d x86/resctrl: Clean up unused function parameter in rmdir path
Commit

  fd8d9db355 ("x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak")

removed superfluous kernfs_get() calls in rdtgroup_ctrl_remove() and
rdtgroup_rmdir_ctrl(). That change resulted in an unused function
parameter to these two functions.

Clean up the unused function parameter in rdtgroup_ctrl_remove(),
rdtgroup_rmdir_mon() and their callers rdtgroup_rmdir_ctrl() and
rdtgroup_rmdir().

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/1606759618-13181-1-git-send-email-xiaochen.shen@intel.com
2020-12-01 18:06:35 +01:00
Borislav Petkov
87314fb181 Linux 5.10-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl/EM9oeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG/3kH/RNkFyTlHlUkZpJx
 8Ks2yWgUln7YhZcmOaG/IcIyWnhCgo3l35kiaH7XxM+rPMZzidp51MHUllaTAQDc
 u+5EFHMJsmTWUfE8ocHPb1cPdYEDSoVr6QUsixbL9+uADpRz+VZVtWMb89EiyMrC
 wvLIzpnqY5UNriWWBxD0hrmSsT4g9XCsauer4k2KB+zvebwg6vFOMCFLFc2qz7fb
 ABsrPFqLZOMp+16chGxyHP7LJ6ygI/Hwf7tPW8ppv4c+hes4HZg7yqJxXhV02QbJ
 s10s6BTcEWMqKg/T6L/VoScsMHWUcNdvrr3uuPQhgup240XdmB1XO8rOKddw27e7
 VIjrjNw=
 =4ZaP
 -----END PGP SIGNATURE-----

Merge tag 'v5.10-rc6' into x86/cache

Merge -rc6 tag to pick up dependent changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
2020-12-01 18:01:27 +01:00
Babu Moger
fae3a13d2a x86/resctrl: Fix AMD L3 QOS CDP enable/disable
When the AMD QoS feature CDP (code and data prioritization) is enabled
or disabled, the CDP bit in MSR 0000_0C81 is written on one of the CPUs
in an L3 domain (core complex). That is not correct - the CDP bit needs
to be updated on all the logical CPUs in the domain.

This was not spelled out clearly in the spec earlier. The specification
has been updated and the updated document, "AMD64 Technology Platform
Quality of Service Extensions Publication # 56375 Revision: 1.02 Issue
Date: October 2020" is available now. Refer the section: Code and Data
Prioritization.

Fix the issue by adding a new flag arch_has_per_cpu_cfg in rdt_cache
data structure.

The documentation can be obtained at:
https://developer.amd.com/wp-content/resources/56375.pdf
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537

 [ bp: Massage commit message. ]

Fixes: 4d05bf71f1 ("x86/resctrl: Introduce AMD QOS feature")
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/160675180380.15628.3309402017215002347.stgit@bmoger-ubuntu
2020-12-01 17:53:31 +01:00
Linus Torvalds
f91a3aa6bc Yet two more places which invoke tracing from RCU disabled regions in the
idle path. Similar to the entry path the low level idle functions have to
 be non-instrumentable.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/DpAUTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoXSLD/9klc0YimnEnROW6Q5Svb2IcyIutmXF
 bOIY1bYYoKILOBj3wyvDUhmdMuq5zh7H9yG11hO8MaVVWVQcLcOMLdHTYm9dcdmF
 xQk33+xqjuhRShB+nEmC9ayYtWogtH6W6uZ6WDtF9ZltMKU85n5ddGJ/Fvo+HoCb
 NbOdHGJdJ3/3ZCeHnxOnxM+5/GwjkBuccTV/tXmb3yXrfU9DBySyQ4/UchcpF43w
 LcEb0kiQbpZsBTByKJOQV8+RR654S0sILlvRwVXpmj94vrgGwhlVk1/9rz7tkOhF
 ksoo1mTVu75LMt22G/hXxE63787yRvFdHjapf0+kCOAuhl992NK+xlGDH8o9DXcu
 9y73D4bI0HnDFs20w6vs20iLvxECJiYHJqlgR5ZwFUToceaNgtiYr8kzuD7Zbae1
 KG2E7BuNSwHWMtf97fGn44GZknPEOaKdDn4Wv6/bvKHxLm77qe11RKF70Stcz2AI
 am13KmQzzsHGF5qNWwpElRUxSdxfJMR66RnOdTQULGrRedaZTFol/y2pnVzTSe3k
 SZnlpL5kE7y92UYDogPb5wWA7b+YkJN0OdSkRFy1FH26ZG8E4M7ZJ2tql5Sw7pGM
 lsTjXpAUphnK5rz7QcYE8KAZWj//fIAcElIrvdklVcBnS3IqjfksYW27B64133vx
 cT1B/lA1PHXj6Q==
 =raED
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2020-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fixes from Thomas Gleixner:
 "Two more places which invoke tracing from RCU disabled regions in the
  idle path.

  Similar to the entry path the low level idle functions have to be
  non-instrumentable"

* tag 'locking-urgent-2020-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  intel_idle: Fix intel_idle() vs tracing
  sched/idle: Fix arch_cpu_idle() vs tracing
2020-11-29 11:19:26 -08:00
Linus Torvalds
7255a39d24 - Two resctrl fixes to prevent refcount leaks when manipulating the
resctrl fs (Xiaochen Shen)
 
 - Correct prctl(PR_GET_SPECULATION_CTRL) reporting (Anand K Mistry)
 
 - A fix to not lose already seen MCE severity which determines whether
   the machine can recover (Gabriele Paoloni)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/DeEcACgkQEsHwGGHe
 VUrAlg/+KO2GMYiz5mNslI3tg0tUf0tDEw7bsfsibGL9XkkRRre0zn2pTN3GoutG
 ZeM9d+epFNfv1msyTYXjhhMeO5XLGnuBzyeCv4wQkIH0zdNJSencPlr6tKnaMrAo
 Alk/BxmvBmOOWHayBOhgNBVYWXBoACkIzIUFi0d3lRbgiGDhOqowYbYkLWIu/tvm
 pI8anovuAfkluhh+tD96TyBrSa+GfWWiOYItNbJbmKAbGjGbJU5IEVmlXeMEziGO
 7zmGNwl4Di2V3lU5oDJNcYvF2Ngok/L81QqOMKeVu7EYDHLBjHxp0rxwuE8QqlX8
 bk40EEHyI1Aiwx/7WxB/ChnDOfpIpXljWEvropZeNOCY1LqDMnMnAJFrygQAmYeo
 t9GxbhBYvtXSxTkkfTWzowt20Vmm2+r5j09kpq3tfMrZXwk7VrSF8TjQ/3iM/5iM
 eV/wPaigqX0xZwdzF1o3fuXeM+RXHhMrnX+cKlFemUX9uFZi0/ZXk3TlJwCG/jKL
 QWbpTBuxcDwXR3K1RNPjBJG1OA1cupZQ9ePK8h95H1844rEwglreypCOf3LwDJzU
 9XEZo94V/J+gxNzg5iakidtdBMSst+SYY+vd/SUlLHXzkwcykOzwPgvZwVA7U2HM
 74Eh7eQhIht3Lin35i9h1Ez/bxE/Ss6Fi9wnObp8ij5ImOaS5eU=
 =JrT9
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:
 "A couple of urgent fixes which accumulated this last week:

   - Two resctrl fixes to prevent refcount leaks when manipulating the
     resctrl fs (Xiaochen Shen)

   - Correct prctl(PR_GET_SPECULATION_CTRL) reporting (Anand K Mistry)

   - A fix to not lose already seen MCE severity which determines
     whether the machine can recover (Gabriele Paoloni)"

* tag 'x86_urgent_for_v5.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce: Do not overwrite no_way_out if mce_end() fails
  x86/speculation: Fix prctl() when spectre_v2_user={seccomp,prctl},ibpb
  x86/resctrl: Add necessary kernfs_put() calls to prevent refcount leak
  x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak
2020-11-29 10:08:17 -08:00
Linus Torvalds
6adf33a5e4 iommu fixes for -rc6
- Fix intel iommu driver when running on devices without VCCAP_REG
 
 - Fix swiotlb and "iommu=pt" interaction under TXT (tboot)
 
 - Fix missing return value check during device probe()
 
 - Fix probe ordering for Qualcomm SMMU implementation
 
 - Ensure page-sized mappings are used for AMD IOMMU buffers with SNP RMP
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl/A3msQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNAI6B/9/MLjurPmQrSusq9Y7dhnGR7aahtICLAwR
 UDZHebGTeVxNqtTklQVl/1qgcY7DxU40LAO1777MM7eHOW1FnlCAWrkTo6BBQsGx
 U1FKegOJ/0eVHFPtFDfM7IA2skeLwlZW+hywNLAksme5mtd6iZG9yQLlDFqAjxL7
 v8uXgfHFn6Z2MvMc2O+IeKTtflIwPek/6rYuaEf7UknA6ZYPAD3hnu9i1RTEuUAA
 h2PoVrrJ/KefEsCMUIq2jwMTsSvxohDH8ClGK9b6h74J2CLKKuhALSgABRAmsEL9
 7w5TVdtMQ85n3ccnXyT4RBQ+O/eVtmsKSdfeAbFI4nG9e9j7YE1L
 =BI1u
 -----END PGP SIGNATURE-----

Merge tag 'iommu-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull iommu fixes from Will Deacon:
 "Here's another round of IOMMU fixes for -rc6 consisting mainly of a
  bunch of independent driver fixes. Thomas agreed for me to take the
  x86 'tboot' fix here, as it fixes a regression introduced by a vt-d
  change.

   - Fix intel iommu driver when running on devices without VCCAP_REG

   - Fix swiotlb and "iommu=pt" interaction under TXT (tboot)

   - Fix missing return value check during device probe()

   - Fix probe ordering for Qualcomm SMMU implementation

   - Ensure page-sized mappings are used for AMD IOMMU buffers with SNP
     RMP"

* tag 'iommu-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  iommu/vt-d: Don't read VCCAP register unless it exists
  x86/tboot: Don't disable swiotlb when iommu is forced on
  iommu: Check return of __iommu_attach_device()
  arm-smmu-qcom: Ensure the qcom_scm driver has finished probing
  iommu/amd: Enforce 4k mapping for certain IOMMU data structures
2020-11-27 10:41:19 -08:00
Gabriele Paoloni
25bc65d8dd x86/mce: Do not overwrite no_way_out if mce_end() fails
Currently, if mce_end() fails, no_way_out - the variable denoting
whether the machine can recover from this MCE - is determined by whether
the worst severity that was found across the MCA banks associated with
the current CPU, is of panic severity.

However, at this point no_way_out could have been already set by
mca_start() after looking at all severities of all CPUs that entered the
MCE handler. If mce_end() fails, check first if no_way_out is already
set and, if so, stick to it, otherwise use the local worst value.

 [ bp: Massage. ]

Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201127161819.3106432-2-gabriele.paoloni@intel.com
2020-11-27 17:38:36 +01:00
Ingo Molnar
a787bdaff8 Merge branch 'linus' into sched/core, to resolve semantic conflict
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-11-27 11:10:50 +01:00
Anand K Mistry
33fc379df7 x86/speculation: Fix prctl() when spectre_v2_user={seccomp,prctl},ibpb
When spectre_v2_user={seccomp,prctl},ibpb is specified on the command
line, IBPB is force-enabled and STIPB is conditionally-enabled (or not
available).

However, since

  21998a3515 ("x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.")

the spectre_v2_user_ibpb variable is set to SPECTRE_V2_USER_{PRCTL,SECCOMP}
instead of SPECTRE_V2_USER_STRICT, which is the actual behaviour.
Because the issuing of IBPB relies on the switch_mm_*_ibpb static
branches, the mitigations behave as expected.

Since

  1978b3a53a ("x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP")

this discrepency caused the misreporting of IB speculation via prctl().

On CPUs with STIBP always-on and spectre_v2_user=seccomp,ibpb,
prctl(PR_GET_SPECULATION_CTRL) would return PR_SPEC_PRCTL |
PR_SPEC_ENABLE instead of PR_SPEC_DISABLE since both IBPB and STIPB are
always on. It also allowed prctl(PR_SET_SPECULATION_CTRL) to set the IB
speculation mode, even though the flag is ignored.

Similarly, for CPUs without SMT, prctl(PR_GET_SPECULATION_CTRL) should
also return PR_SPEC_DISABLE since IBPB is always on and STIBP is not
available.

 [ bp: Massage commit message. ]

Fixes: 21998a3515 ("x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.")
Fixes: 1978b3a53a ("x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP")
Signed-off-by: Anand K Mistry <amistry@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201110123349.1.Id0cbf996d2151f4c143c90f9028651a5b49a5908@changeid
2020-11-25 20:17:09 +01:00
Lu Baolu
e2be2a833a x86/tboot: Don't disable swiotlb when iommu is forced on
After commit 327d5b2fee ("iommu/vt-d: Allow 32bit devices to uses DMA
domain"), swiotlb could also be used for direct memory access if IOMMU
is enabled but a device is configured to pass through the DMA translation.
Keep swiotlb when IOMMU is forced on, otherwise, some devices won't work
if "iommu=pt" kernel parameter is used.

Fixes: 327d5b2fee ("iommu/vt-d: Allow 32bit devices to uses DMA domain")
Reported-and-tested-by: Adrian Huang <ahuang12@lenovo.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20201125014124.4070776-1-baolu.lu@linux.intel.com
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210237
Signed-off-by: Will Deacon <will@kernel.org>
2020-11-25 12:07:32 +00:00
Peter Zijlstra
545b8c8df4 smp: Cleanup smp_call_function*()
Get rid of the __call_single_node union and cleanup the API a little
to avoid external code relying on the structure layout as much.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
2020-11-24 16:47:49 +01:00
Peter Zijlstra
58c644ba51 sched/idle: Fix arch_cpu_idle() vs tracing
We call arch_cpu_idle() with RCU disabled, but then use
local_irq_{en,dis}able(), which invokes tracing, which relies on RCU.

Switch all arch_cpu_idle() implementations to use
raw_local_irq_{en,dis}able() and carefully manage the
lockdep,rcu,tracing state like we do in entry.

(XXX: we really should change arch_cpu_idle() to not return with
interrupts enabled)

Reported-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lkml.kernel.org/r/20201120114925.594122626@infradead.org
2020-11-24 16:47:35 +01:00
Thomas Gleixner
7e015a2798 x86/crashdump/32: Simplify copy_oldmem_page()
Replace kmap_atomic_pfn() with kmap_local_pfn() which is preemptible and
can take page faults.

Remove the indirection of the dump page and the related cruft which is not
longer required.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20201118204007.670851839@linutronix.de
2020-11-24 14:42:09 +01:00
Xiaochen Shen
7589992469 x86/resctrl: Add necessary kernfs_put() calls to prevent refcount leak
On resource group creation via a mkdir an extra kernfs_node reference is
obtained by kernfs_get() to ensure that the rdtgroup structure remains
accessible for the rdtgroup_kn_unlock() calls where it is removed on
deletion. Currently the extra kernfs_node reference count is only
dropped by kernfs_put() in rdtgroup_kn_unlock() while the rdtgroup
structure is removed in a few other locations that lack the matching
reference drop.

In call paths of rmdir and umount, when a control group is removed,
kernfs_remove() is called to remove the whole kernfs nodes tree of the
control group (including the kernfs nodes trees of all child monitoring
groups), and then rdtgroup structure is freed by kfree(). The rdtgroup
structures of all child monitoring groups under the control group are
freed by kfree() in free_all_child_rdtgrp().

Before calling kfree() to free the rdtgroup structures, the kernfs node
of the control group itself as well as the kernfs nodes of all child
monitoring groups still take the extra references which will never be
dropped to 0 and the kernfs nodes will never be freed. It leads to
reference count leak and kernfs_node_cache memory leak.

For example, reference count leak is observed in these two cases:
  (1) mount -t resctrl resctrl /sys/fs/resctrl
      mkdir /sys/fs/resctrl/c1
      mkdir /sys/fs/resctrl/c1/mon_groups/m1
      umount /sys/fs/resctrl

  (2) mkdir /sys/fs/resctrl/c1
      mkdir /sys/fs/resctrl/c1/mon_groups/m1
      rmdir /sys/fs/resctrl/c1

The same reference count leak issue also exists in the error exit paths
of mkdir in mkdir_rdt_prepare() and rdtgroup_mkdir_ctrl_mon().

Fix this issue by following changes to make sure the extra kernfs_node
reference on rdtgroup is dropped before freeing the rdtgroup structure.
  (1) Introduce rdtgroup removal helper rdtgroup_remove() to wrap up
  kernfs_put() and kfree().

  (2) Call rdtgroup_remove() in rdtgroup removal path where the rdtgroup
  structure is about to be freed by kfree().

  (3) Call rdtgroup_remove() or kernfs_put() as appropriate in the error
  exit paths of mkdir where an extra reference is taken by kernfs_get().

Fixes: f3cbeacaa0 ("x86/intel_rdt/cqm: Add rmdir support")
Fixes: e02737d5b8 ("x86/intel_rdt: Add tasks files")
Fixes: 60cf5e101f ("x86/intel_rdt: Add mkdir to resctrl file system")
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1604085088-31707-1-git-send-email-xiaochen.shen@intel.com
2020-11-24 12:13:37 +01:00
Xiaochen Shen
fd8d9db355 x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak
Willem reported growing of kernfs_node_cache entries in slabtop when
repeatedly creating and removing resctrl subdirectories as well as when
repeatedly mounting and unmounting the resctrl filesystem.

On resource group (control as well as monitoring) creation via a mkdir
an extra kernfs_node reference is obtained to ensure that the rdtgroup
structure remains accessible for the rdtgroup_kn_unlock() calls where it
is removed on deletion. The kernfs_node reference count is dropped by
kernfs_put() in rdtgroup_kn_unlock().

With the above explaining the need for one kernfs_get()/kernfs_put()
pair in resctrl there are more places where a kernfs_node reference is
obtained without a corresponding release. The excessive amount of
reference count on kernfs nodes will never be dropped to 0 and the
kernfs nodes will never be freed in the call paths of rmdir and umount.
It leads to reference count leak and kernfs_node_cache memory leak.

Remove the superfluous kernfs_get() calls and expand the existing
comments surrounding the remaining kernfs_get()/kernfs_put() pair that
remains in use.

Superfluous kernfs_get() calls are removed from two areas:

  (1) In call paths of mount and mkdir, when kernfs nodes for "info",
  "mon_groups" and "mon_data" directories and sub-directories are
  created, the reference count of newly created kernfs node is set to 1.
  But after kernfs_create_dir() returns, superfluous kernfs_get() are
  called to take an additional reference.

  (2) kernfs_get() calls in rmdir call paths.

Fixes: 17eafd0762 ("x86/intel_rdt: Split resource group removal in two")
Fixes: 4af4a88e0c ("x86/intel_rdt/cqm: Add mount,umount support")
Fixes: f3cbeacaa0 ("x86/intel_rdt/cqm: Add rmdir support")
Fixes: d89b737901 ("x86/intel_rdt/cqm: Add mon_data")
Fixes: c7d9aac613 ("x86/intel_rdt/cqm: Add mkdir support for RDT monitoring")
Fixes: 5dc1d5c6ba ("x86/intel_rdt: Simplify info and base file lists")
Fixes: 60cf5e101f ("x86/intel_rdt: Add mkdir to resctrl file system")
Fixes: 4e978d06de ("x86/intel_rdt: Add "info" files to resctrl file system")
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1604085053-31639-1-git-send-email-xiaochen.shen@intel.com
2020-11-24 12:03:04 +01:00
Borislav Petkov
afe76eca86 x86/sgx: Fix sgx_ioc_enclave_provision() kernel-doc comment
Fix

  ./arch/x86/kernel/cpu/sgx/ioctl.c:666: warning: Function parameter or member \
	  'encl' not described in 'sgx_ioc_enclave_provision'
  ./arch/x86/kernel/cpu/sgx/ioctl.c:666: warning: Excess function parameter \
	  'enclave' description in 'sgx_ioc_enclave_provision'

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201123181922.0c009406@canb.auug.org.au
2020-11-24 10:46:01 +01:00
Peter Collingbourne
23acdc76f1 signal: clear non-uapi flag bits when passing/returning sa_flags
Previously we were not clearing non-uapi flag bits in
sigaction.sa_flags when storing the userspace-provided sa_flags or
when returning them via oldact. Start doing so.

This allows userspace to detect missing support for flag bits and
allows the kernel to use non-uapi bits internally, as we are already
doing in arch/x86 for two flag bits. Now that this change is in
place, we no longer need the code in arch/x86 that was hiding these
bits from userspace, so remove it.

This is technically a userspace-visible behavior change for sigaction, as
the unknown bits returned via oldact.sa_flags are no longer set. However,
we are free to define the behavior for unknown bits exactly because
their behavior is currently undefined, so for now we can define the
meaning of each of them to be "clear the bit in oldact.sa_flags unless
the bit becomes known in the future". Furthermore, this behavior is
consistent with OpenBSD [1], illumos [2] and XNU [3] (FreeBSD [4] and
NetBSD [5] fail the syscall if unknown bits are set). So there is some
precedent for this behavior in other kernels, and in particular in XNU,
which is probably the most popular kernel among those that I looked at,
which means that this change is less likely to be a compatibility issue.

Link: [1] f634a6a4b5/sys/kern/kern_sig.c (L278)
Link: [2] 76f19f5fdc/usr/src/uts/common/syscall/sigaction.c (L86)
Link: [3] a449c6a3b8/bsd/kern/kern_sig.c (L480)
Link: [4] eded70c370/sys/kern/kern_sig.c (L699)
Link: [5] 3365779bec/sys/kern/sys_sig.c (L473)
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://linux-review.googlesource.com/id/I35aab6f5be932505d90f3b3450c083b4db1eca86
Link: https://lkml.kernel.org/r/878dbcb5f47bc9b11881c81f745c0bef5c23f97f.1605235762.git.pcc@google.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-11-23 10:31:05 -06:00
Linus Torvalds
7d53be55c9 * An IOMMU VT-d build fix when CONFIG_PCI_ATS=n along with a revert of
same because the proper one is going through the IOMMU tree. (Thomas Gleixner)
 
 * An Intel microcode loader fix to save the correct microcode patch to
   apply during resume. (Chen Yu)
 
 * A fix to not access user memory of other processes when dumping opcode
   bytes. (Thomas Gleixner)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+6QnYACgkQEsHwGGHe
 VUpMgg//cE4GvK7dFihq5iNYS16yKKgyVBLwHiHoCxQkOu4CPSf0EejTMLKPSKpA
 hS7pJm6dMYLJ7PhIKRwlQD1NYet3SHCOi1h0Y2/3rVay82Vz/As2nZRd7paEkP9k
 kiNd/Ib2uAQ9ZwCKhmLanOWDoXXFo5Fwmpw+O/zhn31Y3kBx05mc2ojXnEwCERK/
 j6GwOBkfOq+q6D7HzsfEehlX0iy2rlPSq6OUN1H/Z4w+W9zfpubLX9bhhr9RvDmS
 6Hsmwos8fM/cQXZC2oxoKUP6zeYk5lDF/Qf7hOFt3WgdCpTZP2z6uLwUL6+NXu/C
 Ms+77Cx+Loe8UaDkyxXXruiv7mZIZCnl0phZulE4ibOjzL7W7w5cEf0LHdDS7uDL
 JzZwTJlbK9RjrSp4pXR+rn+U/PdluZJYpjh/Gj8+V23QszvbDsVd55SV5+CLSTM/
 qwrNp5ffLEnIFHmW98/OQKMvNTTzBFdlltnJm7mMW2JucgfCZMx4zwY0+dZ+ocDq
 /Kt/HNCzez1jHWiS5jdZCM15KuIhPTNuTpq0ipU+UAXxZOaNtTo1st1lhKoPvTz/
 k4y7S/EibX6PK4lxBQMM3fBG/T1KK7ZMgNEN0jbhWegqEmdWtgwc57p1gVE70XUU
 f6elyUupD65TMwgio2I8sQuiNBCKfF2Wj144dxXGJnGnOoZKey8=
 =z/Gq
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - An IOMMU VT-d build fix when CONFIG_PCI_ATS=n along with a revert of
   same because the proper one is going through the IOMMU tree (Thomas
   Gleixner)

 - An Intel microcode loader fix to save the correct microcode patch to
   apply during resume (Chen Yu)

 - A fix to not access user memory of other processes when dumping
   opcode bytes (Thomas Gleixner)

* tag 'x86_urgent_for_v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "iommu/vt-d: Take CONFIG_PCI_ATS into account"
  x86/dumpstack: Do not try to access user space code of other tasks
  x86/microcode/intel: Check patch signature before saving microcode for early loading
  iommu/vt-d: Take CONFIG_PCI_ATS into account
2020-11-22 12:55:50 -08:00
Smita Koralahalli
4a24d80b8c x86/mce, cper: Pass x86 CPER through the MCA handling chain
The kernel uses ACPI Boot Error Record Table (BERT) to report fatal
errors that occurred in a previous boot. The MCA errors in the BERT are
reported using the x86 Processor Error Common Platform Error Record
(CPER) format. Currently, the record prints out the raw MSR values and
AMD relies on the raw record to provide MCA information.

Extract the raw MSR values of MCA registers from the BERT and feed them
into mce_log() to decode them properly.

The implementation is SMCA-specific as the raw MCA register values are
given in the register offset order of the SMCA address space.

 [ bp: Massage. ]

[ Fix a build breakage in patch v1. ]
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Punit Agrawal <punit1.agrawal@toshiba.co.jp>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lkml.kernel.org/r/20201119182938.151155-1-Smita.KoralahalliChannabasappa@amd.com
2020-11-21 12:05:41 +01:00
Linus Torvalds
fc8299f9f3 iommu fixes for -rc5
- Fix boot when intel iommu initialisation fails under TXT (tboot)
 
 - Fix intel iommu compilation error when DMAR is enabled without ATS
 
 - Temporarily update IOMMU MAINTAINERs entry
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl+3p6gQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNPowCADBKq5PBRwqIM1tmntRu5rePf/WuB1BotlJ
 cUUR8dIxFDvKMeGvN4mDbn/5jjL1+TlD7sQqFh+gcWJ9tImpAMzp1ENbg71isl5o
 Ej21X9YhjfiIsKpTrNGxcetCkYaR2tp8Z4/WzSQaO+IGB578dQ9VwiwSnDyFdWUb
 1Qcj712ainYvKjbq4NyL80lPm9v43OV8QZ73WeelzBjE2UcDFAV0Xl4BIX8uuGtv
 I1x03JfgzmzhVBmpj5HUGPsSopMDdo5K4vlcqscqSqvBY90iHD5fgRF4ZatTjR1t
 PVM/1+c9vdJIWSxSt1hsB8DPtqXVOvGjzrLW/h63rxyWsISwkAq0
 =DNX/
 -----END PGP SIGNATURE-----

Merge tag 'iommu-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull iommu fixes from Will Deacon:
 "Two straightforward vt-d fixes:

   - Fix boot when intel iommu initialisation fails under TXT (tboot)

   - Fix intel iommu compilation error when DMAR is enabled without ATS

  and temporarily update IOMMU MAINTAINERs entry"

* tag 'iommu-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  MAINTAINERS: Temporarily add myself to the IOMMU entry
  iommu/vt-d: Fix compile error with CONFIG_PCI_ATS not set
  iommu/vt-d: Avoid panic if iommu init fails in tboot system
2020-11-20 10:20:16 -08:00
Wang Qing
61b39ad9a7 x86/head64: Remove duplicate include
Remove duplicate header include.

Signed-off-by: Wang Qing <wangqing@vivo.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1604893542-20961-1-git-send-email-wangqing@vivo.com
2020-11-20 17:43:15 +01:00
Paul E. McKenney
7fc91fc845 Merge branches 'cpuinfo.2020.11.06a', 'doc.2020.11.06a', 'fixes.2020.11.19b', 'lockdep.2020.11.02a', 'tasks.2020.11.06a' and 'torture.2020.11.06a' into HEAD
cpuinfo.2020.11.06a: Speedups for /proc/cpuinfo.
doc.2020.11.06a: Documentation updates.
fixes.2020.11.19b: Miscellaneous fixes.
lockdep.2020.11.02a: Lockdep-RCU updates to avoid "unused variable".
tasks.2020.11.06a: Tasks-RCU updates.
torture.2020.11.06a': Torture-test updates.
2020-11-19 19:37:47 -08:00
Paul E. McKenney
29368e0939 x86/smpboot: Move rcu_cpu_starting() earlier
The call to rcu_cpu_starting() in mtrr_ap_init() is not early enough
in the CPU-hotplug onlining process, which results in lockdep splats
as follows:

=============================
WARNING: suspicious RCU usage
5.9.0+ #268 Not tainted
-----------------------------
kernel/kprobes.c:300 RCU-list traversed in non-reader section!!

other info that might help us debug this:

RCU used illegally from offline CPU!
rcu_scheduler_active = 1, debug_locks = 1
no locks held by swapper/1/0.

stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0+ #268
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
 dump_stack+0x77/0x97
 __is_insn_slot_addr+0x15d/0x170
 kernel_text_address+0xba/0xe0
 ? get_stack_info+0x22/0xa0
 __kernel_text_address+0x9/0x30
 show_trace_log_lvl+0x17d/0x380
 ? dump_stack+0x77/0x97
 dump_stack+0x77/0x97
 __lock_acquire+0xdf7/0x1bf0
 lock_acquire+0x258/0x3d0
 ? vprintk_emit+0x6d/0x2c0
 _raw_spin_lock+0x27/0x40
 ? vprintk_emit+0x6d/0x2c0
 vprintk_emit+0x6d/0x2c0
 printk+0x4d/0x69
 start_secondary+0x1c/0x100
 secondary_startup_64_no_verify+0xb8/0xbb

This is avoided by moving the call to rcu_cpu_starting up near
the beginning of the start_secondary() function.  Note that the
raw_smp_processor_id() is required in order to avoid calling into lockdep
before RCU has declared the CPU to be watched for readers.

Link: https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/
Reported-by: Qian Cai <cai@redhat.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-11-19 19:37:16 -08:00
Rikard Falkeborn
2002d29513 x86/resctrl: Constify kernfs_ops
The only usage of the kf_ops field in the rftype struct is to pass
it as argument to __kernfs_create_file(), which accepts a pointer to
const. Make it a pointer to const. This makes it possible to make
rdtgroup_kf_single_ops and kf_mondata_ops const, which allows the
compiler to put them in read-only memory.

Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20201110230228.801785-1-rikard.falkeborn@gmail.com
2020-11-19 18:23:45 +01:00
Yazen Ghannam
cb09a37972 x86/topology: Set cpu_die_id only if DIE_TYPE found
CPUID Leaf 0x1F defines a DIE_TYPE level (nb: ECX[8:15] level type == 0x5),
but CPUID Leaf 0xB does not. However, detect_extended_topology() will
set struct cpuinfo_x86.cpu_die_id regardless of whether a valid Die ID
was found.

Only set cpu_die_id if a DIE_TYPE level is found. CPU topology code may
use another value for cpu_die_id, e.g. the AMD NodeId on AMD-based
systems. Code ordering should be maintained so that the CPUID Leaf 0x1F
Die ID value will take precedence on systems that may use another value.

Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201109210659.754018-5-Yazen.Ghannam@amd.com
2020-11-19 11:43:25 +01:00
Yazen Ghannam
db970bd231 x86/CPU/AMD: Remove amd_get_nb_id()
The Last Level Cache ID is returned by amd_get_nb_id(). In practice,
this value is the same as the AMD NodeId for callers of this function.
The NodeId is saved in struct cpuinfo_x86.cpu_die_id.

Replace calls to amd_get_nb_id() with the logical CPU's cpu_die_id and
remove the function.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201109210659.754018-3-Yazen.Ghannam@amd.com
2020-11-19 11:43:17 +01:00
Yazen Ghannam
028c221ed1 x86/CPU/AMD: Save AMD NodeId as cpu_die_id
AMD systems provide a "NodeId" value that represents a global ID
indicating to which "Node" a logical CPU belongs. The "Node" is a
physical structure equivalent to a Die, and it should not be confused
with logical structures like NUMA nodes. Logical nodes can be adjusted
based on firmware or other settings whereas the physical nodes/dies are
fixed based on hardware topology.

The NodeId value can be used when a physical ID is needed by software.

Save the AMD NodeId to struct cpuinfo_x86.cpu_die_id. Use the value
from CPUID or MSR as appropriate. Default to phys_proc_id otherwise.
Do so for both AMD and Hygon systems.

Drop the node_id parameter from cacheinfo_*_init_llc_id() as it is no
longer needed.

Update the x86 topology documentation.

Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201109210659.754018-2-Yazen.Ghannam@amd.com
2020-11-19 11:43:13 +01:00
Jarkko Sakkinen
14132a5b80 x86/sgx: Return -ERESTARTSYS in sgx_ioc_enclave_add_pages()
Return -ERESTARTSYS instead of -EINTR in sgx_ioc_enclave_add_pages()
when interrupted before any pages have been processed. At this point
ioctl can be obviously safely restarted.

Reported-by: Haitao Huang <haitao.huang@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201118213932.63341-1-jarkko@kernel.org
2020-11-19 10:51:24 +01:00
Will Deacon
388255ce95 A small set of fixes for x86:
- Cure the fallout from the MSI irqdomain overhaul which missed that the
    Intel IOMMU does not register virtual function devices and therefore
    never reaches the point where the MSI interrupt domain is assigned. This
    makes the VF devices use the non-remapped MSI domain which is trapped by
    the IOMMU/remap unit.
 
  - Remove an extra space in the SGI_UV architecture type procfs output for
    UV5.
 
  - Remove a unused function which was missed when removing the UV BAU TLB
    shootdown handler.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+xJi0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVWxD/9Tq4W6Kniln7mtoEWHRvHRceiiGcS3
 MocvqurhoJwirH4F2gkvCegTBy0r3FdUORy3OMmChVs6nb8XpPpso84SANCRePWp
 JZezpVwLSNC4O1/ZCg1Kjj4eUpzLB/UjUUQV9RsjL5wyQEhfCZgb1D40yLM/2dj5
 SkVm/EAqWuQNtYe/jqAOwTX/7mV+k2QEmKCNOigM13R9EWgu6a4J8ta1gtNSbwvN
 jWMW+M1KjZ76pfRK+y4OpbuFixteSzhSWYPITSGwQz4IpQ+Ty2Rv0zzjidmDnAR+
 Q73cup0dretdVnVDRpMwDc06dBCmt/rbN50w4yGU0YFRFDgjGc8sIbQzuIP81nEQ
 XY4l4rcBgyVufFsLrRpQxu1iYPFrcgU38W1kRkkJ3Kl/rY1a2ZU7sLE4kt4Oh55W
 A9KCmsfqP1PCYppjAQ0QT4NOp4YtecPvAU4UcBOb722DDBd8TfhLWWGw2yG57Q/d
 Wnu8xCJGy7BaLHLGGGseAft+D4aNnCjKC3jgMyvNtRDXaV2cK2Kdd6ehMlWVUapD
 xfLlKXE+igXMyoWJIWjTXQJs4dpKu6QpJCPiorwEZ8rmNaRfxsWEJVbeYwEkmUke
 bMoBBSCbZT86WVOYhI8WtrIemraY0mMYrrcE03M96HU3eYB8BV92KrIzZWThupcQ
 ZqkZbqCZm3vfHA==
 =X/P+
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-next/iommu/fixes

Pull in x86 fixes from Thomas, as they include a change to the Intel DMAR
code on which we depend:

* tag 'x86-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  iommu/vt-d: Cure VF irqdomain hickup
  x86/platform/uv: Fix copied UV5 output archtype
  x86/platform/uv: Drop last traces of uv_flush_tlb_others
2020-11-19 09:50:06 +00:00
Borislav Petkov
b023fd5f74 x86/msr: Downgrade unrecognized MSR message
It is a warning and not an error so use pr_warn().

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201118123806.19672-1-bp@alien8.de
2020-11-19 10:46:54 +01:00
Dave Hansen
67655b57f8 x86/sgx: Clarify 'laundry_list' locking
Short Version:

The SGX section->laundry_list structure is effectively thread-local, but
declared next to some shared structures. Its semantics are clear as mud.
Fix that. No functional changes. Compile tested only.

Long Version:

The SGX hardware keeps per-page metadata. This can provide things like
permissions, integrity and replay protection. It also prevents things
like having an enclave page mapped multiple times or shared between
enclaves.

But, that presents a problem for kexec()'d kernels (or any other kernel
that does not run immediately after a hardware reset). This is because
the last kernel may have been rude and forgotten to reset pages, which
would trigger the "shared page" sanity check.

To fix this, the SGX code "launders" the pages by running the EREMOVE
instruction on all pages at boot. This is slow and can take a long
time, so it is performed off in the SGX-specific ksgxd instead of being
synchronous at boot. The init code hands the list of pages to launder in
a per-SGX-section list: ->laundry_list. The only code to touch this list
is the init code and ksgxd. This means that no locking is necessary for
->laundry_list.

However, a lock is required for section->page_list, which is accessed
while creating enclaves and by ksgxd. This lock (section->lock) is
acquired by ksgxd while also processing ->laundry_list. It is easy to
confuse the purpose of the locking as being for ->laundry_list and
->page_list.

Rename ->laundry_list to ->init_laundry_list to make it clear that this
is not normally used at runtime. Also add some comments clarifying the
locking, and reorganize 'sgx_epc_section' to put 'lock' near the things
it protects.

Note: init_laundry_list is 128 bytes of wasted space at runtime. It
could theoretically be dynamically allocated and then freed after
the laundering process. But it would take nearly 128 bytes of extra
instructions to do that.

Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201116222531.4834-1-dave.hansen@intel.com
2020-11-18 18:16:28 +01:00
Arvind Sankar
31d8546033 x86/head/64: Remove unused GET_CR2_INTO() macro
Commit

  4b47cdbda6 ("x86/head/64: Move early exception dispatch to C code")

removed the usage of GET_CR2_INTO().

Drop the definition as well, and related definitions in paravirt.h and
asm-offsets.h

Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201005151208.2212886-3-nivedita@alum.mit.edu
2020-11-18 18:09:38 +01:00
Jarkko Sakkinen
947c6e11fa x86/sgx: Add ptrace() support for the SGX driver
Enclave memory is normally inaccessible from outside the enclave. This
makes enclaves hard to debug. However, enclaves can be put in a debug
mode when they are being built. In that mode, enclave data *can* be read
and/or written by using the ENCLS[EDBGRD] and ENCLS[EDBGWR] functions.

This is obviously only for debugging and destroys all the protections
present with normal enclaves. But, enclaves know their own debug status
and can adjust their behavior appropriately.

Add a vm_ops->access() implementation which can be used to read and write
memory inside debug enclaves.  This is typically used via ptrace() APIs.

 [ bp: Massage. ]

Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-23-jarkko@kernel.org
2020-11-18 18:04:11 +01:00
Jarkko Sakkinen
1728ab54b4 x86/sgx: Add a page reclaimer
Just like normal RAM, there is a limited amount of enclave memory available
and overcommitting it is a very valuable tool to reduce resource use.
Introduce a simple reclaim mechanism for enclave pages.

In contrast to normal page reclaim, the kernel cannot directly access
enclave memory.  To get around this, the SGX architecture provides a set of
functions to help.  Among other things, these functions copy enclave memory
to and from normal memory, encrypting it and protecting its integrity in
the process.

Implement a page reclaimer by using these functions. Picks victim pages in
LRU fashion from all the enclaves running in the system.  A new kernel
thread (ksgxswapd) reclaims pages in the background based on watermarks,
similar to normal kswapd.

All enclave pages can be reclaimed, architecturally.  But, there are some
limits to this, such as the special SECS metadata page which must be
reclaimed last.  The page version array (used to mitigate replaying old
reclaimed pages) is also architecturally reclaimable, but not yet
implemented.  The end result is that the vast majority of enclave pages are
currently reclaimable.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-22-jarkko@kernel.org
2020-11-18 18:04:11 +01:00
Sean Christopherson
334872a091 x86/traps: Attempt to fixup exceptions in vDSO before signaling
vDSO functions can now leverage an exception fixup mechanism similar to
kernel exception fixup.  For vDSO exception fixup, the initial user is
Intel's Software Guard Extensions (SGX), which will wrap the low-level
transitions to/from the enclave, i.e. EENTER and ERESUME instructions,
in a vDSO function and leverage fixup to intercept exceptions that would
otherwise generate a signal.  This allows the vDSO wrapper to return the
fault information directly to its caller, obviating the need for SGX
applications and libraries to juggle signal handlers.

Attempt to fixup vDSO exceptions immediately prior to populating and
sending signal information.  Except for the delivery mechanism, an
exception in a vDSO function should be treated like any other exception
in userspace, e.g. any fault that is successfully handled by the kernel
should not be directly visible to userspace.

Although it's debatable whether or not all exceptions are of interest to
enclaves, defer to the vDSO fixup to decide whether to do fixup or
generate a signal.  Future users of vDSO fixup, if there ever are any,
will undoubtedly have different requirements than SGX enclaves, e.g. the
fixup vs. signal logic can be made function specific if/when necessary.

Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-19-jarkko@kernel.org
2020-11-18 18:02:50 +01:00
Jarkko Sakkinen
c82c618650 x86/sgx: Add SGX_IOC_ENCLAVE_PROVISION
The whole point of SGX is to create a hardware protected place to do
“stuff”. But, before someone is willing to hand over the keys to
the castle , an enclave must often prove that it is running on an
SGX-protected processor. Provisioning enclaves play a key role in
providing proof.

There are actually three different enclaves in play in order to make this
happen:

1. The application enclave.  The familiar one we know and love that runs
   the actual code that’s doing real work.  There can be many of these on
   a single system, or even in a single application.
2. The quoting enclave  (QE).  The QE is mentioned in lots of silly
   whitepapers, but, for the purposes of kernel enabling, just pretend they
   do not exist.
3. The provisioning enclave.  There is typically only one of these
   enclaves per system.  Provisioning enclaves have access to a special
   hardware key.

   They can use this key to help to generate certificates which serve as
   proof that enclaves are running on trusted SGX hardware.  These
   certificates can be passed around without revealing the special key.

Any user who can create a provisioning enclave can access the
processor-unique Provisioning Certificate Key which has privacy and
fingerprinting implications. Even if a user is permitted to create
normal application enclaves (via /dev/sgx_enclave), they should not be
able to create provisioning enclaves. That means a separate permissions
scheme is needed to control provisioning enclave privileges.

Implement a separate device file (/dev/sgx_provision) which allows
creating provisioning enclaves. This device will typically have more
strict permissions than the plain enclave device.

The actual device “driver” is an empty stub.  Open file descriptors for
this device will represent a token which allows provisioning enclave duty.
This file descriptor can be passed around and ultimately given as an
argument to the /dev/sgx_enclave driver ioctl().

 [ bp: Touchups. ]

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: linux-security-module@vger.kernel.org
Link: https://lkml.kernel.org/r/20201112220135.165028-16-jarkko@kernel.org
2020-11-18 18:02:50 +01:00
Jarkko Sakkinen
9d0c151b41 x86/sgx: Add SGX_IOC_ENCLAVE_INIT
Enclaves have two basic states. They are either being built and are
malleable and can be modified by doing things like adding pages. Or,
they are locked down and not accepting changes. They can only be run
after they have been locked down. The ENCLS[EINIT] function induces the
transition from being malleable to locked-down.

Add an ioctl() that performs ENCLS[EINIT]. After this, new pages can
no longer be added with ENCLS[EADD]. This is also the time where the
enclave can be measured to verify its integrity.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-15-jarkko@kernel.org
2020-11-18 18:02:49 +01:00
Jarkko Sakkinen
c6d26d3707 x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES
SGX enclave pages are inaccessible to normal software. They must be
populated with data by copying from normal memory with the help of the
EADD and EEXTEND functions of the ENCLS instruction.

Add an ioctl() which performs EADD that adds new data to an enclave, and
optionally EEXTEND functions that hash the page contents and use the
hash as part of enclave “measurement” to ensure enclave integrity.

The enclave author gets to decide which pages will be included in the
enclave measurement with EEXTEND. Measurement is very slow and has
sometimes has very little value. For instance, an enclave _could_
measure every page of data and code, but would be slow to initialize.
Or, it might just measure its code and then trust that code to
initialize the bulk of its data after it starts running.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-14-jarkko@kernel.org
2020-11-18 18:02:49 +01:00
Jarkko Sakkinen
888d249117 x86/sgx: Add SGX_IOC_ENCLAVE_CREATE
Add an ioctl() that performs the ECREATE function of the ENCLS
instruction, which creates an SGX Enclave Control Structure (SECS).

Although the SECS is an in-memory data structure, it is present in
enclave memory and is not directly accessible by software.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-13-jarkko@kernel.org
2020-11-18 18:02:49 +01:00
Jarkko Sakkinen
3fe0778eda x86/sgx: Add an SGX misc driver interface
Intel(R) SGX is a new hardware functionality that can be used by
applications to set aside private regions of code and data called
enclaves. New hardware protects enclave code and data from outside
access and modification.

Add a driver that presents a device file and ioctl API to build and
manage enclaves.

 [ bp: Small touchups, remove unused encl variable in sgx_encl_find() as
   Reported-by: kernel test robot <lkp@intel.com> ]

Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-12-jarkko@kernel.org
2020-11-18 18:01:16 +01:00
Zhenzhong Duan
4d213e76a3 iommu/vt-d: Avoid panic if iommu init fails in tboot system
"intel_iommu=off" command line is used to disable iommu but iommu is force
enabled in a tboot system for security reason.

However for better performance on high speed network device, a new option
"intel_iommu=tboot_noforce" is introduced to disable the force on.

By default kernel should panic if iommu init fail in tboot for security
reason, but it's unnecessory if we use "intel_iommu=tboot_noforce,off".

Fix the code setting force_on and move intel_iommu_tboot_noforce
from tboot code to intel iommu code.

Fixes: 7304e8f28b ("iommu/vt-d: Correctly disable Intel IOMMU force on")
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@gmail.com>
Tested-by: Lukasz Hawrylko <lukasz.hawrylko@linux.intel.com>
Acked-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20201110071908.3133-1-zhenzhong.duan@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
2020-11-18 13:09:07 +00:00
Thomas Gleixner
860aaabac8 x86/dumpstack: Do not try to access user space code of other tasks
sysrq-t ends up invoking show_opcodes() for each task which tries to access
the user space code of other processes, which is obviously bogus.

It either manages to dump where the foreign task's regs->ip points to in a
valid mapping of the current task or triggers a pagefault and prints "Code:
Bad RIP value.". Both is just wrong.

Add a safeguard in copy_code() and check whether the @regs pointer matches
currents pt_regs. If not, do not even try to access it.

While at it, add commentary why using copy_from_user_nmi() is safe in
copy_code() even if the function name suggests otherwise.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201117202753.667274723@linutronix.de
2020-11-18 12:56:29 +01:00
Hui Su
09a217c105 x86/dumpstack: Make show_trace_log_lvl() static
show_trace_log_lvl() is not used by other compilation units so make it
static and remove the declaration from the header file.

Signed-off-by: Hui Su <sh_def@163.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201113133943.GA136221@rlk
2020-11-17 19:05:32 +01:00
Jarkko Sakkinen
d2285493be x86/sgx: Add SGX page allocator functions
Add functions for runtime allocation and free.

This allocator and its algorithms are as simple as it gets.  They do a
linear search across all EPC sections and find the first free page.  They
are not NUMA-aware and only hand out individual pages.  The SGX hardware
does not support large pages, so something more complicated like a buddy
allocator is unwarranted.

The free function (sgx_free_epc_page()) implicitly calls ENCLS[EREMOVE],
which returns the page to the uninitialized state.  This ensures that the
page is ready for use at the next allocation.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-10-jarkko@kernel.org
2020-11-17 14:36:13 +01:00
Jarkko Sakkinen
38853a3039 x86/cpu/intel: Add a nosgx kernel parameter
Add a kernel parameter to disable SGX kernel support and document it.

 [ bp: Massage. ]

Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Tested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-9-jarkko@kernel.org
2020-11-17 14:36:13 +01:00
Sean Christopherson
224ab3527f x86/cpu/intel: Detect SGX support
Kernel support for SGX is ultimately decided by the state of the launch
control bits in the feature control MSR (MSR_IA32_FEAT_CTL).  If the
hardware supports SGX, but neglects to support flexible launch control, the
kernel will not enable SGX.

Enable SGX at feature control MSR initialization and update the associated
X86_FEATURE flags accordingly.  Disable X86_FEATURE_SGX (and all
derivatives) if the kernel is not able to establish itself as the authority
over SGX Launch Control.

All checks are performed for each logical CPU (not just boot CPU) in order
to verify that MSR_IA32_FEATURE_CONTROL is correctly configured on all
CPUs. All SGX code in this series expects the same configuration from all
CPUs.

This differs from VMX where X86_FEATURE_VMX is intentionally cleared only
for the current CPU so that KVM can provide additional information if KVM
fails to load like which CPU doesn't support VMX.  There’s not much the
kernel or an administrator can do to fix the situation, so SGX neglects to
convey additional details about these kinds of failures if they occur.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-8-jarkko@kernel.org
2020-11-17 14:36:13 +01:00
Sean Christopherson
e7e0545299 x86/sgx: Initialize metadata for Enclave Page Cache (EPC) sections
Although carved out of normal DRAM, enclave memory is marked in the
system memory map as reserved and is not managed by the core mm.  There
may be several regions spread across the system.  Each contiguous region
is called an Enclave Page Cache (EPC) section.  EPC sections are
enumerated via CPUID

Enclave pages can only be accessed when they are mapped as part of an
enclave, by a hardware thread running inside the enclave.

Parse CPUID data, create metadata for EPC pages and populate a simple
EPC page allocator.  Although much smaller, ‘struct sgx_epc_page’
metadata is the SGX analog of the core mm ‘struct page’.

Similar to how the core mm’s page->flags encode zone and NUMA
information, embed the EPC section index to the first eight bits of
sgx_epc_page->desc.  This allows a quick reverse lookup from EPC page to
EPC section.  Existing client hardware supports only a single section,
while upcoming server hardware will support at most eight sections.
Thus, eight bits should be enough for long term needs.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Serge Ayoun <serge.ayoun@intel.com>
Signed-off-by: Serge Ayoun <serge.ayoun@intel.com>
Co-developed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-6-jarkko@kernel.org
2020-11-17 14:36:13 +01:00
Jarkko Sakkinen
2c273671d0 x86/sgx: Add wrappers for ENCLS functions
ENCLS is the userspace instruction which wraps virtually all
unprivileged SGX functionality for managing enclaves.  It is essentially
the ioctl() of instructions with each function implementing different
SGX-related functionality.

Add macros to wrap the ENCLS functionality. There are two main groups,
one for functions which do not return error codes and a “ret_” set for
those that do.

ENCLS functions are documented in Intel SDM section 36.6.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-3-jarkko@kernel.org
2020-11-17 14:36:12 +01:00
Jarkko Sakkinen
70d3b8ddcd x86/sgx: Add SGX architectural data structures
Define the SGX architectural data structures used by various SGX
functions. This is not an exhaustive representation of all SGX data
structures but only those needed by the kernel.

The goal is to sequester hardware structures in "sgx/arch.h" and keep
them separate from kernel-internal or uapi structures.

The data structures are described in Intel SDM section 37.6.

Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-2-jarkko@kernel.org
2020-11-17 14:36:12 +01:00
Chen Yu
1a371e67dc x86/microcode/intel: Check patch signature before saving microcode for early loading
Currently, scan_microcode() leverages microcode_matches() to check
if the microcode matches the CPU by comparing the family and model.
However, the processor stepping and flags of the microcode signature
should also be considered when saving a microcode patch for early
update.

Use find_matching_signature() in scan_microcode() and get rid of the
now-unused microcode_matches() which is a good cleanup in itself.

Complete the verification of the patch being saved for early loading in
save_microcode_patch() directly. This needs to be done there too because
save_mc_for_early() will call save_microcode_patch() too.

The second reason why this needs to be done is because the loader still
tries to support, at least hypothetically, mixed-steppings systems and
thus adds all patches to the cache that belong to the same CPU model
albeit with different steppings.

For example:

  microcode: CPU: sig=0x906ec, pf=0x2, rev=0xd6
  microcode: mc_saved[0]: sig=0x906e9, pf=0x2a, rev=0xd6, total size=0x19400, date = 2020-04-23
  microcode: mc_saved[1]: sig=0x906ea, pf=0x22, rev=0xd6, total size=0x19000, date = 2020-04-27
  microcode: mc_saved[2]: sig=0x906eb, pf=0x2, rev=0xd6, total size=0x19400, date = 2020-04-23
  microcode: mc_saved[3]: sig=0x906ec, pf=0x22, rev=0xd6, total size=0x19000, date = 2020-04-27
  microcode: mc_saved[4]: sig=0x906ed, pf=0x22, rev=0xd6, total size=0x19400, date = 2020-04-23

The patch which is being saved for early loading, however, can only be
the one which fits the CPU this runs on so do the signature verification
before saving.

 [ bp: Do signature verification in save_microcode_patch()
       and rewrite commit message. ]

Fixes: ec400ddeff ("x86/microcode_intel_early.c: Early update ucode on Intel's CPU")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=208535
Link: https://lkml.kernel.org/r/20201113015923.13960-1-yu.c.chen@intel.com
2020-11-17 10:33:18 +01:00
Thomas Gleixner
4cffe21d4a Merge branch 'x86/entry' into core/entry
Prepare for the merging of the syscall_work series which conflicts with the
TIF bits overhaul in X86.
2020-11-16 20:51:59 +01:00
Alex Shi
789f52c071 x86/kvm: remove unused macro HV_CLOCK_SIZE
This macro is useless, and could cause gcc warning:
arch/x86/kernel/kvmclock.c:47:0: warning: macro "HV_CLOCK_SIZE" is not
used [-Wunused-macros]
Let's remove it.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Message-Id: <1604651963-10067-1-git-send-email-alex.shi@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-16 13:14:21 -05:00
Borislav Petkov
18741a5251 x86/msr: Do not allow writes to MSR_IA32_ENERGY_PERF_BIAS
Now that all in-kernel-tree users are converted to using the sysfs file,
remove the MSR from the "allowlist".

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Link: https://lkml.kernel.org/r/20201029190259.3476-5-bp@alien8.de
2020-11-16 17:44:04 +01:00
Tony Luck
098416e698 x86/mce: Use "safe" MSR functions when enabling additional error logging
Booting as a guest under KVM results in error messages about
unchecked MSR access:

  unchecked MSR access error: RDMSR from 0x17f at rIP: 0xffffffff84483f16 (mce_intel_feature_init+0x156/0x270)

because KVM doesn't provide emulation for random model specific
registers.

Switch to using rdmsrl_safe()/wrmsrl_safe() to avoid the message.

Fixes: 68299a42f8 ("x86/mce: Enable additional error logging on certain Intel CPUs")
Reported-by: Qian Cai <cai@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201111003954.GA11878@agluck-desk2.amr.corp.intel.com
2020-11-16 17:34:08 +01:00
Linus Torvalds
326fd6db61 A small set of fixes for x86:
- Cure the fallout from the MSI irqdomain overhaul which missed that the
    Intel IOMMU does not register virtual function devices and therefore
    never reaches the point where the MSI interrupt domain is assigned. This
    makes the VF devices use the non-remapped MSI domain which is trapped by
    the IOMMU/remap unit.
 
  - Remove an extra space in the SGI_UV architecture type procfs output for
    UV5.
 
  - Remove a unused function which was missed when removing the UV BAU TLB
    shootdown handler.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+xJi0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVWxD/9Tq4W6Kniln7mtoEWHRvHRceiiGcS3
 MocvqurhoJwirH4F2gkvCegTBy0r3FdUORy3OMmChVs6nb8XpPpso84SANCRePWp
 JZezpVwLSNC4O1/ZCg1Kjj4eUpzLB/UjUUQV9RsjL5wyQEhfCZgb1D40yLM/2dj5
 SkVm/EAqWuQNtYe/jqAOwTX/7mV+k2QEmKCNOigM13R9EWgu6a4J8ta1gtNSbwvN
 jWMW+M1KjZ76pfRK+y4OpbuFixteSzhSWYPITSGwQz4IpQ+Ty2Rv0zzjidmDnAR+
 Q73cup0dretdVnVDRpMwDc06dBCmt/rbN50w4yGU0YFRFDgjGc8sIbQzuIP81nEQ
 XY4l4rcBgyVufFsLrRpQxu1iYPFrcgU38W1kRkkJ3Kl/rY1a2ZU7sLE4kt4Oh55W
 A9KCmsfqP1PCYppjAQ0QT4NOp4YtecPvAU4UcBOb722DDBd8TfhLWWGw2yG57Q/d
 Wnu8xCJGy7BaLHLGGGseAft+D4aNnCjKC3jgMyvNtRDXaV2cK2Kdd6ehMlWVUapD
 xfLlKXE+igXMyoWJIWjTXQJs4dpKu6QpJCPiorwEZ8rmNaRfxsWEJVbeYwEkmUke
 bMoBBSCbZT86WVOYhI8WtrIemraY0mMYrrcE03M96HU3eYB8BV92KrIzZWThupcQ
 ZqkZbqCZm3vfHA==
 =X/P+
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
 "A small set of fixes for x86:

   - Cure the fallout from the MSI irqdomain overhaul which missed that
     the Intel IOMMU does not register virtual function devices and
     therefore never reaches the point where the MSI interrupt domain is
     assigned. This made the VF devices use the non-remapped MSI domain
     which is trapped by the IOMMU/remap unit

   - Remove an extra space in the SGI_UV architecture type procfs output
     for UV5

   - Remove a unused function which was missed when removing the UV BAU
     TLB shootdown handler"

* tag 'x86-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  iommu/vt-d: Cure VF irqdomain hickup
  x86/platform/uv: Fix copied UV5 output archtype
  x86/platform/uv: Drop last traces of uv_flush_tlb_others
2020-11-15 09:49:56 -08:00
Linus Torvalds
64b609d6a6 A set of fixes for perf:
- A set of commits which reduce the stack usage of various perf event
    handling functions which allocated large data structs on stack causing
    stack overflows in the worst case.
 
  - Use the proper mechanism for detecting soft interrupts in the recursion
    protection.
 
  - Make the resursion protection simpler and more robust.
 
  - Simplify the scheduling of event groups to make the code more robust and
    prepare for fixing the issues vs. scheduling of exclusive event groups.
 
  - Prevent event multiplexing and rotation for exclusive event groups
 
  - Correct the perf event attribute exclusive semantics to take pinned
    events, e.g. the PMU watchdog, into account
 
  - Make the anythread filtering conditional for Intel's generic PMU
    counters as it is not longer guaranteed to be supported on newer
    CPUs. Check the corresponding CPUID leaf to make sure.
 
  - Fixup a duplicate initialization in an array which was probably cause by
    the usual copy & paste - forgot to edit mishap.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+xIi0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYofixD/4+4gc8DhOmAkMrN0Z9tiW8ebgMKmb9
 wZRkMr5Osi0GzLJOPZ6SdY6jd0A3rMN/sW6P1DT6pDtcty4bKFoW5VZBuUDIAhel
 BC4C93L3y1En/GEZu1GTy3LvsBwLBQTOoY4goDjbdAbk60S/0RTHOGyQsRsOQFe6
 fVs3iXozAFuaR6I6N3dlxuJAE51zvr8MyBWaUoByNDB//1+lLNW+JfClaAOG1oXx
 qZIg/niatBVGzSGgKNRUyh3g8G1HJtabsA/NZ4PH8ZHuYABfmj4lmmUPR77ICLfV
 wMITEBG7eaktB8EqM9hvaoOZLA5kpXHO2JbCFSs4c4x11mlC8g7QMV3poCw33YoN
 a5TmT1A3muri1riy1/Ee9lXACOq7/tf2+Xfn9o6dvDdBwd6s5pzlhLGR8gILp2lF
 2bcg3IwYvHT/Kiurb/WGNpbCqQIPJpcUcfs3tNBCCtKegahUQNnGjxN3NVo9RCit
 zfL6xIJ8eZiYnsxXx4NKm744AukWiql3aRNgRkOdBP5WC68xt6VLcxG1YZKUoDhy
 jRSOCD/DuPSMSvAAgN7S8OWlPsKWBxVxxWYV+K8FpwhgzbQ3WbS3UDiYkhgjeOxu
 OlM692oWpllKvQWlvYthr2Be6oPCRRi1vvADNNbTKzgHk5i61bwympsGl1EZx3Pz
 2ROp7NJFRESnqw==
 =FzCf
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Thomas Gleixner:
 "A set of fixes for perf:

    - A set of commits which reduce the stack usage of various perf
      event handling functions which allocated large data structs on
      stack causing stack overflows in the worst case

    - Use the proper mechanism for detecting soft interrupts in the
      recursion protection

    - Make the resursion protection simpler and more robust

    - Simplify the scheduling of event groups to make the code more
      robust and prepare for fixing the issues vs. scheduling of
      exclusive event groups

    - Prevent event multiplexing and rotation for exclusive event groups

    - Correct the perf event attribute exclusive semantics to take
      pinned events, e.g. the PMU watchdog, into account

    - Make the anythread filtering conditional for Intel's generic PMU
      counters as it is not longer guaranteed to be supported on newer
      CPUs. Check the corresponding CPUID leaf to make sure

    - Fixup a duplicate initialization in an array which was probably
      caused by the usual 'copy & paste - forgot to edit' mishap"

* tag 'perf-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/uncore: Fix Add BW copypasta
  perf/x86/intel: Make anythread filter support conditional
  perf: Tweak perf_event_attr::exclusive semantics
  perf: Fix event multiplexing for exclusive groups
  perf: Simplify group_sched_in()
  perf: Simplify group_sched_out()
  perf/x86: Make dummy_iregs static
  perf/arch: Remove perf_sample_data::regs_user_copy
  perf: Optimize get_recursion_context()
  perf: Fix get_recursion_context()
  perf/x86: Reduce stack usage for x86_pmu::drain_pebs()
  perf: Reduce stack usage of perf_output_begin()
2020-11-15 09:46:36 -08:00
Steven Rostedt (VMware)
2860cd8a23 livepatch: Use the default ftrace_ops instead of REGS when ARGS is available
When CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS is available, the ftrace call
will be able to set the ip of the calling function. This will improve the
performance of live kernel patching where it does not need all the regs to
be stored just to change the instruction pointer.

If all archs that support live kernel patching also support
HAVE_DYNAMIC_FTRACE_WITH_ARGS, then the architecture specific function
klp_arch_set_pc() could be made generic.

It is possible that an arch can support HAVE_DYNAMIC_FTRACE_WITH_ARGS but
not HAVE_DYNAMIC_FTRACE_WITH_REGS and then have access to live patching.

Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: live-patching@vger.kernel.org
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-13 12:15:28 -05:00
Steven Rostedt (VMware)
02a474ca26 ftrace/x86: Allow for arguments to be passed in to ftrace_regs by default
Currently, the only way to get access to the registers of a function via a
ftrace callback is to set the "FL_SAVE_REGS" bit in the ftrace_ops. But as this
saves all regs as if a breakpoint were to trigger (for use with kprobes), it
is expensive.

The regs are already saved on the stack for the default ftrace callbacks, as
that is required otherwise a function being traced will get the wrong
arguments and possibly crash. And on x86, the arguments are already stored
where they would be on a pt_regs structure to use that code for both the
regs version of a callback, it makes sense to pass that information always
to all functions.

If an architecture does this (as x86_64 now does), it is to set
HAVE_DYNAMIC_FTRACE_WITH_ARGS, and this will let the generic code that it
could have access to arguments without having to set the flags.

This also includes having the stack pointer being saved, which could be used
for accessing arguments on the stack, as well as having the function graph
tracer not require its own trampoline!

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-13 12:14:59 -05:00
Steven Rostedt (VMware)
d19ad0775d ftrace: Have the callbacks receive a struct ftrace_regs instead of pt_regs
In preparation to have arguments of a function passed to callbacks attached
to functions as default, change the default callback prototype to receive a
struct ftrace_regs as the forth parameter instead of a pt_regs.

For callbacks that set the FL_SAVE_REGS flag in their ftrace_ops flags, they
will now need to get the pt_regs via a ftrace_get_regs() helper call. If
this is called by a callback that their ftrace_ops did not have a
FL_SAVE_REGS flag set, it that helper function will return NULL.

This will allow the ftrace_regs to hold enough just to get the parameters
and stack pointer, but without the worry that callbacks may have a pt_regs
that is not completely filled.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-13 12:14:55 -05:00
Mike Travis
77c7e1bc06 x86/platform/uv: Fix copied UV5 output archtype
A test shows that the output contains a space:
    # cat /proc/sgi_uv/archtype
    NSGI4 U/UVX

Remove that embedded space by copying the "trimmed" buffer instead of the
untrimmed input character list.  Use sizeof to remove size dependency on
copy out length.  Increase output buffer size by one character just in case
BIOS sends an 8 character string for archtype.

Fixes: 1e61f5a95f ("Add and decode Arch Type in UVsystab")
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lore.kernel.org/r/20201111010418.82133-1-mike.travis@hpe.com
2020-11-13 00:00:31 +01:00
Thomas Gleixner
aec8da04e4 x86/ioapic: Correct the PCI/ISA trigger type selection
PCI's default trigger type is level and ISA's is edge. The recent
refactoring made it the other way round, which went unnoticed as it seems
only to cause havoc on some AMD systems.

Make the comment and code do the right thing again.

Fixes: a27dca645d ("x86/io_apic: Cleanup trigger/polarity helpers")
Reported-by: Tom Lendacky <thomas.lendacky@amd.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Qian Cai <cai@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/87d00lgu13.fsf@nanos.tec.linutronix.de
2020-11-10 18:43:22 +01:00
Peter Zijlstra
76a4efa809 perf/arch: Remove perf_sample_data::regs_user_copy
struct perf_sample_data lives on-stack, we should be careful about it's
size. Furthermore, the pt_regs copy in there is only because x86_64 is a
trainwreck, solve it differently.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20201030151955.258178461@infradead.org
2020-11-09 18:12:34 +01:00
Linus Torvalds
40be821d62 A set of x86 fixes:
- Use SYM_FUNC_START_WEAK in the mem* ASM functions instead of a
    combination of .weak and SYM_FUNC_START_LOCAL which makes LLVMs
    integrated assembler upset.
 
  - Correct the mitigation selection logic which prevented the related prctl
    to work correctly.
 
  - Make the UV5 hubless system work correctly by fixing up the malformed
    table entries and adding the missing ones.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+oDNYTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoaN0EACPWY15k1YuAEIjiQxRBhq22J8Y6wNX
 Ui/rF2AZcAnNEJDTIyvjP6COnT9mjX/tuuluMaI6i/XY/9Xp5LpKvivkL2PXNN3X
 onW01ouIc1iYxXwQEVZvhYHsOyhkR9Z8yNG/q9I7xYAXNSZcAHwXVar4VlPBT7Ay
 iP75i8pGmb/NCc4oHNXuBp/dV/0/dCoLTndb5p5pX8oS60AAt9ZuK3IRc3ucayhI
 M4rTTEya1oY+ZNbtP4A4Jp7Qc/NGYDo6q04za+jcxZ5Gqacs+fk/PNuWgL1fZZtW
 sn1D+SMWEb55Xcsdy976b29FFU/DcOcf7TRASzyKgyPW5jg1dP6BZ6U0wpVV3KZw
 S2h5/pt48JZI7olrDsLQ0tzjALlk2CcFNrnRtOMDduHdw9wyz+Sg58lZYuvH3sXK
 5ZblWRJ3JiBNsNO0sA3kd4sp7xWQB3ey6mkYD8Vqb7zRIt8aXT9jqBxhDrP+Vqs/
 /UKv+BJfD6WxC0nQ4x6MS3g4sDvI+1SLfHSZ/UjWJ6NfYJW5/w429pFCaF73xCTd
 cqxja1dZYixn7ioFZjolMUdvuDiC5B2+5+RzEV87kaDzO9QZQyvsl7G74MSfwx6G
 DAydvuyJoxP2qVASobOBcVOzLQO7DsLzFZzJTttZcnkK2iprcz4qrsFLMxF9SxTD
 Amb8qck60dLfqA==
 =JdPk
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
 "A set of x86 fixes:

   - Use SYM_FUNC_START_WEAK in the mem* ASM functions instead of a
     combination of .weak and SYM_FUNC_START_LOCAL which makes LLVMs
     integrated assembler upset

   - Correct the mitigation selection logic which prevented the related
     prctl to work correctly

   - Make the UV5 hubless system work correctly by fixing up the
     malformed table entries and adding the missing ones"

* tag 'x86-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/platform/uv: Recognize UV5 hubless system identifier
  x86/platform/uv: Remove spaces from OEM IDs
  x86/platform/uv: Fix missing OEM_TABLE_ID
  x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP
  x86/lib: Change .weak to SYM_FUNC_START_WEAK for arch/x86/lib/mem*_64.S
2020-11-08 10:09:36 -08:00
Mike Travis
801284f973 x86/platform/uv: Recognize UV5 hubless system identifier
Testing shows a problem in that UV5 hubless systems were not being
recognized.  Add them to the list of OEM IDs checked.

Fixes: 6c7794423a ("Add UV5 direct references")
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201105222741.157029-4-mike.travis@hpe.com
2020-11-07 11:17:39 +01:00
Mike Travis
1aee505e01 x86/platform/uv: Remove spaces from OEM IDs
Testing shows that trailing spaces caused problems with the OEM_ID and
the OEM_TABLE_ID.  One being that the OEM_ID would not string compare
correctly.  Another the OEM_ID and OEM_TABLE_ID would be concatenated
in the printout.  Remove any trailing spaces.

Fixes: 1e61f5a95f ("Add and decode Arch Type in UVsystab")
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201105222741.157029-3-mike.travis@hpe.com
2020-11-07 11:17:39 +01:00
Mike Travis
1aec69ae56 x86/platform/uv: Fix missing OEM_TABLE_ID
Testing shows a problem in that the OEM_TABLE_ID was missing for
hubless systems.  This is used to determine the APIC type (legacy or
extended).  Add the OEM_TABLE_ID to the early hubless processing.

Fixes: 1e61f5a95f ("Add and decode Arch Type in UVsystab")
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201105222741.157029-2-mike.travis@hpe.com
2020-11-07 11:17:39 +01:00
Paul E. McKenney
3fcd6a230f x86/cpu: Avoid cpuinfo-induced IPIing of idle CPUs
Currently, accessing /proc/cpuinfo sends IPIs to idle CPUs in order to
learn their clock frequency.  Which is a bit strange, given that waking
them from idle likely significantly changes their clock frequency.
This commit therefore avoids sending /proc/cpuinfo-induced IPIs to
idle CPUs.

[ paulmck: Also check for idle in arch_freq_prepare_all(). ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
2020-11-06 16:59:11 -08:00
Paul E. McKenney
f4deaf9021 x86/cpu: Avoid cpuinfo-induced IPI pileups
The aperfmperf_snapshot_cpu() function is invoked upon access to
/proc/cpuinfo, and it does do an early exit if the specified CPU has
recently done a snapshot.  Unfortunately, the indication that a snapshot
has been completed is set in an IPI handler, and the execution of this
handler can be delayed by any number of unfortunate events.  This means
that a system that starts a number of applications, each of which
parses /proc/cpuinfo, can suffer from an smp_call_function_single()
storm, especially given that each access to /proc/cpuinfo invokes
smp_call_function_single() for all CPUs.  Please note that this is not
theoretical speculation.  Note also that one CPU's pending IPI serves
all requests, so there is no point in ever having more than one IPI
pending to a given CPU.

This commit therefore suppresses duplicate IPIs to a given CPU via a
new ->scfpending field in the aperfmperf_sample structure.  This field
is set to the value one if an IPI is pending to the corresponding CPU
and to zero otherwise.

The aperfmperf_snapshot_cpu() function uses atomic_xchg() to set this
field to the value one and sample the old value.  If this function's
"wait" parameter is zero, smp_call_function_single() is called only if
the old value of the ->scfpending field was zero.  The IPI handler uses
atomic_set_release() to set this new field to zero just before returning,
so that the prior stores into the aperfmperf_sample structure are seen
by future requests that get to the atomic_xchg().  Future requests that
pass the elapsed-time check are ordered by the fact that on x86 loads act
as acquire loads, just as was the case prior to this change.  The return
value is based off of the age of the prior snapshot, just as before.

Reported-by: Dave Jones <davej@codemonkey.org.uk>
[ paulmck: Allow /proc/cpuinfo to take advantage of arch_freq_get_on_cpu(). ]
[ paulmck: Add comment on memory barrier. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
2020-11-06 16:58:40 -08:00
Zhen Lei
15af36596a x86/mce: Correct the detection of invalid notifier priorities
Commit

  c9c6d216ed ("x86/mce: Rename "first" function as "early"")

changed the enumeration of MCE notifier priorities. Correct the check
for notifier priorities to cover the new range.

 [ bp: Rewrite commit message, remove superfluous brackets in
   conditional. ]

Fixes: c9c6d216ed ("x86/mce: Rename "first" function as "early"")
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201106141216.2062-2-thunder.leizhen@huawei.com
2020-11-06 19:02:48 +01:00
Steven Rostedt (VMware)
773c167050 ftrace: Add recording of functions that caused recursion
This adds CONFIG_FTRACE_RECORD_RECURSION that will record to a file
"recursed_functions" all the functions that caused recursion while a
callback to the function tracer was running.

Link: https://lkml.kernel.org/r/20201106023548.102375687@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Guo Ren <guoren@kernel.org>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-csky@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: live-patching@vger.kernel.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-06 08:42:26 -05:00
Steven Rostedt (VMware)
c536aa1c5b kprobes/ftrace: Add recursion protection to the ftrace callback
If a ftrace callback does not supply its own recursion protection and
does not set the RECURSION_SAFE flag in its ftrace_ops, then ftrace will
make a helper trampoline to do so before calling the callback instead of
just calling the callback directly.

The default for ftrace_ops is going to change. It will expect that handlers
provide their own recursion protection, unless its ftrace_ops states
otherwise.

Link: https://lkml.kernel.org/r/20201028115613.140212174@goodmis.org
Link: https://lkml.kernel.org/r/20201106023546.944907560@goodmis.org

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh  Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-csky@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-06 08:35:44 -05:00
Kaixu Xia
77080929d5 x86/mce: Assign boolean values to a bool variable
Fix the following coccinelle warnings:

  ./arch/x86/kernel/cpu/mce/core.c:1765:3-20: WARNING: Assignment of 0/1 to bool variable
  ./arch/x86/kernel/cpu/mce/core.c:1584:2-9: WARNING: Assignment of 0/1 to bool variable

 [ bp: Massage commit message. ]

Reported-by: Tosk Robot <tencent_os_robot@tencent.com>
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1604654363-1463-1-git-send-email-kaixuxia@tencent.com
2020-11-06 11:51:04 +01:00
Chester Lin
25519d6834 ima: generalize x86/EFI arch glue for other EFI architectures
Move the x86 IMA arch code into security/integrity/ima/ima_efi.c,
so that we will be able to wire it up for arm64 in a future patch.

Co-developed-by: Chester Lin <clin@suse.com>
Signed-off-by: Chester Lin <clin@suse.com>
Acked-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-11-06 07:40:42 +01:00
Anand K Mistry
1978b3a53a x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP
On AMD CPUs which have the feature X86_FEATURE_AMD_STIBP_ALWAYS_ON,
STIBP is set to on and

  spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED

At the same time, IBPB can be set to conditional.

However, this leads to the case where it's impossible to turn on IBPB
for a process because in the PR_SPEC_DISABLE case in ib_prctl_set() the

  spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED

condition leads to a return before the task flag is set. Similarly,
ib_prctl_get() will return PR_SPEC_DISABLE even though IBPB is set to
conditional.

More generally, the following cases are possible:

1. STIBP = conditional && IBPB = on for spectre_v2_user=seccomp,ibpb
2. STIBP = on && IBPB = conditional for AMD CPUs with
   X86_FEATURE_AMD_STIBP_ALWAYS_ON

The first case functions correctly today, but only because
spectre_v2_user_ibpb isn't updated to reflect the IBPB mode.

At a high level, this change does one thing. If either STIBP or IBPB
is set to conditional, allow the prctl to change the task flag.
Also, reflect that capability when querying the state. This isn't
perfect since it doesn't take into account if only STIBP or IBPB is
unconditionally on. But it allows the conditional feature to work as
expected, without affecting the unconditional one.

 [ bp: Massage commit message and comment; space out statements for
   better readability. ]

Fixes: 21998a3515 ("x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.")
Signed-off-by: Anand K Mistry <amistry@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20201105163246.v2.1.Ifd7243cd3e2c2206a893ad0a5b9a4f19549e22c6@changeid
2020-11-05 21:43:34 +01:00
Thomas Gleixner
b6be002bcd x86/entry: Move nmi entry/exit into common code
Lockdep state handling on NMI enter and exit is nothing specific to X86. It's
not any different on other architectures. Also the extra state type is not
necessary, irqentry_state_t can carry the necessary information as well.

Move it to common code and extend irqentry_state_t to carry lockdep state.

[ Ira: Make exit_rcu and lockdep a union as they are mutually exclusive
  between the IRQ and NMI exceptions, and add kernel documentation for
  struct irqentry_state_t ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201102205320.1458656-7-ira.weiny@intel.com
2020-11-04 22:55:36 +01:00
Thomas Gleixner
01be83eea0 Merge branch 'core/urgent' into core/entry
Pick up the entry fix before further modifications.
2020-11-04 18:14:52 +01:00
David Woodhouse
f36a74b934 x86/ioapic: Use I/O-APIC ID for finding irqdomain, not index
In commit b643128b91 ("x86/ioapic: Use irq_find_matching_fwspec() to
find remapping irqdomain") the I/O-APIC code was changed to find its
parent irqdomain using irq_find_matching_fwspec(), but the key used
for the lookup was wrong. It shouldn't use 'ioapic' which is the index
into its own ioapics[] array. It should use the actual arbitration
ID of the I/O-APIC in question, which is mpc_ioapic_id(ioapic).

Fixes: b643128b91 ("x86/ioapic: Use irq_find_matching_fwspec() to find remapping irqdomain")
Reported-by: lkp <oliver.sang@intel.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/57adf2c305cd0c5e9d860b2f3007a7e676fd0f9f.camel@infradead.org
2020-11-04 11:11:35 +01:00
Dexuan Cui
d981059e13 x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it
When a Linux VM runs on Hyper-V, if the VM has CPUs with >255 APIC IDs,
the CPUs can't be the destination of IOAPIC interrupts, because the
IOAPIC RTE's Dest Field has only 8 bits. Currently the hackery driver
drivers/iommu/hyperv-iommu.c is used to ensure IOAPIC interrupts are
only routed to CPUs that don't have >255 APIC IDs. However, there is
an issue with kdump, because the kdump kernel can run on any CPU, and
hence IOAPIC interrupts can't work if the kdump kernel run on a CPU
with a >255 APIC ID.

The kdump issue can be fixed by the Extended Dest ID, which is introduced
recently by David Woodhouse (for IOAPIC, see the field virt_destid_8_14 in
struct IO_APIC_route_entry). Of course, the Extended Dest ID needs the
support of the underlying hypervisor. The latest Hyper-V has added the
support recently: with this commit, on such a Hyper-V host, Linux VM
does not use hyperv-iommu.c because hyperv_prepare_irq_remapping()
returns -ENODEV; instead, Linux kernel's generic support of Extended Dest
ID from David is used, meaning that Linux VM is able to support up to
32K CPUs, and IOAPIC interrupts can be routed to all the CPUs.

On an old Hyper-V host that doesn't support the Extended Dest ID, nothing
changes with this commit: Linux VM is still able to bring up the CPUs with
> 255 APIC IDs with the help of hyperv-iommu.c, but IOAPIC interrupts still
can not go to such CPUs, and the kdump kernel still can not work properly
on such CPUs.

[ tglx: Updated comment as suggested by David ]

Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20201103011136.59108-1-decui@microsoft.com
2020-11-04 11:10:52 +01:00
Linus Torvalds
43c834186c A couple of changes to the SEV-ES code to perform more stringent
hypervisor checks before enabling encryption. (Joerg Roedel)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+hKN8ACgkQEsHwGGHe
 VUrkZQ/+LWjbDrbkLCQpWuzLagAocZMKKvr4+2ujU+krj0QU5FFJbfuzhkktQD+H
 cbfOW7+E8lqTDoj/dwoJPj2Xs8HvW4Ua6sbxF5lCPhlEr3NIetRfQ7SPj3qFvQG+
 FKP/55RSnjKIx7aZXKN9YAw2FF3EC1BisjszCBKid5S8HbGqjLMb2Ue0i/nssksY
 CvLwaxtDOGuSzJ8FwL+vmI70NkeLZ0ulTxbuxXAqfMTvJX3e1QA9dgeZMgfU1hng
 eA1Pjlm0X7FOsnwihYP2EZ6NzRrTkYeGl1Iagz1apqlDlQ+bcaxvs2btIyb7MKt5
 6PPDGg0P0WVMNfOEUYTZob31QcLnakA/p8kG8sYE6h2PlqO9Tf5cpmOJ6pv+DYFz
 hfcjAZfamStUbWdWpr33RVCXN5pwZRu+UytD3JYykzgwmKXQxLHqrbjHXLO3zJ7k
 +L0JE+N2vmi/7M5Ghsv3yKwy5fR5rMT5V6qEHSd1qrr9VpKBceNMJgPA8wh4882F
 SD5sD2b6L/Cf9L4FAFqICHb/p4rxPRf5VnUoybo70U7EiwfbZQik5g3X5cO4KO2N
 0z8nMk7dIZncQF0LYJNElIvKonrU8sIa+TbHjYyBWdQlOPgK4IlCvZeyjVUvUG24
 kYx2WbANhCxGFd86rsl5P7xNXvBiSALf1afbQPvU0VTbZ43vSnQ=
 =Pvgr
 -----END PGP SIGNATURE-----

Merge tag 'x86_seves_for_v5.10_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 SEV-ES fixes from Borislav Petkov:
 "A couple of changes to the SEV-ES code to perform more stringent
  hypervisor checks before enabling encryption (Joerg Roedel)"

* tag 'x86_seves_for_v5.10_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sev-es: Do not support MMIO to/from encrypted memory
  x86/head/64: Check SEV encryption before switching to kernel page-table
  x86/boot/compressed/64: Check SEV encryption in 64-bit boot-path
  x86/boot/compressed/64: Sanity-check CPUID results in the early #VC handler
  x86/boot/compressed/64: Introduce sev_status
2020-11-03 09:55:09 -08:00
Mauro Carvalho Chehab
4a2d2ed9ba x86/mtrr: Fix a kernel-doc markup
Kernel-doc markup should use this format:
	identifier - description

Fix it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/2217cd4ae9e561da2825485eb97de77c65741489.1603469755.git.mchehab+huawei@kernel.org
2020-11-02 19:58:53 +01:00
Tony Luck
68299a42f8 x86/mce: Enable additional error logging on certain Intel CPUs
The Xeon versions of Sandy Bridge, Ivy Bridge and Haswell support an
optional additional error logging mode which is enabled by an MSR.

Previously, this mode was enabled from the mcelog(8) tool via /dev/cpu,
but userspace should not be poking at MSRs. So move the enabling into
the kernel.

 [ bp: Correct the explanation why this is done. ]

Suggested-by: Boris Petkov <bp@alien8.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201030190807.GA13884@agluck-desk2.amr.corp.intel.com
2020-11-02 11:15:59 +01:00
Arvind Sankar
ea3186b957 x86/build: Fix vmlinux size check on 64-bit
Commit

  b4e0409a36 ("x86: check vmlinux limits, 64-bit")

added a check that the size of the 64-bit kernel is less than
KERNEL_IMAGE_SIZE.

The check uses (_end - _text), but this is not enough. The initial
PMD used in startup_64() (level2_kernel_pgt) can only map upto
KERNEL_IMAGE_SIZE from __START_KERNEL_map, not from _text, and the
modules area (MODULES_VADDR) starts at KERNEL_IMAGE_SIZE.

The correct check is what is currently done for 32-bit, since
LOAD_OFFSET is defined appropriately for the two architectures. Just
check (_end - LOAD_OFFSET) against KERNEL_IMAGE_SIZE unconditionally.

Note that on 32-bit, the limit is not strict: KERNEL_IMAGE_SIZE is not
really used by the main kernel. The higher the kernel is located, the
less the space available for the vmalloc area. However, it is used by
KASLR in the compressed stub to limit the maximum address of the kernel
to a safe value.

Clean up various comments to clarify that despite the name,
KERNEL_IMAGE_SIZE is not a limit on the size of the kernel image, but a
limit on the maximum virtual address that the image can occupy.

Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201029161903.2553528-1-nivedita@alum.mit.edu
2020-10-29 21:54:35 +01:00
Joerg Roedel
2411cd8211 x86/sev-es: Do not support MMIO to/from encrypted memory
MMIO memory is usually not mapped encrypted, so there is no reason to
support emulated MMIO when it is mapped encrypted.

Prevent a possible hypervisor attack where a RAM page is mapped as
an MMIO page in the nested page-table, so that any guest access to it
will trigger a #VC exception and leak the data on that page to the
hypervisor via the GHCB (like with valid MMIO). On the read side this
attack would allow the HV to inject data into the guest.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20201028164659.27002-6-joro@8bytes.org
2020-10-29 19:27:42 +01:00
Joerg Roedel
c9f09539e1 x86/head/64: Check SEV encryption before switching to kernel page-table
When SEV is enabled, the kernel requests the C-bit position again from
the hypervisor to build its own page-table. Since the hypervisor is an
untrusted source, the C-bit position needs to be verified before the
kernel page-table is used.

Call sev_verify_cbit() before writing the CR3.

 [ bp: Massage. ]

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20201028164659.27002-5-joro@8bytes.org
2020-10-29 18:09:59 +01:00
Joerg Roedel
86ce43f7dd x86/boot/compressed/64: Check SEV encryption in 64-bit boot-path
Check whether the hypervisor reported the correct C-bit when running as
an SEV guest. Using a wrong C-bit position could be used to leak
sensitive data from the guest to the hypervisor.

The check function is in a separate file:

  arch/x86/kernel/sev_verify_cbit.S

so that it can be re-used in the running kernel image.

 [ bp: Massage. ]

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20201028164659.27002-4-joro@8bytes.org
2020-10-29 18:06:52 +01:00
Joerg Roedel
ed7b895f3e x86/boot/compressed/64: Sanity-check CPUID results in the early #VC handler
The early #VC handler which doesn't have a GHCB can only handle CPUID
exit codes. It is needed by the early boot code to handle #VC exceptions
raised in verify_cpu() and to get the position of the C-bit.

But the CPUID information comes from the hypervisor which is untrusted
and might return results which trick the guest into the no-SEV boot path
with no C-bit set in the page-tables. All data written to memory would
then be unencrypted and could leak sensitive data to the hypervisor.

Add sanity checks to the early #VC handler to make sure the hypervisor
can not pretend that SEV is disabled.

 [ bp: Massage a bit. ]

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20201028164659.27002-3-joro@8bytes.org
2020-10-29 13:48:49 +01:00
Jens Axboe
12db8b6900 entry: Add support for TIF_NOTIFY_SIGNAL
Add TIF_NOTIFY_SIGNAL handling in the generic entry code, which if set,
will return true if signal_pending() is used in a wait loop. That causes an
exit of the loop so that notify_signal tracehooks can be run. If the wait
loop is currently inside a system call, the system call is restarted once
task_work has been processed.

In preparation for only having arch_do_signal() handle syscall restarts if
_TIF_SIGPENDING isn't set, rename it to arch_do_signal_or_restart().  Pass
in a boolean that tells the architecture specific signal handler if it
should attempt to get a signal, or just process a potential syscall
restart.

For !CONFIG_GENERIC_ENTRY archs, add the TIF_NOTIFY_SIGNAL handling to
get_signal(). This is done to minimize the needed architecture changes to
support this feature.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20201026203230.386348-3-axboe@kernel.dk
2020-10-29 09:37:36 +01:00
David Woodhouse
2e008ffe42 x86/kvm: Enable 15-bit extension when KVM_FEATURE_MSI_EXT_DEST_ID detected
This allows the host to indicate that MSI emulation supports 15-bit
destination IDs, allowing up to 32768 CPUs without interrupt remapping.

cf. https://patchwork.kernel.org/patch/11816693/ for qemu

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20201024213535.443185-36-dwmw2@infradead.org
2020-10-28 20:26:33 +01:00
David Woodhouse
ab0f59c6f1 x86/apic: Support 15 bits of APIC ID in MSI where available
Some hypervisors can allow the guest to use the Extended Destination ID
field in the MSI address to address up to 32768 CPUs.

This applies to all downstream devices which generate MSI cycles,
including HPET, I/O-APIC and PCI MSI.

HPET and PCI MSI use the same __irq_msi_compose_msg() function, while
I/O-APIC generates its own and had support for the extended bits added in
a previous commit.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-33-dwmw2@infradead.org
2020-10-28 20:26:29 +01:00
David Woodhouse
51130d2188 x86/ioapic: Handle Extended Destination ID field in RTE
Bits 63-48 of the I/OAPIC Redirection Table Entry map directly to bits 19-4
of the address used in the resulting MSI cycle.

Historically, the x86 MSI format only used the top 8 of those 16 bits as
the destination APIC ID, and the "Extended Destination ID" in the lower 8
bits was unused.

With interrupt remapping, the lowest bit of the Extended Destination ID
(bit 48 of RTE, bit 4 of MSI address) is now used to indicate a remappable
format MSI.

A hypervisor can use the other 7 bits of the Extended Destination ID to
permit guests to address up to 15 bits of APIC IDs, thus allowing 32768
vCPUs before having to expose a vIOMMU and interrupt remapping to the
guest.

No behavioural change in this patch, since nothing yet permits APIC IDs
above 255 to be used with the non-IR I/OAPIC domain.

[ tglx: Converted it to the cleaned up entry/msi_msg format and added
  	commentry ]

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-32-dwmw2@infradead.org
2020-10-28 20:26:28 +01:00
David Woodhouse
b643128b91 x86/ioapic: Use irq_find_matching_fwspec() to find remapping irqdomain
All possible parent domains have a select method now. Make use of it.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-29-dwmw2@infradead.org
2020-10-28 20:26:28 +01:00
David Woodhouse
c2a5881c28 x86/hpet: Use irq_find_matching_fwspec() to find remapping irqdomain
All possible parent domains have a select method now. Make use of it.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-28-dwmw2@infradead.org
2020-10-28 20:26:28 +01:00
David Woodhouse
6452ea2a32 x86/apic: Add select() method on vector irqdomain
This will be used to select the irqdomain for I/O-APIC and HPET.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-24-dwmw2@infradead.org
2020-10-28 20:26:27 +01:00
David Woodhouse
5d5a971338 x86/ioapic: Generate RTE directly from parent irqchip's MSI message
The I/O-APIC generates an MSI cycle with address/data bits taken from its
Redirection Table Entry in some combination which used to make sense, but
now is just a bunch of bits which get passed through in some seemingly
arbitrary order.

Instead of making IRQ remapping drivers directly frob the I/OA-PIC RTE, let
them just do their job and generate an MSI message. The bit swizzling to
turn that MSI message into the I/O-APIC's RTE is the same in all cases,
since it's a function of the I/O-APIC hardware. The IRQ remappers have no
real need to get involved with that.

The only slight caveat is that the I/OAPIC is interpreting some of those
fields too, and it does want the 'vector' field to be unique to make EOI
work. The AMD IOMMU happens to put its IRTE index in the bits that the
I/O-APIC thinks are the vector field, and accommodates this requirement by
reserving the first 32 indices for the I/O-APIC.  The Intel IOMMU doesn't
actually use the bits that the I/O-APIC thinks are the vector field, so it
fills in the 'pin' value there instead.

[ tglx: Replaced the unreadably macro maze with the cleaned up RTE/msi_msg
  	bitfields and added commentry to explain the mapping magic ]

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-22-dwmw2@infradead.org
2020-10-28 20:26:27 +01:00
Thomas Gleixner
341b4a7211 x86/ioapic: Cleanup IO/APIC route entry structs
Having two seperate structs for the I/O-APIC RTE entries (non-remapped and
DMAR remapped) requires type casts and makes it hard to map.

Combine them in IO_APIC_routing_entry by defining a union of two 64bit
bitfields. Use naming which reflects which bits are shared and which bits
are actually different for the operating modes.

[dwmw2: Fix it up and finish the job, pulling the 32-bit w1,w2 words for
        register access into the same union and eliminating a few more
        places where bits were accessed through masks and shifts.]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-21-dwmw2@infradead.org
2020-10-28 20:26:27 +01:00
Thomas Gleixner
a27dca645d x86/io_apic: Cleanup trigger/polarity helpers
'trigger' and 'polarity' are used throughout the I/O-APIC code for handling
the trigger type (edge/level) and the active low/high configuration. While
there are defines for initializing these variables and struct members, they
are not used consequently and the meaning of 'trigger' and 'polarity' is
opaque and confusing at best.

Rename them to 'is_level' and 'active_low' and make them boolean in various
structs so it's entirely clear what the meaning is.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-20-dwmw2@infradead.org
2020-10-28 20:26:26 +01:00
Thomas Gleixner
6285aa5073 x86/msi: Provide msi message shadow structs
Create shadow structs with named bitfields for msi_msg data, address_lo and
address_hi and use them in the MSI message composer.

Provide a function to retrieve the destination ID. This could be inline,
but that'd create a circular header dependency.

[dwmw2: fix bitfields not all to be a union]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-13-dwmw2@infradead.org
2020-10-28 20:26:25 +01:00
David Woodhouse
3d7295eb30 x86/hpet: Move MSI support into hpet.c
This isn't really dependent on PCI MSI; it's just generic MSI which is now
supported by the generic x86_vector_domain. Move the HPET MSI support back
into hpet.c with the rest of the HPET support.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-11-dwmw2@infradead.org
2020-10-28 20:26:25 +01:00
David Woodhouse
f598181acf x86/apic: Always provide irq_compose_msi_msg() method for vector domain
This shouldn't be dependent on PCI_MSI. HPET and I/O-APIC can deliver
interrupts through MSI without having any PCI in the system at all.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-10-dwmw2@infradead.org
2020-10-28 20:26:25 +01:00
Thomas Gleixner
8c44963b60 x86/apic: Cleanup destination mode
apic::irq_dest_mode is actually a boolean, but defined as u32 and named in
a way which does not explain what it means.

Make it a boolean and rename it to 'dest_mode_logical'

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-9-dwmw2@infradead.org
2020-10-28 20:26:25 +01:00
Thomas Gleixner
e57d04e5fa x86/apic: Get rid of apic:: Dest_logical
struct apic has two members which store information about the destination
mode: dest_logical and irq_dest_mode.

dest_logical contains a mask which was historically used to set the
destination mode in IPI messages. Over time the usage was reduced and the
logical/physical functions were seperated.

There are only a few places which still use 'dest_logical' but they can
use 'irq_dest_mode' instead.

irq_dest_mode is actually a boolean where 0 means physical destination mode
and 1 means logical destination mode. Of course the name does not reflect
the functionality. This will be cleaned up in a subsequent change.

Remove apic::dest_logical and fixup the remaining users.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-8-dwmw2@infradead.org
2020-10-28 20:26:24 +01:00
Thomas Gleixner
22e0db4209 x86/apic: Replace pointless apic:: Dest_logical usage
All these functions are only used for logical destination mode. So reading
the destination mode mask from the apic structure is a pointless
exercise. Just hand in the proper constant: APIC_DEST_LOGICAL.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-7-dwmw2@infradead.org
2020-10-28 20:26:24 +01:00
Thomas Gleixner
721612994f x86/apic: Cleanup delivery mode defines
The enum ioapic_irq_destination_types and the enumerated constants starting
with 'dest_' are gross misnomers because they describe the delivery mode.

Rename then enum and the constants so they actually make sense.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-6-dwmw2@infradead.org
2020-10-28 20:26:24 +01:00
Thomas Gleixner
2e730cb56b x86/devicetree: Fix the ioapic interrupt type table
The ioapic interrupt type table is wrong as it assumes that polarity in
IO/APIC context means active high when set. But the IO/APIC polarity is
working the other way round. This works because the ordering of the entries
is consistent with the device tree and the type information is not used by
the IO/APIC interrupt chip.

The whole trigger and polarity business of IO/APIC is misleading and the
corresponding constants which are defined as 0/1 are not used consistently
and are going to be removed.

Rename the type table members to 'is_level' and 'active_low' and adjust the
type information for consistency sake.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-5-dwmw2@infradead.org
2020-10-28 20:26:24 +01:00
Thomas Gleixner
93b7a3d6a1 x86/apic/uv: Fix inconsistent destination mode
The UV x2apic is strictly using physical destination mode, but
apic::dest_logical is initialized with APIC_DEST_LOGICAL.

This does not matter much because UV does not use any of the generic
functions which use apic::dest_logical, but is still inconsistent.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-4-dwmw2@infradead.org
2020-10-28 20:26:24 +01:00
David Woodhouse
47bea873cf x86/msi: Only use high bits of MSI address for DMAR unit
The Intel IOMMU has an MSI-like configuration for its interrupt, but it
isn't really MSI. So it gets to abuse the high 32 bits of the address, and
puts the high 24 bits of the extended APIC ID there.

This isn't something that can be used in the general case for real MSIs,
since external devices using the high bits of the address would be
performing writes to actual memory space above 4GiB, not targeted at the
APIC.

Factor the hack out and allow it only to be used when appropriate, adding a
WARN_ON_ONCE() if other MSIs are targeted at an unreachable APIC ID. That
should never happen since the compatibility MSI messages are not used when
Interrupt Remapping is enabled.

The x2apic_enabled() check isn't needed because Linux won't bring up CPUs
with higher APIC IDs unless IR and x2apic are enabled anyway.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-3-dwmw2@infradead.org
2020-10-28 20:26:24 +01:00
David Woodhouse
26573a9774 x86/apic: Fix x2apic enablement without interrupt remapping
Currently, Linux as a hypervisor guest will enable x2apic only if there are
no CPUs present at boot time with an APIC ID above 255.

Hotplugging a CPU later with a higher APIC ID would result in a CPU which
cannot be targeted by external interrupts.

Add a filter in x2apic_apic_id_valid() which can be used to prevent such
CPUs from coming online, and allow x2apic to be enabled even if they are
present at boot time.

Fixes: ce69a78450 ("x86/apic: Enable x2APIC without interrupt remapping under KVM")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-2-dwmw2@infradead.org
2020-10-28 20:26:23 +01:00
Borislav Petkov
0d847ce7c1 x86/setup: Remove unused MCA variables
Commit

  bb8187d35f ("MCA: delete all remaining traces of microchannel bus support.")

removed the remaining traces of Micro Channel Architecture support but
one trace remained - three variables in setup.c which have been unused
since 2012 at least.

Drop them finally.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201021165614.23023-1-bp@alien8.de
2020-10-28 14:58:51 +01:00
Peter Zijlstra
cb05143bdf x86/debug: Fix DR_STEP vs ptrace_get_debugreg(6)
Commit d53d9bc0cf ("x86/debug: Change thread.debugreg6 to
thread.virtual_dr6") changed the semantics of the variable from random
collection of bits, to exactly only those bits that ptrace() needs.

Unfortunately this lost DR_STEP for PTRACE_{BLOCK,SINGLE}STEP.

Furthermore, it turns out that userspace expects DR_STEP to be
unconditionally available, even for manual TF usage outside of
PTRACE_{BLOCK,SINGLE}_STEP.

Fixes: d53d9bc0cf ("x86/debug: Change thread.debugreg6 to thread.virtual_dr6")
Reported-by: Kyle Huey <me@kylehuey.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kyle Huey <me@kylehuey.com> 
Link: https://lore.kernel.org/r/20201027183330.GM2628@hirez.programming.kicks-ass.net
2020-10-27 23:15:24 +01:00
Peter Zijlstra
a195f3d452 x86/debug: Only clear/set ->virtual_dr6 for userspace #DB
The ->virtual_dr6 is the value used by ptrace_{get,set}_debugreg(6). A
kernel #DB clearing it could mean spurious malfunction of ptrace()
expectations.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kyle Huey <me@kylehuey.com> 
Link: https://lore.kernel.org/r/20201027093608.028952500@infradead.org
2020-10-27 23:15:23 +01:00
Peter Zijlstra
2a9baf5ad4 x86/debug: Fix BTF handling
The SDM states that #DB clears DEBUGCTLMSR_BTF, this means that when the
bit is set for userspace (TIF_BLOCKSTEP) and a kernel #DB happens first,
the BTF bit meant for userspace execution is lost.

Have the kernel #DB handler restore the BTF bit when it was requested
for userspace.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kyle Huey <me@kylehuey.com> 
Link: https://lore.kernel.org/r/20201027093607.956147736@infradead.org
2020-10-27 23:15:23 +01:00
Linus Torvalds
ed8780e3f2 A couple of x86 fixes which missed rc1 due to my stupidity:
- Drop lazy TLB mode before switching to the temporary address space for
     text patching. text_poke() switches to the temporary mm which clears
     the lazy mode and restores the original mm afterwards. Due to clearing
     lazy mode this might restore a already dead mm if exit_mmap() runs in
     parallel on another CPU.
 
   - Document the x32 syscall design fail vs. syscall numbers 512-547
     properly.
 
   - Fix the ORC unwinder to handle the inactive task frame correctly. This
     was unearthed due to the slightly different code generation of GCC10.
 
   - Use an up to date screen_info for the boot params of kexec instead of
     the possibly stale and invalid version which happened to be valid when
     the kexec kernel was loaded.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+Yf5YTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoY6HD/4vlFNTVR19JhICQM64XINoaWOOjdIq
 M3wWyh+lmW5+JqNYCYY3M5LX2ZLwYOlNgabE1W6KJgnJsN26GRztBN3z037Vllka
 lS1pONg2a3StpVUEJ3AGDnFgaYrKRSyHBhi/0TazXmvOlscjwPIPxI53oLohyc23
 vSd9ivIFl9jD894OsLjJtWt1rKK6k9p4FqR8bv+u/GwtYGQk9HXlk/XW/uOeH3oU
 ozQhlHCnqN9VnHGHS/nRz3BwIiPJRCYl7h4PdC4MqT+WL1e4pIKEJqyN9uPWeC6L
 b7DzX5KVO0Zcvgvl5OtuR6radXzrMvBwcY6BSOxylfoM+7SIE24PlRFW24EQGKv2
 WHtOKSGsvooU8KWVw4FvHUkSFAgNWUTjZ9x1kzEw1oUANceJUuM74n4rIFUXv3Kf
 gxhcPm2flrB3WrHKuXtQ3QxD9SyGuqk4QUraeNMYyS3DqnnBycgUkd72KiY9H0g8
 9XBvHEFs5G9apA8MSdumHKgrluHVcvdpe3YGy0/vugJvolSvDWkx3EbxpWbhilYS
 WyboQGOwSH1vgEGHHnoiksY/Ofhv+rxBknDUJOiJazVZFbOwFvdKIPDNTQTjrzw1
 NENSBtMkCLG8XvuZ1E1l57wd7BN7fJENYLnG2k9gUsnouWV0pK6x8w9GPn9DW4Do
 0IB3hScRgIIuvQ==
 =e60h
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2020-10-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
 "A couple of x86 fixes which missed rc1 due to my stupidity:

   - Drop lazy TLB mode before switching to the temporary address space
     for text patching.

     text_poke() switches to the temporary mm which clears the lazy mode
     and restores the original mm afterwards. Due to clearing lazy mode
     this might restore a already dead mm if exit_mmap() runs in
     parallel on another CPU.

   - Document the x32 syscall design fail vs. syscall numbers 512-547
     properly.

   - Fix the ORC unwinder to handle the inactive task frame correctly.

     This was unearthed due to the slightly different code generation of
     gcc-10.

   - Use an up to date screen_info for the boot params of kexec instead
     of the possibly stale and invalid version which happened to be
     valid when the kexec kernel was loaded"

* tag 'x86-urgent-2020-10-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/alternative: Don't call text_poke() in lazy TLB mode
  x86/syscalls: Document the fact that syscalls 512-547 are a legacy mistake
  x86/unwind/orc: Fix inactive tasks with stack pointer in %sp on GCC 10 compiled kernels
  hyperv_fb: Update screen_info after removing old framebuffer
  x86/kexec: Use up-to-dated screen_info copy to fill boot params
2020-10-27 14:39:29 -07:00
Fenghua Yu
4868a61d49 x86/resctrl: Correct MBM total and local values
Intel Memory Bandwidth Monitoring (MBM) counters may report system
memory bandwidth incorrectly on some Intel processors. The errata SKX99
for Skylake server, BDF102 for Broadwell server, and the correction
factor table are documented in Documentation/x86/resctrl.rst.

Intel MBM counters track metrics according to the assigned Resource
Monitor ID (RMID) for that logical core. The IA32_QM_CTR register
(MSR 0xC8E) used to report these metrics, may report incorrect system
bandwidth for certain RMID values.

Due to the errata, system memory bandwidth may not match what is
reported.

To work around the errata, correct MBM total and local readings using a
correction factor table. If rmid > rmid threshold, MBM total and local
values should be multiplied by the correction factor.

 [ bp: Mark mbm_cf_table[] __initdata. ]

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20201014004927.1839452-3-fenghua.yu@intel.com
2020-10-27 18:57:22 +01:00
Gabriel Krisman Bertazi
8d71d2bf6e x86: Reclaim TIF_IA32 and TIF_X32
Now that these flags are no longer used, reclaim those TIF bits.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201004032536.1229030-11-krisman@collabora.com
2020-10-26 13:46:47 +01:00
Gabriel Krisman Bertazi
ff170cd059 x86/mm: Convert mmu context ia32_compat into a proper flags field
The ia32_compat attribute is a weird thing.  It mirrors TIF_IA32 and
TIF_X32 and is used only in two very unrelated places: (1) to decide if
the vsyscall page is accessible (2) for uprobes to find whether the
patched instruction is 32 or 64 bit.

In preparation to remove the TIF flags, a new mechanism is required for
ia32_compat, but given its odd semantics, adding a real flags field which
configures these specific behaviours is the best option.

So, set_personality_x64() can ask for the vsyscall page, which is not
available in x32/ia32 and set_personality_ia32() can configure the uprobe
code as needed.

uprobe cannot rely on other methods like user_64bit_mode() to decide how
to patch, so it needs some specific flag like this.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski<luto@kernel.org>
Link: https://lore.kernel.org/r/20201004032536.1229030-10-krisman@collabora.com
2020-10-26 13:46:47 +01:00
Gabriel Krisman Bertazi
2424b14605 x86/elf: Use e_machine to select start_thread for x32
Since TIF_X32 is going away, avoid using it to find the ELF type in
compat_start_thread.

According to SysV AMD64 ABI Draft, an AMD64 ELF object using ILP32 must
have ELFCLASS32 with (E_MACHINE == EM_X86_64), so use that ELF field to
differentiate a x32 object from a IA32 object when executing start_thread()
in compat mode.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201004032536.1229030-7-krisman@collabora.com
2020-10-26 13:46:46 +01:00
Gabriel Krisman Bertazi
375d4bfda5 perf/x86: Avoid TIF_IA32 when checking 64bit mode
In preparation to remove TIF_IA32, stop using it in perf events code.

Tested by running perf on 32-bit, 64-bit and x32 applications.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20201004032536.1229030-2-krisman@collabora.com
2020-10-26 13:46:46 +01:00