Convert #NMI to IDTENTRY_NMI:
- Implement the C entry point with DEFINE_IDTENTRY_NMI
- Fixup the XEN/PV code
- Remove the old prototypes
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505135314.609932306@linutronix.de
mce_check_crashing_cpu() is called right at the entry of the MCE
handler. It uses mce_rdmsr() and mce_wrmsr() which are wrappers around
rdmsr() and wrmsr() to handle the MCE error injection mechanism, which is
pointless in this context, i.e. when the MCE hits an offline CPU or the
system is already marked crashing.
The MSR access can also be traced, so use the untraceable variants. This
is also safe vs. XEN paravirt as these MSRs are not affected by XEN PV
modifications.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505135314.426347351@linutronix.de
Convert #MC to IDTENTRY_MCE:
- Implement the C entry points with DEFINE_IDTENTRY_MCE
- Emit the ASM stub with DECLARE_IDTENTRY_MCE
- Remove the ASM idtentry in 64bit
- Remove the open coded ASM entry code in 32bit
- Fixup the XEN/PV code
- Remove the old prototypes
- Remove the error code from *machine_check_vector() as
it is always 0 and not used by any of the functions
it can point to. Fixup all the functions as well.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505135314.334980426@linutronix.de
There is no reason to have nmi_enter/exit() in the actual MCE
handlers. Move it to the entry point. This also covers the until now
uncovered initial handler which only prints.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505135314.243936614@linutronix.de
For code simplicity split up the int3 handler into a kernel and user part
which makes the code flow simpler to understand.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505135314.045220765@linutronix.de
Convert #BP to IDTENTRY_RAW:
- Implement the C entry point with DEFINE_IDTENTRY_RAW
- Invoke idtentry_enter/exit() from the function body
- Emit the ASM stub with DECLARE_IDTENTRY_RAW
- Remove the ASM idtentry in 64bit
- Remove the open coded ASM entry code in 32bit
- Fixup the XEN/PV code
- Remove the old prototypes
No functional change.
This could be a plain IDTENTRY, but as Peter pointed out INT3 is broken
vs. the static key in the context tracking code as this static key might be
in the state of being patched and has an int3 which would recurse forever.
IDTENTRY_RAW is therefore chosen to allow addressing this issue without
lots of code churn.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505135313.938474960@linutronix.de
Avoid calling out to bsearch() by inlining it, for normal kernel configs
this was the last external call and poke_int3_handler() is now fully self
sufficient -- no calls to external code.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505135313.731774429@linutronix.de
Use arch_atomic_*() and __READ_ONCE() to ensure nothing untoward
creeps in and ruins things.
That is; this is the INT3 text poke handler, strictly limit the code
that runs in it, lest it inadvertenly hits yet another INT3.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505135313.517429268@linutronix.de
In order to ensure poke_int3_handler() is completely self contained -- this
is called while modifying other text, imagine the fun of hitting another
INT3 -- ensure that everything it uses is not traced.
The primary means here is to force inlining; bsearch() is notrace because
all of lib/ is.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505135313.410702173@linutronix.de
Convert the IRET exception handler to IDTENTRY_SW. This is slightly
different than the conversions of hardware exceptions as the IRET exception
is invoked via an exception table when IRET faults. So it just uses the
IDTENTRY_SW mechanism for consistency. It does not emit ASM code as it does
not fit the other idtentry exceptions.
- Implement the C entry point with DEFINE_IDTENTRY_SW() which maps to
DEFINE_IDTENTRY()
- Fixup the XEN/PV code
- Remove the old prototypes
- Remove the RCU warning as the new entry macro ensures correctness
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505134906.128769226@linutronix.de
Convert #XF to IDTENTRY_ERRORCODE:
- Implement the C entry point with DEFINE_IDTENTRY
- Emit the ASM stub with DECLARE_IDTENTRY
- Handle INVD_BUG in C
- Remove the ASM idtentry in 64bit
- Remove the open coded ASM entry code in 32bit
- Fixup the XEN/PV code
- Remove the old prototypes
- Remove the RCU warning as the new entry macro ensures correctness
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505134906.021552202@linutronix.de
Convert #AC to IDTENTRY_ERRORCODE:
- Implement the C entry point with DEFINE_IDTENTRY
- Emit the ASM stub with DECLARE_IDTENTRY
- Remove the ASM idtentry in 64bit
- Remove the open coded ASM entry code in 32bit
- Fixup the XEN/PV code
- Remove the old prototypes
- Remove the RCU warning as the new entry macro ensures correctness
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505134905.928967113@linutronix.de
Convert #MF to IDTENTRY_ERRORCODE:
- Implement the C entry point with DEFINE_IDTENTRY
- Emit the ASM stub with DECLARE_IDTENTRY
- Remove the ASM idtentry in 64bit
- Remove the open coded ASM entry code in 32bit
- Fixup the XEN/PV code
- Remove the old prototypes
- Remove the RCU warning as the new entry macro ensures correctness
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505134905.838823510@linutronix.de
Convert #SPURIOUS to IDTENTRY_ERRORCODE:
- Implement the C entry point with DEFINE_IDTENTRY
- Emit the ASM stub with DECLARE_IDTENTRY
- Remove the ASM idtentry in 64bit
- Remove the open coded ASM entry code in 32bit
- Fixup the XEN/PV code
- Remove the old prototypes
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505134905.728077036@linutronix.de
Convert #GP to IDTENTRY_ERRORCODE:
- Implement the C entry point with DEFINE_IDTENTRY
- Emit the ASM stub with DECLARE_IDTENTRY
- Remove the ASM idtentry in 64bit
- Remove the open coded ASM entry code in 32bit
- Fixup the XEN/PV code
- Remove the old prototypes
- Remove the RCU warning as the new entry macro ensures correctness
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505134905.637269946@linutronix.de
Convert #SS to IDTENTRY_ERRORCODE:
- Implement the C entry point with DEFINE_IDTENTRY
- Emit the ASM stub with DECLARE_IDTENTRY
- Remove the ASM idtentry in 64bit
- Remove the open coded ASM entry code in 32bit
- Fixup the XEN/PV code
- Remove the old prototypes
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505134905.539867572@linutronix.de
Convert #NP to IDTENTRY_ERRORCODE:
- Implement the C entry point with DEFINE_IDTENTRY
- Emit the ASM stub with DECLARE_IDTENTRY
- Remove the ASM idtentry in 64bit
- Remove the open coded ASM entry code in 32bit
- Fixup the XEN/PV code
- Remove the old prototypes
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134905.443591450@linutronix.de
Convert #TS to IDTENTRY_ERRORCODE:
- Implement the C entry point with DEFINE_IDTENTRY
- Emit the ASM stub with DECLARE_IDTENTRY
- Remove the ASM idtentry in 64bit
- Remove the open coded ASM entry code in 32bit
- Fixup the XEN/PV code
- Remove the old prototypes
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134905.350676449@linutronix.de
Convert #OLD_MF to IDTENTRY:
- Implement the C entry point with DEFINE_IDTENTRY
- Emit the ASM stub with DECLARE_IDTENTRY
- Remove the ASM idtentry in 64bit
- Remove the open coded ASM entry code in 32bit
- Fixup the XEN/PV code
- Remove the old prototypes
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134905.838823510@linutronix.de
Convert #NM to IDTENTRY:
- Implement the C entry point with DEFINE_IDTENTRY
- Emit the ASM stub with DECLARE_IDTENTRY
- Remove the ASM idtentry in 64bit
- Remove the open coded ASM entry code in 32bit
- Fixup the XEN/PV code
- Remove the old prototypes
- Remove the RCU warning as the new entry macro ensures correctness
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134905.056243863@linutronix.de
Convert #UD to IDTENTRY:
- Implement the C entry point with DEFINE_IDTENTRY
- Emit the ASM stub with DECLARE_IDTENTRY
- Remove the ASM idtentry in 64bit
- Remove the open coded ASM entry code in 32bit
- Fixup the XEN/PV code
- Fixup the FOOF bug call in fault.c
- Remove the old prototypes
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134904.955511913@linutronix.de
Convert #BR to IDTENTRY:
- Implement the C entry point with DEFINE_IDTENTRY
- Emit the ASM stub with DECLARE_IDTENTRY
- Remove the ASM idtentry in 64bit
- Remove the open coded ASM entry code in 32bit
- Fixup the XEN/PV code
- Remove the old prototypes
- Remove the RCU warning as the new entry macro ensures correctness
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134904.863001309@linutronix.de
Convert #OF to IDTENTRY:
- Implement the C entry point with DEFINE_IDTENTRY
- Emit the ASM stub with DECLARE_IDTENTRY
- Remove the ASM idtentry in 64bit
- Remove the open coded ASM entry code in 32bit
- Fixup the XEN/PV code
- Remove the old prototypes
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134904.771457898@linutronix.de
Convert #DE to IDTENTRY:
- Implement the C entry point with DEFINE_IDTENTRY
- Emit the ASM stub with DECLARE_IDTENTRY
- Remove the ASM idtentry in 64bit
- Remove the open coded ASM entry code in 32bit
- Fixup the XEN/PV code
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505134904.663914713@linutronix.de
Prepare for using IDTENTRY to define the C exception/trap entry points. It
would be possible to glue this into the existing macro maze, but it's
simpler and better to read at the end to just make them distinct.
Provide a trivial inline helper to read the trap address and add a comment
explaining the logic behind it.
The existing macros will be removed once all instances are converted.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134904.556327833@linutronix.de
Traps enable interrupts conditionally but rely on the ASM return code to
disable them again. That results in redundant interrupt disable and trace
calls.
Make the trap handlers disable interrupts before returning to avoid that,
which allows simplification of the ASM entry code in follow up changes.
Originally-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505134903.622702796@linutronix.de
Replace the notrace and NOKPROBE annotations with noinstr.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505134903.439765290@linutronix.de
This is called from deep entry ASM in a situation where instrumentation
will cause more harm than providing useful information.
Switch from memmove() to memcpy() because memmove() can't be called
from noinstr code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505134903.346741553@linutronix.de
All ASM code which is not part of the entry functionality can move out into
the .text section. No reason to keep it in the non-instrumentable entry
section.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134340.227579223@linutronix.de
Use of memmove() in #DF is problematic considered tracing and other
instrumentation.
Remove the memmove() call and simply write out what needs doing; this
even clarifies the code, win-win! The code copies from the espfix64
stack to the normal task stack, there is no possible way for that to
overlap.
Survives selftests/x86, specifically sigreturn_64.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505134058.863038566@linutronix.de
A data breakpoint near the top of an IST stack will cause unrecoverable
recursion. A data breakpoint on the GDT, IDT, or TSS is terrifying.
Prevent either of these from happening.
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134058.272448010@linutronix.de
With commit dc20b2d526 ("x86/idt: Move interrupt gate initialization to
IDT code") non assigned system vectors are also marked as used in
'used_vectors' (now 'system_vectors') bitmap. This makes checks in
arch_show_interrupts() whether a particular system vector is allocated to
always pass and e.g. 'Hyper-V reenlightenment interrupts' entry always
shows up in /proc/interrupts.
Another side effect of having all unassigned system vectors marked as used
is that irq_matrix_debug_show() will wrongly count them among 'System'
vectors.
As it is now ensured that alloc_intr_gate() is not called after init, it is
possible to leave unused entries in 'system_vectors' unset to fix these
issues.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200428093824.1451532-4-vkuznets@redhat.com
There seems to be no reason to allocate interrupt gates after init. Mark
alloc_intr_gate() as __init and add WARN_ON() checks making sure it is
only used before idt_setup_apic_and_irq_gates() finalizes IDT setup and
maps all un-allocated entries to spurious entries.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200428093824.1451532-3-vkuznets@redhat.com
machine_check is function address, the address operator on it is nop for
compiler.
Make it consistent with the other function addresses in the same file.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200419144049.1906-3-laijs@linux.alibaba.com
Pull misc uaccess updates from Al Viro:
"Assorted uaccess patches for this cycle - the stuff that didn't fit
into thematic series"
* 'uaccess.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
bpf: make bpf_check_uarg_tail_zero() use check_zeroed_user()
x86: kvm_hv_set_msr(): use __put_user() instead of 32bit __clear_user()
user_regset_copyout_zero(): use clear_user()
TEST_ACCESS_OK _never_ had been checked anywhere
x86: switch cp_stat64() to unsafe_put_user()
binfmt_flat: don't use __put_user()
binfmt_elf_fdpic: don't use __... uaccess primitives
binfmt_elf: don't bother with __{put,copy_to}_user()
pselect6() and friends: take handling the combined 6th/7th args into helper
Merge even more updates from Andrew Morton:
- a kernel-wide sweep of show_stack()
- pagetable cleanups
- abstract out accesses to mmap_sem - prep for mmap_sem scalability work
- hch's user acess work
Subsystems affected by this patch series: debug, mm/pagemap, mm/maccess,
mm/documentation.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (93 commits)
include/linux/cache.h: expand documentation over __read_mostly
maccess: return -ERANGE when probe_kernel_read() fails
x86: use non-set_fs based maccess routines
maccess: allow architectures to provide kernel probing directly
maccess: move user access routines together
maccess: always use strict semantics for probe_kernel_read
maccess: remove strncpy_from_unsafe
tracing/kprobes: handle mixed kernel/userspace probes better
bpf: rework the compat kernel probe handling
bpf:bpf_seq_printf(): handle potentially unsafe format string better
bpf: handle the compat string in bpf_trace_copy_string better
bpf: factor out a bpf_trace_copy_string helper
maccess: unify the probe kernel arch hooks
maccess: remove probe_read_common and probe_write_common
maccess: rename strnlen_unsafe_user to strnlen_user_nofault
maccess: rename strncpy_from_unsafe_strict to strncpy_from_kernel_nofault
maccess: rename strncpy_from_unsafe_user to strncpy_from_user_nofault
maccess: update the top of file comment
maccess: clarify kerneldoc comments
maccess: remove duplicate kerneldoc comments
...
Define a new initializer for the mmap locking api. Initially this just
evaluates to __RWSEM_INITIALIZER as the API is defined as wrappers around
rwsem.
Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-9-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The replacement of <asm/pgrable.h> with <linux/pgtable.h> made the include
of the latter in the middle of asm includes. Fix this up with the aid of
the below script and manual adjustments here and there.
import sys
import re
if len(sys.argv) is not 3:
print "USAGE: %s <file> <header>" % (sys.argv[0])
sys.exit(1)
hdr_to_move="#include <linux/%s>" % sys.argv[2]
moved = False
in_hdrs = False
with open(sys.argv[1], "r") as f:
lines = f.readlines()
for _line in lines:
line = _line.rstrip('
')
if line == hdr_to_move:
continue
if line.startswith("#include <linux/"):
in_hdrs = True
elif not moved and in_hdrs:
moved = True
print hdr_to_move
print line
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-4-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The include/linux/pgtable.h is going to be the home of generic page table
manipulation functions.
Start with moving asm-generic/pgtable.h to include/linux/pgtable.h and
make the latter include asm/pgtable.h.
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-3-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: consolidate definitions of page table accessors", v2.
The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once. For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.
Most of these definitions are actually identical and typically it boils
down to, e.g.
static inline unsigned long pmd_index(unsigned long address)
{
return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}
These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.
For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.
These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.
This patch (of 12):
The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
functions involving page table manipulations, e.g. pte_alloc() and
pmd_alloc(). So, there is no point to explicitly include <asm/pgtable.h>
in the files that include <linux/mm.h>.
The include statements in such cases are remove with a simple loop:
for f in $(git grep -l "include <linux/mm.h>") ; do
sed -i -e '/include <asm\/pgtable.h>/ d' $f
done
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now the last users of show_stack() got converted to use an explicit log
level, show_stack_loglvl() can drop it's redundant suffix and become once
again well known show_stack().
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200418201944.482088-51-dima@arista.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's under CONFIG_IOMMU_LEAK option which is enabled by debug config.
Likely the backtrace is worth to be seen - so aligning with log level of
error message in iommu_full().
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200418201944.482088-46-dima@arista.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the log-level of show_stack() depends on a platform
realization. It creates situations where the headers are printed with
lower log level or higher than the stacktrace (depending on a platform or
user).
Furthermore, it forces the logic decision from user to an architecture
side. In result, some users as sysrq/kdb/etc are doing tricks with
temporary rising console_loglevel while printing their messages. And in
result it not only may print unwanted messages from other CPUs, but also
omit printing at all in the unlucky case where the printk() was deferred.
Introducing log-level parameter and KERN_UNSUPPRESSED [1] seems an easier
approach than introducing more printk buffers. Also, it will consolidate
printings with headers.
Introduce show_stack_loglvl(), that eventually will substitute
show_stack().
[1]: https://lore.kernel.org/lkml/20190528002412.1625-1-dima@arista.com/T/#u
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200418201944.482088-42-dima@arista.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the log-level of show_stack() depends on a platform
realization. It creates situations where the headers are printed with
lower log level or higher than the stacktrace (depending on a platform or
user).
Furthermore, it forces the logic decision from user to an architecture
side. In result, some users as sysrq/kdb/etc are doing tricks with
temporary rising console_loglevel while printing their messages. And in
result it not only may print unwanted messages from other CPUs, but also
omit printing at all in the unlucky case where the printk() was deferred.
Introducing log-level parameter and KERN_UNSUPPRESSED [1] seems an easier
approach than introducing more printk buffers. Also, it will consolidate
printings with headers.
Keep log_lvl const show_trace_log_lvl() and printk_stack_address() as the
new generic show_stack_loglvl() wants to have a proper const qualifier.
And gcc rightfully produces warnings in case it's not keept:
arch/x86/kernel/dumpstack.c: In function `show_stack':
arch/x86/kernel/dumpstack.c:294:37: warning: passing argument 4 of `show_trace_log_lv ' discards `const' qualifier from pointer target type [-Wdiscarded-qualifiers]
294 | show_trace_log_lvl(task, NULL, sp, loglvl);
| ^~~~~~
arch/x86/kernel/dumpstack.c:163:32: note: expected `char *' but argument is of type `const char *'
163 | unsigned long *stack, char *log_lvl)
| ~~~~~~^~~~~~~
[1]: https://lore.kernel.org/lkml/20190528002412.1625-1-dima@arista.com/T/#u
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200418201944.482088-41-dima@arista.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x86 srbds fixes from Thomas Gleixner:
"The 9th episode of the dime novel "The performance killer" with the
subtitle "Slow Randomizing Boosts Denial of Service".
SRBDS is an MDS-like speculative side channel that can leak bits from
the random number generator (RNG) across cores and threads. New
microcode serializes the processor access during the execution of
RDRAND and RDSEED. This ensures that the shared buffer is overwritten
before it is released for reuse. This is equivalent to a full bus
lock, which means that many threads running the RNG instructions in
parallel have the same effect as the same amount of threads issuing a
locked instruction targeting an address which requires locking of two
cachelines at once.
The mitigation support comes with the usual pile of unpleasant
ingredients:
- command line options
- sysfs file
- microcode checks
- a list of vulnerable CPUs identified by model and stepping this
time which requires stepping match support for the cpu match logic.
- the inevitable slowdown of affected CPUs"
* branch 'x86/srbds' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation: Add Ivy Bridge to affected list
x86/speculation: Add SRBDS vulnerability and mitigation documentation
x86/speculation: Add Special Register Buffer Data Sampling (SRBDS) mitigation
x86/cpu: Add 'table' argument to cpu_matches()
'jiffies' and 'jiffies_64' are meant to alias (two different symbols that
share the same address). Most architectures make the symbols alias to the
same address via a linker script assignment in their
arch/<arch>/kernel/vmlinux.lds.S:
jiffies = jiffies_64;
which is effectively a definition of jiffies.
jiffies and jiffies_64 are both forward declared for all architectures in
include/linux/jiffies.h. jiffies_64 is defined in kernel/time/timer.c.
x86_64 was peculiar in that it wasn't doing the above linker script
assignment, but rather was:
1. defining jiffies in arch/x86/kernel/time.c instead via the linker script.
2. overriding the symbol jiffies_64 from kernel/time/timer.c in
arch/x86/kernel/vmlinux.lds.s via 'jiffies_64 = jiffies;'.
As Fangrui notes:
In LLD, symbol assignments in linker scripts override definitions in
object files. GNU ld appears to have the same behavior. It would
probably make sense for LLD to error "duplicate symbol" but GNU ld
is unlikely to adopt for compatibility reasons.
This results in an ODR violation (UB), which seems to have survived
thus far. Where it becomes harmful is when;
1. -fno-semantic-interposition is used:
As Fangrui notes:
Clang after LLVM commit 5b22bcc2b70d
("[X86][ELF] Prefer to lower MC_GlobalAddress operands to .Lfoo$local")
defaults to -fno-semantic-interposition similar semantics which help
-fpic/-fPIC code avoid GOT/PLT when the referenced symbol is defined
within the same translation unit. Unlike GCC
-fno-semantic-interposition, Clang emits such relocations referencing
local symbols for non-pic code as well.
This causes references to jiffies to refer to '.Ljiffies$local' when
jiffies is defined in the same translation unit. Likewise, references to
jiffies_64 become references to '.Ljiffies_64$local' in translation units
that define jiffies_64. Because these differ from the names used in the
linker script, they will not be rewritten to alias one another.
2. Full LTO
Full LTO effectively treats all source files as one translation
unit, causing these local references to be produced everywhere. When
the linker processes the linker script, there are no longer any
references to jiffies_64' anywhere to replace with 'jiffies'. And
thus '.Ljiffies$local' and '.Ljiffies_64$local' no longer alias
at all.
In the process of porting patches enabling Full LTO from arm64 to x86_64,
spooky bugs have been observed where the kernel appeared to boot, but init
doesn't get scheduled.
Avoid the ODR violation by matching other architectures and define jiffies
only by linker script. For -fno-semantic-interposition + Full LTO, there
is no longer a global definition of jiffies for the compiler to produce a
local symbol which the linker script won't ensure aliases to jiffies_64.
Fixes: 40747ffa5a ("asmlinkage: Make jiffies visible")
Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Reported-by: Alistair Delva <adelva@google.com>
Debugged-by: Nick Desaulniers <ndesaulniers@google.com>
Debugged-by: Sami Tolvanen <samitolvanen@google.com>
Suggested-by: Fangrui Song <maskray@google.com>
Signed-off-by: Bob Haarman <inglorion@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com> # build+boot on
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: stable@vger.kernel.org
Link: https://github.com/ClangBuiltLinux/linux/issues/852
Link: https://lkml.kernel.org/r/20200602193100.229287-1-inglorion@google.com
Currently, it is possible to enable indirect branch speculation even after
it was force-disabled using the PR_SPEC_FORCE_DISABLE option. Moreover, the
PR_GET_SPECULATION_CTRL command gives afterwards an incorrect result
(force-disabled when it is in fact enabled). This also is inconsistent
vs. STIBP and the documention which cleary states that
PR_SPEC_FORCE_DISABLE cannot be undone.
Fix this by actually enforcing force-disabled indirect branch
speculation. PR_SPEC_ENABLE called after PR_SPEC_FORCE_DISABLE now fails
with -EPERM as described in the documentation.
Fixes: 9137bb27e6 ("x86/speculation: Add prctl() control for indirect branch speculation")
Signed-off-by: Anthony Steinhauser <asteinhauser@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
On context switch the change of TIF_SSBD and TIF_SPEC_IB are evaluated
to adjust the mitigations accordingly. This is optimized to avoid the
expensive MSR write if not needed.
This optimization is buggy and allows an attacker to shutdown the SSBD
protection of a victim process.
The update logic reads the cached base value for the speculation control
MSR which has neither the SSBD nor the STIBP bit set. It then OR's the
SSBD bit only when TIF_SSBD is different and requests the MSR update.
That means if TIF_SSBD of the previous and next task are the same, then
the base value is not updated, even if TIF_SSBD is set. The MSR write is
not requested.
Subsequently if the TIF_STIBP bit differs then the STIBP bit is updated
in the base value and the MSR is written with a wrong SSBD value.
This was introduced when the per task/process conditional STIPB
switching was added on top of the existing SSBD switching.
It is exploitable if the attacker creates a process which enforces SSBD
and has the contrary value of STIBP than the victim process (i.e. if the
victim process enforces STIBP, the attacker process must not enforce it;
if the victim process does not enforce STIBP, the attacker process must
enforce it) and schedule it on the same core as the victim process. If
the victim runs after the attacker the victim becomes vulnerable to
Spectre V4.
To fix this, update the MSR value independent of the TIF_SSBD difference
and dependent on the SSBD mitigation method available. This ensures that
a subsequent STIPB initiated MSR write has the correct state of SSBD.
[ tglx: Handle X86_FEATURE_VIRT_SSBD & X86_FEATURE_VIRT_SSBD correctly
and massaged changelog ]
Fixes: 5bfbe3ad58 ("x86/speculation: Prepare for per task indirect branch speculation control")
Signed-off-by: Anthony Steinhauser <asteinhauser@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
When STIBP is unavailable or enhanced IBRS is available, Linux
force-disables the IBPB mitigation of Spectre-BTB even when simultaneous
multithreading is disabled. While attempts to enable IBPB using
prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, ...) fail with
EPERM, the seccomp syscall (or its prctl(PR_SET_SECCOMP, ...) equivalent)
which are used e.g. by Chromium or OpenSSH succeed with no errors but the
application remains silently vulnerable to cross-process Spectre v2 attacks
(classical BTB poisoning). At the same time the SYSFS reporting
(/sys/devices/system/cpu/vulnerabilities/spectre_v2) displays that IBPB is
conditionally enabled when in fact it is unconditionally disabled.
STIBP is useful only when SMT is enabled. When SMT is disabled and STIBP is
unavailable, it makes no sense to force-disable also IBPB, because IBPB
protects against cross-process Spectre-BTB attacks regardless of the SMT
state. At the same time since missing STIBP was only observed on AMD CPUs,
AMD does not recommend using STIBP, but recommends using IBPB, so disabling
IBPB because of missing STIBP goes directly against AMD's advice:
https://developer.amd.com/wp-content/resources/Architecture_Guidelines_Update_Indirect_Branch_Control.pdf
Similarly, enhanced IBRS is designed to protect cross-core BTB poisoning
and BTB-poisoning attacks from user space against kernel (and
BTB-poisoning attacks from guest against hypervisor), it is not designed
to prevent cross-process (or cross-VM) BTB poisoning between processes (or
VMs) running on the same core. Therefore, even with enhanced IBRS it is
necessary to flush the BTB during context-switches, so there is no reason
to force disable IBPB when enhanced IBRS is available.
Enable the prctl control of IBPB even when STIBP is unavailable or enhanced
IBRS is available.
Fixes: 7cc765a67d ("x86/speculation: Enable prctl mode for spectre_v2_user")
Signed-off-by: Anthony Steinhauser <asteinhauser@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
- Unexport various PAT primitives
- Unexport per-CPU tlbstate
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7Z+3cRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jgyxAAjPoXEzi9rqGHY6Eus37DNbzHtdQj4fqN
68h8T2tSnOMzETe3L/c4puxI50YFpMA0sFbzm8BfjCtucs0K7Tj4Sv8Aoap2b99A
/bP+ySgHh2BMoI/tu9TiD8et+vttAGGwkXQhIOgeakZcYzpAY7oUNwc+CogkytbQ
DaC8s9FL7RjCXCL91fvZ33C0ksg5J9ynFbRozEHOacHPrE3CbrqUwu+75PmS7nJC
13vatOxjdqNPQhVMg7waN1nHv7K06kph1wxWxYHoD0QwAPy1ecE84wLvg9gv5AqK
BfUBmB34qRW21qbB5tQrMlGDS9tuV0vUB1fxUV7/iOKXQUH6viEG/7J7jm+YwXji
U9S54UPj/TOp8fvYdS18sp6vI1gS3HKjd3LO3pPHWsyZVMJBoGuMConZRs3C31Cp
WuwBU1gY+mFB5l4prt8WU8ocPvEnZkP00cCYNyzPk21tblfUwFbrmu3wcZxOkx3s
ZhRO4KrhxtL7l/wDLuNtWShBL2c6Rz2tts58tr/fj/M+UscJK2MPKxPLCAb20QYZ
qSkMa36+r8LkuMCyjpegEEmo4sw9yC6aLXFKfYu2ABki5o9AR4tavk+lwO+dad6T
k0DJjGXLsG9sReR6hrfaNTk5h7ImiRFDVntnWAhgKhARRoloJJS4/RkzW+ylPbac
mTuNNJDChUQ=
=RXKK
-----END PGP SIGNATURE-----
Merge tag 'x86-mm-2020-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:
"Misc changes:
- Unexport various PAT primitives
- Unexport per-CPU tlbstate and uninline TLB helpers"
* tag 'x86-mm-2020-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/tlb/uv: Add a forward declaration for struct flush_tlb_info
x86/cpu: Export native_write_cr4() only when CONFIG_LKTDM=m
x86/tlb: Restrict access to tlbstate
xen/privcmd: Remove unneeded asm/tlb.h include
x86/tlb: Move PCID helpers where they are used
x86/tlb: Uninline nmi_uaccess_okay()
x86/tlb: Move cr4_set_bits_and_update_boot() to the usage site
x86/tlb: Move paravirt_tlb_remove_table() to the usage site
x86/tlb: Move __flush_tlb_all() out of line
x86/tlb: Move flush_tlb_others() out of line
x86/tlb: Move __flush_tlb_one_kernel() out of line
x86/tlb: Move __flush_tlb_one_user() out of line
x86/tlb: Move __flush_tlb_global() out of line
x86/tlb: Move __flush_tlb() out of line
x86/alternatives: Move temporary_mm helpers into C
x86/cr4: Sanitize CR4.PCE update
x86/cpu: Uninline CR4 accessors
x86/tlb: Uninline __get_current_cr3_fast()
x86/mm: Use pgprotval_t in protval_4k_2_large() and protval_large_2_4k()
x86/mm: Unexport __cachemode2pte_tbl
...
Pull livepatching updates from Jiri Kosina:
- simplifications and improvements for issues Peter Ziljstra found
during his previous work on W^X cleanups.
This allows us to remove livepatch arch-specific .klp.arch sections
and add proper support for jump labels in patched code.
Also, this patchset removes the last module_disable_ro() usage in the
tree.
Patches from Josh Poimboeuf and Peter Zijlstra
- a few other minor cleanups
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
MAINTAINERS: add lib/livepatch to LIVE PATCHING
livepatch: add arch-specific headers to MAINTAINERS
livepatch: Make klp_apply_object_relocs static
MAINTAINERS: adjust to livepatch .klp.arch removal
module: Make module_enable_ro() static again
x86/module: Use text_mutex in apply_relocate_add()
module: Remove module_disable_ro()
livepatch: Remove module_disable_ro() usage
x86/module: Use text_poke() for late relocations
s390/module: Use s390_kernel_write() for late relocations
s390: Change s390_kernel_write() return type to match memcpy()
livepatch: Prevent module-specific KLP rela sections from referencing vmlinux symbols
livepatch: Remove .klp.arch
livepatch: Apply vmlinux-specific KLP relocations early
livepatch: Disallow vmlinux.ko
Remove KVM_DEBUG_FS, which can easily be misconstrued as controlling
KVM-as-a-host. The sole user of CONFIG_KVM_DEBUG_FS was removed by
commit cfd8983f03 ("x86, locking/spinlocks: Remove ticket (spin)lock
implementation").
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200528031121.28904-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull networking updates from David Miller:
1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
Augusto von Dentz.
2) Add GSO partial support to igc, from Sasha Neftin.
3) Several cleanups and improvements to r8169 from Heiner Kallweit.
4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
device self-test. From Andrew Lunn.
5) Start moving away from custom driver versions, use the globally
defined kernel version instead, from Leon Romanovsky.
6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.
7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.
8) Add sriov and vf support to hinic, from Luo bin.
9) Support Media Redundancy Protocol (MRP) in the bridging code, from
Horatiu Vultur.
10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.
11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
Dubroca. Also add ipv6 support for espintcp.
12) Lots of ReST conversions of the networking documentation, from Mauro
Carvalho Chehab.
13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
from Doug Berger.
14) Allow to dump cgroup id and filter by it in inet_diag code, from
Dmitry Yakunin.
15) Add infrastructure to export netlink attribute policies to
userspace, from Johannes Berg.
16) Several optimizations to sch_fq scheduler, from Eric Dumazet.
17) Fallback to the default qdisc if qdisc init fails because otherwise
a packet scheduler init failure will make a device inoperative. From
Jesper Dangaard Brouer.
18) Several RISCV bpf jit optimizations, from Luke Nelson.
19) Correct the return type of the ->ndo_start_xmit() method in several
drivers, it's netdev_tx_t but many drivers were using
'int'. From Yunjian Wang.
20) Add an ethtool interface for PHY master/slave config, from Oleksij
Rempel.
21) Add BPF iterators, from Yonghang Song.
22) Add cable test infrastructure, including ethool interfaces, from
Andrew Lunn. Marvell PHY driver is the first to support this
facility.
23) Remove zero-length arrays all over, from Gustavo A. R. Silva.
24) Calculate and maintain an explicit frame size in XDP, from Jesper
Dangaard Brouer.
25) Add CAP_BPF, from Alexei Starovoitov.
26) Support terse dumps in the packet scheduler, from Vlad Buslov.
27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.
28) Add devm_register_netdev(), from Bartosz Golaszewski.
29) Minimize qdisc resets, from Cong Wang.
30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
eliminate set_fs/get_fs calls. From Christoph Hellwig.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
selftests: net: ip_defrag: ignore EPERM
net_failover: fixed rollback in net_failover_open()
Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
vmxnet3: allow rx flow hash ops only when rss is enabled
hinic: add set_channels ethtool_ops support
selftests/bpf: Add a default $(CXX) value
tools/bpf: Don't use $(COMPILE.c)
bpf, selftests: Use bpf_probe_read_kernel
s390/bpf: Use bcr 0,%0 as tail call nop filler
s390/bpf: Maintain 8-byte stack alignment
selftests/bpf: Fix verifier test
selftests/bpf: Fix sample_cnt shared between two threads
bpf, selftests: Adapt cls_redirect to call csum_level helper
bpf: Add csum_level helper for fixing up csum levels
bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
crypto/chtls: IPv6 support for inline TLS
Crypto/chcr: Fixes a coccinile check error
Crypto/chcr: Fixes compilations warnings
...
- Move the arch-specific code into arch/arm64/kvm
- Start the post-32bit cleanup
- Cherry-pick a few non-invasive pre-NV patches
x86:
- Rework of TLB flushing
- Rework of event injection, especially with respect to nested virtualization
- Nested AMD event injection facelift, building on the rework of generic code
and fixing a lot of corner cases
- Nested AMD live migration support
- Optimization for TSC deadline MSR writes and IPIs
- Various cleanups
- Asynchronous page fault cleanups (from tglx, common topic branch with tip tree)
- Interrupt-based delivery of asynchronous "page ready" events (host side)
- Hyper-V MSRs and hypercalls for guest debugging
- VMX preemption timer fixes
s390:
- Cleanups
Generic:
- switch vCPU thread wakeup from swait to rcuwait
The other architectures, and the guest side of the asynchronous page fault
work, will come next week.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl7VJcYUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPf6QgAq4wU5wdd1lTGz/i3DIhNVJNJgJlp
ozLzRdMaJbdbn5RpAK6PEBd9+pt3+UlojpFB3gpJh2Nazv2OzV4yLQgXXXyyMEx1
5Hg7b4UCJYDrbkCiegNRv7f/4FWDkQ9dx++RZITIbxeskBBCEI+I7GnmZhGWzuC4
7kj4ytuKAySF2OEJu0VQF6u0CvrNYfYbQIRKBXjtOwuRK4Q6L63FGMJpYo159MBQ
asg3B1jB5TcuGZ9zrjL5LkuzaP4qZZHIRs+4kZsH9I6MODHGUxKonrkablfKxyKy
CFK+iaHCuEXXty5K0VmWM3nrTfvpEjVjbMc7e1QGBQ5oXsDM0pqn84syRg==
=v7Wn
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM:
- Move the arch-specific code into arch/arm64/kvm
- Start the post-32bit cleanup
- Cherry-pick a few non-invasive pre-NV patches
x86:
- Rework of TLB flushing
- Rework of event injection, especially with respect to nested
virtualization
- Nested AMD event injection facelift, building on the rework of
generic code and fixing a lot of corner cases
- Nested AMD live migration support
- Optimization for TSC deadline MSR writes and IPIs
- Various cleanups
- Asynchronous page fault cleanups (from tglx, common topic branch
with tip tree)
- Interrupt-based delivery of asynchronous "page ready" events (host
side)
- Hyper-V MSRs and hypercalls for guest debugging
- VMX preemption timer fixes
s390:
- Cleanups
Generic:
- switch vCPU thread wakeup from swait to rcuwait
The other architectures, and the guest side of the asynchronous page
fault work, will come next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (256 commits)
KVM: selftests: fix rdtsc() for vmx_tsc_adjust_test
KVM: check userspace_addr for all memslots
KVM: selftests: update hyperv_cpuid with SynDBG tests
x86/kvm/hyper-v: Add support for synthetic debugger via hypercalls
x86/kvm/hyper-v: enable hypercalls regardless of hypercall page
x86/kvm/hyper-v: Add support for synthetic debugger interface
x86/hyper-v: Add synthetic debugger definitions
KVM: selftests: VMX preemption timer migration test
KVM: nVMX: Fix VMX preemption timer migration
x86/kvm/hyper-v: Explicitly align hcall param for kvm_hyperv_exit
KVM: x86/pmu: Support full width counting
KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' in
KVM: x86: announce KVM_FEATURE_ASYNC_PF_INT
KVM: x86: acknowledgment mechanism for async pf page ready notifications
KVM: x86: interrupt based APF 'page ready' event delivery
KVM: introduce kvm_read_guest_offset_cached()
KVM: rename kvm_arch_can_inject_async_page_present() to kvm_arch_can_dequeue_async_page_present()
KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously"
KVM: VMX: Replace zero-length array with flexible-array
...
- Add TPAUSE based delay which allows the CPU to enter an optimized power
state while waiting for the delay to pass. The delay is based on TSC
cycles.
- Add tsc_early_khz command line parameter to workaround the problem that
overclocked CPUs can report the wrong frequency via CPUID.16h which
causes the refined calibration to fail because the delta to the initial
frequency value is too big. With the parameter users can provide an
halfways accurate initial value.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl7XvMITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoQ59EACWOU2E+S/b+AqKoZRAJWbTASmu2jEU
4AukhjO3A0y+G3EqnCtvQbUbKkthScSmrDJs2Dt8CTO6q3Fqv/f5JgoubgSx9Hbj
pF1hvueOvRBpinzGEJbDbv+HbkoCYr10DZ5dZ8uz120pSnlfSNNpgZ6hJkOFaUHu
nwVEJpkg2x3ZsiJrgyOfdorwbxO5dCNY9YVL3jyVXUi5QfP3lYrr3/Nz6daIRtRn
Q9tj48N4Bk4ASgmg4rSdXd6OKeZ3Oz1nerol5vFvBeaOc8PVcKSu5sSqMIHHUV2M
RJq8T4nW5Y4pkYjpdYP7Pr/3HYbSNW6eU+MycfnJOzYYTIQfFWkG2wHDNuOg/v+A
GC/grS6wNBj/+tZlvWTwLPf44h7V+sowzYPHBWounT/5drFZ+xsm8+Je4s2NtNih
rbG/4oOQ2jn05PNBCCOyLuP33efQ3ub2UHPCoUxckMiX2eqI+iWpdllZLSiSADZY
jlbXgTQ/Fa3nGKVYVDi1GYbx1rBr/HbsbgvGV4D802s7inmev0azrbgc/CECrnvO
rEa501Y1xzxZ7Zet0QvLK/7aKP532pCmgZiBSmcnS73FBbnssvNJiHlAeq4NHtN2
TsaGYLy0iPSj7siXEaeysUKRjUTNNrgRvtWfo35GDjWahgXhixIvVVpwxnWws5cj
aNR5FwxnI03V2A==
=199V
-----END PGP SIGNATURE-----
Merge tag 'x86-timers-2020-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 timer updates from Thomas Gleixner:
"X86 timer specific updates:
- Add TPAUSE based delay which allows the CPU to enter an optimized
power state while waiting for the delay to pass. The delay is based
on TSC cycles.
- Add tsc_early_khz command line parameter to workaround the problem
that overclocked CPUs can report the wrong frequency via CPUID.16h
which causes the refined calibration to fail because the delta to
the initial frequency value is too big. With the parameter users
can provide an halfways accurate initial value"
* tag 'x86-timers-2020-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tsc: Add tsc_early_khz command line parameter
x86/delay: Introduce TPAUSE delay
x86/delay: Refactor delay_mwaitx() for TPAUSE support
x86/delay: Preparatory code cleanup
Merge updates from Andrew Morton:
"A few little subsystems and a start of a lot of MM patches.
Subsystems affected by this patch series: squashfs, ocfs2, parisc,
vfs. With mm subsystems: slab-generic, slub, debug, pagecache, gup,
swap, memcg, pagemap, memory-failure, vmalloc, kasan"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (128 commits)
kasan: move kasan_report() into report.c
mm/mm_init.c: report kasan-tag information stored in page->flags
ubsan: entirely disable alignment checks under UBSAN_TRAP
kasan: fix clang compilation warning due to stack protector
x86/mm: remove vmalloc faulting
mm: remove vmalloc_sync_(un)mappings()
x86/mm/32: implement arch_sync_kernel_mappings()
x86/mm/64: implement arch_sync_kernel_mappings()
mm/ioremap: track which page-table levels were modified
mm/vmalloc: track which page-table levels were modified
mm: add functions to track page directory modifications
s390: use __vmalloc_node in stack_alloc
powerpc: use __vmalloc_node in alloc_vm_stack
arm64: use __vmalloc_node in arch_alloc_vmap_stack
mm: remove vmalloc_user_node_flags
mm: switch the test_vmalloc module to use __vmalloc_node
mm: remove __vmalloc_node_flags_caller
mm: remove both instances of __vmalloc_node_flags
mm: remove the prot argument to __vmalloc_node
mm: remove the pgprot argument to __vmalloc
...
Remove fault handling on vmalloc areas, as the vmalloc code now takes
care of synchronizing changes to all page-tables in the system.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200515140023.25469-8-joro@8bytes.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull uaccess/coredump updates from Al Viro:
"set_fs() removal in coredump-related area - mostly Christoph's
stuff..."
* 'work.set_fs-exec' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
binfmt_elf_fdpic: remove the set_fs(KERNEL_DS) in elf_fdpic_core_dump
binfmt_elf: remove the set_fs(KERNEL_DS) in elf_core_dump
binfmt_elf: remove the set_fs in fill_siginfo_note
signal: refactor copy_siginfo_to_user32
powerpc/spufs: simplify spufs core dumping
powerpc/spufs: stop using access_ok
powerpc/spufs: fix copy_to_user while atomic
- Branch Target Identification (BTI)
* Support for ARMv8.5-BTI in both user- and kernel-space. This
allows branch targets to limit the types of branch from which
they can be called and additionally prevents branching to
arbitrary code, although kernel support requires a very recent
toolchain.
* Function annotation via SYM_FUNC_START() so that assembly
functions are wrapped with the relevant "landing pad"
instructions.
* BPF and vDSO updates to use the new instructions.
* Addition of a new HWCAP and exposure of BTI capability to
userspace via ID register emulation, along with ELF loader
support for the BTI feature in .note.gnu.property.
* Non-critical fixes to CFI unwind annotations in the sigreturn
trampoline.
- Shadow Call Stack (SCS)
* Support for Clang's Shadow Call Stack feature, which reserves
platform register x18 to point at a separate stack for each
task that holds only return addresses. This protects function
return control flow from buffer overruns on the main stack.
* Save/restore of x18 across problematic boundaries (user-mode,
hypervisor, EFI, suspend, etc).
* Core support for SCS, should other architectures want to use it
too.
* SCS overflow checking on context-switch as part of the existing
stack limit check if CONFIG_SCHED_STACK_END_CHECK=y.
- CPU feature detection
* Removed numerous "SANITY CHECK" errors when running on a system
with mismatched AArch32 support at EL1. This is primarily a
concern for KVM, which disabled support for 32-bit guests on
such a system.
* Addition of new ID registers and fields as the architecture has
been extended.
- Perf and PMU drivers
* Minor fixes and cleanups to system PMU drivers.
- Hardware errata
* Unify KVM workarounds for VHE and nVHE configurations.
* Sort vendor errata entries in Kconfig.
- Secure Monitor Call Calling Convention (SMCCC)
* Update to the latest specification from Arm (v1.2).
* Allow PSCI code to query the SMCCC version.
- Software Delegated Exception Interface (SDEI)
* Unexport a bunch of unused symbols.
* Minor fixes to handling of firmware data.
- Pointer authentication
* Add support for dumping the kernel PAC mask in vmcoreinfo so
that the stack can be unwound by tools such as kdump.
* Simplification of key initialisation during CPU bringup.
- BPF backend
* Improve immediate generation for logical and add/sub
instructions.
- vDSO
- Minor fixes to the linker flags for consistency with other
architectures and support for LLVM's unwinder.
- Clean up logic to initialise and map the vDSO into userspace.
- ACPI
- Work around for an ambiguity in the IORT specification relating
to the "num_ids" field.
- Support _DMA method for all named components rather than only
PCIe root complexes.
- Minor other IORT-related fixes.
- Miscellaneous
* Initialise debug traps early for KGDB and fix KDB cacheflushing
deadlock.
* Minor tweaks to early boot state (documentation update, set
TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections).
* Refactoring and cleanup
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl7U9csQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNLBHCACs/YU4SM7Om5f+7QnxIKao5DBr2CnGGvdC
yTfDghFDTLQVv3MufLlfno3yBe5G8sQpcZfcc+hewfcGoMzVZXu8s7LzH6VSn9T9
jmT3KjDMrg0RjSHzyumJp2McyelTk0a4FiKArSIIKsJSXUyb1uPSgm7SvKVDwEwU
JGDzL9IGilmq59GiXfDzGhTZgmC37QdwRoRxDuqtqWQe5CHoRXYexg87HwBKOQxx
HgU9L7ehri4MRZfpyjaDrr6quJo3TVnAAKXNBh3mZAskVS9ZrfKpEH0kYWYuqybv
znKyHRecl/rrGePV8RTMtrwnSdU26zMXE/omsVVauDfG9hqzqm+Q
=w3qi
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"A sizeable pile of arm64 updates for 5.8.
Summary below, but the big two features are support for Branch Target
Identification and Clang's Shadow Call stack. The latter is currently
arm64-only, but the high-level parts are all in core code so it could
easily be adopted by other architectures pending toolchain support
Branch Target Identification (BTI):
- Support for ARMv8.5-BTI in both user- and kernel-space. This allows
branch targets to limit the types of branch from which they can be
called and additionally prevents branching to arbitrary code,
although kernel support requires a very recent toolchain.
- Function annotation via SYM_FUNC_START() so that assembly functions
are wrapped with the relevant "landing pad" instructions.
- BPF and vDSO updates to use the new instructions.
- Addition of a new HWCAP and exposure of BTI capability to userspace
via ID register emulation, along with ELF loader support for the
BTI feature in .note.gnu.property.
- Non-critical fixes to CFI unwind annotations in the sigreturn
trampoline.
Shadow Call Stack (SCS):
- Support for Clang's Shadow Call Stack feature, which reserves
platform register x18 to point at a separate stack for each task
that holds only return addresses. This protects function return
control flow from buffer overruns on the main stack.
- Save/restore of x18 across problematic boundaries (user-mode,
hypervisor, EFI, suspend, etc).
- Core support for SCS, should other architectures want to use it
too.
- SCS overflow checking on context-switch as part of the existing
stack limit check if CONFIG_SCHED_STACK_END_CHECK=y.
CPU feature detection:
- Removed numerous "SANITY CHECK" errors when running on a system
with mismatched AArch32 support at EL1. This is primarily a concern
for KVM, which disabled support for 32-bit guests on such a system.
- Addition of new ID registers and fields as the architecture has
been extended.
Perf and PMU drivers:
- Minor fixes and cleanups to system PMU drivers.
Hardware errata:
- Unify KVM workarounds for VHE and nVHE configurations.
- Sort vendor errata entries in Kconfig.
Secure Monitor Call Calling Convention (SMCCC):
- Update to the latest specification from Arm (v1.2).
- Allow PSCI code to query the SMCCC version.
Software Delegated Exception Interface (SDEI):
- Unexport a bunch of unused symbols.
- Minor fixes to handling of firmware data.
Pointer authentication:
- Add support for dumping the kernel PAC mask in vmcoreinfo so that
the stack can be unwound by tools such as kdump.
- Simplification of key initialisation during CPU bringup.
BPF backend:
- Improve immediate generation for logical and add/sub instructions.
vDSO:
- Minor fixes to the linker flags for consistency with other
architectures and support for LLVM's unwinder.
- Clean up logic to initialise and map the vDSO into userspace.
ACPI:
- Work around for an ambiguity in the IORT specification relating to
the "num_ids" field.
- Support _DMA method for all named components rather than only PCIe
root complexes.
- Minor other IORT-related fixes.
Miscellaneous:
- Initialise debug traps early for KGDB and fix KDB cacheflushing
deadlock.
- Minor tweaks to early boot state (documentation update, set
TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections).
- Refactoring and cleanup"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
KVM: arm64: Move __load_guest_stage2 to kvm_mmu.h
KVM: arm64: Check advertised Stage-2 page size capability
arm64/cpufeature: Add get_arm64_ftr_reg_nowarn()
ACPI/IORT: Remove the unused __get_pci_rid()
arm64/cpuinfo: Add ID_MMFR4_EL1 into the cpuinfo_arm64 context
arm64/cpufeature: Add remaining feature bits in ID_AA64PFR1 register
arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register
arm64/cpufeature: Add remaining feature bits in ID_AA64ISAR0 register
arm64/cpufeature: Add remaining feature bits in ID_MMFR4 register
arm64/cpufeature: Add remaining feature bits in ID_PFR0 register
arm64/cpufeature: Introduce ID_MMFR5 CPU register
arm64/cpufeature: Introduce ID_DFR1 CPU register
arm64/cpufeature: Introduce ID_PFR2 CPU register
arm64/cpufeature: Make doublelock a signed feature in ID_AA64DFR0
arm64/cpufeature: Drop TraceFilt feature exposure from ID_DFR0 register
arm64/cpufeature: Add explicit ftr_id_isar0[] for ID_ISAR0 register
arm64: mm: Add asid_gen_match() helper
firmware: smccc: Fix missing prototype warning for arm_smccc_version_init
arm64: vdso: Fix CFI directives in sigreturn trampoline
arm64: vdso: Don't prefix sigreturn trampoline with a BTI C instruction
...
it removes unnecessary functions and cleans up the rest.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VNO4RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1i61RAAogLRVi4ga4vmTk5SqUqtR4pupbHJv5IM
IjkQN0HZ3+Oi6kRxwuOQ9xOOzQWm8GntkZeyN5FA73H7x+bYdU12MIKKTEDcW3xp
Mg9FtzfeL0V4YmNkmlnIXycyYA3nBdSxnI/OL/58J9CLT15qXYkWjyvkbI2aJ3qL
U8xM5cTTvhoARjd43o0eAfekTg0XdUAsgvO0vOM5+I1HrQP8SR3ZIFaMSR+MfAQx
Nbz/UVUSDJ8BNzmS/CfFLFm0F2dkphlLC0r6eAOFZAYSIax0bRVklxV9qdScEQMK
bkVKXGanCzVTBVM1HXDycLJaILlqcS18tK+VqNIAR5x2BXmaSG8jqwCW4NM0tcaN
c5zemNsqnAH/VzxeFjE2BcDQnA1nkgj75Vm9O81HMQfyqR16M5pBzRXY/qBqslya
vX5wLoD962BiVtbELqW6v+Ot29xMYlCLLlTbLHaWQraJS3TjuAvL0/sOFdWgs63F
N7a+BLvikfYoKCS8IxW87BFBysy9nhv/4UwdaX5RpIQ1wgx/EJLDaowrM+L6Dzmw
bhQ3AgRZGNZCBDm3uGU/LigTTxN93h5KqKnuUKv3H+tNvEKxPKEAqPyvL8fO8G0U
BTJiM/XRIzQrkrmwCKqON1iKRjKB2fKklxiq4REIoSqKfeiQ3SVBIHENFnH96Ekf
G9qptwFEZYI=
=L8Vy
-----END PGP SIGNATURE-----
Merge tag 'x86-platform-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 platform updates from Ingo Molnar:
"This tree cleans up various aspects of the UV platform support code,
it removes unnecessary functions and cleans up the rest"
* tag 'x86-platform-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic/uv: Remove code for unused distributed GRU mode
x86/platform/uv: Remove the unused _uv_cpu_blade_processor_id() macro
x86/platform/uv: Unexport uv_apicid_hibits
x86/platform/uv: Remove _uv_hub_info_check()
x86/platform/uv: Simplify uv_send_IPI_one()
x86/platform/uv: Mark uv_min_hub_revision_id static
x86/platform/uv: Mark is_uv_hubless() static
x86/platform/uv: Remove the UV*_HUB_IS_SUPPORTED macros
x86/platform/uv: Unexport symbols only used by x2apic_uv_x.c
x86/platform/uv: Unexport sn_coherency_id
x86/platform/uv: Remove the uv_partition_coherence_id() macro
x86/platform/uv: Mark uv_bios_call() and uv_bios_call_irqsave() static
which is a feature that allows kernel-only data to be automatically
saved/restored by the FPU context switching code.
CPU features that can be supported this way are Intel PT, 'PASID' and
CET features.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VMZgRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jmAQ/7BJpyAHUjFJdChtkvUmLcBgI2qnxP7rc8
Eh/tSo4PKh484Uqb4WY6XAHIAPBzEt3rHJG3fdaavzlUl98YJCdD9tstfwMPcCQ4
L4c2Ru+h+mPQCMOZUctOphPjDzGWPzR4IhceH6gqhoS4vg9EqgN4o158x4jW6KFN
Jlocp9CMfIaGSmaMlRrIUZ4Dj3mgboqqHsuCaibtaKAMK6LqZQDViTEal4mNbESX
KQPOFpKrhoq6Jtzzer7fLPY2qb6kkLrL03X5IUGFP5UxigSejnfrI9SZpAuPP9S0
kdN04Jo0T2aBIAikBTVhDWdLMJk19qeu7YXBrFEVbyhZHl1HdDqOhMdWPOp1GH9W
CtGUalbIvz/5FbXuUImiiNh/bw2FxYjHsrDguW96IvMVFteucrFg9QyL+taYb1cV
WqWdpIC0VoMuQxQI5FBWu4Bb/cLNV9VCxWAZjZQ806kwmyDxldsw5mucMGmH3+bO
LD6bwRShSMRzI9bzcJSG+Z3y7Fe8b5IGNjCjzgPb88ezffBEFHzIEKdCL6QTNlRF
6UgSGbRs41SqXwNw5tdQQNwPpDO73p+KVRGoEzyMJvojLKRGTcOHHUDriGZ30MNX
3oHvLf5+dNrLC/frbOqUmQ7doBQOplR5VxlZVwwqkdpPw13Jf5zn4ewzriTOmKCq
mEHMQmbkyi4=
=M+BC
-----END PGP SIGNATURE-----
Merge tag 'x86-fpu-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 FPU updates from Ingo Molnar:
"Most of the changes here related to 'XSAVES supervisor state' support,
which is a feature that allows kernel-only data to be automatically
saved/restored by the FPU context switching code.
CPU features that can be supported this way are Intel PT, 'PASID' and
CET features"
* tag 'x86-fpu-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu/xstate: Restore supervisor states for signal return
x86/fpu/xstate: Preserve supervisor states for the slow path in __fpu__restore_sig()
x86/fpu: Introduce copy_supervisor_to_kernel()
x86/fpu/xstate: Update copy_kernel_to_xregs_err() for supervisor states
x86/fpu/xstate: Update sanitize_restored_xstate() for supervisor xstates
x86/fpu/xstate: Define new functions for clearing fpregs and xstates
x86/fpu/xstate: Introduce XSAVES supervisor states
x86/fpu/xstate: Separate user and supervisor xfeatures mask
x86/fpu/xstate: Define new macros for supervisor and user xstates
x86/fpu/xstate: Rename validate_xstate_header() to validate_user_xstate_header()
- Extend the x86 family/model macros with a steppings dimension,
because x86 life isn't complex enough and Intel uses steppings to
differentiate between different CPUs. :-/
- Convert the TSC deadline timer quirks to the steppings macros.
- Clean up asm mnemonics.
- Fix the handling of an AMD erratum, or in other words, fix a kernel erratum.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VL2wRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hxYQ//dic//qQ+4GOL1wP1Qj8EiGOzaWdynoia
oDi7Si1I9vy58iCCRgkQKmxEwfnHcM5+NC6/091S4BD2IE6o+iD1YhPsGZK8DT4Y
FmeD8pgtx5LMJFMBe6KRyek1s0JblP6v0Q0BwUk7YtV6k0oSP+f/2n5BGj2+P7YH
3Iw438M5JhIrzVp3PnCgJoZkSm9iRnZqbBtR8nd2SO+vx8M75cX27LL6fdaCypRj
wH9w6+J2NhAZStmEv54LKOdO5RAPJjvatbTZFMEFdceAGFEbHPJIees7paoC+DTP
3BuhzF/9ghDNKly6Zz3PtyNNDP1vglZ1W9dJkCfTXUWlZKbQV94Yk+JbP5mndxqn
+f3eD/dInofHiCeAh1Sfj3BCGdOSjgFMBB57CKkCy4LehXwJ9C2eBcbxd4XMfEkd
h0EywZrp1L10AxDHtq5x82xf1fwfTDyvlYmJrBshXfiitaySn+mPVJMuj3wvqJSP
WKbJS4HfkekIaf9WoUA+Ay6FJdY7nNirViRrQEZVmDPTV0EDfcaNM5p6Ttkja3Ph
VoVa8Ms8FRqTfh6xCfckYR+vI44U+AFNLM6YFyetGYc0yVXNzg3vLy2DbqLRolWy
t1upDdNf1TMJg4BaMrBzZgDg/uI2BM3jeOj69U0cboO2JhJjxjl3qPeiYDKD50MK
Z1Nho933894=
=QKjn
-----END PGP SIGNATURE-----
Merge tag 'x86-cpu-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu updates from Ingo Molnar:
"Misc updates:
- Extend the x86 family/model macros with a steppings dimension,
because x86 life isn't complex enough and Intel uses steppings to
differentiate between different CPUs. :-/
- Convert the TSC deadline timer quirks to the steppings macros.
- Clean up asm mnemonics.
- Fix the handling of an AMD erratum, or in other words, fix a kernel
erratum"
* tag 'x86-cpu-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Use RDRAND and RDSEED mnemonics in archrandom.h
x86/cpu: Use INVPCID mnemonic in invpcid.h
x86/cpu/amd: Make erratum #1054 a legacy erratum
x86/apic: Convert the TSC deadline timer matching to steppings macro
x86/cpu: Add a X86_MATCH_INTEL_FAM6_MODEL_STEPPINGS() macro
x86/cpu: Add a steppings field to struct x86_cpu_id
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VLcQRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iFnhAArGBqco3C2RPQugv7UDDbKEaMvxOGrc5B
kwnyOS/k/yeIkfhT9u11oBuLcaj/Zgw8YCjFyRfaNsorRqnytLyZzZ6PvdCCE3YU
X3DVYgulcdAQnM4bS2e3Kt9ciJvFxB27XNm0AfuyLMUxMqCD+iIO4gJ6TuQNBYy3
dfUMfB1R9OUDW13GCrASe+p1Dw76uaqVngdFWJhnC8Rm49E6gFXq7CLQp5Cka81I
KZeJ8I6ug9p3gqhOIXdi+S6g5CM5jf86Wkk7dOHwHFH7CceFb3FIz7z0n1je4Wgd
L5rYX7+PwfNeZ73GIuvEBN+agJH2K0H/KmnlWNWeZHzc+J12MeruSdSMBIkBOEpn
iSbYAOmDpQLzBjTdZjC8bDqTZf472WrTh4VwN9NxHLucjdC+IqGoTAvnyyEOmZ5o
R7sv7Q++316CVwRhYVXbzwZcqtiinCDE1EkP5nKTo9z3z0kMF5+ce/k7wn5sgZIk
zJq3LXtaToiDoDRAPGxcvFPts9MdC0EI1aKTIjaK/n6i2h/SpJfrTKgANWaldYTe
XJIqlSB43saqf5YAQ3/sY+wnpCRBmmCU+sfKja4C8bH7RuggI3mZS19uhFs0Qctq
Yx5bIXVSBAIqjJtgzQ0WAAZ5LrCpNNyAzb35ZYefQlGyJlx1URKXVBmxa6S99biU
KiYX7Dk5uhQ=
=0ZQd
-----END PGP SIGNATURE-----
Merge tag 'x86-cleanups-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
"Misc cleanups, with an emphasis on removing obsolete/dead code"
* tag 'x86-cleanups-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/spinlock: Remove obsolete ticket spinlock macros and types
x86/mm: Drop deprecated DISCONTIGMEM support for 32-bit
x86/apb_timer: Drop unused declaration and macro
x86/apb_timer: Drop unused TSC calibration
x86/io_apic: Remove unused function mp_init_irq_at_boot()
x86/mm: Stop printing BRK addresses
x86/audit: Fix a -Wmissing-prototypes warning for ia32_classify_syscall()
x86/nmi: Remove edac.h include leftover
mm: Remove MPX leftovers
x86/mm/mmap: Fix -Wmissing-prototypes warnings
x86/early_printk: Remove unused includes
crash_dump: Remove no longer used saved_max_pfn
x86/smpboot: Remove the last ICPU() macro
- Add the initrdmem= boot option to specify an initrd embedded in RAM (flash most likely)
- Sanitize the CS value earlier during boot, which also fixes SEV-ES.
- Various fixes and smaller cleanups.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VKk8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1i25w/6A8okusHJMXyXMYddRHNiL57x3DcTRsTO
09Wz7e0YrL53HqQEyaqtSam/0VqgSaHDQb/gRb2Ci0G+XzZ3BFYvICVWTW6NcvnA
VSUoHC8Mr83Aq3UfAEcJZZ0bHNuoKymO256v2tZPGCSGgZoxQdoe4/6W1uMxxjLr
NFpeyAm93zTe1+MmA/ZcFxH+xOZPYVPhl7+KgO3muMH/hGoS3Dt+RCuB9VHTgMvf
4mN6IxN3cVHDogt7usdtWjgrYnhY0SjiWo858+MDWsrW5oXifsXLJ5jJr1Ea1nGx
qqVyaCqAVNobOkpsBLHg1DiD/rr9A4sfS/etmAjWsPO6kAx9Mq9+B2DG5fTU/gB+
zd76M3Jl3wyjdy6hPMyiZGlFFM9l3efyp/iYPhFWgPqVlkkOvbO+9FWVDbFtErQw
WpEG2d8KHN4+ph8D04ExeKJKCKaYnAaHKk13fZnjjeQhatyGGAYn6hx+rT/x+onM
2CeRG/+KcnlzKgXqYX6/YT++XlaCKgMntO/FdLT99/4CD92rqQdhwJ6JNH1U8nXO
LWjrV5ZH6R3n5Hr5+J/Kcd9/kIfAqWG3t/eiTEPEjJIUWXEdhBoQWErSce4on5a7
6eBfkKEQxIYAdC1iO2uoKEtEpMDvFWoIIVjdlVTFiJ8Np9uvv7lPByr/0TJ+N5b7
fgOrzglWuxo=
=U/uh
-----END PGP SIGNATURE-----
Merge tag 'x86-boot-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Ingo Molnar:
"Misc updates:
- Add the initrdmem= boot option to specify an initrd embedded in RAM
(flash most likely)
- Sanitize the CS value earlier during boot, which also fixes SEV-ES
- Various fixes and smaller cleanups"
* tag 'x86-boot-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Correct relocation destination on old linkers
x86/boot/compressed/64: Switch to __KERNEL_CS after GDT is loaded
x86/boot: Fix -Wint-to-pointer-cast build warning
x86/boot: Add kstrtoul() from lib/
x86/tboot: Mark tboot static
x86/setup: Add an initrdmem= option to specify initrd physical address
- Add AMD Fam17h RAPL support
- Introduce CAP_PERFMON to kernel and user space
- Add Zhaoxin CPU support
- Misc fixes and cleanups
Tooling changes:
perf record:
- Introduce --switch-output-event to use arbitrary events to be setup
and read from a side band thread and, when they take place a signal
be sent to the main 'perf record' thread, reusing the --switch-output
code to take perf.data snapshots from the --overwrite ring buffer, e.g.:
# perf record --overwrite -e sched:* \
--switch-output-event syscalls:*connect* \
workload
will take perf.data.YYYYMMDDHHMMSS snapshots up to around the
connect syscalls.
- Add --num-synthesize-threads option to control degree of parallelism of the
synthesize_mmap() code which is scanning /proc/PID/task/PID/maps and can be
time consuming. This mimics pre-existing behaviour in 'perf top'.
perf bench:
- Add a multi-threaded synthesize benchmark.
- Add kallsyms parsing benchmark.
Intel PT support:
- Stitch LBR records from multiple samples to get deeper backtraces,
there are caveats, see the csets for details.
- Allow using Intel PT to synthesize callchains for regular events.
- Add support for synthesizing branch stacks for regular events (cycles,
instructions, etc) from Intel PT data.
Misc changes:
- Updated perf vendor events for power9 and Coresight.
- Add flamegraph.py script via 'perf flamegraph'
- Misc other changes, fixes and cleanups - see the Git log for details.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VJAcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hAYw/8DFtzGkMaaWkrDSj62LXtWQiqr1l01ZFt
9GzV4aN4/go+K4BQtsQN8cUjOkRHFnOryLuD9LfSBfqsdjuiyTynV/cJkeUGQBck
TT/GgWf3XKJzTUBRQRk367Gbqs9UKwBP8CdFhOXcNzGEQpjhbwwIDPmem94U4L1N
XLsysgC45ejWL1kMTZKmk6hDIidlFeDg9j70WDPX1nNfCeisk25rxwTpdgvjsjcj
3RzPRt2EGS+IkuF4QSCT5leYSGaCpVDHCQrVpHj57UoADfWAyC71uopTLG4OgYSx
PVd9gvloMeeqWmroirIxM67rMd/TBTfVekNolhnQDjqp60Huxm+gGUYmhsyjNqdx
Pb8HRZCBAudei9Ue4jNMfhCRK2Ug1oL5wNvN1xcSteAqrwMlwBMGHWns6l12x0ks
BxYhyLvfREvnKijXc1o8D5paRgqohJgfnHlrUZeacyaw5hQCbiVRpwg0T1mWAF53
u9hfWLY0Oy+Qs2C7EInNsWSYXRw8oPQNTFVx2I968GZqsEn4DC6Pt3ovWrDKIDnz
ugoZJQkJ3/O8stYSMiyENehdWlo575NkapCTDwhLWnYztrw4skqqHE8ighU/e8ug
o/Kx7ANWN9OjjjQpq2GVUeT0jCaFO+OMiGMNEkKoniYgYjogt3Gw5PeedBMtY07p
OcWTiQZamjU=
=i27M
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
"Kernel side changes:
- Add AMD Fam17h RAPL support
- Introduce CAP_PERFMON to kernel and user space
- Add Zhaoxin CPU support
- Misc fixes and cleanups
Tooling changes:
- perf record:
Introduce '--switch-output-event' to use arbitrary events to be
setup and read from a side band thread and, when they take place a
signal be sent to the main 'perf record' thread, reusing the core
for '--switch-output' to take perf.data snapshots from the ring
buffer used for '--overwrite', e.g.:
# perf record --overwrite -e sched:* \
--switch-output-event syscalls:*connect* \
workload
will take perf.data.YYYYMMDDHHMMSS snapshots up to around the
connect syscalls.
Add '--num-synthesize-threads' option to control degree of
parallelism of the synthesize_mmap() code which is scanning
/proc/PID/task/PID/maps and can be time consuming. This mimics
pre-existing behaviour in 'perf top'.
- perf bench:
Add a multi-threaded synthesize benchmark and kallsyms parsing
benchmark.
- Intel PT support:
Stitch LBR records from multiple samples to get deeper backtraces,
there are caveats, see the csets for details.
Allow using Intel PT to synthesize callchains for regular events.
Add support for synthesizing branch stacks for regular events
(cycles, instructions, etc) from Intel PT data.
Misc changes:
- Updated perf vendor events for power9 and Coresight.
- Add flamegraph.py script via 'perf flamegraph'
- Misc other changes, fixes and cleanups - see the Git log for details
Also, since over the last couple of years perf tooling has matured and
decoupled from the kernel perf changes to a large degree, going
forward Arnaldo is going to send perf tooling changes via direct pull
requests"
* tag 'perf-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (163 commits)
perf/x86/rapl: Add AMD Fam17h RAPL support
perf/x86/rapl: Make perf_probe_msr() more robust and flexible
perf/x86/rapl: Flip logic on default events visibility
perf/x86/rapl: Refactor to share the RAPL code between Intel and AMD CPUs
perf/x86/rapl: Move RAPL support to common x86 code
perf/core: Replace zero-length array with flexible-array
perf/x86: Replace zero-length array with flexible-array
perf/x86/intel: Add more available bits for OFFCORE_RESPONSE of Intel Tremont
perf/x86/rapl: Add Ice Lake RAPL support
perf flamegraph: Use /bin/bash for report and record scripts
perf cs-etm: Move definition of 'traceid_list' global variable from header file
libsymbols kallsyms: Move hex2u64 out of header
libsymbols kallsyms: Parse using io api
perf bench: Add kallsyms parsing
perf: cs-etm: Update to build with latest opencsd version.
perf symbol: Fix kernel symbol address display
perf inject: Rename perf_evsel__*() operating on 'struct evsel *' to evsel__*()
perf annotate: Rename perf_evsel__*() operating on 'struct evsel *' to evsel__*()
perf trace: Rename perf_evsel__*() operating on 'struct evsel *' to evsel__*()
perf script: Rename perf_evsel__*() operating on 'struct evsel *' to evsel__*()
...
- Speed up objtool significantly, especially when there are large number of sections
- Improve objtool's understanding of special instructions such as IRET,
to reduce the number of annotations required
- Implement 'noinstr' validation
- Do baby steps for non-x86 objtool use
- Simplify/fix retpoline decoding
- Add vmlinux validation
- Improve documentation
- Fix various bugs and apply smaller cleanups
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VHvcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gEfBAAhvPWljUmfQsetYq4q9BdbuC4xPSQN9ra
e+2zu1MQaohkjAdFM1boNVhCCGKFUvlTEEw3GJR141Us6Y/ZRS8VIo70tmVSku6I
OwuR5i8SgEKwurr1SwLxrI05rovYWRLSaDIRTHn2CViPEjgriyFGRV8QKam3AYmI
dx47la3ELwuQR68nIdIMzDRt49oZVy+ZKW8Pjgjklzrd5KMYsPy7HPtraHUMeDg+
GdoC7RresIt5AFiDiIJzKTT/jROI7KuHFnM6blluKHoKenWhYBFCz3sd6IvCdQWX
JGy+KKY6H+YDMSpgc4FRP56M3GI0hX14oCd7L72epSLfOuzPr9Tmf6wfyQ8f50Je
LGLD47tyltIcQR9H85YdR8UQspkjSW6xcql4ByCPTEqp0UzSGTsVntvsHzwsgz6A
Csh3s+DVdv0rk5ZjMCu8STA2oErpehJm7fmugt2oLx+nsCNCBUI25lilw5JGeq5c
+cO0IRxRxHPeRvMWvItTjbixVAHOHYlB00ilDbvsm+GnTJgu/5cMqpXdLvfXI2Rr
nl360bSS3t3J4w5rX0mXw4x24vjQmVrA69jU+oo8RSHje2X8Y4Q7sFHNjmN0YAI3
Re8aP6HSLQjioJxGz9aISlrxmPOXe0CMp8JE586SREVgmS/olXtidMgi7l12uZ2B
cRdtNYcn31U=
=dbCU
-----END PGP SIGNATURE-----
Merge tag 'objtool-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar:
"There are a lot of objtool changes in this cycle, all across the map:
- Speed up objtool significantly, especially when there are large
number of sections
- Improve objtool's understanding of special instructions such as
IRET, to reduce the number of annotations required
- Implement 'noinstr' validation
- Do baby steps for non-x86 objtool use
- Simplify/fix retpoline decoding
- Add vmlinux validation
- Improve documentation
- Fix various bugs and apply smaller cleanups"
* tag 'objtool-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
objtool: Enable compilation of objtool for all architectures
objtool: Move struct objtool_file into arch-independent header
objtool: Exit successfully when requesting help
objtool: Add check_kcov_mode() to the uaccess safelist
samples/ftrace: Fix asm function ELF annotations
objtool: optimize add_dead_ends for split sections
objtool: use gelf_getsymshndx to handle >64k sections
objtool: Allow no-op CFI ops in alternatives
x86/retpoline: Fix retpoline unwind
x86: Change {JMP,CALL}_NOSPEC argument
x86: Simplify retpoline declaration
x86/speculation: Change FILL_RETURN_BUFFER to work with objtool
objtool: Add support for intra-function calls
objtool: Move the IRET hack into the arch decoder
objtool: Remove INSN_STACK
objtool: Make handle_insn_ops() unconditional
objtool: Rework allocating stack_ops on decode
objtool: UNWIND_HINT_RET_OFFSET should not check registers
objtool: is_fentry_call() crashes if call has no destination
x86,smap: Fix smap_{save,restore}() alternatives
...
- RCU-tasks update, including addition of RCU Tasks Trace for
BPF use and TASKS_RUDE_RCU
- kfree_rcu() updates.
- Remove scheduler locking restriction
- RCU CPU stall warning updates.
- Torture-test updates.
- Miscellaneous fixes and other updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7U/r0RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hSNxAAirKhPGBoLI9DW1qde4OFhZg+BlIpS+LD
IE/0eGB8hGwhb1793RGbzIJfSnRQpSOPxWbWc6DJZ4Zpi5/ZbVkiPKsuXpM1xGxs
kuBCTOhWy1/p3iCZ1JH/JCrCAdWGZkIzEoaV7ipnHtV/+UrRbCWH5PB7R0fYvcbI
q5bUcWJyEp/bYMxQn8DhAih6SLPHx+F9qaGAqqloLSHstTYG2HkBhBGKnqcd/Jex
twkLK53poCkeP/c08V1dyagU2IRWj2jGB1NjYh/Ocm+Sn/vru15CVGspjVjqO5FF
oq07lad357ddMsZmKoM2F5DhXbOh95A+EqF9VDvIzCvfGMUgqYI1oxWF4eycsGhg
/aYJgYuN23YeEe2DkDzJB67GvBOwl4WgdoFaxKRzOiCSfrhkM8KqM4G9Fz1JIepG
abRJCF85iGcLslU9DkrShQiDsd/CRPzu/jz6ybK0I2II2pICo6QRf76T7TdOvKnK
yXwC6OdL7/dwOht20uT6XfnDXMCWI4MutiUrb8/C1DbaihwEaI2denr3YYL+IwrB
B38CdP6sfKZ5UFxKh0xb+sOzWrw0KA+ThSAXeJhz3tKdxdyB6nkaw3J9lFg8oi20
XGeAujjtjMZG5cxt2H+wO9kZY0RRau/nTqNtmmRrCobd5yJjHHPHH8trEd0twZ9A
X5Wjh11lv3E=
=Yisx
-----END PGP SIGNATURE-----
Merge tag 'core-rcu-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
"The RCU updates for this cycle were:
- RCU-tasks update, including addition of RCU Tasks Trace for BPF use
and TASKS_RUDE_RCU
- kfree_rcu() updates.
- Remove scheduler locking restriction
- RCU CPU stall warning updates.
- Torture-test updates.
- Miscellaneous fixes and other updates"
* tag 'core-rcu-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
rcu: Allow for smp_call_function() running callbacks from idle
rcu: Provide rcu_irq_exit_check_preempt()
rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter()
rcu: Provide __rcu_is_watching()
rcu: Provide rcu_irq_exit_preempt()
rcu: Make RCU IRQ enter/exit functions rely on in_nmi()
rcu/tree: Mark the idle relevant functions noinstr
x86: Replace ist_enter() with nmi_enter()
x86/mce: Send #MC singal from task work
x86/entry: Get rid of ist_begin/end_non_atomic()
sched,rcu,tracing: Avoid tracing before in_nmi() is correct
sh/ftrace: Move arch_ftrace_nmi_{enter,exit} into nmi exception
lockdep: Always inline lockdep_{off,on}()
hardirq/nmi: Allow nested nmi_enter()
arm64: Prepare arch_nmi_enter() for recursion
printk: Disallow instrumenting print_nmi_enter()
printk: Prepare for nested printk_nmi_enter()
rcutorture: Convert ULONG_CMP_LT() to time_before()
torture: Add a --kasan argument
torture: Save a few lines by using config_override_param initially
...
their width from CPUID. As a prerequsite, streamline and unify the CPUID
detection of the respective resource control attributes. By Reinette
Chatre.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl7VNQYACgkQEsHwGGHe
VUqFgw//Qr6Ot21wtZAgVSCOPJ7rWH4aR6Bbq9YN0/kUGk2PbAjfFK3aMekHd4u6
+cTc1eroOiWc5mXurbJ6gfmDgTYHj09DII/qXIi9wxWHPyAZ3cBltGXG9A86lIHZ
hyA6l3Z+RzhnIKoATbXzaEy1nEoGMgCsezuGmN2d1KM2Fl8dmYHT+88iMDSazNb3
D51+HHvUF03+h41A11mw7Omknycpk3wOmNX0A+t55EExW8xrYjjenY+suWvhDoGj
+uqDVhYjxvVziI0lsAfcDfN3R3MuPMBXJXtwf9qaZODGs4ttTg/lyHRcTqGBVwrf
xGuDw0qRLvDKy9UC09H5F7ArScx9X8bOu9NFjNA2AZyLLdhVpTa5AuX1T7VroRkD
rcDx++EEjVEX36wBoGbwrfZTcaL3er8sPVZiXG1dDc5/GqWfP+abx4G1JPG9xRYH
V4oghEsBAJXRfGt07PTNSTqDWu/dQWmMUIVfGWX97l+5ED+HlL24MVoThkkH6f2M
7uAj8IRsbgzn207nZD8DNorARcQVsK2VvzgZzbqa0VsArt57Xk8DSZpfsqw8B7OJ
OwuA/0S6nFQX9g6C0eLR68WPwv6YrLjKMEO7KJukNbaN/pcVGKXaJ/dSTwpsNQbC
JZeAuc7DSYJ3Iv+xh64DvpL3hvb46J1UMRCWoVFw3ZQcnGV6t0E=
=JeAm
-----END PGP SIGNATURE-----
Merge tag 'x86_cache_updates_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cache resource control updates from Borislav Petkov:
"Add support for wider Memory Bandwidth Monitoring counters by querying
their width from CPUID.
As a prerequsite for that, streamline and unify the CPUID detection of
the respective resource control attributes.
By Reinette Chatre"
* tag 'x86_cache_updates_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Support wider MBM counters
x86/resctrl: Support CPUID enumeration of MBM counter width
x86/resctrl: Maintain MBM counter width per resource
x86/resctrl: Query LLC monitoring properties once during boot
x86/resctrl: Remove unnecessary RMID checks
x86/cpu: Move resctrl CPUID code to resctrl/
x86/resctrl: Rename asm/resctrl_sched.h to asm/resctrl.h
value from stop_machine(), from Mihai Carabas.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl7UxoEACgkQEsHwGGHe
VUrQdxAArdz2s/EtyhFeacKGp6nFG8wHyMVFbwJ/6ZzKcaX6zVc4PI/5O1837ls7
rFMZt5r21kxfLB3/wMAOZGI+ZP7i6IXzcwBI5/BmS+YK+t3PqWeT+iTNo8hr9tI/
d8Xly4sE/CIrPZduZPnNVsrRdzqKDs/KMnnPTxZWVNDWMVOKHJZtJ2Ty8eHZsgwl
b4yBL1JiZHELSb9SrMhZfortogB2eSUaFABWYJMhGJ8XHQ6AZ+A3EB+he/9Zu3Wu
Giz4LvnhCGJyhTLaDHRUhMfLHo1knl6LNS6QNqVSP82TKRlX3AVeDnHST968BeTr
ronLTvOVkkZcpvk5ukeSqcBFhxiio9R1rUbkfZlYPt63m/6uWCiMzzOGXB+JTtYc
5of95CXehYj41XlQVkQtJJmoysYdt7JJw0w5+Cr3Uuov/RKOEiCdrgemOxOmIcM+
YJ8m+lTn95+8PXFjg/kvweZA7rXr1HcPhfmd9tCMha2k6b1MbdaMT3xb+m1vGXD/
BRojkuqf7OK19T/Owcum6A/oBmjuNjPZPL5HapQ9ZbMz6AZ3InRmaU+8EvQLIer7
iimQYWzTTdlZsresJh2+itPMf1EVyHVRnzFlx/N1BMhAxpR2aYXwGA5WKxk10p7U
80iejJntiNwXJmCHXXiQ55Dyii0vZykJv2FbGjLF4xUUGB/zxW8=
=g4rq
-----END PGP SIGNATURE-----
Merge tag 'x86_microcode_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode update from Borislav Petkov:
"A single fix for late microcode loading to handle the correct return
value from stop_machine(), from Mihai Carabas"
* tag 'x86_microcode_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode: Fix return value for microcode late loading
Currently, APF mechanism relies on the #PF abuse where the token is being
passed through CR2. If we switch to using interrupts to deliver page-ready
notifications we need a different way to pass the data. Extent the existing
'struct kvm_vcpu_pv_apf_data' with token information for page-ready
notifications.
While on it, rename 'reason' to 'flags'. This doesn't change the semantics
as we only have reasons '1' and '2' and these can be treated as bit flags
but KVM_PV_REASON_PAGE_READY is going away with interrupt based delivery
making 'reason' name misleading.
The newly introduced apf_put_user_ready() temporary puts both flags and
token information, this will be changed to put token only when we switch
to interrupt based notifications.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200525144125.143875-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
xdp_umem.c had overlapping changes between the 64-bit math fix
for the calculation of npgs and the removal of the zerocopy
memory type which got rid of the chunk_size_nohdr member.
The mlx5 Kconfig conflict is a case where we just take the
net-next copy of the Kconfig entry dependency as it takes on
the ESWITCH dependency by one level of indirection which is
what the 'net' conflicting change is trying to ensure.
Signed-off-by: David S. Miller <davem@davemloft.net>
In the copy_process() routine called by _do_fork(), failure to allocate
a PID (or further along in the function) will trigger an invocation to
exit_thread(). This is done to clean up from an earlier call to
copy_thread_tls(). Naturally, the child task is passed into exit_thread(),
however during the process, io_bitmap_exit() nullifies the parent's
io_bitmap rather than the child's.
As copy_thread_tls() has been called ahead of the failure, the reference
count on the calling thread's io_bitmap is incremented as we would expect.
However, io_bitmap_exit() doesn't accept any arguments, and thus assumes
it should trash the current thread's io_bitmap reference rather than the
child's. This is pretty sneaky in practice, because in all instances but
this one, exit_thread() is called with respect to the current task and
everything works out.
A determined attacker can issue an appropriate ioctl (i.e. KDENABIO) to
get a bitmap allocated, and force a clone3() syscall to fail by passing
in a zeroed clone_args structure. The kernel handles the erroneous struct
and the buggy code path is followed, and even though the parent's reference
to the io_bitmap is trashed, the child still holds a reference and thus
the structure will never be freed.
Fix this by tweaking io_bitmap_exit() and its subroutines to accept a
task_struct argument which to operate on.
Fixes: ea5f1cd7ab ("x86/ioperm: Remove bitmap if all permissions dropped")
Signed-off-by: Jay Lang <jaytlang@mit.edu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable#@vger.kernel.org
Link: https://lkml.kernel.org/r/20200524162742.253727-1-jaytlang@mit.edu
Icelake microserver CPU supports split lock detection while it doesn't
have the split lock enumeration bit in IA32_CORE_CAPABILITIES. Tigerlake
CPUs do enumerate the MSR.
[ bp: Merge the two model-adding patches into one. ]
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/1588290395-2677-1-git-send-email-fenghua.yu@intel.com
copy the corresponding pieces of init_fpstate into the gaps instead.
Cc: stable@kernel.org
Tested-by: Alexander Potapenko <glider@google.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Drop the APB-timer TSC calibration, which hasn't been used since the
removal of Moorestown support by commit
1a8359e411 ("x86/mid: Remove Intel Moorestown").
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200513100944.9171-1-johan@kernel.org
The commit
c84cb3735f ("x86/apic: Move TSC deadline timer debug printk")
removed the message which said that the deadline timer was enabled.
It added a pr_debug() message which is issued when deadline timer
validation succeeds.
Well, issued only when CONFIG_DYNAMIC_DEBUG is enabled - otherwise
pr_debug() calls get optimized away if DEBUG is not defined in the
compilation unit.
Therefore, make the above message pr_info() so that it is visible in
dmesg.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200525104218.27018-1-bp@alien8.de
The MSCC bug fix in 'net' had to be slightly adjusted because the
register accesses are done slightly differently in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
Distributed GRU mode appeared in only one generation of UV hardware,
and no version of the BIOS has shipped with this feature enabled, and
we have no plans to ever change that. The gru.s3.mode check has
always been and will continue to be false. So remove this dead code.
Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dimitri Sivanich <sivanich@hpe.com>
Link: https://lkml.kernel.org/r/20200513221123.GJ3240@raspberrypi
Normally, show_trace_log_lvl() scans the stack, looking for text
addresses to print. In parallel, it unwinds the stack with
unwind_next_frame(). If the stack address matches the pointer returned
by unwind_get_return_address_ptr() for the current frame, the text
address is printed normally without a question mark. Otherwise it's
considered a breadcrumb (potentially from a previous call path) and it's
printed with a question mark to indicate that the address is unreliable
and typically can be ignored.
Since the following commit:
f1d9a2abff ("x86/unwind/orc: Don't skip the first frame for inactive tasks")
... for inactive tasks, show_trace_log_lvl() prints *only* unreliable
addresses (prepended with '?').
That happens because, for the first frame of an inactive task,
unwind_get_return_address_ptr() returns the wrong return address
pointer: one word *below* the task stack pointer. show_trace_log_lvl()
starts scanning at the stack pointer itself, so it never finds the first
'reliable' address, causing only guesses to being printed.
The first frame of an inactive task isn't a normal stack frame. It's
actually just an instance of 'struct inactive_task_frame' which is left
behind by __switch_to_asm(). Now that this inactive frame is actually
exposed to callers, fix unwind_get_return_address_ptr() to interpret it
properly.
Fixes: f1d9a2abff ("x86/unwind/orc: Don't skip the first frame for inactive tasks")
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200522135435.vbxs7umku5pyrdbk@treble
Add PCI IDs for AMD Renoir (4000-series Ryzen CPUs). This is necessary
to enable support for temperature sensors via the k10temp module.
Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Yazen Ghannam <yazen.ghannam@amd.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lkml.kernel.org/r/20200510204842.2603-2-amonakov@ispras.ru
Changing base clock frequency directly impacts TSC Hz but not CPUID.16h
value. An overclocked CPU supporting CPUID.16h and with partial CPUID.15h
support will set TSC KHZ according to "best guess" given by CPUID.16h
relying on tsc_refine_calibration_work to give better numbers later.
tsc_refine_calibration_work will refuse to do its work when the outcome is
off the early TSC KHZ value by more than 1% which is certain to happen on
an overclocked system.
Fix this by adding a tsc_early_khz command line parameter that makes the
kernel skip early TSC calibration and use the given value instead.
This allows the user to provide the expected TSC frequency that is closer
to reality than the one reported by the hardware, enabling
tsc_refine_calibration_work to do meaningful error checking.
[ tglx: Made the variable __initdata as it's only used on init and
removed the error checking in the argument parser because
kstrto*() only stores to the variable if the string is valid ]
Signed-off-by: Krzysztof Piecuch <piecuch@protonmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/O2CpIOrqLZHgNRkfjRpz_LGqnc1ix_seNIiOCvHY4RHoulOVRo6kMXKuLOfBVTi0SMMevg6Go1uZ_cL9fLYtYdTRNH78ChaFaZyG3VAyYz8=@protonmail.com
RKL re-uses the same stolen memory registers as TGL and ICL.
Bspec: 52055
Bspec: 49589
Bspec: 49636
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Anusha Srivatsa <anusha.srivatsa@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200504225227.464666-3-matthew.d.roper@intel.com
The async page fault injection into kernel space creates more problems than
it solves. The host has absolutely no knowledge about the state of the
guest if the fault happens in CPL0. The only restriction for the host is
interrupt disabled state. If interrupts are enabled in the guest then the
exception can hit arbitrary code. The HALT based wait in non-preemotible
code is a hacky replacement for a proper hypercall.
For the ongoing work to restrict instrumentation and make the RCU idle
interaction well defined the required extra work for supporting async
pagefault in CPL0 is just not justified and creates complexity for a
dubious benefit.
The CPL3 injection is well defined and does not cause any issues as it is
more or less the same as a regular page fault from CPL3.
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134059.369802541@linutronix.de
While working on the entry consolidation I stumbled over the KVM async page
fault handler and kvm_async_pf_task_wait() in particular. It took me a
while to realize that the randomly sprinkled around rcu_irq_enter()/exit()
invocations are just cargo cult programming. Several patches "fixed" RCU
splats by curing the symptoms without noticing that the code is flawed
from a design perspective.
The main problem is that this async injection is not based on a proper
handshake mechanism and only respects the minimal requirement, i.e. the
guest is not in a state where it has interrupts disabled.
Aside of that the actual code is a convoluted one fits it all swiss army
knife. It is invoked from different places with different RCU constraints:
1) Host side:
vcpu_enter_guest()
kvm_x86_ops->handle_exit()
kvm_handle_page_fault()
kvm_async_pf_task_wait()
The invocation happens from fully preemptible context.
2) Guest side:
The async page fault interrupted:
a) user space
b) preemptible kernel code which is not in a RCU read side
critical section
c) non-preemtible kernel code or a RCU read side critical section
or kernel code with CONFIG_PREEMPTION=n which allows not to
differentiate between #2b and #2c.
RCU is watching for:
#1 The vCPU exited and current is definitely not the idle task
#2a The #PF entry code on the guest went through enter_from_user_mode()
which reactivates RCU
#2b There is no preemptible, interrupts enabled code in the kernel
which can run with RCU looking away. (The idle task is always
non preemptible).
I.e. all schedulable states (#1, #2a, #2b) do not need any of this RCU
voodoo at all.
In #2c RCU is eventually not watching, but as that state cannot schedule
anyway there is no point to worry about it so it has to invoke
rcu_irq_enter() before running that code. This can be optimized, but this
will be done as an extra step in course of the entry code consolidation
work.
So the proper solution for this is to:
- Split kvm_async_pf_task_wait() into schedule and halt based waiting
interfaces which share the enqueueing code.
- Add comments (condensed form of this changelog) to spare others the
time waste and pain of reverse engineering all of this with the help of
uncomprehensible changelogs and code history.
- Invoke kvm_async_pf_task_wait_schedule() from kvm_handle_page_fault(),
user mode and schedulable kernel side async page faults (#1, #2a, #2b)
- Invoke kvm_async_pf_task_wait_halt() for the non schedulable kernel
case (#2c).
For this case also remove the rcu_irq_exit()/enter() pair around the
halt as it is just a pointless exercise:
- vCPUs can VMEXIT at any random point and can be scheduled out for
an arbitrary amount of time by the host and this is not any
different except that it voluntary triggers the exit via halt.
- The interrupted context could have RCU watching already. So the
rcu_irq_exit() before the halt is not gaining anything aside of
confusing the reader. Claiming that this might prevent RCU stalls
is just an illusion.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134059.262701431@linutronix.de
KVM overloads #PF to indicate two types of not-actually-page-fault
events. Right now, the KVM guest code intercepts them by modifying
the IDT and hooking the #PF vector. This makes the already fragile
fault code even harder to understand, and it also pollutes call
traces with async_page_fault and do_async_page_fault for normal page
faults.
Clean it up by moving the logic into do_page_fault() using a static
branch. This gets rid of the platform trap_init override mechanism
completely.
[ tglx: Fixed up 32bit, removed error code from the async functions and
massaged coding style ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134059.169270470@linutronix.de
A few exceptions (like #DB and #BP) can happen at any location in the code,
this then means that tracers should treat events from these exceptions as
NMI-like. The interrupted context could be holding locks with interrupts
disabled for instance.
Similarly, #MC is an actual NMI-like exception.
All of them use ist_enter() which only concerns itself with RCU, but does
not do any of the other setup that NMIs need. This means things like:
printk()
raw_spin_lock_irq(&logbuf_lock);
<#DB/#BP/#MC>
printk()
raw_spin_lock_irq(&logbuf_lock);
are entirely possible (well, not really since printk tries hard to
play nice, but the concept stands).
So replace ist_enter() with nmi_enter(). Also observe that any nmi_enter()
caller must be both notrace and NOKPROBE, or in the noinstr text section.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134101.525508608@linutronix.de
Convert #MC over to using task_work_add(); it will run the same code
slightly later, on the return to user path of the same exception.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134100.957390899@linutronix.de
This is completely overengineered and definitely not an interface which
should be made available to anything else than this particular MCE case.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134059.462640294@linutronix.de
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl7BzV8eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGg8EH/A2pXMTxtc96RI4S
sttEsUQqbakFS0Z/2tQPpMGr/qW2e5eHgsTX/a3SiUeZiIXk6f4lMFkMuctzBf7p
X77cNEDwGOEdbtCXTsMcmKSde7sP2zCXsPB8xTWLyE6rnaFRgikwwkeqgkIKhp1h
bvOQV0t9HNGvxGAM0iZeOvQAvFl4vd7nS123/MYbir9cugfQUSJRueQ4BiCiJqVE
6cNA7/vFzDJuFGszzIrJ7HXn/IdQMMWHkvTDjgBw0GZw1mDbGFbfbZwOeTz1ojCt
smUQ4tIFxBa/VA5zx7dOy2P2keHbSVf4VLkZRPcceT7OqVS65ETmFDp+qt5NdWM5
vZ8+7/0=
=CyYH
-----END PGP SIGNATURE-----
Merge tag 'v5.7-rc6' into objtool/core, to pick up fixes and resolve semantic conflict
Resolve structural conflict between:
59566b0b62: ("x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot up")
which introduced a new reference to 'ftrace_epilogue', and:
0298739b79: ("x86,ftrace: Fix ftrace_regs_caller() unwind")
Which renamed it to 'ftrace_caller_end'. Rename the new usage site in the merge commit.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
tells the unwinding code whether a stack trace can be trusted or not is
always set correctly. This was messed up by a couple of changes in the
recent past.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl7BC+gTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoWFBEACR8MiO0VM2XXNsejd7rttgs/eoC4/M
IKM5K1hq4eRCTodwVnkWwLk6p0asAMKhzpWQ3MS5RJBNAYxLbbxnsYSGtd8zIsdV
wk6jbNYeT2MUZq2tYkjn3b9B6+91FFMZq6q+KDOfNPqcKZyP4n5o5QSewznBvQwt
dHvjGgegJDjrrtuhLSQKG/uvSSi2hN9S5ibSMCa004GnH6P+uk/eICpvUXwNCyjV
ygogYTmQQqAEqnlqVNdQxo+DFYbaxKCw12VSoBeOsEySljPdc136hP/j7Tzbf2em
rkqtyXwng1+yG0vozMCAkyP5l3uA+HUculQLdmO8/55eia5Dl/zgsp3SvW7/2ONS
0DRfGo0ghoZgId1oDu6DGPsX80wKKskerJpTN/tHWTXQWeUXCNXrX//lhrFiwd7P
mHiyuk+INw3LQBkTlf7XhAf28w/9/+gCm3prEGnUCmLaJOeZ8HtL0mwDzudgc9Ca
NW/b3tdt4JU3oXKyyqywr4XAYfxlfmyf3DrBMnuHdTgccaB9PAAzugjmDnFJOuzk
jQw/Qfd6w7ZgVcVoaNQjjeogMTryGthCOPe9DzPUgkr+jCDsMwXopCvxbhbWI9e5
L1/U5ilka/VC2ZP7qZUvwsltCgp6RamhDb3yLZbn/2PKf0sFKVoI/j/g1qMnLNZt
TBNjzYuWAC8Hlw==
=4kDr
-----END PGP SIGNATURE-----
Merge tag 'objtool-urgent-2020-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 stack unwinding fix from Thomas Gleixner:
"A single bugfix for the ORC unwinder to ensure that the error flag
which tells the unwinding code whether a stack trace can be trusted or
not is always set correctly.
This was messed up by a couple of changes in the recent past"
* tag 'objtool-urgent-2020-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/unwind/orc: Fix error handling in __unwind_start()
stack protector enabled.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl7A+q4ACgkQEsHwGGHe
VUpvtA/+NNPKVGSKZPdDlUm64JEPy7XrbzFJ+zigWGQjUPtZsDkAT4U33eQIvV5f
ea7vB2u+e7iRZBExgTI1JfyjTenGpBffhubR/ueawtxeTgvZSopFajHQir/VGPlJ
KQdtqe2wZek3Wux8BsKl8vcbqhgNH/LKgQzoG2y5P1LuA77MpFkMVkAoxKqbTDbt
Nx7j147ffZBJHfmUHz2/nWD9r0Exu+abeSPJeO4T52ImhVkr+Pd1nFS8S+mRCHMj
uJjxL/nB/sZmDDX+EX/zA7Du3ibaVa2po9cuhMTwNIPZIpak8Yyopl64fVm/N7jH
w0DIc1CgEaA1IkG7lwyKSgB/T6Fsg4SQp8gM4V3BkcTgVDuhTH0J/kGrOk2+YFSc
akk3420XBS4Q54BQ547woOImabxgQXDBvqBq+DhJFwP1qSllUXbZX7rlwZ3VQ160
sfmItVM0c4J9bgaXqZuwqHxJdgakaIECkXWZwpksQAzVxaOKpZo7drLq6SDhX9HH
BZdm/5AhIJ5rIGaiMXsZj5cC+H341N5TlaXA+I2b0r/vVOLtbe3it1rbSsvMoZJQ
7WOesyqFSjSObDUpXZ0riLl1X+rdrCAfzHsm5IMwLAoxmv80973johZKNZIgqIoh
CbPdyvaJoNK8FK6gT7bw3HNJ1ILGqk53jpWH1Gr1MlfzSzErOdQ=
=5Xi5
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Borislav Petkov:
"A single fix for early boot crashes of kernels built with gcc10 and
stack protector enabled"
* tag 'x86_urgent_for_v5.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Fix early boot crash on gcc-10, third try
The signal return fast path directly restores user states from the user
buffer. Once that succeeds, restore supervisor states (but only when
they are not yet restored).
For the slow path, save supervisor states to preserve them across context
switches, and restore after the user states are restored.
The previous version has the overhead of an XSAVES in both the fast and the
slow paths. It is addressed as the following:
- In the fast path, only do an XRSTORS.
- In the slow path, do a supervisor-state-only XSAVES, and relocate the
buffer contents.
Some thoughts in the implementation:
- In the slow path, can any supervisor state become stale between
save/restore?
Answer: set_thread_flag(TIF_NEED_FPU_LOAD) protects the xstate buffer.
- In the slow path, can any code reference a stale supervisor state
register between save/restore?
Answer: In the current lazy-restore scheme, any reference to xstate
registers needs fpregs_lock()/fpregs_unlock() and __fpregs_load_activate().
- Are there other options?
One other option is eagerly restoring all supervisor states.
Currently, CET user-mode states and ENQCMD's PASID do not need to be
eagerly restored. The upcoming CET kernel-mode states (24 bytes) need
to be eagerly restored. To me, eagerly restoring all supervisor states
adds more overhead then benefit at this point.
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20200512145444.15483-11-yu-cheng.yu@intel.com
The signal return code is responsible for taking an XSAVE buffer
present in user memory and loading it into the hardware registers. This
operation only affects user XSAVE state and never affects supervisor
state.
The fast path through this code simply points XRSTOR directly at the
user buffer. However, since user memory is not guaranteed to be always
mapped, this XRSTOR can fail. If it fails, the signal return code falls
back to a slow path which can tolerate page faults.
That slow path copies the xfeatures one by one out of the user buffer
into the task's fpu state area. However, by being in a context where it
can handle page faults, the code can also schedule.
The lazy-fpu-load code would think it has an up-to-date fpstate and
would fail to save the supervisor state when scheduling the task out.
When scheduling back in, it would likely restore stale supervisor state.
To fix that, preserve supervisor state before the slow path. Modify
copy_user_to_fpregs_zeroing() so that if it fails, fpregs are not zeroed,
and there is no need for fpregs_deactivate() and supervisor states are
preserved.
Move set_thread_flag(TIF_NEED_FPU_LOAD) to the slow path. Without doing
this, the fast path also needs supervisor states to be saved first.
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200512145444.15483-10-yu-cheng.yu@intel.com
The XSAVES instruction takes a mask and saves only the features specified
in that mask. The kernel normally specifies that all features be saved.
XSAVES also unconditionally uses the "compacted format" which means that
all specified features are saved next to each other in memory. If a
feature is removed from the mask, all the features after it will "move
up" into earlier locations in the buffer.
Introduce copy_supervisor_to_kernel(), which saves only supervisor states
and then moves those states into the standard location where they are
normally found.
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200512145444.15483-9-yu-cheng.yu@intel.com
Move the bpf verifier trace check into the new switch statement in
HEAD.
Resolve the overlapping changes in hinic, where bug fixes overlap
the addition of VF support.
Signed-off-by: David S. Miller <davem@davemloft.net>
... or the odyssey of trying to disable the stack protector for the
function which generates the stack canary value.
The whole story started with Sergei reporting a boot crash with a kernel
built with gcc-10:
Kernel panic — not syncing: stack-protector: Kernel stack is corrupted in: start_secondary
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc5—00235—gfffb08b37df9 #139
Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./H77M—D3H, BIOS F12 11/14/2013
Call Trace:
dump_stack
panic
? start_secondary
__stack_chk_fail
start_secondary
secondary_startup_64
-—-[ end Kernel panic — not syncing: stack—protector: Kernel stack is corrupted in: start_secondary
This happens because gcc-10 tail-call optimizes the last function call
in start_secondary() - cpu_startup_entry() - and thus emits a stack
canary check which fails because the canary value changes after the
boot_init_stack_canary() call.
To fix that, the initial attempt was to mark the one function which
generates the stack canary with:
__attribute__((optimize("-fno-stack-protector"))) ... start_secondary(void *unused)
however, using the optimize attribute doesn't work cumulatively
as the attribute does not add to but rather replaces previously
supplied optimization options - roughly all -fxxx options.
The key one among them being -fno-omit-frame-pointer and thus leading to
not present frame pointer - frame pointer which the kernel needs.
The next attempt to prevent compilers from tail-call optimizing
the last function call cpu_startup_entry(), shy of carving out
start_secondary() into a separate compilation unit and building it with
-fno-stack-protector, was to add an empty asm("").
This current solution was short and sweet, and reportedly, is supported
by both compilers but we didn't get very far this time: future (LTO?)
optimization passes could potentially eliminate this, which leads us
to the third attempt: having an actual memory barrier there which the
compiler cannot ignore or move around etc.
That should hold for a long time, but hey we said that about the other
two solutions too so...
Reported-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Kalle Valo <kvalo@codeaurora.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200314164451.346497-1-slyfox@gentoo.org
The unwind_state 'error' field is used to inform the reliable unwinding
code that the stack trace can't be trusted. Set this field for all
errors in __unwind_start().
Also, move the zeroing out of the unwind_state struct to before the ORC
table initialization check, to prevent the caller from reading
uninitialized data if the ORC table is corrupted.
Fixes: af085d9084 ("stacktrace/x86: add function for detecting reliable stack traces")
Fixes: d3a0910401 ("x86/unwinder/orc: Dont bail on stack overflow")
Fixes: 98d0c8ebf7 ("x86/unwind/orc: Prevent unwinding before ORC initialization")
Reported-by: Pavel Machek <pavel@denx.de>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/d6ac7215a84ca92b895fdd2e1aa546729417e6e6.1589487277.git.jpoimboe@redhat.com
- Fix a crash when having function tracing and function stack tracing on
the command line. The ftrace trampolines are created as executable and
read only. But the stack tracer tries to modify them with text_poke()
which expects all kernel text to still be writable at boot.
Keep the trampolines writable at boot, and convert them to read-only
with the rest of the kernel.
- A selftest was triggering in the ring buffer iterator code, that
is no longer valid with the update of keeping the ring buffer
writable while a iterator is reading. Just bail after three failed
attempts to get an event and remove the warning and disabling of the
ring buffer.
- While modifying the ring buffer code, decided to remove all the
unnecessary BUG() calls.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXr1CDhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qsXcAQCoL229SBrtHsn4DUO7eAQRppUT3hNw
RuKzvQ56+1GccQEAh8VGCeg89uMSK6imrTujEl6VmOUdbgrD5R96yiKoGQw=
=vi+k
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull more tracing fixes from Steven Rostedt:
"Various tracing fixes:
- Fix a crash when having function tracing and function stack tracing
on the command line.
The ftrace trampolines are created as executable and read only. But
the stack tracer tries to modify them with text_poke() which
expects all kernel text to still be writable at boot. Keep the
trampolines writable at boot, and convert them to read-only with
the rest of the kernel.
- A selftest was triggering in the ring buffer iterator code, that is
no longer valid with the update of keeping the ring buffer writable
while a iterator is reading.
Just bail after three failed attempts to get an event and remove
the warning and disabling of the ring buffer.
- While modifying the ring buffer code, decided to remove all the
unnecessary BUG() calls"
* tag 'trace-v5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Remove all BUG() calls
ring-buffer: Don't deactivate the ring buffer on failed iterator reads
x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot up
The function sanitize_restored_xstate() sanitizes user xstates of an XSAVE
buffer by clearing bits not in the input 'xfeatures' from the buffer's
header->xfeatures, effectively resetting those features back to the init
state.
When supervisor xstates are introduced, it is necessary to make sure only
user xstates are sanitized. Ensure supervisor bits in header->xfeatures
stay set and supervisor states are not modified.
To make names clear, also:
- Rename the function to sanitize_restored_user_xstate().
- Rename input parameter 'xfeatures' to 'user_xfeatures'.
- In __fpu__restore_sig(), rename 'xfeatures' to 'user_xfeatures'.
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20200512145444.15483-7-yu-cheng.yu@intel.com
Currently, fpu__clear() clears all fpregs and xstates. Once XSAVES
supervisor states are introduced, supervisor settings (e.g. CET xstates)
must remain active for signals; It is necessary to have separate functions:
- Create fpu__clear_user_states(): clear only user settings for signals;
- Create fpu__clear_all(): clear both user and supervisor settings in
flush_thread().
Also modify copy_init_fpstate_to_fpregs() to take a mask from above two
functions.
Remove obvious side-comment in fpu__clear(), while at it.
[ bp: Make the second argument of fpu__clear() bool after requesting it
a bunch of times during review.
- Add a comment about copy_init_fpstate_to_fpregs() locking needs. ]
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200512145444.15483-6-yu-cheng.yu@intel.com
Enable XSAVES supervisor states by setting MSR_IA32_XSS bits according
to CPUID enumeration results. Also revise comments at various places.
Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200512145444.15483-5-yu-cheng.yu@intel.com
Before the introduction of XSAVES supervisor states, 'xfeatures_mask' is
used at various places to determine XSAVE buffer components and XCR0 bits.
It contains only user xstates. To support supervisor xstates, it is
necessary to separate user and supervisor xstates:
- First, change 'xfeatures_mask' to 'xfeatures_mask_all', which represents
the full set of bits that should ever be set in a kernel XSAVE buffer.
- Introduce xfeatures_mask_supervisor() and xfeatures_mask_user() to
extract relevant xfeatures from xfeatures_mask_all.
Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200512145444.15483-4-yu-cheng.yu@intel.com
Booting one of my machines, it triggered the following crash:
Kernel/User page tables isolation: enabled
ftrace: allocating 36577 entries in 143 pages
Starting tracer 'function'
BUG: unable to handle page fault for address: ffffffffa000005c
#PF: supervisor write access in kernel mode
#PF: error_code(0x0003) - permissions violation
PGD 2014067 P4D 2014067 PUD 2015063 PMD 7b253067 PTE 7b252061
Oops: 0003 [#1] PREEMPT SMP PTI
CPU: 0 PID: 0 Comm: swapper Not tainted 5.4.0-test+ #24
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
RIP: 0010:text_poke_early+0x4a/0x58
Code: 34 24 48 89 54 24 08 e8 bf 72 0b 00 48 8b 34 24 48 8b 4c 24 08 84 c0 74 0b 48 89 df f3 a4 48 83 c4 10 5b c3 9c 58 fa 48 89 df <f3> a4 50 9d 48 83 c4 10 5b e9 d6 f9 ff ff
0 41 57 49
RSP: 0000:ffffffff82003d38 EFLAGS: 00010046
RAX: 0000000000000046 RBX: ffffffffa000005c RCX: 0000000000000005
RDX: 0000000000000005 RSI: ffffffff825b9a90 RDI: ffffffffa000005c
RBP: ffffffffa000005c R08: 0000000000000000 R09: ffffffff8206e6e0
R10: ffff88807b01f4c0 R11: ffffffff8176c106 R12: ffffffff8206e6e0
R13: ffffffff824f2440 R14: 0000000000000000 R15: ffffffff8206eac0
FS: 0000000000000000(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffa000005c CR3: 0000000002012000 CR4: 00000000000006b0
Call Trace:
text_poke_bp+0x27/0x64
? mutex_lock+0x36/0x5d
arch_ftrace_update_trampoline+0x287/0x2d5
? ftrace_replace_code+0x14b/0x160
? ftrace_update_ftrace_func+0x65/0x6c
__register_ftrace_function+0x6d/0x81
ftrace_startup+0x23/0xc1
register_ftrace_function+0x20/0x37
func_set_flag+0x59/0x77
__set_tracer_option.isra.19+0x20/0x3e
trace_set_options+0xd6/0x13e
apply_trace_boot_options+0x44/0x6d
register_tracer+0x19e/0x1ac
early_trace_init+0x21b/0x2c9
start_kernel+0x241/0x518
? load_ucode_intel_bsp+0x21/0x52
secondary_startup_64+0xa4/0xb0
I was able to trigger it on other machines, when I added to the kernel
command line of both "ftrace=function" and "trace_options=func_stack_trace".
The cause is the "ftrace=function" would register the function tracer
and create a trampoline, and it will set it as executable and
read-only. Then the "trace_options=func_stack_trace" would then update
the same trampoline to include the stack tracer version of the function
tracer. But since the trampoline already exists, it updates it with
text_poke_bp(). The problem is that text_poke_bp() called while
system_state == SYSTEM_BOOTING, it will simply do a memcpy() and not
the page mapping, as it would think that the text is still read-write.
But in this case it is not, and we take a fault and crash.
Instead, lets keep the ftrace trampolines read-write during boot up,
and then when the kernel executable text is set to read-only, the
ftrace trampolines get set to read-only as well.
Link: https://lkml.kernel.org/r/20200430202147.4dc6e2de@oasis.local.home
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: stable@vger.kernel.org
Fixes: 768ae4406a ("x86/ftrace: Use text_poke()")
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
XCNTXT_MASK is 'all supported xfeatures' before introducing supervisor
xstates. Rename it to XFEATURE_MASK_USER_SUPPORTED to make clear that
these are user xstates.
Replace XFEATURE_MASK_SUPERVISOR with the following:
- XFEATURE_MASK_SUPERVISOR_SUPPORTED: Currently nothing. ENQCMD and
Control-flow Enforcement Technology (CET) will be introduced in separate
series.
- XFEATURE_MASK_SUPERVISOR_UNSUPPORTED: Currently only Processor Trace.
- XFEATURE_MASK_SUPERVISOR_ALL: the combination of above.
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200512145444.15483-3-yu-cheng.yu@intel.com
The function validate_xstate_header() validates an xstate header coming
from userspace (PTRACE or sigreturn). To make it clear, rename it to
validate_user_xstate_header().
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200512145444.15483-2-yu-cheng.yu@intel.com
- Ensure that direct mapping alias is always flushed when changing page
attributes. The optimization for small ranges failed to do so when
the virtual address was in the vmalloc or module space.
- Unbreak the trace event registration for syscalls without arguments
caused by the refactoring of the SYSCALL_DEFINE0() macro.
- Move the printk in the TSC deadline timer code to a place where it is
guaranteed to only be called once during boot and cannot be rearmed by
clearing warn_once after boot. If it's invoked post boot then lockdep
rightfully complains about a potential deadlock as the calling context
is different.
- A series of fixes for objtool and the ORC unwinder addressing variety
of small issues:
Stack offset tracking for indirect CFAs in objtool ignored subsequent
pushs and pops
Repair the unwind hints in the register clearing entry ASM code
Make the unwinding in the low level exit to usermode code stop after
switching to the trampoline stack. The unwind hint is not longer valid
and the ORC unwinder emits a warning as it can't find the registers
anymore.
Fix the unwind hints in switch_to_asm() and rewind_stack_do_exit()
which caused objtool to generate bogus ORC data.
Prevent unwinder warnings when dumping the stack of a non-current
task as there is no way to be sure about the validity because the
dumped stack can be a moving target.
Make the ORC unwinder behave the same way as the frame pointer
unwinder when dumping an inactive tasks stack and do not skip the
first frame.
Prevent ORC unwinding before ORC data has been initialized
Immediately terminate unwinding when a unknown ORC entry type is
found.
Prevent premature stop of the unwinder caused by IRET frames.
Fix another infinite loop in objtool caused by a negative offset which
was not catched.
Address a few build warnings in the ORC unwinder and add missing
static/ro_after_init annotations
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6363QTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoRJHD/4hWjzJLsUZ9xq2NrzhevoeJtxj+wVM
66x9NM3mlFQ30BN4Aye4EnNEhR0iIvNPWWdfEmaJYfPHPwnUjjcOa426HYxP/WXA
DWd5F20wGaaPOJ65LJpy/+pfcxAeQynt4I2cDEWHAplswfOWV/Hv8mSeKAKuq400
lCWaTMkWcO/toexSNn8PVyWi9rHlm+76E1bHkVwuoekGBGt1VloKGlK6OPyElzL2
w9VtrjSLlYQ0MdfCJKQeg44XQPMbf4hZRfc88x9SwDWB01q7aSvb0pWNl9AJKNXA
7fFu5T4F4PABPgRM7eJ5yNk0De9jM1y+6eCp66f9UXoNOeSr7Boz9Xc4xWqAraIi
9Dtx3WliO9CAxwUiD+Cj2iJO5o83AdRK/xhCth2VRnYMS6imfSidEqTC+LhEtkzw
Yplu7sbrWQDa5JTh8vk60clDvbkU+pfdxJisY+KClRguWfQfR6MJNuQnE0NHr7cH
H4VXFFHEE6tDdJneQ9RxA4iF20RTgSlJGK0YlsH6QsxPsRgoHVkGUao8fQhrNvRc
MIdpm9YasWStjJ7ZXbDeStmnLFN3DCj1RC8wmvJ4i/R1sPnBvPvRUt4Lm988a951
Vyr23VIcVrE7zykiqQZVH7bvIv6ULORqTJbIOF1rO/aIut4W8z0ojoVXC0Z7CiwF
S5SGj+hlWciIew==
=0rCi
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-05-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A set of fixes for x86:
- Ensure that direct mapping alias is always flushed when changing
page attributes. The optimization for small ranges failed to do so
when the virtual address was in the vmalloc or module space.
- Unbreak the trace event registration for syscalls without arguments
caused by the refactoring of the SYSCALL_DEFINE0() macro.
- Move the printk in the TSC deadline timer code to a place where it
is guaranteed to only be called once during boot and cannot be
rearmed by clearing warn_once after boot. If it's invoked post boot
then lockdep rightfully complains about a potential deadlock as the
calling context is different.
- A series of fixes for objtool and the ORC unwinder addressing
variety of small issues:
- Stack offset tracking for indirect CFAs in objtool ignored
subsequent pushs and pops
- Repair the unwind hints in the register clearing entry ASM code
- Make the unwinding in the low level exit to usermode code stop
after switching to the trampoline stack. The unwind hint is no
longer valid and the ORC unwinder emits a warning as it can't
find the registers anymore.
- Fix unwind hints in switch_to_asm() and rewind_stack_do_exit()
which caused objtool to generate bogus ORC data.
- Prevent unwinder warnings when dumping the stack of a
non-current task as there is no way to be sure about the
validity because the dumped stack can be a moving target.
- Make the ORC unwinder behave the same way as the frame pointer
unwinder when dumping an inactive tasks stack and do not skip
the first frame.
- Prevent ORC unwinding before ORC data has been initialized
- Immediately terminate unwinding when a unknown ORC entry type
is found.
- Prevent premature stop of the unwinder caused by IRET frames.
- Fix another infinite loop in objtool caused by a negative
offset which was not catched.
- Address a few build warnings in the ORC unwinder and add
missing static/ro_after_init annotations"
* tag 'x86-urgent-2020-05-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/unwind/orc: Move ORC sorting variables under !CONFIG_MODULES
x86/apic: Move TSC deadline timer debug printk
ftrace/x86: Fix trace event registration for syscalls without arguments
x86/mm/cpa: Flush direct map alias during cpa
objtool: Fix infinite loop in for_offset_range()
x86/unwind/orc: Fix premature unwind stoppage due to IRET frames
x86/unwind/orc: Fix error path for bad ORC entry type
x86/unwind/orc: Prevent unwinding before ORC initialization
x86/unwind/orc: Don't skip the first frame for inactive tasks
x86/unwind: Prevent false warnings for non-current tasks
x86/unwind/orc: Convert global variables to static
x86/entry/64: Fix unwind hints in rewind_stack_do_exit()
x86/entry/64: Fix unwind hints in __switch_to_asm()
x86/entry/64: Fix unwind hints in kernel exit path
x86/entry/64: Fix unwind hints in register clearing code
objtool: Fix stack offset tracking for indirect CFAs
Now that the livepatch code no longer needs the text_mutex for changing
module permissions, move its usage down to apply_relocate_add().
Note the s390 version of apply_relocate_add() doesn't need to use the
text_mutex because it already uses s390_kernel_write_lock, which
accomplishes the same task.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Because of late module patching, a livepatch module needs to be able to
apply some of its relocations well after it has been loaded. Instead of
playing games with module_{dis,en}able_ro(), use existing text poking
mechanisms to apply relocations after module loading.
So far only x86, s390 and Power have HAVE_LIVEPATCH but only the first
two also have STRICT_MODULE_RWX.
This will allow removal of the last module_disable_ro() usage in
livepatch. The ultimate goal is to completely disallow making
executable mappings writable.
[ jpoimboe: Split up patches. Use mod state to determine whether
memcpy() can be used. Implement text_poke() for UML. ]
Cc: x86@kernel.org
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
After the previous patch, vmlinux-specific KLP relocations are now
applied early during KLP module load. This means that .klp.arch
sections are no longer needed for *vmlinux-specific* KLP relocations.
One might think they're still needed for *module-specific* KLP
relocations. If a to-be-patched module is loaded *after* its
corresponding KLP module is loaded, any corresponding KLP relocations
will be delayed until the to-be-patched module is loaded. If any
special sections (.parainstructions, for example) rely on those
relocations, their initializations (apply_paravirt) need to be done
afterwards. Thus the apparent need for arch_klp_init_object_loaded()
and its corresponding .klp.arch sections -- it allows some of the
special section initializations to be done at a later time.
But... if you look closer, that dependency between the special sections
and the module-specific KLP relocations doesn't actually exist in
reality. Looking at the contents of the .altinstructions and
.parainstructions sections, there's not a realistic scenario in which a
KLP module's .altinstructions or .parainstructions section needs to
access a symbol in a to-be-patched module. It might need to access a
local symbol or even a vmlinux symbol; but not another module's symbol.
When a special section needs to reference a local or vmlinux symbol, a
normal rela can be used instead of a KLP rela.
Since the special section initializations don't actually have any real
dependency on module-specific KLP relocations, .klp.arch and
arch_klp_init_object_loaded() no longer have a reason to exist. So
remove them.
As Peter said much more succinctly:
So the reason for .klp.arch was that .klp.rela.* stuff would overwrite
paravirt instructions. If that happens you're doing it wrong. Those
RELAs are core kernel, not module, and thus should've happened in
.rela.* sections at patch-module loading time.
Reverting this removes the two apply_{paravirt,alternatives}() calls
from the late patching path, and means we don't have to worry about
them when removing module_disable_ro().
[ jpoimboe: Rewrote patch description. Tweaked klp_init_object_loaded()
error path. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Commit
21b5ee59ef ("x86/cpu/amd: Enable the fixed Instructions Retired
counter IRPERF")
mistakenly added erratum #1054 as an OS Visible Workaround (OSVW) ID 0.
Erratum #1054 is not OSVW ID 0 [1], so make it a legacy erratum.
There would never have been a false positive on older hardware that
has OSVW bit 0 set, since the IRPERF feature was not available.
However, save a couple of RDMSR executions per thread, on modern
system configurations that correctly set non-zero values in their
OSVW_ID_Length MSRs.
[1] Revision Guide for AMD Family 17h Models 00h-0Fh Processors. The
revision guide is available from the bugzilla link below.
Fixes: 21b5ee59ef ("x86/cpu/amd: Enable the fixed Instructions Retired counter IRPERF")
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200417143356.26054-1-kim.phillips@amd.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
TPAUSE instructs the processor to enter an implementation-dependent
optimized state. The instruction execution wakes up when the time-stamp
counter reaches or exceeds the implicit EDX:EAX 64-bit input value.
The instruction execution also wakes up due to the expiration of
the operating system time-limit or by an external interrupt
or exceptions such as a debug exception or a machine check exception.
TPAUSE offers a choice of two lower power states:
1. Light-weight power/performance optimized state C0.1
2. Improved power/performance optimized state C0.2
This way, it can save power with low wake-up latency in comparison to
spinloop based delay. The selection between the two is governed by the
input register.
TPAUSE is available on processors with X86_FEATURE_WAITPKG.
Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/1587757076-30337-4-git-send-email-kyung.min.park@intel.com
Neither this functions nor the helpers used to implement it are used
anywhere in the kernel tree.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Not-acked-by: Dimitri Sivanich <sivanich@hpe.com>
Cc: Russ Anderson <rja@hpe.com>
Link: https://lkml.kernel.org/r/20200504171527.2845224-10-hch@lst.de
Merge two helpers only used by uv_send_IPI_one() into the main function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Not-acked-by: Dimitri Sivanich <sivanich@hpe.com>
Cc: Russ Anderson <rja@hpe.com>
Link: https://lkml.kernel.org/r/20200504171527.2845224-9-hch@lst.de
This variable is only used inside x2apic_uv_x and not even declared
in a header.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Not-acked-by: Dimitri Sivanich <sivanich@hpe.com>
Cc: Russ Anderson <rja@hpe.com>
Link: https://lkml.kernel.org/r/20200504171527.2845224-8-hch@lst.de
is_uv_hubless() is only used in x2apic_uv_x.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Not-acked-by: Dimitri Sivanich <sivanich@hpe.com>
Cc: Russ Anderson <rja@hpe.com>
Link: https://lkml.kernel.org/r/20200504171527.2845224-7-hch@lst.de
The single user could have called freeze_secondary_cpus() directly.
Since this function was a source of confusion, remove it as it's
just a pointless wrapper.
While at it, rename enable_nonboot_cpus() to thaw_secondary_cpus() to
preserve the naming symmetry.
Done automatically via:
git grep -l enable_nonboot_cpus | xargs sed -i 's/enable_nonboot_cpus/thaw_secondary_cpus/g'
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Link: https://lkml.kernel.org/r/20200430114004.17477-1-qais.yousef@arm.com
... and get rid of the function pointers which would spit out the
microcode revision based on the CPU stepping.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mark Gross <mgross.linux.intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200506071516.25445-4-bp@alien8.de
The original Memory Bandwidth Monitoring (MBM) architectural
definition defines counters of up to 62 bits in the
IA32_QM_CTR MSR while the first-generation MBM implementation
uses statically defined 24 bit counters.
The MBM CPUID enumeration properties have been expanded to include
the MBM counter width, encoded as an offset from 24 bits.
While eight bits are available for the counter width offset IA32_QM_CTR
MSR only supports 62 bit counters. Add a sanity check, with warning
printed when encountered, to ensure counters cannot exceed the 62 bit
limit.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/69d52abd5b14794d3a0f05ba7c755ed1f4c0d5ed.1588715690.git.reinette.chatre@intel.com
The original Memory Bandwidth Monitoring (MBM) architectural
definition defines counters of up to 62 bits in the
IA32_QM_CTR MSR while the first-generation MBM implementation
uses statically defined 24 bit counters.
Expand the MBM CPUID enumeration properties to include the MBM
counter width. The previously undefined EAX output register contains,
in bits [7:0], the MBM counter width encoded as an offset from
24 bits. Enumerating this property is only specified for Intel
CPUs.
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/afa3af2f753f6bc301fb743bc8944e749cb24afa.1588715690.git.reinette.chatre@intel.com
The original Memory Bandwidth Monitoring (MBM) architectural
definition defines counters of up to 62 bits in the IA32_QM_CTR MSR,
and the first-generation MBM implementation uses 24 bit counters.
Software is required to poll at 1 second or faster to ensure that
data is retrieved before a counter rollover occurs more than once
under worst conditions.
As system bandwidths scale the software requirement is maintained with
the introduction of a per-resource enumerable MBM counter width.
In preparation for supporting hardware with an enumerable MBM counter
width the current globally static MBM counter width is moved to a
per-resource MBM counter width. Currently initialized to 24 always
to result in no functional change.
In essence there is one function, mbm_overflow_count() that needs to
know the counter width to handle rollovers. The static value
used within mbm_overflow_count() will be replaced with a value
discovered from the hardware. Support for learning the MBM counter
width from hardware is added in the change that follows.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/e36743b9800f16ce600f86b89127391f61261f23.1588715690.git.reinette.chatre@intel.com
Cache and memory bandwidth monitoring are features that are part of
x86 CPU resource control that is supported by the resctrl subsystem.
The monitoring properties are obtained via CPUID from every CPU
and only used within the resctrl subsystem where the properties are
only read from boot_cpu_data.
Obtain the monitoring properties once, placed in boot_cpu_data, via the
->c_bsp_init() helpers of the vendors that support X86_FEATURE_CQM_LLC.
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/6d74a6ac3e69f4b7a8b4115835f9455faf0f468d.1588715690.git.reinette.chatre@intel.com
The cache and memory bandwidth monitoring properties are read using
CPUID on every CPU. After the information is read from the system a
sanity check is run to
(1) ensure that the RMID data is initialized for the boot CPU in case
the information was not available on the boot CPU and
(2) the boot CPU's RMID is set to the minimum of RMID obtained
from all CPUs.
Every known platform that supports resctrl has the same maximum RMID
on all CPUs. Both sanity checks found in x86_init_cache_qos() can thus
safely be removed.
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/c9a3b60d34091840c8b0bd1c6fab15e5ba92cb17.1588715690.git.reinette.chatre@intel.com
The function determining a platform's support and properties of cache
occupancy and memory bandwidth monitoring (properties of
X86_FEATURE_CQM_LLC) can be found among the common CPU code. After
the feature's properties is populated in the per-CPU data the resctrl
subsystem is the only consumer (via boot_cpu_data).
Move the function that obtains the CPU information used by resctrl to
the resctrl subsystem and rename it from init_cqm() to
resctrl_cpu_detect(). The function continues to be called from the
common CPU code. This move is done in preparation of the addition of some
vendor specific code.
No functional change.
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/38433b99f9d16c8f4ee796f8cc42b871531fa203.1588715690.git.reinette.chatre@intel.com
asm/resctrl_sched.h is dedicated to the code used for configuration
of the CPU resource control state when a task is scheduled.
Rename resctrl_sched.h to resctrl.h in preparation of additions that
will no longer make this file dedicated to work done during scheduling.
No functional change.
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/6914e0ef880b539a82a6d889f9423496d471ad1d.1588715690.git.reinette.chatre@intel.com
Factor out a copy_siginfo_to_external32 helper from
copy_siginfo_to_user32 that fills out the compat_siginfo, but does so
on a kernel space data structure. With that we can let architectures
override copy_siginfo_to_user32 with their own implementations using
copy_siginfo_to_external32. That allows moving the x32 SIGCHLD purely
to x86 architecture code.
As a nice side effect copy_siginfo_to_external32 also comes in handy
for avoiding a set_fs() call in the coredump code later on.
Contains improvements from Eric W. Biederman <ebiederm@xmission.com>
and Arnd Bergmann <arnd@arndb.de>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A 32-bit version of mcelog issuing ioctls on /dev/mcelog causes errors
like the following:
MCE_GET_RECORD_LEN: Inappropriate ioctl for device
This is due to a missing compat_ioctl callback.
Assign to it compat_ptr_ioctl() as a generic implementation of the
.compat_ioctl file operation to ioctl functions that either ignore the
argument or pass a pointer to a compatible data type.
[ bp: Massage commit message. ]
Signed-off-by: He Zhe <zhe.he@windriver.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/1583303947-49858-1-git-send-email-zhe.he@windriver.com
Fix the following warnings seen with !CONFIG_MODULES:
arch/x86/kernel/unwind_orc.c:29:26: warning: 'cur_orc_table' defined but not used [-Wunused-variable]
29 | static struct orc_entry *cur_orc_table = __start_orc_unwind;
| ^~~~~~~~~~~~~
arch/x86/kernel/unwind_orc.c:28:13: warning: 'cur_orc_ip_table' defined but not used [-Wunused-variable]
28 | static int *cur_orc_ip_table = __start_orc_unwind_ip;
| ^~~~~~~~~~~~~~~~
Fixes: 153eb2223c ("x86/unwind/orc: Convert global variables to static")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linux Next Mailing List <linux-next@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200428071640.psn5m7eh3zt2in4v@treble
Leon reported that the printk_once() in __setup_APIC_LVTT() triggers a
lockdep splat due to a lock order violation between hrtimer_base::lock and
console_sem, when the 'once' condition is reset via
/sys/kernel/debug/clear_warn_once after boot.
The initial printk cannot trigger this because that happens during boot
when the local APIC timer is set up on the boot CPU.
Prevent it by moving the printk to a place which is guaranteed to be only
called once during boot.
Mark the deadline timer check related functions and data __init while at
it.
Reported-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87y2qhoshi.fsf@nanos.tec.linutronix.de
Zhaoxin CPU has provided facilities for monitoring performance
via PMU (Performance Monitor Unit), but the functionality is unused so far.
Therefore, add support for zhaoxin pmu to make performance related
hardware events available.
The PMU is mostly an Intel Architectural PerfMon-v2 with a novel
errata for the ZXC line. It supports the following events:
-----------------------------------------------------------------------------------------------------------------------------------
Event | Event | Umask | Description
| Select | |
-----------------------------------------------------------------------------------------------------------------------------------
cpu-cycles | 82h | 00h | unhalt core clock
instructions | 00h | 00h | number of instructions at retirement.
cache-references | 15h | 05h | number of fillq pushs at the current cycle.
cache-misses | 1ah | 05h | number of l2 miss pushed by fillq.
branch-instructions | 28h | 00h | counts the number of branch instructions retired.
branch-misses | 29h | 00h | mispredicted branch instructions at retirement.
bus-cycles | 83h | 00h | unhalt bus clock
stalled-cycles-frontend | 01h | 01h | Increments each cycle the # of Uops issued by the RAT to RS.
stalled-cycles-backend | 0fh | 04h | RS0/1/2/3/45 empty
L1-dcache-loads | 68h | 05h | number of retire/commit load.
L1-dcache-load-misses | 4bh | 05h | retired load uops whose data source followed an L1 miss.
L1-dcache-stores | 69h | 06h | number of retire/commit Store,no LEA
L1-dcache-store-misses | 62h | 05h | cache lines in M state evicted out of L1D due to Snoop HitM or dirty line replacement.
L1-icache-loads | 00h | 03h | number of l1i cache access for valid normal fetch,including un-cacheable access.
L1-icache-load-misses | 01h | 03h | number of l1i cache miss for valid normal fetch,including un-cacheable miss.
L1-icache-prefetches | 0ah | 03h | number of prefetch.
L1-icache-prefetch-misses | 0bh | 03h | number of prefetch miss.
dTLB-loads | 68h | 05h | number of retire/commit load
dTLB-load-misses | 2ch | 05h | number of load operations miss all level tlbs and cause a tablewalk.
dTLB-stores | 69h | 06h | number of retire/commit Store,no LEA
dTLB-store-misses | 30h | 05h | number of store operations miss all level tlbs and cause a tablewalk.
dTLB-prefetches | 64h | 05h | number of hardware pte prefetch requests dispatched out of the prefetch FIFO.
dTLB-prefetch-misses | 65h | 05h | number of hardware pte prefetch requests miss the l1d data cache.
iTLB-load | 00h | 00h | actually counter instructions.
iTLB-load-misses | 34h | 05h | number of code operations miss all level tlbs and cause a tablewalk.
-----------------------------------------------------------------------------------------------------------------------------------
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: CodyYao-oc <CodyYao-oc@zhaoxin.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1586747669-4827-1-git-send-email-CodyYao-oc@zhaoxin.com
In order to change the {JMP,CALL}_NOSPEC macros to call out-of-line
versions of the retpoline magic, we need to remove the '%' from the
argument, such that we can paste it onto symbol names.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20200428191700.151623523@infradead.org
As agreed with Boris, merge in the 'x86/asm' branch from -tip so that we
can select the new 'ARCH_USE_SYM_ANNOTATIONS' Kconfig symbol, which is
required by the BTI kernel patches.
* 'x86/asm' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm: Provide a Kconfig symbol for disabling old assembly annotations
x86/32: Remove CONFIG_DOUBLEFAULT
Pull in Christoph Hellwig's series that changes the sysctl's ->proc_handler
methods to take kernel pointers instead. It gets rid of the set_fs address
space overrides used by BPF. As per discussion, pull in the feature branch
into bpf-next as it relates to BPF sysctl progs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200427071508.GV23230@ZenIV.linux.org.uk/T/
This structure is only really used in tboot.c. The only exception
is a single tboot_enabled check, but for that we don't need an inline
function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200428051703.1625952-1-hch@lst.de
Add the initrdmem option:
initrdmem=ss[KMG],nn[KMG]
which is used to specify the physical address of the initrd, almost
always an address in FLASH. Also add code for x86 to use the existing
phys_init_start and phys_init_size variables in the kernel.
This is useful in cases where a kernel and an initrd is placed in FLASH,
but there is no firmware file system structure in the FLASH.
One such situation occurs when unused FLASH space on UEFI systems has
been reclaimed by, e.g., taking it from the Management Engine. For
example, on many systems, the ME is given half the FLASH part; not only
is 2.75M of an 8M part unused; but 10.75M of a 16M part is unused. This
space can be used to contain an initrd, but need to tell Linux where it
is.
This space is "raw": due to, e.g., UEFI limitations: it can not be added
to UEFI firmware volumes without rebuilding UEFI from source or writing
a UEFI device driver. It can be referenced only as a physical address
and size.
At the same time, if a kernel can be "netbooted" or loaded from GRUB or
syslinux, the option of not using the physical address specification
should be available.
Then, it is easy to boot the kernel and provide an initrd; or boot the
the kernel and let it use the initrd in FLASH. In practice, this has
proven to be very helpful when integrating Linux into FLASH on x86.
Hence, the most flexible and convenient path is to enable the initrdmem
command line option in a way that it is the last choice tried.
For example, on the DigitalLoggers Atomic Pi, an image into FLASH can be
burnt in with a built-in command line which includes:
initrdmem=0xff968000,0x200000
which specifies a location and size.
[ bp: Massage commit message, make it passive. ]
[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Link: http://lkml.kernel.org/r/CAP6exYLK11rhreX=6QPyDQmW7wPHsKNEFtXE47pjx41xS6O7-A@mail.gmail.com
Link: https://lkml.kernel.org/r/20200426011021.1cskg0AGd%akpm@linux-foundation.org
Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from userspace in common code. This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.
As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Modules have no business poking into this but fixing this is for later.
[ bp: Carve out from an earlier patch. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200421092558.939985695@linutronix.de
cpu_tlbstate is exported because various TLB-related functions need access
to it, but cpu_tlbstate is sensitive information which should only be
accessed by well-contained kernel functions and not be directly exposed to
modules.
As a third step, move _flush_tlb_one_user() out of line and hide the
native function. The latter can be static when CONFIG_PARAVIRT is
disabled.
Consolidate the name space while at it and remove the pointless extra
wrapper in the paravirt code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200421092559.428213098@linutronix.de
cpu_tlbstate is exported because various TLB-related functions need
access to it, but cpu_tlbstate is sensitive information which should
only be accessed by well-contained kernel functions and not be directly
exposed to modules.
As a second step, move __flush_tlb_global() out of line and hide the
native function. The latter can be static when CONFIG_PARAVIRT is
disabled.
Consolidate the namespace while at it and remove the pointless extra
wrapper in the paravirt code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200421092559.336916818@linutronix.de
cpu_tlbstate is exported because various TLB-related functions need
access to it, but cpu_tlbstate is sensitive information which should
only be accessed by well-contained kernel functions and not be directly
exposed to modules.
As a first step, move __flush_tlb() out of line and hide the native
function. The latter can be static when CONFIG_PARAVIRT is disabled.
Consolidate the namespace while at it and remove the pointless extra
wrapper in the paravirt code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200421092559.246130908@linutronix.de
The following execution path is possible:
fsnotify()
[ realign the stack and store previous SP in R10 ]
<IRQ>
[ only IRET regs saved ]
common_interrupt()
interrupt_entry()
<NMI>
[ full pt_regs saved ]
...
[ unwind stack ]
When the unwinder goes through the NMI and the IRQ on the stack, and
then sees fsnotify(), it doesn't have access to the value of R10,
because it only has the five IRET registers. So the unwind stops
prematurely.
However, because the interrupt_entry() code is careful not to clobber
R10 before saving the full regs, the unwinder should be able to read R10
from the previously saved full pt_regs associated with the NMI.
Handle this case properly. When encountering an IRET regs frame
immediately after a full pt_regs frame, use the pt_regs as a backup
which can be used to get the C register values.
Also, note that a call frame resets the 'prev_regs' value, because a
function is free to clobber the registers. For this fix to work, the
IRET and full regs frames must be adjacent, with no FUNC frames in
between. So replace the FUNC hint in interrupt_entry() with an
IRET_REGS hint.
Fixes: ee9f8fce99 ("x86/unwind: Add the ORC unwinder")
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Jones <dsj@fb.com>
Cc: Jann Horn <jannh@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lore.kernel.org/r/97a408167cc09f1cfa0de31a7b70dd88868d743f.1587808742.git.jpoimboe@redhat.com
If the ORC entry type is unknown, nothing else can be done other than
reporting an error. Exit the function instead of breaking out of the
switch statement.
Fixes: ee9f8fce99 ("x86/unwind: Add the ORC unwinder")
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Jones <dsj@fb.com>
Cc: Jann Horn <jannh@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lore.kernel.org/r/a7fa668ca6eabbe81ab18b2424f15adbbfdc810a.1587808742.git.jpoimboe@redhat.com
If the unwinder is called before the ORC data has been initialized,
orc_find() returns NULL, and it tries to fall back to using frame
pointers. This can cause some unexpected warnings during boot.
Move the 'orc_init' check from orc_find() to __unwind_init(), so that it
doesn't even try to unwind from an uninitialized state.
Fixes: ee9f8fce99 ("x86/unwind: Add the ORC unwinder")
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Jones <dsj@fb.com>
Cc: Jann Horn <jannh@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lore.kernel.org/r/069d1499ad606d85532eb32ce39b2441679667d5.1587808742.git.jpoimboe@redhat.com
When unwinding an inactive task, the ORC unwinder skips the first frame
by default. If both the 'regs' and 'first_frame' parameters of
unwind_start() are NULL, 'state->sp' and 'first_frame' are later
initialized to the same value for an inactive task. Given there is a
"less than or equal to" comparison used at the end of __unwind_start()
for skipping stack frames, the first frame is skipped.
Drop the equal part of the comparison and make the behavior equivalent
to the frame pointer unwinder.
Fixes: ee9f8fce99 ("x86/unwind: Add the ORC unwinder")
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Jones <dsj@fb.com>
Cc: Jann Horn <jannh@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lore.kernel.org/r/7f08db872ab59e807016910acdbe82f744de7065.1587808742.git.jpoimboe@redhat.com
There's some daring kernel code out there which dumps the stack of
another task without first making sure the task is inactive. If the
task happens to be running while the unwinder is reading the stack,
unusual unwinder warnings can result.
There's no race-free way for the unwinder to know whether such a warning
is legitimate, so just disable unwinder warnings for all non-current
tasks.
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Jones <dsj@fb.com>
Cc: Jann Horn <jannh@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lore.kernel.org/r/ec424a2aea1d461eb30cab48a28c6433de2ab784.1587808742.git.jpoimboe@redhat.com
These variables aren't used outside of unwind_orc.c, make them static.
Also annotate some of them with '__ro_after_init', as applicable.
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Jones <dsj@fb.com>
Cc: Jann Horn <jannh@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lore.kernel.org/r/43ae310bf7822b9862e571f36ae3474cfde8f301.1587808742.git.jpoimboe@redhat.com
The only user of these inlines is the text poke code and this must not be
exposed to the world.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200421092559.139069561@linutronix.de
cpu_tlbstate is exported because various TLB-related functions need
access to it, but cpu_tlbstate is sensitive information which should
only be accessed by well-contained kernel functions and not be directly
exposed to modules.
The various CR4 accessors require cpu_tlbstate as the CR4 shadow cache
is located there.
In preparation for unexporting cpu_tlbstate, create a builtin function
for manipulating CR4 and rework the various helpers to use it.
No functional change.
[ bp: push the export of native_write_cr4() only when CONFIG_LKTDM=m to
the last patch in the series. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200421092558.939985695@linutronix.de
Improve readability of the function intel_set_max_freq_ratio() by moving
the check for KNL CPUs there, together with checks for GLM and SKX.
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200416054745.740-5-ggherdovich@suse.cz
The static key arch_scale_freq_key only needs to be enabled once (at
boot). This change fixes a bug by which the key was enabled every time cpu0
is started, even as a secondary CPU during cpu hotplug. Secondary CPUs are
started from the idle thread: setting a static key from there means
acquiring a lock and may result in sleeping in the idle task, causing CPU
lockup.
Another consequence of this change is that init_counter_refs() is now
called on each CPU correctly; previously the function on_each_cpu() was
used, but it was called at boot when the only online cpu is cpu0.
[ggherdovich@suse.cz: Tested and wrote changelog]
Fixes: 1567c3e346 ("x86, sched: Add support for frequency invariance")
Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200416054745.740-4-ggherdovich@suse.cz
If a CPU has less than 4 physical cores, MSR_TURBO_RATIO_LIMIT will
rightfully report that the 4C turbo ratio is zero. In such cases, use the
1C turbo ratio instead for frequency invariance calculations.
Fixes: 1567c3e346 ("x86, sched: Add support for frequency invariance")
Reported-by: Like Xu <like.xu@linux.intel.com>
Reported-by: Neil Rickert <nwr10cst-oslnx@yahoo.com>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Link: https://lkml.kernel.org/r/20200416054745.740-3-ggherdovich@suse.cz
Some hypervisors such as VMWare ESXi 5.5 advertise support for
X86_FEATURE_APERFMPERF but then fill all MSR's with zeroes. In particular,
MSR_PLATFORM_INFO set to zero tricks the code that wants to know the base
clock frequency of the CPU (highest non-turbo frequency), producing a
division by zero when computing the ratio turbo_freq/base_freq necessary
for frequency invariant accounting.
It is to be noted that even if MSR_PLATFORM_INFO contained the appropriate
data, APERF and MPERF are constantly zero on ESXi 5.5, thus freq-invariance
couldn't be done in principle (not that it would make a lot of sense in a
VM anyway). The real problem is advertising X86_FEATURE_APERFMPERF. This
appears to be fixed in more recent versions: ESXi 6.7 doesn't advertise
that feature.
Fixes: 1567c3e346 ("x86, sched: Add support for frequency invariance")
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200416054745.740-2-ggherdovich@suse.cz
The return value from stop_machine() might not be consistent.
stop_machine_cpuslocked() returns:
- zero if all functions have returned 0.
- a non-zero value if at least one of the functions returned
a non-zero value.
There is no way to know if it is negative or positive. So make
__reload_late() return 0 on success or negative otherwise.
[ bp: Unify ret val check and touch up. ]
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1587497318-4438-1-git-send-email-mihai.carabas@oracle.com
'Optimize' ftrace_regs_caller. Instead of comparing against an
immediate, the more natural way to test for zero on x86 is: 'test
%r,%r'.
48 83 f8 00 cmp $0x0,%rax
74 49 je 226 <ftrace_regs_call+0xa3>
48 85 c0 test %rax,%rax
74 49 je 225 <ftrace_regs_call+0xa2>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20200416115118.867411350@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There's a convenient macro for 'SS+8' called FRAME_SIZE. Use it to
clarify things.
(entry/calling.h calls this SIZEOF_PTREGS but we're using
asm/ptrace-abi.h)
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20200416115118.808485515@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The ftrace_regs_caller() trampoline does something 'funny' when there
is a direct-caller present. In that case it stuffs the 'direct-caller'
address on the return stack and then exits the function. This then
results in 'returning' to the direct-caller with the exact registers
we came in with -- an indirect tail-call without using a register.
This however (rightfully) confuses objtool because the function shares
a few instruction in order to have a single exit path, but the stack
layout is different for them, depending through which path we came
there.
This is currently cludged by forcing the stack state to the non-direct
case, but this generates actively wrong (ORC) unwind information for
the direct case, leading to potential broken unwinds.
Fix this issue by fully separating the exit paths. This results in
having to poke a second RET into the trampoline copy, see
ftrace_regs_caller_ret.
This brings us to a second objtool problem, in order for it to
perceive the 'jmp ftrace_epilogue' as a function exit, it needs to be
recognised as a tail call. In order to make that happen,
ftrace_epilogue needs to be the start of an STT_FUNC, so re-arrange
code to make this so.
Finally, a third issue is that objtool requires functions to exit with
the same stack layout they started with, which is obviously violated
in the direct case, employ the new HINT_RET_OFFSET to tell objtool
this is an expected exception.
Together, this results in generating correct ORC unwind information
for the ftrace_regs_caller() function and it's trampoline copies.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20200416115118.749606694@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
SRBDS is an MDS-like speculative side channel that can leak bits from the
random number generator (RNG) across cores and threads. New microcode
serializes the processor access during the execution of RDRAND and
RDSEED. This ensures that the shared buffer is overwritten before it is
released for reuse.
While it is present on all affected CPU models, the microcode mitigation
is not needed on models that enumerate ARCH_CAPABILITIES[MDS_NO] in the
cases where TSX is not supported or has been disabled with TSX_CTRL.
The mitigation is activated by default on affected processors and it
increases latency for RDRAND and RDSEED instructions. Among other
effects this will reduce throughput from /dev/urandom.
* Enable administrator to configure the mitigation off when desired using
either mitigations=off or srbds=off.
* Export vulnerability status via sysfs
* Rename file-scoped macros to apply for non-whitelist table initializations.
[ bp: Massage,
- s/VULNBL_INTEL_STEPPING/VULNBL_INTEL_STEPPINGS/g,
- do not read arch cap MSR a second time in tsx_fused_off() - just pass it in,
- flip check in cpu_set_bug_bits() to save an indentation level,
- reflow comments.
jpoimboe: s/Mitigated/Mitigation/ in user-visible strings
tglx: Dropped the fused off magic for now
]
Signed-off-by: Mark Gross <mgross@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Intel uses the same family/model for several CPUs. Sometimes the
stepping must be checked to tell them apart.
On x86 there can be at most 16 steppings. Add a steppings bitmask to
x86_cpu_id and a X86_MATCH_VENDOR_FAMILY_MODEL_STEPPING_FEATURE macro
and support for matching against family/model/stepping.
[ bp: Massage. ]
Signed-off-by: Mark Gross <mgross@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
To make cpu_matches() reusable for other matching tables, have it take a
pointer to a x86_cpu_id table as an argument.
[ bp: Flip arguments order. ]
Signed-off-by: Mark Gross <mgross@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
objtool:
- Ignore the double UD2 which is emitted in BUG() when CONFIG_UBSAN_TRAP
is enabled.
- Support clang non-section symbols in objtool ORC dump
- Fix switch table detection in .text.unlikely
- Make the BP scratch register warning more robust.
x86:
- Increase microcode maximum patch size for AMD to cope with new CPUs
which have a larger patch size.
- Fix a crash in the resource control filesystem when the removal of the
default resource group is attempted.
- Preserve Code and Data Prioritization enabled state accross CPU
hotplug.
- Update split lock cpu matching to use the new X86_MATCH macros.
- Change the split lock enumeration as Intel finaly decided that the
IA32_CORE_CAPABILITIES bits are not architectural contrary to what
the SDM claims. !@#%$^!
- Add Tremont CPU models to the split lock detection cpu match.
- Add a missing static attribute to make sparse happy.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6cWGsTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYod2jD/4kZqz+nEzAvx8RC/7zfLr1S6mDYcLb
kqWEblLRfPofFNO3W/1Ri7xUs2VCyBcOJeG9JIugI8YV/b/5LY9j2nW30unXi84y
8DHLWgM7OG+EiNDMvdQwgnjNb9Pdl4F1e9yTTD6IRg0bHOjvtHVyq9bNg7f3iaED
ZE4X5Hh5u4qFK/jmcsTF5HA/wIjELdmT32F4RxceAlmvpa5SUGlOfVVo1cSZpCbx
XkrvUvEzyZhbzY+Gy1q3SHTt+fvzx1++LsnJD0Dyfe5Q47PA1Iy6Zo2+Epn3FnCu
XuQKLaiDhidpkPzTGULZUsubavXbrSEu5/yhFJHyUqMy5WNOmvXBN8eVC4j1I9Ga
tnt43s3AS8noz4qIb7bpoVgETFtoCfWfqwhtZmALPzrfutwxe2Ujtsi9FUca6HtA
T5dKuNwc8G+Q5ZiNi+rPjcV/QGGncZFwtwwRwUl/YKgQ2VgrTgfsPc431tfSl3Q8
hVQIOhQNHCKqe3uGhiCsI29pNMDXVijZcI8w2SSmxnPyrMRXD7bTfLWnPav7SGFO
aSSi9HWtghkU/MsmRgRcZc9PI5bNs6w5IkfQqfXjd/lJwea2yQg1cn1KdmGi3Q33
BNj9FudNMe4K8ITaNWiLdt5rYCDIvWEzmbwawAhevstbKrjVtrAYgNAjvgJEnXAt
mZwTu+Hpd6d+JA==
=raUm
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 and objtool fixes from Thomas Gleixner:
"A set of fixes for x86 and objtool:
objtool:
- Ignore the double UD2 which is emitted in BUG() when
CONFIG_UBSAN_TRAP is enabled.
- Support clang non-section symbols in objtool ORC dump
- Fix switch table detection in .text.unlikely
- Make the BP scratch register warning more robust.
x86:
- Increase microcode maximum patch size for AMD to cope with new CPUs
which have a larger patch size.
- Fix a crash in the resource control filesystem when the removal of
the default resource group is attempted.
- Preserve Code and Data Prioritization enabled state accross CPU
hotplug.
- Update split lock cpu matching to use the new X86_MATCH macros.
- Change the split lock enumeration as Intel finaly decided that the
IA32_CORE_CAPABILITIES bits are not architectural contrary to what
the SDM claims. !@#%$^!
- Add Tremont CPU models to the split lock detection cpu match.
- Add a missing static attribute to make sparse happy"
* tag 'x86-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/split_lock: Add Tremont family CPU models
x86/split_lock: Bits in IA32_CORE_CAPABILITIES are not architectural
x86/resctrl: Preserve CDP enable over CPU hotplug
x86/resctrl: Fix invalid attempt at removing the default resource group
x86/split_lock: Update to use X86_MATCH_INTEL_FAM6_MODEL()
x86/umip: Make umip_insns static
x86/microcode/AMD: Increase microcode PATCH_MAX_SIZE
objtool: Make BP scratch register warning more robust
objtool: Fix switch table detection in .text.unlikely
objtool: Support Clang non-section symbols in ORC generation
objtool: Support Clang non-section symbols in ORC dump
objtool: Fix CONFIG_UBSAN_TRAP unreachable warnings
Tremont CPUs support IA32_CORE_CAPABILITIES bits to indicate whether
specific SKUs have support for split lock detection.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200416205754.21177-4-tony.luck@intel.com
The Intel Software Developers' Manual erroneously listed bit 5 of the
IA32_CORE_CAPABILITIES register as an architectural feature. It is not.
Features enumerated by IA32_CORE_CAPABILITIES are model specific and
implementation details may vary in different cpu models. Thus it is only
safe to trust features after checking the CPU model.
Icelake client and server models are known to implement the split lock
detect feature even though they don't enumerate IA32_CORE_CAPABILITIES
[ tglx: Use switch() for readability and massage comments ]
Fixes: 6650cdd9a8 ("x86/split_lock: Enable split lock detection by kernel")
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200416205754.21177-3-tony.luck@intel.com
After
1bd187de53 ("x86, intel-mid: remove Intel MID specific serial support")
the Intel MID header is not needed anymore.
After
69c1f396f2 ("efi/x86: Convert x86 EFI earlyprintk into generic earlycon implementation")
the EFI headers are not needed anymore.
Remove the respective includes.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200326175415.8618-1-andriy.shevchenko@linux.intel.com
Resctrl assumes that all CPUs are online when the filesystem is mounted,
and that CPUs remember their CDP-enabled state over CPU hotplug.
This goes wrong when resctrl's CDP-enabled state changes while all the
CPUs in a domain are offline.
When a domain comes online, enable (or disable!) CDP to match resctrl's
current setting.
Fixes: 5ff193fbde ("x86/intel_rdt: Add basic resctrl filesystem support")
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200221162105.154163-1-james.morse@arm.com
The default resource group ("rdtgroup_default") is associated with the
root of the resctrl filesystem and should never be removed. New resource
groups can be created as subdirectories of the resctrl filesystem and
they can be removed from user space.
There exists a safeguard in the directory removal code
(rdtgroup_rmdir()) that ensures that only subdirectories can be removed
by testing that the directory to be removed has to be a child of the
root directory.
A possible deadlock was recently fixed with
334b0f4e9b ("x86/resctrl: Fix a deadlock due to inaccurate reference").
This fix involved associating the private data of the "mon_groups"
and "mon_data" directories to the resource group to which they belong
instead of NULL as before. A consequence of this change was that
the original safeguard code preventing removal of "mon_groups" and
"mon_data" found in the root directory failed resulting in attempts to
remove the default resource group that ends in a BUG:
kernel BUG at mm/slub.c:3969!
invalid opcode: 0000 [#1] SMP PTI
Call Trace:
rdtgroup_rmdir+0x16b/0x2c0
kernfs_iop_rmdir+0x5c/0x90
vfs_rmdir+0x7a/0x160
do_rmdir+0x17d/0x1e0
do_syscall_64+0x55/0x1d0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fix this by improving the directory removal safeguard to ensure that
subdirectories of the resctrl root directory can only be removed if they
are a child of the resctrl filesystem's root _and_ not associated with
the default resource group.
Fixes: 334b0f4e9b ("x86/resctrl: Fix a deadlock due to inaccurate reference")
Reported-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/884cbe1773496b5dbec1b6bd11bb50cffa83603d.1584461853.git.reinette.chatre@intel.com
The SPLIT_LOCK_CPU() macro escaped the tree-wide sweep for old-style
initialization. Update to use X86_MATCH_INTEL_FAM6_MODEL().
Fixes: 6650cdd9a8 ("x86/split_lock: Enable split lock detection by kernel")
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200416205754.21177-2-tony.luck@intel.com
saved_max_pfn was originally introduced in commit
92aa63a5a1 ("[PATCH] kdump: Retrieve saved max pfn")
It used to make sure that the user does not try to read the physical memory
beyond saved_max_pfn. But since commit
921d58c0e6 ("vmcore: remove saved_max_pfn check")
it's no longer used for the check. This variable doesn't have any users
anymore so just remove it.
[ bp: Drop the Calgary IOMMU reference from the commit message. ]
Signed-off-by: Kairui Song <kasong@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lkml.kernel.org/r/20200330181544.1595733-1-kasong@redhat.com
Fix the following sparse warning:
arch/x86/kernel/umip.c:84:12: warning: symbol 'umip_insns' was not declared.
Should it be static?
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Link: https://lkml.kernel.org/r/20200413082213.22934-1-yanaijie@huawei.com
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAl6ViNsTHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXuXIB/4nuYRCt4d/XaeHF6dCWU45ThG+tNs7
p/OnBPZmknI0SnZ4uR/XW5caHEFj7g9ndYh+M1afZ/zKdsc+syMSDT5XhuhC/GKV
fQRW0qO8N+IAqXbLzJxyBg6fH2anwfe3w2uy2cKDEZk6d4FD5atTWhRY6R4ISq0l
g7pUyvQN1q+G6KH2snmOaZL8mybFkbHrmwtAZzcjzdzqasdLFiQB8EEFkONG66t9
HeNTyUF0mnbGBIePQLSZSHLj5p4yHG/9pa3jgqO5dsmIdsBvoaVNqEi3pCm1s/5n
BH9FWn6fTwpcKvtF385yzBiFFlzBVgXbetxuSmxxOkWW4P+db5B/GL2Y
=fjSF
-----END PGP SIGNATURE-----
Merge tag 'hyperv-fixes-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:
- a series from Tianyu Lan to fix crash reporting on Hyper-V
- three miscellaneous cleanup patches
* tag 'hyperv-fixes-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
x86/Hyper-V: Report crash data in die() when panic_on_oops is set
x86/Hyper-V: Report crash register data when sysctl_record_panic_msg is not set
x86/Hyper-V: Report crash register data or kmsg before running crash kernel
x86/Hyper-V: Trigger crash enlightenment only once during system crash.
x86/Hyper-V: Free hv_panic_page when fail to register kmsg dump
x86/Hyper-V: Unload vmbus channel in hv panic callback
x86: hyperv: report value of misc_features
hv_debugfs: Make hv_debug_root static
hv: hyperv_vmbus.h: Replace zero-length array with flexible-array member
The severity grading code returns IN_KERNEL_RECOV error context for
errors which have happened in kernel space but from which the kernel can
recover. Whether the recovery can happen is determined by the exception
table entry having as handler ex_handler_fault() and which has been
declared at build time using _ASM_EXTABLE_FAULT().
IN_KERNEL_RECOV is used in mce_severity_intel() to lookup the
corresponding error severity in the severities table.
However, the mapping back from error severity to whether the error is
IN_KERNEL_RECOV is ambiguous and in the very paranoid case - which
might not be possible right now - but be better safe than sorry later,
an exception fixup could be attempted for another MCE whose address
is in the exception table and has the proper severity. Which would be
unfortunate, to say the least.
Therefore, mark such MCEs explicitly as MCE_IN_KERNEL_RECOV so that the
recovery attempt is done only for them.
Document the whole handling, while at it, as it is not trivial.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200407163414.18058-10-bp@alien8.de
Sometimes, when logs are getting lost, it's nice to just
have everything dumped to the serial console.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-7-tony.luck@intel.com
Instead of keeping count of how many handlers are registered on the
MCE notifier chain and printing if below some magic value, look at
mce->kflags to see if anyone claims to have handled/logged this error.
[ bp: Do not print ->kflags in __print_mce(). ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-6-tony.luck@intel.com
If the handler took any action to log or deal with the error, set a bit
in mce->kflags so that the default handler on the end of the machine
check chain can see what has been done.
Get rid of NOTIFY_STOP returns. Make the EDAC and dev-mcelog handlers
skip over errors already processed by CEC.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-5-tony.luck@intel.com
The CEC code has its claws in a couple of routines in mce/core.c.
Convert it to just register itself on the normal MCE notifier chain.
[ bp: Make cec_add_elem() and cec_init() static. ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-3-tony.luck@intel.com
It isn't going to be first on the notifier chain when the CEC is moved
to be a normal user of the notifier chain.
Fix the enum for the MCE_PRIO symbols to list them in reverse order so
that the compiler can give them numbers from low to high priority. Add
an entry for MCE_PRIO_CEC as the highest priority.
[ bp: Use passive voice, add comments. ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-2-tony.luck@intel.com
... because no one should be interested in spurious MCEs anyway. Make
the filtering unconditional and move it to amd_filter_mce().
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200407163414.18058-2-bp@alien8.de
Handle the cases when the CPU goes offline before the bank
setting/reading happens.
[ bp: Write commit message. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200403161943.1458-8-bp@alien8.de
Pass in the bank pointer directly to the cleaning up functions,
obviating the need for per-CPU accesses. Make the clean up path
interrupt-safe by cleaning the bank pointer first so that the rest of
the teardown happens safe from the thresholding interrupt.
No functional changes.
[ bp: Write commit message and reverse bank->shared test to save an
indentation level in threshold_remove_bank(). ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200403161943.1458-7-bp@alien8.de
mce_threshold_create_device() hotplug callback runs on the plugged in
CPU so:
- use this_cpu_read() which is faster
- pass in struct threshold_bank **bp to threshold_create_bank() and
instead of doing per-CPU accesses
- Use rdmsr_safe() instead of rdmsr_safe_on_cpu() which avoids an IPI.
No functional changes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200403161943.1458-6-bp@alien8.de
Drop the stupid threshold_init_device() initcall iterating over all
online CPUs in favor of properly setting up everything on the CPU
hotplug path, when each CPU's callback is invoked.
[ bp: Write commit message. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200403161943.1458-5-bp@alien8.de
Make sure the thresholding bank descriptor is fully initialized when the
thresholding interrupt fires after a hotplug event.
[ bp: Write commit message and document long-forgotten bank_map. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200403161943.1458-4-bp@alien8.de
Drop kobject reference counts properly on error in the banks and blocks
allocation functions.
[ bp: Write commit message. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200403161943.1458-2-bp@alien8.de
Make the doublefault exception handler unconditional on 32-bit. Yes,
it is important to be able to catch #DF exceptions instead of silent
reboots. Yes, the code size increase is worth every byte. And one less
CONFIG symbol is just the cherry on top.
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200404083646.8897-1-bp@alien8.de
Now all is using the shiny new macros.
No code changed:
# arch/x86/kernel/smpboot.o:
text data bss dec hex filename
16432 2649 40 19121 4ab1 smpboot.o.before
16432 2649 40 19121 4ab1 smpboot.o.after
md5:
a58104003b72c1de533095bc5a4c30a9 smpboot.o.before.asm
a58104003b72c1de533095bc5a4c30a9 smpboot.o.after.asm
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200324185836.GI22931@zn.tnic
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl6TbaUeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGhgkH/iWpiKvosA20HJjC
rBqYeJPxQsgZTuBieWJ+MeVxbpcF7RlM4c+glyvg3QJhHwIEG58dl6LBrQbAyBAR
aFHNojr1iAYOruVCGnU3pA008YZiwUIDv/ZQ4DF8fmIU2vI2mJ6qHBv3XDl4G2uR
Nwz8Eu9AgIwZM5coomVOSmoWyFy7Vxmb7W+3t5VmKsvOWx4ib9kyQtOIkvQDEl7j
XCbWfI0xDQr6LFOm4jnCi5R/LhJ2LIqqIvHHrunbpszM8IwK797jCXz4im+dmd5Y
+km46N7a8pDqri36xXz1gdBAU3eG7Pt1NyvfjwRVTdX4GquQ2MT0GoojxbLxUP3y
3pEsQuE=
=whbL
-----END PGP SIGNATURE-----
Merge tag 'v5.7-rc1' into locking/kcsan, to resolve conflicts and refresh
Resolve these conflicts:
arch/x86/Kconfig
arch/x86/kernel/Makefile
Do a minor "evil merge" to move the KCSAN entry up a bit by a few lines
in the Kconfig to reduce the probability of future conflicts.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We want to notify Hyper-V when a Linux guest VM crash occurs, so
there is a record of the crash even when kdump is enabled. But
crash_kexec_post_notifiers defaults to "false", so the kdump kernel
runs before the notifiers and Hyper-V never gets notified. Fix this by
always setting crash_kexec_post_notifiers to be true for Hyper-V VMs.
Fixes: 81b18bce48 ("Drivers: HV: Send one page worth of kmsg dump over Hyper-V during panic")
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Link: https://lore.kernel.org/r/20200406155331.2105-5-Tianyu.Lan@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Without at least minimal handling for split lock detection induced #AC,
VMX will just run into the same problem as the VMWare hypervisor, which
was reported by Kenneth.
It will inject the #AC blindly into the guest whether the guest is
prepared or not.
Provide a function for guest mode which acts depending on the host
SLD mode. If mode == sld_warn, treat it like user space, i.e. emit a
warning, disable SLD and mark the task accordingly. Otherwise force
SIGBUS.
[ bp: Add a !CPU_SUP_INTEL stub for handle_guest_split_lock(). ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lkml.kernel.org/r/20200410115516.978037132@linutronix.de
Link: https://lkml.kernel.org/r/20200402123258.895628824@linutronix.de
Merge yet more updates from Andrew Morton:
- Almost all of the rest of MM (memcg, slab-generic, slab, pagealloc,
gup, hugetlb, pagemap, memremap)
- Various other things (hfs, ocfs2, kmod, misc, seqfile)
* akpm: (34 commits)
ipc/util.c: sysvipc_find_ipc() should increase position index
kernel/gcov/fs.c: gcov_seq_next() should increase position index
fs/seq_file.c: seq_read(): add info message about buggy .next functions
drivers/dma/tegra20-apb-dma.c: fix platform_get_irq.cocci warnings
change email address for Pali Rohár
selftests: kmod: test disabling module autoloading
selftests: kmod: fix handling test numbers above 9
docs: admin-guide: document the kernel.modprobe sysctl
fs/filesystems.c: downgrade user-reachable WARN_ONCE() to pr_warn_once()
kmod: make request_module() return an error when autoloading is disabled
mm/memremap: set caching mode for PCI P2PDMA memory to WC
mm/memory_hotplug: add pgprot_t to mhp_params
powerpc/mm: thread pgprot_t through create_section_mapping()
x86/mm: introduce __set_memory_prot()
x86/mm: thread pgprot_t through init_memory_mapping()
mm/memory_hotplug: rename mhp_restrictions to mhp_params
mm/memory_hotplug: drop the flags field from struct mhp_restrictions
mm/special: create generic fallbacks for pte_special() and pte_mkspecial()
mm/vma: introduce VM_ACCESS_FLAGS
mm/vma: define a default value for VM_DATA_DEFAULT_FLAGS
...
In preparation to support a pgprot_t argument for arch_add_memory().
It's required to move the prototype of init_memory_mapping() seeing the
original location came before the definition of pgprot_t.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric Badger <ebadger@gigaio.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/20200306170846.9333-4-logang@deltatee.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 944d9fec8d ("hugetlb: add support for gigantic page allocation
at runtime") has added the run-time allocation of gigantic pages.
However it actually works only at early stages of the system loading,
when the majority of memory is free. After some time the memory gets
fragmented by non-movable pages, so the chances to find a contiguous 1GB
block are getting close to zero. Even dropping caches manually doesn't
help a lot.
At large scale rebooting servers in order to allocate gigantic hugepages
is quite expensive and complex. At the same time keeping some constant
percentage of memory in reserved hugepages even if the workload isn't
using it is a big waste: not all workloads can benefit from using 1 GB
pages.
The following solution can solve the problem:
1) On boot time a dedicated cma area* is reserved. The size is passed
as a kernel argument.
2) Run-time allocations of gigantic hugepages are performed using the
cma allocator and the dedicated cma area
In this case gigantic hugepages can be allocated successfully with a
high probability, however the memory isn't completely wasted if nobody
is using 1GB hugepages: it can be used for pagecache, anon memory, THPs,
etc.
* On a multi-node machine a per-node cma area is allocated on each node.
Following gigantic hugetlb allocation are using the first available
numa node if the mask isn't specified by a user.
Usage:
1) configure the kernel to allocate a cma area for hugetlb allocations:
pass hugetlb_cma=10G as a kernel argument
2) allocate hugetlb pages as usual, e.g.
echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
If the option isn't enabled or the allocation of the cma area failed,
the current behavior of the system is preserved.
x86 and arm-64 are covered by this patch, other architectures can be
trivially added later.
The patch contains clean-ups and fixes proposed and implemented by Aslan
Bakirov and Randy Dunlap. It also contains ideas and suggestions
proposed by Rik van Riel, Michal Hocko and Mike Kravetz. Thanks!
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Andreas Schaufler <andreas.schaufler@gmx.de>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Aslan Bakirov <aslan@fb.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Link: http://lkml.kernel.org/r/20200407163840.92263-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A few kernel features depend on ms_hyperv.misc_features, but unlike its
siblings ->features and ->hints, the value was never reported during boot.
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Link: https://lore.kernel.org/r/20200407172739.31371-1-olaf@aepfle.de
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Fix the following sparse warning:
arch/x86/kernel/acpi/boot.c:48:5: warning: symbol 'acpi_nobgrt' was not
declared. Should it be static?
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Update the ACPICA code in the kernel to upstream revision 20200326
including:
* Fix for a typo in a comment field (Bob Moore).
* acpiExec namespace init file fixes (Bob Moore).
* Addition of NHLT to the known tables list (Cezary Rojewski).
* Conversion of PlatformCommChannel ASL keyword to PCC (Erik
Kaneda).
* acpiexec cleanup (Erik Kaneda).
* WSMT-related typo fix (Erik Kaneda).
* sprintf() utility function fix (John Levon).
* IVRS IVHD type 11h parsing implementation (Michał Żygowski).
* IVRS IVHD type 10h reserved field name fix (Michał Żygowski).
- Fix ACPI-related CPU hotplug deadlock on x86 (Qian Cai).
- Fix Intel Tiger Lake ACPI device IDs in several places (Gayatri
Kammela).
- Add ACPI backlight blacklist entry for Acer Aspire 5783z (Hans
de Goede).
- Fix documentation of the "acpi_backlight" kernel command line
switch (Randy Dunlap).
- Clean up the acpi_get_psd_map() CPPC library routine (Liguang
Zhang).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl6LQEASHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxqfoQAI/GPq7xhb8jOofmTfLxa4ahO2NDxK1E
Ye4Tcm8JLv78hro7iMUlbPsRXm15lyDxMldGRfxsiLFTF2xQtYhdTnPx+KZ439j+
QokMHUT6gFEMAV7OPFvXd2r58ShJJHezobbn241zTILx1c3ai66dCQrqyhYjlZ28
0hUCyY4ilgXWuYInlckGW3Rp/Qxc9IVOxzFUV90EW9pTb4vKzoqznjNm+dpY8rHm
QFNb2BkTJygOPmJiumi/yJX+74YSZrzW5fS1PDQS4Lr46j0imvWVVataMd1qbQ0+
fDhvhL7IimHiM/qZg67hKpsAt6AcQPhaZ6JyoEGUoafxpBQN0a7b5rMwmL0P/HWV
pL5mKM+jc7zh0HTb+xkpNotJxT+KBFo1jTRxGyVAnK8SThzlyFhKhetiOwaHCIDv
dNYao6bCNsuGLh3T/09xbAmEeCSt7k+ok892N4o9wzqNfoDg6fX/c0M5ZD1F+Awb
l9agU7XChziyDJwAqTbqndx71DK4ALrhZa1tNKA5PGTY8b5XrojoKsOyYk6PYA1x
CqU20muRV4VAzB0pvdiwBc2Yrtfiv32mv5jMNrqrrv3D6S6R8vBUNhHlWKu/75a9
9muIoEHWnK0/a9kmVJG8CUSXTTTPQpvOovesznruTxvGx3Mp9gw+d3/1tjuA/QNM
ZoOOru5AEyi+
=b3Dp
-----END PGP SIGNATURE-----
Merge tag 'acpi-5.7-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more ACPI updates from Rafael Wysocki:
"Additional ACPI updates.
These update the ACPICA code in the kernel to the 20200326 upstream
revision, fix an ACPI-related CPU hotplug deadlock on x86, update
Intel Tiger Lake device IDs in some places, add a new ACPI backlight
blacklist entry, update the "acpi_backlight" kernel command line
switch documentation and clean up a CPPC library routine.
Specifics:
- Update the ACPICA code in the kernel to upstream revision 20200326
including:
* Fix for a typo in a comment field (Bob Moore)
* acpiExec namespace init file fixes (Bob Moore)
* Addition of NHLT to the known tables list (Cezary Rojewski)
* Conversion of PlatformCommChannel ASL keyword to PCC (Erik
Kaneda)
* acpiexec cleanup (Erik Kaneda)
* WSMT-related typo fix (Erik Kaneda)
* sprintf() utility function fix (John Levon)
* IVRS IVHD type 11h parsing implementation (Michał Żygowski)
* IVRS IVHD type 10h reserved field name fix (Michał Żygowski)
- Fix ACPI-related CPU hotplug deadlock on x86 (Qian Cai)
- Fix Intel Tiger Lake ACPI device IDs in several places (Gayatri
Kammela)
- Add ACPI backlight blacklist entry for Acer Aspire 5783z (Hans de
Goede)
- Fix documentation of the "acpi_backlight" kernel command line
switch (Randy Dunlap)
- Clean up the acpi_get_psd_map() CPPC library routine (Liguang
Zhang)"
* tag 'acpi-5.7-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
x86: ACPI: fix CPU hotplug deadlock
thermal: int340x_thermal: fix: Update Tiger Lake ACPI device IDs
platform/x86: intel-hid: fix: Update Tiger Lake ACPI device ID
ACPI: Update Tiger Lake ACPI device IDs
ACPI: video: Use native backlight on Acer Aspire 5783z
ACPI: video: Docs update for "acpi_backlight" kernel parameter options
ACPICA: Update version 20200326
ACPICA: Fixes for acpiExec namespace init file
ACPICA: Add NHLT table signature
ACPICA: WSMT: Fix typo, no functional change
ACPICA: utilities: fix sprintf()
ACPICA: acpiexec: remove redeclaration of acpi_gbl_db_opt_no_region_support
ACPICA: Change PlatformCommChannel ASL keyword to PCC
ACPICA: Fix IVRS IVHD type 10h reserved field name
ACPICA: Implement IVRS IVHD type 11h parsing
ACPICA: Fix a typo in a comment field
ACPI: CPPC: clean up acpi_get_psd_map()
Similar to commit 0266d81e9b ("acpi/processor: Prevent cpu hotplug
deadlock") except this is for acpi_processor_ffh_cstate_probe():
"The problem is that the work is scheduled on the current CPU from the
hotplug thread associated with that CPU.
It's not required to invoke these functions via the workqueue because
the hotplug thread runs on the target CPU already.
Check whether current is a per cpu thread pinned on the target CPU and
invoke the function directly to avoid the workqueue."
WARNING: possible circular locking dependency detected
------------------------------------------------------
cpuhp/1/15 is trying to acquire lock:
ffffc90003447a28 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: __flush_work+0x4c6/0x630
but task is already holding lock:
ffffffffafa1c0e8 (cpuidle_lock){+.+.}-{3:3}, at: cpuidle_pause_and_lock+0x17/0x20
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (cpu_hotplug_lock){++++}-{0:0}:
cpus_read_lock+0x3e/0xc0
irq_calc_affinity_vectors+0x5f/0x91
__pci_enable_msix_range+0x10f/0x9a0
pci_alloc_irq_vectors_affinity+0x13e/0x1f0
pci_alloc_irq_vectors_affinity at drivers/pci/msi.c:1208
pqi_ctrl_init+0x72f/0x1618 [smartpqi]
pqi_pci_probe.cold.63+0x882/0x892 [smartpqi]
local_pci_probe+0x7a/0xc0
work_for_cpu_fn+0x2e/0x50
process_one_work+0x57e/0xb90
worker_thread+0x363/0x5b0
kthread+0x1f4/0x220
ret_from_fork+0x27/0x50
-> #0 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
__lock_acquire+0x2244/0x32a0
lock_acquire+0x1a2/0x680
__flush_work+0x4e6/0x630
work_on_cpu+0x114/0x160
acpi_processor_ffh_cstate_probe+0x129/0x250
acpi_processor_evaluate_cst+0x4c8/0x580
acpi_processor_get_power_info+0x86/0x740
acpi_processor_hotplug+0xc3/0x140
acpi_soft_cpu_online+0x102/0x1d0
cpuhp_invoke_callback+0x197/0x1120
cpuhp_thread_fun+0x252/0x2f0
smpboot_thread_fn+0x255/0x440
kthread+0x1f4/0x220
ret_from_fork+0x27/0x50
other info that might help us debug this:
Chain exists of:
(work_completion)(&wfc.work) --> cpuhp_state-up --> cpuidle_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(cpuidle_lock);
lock(cpuhp_state-up);
lock(cpuidle_lock);
lock((work_completion)(&wfc.work));
*** DEADLOCK ***
3 locks held by cpuhp/1/15:
#0: ffffffffaf51ab10 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x69/0x2f0
#1: ffffffffaf51ad40 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x69/0x2f0
#2: ffffffffafa1c0e8 (cpuidle_lock){+.+.}-{3:3}, at: cpuidle_pause_and_lock+0x17/0x20
Call Trace:
dump_stack+0xa0/0xea
print_circular_bug.cold.52+0x147/0x14c
check_noncircular+0x295/0x2d0
__lock_acquire+0x2244/0x32a0
lock_acquire+0x1a2/0x680
__flush_work+0x4e6/0x630
work_on_cpu+0x114/0x160
acpi_processor_ffh_cstate_probe+0x129/0x250
acpi_processor_evaluate_cst+0x4c8/0x580
acpi_processor_get_power_info+0x86/0x740
acpi_processor_hotplug+0xc3/0x140
acpi_soft_cpu_online+0x102/0x1d0
cpuhp_invoke_callback+0x197/0x1120
cpuhp_thread_fun+0x252/0x2f0
smpboot_thread_fn+0x255/0x440
kthread+0x1f4/0x220
ret_from_fork+0x27/0x50
Signed-off-by: Qian Cai <cai@lca.pw>
Tested-by: Borislav Petkov <bp@suse.de>
[ rjw: Subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Here are 3 SPDX patches for 5.7-rc1.
One fixes up the SPDX tag for a single driver, while the other two go
through the tree and add SPDX tags for all of the .gitignore files as
needed.
Nothing too complex, but you will get a merge conflict with your current
tree, that should be trivial to handle (one file modified by two things,
one file deleted.)
All 3 of these have been in linux-next for a while, with no reported
issues other than the merge conflict.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXodg5A8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykySQCgy9YDrkz7nWq6v3Gohl6+lW/L+rMAnRM4uTZm
m5AuCzO3Azt9KBi7NL+L
=2Lm5
-----END PGP SIGNATURE-----
Merge tag 'spdx-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx
Pull SPDX updates from Greg KH:
"Here are three SPDX patches for 5.7-rc1.
One fixes up the SPDX tag for a single driver, while the other two go
through the tree and add SPDX tags for all of the .gitignore files as
needed.
Nothing too complex, but you will get a merge conflict with your
current tree, that should be trivial to handle (one file modified by
two things, one file deleted.)
All three of these have been in linux-next for a while, with no
reported issues other than the merge conflict"
* tag 'spdx-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx:
ASoC: MT6660: make spdxcheck.py happy
.gitignore: add SPDX License Identifier
.gitignore: remove too obvious comments
Pull integrity updates from Mimi Zohar:
"Just a couple of updates for linux-5.7:
- A new Kconfig option to enable IMA architecture specific runtime
policy rules needed for secure and/or trusted boot, as requested.
- Some message cleanup (eg. pr_fmt, additional error messages)"
* 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
ima: add a new CONFIG for loading arch-specific policies
integrity: Remove duplicate pr_fmt definitions
IMA: Add log statements for failure conditions
IMA: Update KBUILD_MODNAME for IMA files to ima
Pull x86 vmware updates from Ingo Molnar:
"The main change in this tree is the addition of 'steal time clock
support' for VMware guests"
* 'x86-vmware-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vmware: Use bool type for vmw_sched_clock
x86/vmware: Enable steal time accounting
x86/vmware: Add steal time clock support for VMware guests
x86/vmware: Remove vmware_sched_clock_setup()
x86/vmware: Make vmware_select_hypercall() __init
Pull x86 fpu updates from Ingo Molnar:
"Misc changes:
- add a pkey sanity check
- three commits to improve and future-proof xstate/xfeature handling
some more"
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/pkeys: Add check for pkey "overflow"
x86/fpu/xstate: Warn when checking alignment of disabled xfeatures
x86/fpu/xstate: Fix XSAVES offsets in setup_xstate_comp()
x86/fpu/xstate: Fix last_good_offset in setup_xstate_features()
Pull x86 cleanups from Ingo Molnar:
"This topic tree contains more commits than usual:
- most of it are uaccess cleanups/reorganization by Al
- there's a bunch of prototype declaration (--Wmissing-prototypes)
cleanups
- misc other cleanups all around the map"
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
x86/mm/set_memory: Fix -Wmissing-prototypes warnings
x86/efi: Add a prototype for efi_arch_mem_reserve()
x86/mm: Mark setup_emu2phys_nid() static
x86/jump_label: Move 'inline' keyword placement
x86/platform/uv: Add a missing prototype for uv_bau_message_interrupt()
kill uaccess_try()
x86: unsafe_put-style macro for sigmask
x86: x32_setup_rt_frame(): consolidate uaccess areas
x86: __setup_rt_frame(): consolidate uaccess areas
x86: __setup_frame(): consolidate uaccess areas
x86: setup_sigcontext(): list user_access_{begin,end}() into callers
x86: get rid of put_user_try in __setup_rt_frame() (both 32bit and 64bit)
x86: ia32_setup_rt_frame(): consolidate uaccess areas
x86: ia32_setup_frame(): consolidate uaccess areas
x86: ia32_setup_sigcontext(): lift user_access_{begin,end}() into the callers
x86/alternatives: Mark text_poke_loc_init() static
x86/cpu: Fix a -Wmissing-prototypes warning for init_ia32_feat_ctl()
x86/mm: Drop pud_mknotpresent()
x86: Replace setup_irq() by request_irq()
x86/configs: Slightly reduce defconfigs
...
Pull x86 build updates from Ingo Molnar:
"A handful of updates: two linker script cleanups and a stock
defconfig+allmodconfig bootability fix"
* 'x86-build-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vdso: Discard .note.gnu.property sections in vDSO
x86, vmlinux.lds: Add RUNTIME_DISCARD_EXIT to generic DISCARDS
x86/Kconfig: Make CMDLINE_OVERRIDE depend on non-empty CMDLINE
Pull x86 boot updates from Ingo Molnar:
"Misc cleanups and small enhancements all around the map"
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot/compressed: Fix debug_puthex() parameter type
x86/setup: Fix static memory detection
x86/vmlinux: Drop unneeded linker script discard of .eh_frame
x86/*/Makefile: Use -fno-asynchronous-unwind-tables to suppress .eh_frame sections
x86/boot/compressed: Remove .eh_frame section from bzImage
x86/boot/compressed/64: Remove .bss/.pgtable from bzImage
x86/boot/compressed/64: Use 32-bit (zero-extended) MOV for z_output_len
x86/boot/compressed/64: Use LEA to initialize boot stack pointer
- A series of commits to make the MSR derived CPU and TSC frequency more
accurate.
It turned out that the frequency tables which have been taken from the
SDM are inaccurate because the SDM provides truncated and rounded
values, e.g. 83.3Mhz (83.3333...) or 116.7Mhz (116.6666...).
This causes time drift in the range of ~1 second per hour
(20-30 seconds per day). On some of these SoCs it's not possible to
recalibrate the TSC because there is no reference (PIT, HPET) available.
With some reverse engineering it was established that the possible
frequencies are derived from the base clock with fixed multiplier /
divider pairs.
For the CPU models which have a known crystal frequency the kernel now
uses multiplier / divider pairs which bring the frequencies closer to
reality and fix the observed time drift issues.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6CApYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoRM4D/9/lgBQQQ+xilpYHLv4lk5ukmkrLEjt
NqL0dZKthd2v4VoAViCZqCYUSuxmo9uGPCxC0Ol7MMB2mUHXrwPn5q2wwcHSE830
KQv8Dk9tCVeJMMTMk2s5t4QBYEHD95+ueObKK1sofz0NkQW3ea+cpRCh4jt2lrnw
X7uT5rSHk87B1VYMPWzELsBEeqan9kUbvbe9se7My5utesOZumn4gj9rmO/5y9Vc
rNuwGEZX8RpQAZZmfEJ00r5iA+VTdWyQ4rhktlQeeIdb4y4axjxMsWQIuaggjdyn
oRA2vZnoc4+IqNUUBvj1q1D3RETwyf3WT+nxiYUdb3VuSh1o7he5MTzgznKzThnU
s+ViOPXbfzrfUUW8dlk6zd5yovmIuQNb0Xk05USqAB3gVQS1fYPnyy+pb9dFDnnB
0zEq3RAQVCb/bkyWQ0JemgHXda3WTABZRCR812L2e+WZD6KjlqySkdeJJ+kxzQwN
6FRNrdtl+8ULy6SlWIC8y0yuVdSIFfgNSm+5HZMrw8VbqJp1ZVpTvKQ+xczjOunn
z9y24IC1IlhtDsTMzIU0LHhgwhVGcohdTNbu3yX4hVQ7EgQOQPVE/XGM5RdjiXzq
bD5j+PCntjoE7hnnxsPnuhDs9ZqNptTo4UevMwTL6rKAgJaPMAc9PB5LqdUynAEx
hpkkMFGU7BSBDw==
=9eX1
-----END PGP SIGNATURE-----
Merge tag 'x86-timers-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 timer updates from Thomas Gleixner:
"A series of commits to make the MSR derived CPU and TSC frequency more
accurate.
It turned out that the frequency tables which have been taken from the
SDM are inaccurate because the SDM provides truncated and rounded
values, e.g. 83.3Mhz (83.3333...) or 116.7Mhz (116.6666...).
This causes time drift in the range of ~1 second per hour (20-30
seconds per day). On some of these SoCs it's not possible to
recalibrate the TSC because there is no reference (PIT, HPET)
available.
With some reverse engineering it was established that the possible
frequencies are derived from the base clock with fixed multiplier /
divider pairs.
For the CPU models which have a known crystal frequency the kernel now
uses multiplier / divider pairs which bring the frequencies closer to
reality and fix the observed time drift issues"
* tag 'x86-timers-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tsc_msr: Make MSR derived TSC frequency more accurate
x86/tsc_msr: Fix MSR_FSB_FREQ mask for Cherry Trail devices
x86/tsc_msr: Use named struct initializers
- Atomic operations (lock prefixed instructions) which span two cache
lines have to acquire the global bus lock. This is at least 1k cycles
slower than an atomic operation within a cache line and disrupts
performance on other cores. Aside of performance disruption this is
a unpriviledged form of DoS.
Some newer CPUs have the capability to raise an #AC trap when such an
operation is attempted. The detection is by default enabled in warning
mode which will warn once when a user space application is caught. A
command line option allows to disable the detection or to select fatal
mode which will terminate offending applications with SIGBUS.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6B/uMTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYocsAD/9yqpw+XlPKNPsfbm9sbirBDfTrENcL
F44iwn4WnrjoW/gnnZCYmPxJFsTtGVPqxHdUf4eyGemg9r9ZEO0DQftmUHC5Z6KX
aa/b5JoeM61wp9HlpVlD4D1jVt4pWyQODQeZnUXE4DEzmRc3cD/5lSU+/VeaIwwz
lxwUemqmXK7ucH2KA7smOGsl2nU6ED84q3mdOB1b4Cw+gWYMUnPJnuS/ipriBRx4
BYbMItcxsFvtdO9Hx8PvGd5LUK0wW8JOWrYQICD2kLpZtHtGeaHpBzFzL0+nMU7d
1epyDqJQDmX+PAzvj+EYyn3HTfobZlckn+tbxMQkkS+oDk1ywOZd+BancClvn5/5
jMfPIQJF5bGASVnzGMWhzVdwthTZiMG4d1iKsUWOA/hN0ch0+rm1BqraToabsEFg
Sv7/rvl9KtSOtMJTeAmMhlZUMBj9m8BtPFjniDwp6nw/upGgJdST5mrKFNYZvqOj
JnXsEMr/nJVW6bnUvT6LF66xbHlzHdxtodkQWqF+IEsyRaOz1zAGpQamP98KxNLc
dq/XYoEe1KqIFbg4BkNP+GeDL3FQDxjFNwPQnnjQEzWRbjkHlfmq1uKCsR2r8mBO
fYNJ1X8lTyGV0kx/ERpWGazzabpzh+8Lr1yMhnoA3EWvlzUjmpN2PFI4oTpTrtzT
c/q16SCxim3NWA==
=D9x8
-----END PGP SIGNATURE-----
Merge tag 'x86-splitlock-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 splitlock updates from Thomas Gleixner:
"Support for 'split lock' detection:
Atomic operations (lock prefixed instructions) which span two cache
lines have to acquire the global bus lock. This is at least 1k cycles
slower than an atomic operation within a cache line and disrupts
performance on other cores. Aside of performance disruption this is a
unpriviledged form of DoS.
Some newer CPUs have the capability to raise an #AC trap when such an
operation is attempted. The detection is by default enabled in warning
mode which will warn once when a user space application is caught. A
command line option allows to disable the detection or to select fatal
mode which will terminate offending applications with SIGBUS"
* tag 'x86-splitlock-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/split_lock: Avoid runtime reads of the TEST_CTRL MSR
x86/split_lock: Rework the initialization flow of split lock detection
x86/split_lock: Enable split lock detection by kernel
- Convert the 32bit syscalls to be pt_regs based which removes the
requirement to push all 6 potential arguments onto the stack and
consolidates the interface with the 64bit variant
- The first small portion of the exception and syscall related entry
code consolidation which aims to address the recently discovered
issues vs. RCU, int3, NMI and some other exceptions which can
interrupt any context. The bulk of the changes is still work in
progress and aimed for 5.8.
- A few lockdep namespace cleanups which have been applied into this
branch to keep the prerequisites for the ongoing work confined.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6B/TMTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoYA6EAC7r/bCMxBelljT3b7LkBbiJcocJ+zK
OSzWU9miJGTAvYqn4/ciLKg4dA424b/1rBFlF1hBTCQ0HL5Cv4lajxdKEZCO5WCC
WWTCz+MC60aWFaH3VNoywiLGb39H2IbqWbS9yNPd/wBkLHiMAD6NPQntOvcPaD4j
1lyrMtLzfrWlrHxvxdI3kt5ZpFLYNXr2xk61xQjTz0ROFQBhf2sDsuhHhiYVLPj7
JwYktpbBiPeaw2+I18NPymNPY+VfY8LCTgLl5M+rbKyCqebKaedZQJ7QXFhAEqKC
Y2f+gJsKWtTDzGP2mk/5kF0uP7cd0vJK35ZCXtLZ9BbcNtFZU6w+ADqRo4pJBHRY
QRzo/AWrdkuTJF0CrP6mcneNC7NwWLSdKrE1z77RQCHUPVvhHhRDZsgdLcZ/KKwx
y1ji22trwNB+7LmI2fUOU5RRHZBIuNvQT+mPt24febJuHpZKul62dd3cqTGeSTC+
MYVknYDSg/+jk+83DhuZnTyb9lWTbq/0Q1HRDu6l2LrMIH7YMPpY5Ea64ZFYzWXy
s0+iHEM4mUzltwNauHIntjbwXi3C0l2k1WQyG0gun2eS6SXfu0lb93V4msFj/N1+
oHavH2n2A4XrRr+Ob87fsl7nfXJibWP7R9xPblrWP2sNdqfjSyGd49rnsvpWqWMK
Fj0d7tQ78+/SwA==
=tWXS
-----END PGP SIGNATURE-----
Merge tag 'x86-entry-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 entry code updates from Thomas Gleixner:
- Convert the 32bit syscalls to be pt_regs based which removes the
requirement to push all 6 potential arguments onto the stack and
consolidates the interface with the 64bit variant
- The first small portion of the exception and syscall related entry
code consolidation which aims to address the recently discovered
issues vs. RCU, int3, NMI and some other exceptions which can
interrupt any context. The bulk of the changes is still work in
progress and aimed for 5.8.
- A few lockdep namespace cleanups which have been applied into this
branch to keep the prerequisites for the ongoing work confined.
* tag 'x86-entry-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
x86/entry: Fix build error x86 with !CONFIG_POSIX_TIMERS
lockdep: Rename trace_{hard,soft}{irq_context,irqs_enabled}()
lockdep: Rename trace_softirqs_{on,off}()
lockdep: Rename trace_hardirq_{enter,exit}()
x86/entry: Rename ___preempt_schedule
x86: Remove unneeded includes
x86/entry: Drop asmlinkage from syscalls
x86/entry/32: Enable pt_regs based syscalls
x86/entry/32: Use IA32-specific wrappers for syscalls taking 64-bit arguments
x86/entry/32: Rename 32-bit specific syscalls
x86/entry/32: Clean up syscall_32.tbl
x86/entry: Remove ABI prefixes from functions in syscall tables
x86/entry/64: Add __SYSCALL_COMMON()
x86/entry: Remove syscall qualifier support
x86/entry/64: Remove ptregs qualifier from syscall table
x86/entry: Move max syscall number calculation to syscallhdr.sh
x86/entry/64: Split X32 syscall table into its own file
x86/entry/64: Move sys_ni_syscall stub to common.c
x86/entry/64: Use syscall wrappers for x32_rt_sigreturn
x86/entry: Refactor SYS_NI macros
...
Core:
- Consolidation of the vDSO build infrastructure to address the
difficulties of cross-builds for ARM64 compat vDSO libraries by
restricting the exposure of header content to the vDSO build.
This is achieved by splitting out header content into separate
headers. which contain only the minimaly required information which is
necessary to build the vDSO. These new headers are included from the
kernel headers and the vDSO specific files.
- Enhancements to the generic vDSO library allowing more fine grained
control over the compiled in code, further reducing architecture
specific storage and preparing for adopting the generic library by PPC.
- Cleanup and consolidation of the exit related code in posix CPU timers.
- Small cleanups and enhancements here and there
Drivers:
- The obligatory new drivers: Ingenic JZ47xx and X1000 TCU support
- Correct the clock rate of PIT64b global clock
- setup_irq() cleanup
- Preparation for PWM and suspend support for the TI DM timer
- Expand the fttmr010 driver to support ast2600 systems
- The usual small fixes, enhancements and cleanups all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6B+QETHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYofJ5D/94s5fpaqiuNcaAsLq2D3DRIrTnqxx7
yEeAOPcbYV1bM1SgY/M83L5yGc2S8ny787e26abwRTCZhZV3eAmRTphIFFIZR0Xk
xS+i67odscbdJTRtztKj3uQ9rFxefszRuphyaa89pwSY9nnyMWLcahGSQOGs0LJK
hvmgwPjyM1drNfPxgPiaFg7vDr2XxNATpQr/FBt+BhelvVan8TlAfrkcNPiLr++Y
Axz925FP7jMaRRbZ1acji34gLiIAZk0jLCUdbix7YkPrqDB4GfO+v8Vez+fGClbJ
uDOYeR4r1+Be/BtSJtJ2tHqtsKCcAL6agtaE2+epZq5HbzaZFRvBFaxgFNF8WVcn
3FFibdEMdsRNfZTUVp5wwgOLN0UIqE/7LifE12oLEL2oFB5H2PiNEUw3E02XHO11
rL3zgHhB6Ke1sXKPCjSGdmIQLbxZmV5kOlQFy7XuSeo5fmRapVzKNffnKcftIliF
1HNtZbgdA+3tdxMFCqoo1QX+kotl9kgpslmdZ0qHAbaRb3xqLoSskbqEjFRMuSCC
8bjJrwboD9T5GPfwodSCgqs/58CaSDuqPFbIjCay+p90Fcg6wWAkZtyG04ZLdPRc
GgNNdN4gjTD9bnrRi8cH47z1g8OO4vt4K4SEbmjo8IlDW+9jYMxuwgR88CMeDXd7
hu7aKsr2I2q/WQ==
=5o9G
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timekeeping and timer updates from Thomas Gleixner:
"Core:
- Consolidation of the vDSO build infrastructure to address the
difficulties of cross-builds for ARM64 compat vDSO libraries by
restricting the exposure of header content to the vDSO build.
This is achieved by splitting out header content into separate
headers. which contain only the minimaly required information which
is necessary to build the vDSO. These new headers are included from
the kernel headers and the vDSO specific files.
- Enhancements to the generic vDSO library allowing more fine grained
control over the compiled in code, further reducing architecture
specific storage and preparing for adopting the generic library by
PPC.
- Cleanup and consolidation of the exit related code in posix CPU
timers.
- Small cleanups and enhancements here and there
Drivers:
- The obligatory new drivers: Ingenic JZ47xx and X1000 TCU support
- Correct the clock rate of PIT64b global clock
- setup_irq() cleanup
- Preparation for PWM and suspend support for the TI DM timer
- Expand the fttmr010 driver to support ast2600 systems
- The usual small fixes, enhancements and cleanups all over the
place"
* tag 'timers-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (80 commits)
Revert "clocksource/drivers/timer-probe: Avoid creating dead devices"
vdso: Fix clocksource.h macro detection
um: Fix header inclusion
arm64: vdso32: Enable Clang Compilation
lib/vdso: Enable common headers
arm: vdso: Enable arm to use common headers
x86/vdso: Enable x86 to use common headers
mips: vdso: Enable mips to use common headers
arm64: vdso32: Include common headers in the vdso library
arm64: vdso: Include common headers in the vdso library
arm64: Introduce asm/vdso/processor.h
arm64: vdso32: Code clean up
linux/elfnote.h: Replace elf.h with UAPI equivalent
scripts: Fix the inclusion order in modpost
common: Introduce processor.h
linux/ktime.h: Extract common header for vDSO
linux/jiffies.h: Extract common header for vDSO
linux/time64.h: Extract common header for vDSO
linux/time32.h: Extract common header for vDSO
linux/time.h: Extract common header for vDSO
...
- Support for locked CSD objects in smp_call_function_single_async()
which allows to simplify callsites in the scheduler core and MIPS
- Treewide consolidation of CPU hotplug functions which ensures the
consistency between the sysfs interface and kernel state. The low level
functions cpu_up/down() are now confined to the core code and not
longer accessible from random code.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6B9VQTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodCyD/0WFYAe7LkOfNjkbLa0IeuyLjF9rnCi
ilcSXMLpaVwwoQvm7MopwkXUDdmEIyeJ0B641j3mC3AKCRap4+O36H2IEg2byrj7
twOvQNCfxpVVmCCD11FTH9aQa74LEB6AikTgjevhrRWj6eHsal7c2Ak26AzCgrt+
0eEkOAOWJbLAlbIiPdHlCZ3TMldcs3gg+lRSYd5QCGQVkZFnwpXzyOvpyJEUGGbb
R/JuvwJoLhRMiYAJDILoQQQg/J07ODuivse/R8PWaH2djkn+2NyRGrD794PhyyOg
QoTU0ZrYD3Z48ACXv+N3jLM7wXMcFzjYtr1vW1E3O/YGA7GVIC6XHGbMQ7tEihY0
ajtwq8DcnpKtuouviYnf7NuKgqdmJXkaZjz3Gms6n8nLXqqSVwuQELWV2CXkxNe6
9kgnnKK+xXMOGI4TUhN8bejvkXqRCmKMeQJcWyf+7RA9UIhAJw5o7WGo8gXfQWUx
tazCqDy/inYjqGxckW615fhi2zHfemlYTbSzIGOuMB1TEPKFcrgYAii/VMsYHQVZ
5amkYUXGQ5brlCOzOn38lzp5OkALBnFzD7xgvOcQgWT3ynVpdqADfBytXiEEHh4J
KSkSgSSRcS58397nIxnDcJgJouHLvAWYyPZ4UC6mfynuQIic31qMHGVqwdbEKMY3
4M5dGgqIfOBgYw==
=jwCg
-----END PGP SIGNATURE-----
Merge tag 'smp-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core SMP updates from Thomas Gleixner:
"CPU (hotplug) updates:
- Support for locked CSD objects in smp_call_function_single_async()
which allows to simplify callsites in the scheduler core and MIPS
- Treewide consolidation of CPU hotplug functions which ensures the
consistency between the sysfs interface and kernel state. The low
level functions cpu_up/down() are now confined to the core code and
not longer accessible from random code"
* tag 'smp-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
cpu/hotplug: Ignore pm_wakeup_pending() for disable_nonboot_cpus()
cpu/hotplug: Hide cpu_up/down()
cpu/hotplug: Move bringup of secondary CPUs out of smp_init()
torture: Replace cpu_up/down() with add/remove_cpu()
firmware: psci: Replace cpu_up/down() with add/remove_cpu()
xen/cpuhotplug: Replace cpu_up/down() with device_online/offline()
parisc: Replace cpu_up/down() with add/remove_cpu()
sparc: Replace cpu_up/down() with add/remove_cpu()
powerpc: Replace cpu_up/down() with add/remove_cpu()
x86/smp: Replace cpu_up/down() with add/remove_cpu()
arm64: hibernate: Use bringup_hibernate_cpu()
cpu/hotplug: Provide bringup_hibernate_cpu()
arm64: Use reboot_cpu instead of hardconding it to 0
arm64: Don't use disable_nonboot_cpus()
ARM: Use reboot_cpu instead of hardcoding it to 0
ARM: Don't use disable_nonboot_cpus()
ia64: Replace cpu_down() with smp_shutdown_nonboot_cpus()
cpu/hotplug: Create a new function to shutdown nonboot cpus
cpu/hotplug: Add new {add,remove}_cpu() functions
sched/core: Remove rq.hrtick_csd_pending
...
Treewide:
- Cleanup of setup_irq() which is not longer required because the
memory allocator is available early. Most cleanup changes come
through the various maintainer trees, so the final removal of
setup_irq() is postponed towards the end of the merge window.
Core:
- Protection against unsafe invocation of interrupt handlers and unsafe
interrupt injection including a fixup of the offending PCI/AER error
injection mechanism.
Invoking interrupt handlers from arbitrary contexts, i.e. outside of
an actual interrupt, can cause inconsistent state on the fragile
x86 interrupt affinity changing hardware trainwreck.
Drivers:
- Second wave of support for the new ARM GICv4.1
- Multi-instance support for Xilinx and PLIC interrupt controllers
- CPU-Hotplug support for PLIC
- The obligatory new driver for X1000 TCU
- Enhancements, cleanups and fixes all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6B888THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoeMJD/9v8GcI/DSY87Fmo7s4odLFVU0J8zZ6
7QlYjSPm4yWv4pqn1TEnEF2pKz5X9Euhoh8BmdMKtdXBqlS4Ix9N+pH8ModcxyQo
aX97zuRUxvqfeeVE+yQRwbbMREj9jj9RW8FRtA39+l5H3uC1GDcc+2aAMIaykQ7+
8lo/6wBd8ZrZ0gsNf4KjlBwMDYAlQSRWxrff38PQ2XRpGKowdp8JFYZuq5Vp0ljJ
r2cE75ldmFSfmtuhhVroBRY0GAqW4/8v8/syAN3Q9jOEII60qhA0dqR085B9veWa
DHSqgLmzyUFFXN7Ntzt/fDirJVsIM4BE9qGu3ftCYHMaPB8hG+xqjbZe9E3D2e/d
+0Pb3TG8EHVOIwzv1t9+6462qYGkBhmBXtbj6GptPYk2Ai4HZlNaSsa8jUNyHvGz
WDegdRjt7O5RjqDH/VwrQxW/AEp05f/1egweBXbq9aF6j9nqeOur75c/PdxZxAX5
WUMtouXP2WN+sMW8k1T5cmVMGWxLGBB0wwG4LC/mXzHnkDiN1+2wEUHmhS8Voi3q
3HXeYBJeukUYbVvMKRvWVAD330TxFjAyd6pPwCdoNY2ZngJnQWlDD9vbYYX2osoW
kP+KhIANNBVqdK7NqlLoqcr3SdHn01pQYuVHejNzxb7E6/mmpMlaYDJc/rMPi/eM
0/rzl8fAj/WyBQ==
=DZ/G
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"Updates for the interrupt subsystem:
Treewide:
- Cleanup of setup_irq() which is not longer required because the
memory allocator is available early.
Most cleanup changes come through the various maintainer trees, so
the final removal of setup_irq() is postponed towards the end of
the merge window.
Core:
- Protection against unsafe invocation of interrupt handlers and
unsafe interrupt injection including a fixup of the offending
PCI/AER error injection mechanism.
Invoking interrupt handlers from arbitrary contexts, i.e. outside
of an actual interrupt, can cause inconsistent state on the
fragile x86 interrupt affinity changing hardware trainwreck.
Drivers:
- Second wave of support for the new ARM GICv4.1
- Multi-instance support for Xilinx and PLIC interrupt controllers
- CPU-Hotplug support for PLIC
- The obligatory new driver for X1000 TCU
- Enhancements, cleanups and fixes all over the place"
* tag 'irq-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
unicore32: Replace setup_irq() by request_irq()
sh: Replace setup_irq() by request_irq()
hexagon: Replace setup_irq() by request_irq()
c6x: Replace setup_irq() by request_irq()
alpha: Replace setup_irq() by request_irq()
irqchip/gic-v4.1: Eagerly vmap vPEs
irqchip/gic-v4.1: Add VSGI property setup
irqchip/gic-v4.1: Add VSGI allocation/teardown
irqchip/gic-v4.1: Move doorbell management to the GICv4 abstraction layer
irqchip/gic-v4.1: Plumb set_vcpu_affinity SGI callbacks
irqchip/gic-v4.1: Plumb get/set_irqchip_state SGI callbacks
irqchip/gic-v4.1: Plumb mask/unmask SGI callbacks
irqchip/gic-v4.1: Add initial SGI configuration
irqchip/gic-v4.1: Plumb skeletal VSGI irqchip
irqchip/stm32: Retrigger both in eoi and unmask callbacks
irqchip/gic-v3: Move irq_domain_update_bus_token to after checking for NULL domain
irqchip/xilinx: Do not call irq_set_default_host()
irqchip/xilinx: Enable generic irq multi handler
irqchip/xilinx: Fill error code when irq domain registration fails
irqchip/xilinx: Add support for multiple instances
...
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle are:
- Various NUMA scheduling updates: harmonize the load-balancer and
NUMA placement logic to not work against each other. The intended
result is better locality, better utilization and fewer migrations.
- Introduce Thermal Pressure tracking and optimizations, to improve
task placement on thermally overloaded systems.
- Implement frequency invariant scheduler accounting on (some) x86
CPUs. This is done by observing and sampling the 'recent' CPU
frequency average at ~tick boundaries. The CPU provides this data
via the APERF/MPERF MSRs. This hopefully makes our capacity
estimates more precise and keeps tasks on the same CPU better even
if it might seem overloaded at a lower momentary frequency. (As
usual, turbo mode is a complication that we resolve by observing
the maximum frequency and renormalizing to it.)
- Add asymmetric CPU capacity wakeup scan to improve capacity
utilization on asymmetric topologies. (big.LITTLE systems)
- PSI fixes and optimizations.
- RT scheduling capacity awareness fixes & improvements.
- Optimize the CONFIG_RT_GROUP_SCHED constraints code.
- Misc fixes, cleanups and optimizations - see the changelog for
details"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
threads: Update PID limit comment according to futex UAPI change
sched/fair: Fix condition of avg_load calculation
sched/rt: cpupri_find: Trigger a full search as fallback
kthread: Do not preempt current task if it is going to call schedule()
sched/fair: Improve spreading of utilization
sched: Avoid scale real weight down to zero
psi: Move PF_MEMSTALL out of task->flags
MAINTAINERS: Add maintenance information for psi
psi: Optimize switching tasks inside shared cgroups
psi: Fix cpu.pressure for cpu.max and competing cgroups
sched/core: Distribute tasks within affinity masks
sched/fair: Fix enqueue_task_fair warning
thermal/cpu-cooling, sched/core: Move the arch_set_thermal_pressure() API to generic scheduler code
sched/rt: Remove unnecessary push for unfit tasks
sched/rt: Allow pulling unfitting task
sched/rt: Optimize cpupri_find() on non-heterogenous systems
sched/rt: Re-instate old behavior in select_task_rq_rt()
sched/rt: cpupri_find: Implement fallback mechanism for !fit case
sched/fair: Fix reordering of enqueue/dequeue_task_fair()
sched/fair: Fix runnable_avg for throttled cfs
...
Pull perf updates from Ingo Molnar:
"The main changes in this cycle were:
Kernel side changes:
- A couple of x86/cpu cleanups and changes were grandfathered in due
to patch dependencies. These clean up the set of CPU model/family
matching macros with a consistent namespace and C99 initializer
style.
- A bunch of updates to various low level PMU drivers:
* AMD Family 19h L3 uncore PMU
* Intel Tiger Lake uncore support
* misc fixes to LBR TOS sampling
- optprobe fixes
- perf/cgroup: optimize cgroup event sched-in processing
- misc cleanups and fixes
Tooling side changes are to:
- perf {annotate,expr,record,report,stat,test}
- perl scripting
- libapi, libperf and libtraceevent
- vendor events on Intel and S390, ARM cs-etm
- Intel PT updates
- Documentation changes and updates to core facilities
- misc cleanups, fixes and other enhancements"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (89 commits)
cpufreq/intel_pstate: Fix wrong macro conversion
x86/cpu: Cleanup the now unused CPU match macros
hwrng: via_rng: Convert to new X86 CPU match macros
crypto: Convert to new CPU match macros
ASoC: Intel: Convert to new X86 CPU match macros
powercap/intel_rapl: Convert to new X86 CPU match macros
PCI: intel-mid: Convert to new X86 CPU match macros
mmc: sdhci-acpi: Convert to new X86 CPU match macros
intel_idle: Convert to new X86 CPU match macros
extcon: axp288: Convert to new X86 CPU match macros
thermal: Convert to new X86 CPU match macros
hwmon: Convert to new X86 CPU match macros
platform/x86: Convert to new CPU match macros
EDAC: Convert to new X86 CPU match macros
cpufreq: Convert to new X86 CPU match macros
ACPI: Convert to new X86 CPU match macros
x86/platform: Convert to new CPU match macros
x86/kernel: Convert to new CPU match macros
x86/kvm: Convert to new CPU match macros
x86/perf/events: Convert to new CPU match macros
...
Pull EFI updates from Ingo Molnar:
"The EFI changes in this cycle are much larger than usual, for two
(positive) reasons:
- The GRUB project is showing signs of life again, resulting in the
introduction of the generic Linux/UEFI boot protocol, instead of
x86 specific hacks which are increasingly difficult to maintain.
There's hope that all future extensions will now go through that
boot protocol.
- Preparatory work for RISC-V EFI support.
The main changes are:
- Boot time GDT handling changes
- Simplify handling of EFI properties table on arm64
- Generic EFI stub cleanups, to improve command line handling, file
I/O, memory allocation, etc.
- Introduce a generic initrd loading method based on calling back
into the firmware, instead of relying on the x86 EFI handover
protocol or device tree.
- Introduce a mixed mode boot method that does not rely on the x86
EFI handover protocol either, and could potentially be adopted by
other architectures (if another one ever surfaces where one
execution mode is a superset of another)
- Clean up the contents of 'struct efi', and move out everything that
doesn't need to be stored there.
- Incorporate support for UEFI spec v2.8A changes that permit
firmware implementations to return EFI_UNSUPPORTED from UEFI
runtime services at OS runtime, and expose a mask of which ones are
supported or unsupported via a configuration table.
- Partial fix for the lack of by-VA cache maintenance in the
decompressor on 32-bit ARM.
- Changes to load device firmware from EFI boot service memory
regions
- Various documentation updates and minor code cleanups and fixes"
* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
efi/libstub/arm: Fix spurious message that an initrd was loaded
efi/libstub/arm64: Avoid image_base value from efi_loaded_image
partitions/efi: Fix partition name parsing in GUID partition entry
efi/x86: Fix cast of image argument
efi/libstub/x86: Use ULONG_MAX as upper bound for all allocations
efi: Fix a mistype in comments mentioning efivar_entry_iter_begin()
efi/libstub: Avoid linking libstub/lib-ksyms.o into vmlinux
efi/x86: Preserve %ebx correctly in efi_set_virtual_address_map()
efi/x86: Ignore the memory attributes table on i386
efi/x86: Don't relocate the kernel unless necessary
efi/x86: Remove extra headroom for setup block
efi/x86: Add kernel preferred address to PE header
efi/x86: Decompress at start of PE image load address
x86/boot/compressed/32: Save the output address instead of recalculating it
efi/libstub/x86: Deal with exit() boot service returning
x86/boot: Use unsigned comparison for addresses
efi/x86: Avoid using code32_start
efi/x86: Make efi32_pe_entry() more readable
efi/x86: Respect 32-bit ABI in efi32_pe_entry()
efi/x86: Annotate the LOADED_IMAGE_PROTOCOL_GUID with SYM_DATA
...
Pull objtool updates from Ingo Molnar:
"The biggest changes in this cycle were the vmlinux.o optimizations by
Peter Zijlstra, which are preparatory and optimization work to run
objtool against the much richer vmlinux.o object file, to perform
new, whole-program section based logic. That work exposed a handful
of problems with the existing code, which fixes and optimizations are
merged here. The complete 'vmlinux.o and noinstr' work is still work
in progress, targeted for v5.8.
There's also assorted fixes and enhancements from Josh Poimboeuf.
In particular I'd like to draw attention to commit 644592d328,
which turns fatal objtool errors into failed kernel builds. This
behavior is IMO now justified on multiple grounds (it's easy currently
to not notice an essentially corrupted kernel build), and the commit
has been in -next testing for several weeks, but there could still be
build failures with old or weird toolchains. Should that be widespread
or high profile enough then I'd suggest a quick revert, to not hold up
the merge window"
* 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
objtool: Re-arrange validate_functions()
objtool: Optimize find_rela_by_dest_range()
objtool: Delete cleanup()
objtool: Optimize read_sections()
objtool: Optimize find_symbol_by_name()
objtool: Resize insn_hash
objtool: Rename find_containing_func()
objtool: Optimize find_symbol_*() and read_symbols()
objtool: Optimize find_section_by_name()
objtool: Optimize find_section_by_index()
objtool: Add a statistics mode
objtool: Optimize find_symbol_by_index()
x86/kexec: Make relocate_kernel_64.S objtool clean
x86/kexec: Use RIP relative addressing
objtool: Rename func_for_each_insn_all()
objtool: Rename func_for_each_insn()
objtool: Introduce validate_return()
objtool: Improve call destination function detection
objtool: Fix clang switch table edge case
objtool: Add relocation check for alternative sections
...
- Update the ACPICA code in the kernel to the 20200214 upstream
release including:
* Fix to re-enable the sleep button after wakeup (Anchal Agarwal).
* Fixes for mistakes in comments and typos (Bob Moore).
* ASL-ASL+ converter updates (Erik Kaneda).
* Type casting cleanups (Sven Barth).
- Clean up the intialization of the EC driver and eliminate some
dead code from it (Rafael Wysocki).
- Clean up the quirk tables in the AC and battery drivers (Hans de
Goede).
- Fix the global lock handling on x86 to ignore unspecified bit
positions in the global lock field (Jan Engelhardt).
- Add a new "tiny" driver for ACPI button devices exposed by VMs to
guest kernels to send signals directly to init (Josh Triplett).
- Add a kernel parameter to disable ACPI BGRT on x86 (Alex Hung).
- Make the ACPI PCI host bridge and fan drivers use scnprintf() to
avoid potential buffer overflows (Takashi Iwai).
- Clean up assorted pieces of code:
* Reorder "asmlinkage" to make g++ happy (Alexey Dobriyan).
* Drop unneeded variable initialization (Colin Ian King).
* Add missing __acquires/__releases annotations (Jules Irenge).
* Replace list_for_each_safe() with list_for_each_entry_safe()
(chenqiwu).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl6CCQASHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRx15UQAISSZxFTq6huh9c3r0xEgddamhn7VOX+
phjRuTmPzRn2RqFt7Q/ypiy5qqRgBko7oR0UyMJeHc7YPYcJ2nrRx/6Ymg46nmac
mdIwTG3y1bH6cD/Fz8cM+9ZCtQl8iZRf36zvlY/8fNpk+Cj98et+x+wbUN8GMO9F
9anHpPKk7hHCwxSN/SnyrJGJpjKdW057sv9sYwgR65XnM35dGxExQNjqtQVFk/ih
N7TKVHUAlEE06liS0QYCeugsZsu5/GviU/1uy3qwg+Fxcxw7muHfG/impZwFhdjn
QrdnFOGz9lFXzY+ynQplW0tJtt1AvOLJzQtzGVOxurTJIgz1pEJnptvDXFWP2YBX
aESfuFt47bzi/NT1f31L3YQ3vuOJczwkS/QlDxv4TJh6rFdZFnQQNo+iIxBAlB6n
xSsADFbZ3OaAU2VcjVn6WSL7iD3znnIBZp/xQIybb+9BUoDhSXCTH7rNT7p025cR
g4KGAevlNDEVKIsZs3UHRQYpFQ+qHDM3WNiAiIEyF9cdenSXEMKrBnEYKSbV7DnI
rBYexFTvjAyVEb6qnuaQDwHHKhu5Xc0JebIXeTjByg993Y8SFLll7a5d40H71S6Z
/nG4mOa8+Qt6MqhwvkXLu/cxrXgNmnCG8W9RH0/2sQs25AMys9SESo1jsvEeCS2o
tC2xCpKl2TlU
=kQmH
-----END PGP SIGNATURE-----
Merge tag 'acpi-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI updates from Rafael Wysocki:
- Update the ACPICA code in the kernel to the 20200214 upstream
release including:
* Fix to re-enable the sleep button after wakeup (Anchal
Agarwal).
* Fixes for mistakes in comments and typos (Bob Moore).
* ASL-ASL+ converter updates (Erik Kaneda).
* Type casting cleanups (Sven Barth).
- Clean up the intialization of the EC driver and eliminate some dead
code from it (Rafael Wysocki).
- Clean up the quirk tables in the AC and battery drivers (Hans de
Goede).
- Fix the global lock handling on x86 to ignore unspecified bit
positions in the global lock field (Jan Engelhardt).
- Add a new "tiny" driver for ACPI button devices exposed by VMs to
guest kernels to send signals directly to init (Josh Triplett).
- Add a kernel parameter to disable ACPI BGRT on x86 (Alex Hung).
- Make the ACPI PCI host bridge and fan drivers use scnprintf() to
avoid potential buffer overflows (Takashi Iwai).
- Clean up assorted pieces of code:
* Reorder "asmlinkage" to make g++ happy (Alexey Dobriyan).
* Drop unneeded variable initialization (Colin Ian King).
* Add missing __acquires/__releases annotations (Jules Irenge).
* Replace list_for_each_safe() with list_for_each_entry_safe()
(chenqiwu)"
* tag 'acpi-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (31 commits)
ACPICA: Update version to 20200214
ACPI: PCI: Use scnprintf() for avoiding potential buffer overflow
ACPI: fan: Use scnprintf() for avoiding potential buffer overflow
ACPI: EC: Eliminate EC_FLAGS_QUERY_HANDSHAKE
ACPI: EC: Do not clear boot_ec_is_ecdt in acpi_ec_add()
ACPI: EC: Simplify acpi_ec_ecdt_start() and acpi_ec_init()
ACPI: EC: Consolidate event handler installation code
acpi/x86: ignore unspecified bit positions in the ACPI global lock field
acpi/x86: add a kernel parameter to disable ACPI BGRT
x86/acpi: make "asmlinkage" part first thing in the function definition
ACPI: list_for_each_safe() -> list_for_each_entry_safe()
ACPI: video: remove redundant assignments to variable result
ACPI: OSL: Add missing __acquires/__releases annotations
ACPI / battery: Cleanup Lenovo Ideapad Miix 320 DMI table entry
ACPI / AC: Cleanup DMI quirk table
ACPI: EC: Use fast path in acpi_ec_add() for DSDT boot EC
ACPI: EC: Simplify acpi_ec_add()
ACPI: EC: Drop AE_NOT_FOUND special case from ec_install_handlers()
ACPI: EC: Avoid passing redundant argument to functions
ACPI: EC: Avoid printing confusing messages in acpi_ec_setup()
...
by Prarit Bhargava.
* Change dev-mcelog's hardcoded limit of 32 error records to a dynamic
one, controlled by the number of logical CPUs, by Tony Luck.
* Add support for the processor identification number (PPIN) on AMD, by
Wei Huang.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl6BseQACgkQEsHwGGHe
VUqIMg/+KtZsFOHRKZD1dc0Jyo8O0BTzqMIif5J7AzRWv6DPLzfEFBjGFmVY10gN
aovhRIF1TrUI8Em5as4FlczH8l328n1ZQhhy6YoCHcrT03LsKHXE46bcvm5msj9n
0s0uZyDei6ly4k6hnNn5NPMjlkpNKS4/A1dkT3Ir25zlS+3Agds4nj5iNzFfOE19
67bFSVw+KuEt4iihfX/uT0HtmcW5T5byDlwrxgMUC3s0EzMLIx4y+hqROzrJfIau
NI3edpD0olhfkT9vz5NyZI7hNVAUOoWfYhoxZEJlAxjC+0MRKwR2A539YGsqzgJ9
kFN5h6400xDmG5C5FUVULAEHG8O/AV+0AzMoH0c4xamalB64CJe6BehYJggFbyXB
bH9bSZKasesZUSTP+v92dOrMK2ZtJnvhU5hhEDYbtRL4ERyIb/q9/AsJfpb299HJ
JD1t4lMhURYr5qu/nck48yVnsHw0yqPju1qRDxqkbmRCkKNDi2t1ph7XUb7okSba
AekWUomTliTm83rsX/lH6OJQ1uCtM7QOp6YULr8Zjb4TJcSAfuEsbAcnulUSrxan
hreIKqC2A2RMpRVnX9IflKDHAGNWmT5Ag6tLpQ0/TfeaazxT2gdEw8YS4EU18cq6
mMiJyIKmH2nGT7Mf65A0Lg0uJXFPFrtnKfFoSlb0kDsGlx3PEic=
=3/4h
-----END PGP SIGNATURE-----
Merge tag 'ras_updates_for_5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS updates from Borislav Petkov:
- Do not report spurious MCEs on some Intel platforms caused by errata;
by Prarit Bhargava.
- Change dev-mcelog's hardcoded limit of 32 error records to a dynamic
one, controlled by the number of logical CPUs, by Tony Luck.
- Add support for the processor identification number (PPIN) on AMD, by
Wei Huang.
* tag 'ras_updates_for_5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce/amd: Add PPIN support for AMD MCE
x86/mce/dev-mcelog: Dynamically allocate space for machine check records
x86/mce: Do not log spurious corrected mce errors
In the x86 kernel, .exit.text and .exit.data sections are discarded at
runtime, not by the linker. Add RUNTIME_DISCARD_EXIT to generic DISCARDS
and define it in the x86 kernel linker script to keep them.
The sections are added before the DISCARD directive so document here
only the situation explicitly as this change doesn't have any effect on
the generated kernel. Also, other architectures like ARM64 will use it
too so generalize the approach with the RUNTIME_DISCARD_EXIT define.
[ bp: Massage and extend commit message. ]
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200326193021.255002-1-hjl.tools@gmail.com
In a context switch from a task that is detecting split locks to one that
is not (or vice versa) we need to update the TEST_CTRL MSR. Currently this
is done with the common sequence:
read the MSR
flip the bit
write the MSR
in order to avoid changing the value of any reserved bits in the MSR.
Cache unused and reserved bits of TEST_CTRL MSR with SPLIT_LOCK_DETECT bit
cleared during initialization, so we can avoid an expensive RDMSR
instruction during context switch.
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Originally-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200325030924.132881-3-xiaoyao.li@intel.com
Current initialization flow of split lock detection has following issues:
1. It assumes the initial value of MSR_TEST_CTRL.SPLIT_LOCK_DETECT to be
zero. However, it's possible that BIOS/firmware has set it.
2. X86_FEATURE_SPLIT_LOCK_DETECT flag is unconditionally set even if
there is a virtualization flaw that FMS indicates the existence while
it's actually not supported.
Rework the initialization flow to solve above issues. In detail, explicitly
clear and set split_lock_detect bit to verify MSR_TEST_CTRL can be
accessed, and rdmsr after wrmsr to ensure bit is cleared/set successfully.
X86_FEATURE_SPLIT_LOCK_DETECT flag is set only when the feature does exist
and the feature is not disabled with kernel param "split_lock_detect=off"
On each processor, explicitly updating the SPLIT_LOCK_DETECT bit based on
sld_sate in split_lock_init() since BIOS/firmware may touch it.
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200325030924.132881-2-xiaoyao.li@intel.com
Fix gcc warning when -Wextra is used by moving the keyword:
arch/x86/kernel/jump_label.c:61:1: warning: ‘inline’ is not at \
beginning of declaration [-Wold-style-declaration]
static void inline __jump_label_transform(struct jump_entry *entry,
^~~~~~
Reported-by: Zzy Wysm <zzy@zzywysm.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/796d93d2-e73e-3447-44eb-4f89e1b636d9@infradead.org
Similar to ia32_setup_sigcontext() change several commits ago, make it
__always_inline. In cases when there is a user_access_{begin,end}()
section nearby, just move the call over there.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Straightforward, except for save_altstack_ex() stuck in those.
Replace that thing with an analogue that would use unsafe_put_user()
instead of put_user_ex() (called compat_save_altstack()) and be done
with that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Having fixed the biggest objtool issue in this file; fix up the rest
and remove the exception.
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20200324160924.202621656@infradead.org
Normally identity_mapped is not visible to objtool, due to:
arch/x86/kernel/Makefile:OBJECT_FILES_NON_STANDARD_relocate_kernel_$(BITS).o := y
However, when we want to run objtool on vmlinux.o there is no hiding
it:
vmlinux.o: warning: objtool: .text+0x4c0f1: unsupported intra-function call
Replace the (i386 inspired) pattern:
call 1f
1: popq %r8
subq $(1b - relocate_kernel), %r8
With a x86_64 RIP-relative LEA:
leaq relocate_kernel(%rip), %r8
Suggested-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20200324160924.143334345@infradead.org
The core device API performs extra housekeeping bits that are missing
from directly calling cpu_up/down().
See commit a6717c01dd ("powerpc/rtas: use device model APIs and
serialization during LPM") for an example description of what might go
wrong.
This also prepares to make cpu_up/down() a private interface of the CPU
subsystem.
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200323135110.30522-10-qais.yousef@arm.com
The new macro set has a consistent namespace and uses C99 initializers
instead of the grufty C89 ones.
Get rid the of the local macro wrappers for consistency.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lkml.kernel.org/r/20200320131509.250559388@linutronix.de
The new macro set has a consistent namespace and uses C99 initializers
instead of the grufty C89 ones.
The local wrappers have to stay as they are tailored to tame the hardware
vulnerability mess.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lkml.kernel.org/r/20200320131508.934926587@linutronix.de
Finding all places which build x86_cpu_id match tables is tedious and the
logic is hidden in lots of differently named macro wrappers.
Most of these initializer macros use plain C89 initializers which rely on
the ordering of the struct members. So new members could only be added at
the end of the struct, but that's ugly as hell and C99 initializers are
really the right thing to use.
Provide a set of macros which:
- Have a proper naming scheme, starting with X86_MATCH_
- Use C99 initializers
The set of provided macros are all subsets of the base macro
X86_MATCH_VENDOR_FAM_MODEL_FEATURE()
which allows to supply all possible selection criteria:
vendor, family, model, feature
The other macros shorten this to avoid typing all arguments when they are
not needed and would require one of the _ANY constants. They have been
created due to the requirements of the existing usage sites.
Also add a few model constants for Centaur CPUs and QUARK.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lkml.kernel.org/r/20200320131508.826011988@linutronix.de
Set paravirt_steal_rq_enabled if steal clock present.
paravirt_steal_rq_enabled is used in sched/core.c to adjust task
progress by offsetting stolen time. Use 'no-steal-acc' off switch (share
same name with KVM) to disable steal time accounting.
Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200323195707.31242-5-amakhalov@vmware.com
Steal time is the amount of CPU time needed by a guest virtual machine
that is not provided by the host. Steal time occurs when the host
allocates this CPU time elsewhere, for example, to another guest.
Steal time can be enabled by adding the VM configuration option
stealclock.enable = "TRUE". It is supported by VMs that run hardware
version 13 or newer.
Introduce the VMware steal time infrastructure. The high level code
(such as enabling, disabling and hot-plug routines) was derived from KVM.
[ Tomer: use READ_ONCE macros and 32bit guests support. ]
[ bp: Massage. ]
Co-developed-by: Tomer Zeltzer <tomerr90@gmail.com>
Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
Signed-off-by: Tomer Zeltzer <tomerr90@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200323195707.31242-4-amakhalov@vmware.com
Move cyc2ns setup logic to separate function.
This separation will allow to use cyc2ns mult/shift pair
not only for the sched_clock but also for other clocks
such as steal_clock.
Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200323195707.31242-3-amakhalov@vmware.com
vmware_select_hypercall() is used only by the __init
functions, and should be annotated with __init as well.
Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200323195707.31242-2-amakhalov@vmware.com
Add a missing include in order to fix -Wmissing-prototypes warning:
arch/x86/kernel/cpu/feat_ctl.c:95:6: warning: no previous prototype for ‘init_ia32_feat_ctl’ [-Wmissing-prototypes]
95 | void init_ia32_feat_ctl(struct cpuinfo_x86 *c)
Signed-off-by: Benjamin Thiel <b.thiel@posteo.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200323105934.26597-1-b.thiel@posteo.de
Newer AMD CPUs support a feature called protected processor
identification number (PPIN). This feature can be detected via
CPUID_Fn80000008_EBX[23].
However, CPUID alone is not enough to read the processor identification
number - MSR_AMD_PPIN_CTL also needs to be configured properly. If, for
any reason, MSR_AMD_PPIN_CTL[PPIN_EN] can not be turned on, such as
disabled in BIOS, the CPU capability bit X86_FEATURE_AMD_PPIN needs to
be cleared.
When the X86_FEATURE_AMD_PPIN capability is available, the
identification number is issued together with the MCE error info in
order to keep track of the source of MCE errors.
[ bp: Massage. ]
Co-developed-by: Smita Koralahalli Channabasappa <smita.koralahallichannabasappa@amd.com>
Signed-off-by: Smita Koralahalli Channabasappa <smita.koralahallichannabasappa@amd.com>
Signed-off-by: Wei Huang <wei.huang2@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200321193800.3666964-1-wei.huang2@amd.com
For the 32-bit syscall interface, 64-bit arguments (loff_t) are passed via
a pair of 32-bit registers. These register pairs end up in consecutive stack
slots, which matches the C ABI for 64-bit arguments. But when accessing the
registers directly from pt_regs, the wrapper needs to manually reassemble the
64-bit value. These wrappers already exist for 32-bit compat, so make them
available to 32-bit native in preparation for enabling pt_regs-based syscalls.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
Link: https://lkml.kernel.org/r/20200313195144.164260-16-brgerst@gmail.com
Instead of using an array in asm-offsets to calculate the max syscall
number, calculate it when writing out the syscall headers.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200313195144.164260-9-brgerst@gmail.com
request_irq() is preferred over setup_irq(). The early boot setup_irq()
invocations happen either via 'init_IRQ()' or 'time_init()', while
memory allocators are ready by 'mm_init()'.
setup_irq() was required in old kernels when allocators were not ready by
the time early interrupts were initialized.
Hence replace setup_irq() by request_irq().
[ tglx: Use a local variable and get rid of the line break. Tweak the
comment a bit ]
Signed-off-by: afzal mohammed <afzal.mohd.ma@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/17f85021f6877650a5b09e0212d88323e6a30fd0.1582471508.git.afzal.mohd.ma@gmail.com
While looking at an objtool UACCESS warning, it suddenly occurred to me
that it is entirely possible to have an OPTPROBE right in the middle of
an UACCESS region.
In this case we must of course clear FLAGS.AC while running the KPROBE.
Luckily the trampoline already saves/restores [ER]FLAGS, so all we need
to do is inject a CLAC. Unfortunately we cannot use ALTERNATIVE() in the
trampoline text, so we have to frob that manually.
Fixes: ca0bbc70f147 ("sched/x86_64: Don't save flags on context switch")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lkml.kernel.org/r/20200305092130.GU2596@hirez.programming.kicks-ass.net
When booting x86 images in qemu, the following warning is seen randomly
if DEBUG_LOCKDEP is enabled.
WARNING: CPU: 0 PID: 1 at kernel/locking/lockdep.c:1119
lockdep_register_key+0xc0/0x100
static_obj() returns true if an address is between _stext and _end.
On x86, this includes the brk memory space. Problem is that this memory
block is not static on x86; its unused portions are released after init
and can be allocated. This results in the observed warning if a lockdep
object is allocated from this memory.
Solve the problem by implementing arch_is_kernel_initmem_freed() for
x86 and have it return true if an address is within the released memory
range.
The same problem was solved for s390 with commit
7a5da02de8 ("locking/lockdep: check for freed initmem in static_obj()"),
which introduced arch_is_kernel_initmem_freed().
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200131021159.9178-1-linux@roeck-us.net
Straightforward, except for compat_save_altstack_ex() stuck in those.
Replace that thing with an analogue that would use unsafe_put_user()
instead of put_user_ex() (called unsafe_compat_save_altstack()) and
be done with that...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Just do copyin into a local struct and be done with that - we are
on a shallow stack here.
[reworked by tglx, removing the macro horrors while we are touching that]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Just do a copyin of what we want into a local variable and
be done with that. We are guaranteed to be on shallow stack
here...
Note that conditional expression for range passed to access_ok()
in mainline had been pointless all along - the only difference
between vm86plus_struct and vm86_struct is that the former has
one extra field in the end and when we get to copyin of that
field (conditional upon 'plus' argument), we use copy_from_user().
Moreover, all fields starting with ->int_revectored are copied
that way, so we only need that check (be it done by access_ok()
or by user_access_begin()) only on the beginning of the structure -
the fields that used to be covered by that get_user_try() block.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and consolidate the definition of sigframe_ia32->extramask - it's
always a 1-element array of 32bit unsigned.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... to find whether there are northbridges present on the
system. Convert the last forgotten user and therefore, unexport
amd_nb_misc_ids[] too.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Michal Kubecek <mkubecek@suse.cz>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lkml.kernel.org/r/20200316150725.925-1-bp@alien8.de
- Map EFI runtime service data as encrypted when SEV is enabled otherwise
e.g. SMBIOS data cannot be properly decoded by dmidecode.
- Remove the warning in the vector management code which triggered when a
managed interrupt affinity changed outside of a CPU hotplug
operation. The warning was correct until the recent core code change
that introduced a CPU isolation feature which needs to migrate managed
interrupts away from online CPUs under certain conditions to achieve the
isolation.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5uRi8THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoSH9EACToDM3iADmLZnP4dookJpPWvxazCio
UclqaIUE7k2Wg/EPmE0oNTQCxqh42rTX6Ifo5WaiCJbxIFZKGMhe02BwmQffilaS
dOlxuEEeLQq3S4Ai10Mq7wcp5uVHCE/+IhaphwFrdPn/w99O0SZf/bpZMveh6xgR
Qw3vMLav9FXpWqvnDTw0Vcrcd9sEnZ/iaLrXVDFAnwZggrUqq26Ia4DqUlOaiHGC
DHESmYFlHcFqfzd6BOJXbsJqedL56Qav0n7zsIqz6B34cLyc8QOqnSn2HxzncP22
BLPVLvdLi7yqrWIoVgSefcAJq1wcE+Vl9V6mvjxMK4GieYZ91WdLKIbvqUPRZvhU
viDzZ7NCsg6TmQBD6ilvYrMNB9ds+GNl/1dZ9c854zuvnTcnKqRq9CE6djnlqaLw
AfHQQJ+kPjrnVyyPnyYBqrWgfsVJ3ueE8BEPtTfruL2CDQLrwiScwCNZ3qQmZ6Bx
r00wbx+QtATHiZ97pwR1FJr1gyuZE6q3tY3gnb5ORIY19DfkwzRprKpE+Z++3N1H
Z5Vc7A67CcQe6uCwyViJZuamNgBaXvFmbDDjt3d8N4KKnLK647WyW0XutabQppWa
Jueq9XJX2V752y81i2Gf2+/U7xGOK0C4QajRMbqiizRBHKiG1JXpi9yCrdqNldEP
ocz5HASe634nng==
=KeLM
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"Two fixes for x86:
- Map EFI runtime service data as encrypted when SEV is enabled.
Otherwise e.g. SMBIOS data cannot be properly decoded by dmidecode.
- Remove the warning in the vector management code which triggered
when a managed interrupt affinity changed outside of a CPU hotplug
operation.
The warning was correct until the recent core code change that
introduced a CPU isolation feature which needs to migrate managed
interrupts away from online CPUs under certain conditions to
achieve the isolation"
* tag 'x86-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vector: Remove warning on managed interrupt migration
x86/ioremap: Map EFI runtime services data as encrypted for SEV
- Shut down the per CPU thermal throttling poll work properly when a CPU
goes offline. The missing shutdown caused the poll work to be migrated
to a unbound worker which triggered warnings about the usage of
smp_processor_id() in preemptible context
- Fix the PPIN feature initialization which missed to enable the
functionality when PPIN_CTL was enabled but the MSR locked against
updates.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5uRHUTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoVVGD/0WEjZoB8yhwez6u0YNFhUkjfP8JFC1
mGdWMoevyH3Tb+DQNX3cW95t2O7IxP0N6OUNnYYQ9Tlqwt6r0ptJpNnXO7CV2+Jh
5lxpw/Uv2kQv69BNDK9qPDhiIBPzZQCg/utDTVdIyG0y+XU0q/IZqXh+XedAJsVr
P3U7KC//NwTYnlpPWjDsG26GHSguV4kj+Lwi88nfh1DJ7eawb8AF4k965pLmOoF9
g13EFxv2FW1/uq+QJq5ophQIH/pPI/T67rhIyLWxFsCByBzVKjm4BBgXH4gb+QIn
OofVQcaWCpZCOq2ZTNfHWdPvJK2ziig9w+twbArb7Cb9aOgp3Oe1zbp2VD4nKu4+
0G5E2Vdv6qRrEIk5LUTqlyOIogd5xPSufaCGF/HC/qXqBxqwWD0tUvjtYyRwwy+Y
u90bo90zlMjUoDirgtZrjYe0bXuy3xJ+FxZ5OxovGRxLn4qqBqEJGrXYvB0LIlpd
3x+YeHB4T2pwC6Ya5Odi6RKhwMKpro24dDMJ9jIR1u/NwIgJ2elSO9bsw6SZ823e
/Mwns7CC/7xtjOCJXPlyj4Uw0TzwTbp1W9Kb0OqJo6q+ntvxbAhoMf32FDxg0OKC
h4trc3FZt+e2a0l8R8e3nNeAvnS0fM1P4vtg18EcX8SqlSoALJkS3XUO4WeCFLBh
F9jOt/LSf+4okQ==
=9W7B
-----END PGP SIGNATURE-----
Merge tag 'ras-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS fixes from Thomas Gleixner:
"Two RAS related fixes:
- Shut down the per CPU thermal throttling poll work properly when a
CPU goes offline.
The missing shutdown caused the poll work to be migrated to a
unbound worker which triggered warnings about the usage of
smp_processor_id() in preemptible context
- Fix the PPIN feature initialization which missed to enable the
functionality when PPIN_CTL was enabled but the MSR locked against
updates"
* tag 'ras-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Fix logic and comments around MSR_PPIN_CTL
x86/mce/therm_throt: Undo thermal polling properly on CPU offline
The value in "new" is constructed from "old" such that all bits defined
as reserved by the ACPI spec[1] are left untouched. But if those bits
do not happen to be all zero, "new < 3" will not evaluate to true.
The firmware of the laptop(s) Medion MD63490 / Akoya P15648 comes with
garbage inside the "FACS" ACPI table. The starting value is
old=0x4944454d, therefore new=0x4944454e, which is >= 3. Mask off
the reserved bits.
[1] https://uefi.org/sites/default/files/resources/ACPI_6_2.pdf
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206553
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
BGRT is for displaying seamless OEM logo from booting to login screen;
however, this mechanism does not always work well on all configurations
and the OEM logo can be displayed multiple times. This looks worse than
without BGRT enabled.
This patch adds a kernel parameter to disable BGRT in boot time. This is
easier than re-compiling a kernel with CONFIG_ACPI_BGRT disabled.
Signed-off-by: Alex Hung <alex.hung@canonical.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
g++ insists that function declaration must start with extern "C"
(which asmlinkage expands to).
gcc doesn't care.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The vector management code assumes that managed interrupts cannot be
migrated away from an online CPU. free_moved_vector() has a WARN_ON_ONCE()
which triggers when a managed interrupt vector association on a online CPU
is cleared. The CPU offline code uses a different mechanism which cannot
trigger this.
This assumption is not longer correct because the new CPU isolation feature
which affects the placement of managed interrupts must be able to move a
managed interrupt away from an online CPU.
There are two reasons why this can happen:
1) When the interrupt is activated the affinity mask which was
established in irq_create_affinity_masks() is handed in to
the vector allocation code. This mask contains all CPUs to which
the interrupt can be made affine to, but this does not take the
CPU isolation 'managed_irq' mask into account.
When the interrupt is finally requested by the device driver then the
affinity is checked again and the CPU isolation 'managed_irq' mask is
taken into account, which moves the interrupt to a non-isolated CPU if
possible.
2) The interrupt can be affine to an isolated CPU because the
non-isolated CPUs in the calculated affinity mask are not online.
Once a non-isolated CPU which is in the mask comes online the
interrupt is migrated to this non-isolated CPU
In both cases the regular online migration mechanism is used which triggers
the WARN_ON_ONCE() in free_moved_vector().
Case #1 could have been addressed by taking the isolation mask into
account, but that would require a massive code change in the activation
logic and the eventual migration event was accepted as a reasonable
tradeoff when the isolation feature was developed. But even if #1 would be
addressed, #2 would still trigger it.
Of course the warning in free_moved_vector() was overlooked at that time
and the above two cases which have been discussed during patch review have
obviously never been tested before the final submission.
So keep it simple and remove the warning.
[ tglx: Rewrote changelog and added a comment to free_moved_vector() ]
Fixes: 11ea68f553 ("genirq, sched/isolation: Isolate from handling managed interrupts")
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lkml.kernel.org/r/20200312205830.81796-1-peterx@redhat.com
Every time a new architecture defines the IMA architecture specific
functions - arch_ima_get_secureboot() and arch_ima_get_policy(), the IMA
include file needs to be updated. To avoid this "noise", this patch
defines a new IMA Kconfig IMA_SECURE_AND_OR_TRUSTED_BOOT option, allowing
the different architectures to select it.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Nayna Jain <nayna@linux.ibm.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Philipp Rudo <prudo@linux.ibm.com> (s390)
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
Family 19h CPUs are Zen-based and still share most architectural
features with Family 17h CPUs, and therefore still need to call
init_amd_zn() e.g., to set the RECLAIM_DISTANCE override.
init_amd_zn() also sets X86_FEATURE_ZEN, which today is only used
in amd_set_core_ssb_state(), which isn't called on some late
model Family 17h CPUs, nor on any Family 19h CPUs:
X86_FEATURE_AMD_SSBD replaces X86_FEATURE_LS_CFG_SSBD on those
later model CPUs, where the SSBD mitigation is done via the
SPEC_CTRL MSR instead of the LS_CFG MSR.
Family 19h CPUs also don't have the erratum where the CPB feature
bit isn't set, but that code can stay unchanged and run safely
on Family 19h.
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200311191451.13221-1-kim.phillips@amd.com
The "Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 4:
Model-Specific Registers" has the following table for the values from
freq_desc_byt:
000B: 083.3 MHz
001B: 100.0 MHz
010B: 133.3 MHz
011B: 116.7 MHz
100B: 080.0 MHz
Notice how for e.g the 83.3 MHz value there are 3 significant digits, which
translates to an accuracy of a 1000 ppm, where as a typical crystal
oscillator is 20 - 100 ppm, so the accuracy of the frequency format used in
the Software Developer’s Manual is not really helpful.
As far as we know Bay Trail SoCs use a 25 MHz crystal and Cherry Trail
uses a 19.2 MHz crystal, the crystal is the source clock for a root PLL
which outputs 1600 and 100 MHz. It is unclear if the root PLL outputs are
used directly by the CPU clock PLL or if there is another PLL in between.
This does not matter though, we can model the chain of PLLs as a single PLL
with a quotient equal to the quotients of all PLLs in the chain multiplied.
So we can create a simplified model of the CPU clock setup using a
reference clock of 100 MHz plus a quotient which gets us as close to the
frequency from the SDM as possible.
For the 83.3 MHz example from above this would give 100 MHz * 5 / 6 = 83
and 1/3 MHz, which matches exactly what has been measured on actual
hardware.
Use a simplified PLL model with a reference clock of 100 MHz for all Bay
and Cherry Trail models.
This has been tested on the following models:
CPU freq before: CPU freq after:
Intel N2840 2165.800 MHz 2166.667 MHz
Intel Z3736 1332.800 MHz 1333.333 MHz
Intel Z3775 1466.300 MHz 1466.667 MHz
Intel Z8350 1440.000 MHz 1440.000 MHz
Intel Z8750 1600.000 MHz 1600.000 MHz
This fixes the time drifting by about 1 second per hour (20 - 30 seconds
per day) on (some) devices which rely on the tsc_msr.c code to determine
the TSC frequency.
Reported-by: Vipul Kumar <vipulk0511@gmail.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200223140610.59612-3-hdegoede@redhat.com
According to the "Intel 64 and IA-32 Architectures Software Developer's
Manual Volume 4: Model-Specific Registers" on Cherry Trail (Airmont)
devices the 4 lowest bits of the MSR_FSB_FREQ mask indicate the bus freq
unlike on e.g. Bay Trail where only the lowest 3 bits are used.
This is also the reason why MAX_NUM_FREQS is defined as 9, since Cherry
Trail SoCs have 9 possible frequencies, so the lo value from the MSR needs
to be masked with 0x0f, not with 0x07 otherwise the 9th frequency will get
interpreted as the 1st.
Bump MAX_NUM_FREQS to 16 to avoid any possibility of addressing the array
out of bounds and makes the mask part of the cpufreq struct so it can be
set it per model.
While at it also log an error when the index points to an uninitialized
part of the freqs lookup-table.
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200223140610.59612-2-hdegoede@redhat.com
Use named struct initializers for the freq_desc struct-s initialization
and change the "u8 msr_plat" to a "bool use_msr_plat" to make its meaning
more clear instead of relying on a comment to explain it.
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200223140610.59612-1-hdegoede@redhat.com
We have had a hard coded limit of 32 machine check records since the
dawn of time. But as numbers of cores increase, it is possible for
more than 32 errors to be reported before a user process reads from
/dev/mcelog. In this case the additional errors are lost.
Keep 32 as the minimum. But tune the maximum value up based on the
number of processors.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200218184408.GA23048@agluck-desk2.amr.corp.intel.com
Sathyanarayanan reported that the PCI-E AER error injection mechanism
can result in a NULL pointer dereference in apic_ack_edge():
BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
RIP: 0010:apic_ack_edge+0x1e/0x40
Call Trace:
handle_edge_irq+0x7d/0x1e0
generic_handle_irq+0x27/0x30
aer_inject_write+0x53a/0x720
It crashes in irq_complete_move() which dereferences get_irq_regs() which
is obviously NULL when this is called from non interrupt context.
Of course the pointer could be checked, but that just papers over the real
issue. Invoking the low level interrupt handling mechanism from random code
can wreckage the fragile interrupt affinity mechanism of x86 as interrupts
can only be moved in interrupt context or with special care when a CPU goes
offline and the move has to be enforced.
In the best case this triggers the warning in the MSI affinity setter, but
if the call happens on the correct CPU it just corrupts state and might
prevent further interrupt delivery for the affected device.
Mark the APIC interrupts as unsuitable for being invoked in random contexts.
This prevents the AER injection from proliferating the wreckage, but that's
less broken than the current state of affairs and more correct than just
papering over the problem by sprinkling random checks all over the place
and silently corrupting state.
Reported-by: sathyanarayanan.kuppuswamy@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200306130623.684591280@linutronix.de
code32_start is meant for 16-bit real-mode bootloaders to inform the
kernel where the 32-bit protected mode code starts. Nothing in the
protected mode kernel except the EFI stub uses it.
efi_main() currently returns boot_params, with code32_start set inside it
to tell efi_stub_entry() where startup_32 is located. Since it was invoked
by efi_stub_entry() in the first place, boot_params is already known.
Return the address of startup_32 instead.
This will allow a 64-bit kernel to live above 4Gb, for example, and it's
cleaner as well.
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200301230436.2246909-5-nivedita@alum.mit.edu
Link: https://lore.kernel.org/r/20200308080859.21568-13-ardb@kernel.org
Pull x86 fixes from Ingo Molnar:
"Misc fixes: a pkeys fix for a bug that triggers with weird BIOS
settings, and two Xen PV fixes: a paravirt interface fix, and
pagetable dumping fix"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Fix dump_pagetables with Xen PV
x86/ioperm: Add new paravirt function update_io_bitmap()
x86/pkeys: Manually set X86_FEATURE_OSPKE to preserve existing changes
as too large frame sizes on some configurations. On the
ARM side, the compiler was messing up shadow stacks between
EL1 and EL2 code, which is easily fixed with __always_inline.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJeXAT4AAoJEL/70l94x66DWywH/1kv4MmeGo6PI0Nxk/yvA7X8
78iqIBchtxZX0v/9kqpTB7bYmHyTgmZHM+IkwtIUANDSaOvWqJwU+TLUfduOiuXF
NxBHcZDyuMoftX5CSQ+bJ5PwxKijAdJsIkCZ13CnsTCkwcfamSGypFUCK8LacPeq
WHvV5Ws5pFc51xrP3CH1DrRhLoulaBmt5xxqK9fxWtslrlsnm1uNza5vs8As8CzM
apnmdRIf5p4v91Zic3PFH7/GXES0m1tjIBKdtZ4YHb8yrXV/kBsEVhhTjqE9mrUq
qtRRl5waOFoP4yc9ey52PAbMm1x1Ho/pyunpM0xh40Yq8OPFwqXBPTnWfobSoiM=
=LNQc
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"More bugfixes, including a few remaining "make W=1" issues such as too
large frame sizes on some configurations.
On the ARM side, the compiler was messing up shadow stacks between EL1
and EL2 code, which is easily fixed with __always_inline"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: VMX: check descriptor table exits on instruction emulation
kvm: x86: Limit the number of "kvm: disabled by bios" messages
KVM: x86: avoid useless copy of cpufreq policy
KVM: allow disabling -Werror
KVM: x86: allow compiling as non-module with W=1
KVM: Pre-allocate 1 cpumask variable per cpu for both pv tlb and pv ipis
KVM: Introduce pv check helpers
KVM: let declaration of kvm_get_running_vcpus match implementation
KVM: SVM: allocate AVIC data structures based on kvm_amd module parameter
arm64: Ask the compiler to __always_inline functions used by KVM at HYP
KVM: arm64: Define our own swab32() to avoid a uapi static inline
KVM: arm64: Ask the compiler to __always_inline functions used at HYP
kvm: arm/arm64: Fold VHE entry/exit work into kvm_vcpu_run_vhe()
KVM: arm/arm64: Fix up includes for trace.h
Commit 111e7b15cf ("x86/ioperm: Extend IOPL config to control ioperm()
as well") reworked the iopl syscall to use I/O bitmaps.
Unfortunately this broke Xen PV domains using that syscall as there is
currently no I/O bitmap support in PV domains.
Add I/O bitmap support via a new paravirt function update_io_bitmap which
Xen PV domains can use to update their I/O bitmaps via a hypercall.
Fixes: 111e7b15cf ("x86/ioperm: Extend IOPL config to control ioperm() as well")
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Cc: <stable@vger.kernel.org> # 5.5
Link: https://lkml.kernel.org/r/20200218154712.25490-1-jgross@suse.com
Nick Desaulniers Reported:
When building with:
$ make CC=clang arch/x86/ CFLAGS=-Wframe-larger-than=1000
The following warning is observed:
arch/x86/kernel/kvm.c:494:13: warning: stack frame size of 1064 bytes in
function 'kvm_send_ipi_mask_allbutself' [-Wframe-larger-than=]
static void kvm_send_ipi_mask_allbutself(const struct cpumask *mask, int
vector)
^
Debugging with:
https://github.com/ClangBuiltLinux/frame-larger-than
via:
$ python3 frame_larger_than.py arch/x86/kernel/kvm.o \
kvm_send_ipi_mask_allbutself
points to the stack allocated `struct cpumask newmask` in
`kvm_send_ipi_mask_allbutself`. The size of a `struct cpumask` is
potentially large, as it's CONFIG_NR_CPUS divided by BITS_PER_LONG for
the target architecture. CONFIG_NR_CPUS for X86_64 can be as high as
8192, making a single instance of a `struct cpumask` 1024 B.
This patch fixes it by pre-allocate 1 cpumask variable per cpu and use it for
both pv tlb and pv ipis..
Reported-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce some pv check helpers for consistency.
Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are two implemented bits in the PPIN_CTL MSR:
Bit 0: LockOut (R/WO)
Set 1 to prevent further writes to MSR_PPIN_CTL.
Bit 1: Enable_PPIN (R/W)
If 1, enables MSR_PPIN to be accessible using RDMSR.
If 0, an attempt to read MSR_PPIN will cause #GP.
So there are four defined values:
0: PPIN is disabled, PPIN_CTL may be updated
1: PPIN is disabled. PPIN_CTL is locked against updates
2: PPIN is enabled. PPIN_CTL may be updated
3: PPIN is enabled. PPIN_CTL is locked against updates
Code would only enable the X86_FEATURE_INTEL_PPIN feature for case "2".
When it should have done so for both case "2" and case "3".
Fix the final test to just check for the enable bit. Also fix some of
the other comments in this function.
Fixes: 3f5a7896a5 ("x86/mce: Include the PPIN in MCE records when available")
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200226011737.9958-1-tony.luck@intel.com
Explicitly set X86_FEATURE_OSPKE via set_cpu_cap() instead of calling
get_cpu_cap() to pull the feature bit from CPUID after enabling CR4.PKE.
Invoking get_cpu_cap() effectively wipes out any {set,clear}_cpu_cap()
changes that were made between this_cpu->c_init() and setup_pku(), as
all non-synthetic feature words are reinitialized from the CPU's CPUID
values.
Blasting away capability updates manifests most visibility when running
on a VMX capable CPU, but with VMX disabled by BIOS. To indicate that
VMX is disabled, init_ia32_feat_ctl() clears X86_FEATURE_VMX, using
clear_cpu_cap() instead of setup_clear_cpu_cap() so that KVM can report
which CPU is misconfigured (KVM needs to probe every CPU anyways).
Restoring X86_FEATURE_VMX from CPUID causes KVM to think VMX is enabled,
ultimately leading to an unexpected #GP when KVM attempts to do VMXON.
Arguably, init_ia32_feat_ctl() should use setup_clear_cpu_cap() and let
KVM figure out a different way to report the misconfigured CPU, but VMX
is not the only feature bit that is affected, i.e. there is precedent
that tweaking feature bits via {set,clear}_cpu_cap() after ->c_init()
is expected to work. Most notably, x86_init_rdrand()'s clearing of
X86_FEATURE_RDRAND when RDRAND malfunctions is also overwritten.
Fixes: 0697694564 ("x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU")
Reported-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Jacob Keller <jacob.e.keller@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200226231615.13664-1-sean.j.christopherson@intel.com
#BP is not longer using IST and using ist_enter() and ist_exit() makes it
harder to change ist_enter() and ist_exit()'s behavior. Instead open-code
the very small amount of required logic.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200225220217.150607679@linutronix.de
That function returns immediately after conditionally reenabling interrupts which
is more than pointless and requires the ASM code to disable interrupts again.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20191023123117.871608831@linutronix.de
Link: https://lkml.kernel.org/r/20200225220216.518575042@linutronix.de
Remove the pointless difference between 32 and 64 bit to make further
unifications simpler.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200225220216.428188397@linutronix.de
do_machine_check() can be raised in almost any context including the most
fragile ones. Prevent kprobes and tracing.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200225220216.315548935@linutronix.de
This time, the set of changes for the EFI subsystem is much larger than
usual. The main reasons are:
- Get things cleaned up before EFI support for RISC-V arrives, which will
increase the size of the validation matrix, and therefore the threshold to
making drastic changes,
- After years of defunct maintainership, the GRUB project has finally started
to consider changes from the distros regarding UEFI boot, some of which are
highly specific to the way x86 does UEFI secure boot and measured boot,
based on knowledge of both shim internals and the layout of bootparams and
the x86 setup header. Having this maintenance burden on other architectures
(which don't need shim in the first place) is hard to justify, so instead,
we are introducing a generic Linux/UEFI boot protocol.
Summary of changes:
- Boot time GDT handling changes (Arvind)
- Simplify handling of EFI properties table on arm64
- Generic EFI stub cleanups, to improve command line handling, file I/O,
memory allocation, etc.
- Introduce a generic initrd loading method based on calling back into
the firmware, instead of relying on the x86 EFI handover protocol or
device tree.
- Introduce a mixed mode boot method that does not rely on the x86 EFI
handover protocol either, and could potentially be adopted by other
architectures (if another one ever surfaces where one execution mode
is a superset of another)
- Clean up the contents of struct efi, and move out everything that
doesn't need to be stored there.
- Incorporate support for UEFI spec v2.8A changes that permit firmware
implementations to return EFI_UNSUPPORTED from UEFI runtime services at
OS runtime, and expose a mask of which ones are supported or unsupported
via a configuration table.
- Various documentation updates and minor code cleanups (Heinrich)
- Partial fix for the lack of by-VA cache maintenance in the decompressor
on 32-bit ARM. Note that these patches were deliberately put at the
beginning so they can be used as a stable branch that will be shared with
a PR containing the complete fix, which I will send to the ARM tree.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEnNKg2mrY9zMBdeK7wjcgfpV0+n0FAl5S7WYACgkQwjcgfpV0
+n1jmQgAmwV3V8pbgB4mi4P2Mv8w5Zj5feUe6uXnTR2AFv5nygLcTzdxN+TU/6lc
OmZv2zzdsAscYlhuUdI/4t4cXIjHAZI39+UBoNRuMqKbkbvXCFscZANLxvJjHjZv
FFbgUk0DXkF0BCEDuSLNavidAv4b3gZsOmHbPfwuB8xdP05LbvbS2mf+2tWVAi2z
ULfua/0o9yiwl19jSS6iQEPCvvLBeBzTLW7x5Rcm/S0JnotzB59yMaeqD7jO8JYP
5PvI4WM/l5UfVHnZp2k1R76AOjReALw8dQgqAsT79Q7+EH3sNNuIjU6omdy+DFf4
tnpwYfeLOaZ1SztNNfU87Hsgnn2CGw==
=pDE3
-----END PGP SIGNATURE-----
Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi into efi/core
Pull EFI updates for v5.7 from Ard Biesheuvel:
This time, the set of changes for the EFI subsystem is much larger than
usual. The main reasons are:
- Get things cleaned up before EFI support for RISC-V arrives, which will
increase the size of the validation matrix, and therefore the threshold to
making drastic changes,
- After years of defunct maintainership, the GRUB project has finally started
to consider changes from the distros regarding UEFI boot, some of which are
highly specific to the way x86 does UEFI secure boot and measured boot,
based on knowledge of both shim internals and the layout of bootparams and
the x86 setup header. Having this maintenance burden on other architectures
(which don't need shim in the first place) is hard to justify, so instead,
we are introducing a generic Linux/UEFI boot protocol.
Summary of changes:
- Boot time GDT handling changes (Arvind)
- Simplify handling of EFI properties table on arm64
- Generic EFI stub cleanups, to improve command line handling, file I/O,
memory allocation, etc.
- Introduce a generic initrd loading method based on calling back into
the firmware, instead of relying on the x86 EFI handover protocol or
device tree.
- Introduce a mixed mode boot method that does not rely on the x86 EFI
handover protocol either, and could potentially be adopted by other
architectures (if another one ever surfaces where one execution mode
is a superset of another)
- Clean up the contents of struct efi, and move out everything that
doesn't need to be stored there.
- Incorporate support for UEFI spec v2.8A changes that permit firmware
implementations to return EFI_UNSUPPORTED from UEFI runtime services at
OS runtime, and expose a mask of which ones are supported or unsupported
via a configuration table.
- Various documentation updates and minor code cleanups (Heinrich)
- Partial fix for the lack of by-VA cache maintenance in the decompressor
on 32-bit ARM. Note that these patches were deliberately put at the
beginning so they can be used as a stable branch that will be shared with
a PR containing the complete fix, which I will send to the ARM tree.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Chris Wilson reported splats from running the thermal throttling
workqueue callback on offlined CPUs. The problem is that that callback
should not even run on offlined CPUs but it happens nevertheless because
the offlining callback thermal_throttle_offline() does not symmetrically
undo the setup work done in its onlining counterpart. IOW,
1. The thermal interrupt vector should be masked out before ...
2. ... cancelling any pending work synchronously so that no new work is
enqueued anymore.
Do those things and fix the issue properly.
[ bp: Write commit message. ]
Fixes: f6656208f0 ("x86/mce/therm_throt: Optimize notifications of thermal throttle")
Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Tested-by: Pandruvada, Srinivas <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/158120068234.18291.7938335950259651295@skylake-alporthouse-com
Now that .eh_frame sections for the files in setup.elf and realmode.elf
are not generated anymore, the linker scripts don't need the special
output section name /DISCARD/ any more.
Remove the one in the main kernel linker script as well, since there are
no .eh_frame sections already, and fix up a comment referencing .eh_frame.
Update the comment in asm/dwarf2.h referring to .eh_frame so it continues
to make sense, as well as being more specific.
[ bp: Touch up commit message. ]
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Link: https://lkml.kernel.org/r/20200224232129.597160-3-nivedita@alum.mit.edu
Alex Shi reported the pkey macros above arch_set_user_pkey_access()
to be unused. They are unused, and even refer to a nonexistent
CONFIG option.
But, they might have served a good use, which was to ensure that
the code does not try to set values that would not fit in the
PKRU register. As it stands, a too-large 'pkey' value would
be likely to silently overflow the u32 new_pkru_bits.
Add a check to look for overflows. Also add a comment to remind
any future developer to closely examine the types used to store
pkey values if arch_max_pkey() ever changes.
This boots and passes the x86 pkey selftests.
Reported-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200122165346.AD4DA150@viggo.jf.intel.com
The e820 table for the kexec kernel unconditionally marks setup_data as
reserved because the second kernel can reuse setup_data passed by the
1st kernel's boot loader, for example SETUP_PCI marked regions like PCI
BIOS, etc.
SETUP_EFI types, however, are used by kexec itself to enable EFI in the
2nd kernel. Thus, it is pointless to add this type of setup_data to the
kexec e820 table as reserved.
IOW, what happens is this:
- 1st physical boot: no SETUP_EFI.
- kexec loads a new kernel and prepares a SETUP_EFI setup_data blob, then
reboots the machine.
- 2nd kernel sees SETUP_EFI, reserves it both in the e820 and in the
kexec e820 table.
- If another kexec load is executed, it prepares a new SETUP_EFI blob and
then reboots the machine into the new kernel.
5. The 3rd kexec-ed kernel has two SETUP_EFI ranges reserved. And so on...
Thus skip SETUP_EFI while reserving setup_data in the e820_table_kexec
table because it is not needed.
[ bp: Heavily massage commit message, shorten line and improve comment. ]
Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200212110424.GA2938@dhcp-128-65.nay.redhat.com
Replace the EFI runtime services check with one that tells us whether
EFI GetVariable() is implemented by the firmware.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Instead of going through the EFI system table each time, just copy the
runtime services table pointer into struct efi directly. This is the
last use of the system table pointer in struct efi, allowing us to
drop it in a future patch, along with a fair amount of quirky handling
of the translated address.
Note that usually, the runtime services pointer changes value during
the call to SetVirtualAddressMap(), so grab the updated value as soon
as that call returns. (Mixed mode uses a 1:1 mapping, and kexec boot
enters with the updated address in the system table, so in those cases,
we don't need to do anything here)
Tested-by: Tony Luck <tony.luck@intel.com> # arch/ia64
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
There is some code that exposes physical addresses of certain parts of
the EFI firmware implementation via sysfs nodes. These nodes are only
used on x86, and are of dubious value to begin with, so let's move
their handling into the x86 arch code.
Tested-by: Tony Luck <tony.luck@intel.com> # arch/ia64
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Since commit 33b85447fa ("efi/x86: Drop two near identical versions
of efi_runtime_init()"), we no longer map the EFI runtime services table
before calling SetVirtualAddressMap(), which means we don't need the 1:1
mapped physical address of this table, and so there is no point in passing
the address via EFI setup data on kexec boot.
Note that the kexec tools will still look for this address in sysfs, so
we still need to provide it.
Tested-by: Tony Luck <tony.luck@intel.com> # arch/ia64
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
- Populate the per CPU MCA bank descriptor pointer only after it has been
completely set up to prevent a use-after-free in case that one of the
subsequent initialization step fails
- Implement a proper release function for the sysfs entries of MCA
threshold controls instead of freeing the memory right in the CPU
teardown code, which leads to another use-after-free when the
associated sysfs file is opened and accessed.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5RkwATHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoY6ED/4oafN9DmeY18oUv1QpoMQQMa6iduz6
udemhpjqVXO7R1Ste4ccM/fIQ8sMjf58isuxUClcDwrX3fxv+lusK2iVPEw99Vpo
w1xSjdXNW0KSSiQko9oMHVu+xcXIt8vxpL4YyjEuR81rcoecFaq2c2KhLMAW4o0p
3mEv7/QYPfpKc4ydcbcHo2JF6U1sfUpsWpoe/SxpXRpxeoy64baCWZGbcsUXqjB6
3MRxxy+ypKKKPPUM1py4D/ViDXwkhhP+gMD4ljWXCprpul/KuXAMEgvW39MtVsBJ
uMF3PMXqjKx+WY492tpxtdZjWej+X13ID/cTc2w1EBHz30Qxmc6RieTKi6FzsJYB
PKsTWdGarzORioaBg51Riq27C3+fjHbe6WqkhIQzmenSIwiV1o6o4IyuOs5sdlxX
rjIk/ssNeAxRpCy308i6Vaq98PBZqAY1/iUZN50vAzldH3bwKxobowjn+AYStA0c
9BF5zw7/3oXB4WaByuBwJ3DzWjqiXM4EUPu7LYF9DVSvj+A2xOmhwN+uz3SK6hBk
vkxiFE50Lo2qoDaATJozY8+nxgUKRNiDdz+udhVsoQxNKWUMxirsH18TFu8yBl2r
HGKsfCBY4CnV64WRy5IKQsqt3EhAgAUUoD0jSy7P3xf4HwSKAn/9OZ1cWQAo1wzQ
xnXUtRDFc7ScHg==
=2f84
-----END PGP SIGNATURE-----
Merge tag 'ras-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS fixes from Thomas Gleixner:
"Two fixes for the AMD MCE driver:
- Populate the per CPU MCA bank descriptor pointer only after it has
been completely set up to prevent a use-after-free in case that one
of the subsequent initialization step fails
- Implement a proper release function for the sysfs entries of MCA
threshold controls instead of freeing the memory right in the CPU
teardown code, which leads to another use-after-free when the
associated sysfs file is opened and accessed"
* tag 'ras-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce/amd: Fix kobject lifetime
x86/mce/amd: Publish the bank pointer only after setup has succeeded
- Remove the __force_oder definiton from the kaslr boot code as it is
already defined in the page table code which makes GCC 10 builds fail
because it changed the default to -fno-common.
- Address the AMD erratum 1054 concerning the IRPERF capability and
enable the Instructions Retired fixed counter on machines which are not
affected by the erratum.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5RlXkTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoV6vEACsB5d8TC+OYYn1UsRJZszQ4ItRoT2Q
t++G1RjY+hIiEVb4BufhWi3DBsS2XETwO7LIma8tj+Vt/hhAs+3PyBiumFIz3HEN
pJzPR7CszD6EiO0qRw5Mrj2n+EC8I1Ts/hKuzir0kQr0h+jxg3OAOWMnUNfXiqS2
mGh3baMeNYvLvI/MUDBcFP0ZDMcgsYPb3qt4Qodg9bS31+d7xlTPwK6Lua5R8eih
ZTaVOR2JMYXIYDQA5eAqB2P/GiFBDERQHrJUQ44mY9A14w3T7qjthfMiCAvWlVd7
+ibxYA3/xujQumhyCFXmdxYEzyVzp8kLSlF7ERGVCdDZ20ZV/FA/c6uyUiW+tmUi
NR915G8632qKF7TXRPITZaWl8rC0KEcm5W+K0uf8ThJKUdq5vigXURLV9t9udeKY
HqQtyuNtesmKycF9oXG6OfFeKuveZR6XSlhLK2fMs/mxa9yyvyRyXNmwwATgTSI4
RPwrpAB52snexARBR/kZ9p/kgB47FceVYOYuQMvcp/n+1KXNesmAIeT29vNSzYUK
vL0M5XVBsz9pvTkQlhxW36sO8uLZG6SPZ+e0ypDt9YDz+YTXbBM91buxpYE9xk36
2j0aPrexC6FwCEny9uEckHRuLUip2mpld4QOxH8j3itYme3LfPa1poajAoKBwFkg
gu42lzbWqVArZA==
=Pgxk
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"Two fixes for x86:
- Remove the __force_oder definiton from the kaslr boot code as it is
already defined in the page table code which makes GCC 10 builds
fail because it changed the default to -fno-common.
- Address the AMD erratum 1054 concerning the IRPERF capability and
enable the Instructions Retired fixed counter on machines which are
not affected by the erratum"
* tag 'x86-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/amd: Enable the fixed Instructions Retired counter IRPERF
x86/boot/compressed: Don't declare __force_order in kaslr_64.c
A split-lock occurs when an atomic instruction operates on data that spans
two cache lines. In order to maintain atomicity the core takes a global bus
lock.
This is typically >1000 cycles slower than an atomic operation within a
cache line. It also disrupts performance on other cores (which must wait
for the bus lock to be released before their memory operations can
complete). For real-time systems this may mean missing deadlines. For other
systems it may just be very annoying.
Some CPUs have the capability to raise an #AC trap when a split lock is
attempted.
Provide a command line option to give the user choices on how to handle
this:
split_lock_detect=
off - not enabled (no traps for split locks)
warn - warn once when an application does a
split lock, but allow it to continue
running.
fatal - Send SIGBUS to applications that cause split lock
On systems that support split lock detection the default is "warn". Note
that if the kernel hits a split lock in any mode other than "off" it will
OOPs.
One implementation wrinkle is that the MSR to control the split lock
detection is per-core, not per thread. This might result in some short
lived races on HT systems in "warn" mode if Linux tries to enable on one
thread while disabling on the other. Race analysis by Sean Christopherson:
- Toggling of split-lock is only done in "warn" mode. Worst case
scenario of a race is that a misbehaving task will generate multiple
#AC exceptions on the same instruction. And this race will only occur
if both siblings are running tasks that generate split-lock #ACs, e.g.
a race where sibling threads are writing different values will only
occur if CPUx is disabling split-lock after an #AC and CPUy is
re-enabling split-lock after *its* previous task generated an #AC.
- Transitioning between off/warn/fatal modes at runtime isn't supported
and disabling is tracked per task, so hardware will always reach a steady
state that matches the configured mode. I.e. split-lock is guaranteed to
be enabled in hardware once all _TIF_SLD threads have been scheduled out.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Co-developed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20200126200535.GB30377@agluck-desk2.amr.corp.intel.com
Commit
aaf248848d ("perf/x86/msr: Add AMD IRPERF (Instructions Retired)
performance counter")
added support for access to the free-running counter via 'perf -e
msr/irperf/', but when exercised, it always returns a 0 count:
BEFORE:
$ perf stat -e instructions,msr/irperf/ true
Performance counter stats for 'true':
624,833 instructions
0 msr/irperf/
Simply set its enable bit - HWCR bit 30 - to make it start counting.
Enablement is restricted to all machines advertising IRPERF capability,
except those susceptible to an erratum that makes the IRPERF return
bad values.
That erratum occurs in Family 17h models 00-1fh [1], but not in F17h
models 20h and above [2].
AFTER (on a family 17h model 31h machine):
$ perf stat -e instructions,msr/irperf/ true
Performance counter stats for 'true':
621,690 instructions
622,490 msr/irperf/
[1] Revision Guide for AMD Family 17h Models 00h-0Fh Processors
[2] Revision Guide for AMD Family 17h Models 30h-3Fh Processors
The revision guides are available from the bugzilla Link below.
[ bp: Massage commit message. ]
Fixes: aaf248848d ("perf/x86/msr: Add AMD IRPERF (Instructions Retired) performance counter")
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Link: http://lkml.kernel.org/r/20200214201805.13830-1-kim.phillips@amd.com
A user has reported that they are seeing spurious corrected errors on
their hardware.
Intel Errata HSD131, HSM142, HSW131, and BDM48 report that "spurious
corrected errors may be logged in the IA32_MC0_STATUS register with
the valid field (bit 63) set, the uncorrected error field (bit 61) not
set, a Model Specific Error Code (bits [31:16]) of 0x000F, and an MCA
Error Code (bits [15:0]) of 0x0005." The Errata PDFs are linked in the
bugzilla below.
Block these spurious errors from the console and logs.
[ bp: Move the intel_filter_mce() header declarations into the already
existing CONFIG_X86_MCE_INTEL ifdeffery. ]
Co-developed-by: Alexander Krupp <centos@akr.yagii.de>
Signed-off-by: Alexander Krupp <centos@akr.yagii.de>
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206587
Link: https://lkml.kernel.org/r/20200219131611.36816-1-prarit@redhat.com
All architectures which use the generic VDSO code have their own storage
for the VDSO clock mode. That's pointless and just requires duplicate code.
X86 abuses the function which retrieves the architecture specific clock
mode storage to mark the clocksource as used in the VDSO. That's silly
because this is invoked on every tick when the VDSO data is updated.
Move this functionality to the clocksource::enable() callback so it gets
invoked once when the clocksource is installed. This allows to make the
clock mode storage generic.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com> (Hyper-V parts)
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> (VDSO parts)
Acked-by: Juergen Gross <jgross@suse.com> (Xen parts)
Link: https://lkml.kernel.org/r/20200207124402.934519777@linutronix.de
Fix a couple of typos in code comments.
[ bp: While at it: s/IRQ's/IRQs/. ]
Signed-off-by: Martin Molnar <martin.molnar.programming@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lkml.kernel.org/r/0819a044-c360-44a4-f0b6-3f5bafe2d35c@gmail.com
Accessing the MCA thresholding controls in sysfs concurrently with CPU
hotplug can lead to a couple of KASAN-reported issues:
BUG: KASAN: use-after-free in sysfs_file_ops+0x155/0x180
Read of size 8 at addr ffff888367578940 by task grep/4019
and
BUG: KASAN: use-after-free in show_error_count+0x15c/0x180
Read of size 2 at addr ffff888368a05514 by task grep/4454
for example. Both result from the fact that the threshold block
creation/teardown code frees the descriptor memory itself instead of
defining proper ->release function and leaving it to the driver core to
take care of that, after all sysfs accesses have completed.
Do that and get rid of the custom freeing code, fixing the above UAFs in
the process.
[ bp: write commit message. ]
Fixes: 9526866439 ("[PATCH] x86_64: mce_amd support for family 0x10 processors")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200214082801.13836-1-bp@alien8.de
threshold_create_bank() creates a bank descriptor per MCA error
thresholding counter which can be controlled over sysfs. It publishes
the pointer to that bank in a per-CPU variable and then goes on to
create additional thresholding blocks if the bank has such.
However, that creation of additional blocks in
allocate_threshold_blocks() can fail, leading to a use-after-free
through the per-CPU pointer.
Therefore, publish that pointer only after all blocks have been setup
successfully.
Fixes: 019f34fccf ("x86, MCE, AMD: Move shared bank to node descriptor")
Reported-by: Saar Amar <Saar.Amar@microsoft.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200128140846.phctkvx5btiexvbx@kili.mountain
An XSAVES component's alignment/offset is meaningful only when the
feature is enabled. Return zero and WARN_ONCE on checking alignment of
disabled features.
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20200109211452.27369-4-yu-cheng.yu@intel.com
In setup_xstate_comp(), each XSAVES component offset starts from the
end of its preceding component plus alignment. A disabled feature does
not take space and its offset should be set to the end of its preceding
one with no alignment. However, in this case, alignment is incorrectly
added to the offset, which can cause the next component to have a wrong
offset.
This problem has not been visible because currently there is no xfeature
requiring alignment.
Fix it by tracking the next starting offset only from enabled
xfeatures. To make it clear, also change the function name to
setup_xstate_comp_offsets().
[ bp: Fix a typo in the comment above it, while at it. ]
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20200109211452.27369-3-yu-cheng.yu@intel.com
The IMA arch code attempts to inspect the "SetupMode" EFI variable
by populating a variable called efi_SetupMode_name with the string
"SecureBoot" and passing that to the EFI GetVariable service, which
obviously does not yield the expected result.
Given that the string is only referenced a single time, let's get
rid of the intermediate variable, and pass the correct string as
an immediate argument. While at it, do the same for "SecureBoot".
Fixes: 399574c64e ("x86/ima: retry detecting secure boot mode")
Fixes: 980ef4d22a ("x86/ima: check EFI SetupMode too")
Cc: Matthew Garrett <mjg59@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Cc: stable@vger.kernel.org # v5.3
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
The function setup_xstate_features() uses CPUID to find each xfeature's
standard-format offset and size. Since XSAVES always uses the compacted
format, supervisor xstates are *NEVER* in the standard-format and their
offsets are left as -1's. However, they are still being tracked as
last_good_offset.
Fix it by tracking only user xstate offsets.
[ bp: Use xfeature_is_supervisor() and save an indentation level. Drop
now unused xfeature_is_user(). ]
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20200109211452.27369-2-yu-cheng.yu@intel.com
- Ensure that the PIT is set up when the local APIC is disable or
configured in legacy mode. This is caused by an ordering issue
introduced in the recent changes which skip PIT initialization when the
TSC and APIC frequencies are already known.
- Handle malformed SRAT tables during early ACPI parsing which caused an
infinite loop anda boot hang.
- Fix a long standing race in the affinity setting code which affects PCI
devices with non-maskable MSI interrupts. The problem is caused by the
non-atomic writes of the MSI address (destination APIC id) and data
(vector) fields which the device uses to construct the MSI message. The
non-atomic writes are mandated by PCI.
If both fields change and the device raises an interrupt after writing
address and before writing data, then the MSI block constructs a
inconsistent message which causes interrupts to be lost and subsequent
malfunction of the device.
The fix is to redirect the interrupt to the new vector on the current
CPU first and then switch it over to the new target CPU. This allows to
observe an eventually raised interrupt in the transitional stage (old
CPU, new vector) to be observed in the APIC IRR and retriggered on the
new target CPU and the new vector. The potential spurious interrupts
caused by this are harmless and can in the worst case expose a buggy
driver (all handlers have to be able to deal with spurious interrupts as
they can and do happen for various reasons).
- Add the missing suspend/resume mechanism for the HYPERV hypercall page
which prevents resume hibernation on HYPERV guests. This change got
lost before the merge window.
- Mask the IOAPIC before disabling the local APIC to prevent potentially
stale IOAPIC remote IRR bits which cause stale interrupt lines after
resume.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5AEJwTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoWY2D/47ur9gsVQGryKzneVAr0SCsq4Un11e
uifX4ldu4gCEBRTYhpgcpiFKeLvY/QJ6uOD+gQUHyy/s+lCf6yzE6UhXEqSCtcT7
LkSxD8jAFf6KhMA6iqYBfyxUsPMXBetLjjHWsyc/kf15O/vbYm7qf05timmNZkDS
S7C+yr3KRqRjLR7G7t4twlgC9aLcNUQihUdsH2qyTvjnlkYHJLDa0/Js7bFYYKVx
9GdUDLvPFB1mZ76g012De4R3kJsWitiyLlQ38DP5VysKulnszUCdiXlgCEFrgxvQ
OQhLafQzOAzvxQmP+1alODR0dmJZA8k0zsDeeTB/vTpRvv6+Pe2qUswLSpauBzuq
TpDsrv8/5pwZh28+91f/Unk+tH8NaVNtGe/Uf+ePxIkn1nbqL84o4NHGplM6R97d
HAWdZQZ1cGRLf6YRRJ+57oM/5xE3vBbF1Wn0+QDTFwdsk2vcxuQ4eB3M/8E1V7Zk
upp8ty50bZ5+rxQ8XTq/eb8epSRnfLoBYpi4ux6MIOWRdmKDl40cDeZCzA2kNP7m
qY1haaRN3ksqvhzc0Yf6cL+CgvC4ur8gRHezfOqmBzVoaLyVEFIVjgjR/ojf0bq8
/v+L9D5+IdIv4jEZruRRs0gOXNDzoBbvf0qKGaO0tUTWiDsv7c5AGixp8aozniHS
HXsv1lIpRuC7WQ==
=WxKD
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A set of fixes for X86:
- Ensure that the PIT is set up when the local APIC is disable or
configured in legacy mode. This is caused by an ordering issue
introduced in the recent changes which skip PIT initialization when
the TSC and APIC frequencies are already known.
- Handle malformed SRAT tables during early ACPI parsing which caused
an infinite loop anda boot hang.
- Fix a long standing race in the affinity setting code which affects
PCI devices with non-maskable MSI interrupts. The problem is caused
by the non-atomic writes of the MSI address (destination APIC id)
and data (vector) fields which the device uses to construct the MSI
message. The non-atomic writes are mandated by PCI.
If both fields change and the device raises an interrupt after
writing address and before writing data, then the MSI block
constructs a inconsistent message which causes interrupts to be
lost and subsequent malfunction of the device.
The fix is to redirect the interrupt to the new vector on the
current CPU first and then switch it over to the new target CPU.
This allows to observe an eventually raised interrupt in the
transitional stage (old CPU, new vector) to be observed in the APIC
IRR and retriggered on the new target CPU and the new vector.
The potential spurious interrupts caused by this are harmless and
can in the worst case expose a buggy driver (all handlers have to
be able to deal with spurious interrupts as they can and do happen
for various reasons).
- Add the missing suspend/resume mechanism for the HYPERV hypercall
page which prevents resume hibernation on HYPERV guests. This
change got lost before the merge window.
- Mask the IOAPIC before disabling the local APIC to prevent
potentially stale IOAPIC remote IRR bits which cause stale
interrupt lines after resume"
* tag 'x86-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic: Mask IOAPIC entries when disabling the local APIC
x86/hyperv: Suspend/resume the hypercall page for hibernation
x86/apic/msi: Plug non-maskable MSI affinity race
x86/boot: Handle malformed SRAT tables during early ACPI parsing
x86/timer: Don't skip PIT setup when APIC is disabled or in legacy mode
Pull vfs file system parameter updates from Al Viro:
"Saner fs_parser.c guts and data structures. The system-wide registry
of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
the horror switch() in fs_parse() that would have to grow another case
every time something got added to that system-wide registry.
New syntax types can be added by filesystems easily now, and their
namespace is that of functions - not of system-wide enum members. IOW,
they can be shared or kept private and if some turn out to be widely
useful, we can make them common library helpers, etc., without having
to do anything whatsoever to fs_parse() itself.
And we already get that kind of requests - the thing that finally
pushed me into doing that was "oh, and let's add one for timeouts -
things like 15s or 2h". If some filesystem really wants that, let them
do it. Without somebody having to play gatekeeper for the variants
blessed by direct support in fs_parse(), TYVM.
Quite a bit of boilerplate is gone. And IMO the data structures make a
lot more sense now. -200LoC, while we are at it"
* 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
tmpfs: switch to use of invalfc()
cgroup1: switch to use of errorfc() et.al.
procfs: switch to use of invalfc()
hugetlbfs: switch to use of invalfc()
cramfs: switch to use of errofc() et.al.
gfs2: switch to use of errorfc() et.al.
fuse: switch to use errorfc() et.al.
ceph: use errorfc() and friends instead of spelling the prefix out
prefix-handling analogues of errorf() and friends
turn fs_param_is_... into functions
fs_parse: handle optional arguments sanely
fs_parse: fold fs_parameter_desc/fs_parameter_spec
fs_parser: remove fs_parameter_description name field
add prefix to fs_context->log
ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
new primitive: __fs_parse()
switch rbd and libceph to p_log-based primitives
struct p_log, variants of warnf() et.al. taking that one instead
teach logfc() to handle prefices, give it saner calling conventions
get rid of cg_invalf()
...
Unused now.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When a system suspends, the local APIC is disabled in the suspend sequence,
but the IOAPIC is left in the current state. This means unmasked interrupt
lines stay unmasked. This is usually the case for IOAPIC pin 9 to which the
ACPI interrupt is connected.
That means that in suspended state the IOAPIC can respond to an external
interrupt, e.g. the wakeup via keyboard/RTC/ACPI, but the interrupt message
cannot be handled by the disabled local APIC. As a consequence the Remote
IRR bit is set, but the local APIC does not send an EOI to acknowledge
it. This causes the affected interrupt line to become stale and the stale
Remote IRR bit will cause a hang when __synchronize_hardirq() is invoked
for that interrupt line.
To prevent this, mask all IOAPIC entries before disabling the local
APIC. The resume code already has the unmask operation inside.
[ tglx: Massaged changelog ]
Signed-off-by: Tony W Wang-oc <TonyWWang-oc@zhaoxin.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1579076539-7267-1-git-send-email-TonyWWang-oc@zhaoxin.com
* fix register corruption
* ENOTSUPP/EOPNOTSUPP mixed
* reset cleanups/fixes
* selftests
x86:
* Bug fixes and cleanups
* AMD support for APIC virtualization even in combination with
in-kernel PIT or IOAPIC.
MIPS:
* Compilation fix.
Generic:
* Fix refcount overflow for zero page.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJeOuf7AAoJEL/70l94x66DOBQH/j1W9lUpbDgr9aWbrZT+O/yP
FWzUDrRlCZCjV1FQKbGPa4YLeDRTG5n+RIQTjmCGRqiMqeoELSJ1+iK99e97nG/u
L28zf/90Nf0R+wwHL4AOFeploTYfG4WP8EVnlr3CG2UCJrNjxN1KU7yRZoWmWa2d
ckLJ8ouwNvx6VZd233LVmT38EP4352d1LyqIL8/+oXDIyAcRJLFQu1gRCwagsh3w
1v1czowFpWnRQ/z9zU7YD+PA4v85/Ge8sVVHlpi1X5NgV/khk4U6B0crAw6M+la+
mTmpz9g56oAh9m9NUdtv4zDCz1EWGH0v8+ZkAajUKtrM0ftJMn57P6p8PH4VVlE=
=5+Wl
-----END PGP SIGNATURE-----
Merge tag 'kvm-5.6-2' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more KVM updates from Paolo Bonzini:
"s390:
- fix register corruption
- ENOTSUPP/EOPNOTSUPP mixed
- reset cleanups/fixes
- selftests
x86:
- Bug fixes and cleanups
- AMD support for APIC virtualization even in combination with
in-kernel PIT or IOAPIC.
MIPS:
- Compilation fix.
Generic:
- Fix refcount overflow for zero page"
* tag 'kvm-5.6-2' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (42 commits)
KVM: vmx: delete meaningless vmx_decache_cr0_guest_bits() declaration
KVM: x86: Mark CR4.UMIP as reserved based on associated CPUID bit
x86: vmxfeatures: rename features for consistency with KVM and manual
KVM: SVM: relax conditions for allowing MSR_IA32_SPEC_CTRL accesses
KVM: x86: Fix perfctr WRMSR for running counters
x86/kvm/hyper-v: don't allow to turn on unsupported VMX controls for nested guests
x86/kvm/hyper-v: move VMX controls sanitization out of nested_enable_evmcs()
kvm: mmu: Separate generating and setting mmio ptes
kvm: mmu: Replace unsigned with unsigned int for PTE access
KVM: nVMX: Remove stale comment from nested_vmx_load_cr3()
KVM: MIPS: Fold comparecount_func() into comparecount_wakeup()
KVM: MIPS: Fix a build error due to referencing not-yet-defined function
x86/kvm: do not setup pv tlb flush when not paravirtualized
KVM: fix overflow of zero page refcount with ksm running
KVM: x86: Take a u64 when checking for a valid dr7 value
KVM: x86: use raw clock values consistently
KVM: x86: reorganize pvclock_gtod_data members
KVM: nVMX: delete meaningless nested_vmx_run() declaration
KVM: SVM: allow AVIC without split irqchip
kvm: ioapic: Lazy update IOAPIC EOI
...
kvm_setup_pv_tlb_flush will waste memory and print a misguiding message
when KVM paravirtualization is not available.
Intel SDM says that the when cpuid is used with EAX higher than the
maximum supported value for basic of extended function, the data for the
highest supported basic function will be returned.
So, in some systems, kvm_arch_para_features will return bogus data,
causing kvm_setup_pv_tlb_flush to detect support for pv tlb flush.
Testing for kvm_para_available will work as it checks for the hypervisor
signature.
Besides, when the "nopv" command line parameter is used, it should not
continue as well, as kvm_guest_init will no be called in that case.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Evan tracked down a subtle race between the update of the MSI message and
the device raising an interrupt internally on PCI devices which do not
support MSI masking. The update of the MSI message is non-atomic and
consists of either 2 or 3 sequential 32bit wide writes to the PCI config
space.
- Write address low 32bits
- Write address high 32bits (If supported by device)
- Write data
When an interrupt is migrated then both address and data might change, so
the kernel attempts to mask the MSI interrupt first. But for MSI masking is
optional, so there exist devices which do not provide it. That means that
if the device raises an interrupt internally between the writes then a MSI
message is sent built from half updated state.
On x86 this can lead to spurious interrupts on the wrong interrupt
vector when the affinity setting changes both address and data. As a
consequence the device interrupt can be lost causing the device to
become stuck or malfunctioning.
Evan tried to handle that by disabling MSI accross an MSI message
update. That's not feasible because disabling MSI has issues on its own:
If MSI is disabled the PCI device is routing an interrupt to the legacy
INTx mechanism. The INTx delivery can be disabled, but the disablement is
not working on all devices.
Some devices lose interrupts when both MSI and INTx delivery are disabled.
Another way to solve this would be to enforce the allocation of the same
vector on all CPUs in the system for this kind of screwed devices. That
could be done, but it would bring back the vector space exhaustion problems
which got solved a few years ago.
Fortunately the high address (if supported by the device) is only relevant
when X2APIC is enabled which implies interrupt remapping. In the interrupt
remapping case the affinity setting is happening at the interrupt remapping
unit and the PCI MSI message is programmed only once when the PCI device is
initialized.
That makes it possible to solve it with a two step update:
1) Target the MSI msg to the new vector on the current target CPU
2) Target the MSI msg to the new vector on the new target CPU
In both cases writing the MSI message is only changing a single 32bit word
which prevents the issue of inconsistency.
After writing the final destination it is necessary to check whether the
device issued an interrupt while the intermediate state #1 (new vector,
current CPU) was in effect.
This is possible because the affinity change is always happening on the
current target CPU. The code runs with interrupts disabled, so the
interrupt can be detected by checking the IRR of the local APIC. If the
vector is pending in the IRR then the interrupt is retriggered on the new
target CPU by sending an IPI for the associated vector on the target CPU.
This can cause spurious interrupts on both the local and the new target
CPU.
1) If the new vector is not in use on the local CPU and the device
affected by the affinity change raised an interrupt during the
transitional state (step #1 above) then interrupt entry code will
ignore that spurious interrupt. The vector is marked so that the
'No irq handler for vector' warning is supressed once.
2) If the new vector is in use already on the local CPU then the IRR check
might see an pending interrupt from the device which is using this
vector. The IPI to the new target CPU will then invoke the handler of
the device, which got the affinity change, even if that device did not
issue an interrupt
3) If the new vector is in use already on the local CPU and the device
affected by the affinity change raised an interrupt during the
transitional state (step #1 above) then the handler of the device which
uses that vector on the local CPU will be invoked.
expose issues in device driver interrupt handlers which are not prepared to
handle a spurious interrupt correctly. This not a regression, it's just
exposing something which was already broken as spurious interrupts can
happen for a lot of reasons and all driver handlers need to be able to deal
with them.
Reported-by: Evan Green <evgreen@chromium.org>
Debugged-by: Evan Green <evgreen@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Evan Green <evgreen@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87imkr4s7n.fsf@nanos.tec.linutronix.de
Pull x86 fixes from Ingo Molnar:
"Misc fixes:
- three fixes and a cleanup for the resctrl code
- a HyperV fix
- a fix to /proc/kcore contents in live debugging sessions
- a fix for the x86 decoder opcode map"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/decoder: Add TEST opcode to Group3-2
x86/resctrl: Clean up unused function parameter in mkdir path
x86/resctrl: Fix a deadlock due to inaccurate reference
x86/resctrl: Fix use-after-free due to inaccurate refcount of rdtgroup
x86/resctrl: Fix use-after-free when deleting resource groups
x86/hyper-v: Add "polling" bit to hv_synic_sint
x86/crash: Define arch_crash_save_vmcoreinfo() if CONFIG_CRASH_CORE=y
Unfortunately, GCC 9.1 is expected to be be released without support for
MPX. This means that there was only a relatively small window where
folks could have ever used MPX. It failed to gain wide adoption in the
industry, and Linux was the only mainstream OS to ever support it widely.
Support for the feature may also disappear on future processors.
This set completes the process that we started during the 5.4 merge window.
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJeK1/pAAoJEGg1lTBwyZKwgC8QAIiVn1d7A9Uj/WpnpgfCChCZ
9XiV6Ak999qD9fbAcrgNfPjieaD4mtokocSRVJuRgJu5iLnIJCINlozLPe4yVl7P
7zebnxkLq0CIA8d56bEUoFlC0J+oWYlDVQePZzNQsSk5KHVGXVLpF6U4vDVzZeQy
cprgvdeY+ehB7G6IIo0MWTg5ylKYAsOAyVvK8NIGpKY2k6/YqCnsptnsVE7bvlHy
TrEOiUWLv+hh0bMkZdP1PwKQKEuMO/IZly0HtviFbMN7T4TB1spfg7ELoBucEq3T
s4EVbYRe+nIE4tuEAveaX3CgxJek8cY5MlticskdaKSEACBwabdOF55qsZy0u+WA
PYC4iUIXfbOH8OgieKWtGX4IuSkRYdQ2nP4BOpe4ZX4+zvU7zOCIyVSKRrwkX8cc
ADtWI5FAtB36KCgUuWnHGHNZpOxPTbTLBuBataFY4Q2uBNJEBJpscZ5H9ObtyGFU
ZjlzqFnM0nFNDKEI1EEtv9jLzgZTU1RQ46s7EFeSeEQ2/s9wJ3+s5sBlVbljsmus
o658bLOEaRWC/aF15dgmEXW9GAO6uifNdmbzGnRn7oEMYyFQPTWbZvi1zGz58QaG
Y6WTtigVtsSrHS4wpYd+p+n1W06VnB6J3BpBM4G1VQv1Vm0dNd1tUOfkqOzPjg7c
33Itmsz2LaW1mb67GlgZ
=g4cC
-----END PGP SIGNATURE-----
Merge tag 'mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-mpx
Pull x86 MPX removal from Dave Hansen:
"MPX requires recompiling applications, which requires compiler
support. Unfortunately, GCC 9.1 is expected to be be released without
support for MPX. This means that there was only a relatively small
window where folks could have ever used MPX. It failed to gain wide
adoption in the industry, and Linux was the only mainstream OS to ever
support it widely.
Support for the feature may also disappear on future processors.
This set completes the process that we started during the 5.4 merge
window when the MPX prctl()s were removed. XSAVE support is left in
place, which allows MPX-using KVM guests to continue to function"
* tag 'mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-mpx:
x86/mpx: remove MPX from arch/x86
mm: remove arch_bprm_mm_init() hook
x86/mpx: remove bounds exception code
x86/mpx: remove build infrastructure
x86/alternatives: add missing insn.h include
Here are the big set of tty and serial driver updates for 5.6-rc1
Included in here are:
- dummy_con cleanups (touches lots of arch code)
- sysrq logic cleanups (touches lots of serial drivers)
- samsung driver fixes (wasn't really being built)
- conmakeshash move to tty subdir out of scripts
- lots of small tty/serial driver updates
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXjFRBg8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yn2VACgkge7vTeUNeZFc+6F4NWphAQ5tCQAoK/MMbU6
0O8ef7PjFwCU4s227UTv
=6m40
-----END PGP SIGNATURE-----
Merge tag 'tty-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver updates from Greg KH:
"Here are the big set of tty and serial driver updates for 5.6-rc1
Included in here are:
- dummy_con cleanups (touches lots of arch code)
- sysrq logic cleanups (touches lots of serial drivers)
- samsung driver fixes (wasn't really being built)
- conmakeshash move to tty subdir out of scripts
- lots of small tty/serial driver updates
All of these have been in linux-next for a while with no reported
issues"
* tag 'tty-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (140 commits)
tty: n_hdlc: Use flexible-array member and struct_size() helper
tty: baudrate: SPARC supports few more baud rates
tty: baudrate: Synchronise baud_table[] and baud_bits[]
tty: serial: meson_uart: Add support for kernel debugger
serial: imx: fix a race condition in receive path
serial: 8250_bcm2835aux: Document struct bcm2835aux_data
serial: 8250_bcm2835aux: Use generic remapping code
serial: 8250_bcm2835aux: Allocate uart_8250_port on stack
serial: 8250_bcm2835aux: Suppress register_port error on -EPROBE_DEFER
serial: 8250_bcm2835aux: Suppress clk_get error on -EPROBE_DEFER
serial: 8250_bcm2835aux: Fix line mismatch on driver unbind
serial_core: Remove unused member in uart_port
vt: Correct comment documenting do_take_over_console()
vt: Delete comment referencing non-existent unbind_con_driver()
arch/xtensa/setup: Drop dummy_con initialization
arch/x86/setup: Drop dummy_con initialization
arch/unicore32/setup: Drop dummy_con initialization
arch/sparc/setup: Drop dummy_con initialization
arch/sh/setup: Drop dummy_con initialization
arch/s390/setup: Drop dummy_con initialization
...
Tony reported a boot regression caused by the recent workaround for systems
which have a disabled (clock gate off) PIT.
On his machine the kernel fails to initialize the PIT because
apic_needs_pit() does not take into account whether the local APIC
interrupt delivery mode will actually allow to setup and use the local
APIC timer. This should be easy to reproduce with acpi=off on the
command line which also disables HPET.
Due to the way the PIT/HPET and APIC setup ordering works (APIC setup can
require working PIT/HPET) the information is not available at the point
where apic_needs_pit() makes this decision.
To address this, split out the interrupt mode selection from
apic_intr_mode_init(), invoke the selection before making the decision
whether PIT is required or not, and add the missing checks into
apic_needs_pit().
Fixes: c8c4076723 ("x86/timer: Skip PIT initialization on modern chipsets")
Reported-by: Anthony Buckley <tony.buckley000@gmail.com>
Tested-by: Anthony Buckley <tony.buckley000@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Daniel Drake <drake@endlessm.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206125
Link: https://lore.kernel.org/r/87sgk6tmk2.fsf@nanos.tec.linutronix.de
Pull x86 mtrr updates from Ingo Molnar:
"Two changes: restrict /proc/mtrr to CAP_SYS_ADMIN, plus a cleanup"
* 'x86-mtrr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mtrr: Require CAP_SYS_ADMIN for all access
x86/mtrr: Get rid of mtrr_seq_show() forward declaration
Pull x86 FPU updates from Ingo Molnar:
"Three changes: fix a race that can result in FPU corruption, plus two
cleanups"
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Deactivate FPU state after failure during state load
x86/fpu/xstate: Make xfeature_is_supervisor()/xfeature_is_user() return bool
x86/fpu/xstate: Fix small issues
Pull x86 cpu-features updates from Ingo Molnar:
"The biggest change in this cycle was a large series from Sean
Christopherson to clean up the handling of VMX features. This both
fixes bugs/inconsistencies and makes the code more coherent and
future-proof.
There are also two cleanups and a minor TSX syslog messages
enhancement"
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/cpu: Remove redundant cpu_detect_cache_sizes() call
x86/cpu: Print "VMX disabled" error message iff KVM is enabled
KVM: VMX: Allow KVM_INTEL when building for Centaur and/or Zhaoxin CPUs
perf/x86: Provide stubs of KVM helpers for non-Intel CPUs
KVM: VMX: Use VMX_FEATURE_* flags to define VMCS control bits
KVM: VMX: Check for full VMX support when verifying CPU compatibility
KVM: VMX: Use VMX feature flag to query BIOS enabling
KVM: VMX: Drop initialization of IA32_FEAT_CTL MSR
x86/cpufeatures: Add flag to track whether MSR IA32_FEAT_CTL is configured
x86/cpu: Set synthetic VMX cpufeatures during init_ia32_feat_ctl()
x86/cpu: Print VMX flags in /proc/cpuinfo using VMX_FEATURES_*
x86/cpu: Detect VMX features on Intel, Centaur and Zhaoxin CPUs
x86/vmx: Introduce VMX_FEATURES_*
x86/cpu: Clear VMX feature flag if VMX is not fully enabled
x86/zhaoxin: Use common IA32_FEAT_CTL MSR initialization
x86/centaur: Use common IA32_FEAT_CTL MSR initialization
x86/mce: WARN once if IA32_FEAT_CTL MSR is left unlocked
x86/intel: Initialize IA32_FEAT_CTL MSR at boot
tools/x86: Sync msr-index.h from kernel sources
selftests, kvm: Replace manual MSR defs with common msr-index.h
...
On some platforms such as the Dell XPS 13 laptop the firmware disables turbo
when the machine is disconnected from AC, and viceversa it enables it again
when it's reconnected. In these cases a _PPC ACPI notification is issued.
The scheduler needs to know freq_max for frequency-invariant calculations.
To account for turbo availability to come and go, record freq_max at boot as
if turbo was available and store it in a helper variable. Use a setter
function to swap between freq_base and freq_max every time turbo goes off or on.
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200122151617.531-7-ggherdovich@suse.cz
The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On all ATOM CPUs prior to Goldmont, set freq_max to the 1-core
turbo ratio.
We intended to perform tests validating that this patch doesn't regress in
terms of energy efficiency, given that this is the primary concern on Atom
processors. Alas, we found out that turbostat doesn't support reading RAPL
interfaces on our test machine (Airmont), and we don't have external equipment
to measure power consumption; all we have is the performance results of the
benchmarks we ran.
Test machine:
Platform : Dell Wyse 3040 Thin Client[1]
CPU Model : Intel Atom x5-Z8350 (aka Cherry Trail, aka Airmont)
Fam/Mod/Ste : 6:76:4
Topology : 1 socket, 4 cores / 4 threads
Memory : 2G
Storage : onboard flash, XFS filesystem
[1] https://www.dell.com/en-us/work/shop/wyse-endpoints-and-software/wyse-3040-thin-client/spd/wyse-3040-thin-client
Base frequency and available turbo levels (MHz):
Min Operating Freq 266 |***
Low Freq Mode 800 |********
Base Freq 2400 |************************
4 Cores 2800 |****************************
3 Cores 2800 |****************************
2 Cores 3200 |********************************
1 Core 3200 |********************************
Tested kernels:
Baseline : v5.4-rc1, intel_pstate passive, schedutil
Comparison #1 : v5.4-rc1, intel_pstate active , powersave
Comparison #2 : v5.4-rc1, this patch, intel_pstate passive, schedutil
tbench, hackbench and kernbench performed the same under all three kernels;
dbench ran faster with intel_pstate/powersave and the git unit tests were a
lot faster with intel_pstate/powersave and invariant schedutil wrt the
baseline. Not that any of this is terrbily interesting anyway, one doesn't buy
an Atom system to go fast. Power consumption regressions aren't expected but
we lack the equipment to make that measurement. Turbostat seems to think that
reading RAPL on this machine isn't a good idea and we're trusting that
decision.
comparison ratio of performance with baseline; 1.00 means neutral,
lower is better:
I_PSTATE FREQ-INV
----------------------------------------
dbench 0.90 ~
kernbench 0.98 0.97
gitsource 0.63 0.43
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200122151617.531-6-ggherdovich@suse.cz
The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On GOLDMONT (aka Apollo Lake), GOLDMONT_D (aka Denverton) and
GOLDMONT_PLUS CPUs (aka Gemini Lake) set freq_max to the highest frequency
reported by the CPU.
The encoding of turbo ratios for GOLDMONT* is identical to the one for
SKYLAKE_X, but we treat the Atom case apart because we want to set freq_max to
a higher value, thus the ratio freq_curr/freq_max to be lower, leading to more
conservative frequency selections (favoring power efficiency).
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200122151617.531-5-ggherdovich@suse.cz
The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On Xeon Phi CPUs set freq_max to the second-highest frequency
reported by the CPU.
Xeon Phi CPUs such as Knights Landing and Knights Mill typically have either
one or two turbo frequencies; in the former case that's 100 MHz above the base
frequency, in the latter case the two levels are 100 MHz and 200 MHz above
base frequency.
We set freq_max to the second-highest frequency reported by the CPU. This
could be the base frequency (if only one turbo level is available) or the first
turbo level (if two levels are available). The rationale is to compromise
between power efficiency or performance -- going straight to max turbo would
favor efficiency and blindly using base freq would favor performance.
For reference, this is how MSR_TURBO_RATIO_LIMIT must be parsed on a Xeon Phi
to get the available frequencies (taken from a comment in turbostat's sources):
[0] -- Reserved
[7:1] -- Base value of number of active cores of bucket 1.
[15:8] -- Base value of freq ratio of bucket 1.
[20:16] -- +ve delta of number of active cores of bucket 2.
i.e. active cores of bucket 2 =
active cores of bucket 1 + delta
[23:21] -- Negative delta of freq ratio of bucket 2.
i.e. freq ratio of bucket 2 =
freq ratio of bucket 1 - delta
[28:24]-- +ve delta of number of active cores of bucket 3.
[31:29]-- -ve delta of freq ratio of bucket 3.
[36:32]-- +ve delta of number of active cores of bucket 4.
[39:37]-- -ve delta of freq ratio of bucket 4.
[44:40]-- +ve delta of number of active cores of bucket 5.
[47:45]-- -ve delta of freq ratio of bucket 5.
[52:48]-- +ve delta of number of active cores of bucket 6.
[55:53]-- -ve delta of freq ratio of bucket 6.
[60:56]-- +ve delta of number of active cores of bucket 7.
[63:61]-- -ve delta of freq ratio of bucket 7.
1. PERFORMANCE EVALUATION: TBENCH +5%
2. NEUTRAL BENCHMARKS (ALL OTHERS)
3. TEST SETUP
1. PERFORMANCE EVALUATION: TBENCH +5%
-------------------------------------
A performance evaluation was conducted on a Knights Mill machine (see "Test
Setup" below), were the frequency-invariance patch (on schedutil) is compared
to both non-invariant schedutil and active intel_pstate with powersave: all
three tested kernels behave the same performance-wise and with regard to power
consumption (performance per watt). The only notable difference is tbench:
comparison ratio of performance with baseline; 1.00 means neutral,
higher is better:
I_PSTATE FREQ-INV
----------------------------------------
tbench 1.04 1.05
performance-per-watt ratios with baseline; 1.00 means neutral, higher is better:
I_PSTATE FREQ-INV
----------------------------------------
tbench 1.03 1.04
which essentially means that frequency-invariant schedutil is 5% better than
baseline, the same as intel_pstate+powersave.
As the results above are averaged over the varying parameter, here the detailed
table.
Varying parameter : number of clients
Unit : MB/sec (higher is better)
5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 freq-inv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean 1 49.06 +- 2.12% ( ) 51.66 +- 1.52% ( 5.30%) 52.87 +- 0.88% ( 7.76%)
Hmean 2 93.82 +- 0.45% ( ) 103.24 +- 0.70% ( 10.05%) 105.90 +- 0.70% ( 12.88%)
Hmean 4 192.46 +- 1.15% ( ) 215.95 +- 0.60% ( 12.21%) 215.78 +- 1.43% ( 12.12%)
Hmean 8 406.74 +- 2.58% ( ) 438.58 +- 0.36% ( 7.83%) 437.61 +- 0.97% ( 7.59%)
Hmean 16 857.70 +- 1.22% ( ) 890.26 +- 0.72% ( 3.80%) 889.11 +- 0.73% ( 3.66%)
Hmean 32 1760.10 +- 0.92% ( ) 1791.70 +- 0.44% ( 1.79%) 1787.95 +- 0.44% ( 1.58%)
Hmean 64 3183.50 +- 0.34% ( ) 3183.19 +- 0.36% ( -0.01%) 3187.53 +- 0.36% ( 0.13%)
Hmean 128 4830.96 +- 0.31% ( ) 4846.53 +- 0.30% ( 0.32%) 4855.86 +- 0.30% ( 0.52%)
Hmean 256 5467.98 +- 0.38% ( ) 5793.80 +- 0.28% ( 5.96%) 5821.94 +- 0.17% ( 6.47%)
Hmean 512 5398.10 +- 0.06% ( ) 5745.56 +- 0.08% ( 6.44%) 5503.68 +- 0.07% ( 1.96%)
Hmean 1024 5290.43 +- 0.63% ( ) 5221.07 +- 0.47% ( -1.31%) 5277.22 +- 0.80% ( -0.25%)
Hmean 1088 5139.71 +- 0.57% ( ) 5236.02 +- 0.71% ( 1.87%) 5190.57 +- 0.41% ( 0.99%)
2. NEUTRAL BENCHMARKS (ALL OTHERS)
----------------------------------
* pgbench (both read/write and read-only)
* NASA Parallel Benchmarks (NPB), MPI or OpenMP for message-passing
* hackbench
* netperf
* dbench
* kernbench
* gitsource (git unit test suite)
3. TEST SETUP
-------------
Test machine:
CPU Model : Intel Xeon Phi CPU 7255 @ 1.10GHz (a.k.a. Knights Mill)
Fam/Mod/Ste : 6:133:0
Topology : 1 socket, 68 cores / 272 threads
Memory : 96G
Storage : rotary, XFS filesystem
Max EFFICiency, BASE frequency and available turbo levels (MHz):
EFFIC 1000 |**********
BASE 1100 |***********
68C 1100 |***********
30C 1200 |************
Tested kernels:
Baseline : v5.2, intel_pstate passive, schedutil
Comparison #1 : v5.2, intel_pstate active , powersave
Comparison #2 : v5.2, this patch, intel_pstate passive, schedutil
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200122151617.531-4-ggherdovich@suse.cz
Implement arch_scale_freq_capacity() for 'modern' x86. This function
is used by the scheduler to correctly account usage in the face of
DVFS.
The present patch addresses Intel processors specifically and has positive
performance and performance-per-watt implications for the schedutil cpufreq
governor, bringing it closer to, if not on-par with, the powersave governor
from the intel_pstate driver/framework.
Large performance gains are obtained when the machine is lightly loaded and
no regression are observed at saturation. The benchmarks with the largest
gains are kernel compilation, tbench (the networking version of dbench) and
shell-intensive workloads.
1. FREQUENCY INVARIANCE: MOTIVATION
* Without it, a task looks larger if the CPU runs slower
2. PECULIARITIES OF X86
* freq invariance accounting requires knowing the ratio freq_curr/freq_max
2.1 CURRENT FREQUENCY
* Use delta_APERF / delta_MPERF * freq_base (a.k.a "BusyMHz")
2.2 MAX FREQUENCY
* It varies with time (turbo). As an approximation, we set it to a
constant, i.e. 4-cores turbo frequency.
3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
* The invariant schedutil's formula has no feedback loop and reacts faster
to utilization changes
4. KNOWN LIMITATIONS
* In some cases tasks can't reach max util despite how hard they try
5. PERFORMANCE TESTING
5.1 MACHINES
* Skylake, Broadwell, Haswell
5.2 SETUP
* baseline Linux v5.2 w/ non-invariant schedutil. Tested freq_max = 1-2-3-4-8-12
active cores turbo w/ invariant schedutil, and intel_pstate/powersave
5.3 BENCHMARK RESULTS
5.3.1 NEUTRAL BENCHMARKS
* NAS Parallel Benchmark (HPC), hackbench
5.3.2 NON-NEUTRAL BENCHMARKS
* tbench (10-30% better), kernbench (10-15% better),
shell-intensive-scripts (30-50% better)
* no regressions
5.3.3 SELECTION OF DETAILED RESULTS
5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT
* dbench (5% worse on one machine), kernbench (3% worse),
tbench (5-10% better), shell-intensive-scripts (10-40% better)
6. MICROARCH'ES ADDRESSED HERE
* Xeon Core before Scalable Performance processors line (Xeon Gold/Platinum
etc have different MSRs semantic for querying turbo levels)
7. REFERENCES
* MMTests performance testing framework, github.com/gormanm/mmtests
+-------------------------------------------------------------------------+
| 1. FREQUENCY INVARIANCE: MOTIVATION
+-------------------------------------------------------------------------+
For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When
running a task that would consume 1/3rd of a CPU at 1000 MHz, it would
appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the
false impression this CPU is almost at capacity, even though it can go
faster [*]. In a nutshell, without frequency scale-invariance tasks look
larger just because the CPU is running slower.
[*] (footnote: this assumes a linear frequency/performance relation; which
everybody knows to be false, but given realities its the best approximation
we can make.)
+-------------------------------------------------------------------------+
| 2. PECULIARITIES OF X86
+-------------------------------------------------------------------------+
Accounting for frequency changes in PELT signals requires the computation of
the ratio freq_curr / freq_max. On x86 neither of those terms is readily
available.
2.1 CURRENT FREQUENCY
====================
Since modern x86 has hardware control over the actual frequency we run
at (because amongst other things, Turbo-Mode), we cannot simply use
the frequency as requested through cpufreq.
Instead we use the APERF/MPERF MSRs to compute the effective frequency
over the recent past. Also, because reading MSRs is expensive, don't
do so every time we need the value, but amortize the cost by doing it
every tick.
2.2 MAX FREQUENCY
=================
Obtaining freq_max is also non-trivial because at any time the hardware can
provide a frequency boost to a selected subset of cores if the package has
enough power to spare (eg: Turbo Boost). This means that the maximum frequency
available to a given core changes with time.
The approach taken in this change is to arbitrarily set freq_max to a constant
value at boot. The value chosen is the "4-cores (4C) turbo frequency" on most
microarchitectures, after evaluating the following candidates:
* 1-core (1C) turbo frequency (the fastest turbo state available)
* around base frequency (a.k.a. max P-state)
* something in between, such as 4C turbo
To interpret these options, consider that this is the denominator in
freq_curr/freq_max, and that ratio will be used to scale PELT signals such as
util_avg and load_avg. A large denominator will undershoot (util_avg looks a
bit smaller than it really is), viceversa with a smaller denominator PELT
signals will tend to overshoot. Given that PELT drives frequency selection
in the schedutil governor, we will have:
freq_max set to | effect on DVFS
--------------------+------------------
1C turbo | power efficiency (lower freq choices)
base freq | performance (higher util_avg, higher freq requests)
4C turbo | a bit of both
4C turbo proves to be a good compromise in a number of benchmarks (see below).
+-------------------------------------------------------------------------+
| 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
+-------------------------------------------------------------------------+
Once an architecture implements a frequency scale-invariant utilization (the
PELT signal util_avg), schedutil switches its frequency selection formula from
freq_next = 1.25 * freq_curr * util [non-invariant util signal]
to
freq_next = 1.25 * freq_max * util [invariant util signal]
where, in the second formula, freq_max is set to the 1C turbo frequency (max
turbo). The advantage of the second formula, whose usage we unlock with this
patch, is that freq_next doesn't depend on the current frequency in an
iterative fashion, but can jump to any frequency in a single update. This
absence of feedback in the formula makes it quicker to react to utilization
changes and more robust against pathological instabilities.
Compare it to the update formula of intel_pstate/powersave:
freq_next = 1.25 * freq_max * Busy%
where again freq_max is 1C turbo and Busy% is the percentage of time not spent
idling (calculated with delta_MPERF / delta_TSC); essentially the same as
invariant schedutil, and largely responsible for intel_pstate/powersave good
reputation. The non-invariant schedutil formula is derived from the invariant
one by approximating util_inv with util_raw * freq_curr / freq_max, but this
has limitations.
Testing shows improved performances due to better frequency selections when
the machine is lightly loaded, and essentially no change in behaviour at
saturation / overutilization.
+-------------------------------------------------------------------------+
| 4. KNOWN LIMITATIONS
+-------------------------------------------------------------------------+
It's been shown that it is possible to create pathological scenarios where a
CPU-bound task cannot reach max utilization, if the normalizing factor
freq_max is fixed to a constant value (see [Lelli-2018]).
If freq_max is set to 4C turbo as we do here, one needs to peg at least 5
cores in a package doing some busywork, and observe that none of those task
will ever reach max util (1024) because they're all running at less than the
4C turbo frequency.
While this concern still applies, we believe the performance benefit of
frequency scale-invariant PELT signals outweights the cost of this limitation.
[Lelli-2018]
https://lore.kernel.org/lkml/20180517150418.GF22493@localhost.localdomain/
+-------------------------------------------------------------------------+
| 5. PERFORMANCE TESTING
+-------------------------------------------------------------------------+
5.1 MACHINES
============
We tested the patch on three machines, with Skylake, Broadwell and Haswell
CPUs. The details are below, together with the available turbo ratios as
reported by the appropriate MSRs.
* 8x-SKYLAKE-UMA:
Single socket E3-1240 v5, Skylake 4 cores/8 threads
Max EFFiciency, BASE frequency and available turbo levels (MHz):
EFFIC 800 |********
BASE 3500 |***********************************
4C 3700 |*************************************
3C 3800 |**************************************
2C 3900 |***************************************
1C 3900 |***************************************
* 80x-BROADWELL-NUMA:
Two sockets E5-2698 v4, 2x Broadwell 20 cores/40 threads
Max EFFiciency, BASE frequency and available turbo levels (MHz):
EFFIC 1200 |************
BASE 2200 |**********************
8C 2900 |*****************************
7C 3000 |******************************
6C 3100 |*******************************
5C 3200 |********************************
4C 3300 |*********************************
3C 3400 |**********************************
2C 3600 |************************************
1C 3600 |************************************
* 48x-HASWELL-NUMA
Two sockets E5-2670 v3, 2x Haswell 12 cores/24 threads
Max EFFiciency, BASE frequency and available turbo levels (MHz):
EFFIC 1200 |************
BASE 2300 |***********************
12C 2600 |**************************
11C 2600 |**************************
10C 2600 |**************************
9C 2600 |**************************
8C 2600 |**************************
7C 2600 |**************************
6C 2600 |**************************
5C 2700 |***************************
4C 2800 |****************************
3C 2900 |*****************************
2C 3100 |*******************************
1C 3100 |*******************************
5.2 SETUP
=========
* The baseline is Linux v5.2 with schedutil (non-invariant) and the intel_pstate
driver in passive mode.
* The rationale for choosing the various freq_max values to test have been to
try all the 1-2-3-4C turbo levels (note that 1C and 2C turbo are identical
on all machines), plus one more value closer to base_freq but still in the
turbo range (8C turbo for both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA).
* In addition we've run all tests with intel_pstate/powersave for comparison.
* The filesystem is always XFS, the userspace is openSUSE Leap 15.1.
* 8x-SKYLAKE-UMA is capable of HWP (Hardware-Managed P-States), so the runs
with active intel_pstate on this machine use that.
This gives, in terms of combinations tested on each machine:
* 8x-SKYLAKE-UMA
* Baseline: Linux v5.2, non-invariant schedutil, intel_pstate passive
* intel_pstate active + powersave + HWP
* invariant schedutil, freq_max = 1C turbo
* invariant schedutil, freq_max = 3C turbo
* invariant schedutil, freq_max = 4C turbo
* both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA
* [same as 8x-SKYLAKE-UMA, but no HWP capable]
* invariant schedutil, freq_max = 8C turbo
(which on 48x-HASWELL-NUMA is the same as 12C turbo, or "all cores turbo")
5.3 BENCHMARK RESULTS
=====================
5.3.1 NEUTRAL BENCHMARKS
------------------------
Tests that didn't show any measurable difference in performance on any of the
test machines between non-invariant schedutil and our patch are:
* NAS Parallel Benchmarks (NPB) using either MPI or openMP for IPC, any
computational kernel
* flexible I/O (FIO)
* hackbench (using threads or processes, and using pipes or sockets)
5.3.2 NON-NEUTRAL BENCHMARKS
----------------------------
What follow are summary tables where each benchmark result is given a score.
* A tilde (~) means a neutral result, i.e. no difference from baseline.
* Scores are computed with the ratio result_new / result_baseline, so a tilde
means a score of 1.00.
* The results in the score ratio are the geometric means of results running
the benchmark with different parameters (eg: for kernbench: using 1, 2, 4,
... number of processes; for pgbench: varying the number of clients, and so
on).
* The first three tables show higher-is-better kind of tests (i.e. measured in
operations/second), the subsequent three show lower-is-better kind of tests
(i.e. the workload is fixed and we measure elapsed time, think kernbench).
* "gitsource" is a name we made up for the test consisting in running the
entire unit tests suite of the Git SCM and measuring how long it takes. We
take it as a typical example of shell-intensive serialized workload.
* In the "I_PSTATE" column we have the results for intel_pstate/powersave. Other
columns show invariant schedutil for different values of freq_max. 4C turbo
is circled as it's the value we've chosen for the final implementation.
80x-BROADWELL-NUMA (comparison ratio; higher is better)
+------+
I_PSTATE 1C 3C | 4C | 8C
pgbench-ro 1.14 ~ ~ | 1.11 | 1.14
pgbench-rw ~ ~ ~ | ~ | ~
netperf-udp 1.06 ~ 1.06 | 1.05 | 1.07
netperf-tcp ~ 1.03 ~ | 1.01 | 1.02
tbench4 1.57 1.18 1.22 | 1.30 | 1.56
+------+
8x-SKYLAKE-UMA (comparison ratio; higher is better)
+------+
I_PSTATE/HWP 1C 3C | 4C |
pgbench-ro ~ ~ ~ | ~ |
pgbench-rw ~ ~ ~ | ~ |
netperf-udp ~ ~ ~ | ~ |
netperf-tcp ~ ~ ~ | ~ |
tbench4 1.30 1.14 1.14 | 1.16 |
+------+
48x-HASWELL-NUMA (comparison ratio; higher is better)
+------+
I_PSTATE 1C 3C | 4C | 12C
pgbench-ro 1.15 ~ ~ | 1.06 | 1.16
pgbench-rw ~ ~ ~ | ~ | ~
netperf-udp 1.05 0.97 1.04 | 1.04 | 1.02
netperf-tcp 0.96 1.01 1.01 | 1.01 | 1.01
tbench4 1.50 1.05 1.13 | 1.13 | 1.25
+------+
In the table above we see that active intel_pstate is slightly better than our
4C-turbo patch (both in reference to the baseline non-invariant schedutil) on
read-only pgbench and much better on tbench. Both cases are notable in which
it shows that lowering our freq_max (to 8C-turbo and 12C-turbo on
80x-BROADWELL-NUMA and 48x-HASWELL-NUMA respectively) helps invariant
schedutil to get closer.
If we ignore active intel_pstate and focus on the comparison with baseline
alone, there are several instances of double-digit performance improvement.
80x-BROADWELL-NUMA (comparison ratio; lower is better)
+------+
I_PSTATE 1C 3C | 4C | 8C
dbench4 1.23 0.95 0.95 | 0.95 | 0.95
kernbench 0.93 0.83 0.83 | 0.83 | 0.82
gitsource 0.98 0.49 0.49 | 0.49 | 0.48
+------+
8x-SKYLAKE-UMA (comparison ratio; lower is better)
+------+
I_PSTATE/HWP 1C 3C | 4C |
dbench4 ~ ~ ~ | ~ |
kernbench ~ ~ ~ | ~ |
gitsource 0.92 0.55 0.55 | 0.55 |
+------+
48x-HASWELL-NUMA (comparison ratio; lower is better)
+------+
I_PSTATE 1C 3C | 4C | 8C
dbench4 ~ ~ ~ | ~ | ~
kernbench 0.94 0.90 0.89 | 0.90 | 0.90
gitsource 0.97 0.69 0.69 | 0.69 | 0.69
+------+
dbench is not very remarkable here, unless we notice how poorly active
intel_pstate is performing on 80x-BROADWELL-NUMA: 23% regression versus
non-invariant schedutil. We repeated that run getting consistent results. Out
of scope for the patch at hand, but deserving future investigation. Other than
that, we previously ran this campaign with Linux v5.0 and saw the patch doing
better on dbench a the time. We haven't checked closely and can only speculate
at this point.
On the NUMA boxes kernbench gets 10-15% improvements on average; we'll see in
the detailed tables that the gains concentrate on low process counts (lightly
loaded machines).
The test we call "gitsource" (running the git unit test suite, a long-running
single-threaded shell script) appears rather spectacular in this table (gains
of 30-50% depending on the machine). It is to be noted, however, that
gitsource has no adjustable parameters (such as the number of jobs in
kernbench, which we average over in order to get a single-number summary
score) and is exactly the kind of low-parallelism workload that benefits the
most from this patch. When looking at the detailed tables of kernbench or
tbench4, at low process or client counts one can see similar numbers.
5.3.3 SELECTION OF DETAILED RESULTS
-----------------------------------
Machine : 48x-HASWELL-NUMA
Benchmark : tbench4 (i.e. dbench4 over the network, actually loopback)
Varying parameter : number of clients
Unit : MB/sec (higher is better)
5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean 1 126.73 +- 0.31% ( ) 315.91 +- 0.66% ( 149.28%) 125.03 +- 0.76% ( -1.34%)
Hmean 2 258.04 +- 0.62% ( ) 614.16 +- 0.51% ( 138.01%) 269.58 +- 1.45% ( 4.47%)
Hmean 4 514.30 +- 0.67% ( ) 1146.58 +- 0.54% ( 122.94%) 533.84 +- 1.99% ( 3.80%)
Hmean 8 1111.38 +- 2.52% ( ) 2159.78 +- 0.38% ( 94.33%) 1359.92 +- 1.56% ( 22.36%)
Hmean 16 2286.47 +- 1.36% ( ) 3338.29 +- 0.21% ( 46.00%) 2720.20 +- 0.52% ( 18.97%)
Hmean 32 4704.84 +- 0.35% ( ) 4759.03 +- 0.43% ( 1.15%) 4774.48 +- 0.30% ( 1.48%)
Hmean 64 7578.04 +- 0.27% ( ) 7533.70 +- 0.43% ( -0.59%) 7462.17 +- 0.65% ( -1.53%)
Hmean 128 6998.52 +- 0.16% ( ) 6987.59 +- 0.12% ( -0.16%) 6909.17 +- 0.14% ( -1.28%)
Hmean 192 6901.35 +- 0.25% ( ) 6913.16 +- 0.10% ( 0.17%) 6855.47 +- 0.21% ( -0.66%)
5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 12C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean 1 128.43 +- 0.28% ( 1.34%) 130.64 +- 3.81% ( 3.09%) 153.71 +- 5.89% ( 21.30%)
Hmean 2 311.70 +- 6.15% ( 20.79%) 281.66 +- 3.40% ( 9.15%) 305.08 +- 5.70% ( 18.23%)
Hmean 4 641.98 +- 2.32% ( 24.83%) 623.88 +- 5.28% ( 21.31%) 906.84 +- 4.65% ( 76.32%)
Hmean 8 1633.31 +- 1.56% ( 46.96%) 1714.16 +- 0.93% ( 54.24%) 2095.74 +- 0.47% ( 88.57%)
Hmean 16 3047.24 +- 0.42% ( 33.27%) 3155.02 +- 0.30% ( 37.99%) 3634.58 +- 0.15% ( 58.96%)
Hmean 32 4734.31 +- 0.60% ( 0.63%) 4804.38 +- 0.23% ( 2.12%) 4674.62 +- 0.27% ( -0.64%)
Hmean 64 7699.74 +- 0.35% ( 1.61%) 7499.72 +- 0.34% ( -1.03%) 7659.03 +- 0.25% ( 1.07%)
Hmean 128 6935.18 +- 0.15% ( -0.91%) 6942.54 +- 0.10% ( -0.80%) 7004.85 +- 0.12% ( 0.09%)
Hmean 192 6901.62 +- 0.12% ( 0.00%) 6856.93 +- 0.10% ( -0.64%) 6978.74 +- 0.10% ( 1.12%)
This is one of the cases where the patch still can't surpass active
intel_pstate, not even when freq_max is as low as 12C-turbo. Otherwise, gains are
visible up to 16 clients and the saturated scenario is the same as baseline.
The scores in the summary table from the previous sections are ratios of
geometric means of the results over different clients, as seen in this table.
Machine : 80x-BROADWELL-NUMA
Benchmark : kernbench (kernel compilation)
Varying parameter : number of jobs
Unit : seconds (lower is better)
5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean 2 379.68 +- 0.06% ( ) 330.20 +- 0.43% ( 13.03%) 285.93 +- 0.07% ( 24.69%)
Amean 4 200.15 +- 0.24% ( ) 175.89 +- 0.22% ( 12.12%) 153.78 +- 0.25% ( 23.17%)
Amean 8 106.20 +- 0.31% ( ) 95.54 +- 0.23% ( 10.03%) 86.74 +- 0.10% ( 18.32%)
Amean 16 56.96 +- 1.31% ( ) 53.25 +- 1.22% ( 6.50%) 48.34 +- 1.73% ( 15.13%)
Amean 32 34.80 +- 2.46% ( ) 33.81 +- 0.77% ( 2.83%) 30.28 +- 1.59% ( 12.99%)
Amean 64 26.11 +- 1.63% ( ) 25.04 +- 1.07% ( 4.10%) 22.41 +- 2.37% ( 14.16%)
Amean 128 24.80 +- 1.36% ( ) 23.57 +- 1.23% ( 4.93%) 21.44 +- 1.37% ( 13.55%)
Amean 160 24.85 +- 0.56% ( ) 23.85 +- 1.17% ( 4.06%) 21.25 +- 1.12% ( 14.49%)
5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 8C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean 2 284.08 +- 0.13% ( 25.18%) 283.96 +- 0.51% ( 25.21%) 285.05 +- 0.21% ( 24.92%)
Amean 4 153.18 +- 0.22% ( 23.47%) 154.70 +- 1.64% ( 22.71%) 153.64 +- 0.30% ( 23.24%)
Amean 8 87.06 +- 0.28% ( 18.02%) 86.77 +- 0.46% ( 18.29%) 86.78 +- 0.22% ( 18.28%)
Amean 16 48.03 +- 0.93% ( 15.68%) 47.75 +- 1.99% ( 16.17%) 47.52 +- 1.61% ( 16.57%)
Amean 32 30.23 +- 1.20% ( 13.14%) 30.08 +- 1.67% ( 13.57%) 30.07 +- 1.67% ( 13.60%)
Amean 64 22.59 +- 2.02% ( 13.50%) 22.63 +- 0.81% ( 13.32%) 22.42 +- 0.76% ( 14.12%)
Amean 128 21.37 +- 0.67% ( 13.82%) 21.31 +- 1.15% ( 14.07%) 21.17 +- 1.93% ( 14.63%)
Amean 160 21.68 +- 0.57% ( 12.76%) 21.18 +- 1.74% ( 14.77%) 21.22 +- 1.00% ( 14.61%)
The patch outperform active intel_pstate (and baseline) by a considerable
margin; the summary table from the previous section says 4C turbo and active
intel_pstate are 0.83 and 0.93 against baseline respectively, so 4C turbo is
0.83/0.93=0.89 against intel_pstate (~10% better on average). There is no
noticeable difference with regard to the value of freq_max.
Machine : 8x-SKYLAKE-UMA
Benchmark : gitsource (time to run the git unit test suite)
Varying parameter : none
Unit : seconds (lower is better)
5.2.0 vanilla 5.2.0 intel_pstate/hwp 5.2.0 1C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean 858.85 +- 1.16% ( ) 791.94 +- 0.21% ( 7.79%) 474.95 ( 44.70%)
5.2.0 3C-turbo 5.2.0 4C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean 475.26 +- 0.20% ( 44.66%) 474.34 +- 0.13% ( 44.77%)
In this test, which is of interest as representing shell-intensive
(i.e. fork-intensive) serialized workloads, invariant schedutil outperforms
intel_pstate/powersave by a whopping 40% margin.
5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT
---------------------------------------------
The following table shows average power consumption in watt for each
benchmark. Data comes from turbostat (package average), which in turn is read
from the RAPL interface on CPUs. We know the patch affects CPU frequencies so
it's reasonable to ignore other power consumers (such as memory or I/O). Also,
we don't have a power meter available in the lab so RAPL is the best we have.
turbostat sampled average power every 10 seconds for the entire duration of
each benchmark. We took all those values and averaged them (i.e. with don't
have detail on a per-parameter granularity, only on whole benchmarks).
80x-BROADWELL-NUMA (power consumption, watts)
+--------+
BASELINE I_PSTATE 1C 3C | 4C | 8C
pgbench-ro 130.01 142.77 131.11 132.45 | 134.65 | 136.84
pgbench-rw 68.30 60.83 71.45 71.70 | 71.65 | 72.54
dbench4 90.25 59.06 101.43 99.89 | 101.10 | 102.94
netperf-udp 65.70 69.81 66.02 68.03 | 68.27 | 68.95
netperf-tcp 88.08 87.96 88.97 88.89 | 88.85 | 88.20
tbench4 142.32 176.73 153.02 163.91 | 165.58 | 176.07
kernbench 92.94 101.95 114.91 115.47 | 115.52 | 115.10
gitsource 40.92 41.87 75.14 75.20 | 75.40 | 75.70
+--------+
8x-SKYLAKE-UMA (power consumption, watts)
+--------+
BASELINE I_PSTATE/HWP 1C 3C | 4C |
pgbench-ro 46.49 46.68 46.56 46.59 | 46.52 |
pgbench-rw 29.34 31.38 30.98 31.00 | 31.00 |
dbench4 27.28 27.37 27.49 27.41 | 27.38 |
netperf-udp 22.33 22.41 22.36 22.35 | 22.36 |
netperf-tcp 27.29 27.29 27.30 27.31 | 27.33 |
tbench4 41.13 45.61 43.10 43.33 | 43.56 |
kernbench 42.56 42.63 43.01 43.01 | 43.01 |
gitsource 13.32 13.69 17.33 17.30 | 17.35 |
+--------+
48x-HASWELL-NUMA (power consumption, watts)
+--------+
BASELINE I_PSTATE 1C 3C | 4C | 12C
pgbench-ro 128.84 136.04 129.87 132.43 | 132.30 | 134.86
pgbench-rw 37.68 37.92 37.17 37.74 | 37.73 | 37.31
dbench4 28.56 28.73 28.60 28.73 | 28.70 | 28.79
netperf-udp 56.70 60.44 56.79 57.42 | 57.54 | 57.52
netperf-tcp 75.49 75.27 75.87 76.02 | 76.01 | 75.95
tbench4 115.44 139.51 119.53 123.07 | 123.97 | 130.22
kernbench 83.23 91.55 95.58 95.69 | 95.72 | 96.04
gitsource 36.79 36.99 39.99 40.34 | 40.35 | 40.23
+--------+
A lower power consumption isn't necessarily better, it depends on what is done
with that energy. Here are tables with the ratio of performance-per-watt on
each machine and benchmark. Higher is always better; a tilde (~) means a
neutral ratio (i.e. 1.00).
80x-BROADWELL-NUMA (performance-per-watt ratios; higher is better)
+------+
I_PSTATE 1C 3C | 4C | 8C
pgbench-ro 1.04 1.06 0.94 | 1.07 | 1.08
pgbench-rw 1.10 0.97 0.96 | 0.96 | 0.97
dbench4 1.24 0.94 0.95 | 0.94 | 0.92
netperf-udp ~ 1.02 1.02 | ~ | 1.02
netperf-tcp ~ 1.02 ~ | ~ | 1.02
tbench4 1.26 1.10 1.06 | 1.12 | 1.26
kernbench 0.98 0.97 0.97 | 0.97 | 0.98
gitsource ~ 1.11 1.11 | 1.11 | 1.13
+------+
8x-SKYLAKE-UMA (performance-per-watt ratios; higher is better)
+------+
I_PSTATE/HWP 1C 3C | 4C |
pgbench-ro ~ ~ ~ | ~ |
pgbench-rw 0.95 0.97 0.96 | 0.96 |
dbench4 ~ ~ ~ | ~ |
netperf-udp ~ ~ ~ | ~ |
netperf-tcp ~ ~ ~ | ~ |
tbench4 1.17 1.09 1.08 | 1.10 |
kernbench ~ ~ ~ | ~ |
gitsource 1.06 1.40 1.40 | 1.40 |
+------+
48x-HASWELL-NUMA (performance-per-watt ratios; higher is better)
+------+
I_PSTATE 1C 3C | 4C | 12C
pgbench-ro 1.09 ~ 1.09 | 1.03 | 1.11
pgbench-rw ~ 0.86 ~ | ~ | 0.86
dbench4 ~ 1.02 1.02 | 1.02 | ~
netperf-udp ~ 0.97 1.03 | 1.02 | ~
netperf-tcp 0.96 ~ ~ | ~ | ~
tbench4 1.24 ~ 1.06 | 1.05 | 1.11
kernbench 0.97 0.97 0.98 | 0.97 | 0.96
gitsource 1.03 1.33 1.32 | 1.32 | 1.33
+------+
These results are overall pleasing: in plenty of cases we observe
performance-per-watt improvements. The few regressions (read/write pgbench and
dbench on the Broadwell machine) are of small magnitude. kernbench loses a few
percentage points (it has a 10-15% performance improvement, but apparently the
increase in power consumption is larger than that). tbench4 and gitsource, which
benefit the most from the patch, keep a positive score in this table which is
a welcome surprise; that suggests that in those particular workloads the
non-invariant schedutil (and active intel_pstate, too) makes some rather
suboptimal frequency selections.
+-------------------------------------------------------------------------+
| 6. MICROARCH'ES ADDRESSED HERE
+-------------------------------------------------------------------------+
The patch addresses Xeon Core processors that use MSR_PLATFORM_INFO and
MSR_TURBO_RATIO_LIMIT to advertise their base frequency and turbo frequencies
respectively. This excludes the recent Xeon Scalable Performance processors
line (Xeon Gold, Platinum etc) whose MSRs have to be parsed differently.
Subsequent patches will address:
* Xeon Scalable Performance processors and Atom Goldmont/Goldmont Plus
* Xeon Phi (Knights Landing, Knights Mill)
* Atom Silvermont
+-------------------------------------------------------------------------+
| 7. REFERENCES
+-------------------------------------------------------------------------+
Tests have been run with the help of the MMTests performance testing
framework, see github.com/gormanm/mmtests. The configuration file names for
the benchmark used are:
db-pgbench-timed-ro-small-xfs
db-pgbench-timed-rw-small-xfs
io-dbench4-async-xfs
network-netperf-unbound
network-tbench
scheduler-unbound
workload-kerndevel-xfs
workload-shellscripts-xfs
hpc-nas-c-class-mpi-full-xfs
hpc-nas-c-class-omp-full
All those benchmarks are generally available on the web:
pgbench: https://www.postgresql.org/docs/10/pgbench.html
netperf: https://hewlettpackard.github.io/netperf/
dbench/tbench: https://dbench.samba.org/
gitsource: git unit test suite, github.com/git/git
NAS Parallel Benchmarks: https://www.nas.nasa.gov/publications/npb.html
hackbench: https://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Doug Smythies <dsmythies@telus.net>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200122151617.531-2-ggherdovich@suse.cz
Pull misc x86 updates from Ingo Molnar:
"Misc changes:
- Enhance #GP fault printouts by distinguishing between canonical and
non-canonical address faults, and also add KASAN fault decoding.
- Fix/enhance the x86 NMI handler by putting the duration check into
a direct function call instead of an irq_work which we know to be
broken in some cases.
- Clean up do_general_protection() a bit"
* 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi: Remove irq_work from the long duration NMI handler
x86/traps: Cleanup do_general_protection()
x86/kasan: Print original address on #GP
x86/dumpstack: Introduce die_addr() for die() with #GP fault address
x86/traps: Print address on #GP
x86/insn-eval: Add support for 64-bit kernel mode
Pull x86 cleanups from Ingo Molnar:
"Misc cleanups all around the map"
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/CPU/AMD: Remove amd_get_topology_early()
x86/tsc: Remove redundant assignment
x86/crash: Use resource_size()
x86/cpu: Add a missing prototype for arch_smt_update()
x86/nospec: Remove unused RSB_FILL_LOOPS
x86/vdso: Provide missing include file
x86/Kconfig: Correct spelling and punctuation
Documentation/x86/boot: Fix typo
x86/boot: Fix a comment's incorrect file reference
x86/process: Remove set but not used variables prev and next
x86/Kconfig: Fix Kconfig indentation
Pull x86 resource control updates from Ingo Molnar:
"The main change in this tree is the extension of the resctrl procfs
ABI with a new file that helps tooling to navigate from tasks back to
resctrl groups: /proc/{pid}/cpu_resctrl_groups.
Also fix static key usage for certain feature combinations and
simplify the task exit resctrl case"
* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Add task resctrl information display
x86/resctrl: Check monitoring static key in the MBM overflow handler
x86/resctrl: Do not reconfigure exiting tasks
Pull x86 boot update from Ingo Molnar:
"Two minor changes: fix an atypical binutils combination build bug, and
also fix a VRAM size check for simplefb"
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sysfb: Fix check for bad VRAM size
x86/boot: Discard .eh_frame sections
Pull x86 asm updates from Ingo Molnar:
"Misc updates:
- Remove last remaining calls to exception_enter/exception_exit() and
simplify the entry code some more.
- Remove force_iret()
- Add support for "Fast Short Rep Mov", which is available starting
with Ice Lake Intel CPUs - and make the x86 assembly version of
memmove() use REP MOV for all sizes when FSRM is available.
- Micro-optimize/simplify the 32-bit boot code a bit.
- Use a more future-proof SYSRET instruction mnemonic"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Simplify calculation of output address
x86/entry/64: Add instruction suffix to SYSRET
x86: Remove force_iret()
x86/cpufeatures: Add support for fast short REP; MOVSB
x86/context-tracking: Remove exception_enter/exit() from KVM_PV_REASON_PAGE_NOT_PRESENT async page fault
x86/context-tracking: Remove exception_enter/exit() from do_page_fault()
Pull x86 apic fix from Ingo Molnar:
"A single commit that simplifies the code and gets rid of a compiler
warning"
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic/uv: Avoid unused variable warning
Pull perf updates from Ingo Molnar:
"Kernel side changes:
- Ftrace is one of the last W^X violators (after this only KLP is
left). These patches move it over to the generic text_poke()
interface and thereby get rid of this oddity. This requires a
surprising amount of surgery, by Peter Zijlstra.
- x86/AMD PMUs: add support for 'Large Increment per Cycle Events' to
count certain types of events that have a special, quirky hw ABI
(by Kim Phillips)
- kprobes fixes by Masami Hiramatsu
Lots of tooling updates as well, the following subcommands were
updated: annotate/report/top, c2c, clang, record, report/top TUI,
sched timehist, tests; plus updates were done to the gtk ui, libperf,
headers and the parser"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
perf/x86/amd: Add support for Large Increment per Cycle Events
perf/x86/amd: Constrain Large Increment per Cycle events
perf/x86/intel/rapl: Add Comet Lake support
tracing: Initialize ret in syscall_enter_define_fields()
perf header: Use last modification time for timestamp
perf c2c: Fix return type for histogram sorting comparision functions
perf beauty sockaddr: Fix augmented syscall format warning
perf/ui/gtk: Fix gtk2 build
perf ui gtk: Add missing zalloc object
perf tools: Use %define api.pure full instead of %pure-parser
libperf: Setup initial evlist::all_cpus value
perf report: Fix no libunwind compiled warning break s390 issue
perf tools: Support --prefix/--prefix-strip
perf report: Clarify in help that --children is default
tools build: Fix test-clang.cpp with Clang 8+
perf clang: Fix build with Clang 9
kprobes: Fix optimize_kprobe()/unoptimize_kprobe() cancellation logic
tools lib: Fix builds when glibc contains strlcpy()
perf report/top: Make 'e' visible in the help and make it toggle showing callchains
perf report/top: Do not offer annotation for symbols without samples
...
Pull EFI updates from Ingo Molnar:
"The main changes in this cycle were:
- Cleanup of the GOP [graphics output] handling code in the EFI stub
- Complete refactoring of the mixed mode handling in the x86 EFI stub
- Overhaul of the x86 EFI boot/runtime code
- Increase robustness for mixed mode code
- Add the ability to disable DMA at the root port level in the EFI
stub
- Get rid of RWX mappings in the EFI memory map and page tables,
where possible
- Move the support code for the old EFI memory mapping style into its
only user, the SGI UV1+ support code.
- plus misc fixes, updates, smaller cleanups.
... and due to interactions with the RWX changes, another round of PAT
cleanups make a guest appearance via the EFI tree - with no side
effects intended"
* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
efi/x86: Disable instrumentation in the EFI runtime handling code
efi/libstub/x86: Fix EFI server boot failure
efi/x86: Disallow efi=old_map in mixed mode
x86/boot/compressed: Relax sed symbol type regex for LLVM ld.lld
efi/x86: avoid KASAN false positives when accessing the 1: 1 mapping
efi: Fix handling of multiple efi_fake_mem= entries
efi: Fix efi_memmap_alloc() leaks
efi: Add tracking for dynamically allocated memmaps
efi: Add a flags parameter to efi_memory_map
efi: Fix comment for efi_mem_type() wrt absent physical addresses
efi/arm: Defer probe of PCIe backed efifb on DT systems
efi/x86: Limit EFI old memory map to SGI UV machines
efi/x86: Avoid RWX mappings for all of DRAM
efi/x86: Don't map the entire kernel text RW for mixed mode
x86/mm: Fix NX bit clearing issue in kernel_map_pages_in_pgd
efi/libstub/x86: Fix unused-variable warning
efi/libstub/x86: Use mandatory 16-byte stack alignment in mixed mode
efi/libstub/x86: Use const attribute for efi_is_64bit()
efi: Allow disabling PCI busmastering on bridges during boot
efi/x86: Allow translating 64-bit arguments for mixed mode calls
...
Pull objtool updates from Ingo Molnar:
"The main changes are to move the ORC unwind table sorting from early
init to build-time - this speeds up booting.
No change in functionality intended"
* 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/unwind/orc: Fix !CONFIG_MODULES build warning
x86/unwind/orc: Remove boot-time ORC unwind tables sorting
scripts/sorttable: Implement build-time ORC unwind table sorting
scripts/sorttable: Rename 'sortextable' to 'sorttable'
scripts/sortextable: Refactor the do_func() function
scripts/sortextable: Remove dead code
scripts/sortextable: Clean up the code to meet the kernel coding style better
scripts/sortextable: Rewrite error/success handling
Pull header cleanup from Ingo Molnar:
"This is a treewide cleanup, mostly (but not exclusively) with x86
impact, which breaks implicit dependencies on the asm/realtime.h
header and finally removes it from asm/acpi.h"
* 'core-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ACPI/sleep: Move acpi_get_wakeup_address() into sleep.c, remove <asm/realmode.h> from <asm/acpi.h>
ACPI/sleep: Convert acpi_wakeup_address into a function
x86/ACPI/sleep: Remove an unnecessary include of asm/realmode.h
ASoC: Intel: Skylake: Explicitly include linux/io.h for virt_to_phys()
vmw_balloon: Explicitly include linux/io.h for virt_to_phys()
virt: vbox: Explicitly include linux/io.h to pick up various defs
efi/capsule-loader: Explicitly include linux/io.h for page_to_phys()
perf/x86/intel: Explicitly include asm/io.h to use virt_to_phys()
x86/kprobes: Explicitly include vmalloc.h for set_vm_flush_reset_perms()
x86/ftrace: Explicitly include vmalloc.h for set_vm_flush_reset_perms()
x86/boot: Explicitly include realmode.h to handle RM reservations
x86/efi: Explicitly include realmode.h to handle RM trampoline quirk
x86/platform/intel/quark: Explicitly include linux/io.h for virt_to_phys()
x86/setup: Enhance the comments
x86/setup: Clean up the header portion of setup.c
and improvements:
- Update the cached HLE state when the TSX state is changed via the new
control register. This ensures feature bit consistency.
- Exclude the new Zhaoxin CPUs from Spectre V2 and SWAPGS vulnerabilities.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl4vc5ITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoRujEACeWTy4k+eCprV2tu4O/e15VaGaIdRp
Kdv02kZ1U/3upSubRYW8XsBqesCdbSzfO78fE7yQYZmTbOxczSE1t2XH5Cj7c3ig
3D4gzXaH2XKf+j9cXceba/3cpRsMU8fAhQYr+Immi/dG2ORs8i2C4G0cxKthqvbP
R6qdS+Pg9Mo1a2c2nLRGMDyNbNdaYXiJEiWSGNlc6Den/vog+jR4T2WV5a5IjeKL
yjjukoLoeScJ/3gYK0BQkaGNY3BeU1H5ECtbVM3SBFkp2a9OVGhr1bSxwO1U+T8C
fdzwJZU/p+JTb69kx91Tz8+wd2zLtV1wxbdZY7nyUUfGd80q0eU4sQZQyY13mONw
0LRvXrfHjIsRPSiRY3iyHYq9I0UZmjG7oInqRLimLrJmjrPtALhu/mO5TjZbxjzM
FSVx5fWbZYUgnrQpB+hjqiBcrc3sjjrDwU6DufbSq/AhTBKQ/o0mC+S63OhJ0H7C
Bzp1m0OniDuNRtYD0LZm0ktAkLpCTy5lgGBQ6ffNqs87ivFi0GTErqml5GrUEKQ5
mq7h56mf00jKPBYOVoYs9MuWJ18JVJuVNipdlM+P7027XDRpl46l9g87YuMhIol/
tQsDHB4Dv/1QTZY3K3MuNLFZaQJ4/DI9LA2B1br1PdHbZSDd81NflT6FKE11rqWC
T+DCpmj+smRqVw==
=dOgd
-----END PGP SIGNATURE-----
Merge tag 'x86-pti-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 pti updates from Thomas Gleixner:
"The performance deterioration departement provides a few non-scary
fixes and improvements:
- Update the cached HLE state when the TSX state is changed via the
new control register. This ensures feature bit consistency.
- Exclude the new Zhaoxin CPUs from Spectre V2 and SWAPGS
vulnerabilities"
* tag 'x86-pti-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation/swapgs: Exclude Zhaoxin CPUs from SWAPGS vulnerability
x86/speculation/spectre_v2: Exclude Zhaoxin CPUs from SPECTRE_V2
x86/cpu: Update cached HLE state on write to TSX_CTRL_CPUID_CLEAR
- Time namespace support:
If a container migrates from one host to another then it expects that
clocks based on MONOTONIC and BOOTTIME are not subject to
disruption. Due to different boot time and non-suspended runtime these
clocks can differ significantly on two hosts, in the worst case time
goes backwards which is a violation of the POSIX requirements.
The time namespace addresses this problem. It allows to set offsets for
clock MONOTONIC and BOOTTIME once after creation and before tasks are
associated with the namespace. These offsets are taken into account by
timers and timekeeping including the VDSO.
Offsets for wall clock based clocks (REALTIME/TAI) are not provided by
this mechanism. While in theory possible, the overhead and code
complexity would be immense and not justified by the esoteric potential
use cases which were discussed at Plumbers '18.
The overhead for tasks in the root namespace (host time offsets = 0) is
in the noise and great effort was made to ensure that especially in the
VDSO. If time namespace is disabled in the kernel configuration the
code is compiled out.
Kudos to Andrei Vagin and Dmitry Sofanov who implemented this feature
and kept on for more than a year addressing review comments, finding
better solutions. A pleasant experience.
- Overhaul of the alarmtimer device dependency handling to ensure that
the init/suspend/resume ordering is correct.
- A new clocksource/event driver for Microchip PIT64
- Suspend/resume support for the Hyper-V clocksource
- The usual pile of fixes, updates and improvements mostly in the
driver code.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl4vbTcTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoXT2D/96iJ3G9Snn2khEQP3XS2rYmtDGw7NO
m1n96falwWeGe6zreU80R2Jge5nLxQtNhRoMPLLee1GpHwRC6lvqEqgdZ4LMBrD2
JqV7Gzg8Urmdh+hpDsyTCpeEWEzoMKxiFOX8PxwctqUhM4szEe5iQg2YQsg85Jw2
vG6M93N2xwDILh4rhEMbKjo+5ZmYn7c1RQvpGOSmpKOj940W/N7H2HBsFhdaJ1Kw
FW5pFv1211PaU5RV2YNb2dMeeMTT1N3e2VN4Dkadoxp47pb+725gNHEBEjmV9poG
Lp4IhzGAPnj8zVD88icQZSTaK3gUHMClxprJ0Pf84WEtiH7SeGu8BPYyu77+oNDe
yzcctDJNyCWXkzmaP/fe/HLc0TStbvNAJ5Tagp4BC75gzebeb4/n8RtRT0fKeDYL
pxpDPKDAPU7p1JSjxiWAtshqjBycWNY3Z49bA7/VhKBhnv8BDyBPGlYd7/4xrbGr
RK7DQNXJwaJaiNJ7p5PiaFxGzNyB0B9sThD/slSlEInIKb4h9YzWr0TV+NB62VnB
sDcN+tpLbRPz5/5cHGGfxR0+zKWpfyai8pzbmmaXEaKssjRYwyvcac5EZdgbWpbK
k7CqAjoWLA2P+tGeePNJOf5JYK6Vmdyh4clmuwM0zOiRJ9NlWUyMf3z7QYILs4RO
UAI+6opYlZEPAw==
=x3qT
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"The timekeeping and timers departement provides:
- Time namespace support:
If a container migrates from one host to another then it expects
that clocks based on MONOTONIC and BOOTTIME are not subject to
disruption. Due to different boot time and non-suspended runtime
these clocks can differ significantly on two hosts, in the worst
case time goes backwards which is a violation of the POSIX
requirements.
The time namespace addresses this problem. It allows to set offsets
for clock MONOTONIC and BOOTTIME once after creation and before
tasks are associated with the namespace. These offsets are taken
into account by timers and timekeeping including the VDSO.
Offsets for wall clock based clocks (REALTIME/TAI) are not provided
by this mechanism. While in theory possible, the overhead and code
complexity would be immense and not justified by the esoteric
potential use cases which were discussed at Plumbers '18.
The overhead for tasks in the root namespace (ie where host time
offsets = 0) is in the noise and great effort was made to ensure
that especially in the VDSO. If time namespace is disabled in the
kernel configuration the code is compiled out.
Kudos to Andrei Vagin and Dmitry Sofanov who implemented this
feature and kept on for more than a year addressing review
comments, finding better solutions. A pleasant experience.
- Overhaul of the alarmtimer device dependency handling to ensure
that the init/suspend/resume ordering is correct.
- A new clocksource/event driver for Microchip PIT64
- Suspend/resume support for the Hyper-V clocksource
- The usual pile of fixes, updates and improvements mostly in the
driver code"
* tag 'timers-core-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
alarmtimer: Make alarmtimer_get_rtcdev() a stub when CONFIG_RTC_CLASS=n
alarmtimer: Use wakeup source from alarmtimer platform device
alarmtimer: Make alarmtimer platform device child of RTC device
alarmtimer: Update alarmtimer_get_rtcdev() docs to reflect reality
hrtimer: Add missing sparse annotation for __run_timer()
lib/vdso: Only read hrtimer_res when needed in __cvdso_clock_getres()
MIPS: vdso: Define BUILD_VDSO32 when building a 32bit kernel
clocksource/drivers/hyper-v: Set TSC clocksource as default w/ InvariantTSC
clocksource/drivers/hyper-v: Untangle stimers and timesync from clocksources
clocksource/drivers/timer-microchip-pit64b: Fix sparse warning
clocksource/drivers/exynos_mct: Rename Exynos to lowercase
clocksource/drivers/timer-ti-dm: Fix uninitialized pointer access
clocksource/drivers/timer-ti-dm: Switch to platform_get_irq
clocksource/drivers/timer-ti-dm: Convert to devm_platform_ioremap_resource
clocksource/drivers/em_sti: Fix variable declaration in em_sti_probe
clocksource/drivers/em_sti: Convert to devm_platform_ioremap_resource
clocksource/drivers/bcm2835_timer: Fix memory leak of timer
clocksource/drivers/cadence-ttc: Use ttc driver as platform driver
clocksource/drivers/timer-microchip-pit64b: Add Microchip PIT64B support
clocksource/drivers/hyper-v: Reserve PAGE_SIZE space for tsc page
...
- remove ioremap_nocache given that is is equivalent to
ioremap everywhere
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl4vKHwLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYMPGBAAuVNUZaZfWYHpiVP2oRcUQUguFiD3NTbknsyzV2oH
J9P0GfeENSKwE9OOhZ7XIjnCZAJwQgTK/ppQY5yiQ/KAtYyyXjXEJ6jqqjiTDInr
+3+I3t/LhkgrK7tMrb7ylTGa/d7KhaciljnOXC8+b75iddvM9I1z2pbHDbppZMS9
wT4RXL/cFtRb85AfOyPLybcka3f5P2gGvQz38qyimhJYEzHDXZu9VO1Bd20f8+Xf
eLBKX0o6yWMhcaPLma8tm0M0zaXHEfLHUKLSOkiOk+eHTWBZ3b/w5nsOQZYZ7uQp
25yaClbameAn7k5dHajduLGEJv//ZjLRWcN3HJWJ5vzO111aHhswpE7JgTZJSVWI
ggCVkytD3ESXapvswmACSeCIDMmiJMzvn6JvwuSMVB7a6e5mcqTuGo/FN+DrBF/R
IP+/gY/T7zIIOaljhQVkiEIIwiD/akYo0V9fheHTBnqcKEDTHV4WjKbeF6aCwcO+
b8inHyXZSKSMG//UlDuN84/KH/o1l62oKaB1uDIYrrL8JVyjAxctWt3GOt5KgSFq
wVz1lMw4kIvWtC/Sy2H4oB+RtODLp6yJDqmvmPkeJwKDUcd/1JKf0KsZ8j3FpGei
/rEkBEss0KBKyFAgBSRO2jIpdj2epgcBcsdB/r5mlhcn8L77AS6mHbA173kY4pQ/
Kdg=
=TUCJ
-----END PGP SIGNATURE-----
Merge tag 'ioremap-5.6' of git://git.infradead.org/users/hch/ioremap
Pull ioremap updates from Christoph Hellwig:
"Remove the ioremap_nocache API (plus wrappers) that are always
identical to ioremap"
* tag 'ioremap-5.6' of git://git.infradead.org/users/hch/ioremap:
remove ioremap_nocache and devm_ioremap_nocache
MIPS: define ioremap_nocache to ioremap
Pull RAS updates from Borislav Petkov:
- Misc fixes to the MCE code all over the place, by Jan H. Schönherr.
- Initial support for AMD F19h and other cleanups to amd64_edac, by
Yazen Ghannam.
- Other small cleanups.
* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
EDAC/mce_amd: Make fam_ops static global
EDAC/amd64: Drop some family checks for newer systems
EDAC/amd64: Add family ops for Family 19h Models 00h-0Fh
x86/amd_nb: Add Family 19h PCI IDs
EDAC/mce_amd: Always load on SMCA systems
x86/MCE/AMD, EDAC/mce_amd: Add new Load Store unit McaType
x86/mce: Fix use of uninitialized MCE message string
x86/mce: Fix mce=nobootlog
x86/mce: Take action on UCNA/Deferred errors again
x86/mce: Remove mce_inject_log() in favor of mce_log()
x86/mce: Pass MCE message to mce_panic() on failed kernel recovery
x86/mce/therm_throt: Mark throttle_active_work() as __maybe_unused
... and fold its function body into its single call site.
No functional changes:
# arch/x86/kernel/cpu/amd.o:
text data bss dec hex filename
5994 385 1 6380 18ec amd.o.before
5994 385 1 6380 18ec amd.o.after
md5:
99ec6daa095b502297884e949c520f90 amd.o.before.asm
99ec6daa095b502297884e949c520f90 amd.o.after.asm
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200123165811.5288-1-bp@alien8.de
From: Dave Hansen <dave.hansen@linux.intel.com>
MPX is being removed from the kernel due to a lack of support
in the toolchain going forward (gcc).
This removes all the remaining (dead at this point) MPX handling
code remaining in the tree. The only remaining code is the XSAVE
support for MPX state which is currently needd for KVM to handle
VMs which might use MPX.
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: x86@kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
From: Dave Hansen <dave.hansen@linux.intel.com>
MPX is being removed from the kernel due to a lack of support
in the toolchain going forward (gcc).
Remove the other user-visible ABI: signal handling. This code
should basically have been inactive after the prctl()s were
removed, but there may be some small ABI remnants from this code.
Remove it.
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: x86@kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
From: Dave Hansen <dave.hansen@linux.intel.com>
While testing my MPX removal series, Borislav noted compilation
failure with an allnoconfig build.
Turned out to be a missing include of insn.h in alternative.c.
With MPX, it got it implicitly from:
asm/mmu_context.h -> asm/mpx.h -> asm/insn.h
Fixes: c3d6324f84 ("x86/alternatives: Teach text_poke_bp() to emulate instructions")
Reported-by: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: x86@kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Previously, the assignment to the local variable 'now' took place
before the for loop. The loop is unconditional so it will be entered
at least once. The variable 'now' is reassigned in the loop and is not
used before reassigning. Therefore, the assignment before the loop is
unnecessary and can be removed.
No code changed:
# arch/x86/kernel/tsc_sync.o:
text data bss dec hex filename
3569 198 44 3811 ee3 tsc_sync.o.before
3569 198 44 3811 ee3 tsc_sync.o.after
md5:
36216de29b208edbcd34fed9fe7f7b69 tsc_sync.o.before.asm
36216de29b208edbcd34fed9fe7f7b69 tsc_sync.o.after.asm
[ bp: Massage commit message. ]
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200118171143.25178-1-mateusznosek0@gmail.com
Commit
334b0f4e9b ("x86/resctrl: Fix a deadlock due to inaccurate reference")
changed the argument to rdtgroup_kn_lock_live()/rdtgroup_kn_unlock()
within mkdir_rdt_prepare(). That change resulted in an unused function
parameter to mkdir_rdt_prepare().
Clean up the unused function parameter in mkdir_rdt_prepare() and its
callers rdtgroup_mkdir_mon() and rdtgroup_mkdir_ctrl_mon().
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1578500886-21771-5-git-send-email-xiaochen.shen@intel.com
There is a race condition which results in a deadlock when rmdir and
mkdir execute concurrently:
$ ls /sys/fs/resctrl/c1/mon_groups/m1/
cpus cpus_list mon_data tasks
Thread 1: rmdir /sys/fs/resctrl/c1
Thread 2: mkdir /sys/fs/resctrl/c1/mon_groups/m1
3 locks held by mkdir/48649:
#0: (sb_writers#17){.+.+}, at: [<ffffffffb4ca2aa0>] mnt_want_write+0x20/0x50
#1: (&type->i_mutex_dir_key#8/1){+.+.}, at: [<ffffffffb4c8c13b>] filename_create+0x7b/0x170
#2: (rdtgroup_mutex){+.+.}, at: [<ffffffffb4a4389d>] rdtgroup_kn_lock_live+0x3d/0x70
4 locks held by rmdir/48652:
#0: (sb_writers#17){.+.+}, at: [<ffffffffb4ca2aa0>] mnt_want_write+0x20/0x50
#1: (&type->i_mutex_dir_key#8/1){+.+.}, at: [<ffffffffb4c8c3cf>] do_rmdir+0x13f/0x1e0
#2: (&type->i_mutex_dir_key#8){++++}, at: [<ffffffffb4c86d5d>] vfs_rmdir+0x4d/0x120
#3: (rdtgroup_mutex){+.+.}, at: [<ffffffffb4a4389d>] rdtgroup_kn_lock_live+0x3d/0x70
Thread 1 is deleting control group "c1". Holding rdtgroup_mutex,
kernfs_remove() removes all kernfs nodes under directory "c1"
recursively, then waits for sub kernfs node "mon_groups" to drop active
reference.
Thread 2 is trying to create a subdirectory "m1" in the "mon_groups"
directory. The wrapper kernfs_iop_mkdir() takes an active reference to
the "mon_groups" directory but the code drops the active reference to
the parent directory "c1" instead.
As a result, Thread 1 is blocked on waiting for active reference to drop
and never release rdtgroup_mutex, while Thread 2 is also blocked on
trying to get rdtgroup_mutex.
Thread 1 (rdtgroup_rmdir) Thread 2 (rdtgroup_mkdir)
(rmdir /sys/fs/resctrl/c1) (mkdir /sys/fs/resctrl/c1/mon_groups/m1)
------------------------- -------------------------
kernfs_iop_mkdir
/*
* kn: "m1", parent_kn: "mon_groups",
* prgrp_kn: parent_kn->parent: "c1",
*
* "mon_groups", parent_kn->active++: 1
*/
kernfs_get_active(parent_kn)
kernfs_iop_rmdir
/* "c1", kn->active++ */
kernfs_get_active(kn)
rdtgroup_kn_lock_live
atomic_inc(&rdtgrp->waitcount)
/* "c1", kn->active-- */
kernfs_break_active_protection(kn)
mutex_lock
rdtgroup_rmdir_ctrl
free_all_child_rdtgrp
sentry->flags = RDT_DELETED
rdtgroup_ctrl_remove
rdtgrp->flags = RDT_DELETED
kernfs_get(kn)
kernfs_remove(rdtgrp->kn)
__kernfs_remove
/* "mon_groups", sub_kn */
atomic_add(KN_DEACTIVATED_BIAS, &sub_kn->active)
kernfs_drain(sub_kn)
/*
* sub_kn->active == KN_DEACTIVATED_BIAS + 1,
* waiting on sub_kn->active to drop, but it
* never drops in Thread 2 which is blocked
* on getting rdtgroup_mutex.
*/
Thread 1 hangs here ---->
wait_event(sub_kn->active == KN_DEACTIVATED_BIAS)
...
rdtgroup_mkdir
rdtgroup_mkdir_mon(parent_kn, prgrp_kn)
mkdir_rdt_prepare(parent_kn, prgrp_kn)
rdtgroup_kn_lock_live(prgrp_kn)
atomic_inc(&rdtgrp->waitcount)
/*
* "c1", prgrp_kn->active--
*
* The active reference on "c1" is
* dropped, but not matching the
* actual active reference taken
* on "mon_groups", thus causing
* Thread 1 to wait forever while
* holding rdtgroup_mutex.
*/
kernfs_break_active_protection(
prgrp_kn)
/*
* Trying to get rdtgroup_mutex
* which is held by Thread 1.
*/
Thread 2 hangs here ----> mutex_lock
...
The problem is that the creation of a subdirectory in the "mon_groups"
directory incorrectly releases the active protection of its parent
directory instead of itself before it starts waiting for rdtgroup_mutex.
This is triggered by the rdtgroup_mkdir() flow calling
rdtgroup_kn_lock_live()/rdtgroup_kn_unlock() with kernfs node of the
parent control group ("c1") as argument. It should be called with kernfs
node "mon_groups" instead. What is currently missing is that the
kn->priv of "mon_groups" is NULL instead of pointing to the rdtgrp.
Fix it by pointing kn->priv to rdtgrp when "mon_groups" is created. Then
it could be passed to rdtgroup_kn_lock_live()/rdtgroup_kn_unlock()
instead. And then it operates on the same rdtgroup structure but handles
the active reference of kernfs node "mon_groups" to prevent deadlock.
The same changes are also made to the "mon_data" directories.
This results in some unused function parameters that will be cleaned up
in follow-up patch as the focus here is on the fix only in support of
backporting efforts.
Fixes: c7d9aac613 ("x86/intel_rdt/cqm: Add mkdir support for RDT monitoring")
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1578500886-21771-4-git-send-email-xiaochen.shen@intel.com
There is a race condition in the following scenario which results in an
use-after-free issue when reading a monitoring file and deleting the
parent ctrl_mon group concurrently:
Thread 1 calls atomic_inc() to take refcount of rdtgrp and then calls
kernfs_break_active_protection() to drop the active reference of kernfs
node in rdtgroup_kn_lock_live().
In Thread 2, kernfs_remove() is a blocking routine. It waits on all sub
kernfs nodes to drop the active reference when removing all subtree
kernfs nodes recursively. Thread 2 could block on kernfs_remove() until
Thread 1 calls kernfs_break_active_protection(). Only after
kernfs_remove() completes the refcount of rdtgrp could be trusted.
Before Thread 1 calls atomic_inc() and kernfs_break_active_protection(),
Thread 2 could call kfree() when the refcount of rdtgrp (sentry) is 0
instead of 1 due to the race.
In Thread 1, in rdtgroup_kn_unlock(), referring to earlier rdtgrp memory
(rdtgrp->waitcount) which was already freed in Thread 2 results in
use-after-free issue.
Thread 1 (rdtgroup_mondata_show) Thread 2 (rdtgroup_rmdir)
-------------------------------- -------------------------
rdtgroup_kn_lock_live
/*
* kn active protection until
* kernfs_break_active_protection(kn)
*/
rdtgrp = kernfs_to_rdtgroup(kn)
rdtgroup_kn_lock_live
atomic_inc(&rdtgrp->waitcount)
mutex_lock
rdtgroup_rmdir_ctrl
free_all_child_rdtgrp
/*
* sentry->waitcount should be 1
* but is 0 now due to the race.
*/
kfree(sentry)*[1]
/*
* Only after kernfs_remove()
* completes, the refcount of
* rdtgrp could be trusted.
*/
atomic_inc(&rdtgrp->waitcount)
/* kn->active-- */
kernfs_break_active_protection(kn)
rdtgroup_ctrl_remove
rdtgrp->flags = RDT_DELETED
/*
* Blocking routine, wait for
* all sub kernfs nodes to drop
* active reference in
* kernfs_break_active_protection.
*/
kernfs_remove(rdtgrp->kn)
rdtgroup_kn_unlock
mutex_unlock
atomic_dec_and_test(
&rdtgrp->waitcount)
&& (flags & RDT_DELETED)
kernfs_unbreak_active_protection(kn)
kfree(rdtgrp)
mutex_lock
mon_event_read
rdtgroup_kn_unlock
mutex_unlock
/*
* Use-after-free: refer to earlier rdtgrp
* memory which was freed in [1].
*/
atomic_dec_and_test(&rdtgrp->waitcount)
&& (flags & RDT_DELETED)
/* kn->active++ */
kernfs_unbreak_active_protection(kn)
kfree(rdtgrp)
Fix it by moving free_all_child_rdtgrp() to after kernfs_remove() in
rdtgroup_rmdir_ctrl() to ensure it has the accurate refcount of rdtgrp.
Fixes: f3cbeacaa0 ("x86/intel_rdt/cqm: Add rmdir support")
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1578500886-21771-3-git-send-email-xiaochen.shen@intel.com
A resource group (rdtgrp) contains a reference count (rdtgrp->waitcount)
that indicates how many waiters expect this rdtgrp to exist. Waiters
could be waiting on rdtgroup_mutex or some work sitting on a task's
workqueue for when the task returns from kernel mode or exits.
The deletion of a rdtgrp is intended to have two phases:
(1) while holding rdtgroup_mutex the necessary cleanup is done and
rdtgrp->flags is set to RDT_DELETED,
(2) after releasing the rdtgroup_mutex, the rdtgrp structure is freed
only if there are no waiters and its flag is set to RDT_DELETED. Upon
gaining access to rdtgroup_mutex or rdtgrp, a waiter is required to check
for the RDT_DELETED flag.
When unmounting the resctrl file system or deleting ctrl_mon groups,
all of the subdirectories are removed and the data structure of rdtgrp
is forcibly freed without checking rdtgrp->waitcount. If at this point
there was a waiter on rdtgrp then a use-after-free issue occurs when the
waiter starts running and accesses the rdtgrp structure it was waiting
on.
See kfree() calls in [1], [2] and [3] in these two call paths in
following scenarios:
(1) rdt_kill_sb() -> rmdir_all_sub() -> free_all_child_rdtgrp()
(2) rdtgroup_rmdir() -> rdtgroup_rmdir_ctrl() -> free_all_child_rdtgrp()
There are several scenarios that result in use-after-free issue in
following:
Scenario 1:
-----------
In Thread 1, rdtgroup_tasks_write() adds a task_work callback
move_myself(). If move_myself() is scheduled to execute after Thread 2
rdt_kill_sb() is finished, referring to earlier rdtgrp memory
(rdtgrp->waitcount) which was already freed in Thread 2 results in
use-after-free issue.
Thread 1 (rdtgroup_tasks_write) Thread 2 (rdt_kill_sb)
------------------------------- ----------------------
rdtgroup_kn_lock_live
atomic_inc(&rdtgrp->waitcount)
mutex_lock
rdtgroup_move_task
__rdtgroup_move_task
/*
* Take an extra refcount, so rdtgrp cannot be freed
* before the call back move_myself has been invoked
*/
atomic_inc(&rdtgrp->waitcount)
/* Callback move_myself will be scheduled for later */
task_work_add(move_myself)
rdtgroup_kn_unlock
mutex_unlock
atomic_dec_and_test(&rdtgrp->waitcount)
&& (flags & RDT_DELETED)
mutex_lock
rmdir_all_sub
/*
* sentry and rdtgrp are freed
* without checking refcount
*/
free_all_child_rdtgrp
kfree(sentry)*[1]
kfree(rdtgrp)*[2]
mutex_unlock
/*
* Callback is scheduled to execute
* after rdt_kill_sb is finished
*/
move_myself
/*
* Use-after-free: refer to earlier rdtgrp
* memory which was freed in [1] or [2].
*/
atomic_dec_and_test(&rdtgrp->waitcount)
&& (flags & RDT_DELETED)
kfree(rdtgrp)
Scenario 2:
-----------
In Thread 1, rdtgroup_tasks_write() adds a task_work callback
move_myself(). If move_myself() is scheduled to execute after Thread 2
rdtgroup_rmdir() is finished, referring to earlier rdtgrp memory
(rdtgrp->waitcount) which was already freed in Thread 2 results in
use-after-free issue.
Thread 1 (rdtgroup_tasks_write) Thread 2 (rdtgroup_rmdir)
------------------------------- -------------------------
rdtgroup_kn_lock_live
atomic_inc(&rdtgrp->waitcount)
mutex_lock
rdtgroup_move_task
__rdtgroup_move_task
/*
* Take an extra refcount, so rdtgrp cannot be freed
* before the call back move_myself has been invoked
*/
atomic_inc(&rdtgrp->waitcount)
/* Callback move_myself will be scheduled for later */
task_work_add(move_myself)
rdtgroup_kn_unlock
mutex_unlock
atomic_dec_and_test(&rdtgrp->waitcount)
&& (flags & RDT_DELETED)
rdtgroup_kn_lock_live
atomic_inc(&rdtgrp->waitcount)
mutex_lock
rdtgroup_rmdir_ctrl
free_all_child_rdtgrp
/*
* sentry is freed without
* checking refcount
*/
kfree(sentry)*[3]
rdtgroup_ctrl_remove
rdtgrp->flags = RDT_DELETED
rdtgroup_kn_unlock
mutex_unlock
atomic_dec_and_test(
&rdtgrp->waitcount)
&& (flags & RDT_DELETED)
kfree(rdtgrp)
/*
* Callback is scheduled to execute
* after rdt_kill_sb is finished
*/
move_myself
/*
* Use-after-free: refer to earlier rdtgrp
* memory which was freed in [3].
*/
atomic_dec_and_test(&rdtgrp->waitcount)
&& (flags & RDT_DELETED)
kfree(rdtgrp)
If CONFIG_DEBUG_SLAB=y, Slab corruption on kmalloc-2k can be observed
like following. Note that "0x6b" is POISON_FREE after kfree(). The
corrupted bits "0x6a", "0x64" at offset 0x424 correspond to
waitcount member of struct rdtgroup which was freed:
Slab corruption (Not tainted): kmalloc-2k start=ffff9504c5b0d000, len=2048
420: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkjkkkkkkkkkkk
Single bit error detected. Probably bad RAM.
Run memtest86+ or a similar memory test tool.
Next obj: start=ffff9504c5b0d800, len=2048
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Slab corruption (Not tainted): kmalloc-2k start=ffff9504c58ab800, len=2048
420: 6b 6b 6b 6b 64 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkdkkkkkkkkkkk
Prev obj: start=ffff9504c58ab000, len=2048
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Fix this by taking reference count (waitcount) of rdtgrp into account in
the two call paths that currently do not do so. Instead of always
freeing the resource group it will only be freed if there are no waiters
on it. If there are waiters, the resource group will have its flags set
to RDT_DELETED.
It will be left to the waiter to free the resource group when it starts
running and finding that it was the last waiter and the resource group
has been removed (rdtgrp->flags & RDT_DELETED) since. (1) rdt_kill_sb()
-> rmdir_all_sub() -> free_all_child_rdtgrp() (2) rdtgroup_rmdir() ->
rdtgroup_rmdir_ctrl() -> free_all_child_rdtgrp()
Fixes: f3cbeacaa0 ("x86/intel_rdt/cqm: Add rmdir support")
Fixes: 60cf5e101f ("x86/intel_rdt: Add mkdir to resctrl file system")
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1578500886-21771-2-git-send-email-xiaochen.shen@intel.com
Both functions call init_intel_cacheinfo() which computes L2 and L3 cache
sizes from CPUID(4). But then they also call cpu_detect_cache_sizes() a
bit later which computes ->x86_tlbsize and L2 size from CPUID(80000006).
However, the latter call is not needed because
- on these CPUs, CPUID(80000006).EBX for ->x86_tlbsize is reserved
- CPUID(80000006).ECX for the L2 size has the same result as CPUID(4)
Therefore, remove the latter call to simplify the code.
[ bp: Rewrite commit message. ]
Signed-off-by: Tony W Wang-oc <TonyWWang-oc@zhaoxin.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1579075257-6985-1-git-send-email-TonyWWang-oc@zhaoxin.com
Monitoring tools that want to find out which resctrl control and monitor
groups a task belongs to must currently read the "tasks" file in every
group until they locate the process ID.
Add an additional file /proc/{pid}/cpu_resctrl_groups to provide this
information:
1) res:
mon:
resctrl is not available.
2) res:/
mon:
Task is part of the root resctrl control group, and it is not associated
to any monitor group.
3) res:/
mon:mon0
Task is part of the root resctrl control group and monitor group mon0.
4) res:group0
mon:
Task is part of resctrl control group group0, and it is not associated
to any monitor group.
5) res:group0
mon:mon1
Task is part of resctrl control group group0 and monitor group mon1.
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jinshi Chen <jinshi.chen@intel.com>
Link: https://lkml.kernel.org/r/20200115092851.14761-1-yu.c.chen@intel.com
When checking whether the reported lfb_size makes sense, the height
* stride result is page-aligned before seeing whether it exceeds the
reported size.
This doesn't work if height * stride is not an exact number of pages.
For example, as reported in the kernel bugzilla below, an 800x600x32 EFI
framebuffer gets skipped because of this.
Move the PAGE_ALIGN to after the check vs size.
Reported-by: Christopher Head <chead@chead.ca>
Tested-by: Christopher Head <chead@chead.ca>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206051
Link: https://lkml.kernel.org/r/20200107230410.2291947-1-nivedita@alum.mit.edu
We carry a quirk in the x86 EFI code to switch back to an older
method of mapping the EFI runtime services memory regions, because
it was deemed risky at the time to implement a new method without
providing a fallback to the old method in case problems arose.
Such problems did arise, but they appear to be limited to SGI UV1
machines, and so these are the only ones for which the fallback gets
enabled automatically (via a DMI quirk). The fallback can be enabled
manually as well, by passing efi=old_map, but there is very little
evidence that suggests that this is something that is being relied
upon in the field.
Given that UV1 support is not enabled by default by the distros
(Ubuntu, Fedora), there is no point in carrying this fallback code
all the time if there are no other users. So let's move it into the
UV support code, and document that efi=old_map now requires this
support code to be enabled.
Note that efi=old_map has been used in the past on other SGI UV
machines to work around kernel regressions in production, so we
keep the option to enable it by hand, but only if the kernel was
built with UV support.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200113172245.27925-8-ardb@kernel.org
Pull x86 fixes from Ingo Molnar:
"Misc fixes:
- a resctrl fix for uninitialized objects found by debugobjects
- a resctrl memory leak fix
- fix the unintended re-enabling of the of SME and SEV CPU flags if
memory encryption was disabled at bootup via the MSR space"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/CPU/AMD: Ensure clearing of SME/SEV features is maintained
x86/resctrl: Fix potential memory leak
x86/resctrl: Fix an imbalance in domain_remove_cpu()
Currently, there are three static keys in the resctrl file system:
rdt_mon_enable_key and rdt_alloc_enable_key indicate if the monitoring
feature and the allocation feature are enabled, respectively. The
rdt_enable_key is enabled when either the monitoring feature or the
allocation feature is enabled.
If no monitoring feature is present (either hardware doesn't support a
monitoring feature or the feature is disabled by the kernel command line
option "rdt="), rdt_enable_key is still enabled but rdt_mon_enable_key
is disabled.
MBM is a monitoring feature. The MBM overflow handler intends to
check if the monitoring feature is not enabled for fast return.
So check the rdt_mon_enable_key in it instead of the rdt_enable_key as
former is the more accurate check.
[ bp: Massage commit message. ]
Fixes: e33026831b ("x86/intel_rdt/mbm: Handle counter overflow")
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1576094705-13660-1-git-send-email-xiaochen.shen@intel.com
New Zhaoxin family 7 CPUs are not affected by the SWAPGS vulnerability. So
mark these CPUs in the cpu vulnerability whitelist accordingly.
Signed-off-by: Tony W Wang-oc <TonyWWang-oc@zhaoxin.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1579227872-26972-3-git-send-email-TonyWWang-oc@zhaoxin.com
New Zhaoxin family 7 CPUs are not affected by SPECTRE_V2. So define a
separate cpu_vuln_whitelist bit NO_SPECTRE_V2 and add these CPUs to the cpu
vulnerability whitelist.
Signed-off-by: Tony W Wang-oc <TonyWWang-oc@zhaoxin.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1579227872-26972-2-git-send-email-TonyWWang-oc@zhaoxin.com
/proc/cpuinfo currently reports Hardware Lock Elision (HLE) feature to
be present on boot cpu even if it was disabled during the bootup. This
is because cpuinfo_x86->x86_capability HLE bit is not updated after TSX
state is changed via the new MSR IA32_TSX_CTRL.
Update the cached HLE bit also since it is expected to change after an
update to CPUID_CLEAR bit in MSR IA32_TSX_CTRL.
Fixes: 95c5824f75 ("x86/cpu: Add a "tsx=" cmdline option with TSX disabled by default")
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/2529b99546294c893dfa1c89e2b3e46da3369a59.1578685425.git.pawan.kumar.gupta@linux.intel.com
When CONFIG_PROC_FS is disabled, the compiler warns about an unused
variable:
arch/x86/kernel/apic/x2apic_uv_x.c: In function 'uv_setup_proc_files':
arch/x86/kernel/apic/x2apic_uv_x.c:1546:8: error: unused variable 'name' [-Werror=unused-variable]
char *name = hubless ? "hubless" : "hubbed";
Simplify the code so this variable is no longer needed.
Fixes: 8785968bce ("x86/platform/uv: Add UV Hubbed/Hubless Proc FS Files")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191212140419.315264-1-arnd@arndb.de
If the SME and SEV features are present via CPUID, but memory encryption
support is not enabled (MSR 0xC001_0010[23]), the feature flags are cleared
using clear_cpu_cap(). However, if get_cpu_cap() is later called, these
feature flags will be reset back to present, which is not desired.
Change from using clear_cpu_cap() to setup_clear_cpu_cap() so that the
clearing of the flags is maintained.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # 4.16.x-
Link: https://lkml.kernel.org/r/226de90a703c3c0be5a49565047905ac4e94e8f3.1579125915.git.thomas.lendacky@amd.com
Add the new PCI Device 18h IDs for AMD Family 19h systems. Note that
Family 19h systems will not have a new PCI root device ID.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200110015651.14887-4-Yazen.Ghannam@amd.com
Add support for a new version of the Load Store unit bank type as
indicated by its McaType value, which will be present in future SMCA
systems.
Add the new (HWID, MCATYPE) tuple. Reuse the same name, since this is
logically the same to the user.
Also, add the new error descriptions to edac_mce_amd.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200110015651.14887-2-Yazen.Ghannam@amd.com
Don't print an error message about VMX being disabled by BIOS if KVM,
the sole user of VMX, is disabled. E.g. if KVM is disabled and the MSR
is unlocked, the kernel will intentionally disable VMX when locking
feature control and then complain that "BIOS" disabled VMX.
Fixes: ef4d3bf198 ("x86/cpu: Clear VMX feature flag if VMX is not fully enabled")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200114202545.20296-1-sean.j.christopherson@intel.com
It is relatively easy to trigger the following boot splat on an Ice Lake
client platform. The call stack is like:
kernel BUG at kernel/timer/timer.c:1152!
Call Trace:
__queue_delayed_work
queue_delayed_work_on
therm_throt_process
intel_thermal_interrupt
...
The reason is that a CPU's thermal interrupt is enabled prior to
executing its hotplug onlining callback which will initialize the
throttling workqueues.
Such a race can lead to therm_throt_process() accessing an uninitialized
therm_work, leading to the above BUG at a very early bootup stage.
Therefore, unmask the thermal interrupt vector only after having setup
the workqueues completely.
[ bp: Heavily massage commit message and correct comment formatting. ]
Fixes: f6656208f0 ("x86/mce/therm_throt: Optimize notifications of thermal throttle")
Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200107004116.59353-1-chuansheng.liu@intel.com
con_init in tty/vt.c will now set conswitchp to dummy_con if it's unset.
Drop it from arch setup code.
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Link: https://lore.kernel.org/r/20191218214506.49252-24-nivedita@alum.mit.edu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
VDSO support for time namespaces needs to set up a page with the same
layout as VVAR. That timens page will be placed on position of VVAR page
inside namespace. That page has vdso_data->seq set to 1 to enforce
the slow path and vdso_data->clock_mode set to VCLOCK_TIMENS to enforce
the time namespace handling path.
To prepare the time namespace page the kernel needs to know the vdso_data
offset. Provide arch_get_vdso_data() helper for locating vdso_data on VVAR
page.
Co-developed-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-22-dima@arista.com
Add a new feature flag, X86_FEATURE_MSR_IA32_FEAT_CTL, to track whether
IA32_FEAT_CTL has been initialized. This will allow KVM, and any future
subsystems that depend on IA32_FEAT_CTL, to rely purely on cpufeatures
to query platform support, e.g. allows a future patch to remove KVM's
manual IA32_FEAT_CTL MSR checks.
Various features (on platforms that support IA32_FEAT_CTL) are dependent
on IA32_FEAT_CTL being configured and locked, e.g. VMX and LMCE. The
MSR is always configured during boot, but only if the CPU vendor is
recognized by the kernel. Because CPUID doesn't incorporate the current
IA32_FEAT_CTL value in its reporting of relevant features, it's possible
for a feature to be reported as supported in cpufeatures but not truly
enabled, e.g. if the CPU supports VMX but the kernel doesn't recognize
the CPU.
As a result, without the flag, KVM would see VMX as supported even if
IA32_FEAT_CTL hasn't been initialized, and so would need to manually
read the MSR and check the various enabling bits to avoid taking an
unexpected #GP on VMXON.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20191221044513.21680-14-sean.j.christopherson@intel.com
Set the synthetic VMX cpufeatures, which need to be kept to preserve
/proc/cpuinfo's ABI, in the common IA32_FEAT_CTL initialization code.
Remove the vendor code that manually sets the synthetic flags.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20191221044513.21680-13-sean.j.christopherson@intel.com
Add support for generating VMX feature names in capflags.c and use the
resulting x86_vmx_flags to print the VMX flags in /proc/cpuinfo. Don't
print VMX flags if no bits are set in word 0, which holds Pin Controls.
Pin Control's INTR and NMI exiting are fundamental pillars of VMX, if
they are not supported then the CPU is broken, it does not actually
support VMX, or the kernel wasn't built with support for the target CPU.
Print the features in a dedicated "vmx flags" line to avoid polluting
the common "flags" and to avoid having to prefix all flags with "vmx_",
which results in horrendously long names.
Keep synthetic VMX flags in cpufeatures to preserve /proc/cpuinfo's ABI
for those flags. This means that "flags" and "vmx flags" will have
duplicate entries for tpr_shadow (virtual_tpr), vnmi, ept, flexpriority,
vpid and ept_ad, but caps the pollution of "flags" at those six VMX
features. The vendor-specific code that populates the synthetic flags
will be consolidated in a future patch to further minimize the lasting
damage.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20191221044513.21680-12-sean.j.christopherson@intel.com
Add an entry in struct cpuinfo_x86 to track VMX capabilities and fill
the capabilities during IA32_FEAT_CTL MSR initialization.
Make the VMX capabilities dependent on IA32_FEAT_CTL and
X86_FEATURE_NAMES so as to avoid unnecessary overhead on CPUs that can't
possibly support VMX, or when /proc/cpuinfo is not available.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20191221044513.21680-11-sean.j.christopherson@intel.com
Now that IA32_FEAT_CTL is always configured and locked for CPUs that are
known to support VMX[*], clear the VMX capability flag if the MSR is
unsupported or BIOS disabled VMX, i.e. locked IA32_FEAT_CTL and didn't
set the appropriate VMX enable bit.
[*] Because init_ia32_feat_ctl() is called from vendors ->c_init(), it's
still possible for IA32_FEAT_CTL to be left unlocked when VMX is
supported by the CPU. This is not fatal, and will be addressed in a
future patch.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20191221044513.21680-9-sean.j.christopherson@intel.com
Use the recently added IA32_FEAT_CTL MSR initialization sequence to
opportunistically enable VMX support when running on a Zhaoxin CPU.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20191221044513.21680-8-sean.j.christopherson@intel.com
Use the recently added IA32_FEAT_CTL MSR initialization sequence to
opportunistically enable VMX support when running on a Centaur CPU.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20191221044513.21680-7-sean.j.christopherson@intel.com
Opportunistically initialize IA32_FEAT_CTL to enable VMX when the MSR is
left unlocked by BIOS. Configuring feature control at boot time paves
the way for similar enabling of other features, e.g. Software Guard
Extensions (SGX).
Temporarily leave equivalent KVM code in place in order to avoid
introducing a regression on Centaur and Zhaoxin CPUs, e.g. removing
KVM's code would leave the MSR unlocked on those CPUs and would break
existing functionality if people are loading kvm_intel on Centaur and/or
Zhaoxin. Defer enablement of the boot-time configuration on Centaur and
Zhaoxin to future patches to aid bisection.
Note, Local Machine Check Exceptions (LMCE) are also supported by the
kernel and enabled via feature control, but the kernel currently uses
LMCE if and only if the feature is explicitly enabled by BIOS. Keep
the current behavior to avoid introducing bugs, future patches can opt
in to opportunistic enabling if it's deemed desirable to do so.
Always lock IA32_FEAT_CTL if it exists, even if the CPU doesn't support
VMX, so that other existing and future kernel code that queries the MSR
can assume it's locked.
Start from a clean slate when constructing the value to write to
IA32_FEAT_CTL, i.e. ignore whatever value BIOS left in the MSR so as not
to enable random features or fault on the WRMSR.
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20191221044513.21680-5-sean.j.christopherson@intel.com
As pointed out by Boris, the defines for bits in IA32_FEATURE_CONTROL
are quite a mouthful, especially the VMX bits which must differentiate
between enabling VMX inside and outside SMX (TXT) operation. Rename the
MSR and its bit defines to abbreviate FEATURE_CONTROL as FEAT_CTL to
make them a little friendlier on the eyes.
Arguably, the MSR itself should keep the full IA32_FEATURE_CONTROL name
to match Intel's SDM, but a future patch will add a dedicated Kconfig,
file and functions for the MSR. Using the full name for those assets is
rather unwieldy, so bite the bullet and use IA32_FEAT_CTL so that its
nomenclature is consistent throughout the kernel.
Opportunistically, fix a few other annoyances with the defines:
- Relocate the bit defines so that they immediately follow the MSR
define, e.g. aren't mistaken as belonging to MISC_FEATURE_CONTROL.
- Add whitespace around the block of feature control defines to make
it clear they're all related.
- Use BIT() instead of manually encoding the bit shift.
- Use "VMX" instead of "VMXON" to match the SDM.
- Append "_ENABLED" to the LMCE (Local Machine Check Exception) bit to
be consistent with the kernel's verbiage used for all other feature
control bits. Note, the SDM refers to the LMCE bit as LMCE_ON,
likely to differentiate it from IA32_MCG_EXT_CTL.LMCE_EN. Ignore
the (literal) one-off usage of _ON, the SDM is simply "wrong".
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20191221044513.21680-2-sean.j.christopherson@intel.com
When writing a pid to file "tasks", a callback function move_myself() is
queued to this task to be called when the task returns from kernel mode
or exits. The purpose of move_myself() is to activate the newly assigned
closid and/or rmid associated with this task. This activation is done by
calling resctrl_sched_in() from move_myself(), the same function that is
called when switching to this task.
If this work is successfully queued but then the task enters PF_EXITING
status (e.g., receiving signal SIGKILL, SIGTERM) prior to the
execution of the callback move_myself(), move_myself() still calls
resctrl_sched_in() since the task status is not currently considered.
When a task is exiting, the data structure of the task itself will
be freed soon. Calling resctrl_sched_in() to write the register that
controls the task's resources is unnecessary and it implies extra
performance overhead.
Add check on task status in move_myself() and return immediately if the
task is PF_EXITING.
[ bp: Massage. ]
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/1578500026-21152-1-git-send-email-xiaochen.shen@intel.com
The function mce_severity() is not required to update its msg argument.
In fact, mce_severity_amd() does not, which makes mce_no_way_out()
return uninitialized data, which may be used later for printing.
Assuming that implementations of mce_severity() either always or never
update the msg argument (which is currently the case), it is sufficient
to initialize the temporary variable in mce_no_way_out().
While at it, avoid printing a useless "Unknown".
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200103150722.20313-4-jschoenh@amazon.de
Since commit
8b38937b7a ("x86/mce: Do not enter deferred errors into the generic
pool twice")
the mce=nobootlog option has become mostly ineffective (after being only
slightly ineffective before), as the code is taking actions on MCEs left
over from boot when they have a usable address.
Move the check for MCP_DONTLOG a bit outward to make it effective again.
Also, since commit
011d826111 ("RAS: Add a Corrected Errors Collector")
the two branches of the remaining "if" at the bottom of machine_check_poll()
do same. Unify them.
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200103150722.20313-3-jschoenh@amazon.de
Commit
fa92c58694 ("x86, mce: Support memory error recovery for both UCNA
and Deferred error in machine_check_poll")
added handling of UCNA and Deferred errors by adding them to the ring
for SRAO errors.
Later, commit
fd4cf79fcc ("x86/mce: Remove the MCE ring for Action Optional errors")
switched storage from the SRAO ring to the unified pool that is still
in use today. In order to only act on the intended errors, a filter
for MCE_AO_SEVERITY is used -- effectively removing handling of
UCNA/Deferred errors again.
Extend the severity filter to include UCNA/Deferred errors again.
Also, generalize the naming of the notifier from SRAO to UC to capture
the extended scope.
Note, that this change may cause a message like the following to appear,
as the same address may be reported as SRAO and as UCNA:
Memory failure: 0x5fe3284: already hardware poisoned
Technically, this is a return to previous behavior.
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200103150722.20313-2-jschoenh@amazon.de
First, printk() is NMI-context safe now since the safe printk() has been
implemented and it already has an irq_work to make NMI-context safe.
Second, this NMI irq_work actually does not work if a NMI handler causes
panic by watchdog timeout. It has no chance to run in such case, while
the safe printk() will flush its per-cpu buffers before panicking.
While at it, repurpose the irq_work callback into a function which
concentrates the NMI duration checking and makes the code easier to
follow.
[ bp: Massage. ]
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200111125427.15662-1-changbin.du@gmail.com
Use resource_size() rather than a verbose computation on
the end and start fields.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
<smpl>
@@ struct resource ptr; @@
- (ptr.end - ptr.start + 1)
+ resource_size(&ptr)
</smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1577900990-8588-10-git-send-email-Julia.Lawall@inria.fr
force_iret() was originally intended to prevent the return to user mode with
the SYSRET or SYSEXIT instructions, in cases where the register state could
have been changed to be incompatible with those instructions. The entry code
has been significantly reworked since then, and register state is validated
before SYSRET or SYSEXIT are used. force_iret() no longer serves its original
purpose and can be eliminated.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lkml.kernel.org/r/20191219115812.102620-1-brgerst@gmail.com
In __fpu__restore_sig(), fpu_fpregs_owner_ctx needs to be reset if the
FPU state was not fully restored. Otherwise the following may happen (on
the same CPU):
Task A Task B fpu_fpregs_owner_ctx
*active* A.fpu
__fpu__restore_sig()
ctx switch load B.fpu
*active* B.fpu
fpregs_lock()
copy_user_to_fpregs_zeroing()
copy_kernel_to_xregs() *modify*
copy_user_to_xregs() *fails*
fpregs_unlock()
ctx switch skip loading B.fpu,
*active* B.fpu
In the success case, fpu_fpregs_owner_ctx is set to the current task.
In the failure case, the FPU state might have been modified by loading
the init state.
In this case, fpu_fpregs_owner_ctx needs to be reset in order to ensure
that the FPU state of the following task is loaded from saved state (and
not skipped because it was the previous state).
Reset fpu_fpregs_owner_ctx after a failure during restore occurred, to
ensure that the FPU state for the next task is always loaded.
The problem was debugged-by Yu-cheng Yu <yu-cheng.yu@intel.com>.
[ bp: Massage commit message. ]
Fixes: 5f409e20b7 ("x86/fpu: Defer FPU state load until return to userspace")
Reported-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191220195906.plk6kpmsrikvbcfn@linutronix.de
To fix follwowing warning due to ORC sort moved to build time:
arch/x86/kernel/unwind_orc.c:210:12: warning: ‘orc_sort_cmp’ defined but not used [-Wunused-function]
arch/x86/kernel/unwind_orc.c:190:13: warning: ‘orc_sort_swap’ defined but not used [-Wunused-function]
Signed-off-by: Shile Zhang <shile.zhang@linux.alibaba.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/c9c81536-2afc-c8aa-c5f8-c7618ecd4f54@linux.alibaba.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is a leftover. Page faults, just like most other exceptions,
are protected inside user_exit() / user_enter() calls in x86 entry code
when we fault from userspace. So this pair of calls is now superfluous.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Link: https://lkml.kernel.org/r/20191227163612.10039-3-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Have both xfeature_is_supervisor()/xfeature_is_user() return bool
because they are used only in boolean context.
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191212210855.19260-3-yu-cheng.yu@intel.com
ioremap has provided non-cached semantics by default since the Linux 2.6
days, so remove the additional ioremap_nocache interface.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
set_cache_qos_cfg() is leaking memory when the given level is not
RDT_RESOURCE_L3 or RDT_RESOURCE_L2. At the moment, this function is
called with only valid levels but move the allocation after the valid
level checks in order to make it more robust and future proof.
[ bp: Massage commit message. ]
Fixes: 99adde9b37 ("x86/intel_rdt: Enable L2 CDP in MSR IA32_L2_QOS_CFG")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20200102165844.133133-1-shakeelb@google.com
Hoist the user_mode() case up because it is less code and can be dealt
with up-front like the other special cases UMIP and vm86.
This saves an indentation level for the kernel-mode #GP case and allows
to "unfold" the code more so that it is more readable.
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Jann Horn <jannh@google.com>
Cc: x86@kernel.org
Make #GP exceptions caused by out-of-bounds KASAN shadow accesses easier
to understand by computing the address of the original access and
printing that. More details are in the comments in the patch.
This turns an error like this:
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault, probably for non-canonical address
0xe017577ddf75b7dd: 0000 [#1] PREEMPT SMP KASAN PTI
into this:
general protection fault, probably for non-canonical address
0xe017577ddf75b7dd: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: maybe wild-memory-access in range
[0x00badbeefbadbee8-0x00badbeefbadbeef]
The hook is placed in architecture-independent code, but is currently
only wired up to the X86 exception handler because I'm not sufficiently
familiar with the address space layout and exception handling mechanisms
on other architectures.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: kasan-dev@googlegroups.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191218231150.12139-4-jannh@google.com
Split __die() into __die_header() and __die_body(). This allows inserting
extra information below the header line that initiates the bug report.
Introduce a new function die_addr() that behaves like die(), but is for
faults only and uses __die_header() and __die_body() so that a future
commit can print extra information after the header line.
[ bp: Comment the KASAN-specific usage of gp_addr. ]
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: kasan-dev@googlegroups.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191218231150.12139-3-jannh@google.com
A frequent cause of #GP exceptions are memory accesses to non-canonical
addresses. Unlike #PF, #GP doesn't report a fault address in CR2, so the
kernel doesn't currently print the fault address for a #GP.
Luckily, the necessary infrastructure for decoding x86 instructions and
computing the memory address being accessed is already present. Hook
it up to the #GP handler so that the address operand of the faulting
instruction can be figured out and printed.
Distinguish two cases:
a) (Part of) the memory range being accessed lies in the non-canonical
address range; in this case, it is likely that the decoded address
is actually the one that caused the #GP.
b) The entire memory range of the decoded operand lies in canonical
address space; the #GP may or may not be related in some way to the
computed address. Print it, but with hedging language in the message.
While it is already possible to compute the faulting address manually by
disassembling the opcode dump and evaluating the instruction against the
register dump, this should make it slightly easier to identify crashes
at a glance.
Note that the operand length which comes from the instruction decoder
and is used to determine whether the access straddles into non-canonical
address space, is currently somewhat unreliable; but it should be good
enough, considering that Linux on x86-64 never maps the page directly
before the start of the non-canonical range anyway, and therefore the
case where a memory range begins in that page and potentially straddles
into the non-canonical range should be fairly uncommon.
In the case the address is still computed wrongly, it only influences
whether the error message claims that the access is canonical.
[ bp: Remove ambiguous "we", massage, reflow comments and spacing. ]
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Tested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: kasan-dev@googlegroups.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191218231150.12139-2-jannh@google.com
A system that supports resource monitoring may have multiple resources
while not all of these resources are capable of monitoring. Monitoring
related state is initialized only for resources that are capable of
monitoring and correspondingly this state should subsequently only be
removed from these resources that are capable of monitoring.
domain_add_cpu() calls domain_setup_mon_state() only when r->mon_capable
is true where it will initialize d->mbm_over. However,
domain_remove_cpu() calls cancel_delayed_work(&d->mbm_over) without
checking r->mon_capable resulting in an attempt to cancel d->mbm_over on
all resources, even those that never initialized d->mbm_over because
they are not capable of monitoring. Hence, it triggers a debugobjects
warning when offlining CPUs because those timer debugobjects are never
initialized:
ODEBUG: assert_init not available (active state 0) object type:
timer_list hint: 0x0
WARNING: CPU: 143 PID: 789 at lib/debugobjects.c:484
debug_print_object
Hardware name: HP Synergy 680 Gen9/Synergy 680 Gen9 Compute Module, BIOS I40 05/23/2018
RIP: 0010:debug_print_object
Call Trace:
debug_object_assert_init
del_timer
try_to_grab_pending
cancel_delayed_work
resctrl_offline_cpu
cpuhp_invoke_callback
cpuhp_thread_fun
smpboot_thread_fn
kthread
ret_from_fork
Fixes: e33026831b ("x86/intel_rdt/mbm: Handle counter overflow")
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: john.stultz@linaro.org
Cc: sboyd@kernel.org
Cc: <stable@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: tj@kernel.org
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191211033042.2188-1-cai@lca.pw
Commit:
285a54efe3 ("x86/alternatives: Sync bp_patching update for avoiding NULL pointer exception")
added an additional text_poke_sync() IPI to text_poke_bp_batch() to
handle the rare case where another CPU is still inside an INT3 handler
while we clear the global state.
Instead of spraying IPIs around, count the active INT3 handlers and
wait for them to go away before proceeding to clear/reuse the data.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On x86 kernels configured with CONFIG_PROC_KCORE=y and
CONFIG_KEXEC_CORE=n, the vmcoreinfo note in /proc/kcore is incomplete.
Specifically, it is missing arch-specific information like the KASLR
offset and whether 5-level page tables are enabled. This breaks
applications like drgn [1] and crash [2], which need this information
for live debugging via /proc/kcore.
This happens because:
1. CONFIG_PROC_KCORE selects CONFIG_CRASH_CORE.
2. kernel/crash_core.c (compiled if CONFIG_CRASH_CORE=y) calls
arch_crash_save_vmcoreinfo() to get the arch-specific parts of
vmcoreinfo. If it is not defined, then it uses a no-op fallback.
3. x86 defines arch_crash_save_vmcoreinfo() in
arch/x86/kernel/machine_kexec_*.c, which is only compiled if
CONFIG_KEXEC_CORE=y.
Therefore, an x86 kernel with CONFIG_CRASH_CORE=y and
CONFIG_KEXEC_CORE=n uses the no-op fallback and gets incomplete
vmcoreinfo data. This isn't relevant to kdump, which requires
CONFIG_KEXEC_CORE. It only affects applications which read vmcoreinfo at
runtime, like the ones mentioned above.
Fix it by moving arch_crash_save_vmcoreinfo() into two new
arch/x86/kernel/crash_core_*.c files, which are gated behind
CONFIG_CRASH_CORE.
1: 73dd7def12/libdrgn/program.c (L385)
2: 60a42d7092
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kairui Song <kasong@redhat.com>
Cc: Lianbo Jiang <lijiang@redhat.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/0589961254102cca23e3618b96541b89f2b249e2.1576858905.git.osandov@fb.com
Pull x86 RAS fixes from Borislav Petkov:
"Three urgent RAS fixes for the AMD side of things:
- initialize struct mce.bank so that calculated error severity on AMD
SMCA machines is correct
- do not send IPIs early during bank initialization, when interrupts
are disabled
- a fix for when only a subset of MCA banks are enabled, which led to
boot hangs on some new AMD CPUs"
* 'ras-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Fix possibly incorrect severity calculation on AMD
x86/MCE/AMD: Allow Reserved types to be overwritten in smca_banks[]
x86/MCE/AMD: Do not use rdmsr_safe_on_cpu() in smca_configure()
Pull timer fixes from Ingo Molnar:
"Add HPET quirks for the Intel 'Coffee Lake H' and 'Ice Lake' platforms"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/intel: Disable HPET on Intel Ice Lake platforms
x86/intel: Disable HPET on Intel Coffee Lake H platforms
The mutex in mce_inject_log() became unnecessary with commit
5de97c9f6d ("x86/mce: Factor out and deprecate the /dev/mcelog driver"),
though the original reason for its presence only vanished with commit
7298f08ea8 ("x86/mcelog: Get rid of RCU remnants").
Drop the mutex. And as that makes mce_inject_log() identical to mce_log(),
get rid of the former in favor of the latter.
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191210000733.17979-7-jschoenh@amazon.de
In commit
b2f9d678e2 ("x86/mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception table entries")
another call to mce_panic() was introduced. Pass the message of the
handled MCE to that instance of mce_panic() as well, as there doesn't
seem to be a reason not to.
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191210000733.17979-6-jschoenh@amazon.de
throttle_active_work() is only called if CONFIG_SYSFS is set, otherwise
we get a harmless warning:
arch/x86/kernel/cpu/mce/therm_throt.c:238:13: error: 'throttle_active_work' \
defined but not used [-Werror=unused-function]
Mark the function as __maybe_unused to avoid the warning.
Fixes: f6656208f0 ("x86/mce/therm_throt: Optimize notifications of thermal throttle")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: bberg@redhat.com
Cc: ckellner@redhat.com
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: hdegoede@redhat.com
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191210203925.3119091-1-arnd@arndb.de
The function mce_severity_amd_smca() requires m->bank to be initialized
for correct operation. Fix the one case, where mce_severity() is called
without doing so.
Fixes: 6bda529ec4 ("x86/mce: Grade uncorrected errors for SMCA-enabled systems")
Fixes: d28af26faa ("x86/MCE: Initialize mce.bank in the case of a fatal error in mce_no_way_out()")
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Link: https://lkml.kernel.org/r/20191210000733.17979-4-jschoenh@amazon.de
Each logical CPU in Scalable MCA systems controls a unique set of MCA
banks in the system. These banks are not shared between CPUs. The bank
types and ordering will be the same across CPUs on currently available
systems.
However, some CPUs may see a bank as Reserved/Read-as-Zero (RAZ) while
other CPUs do not. In this case, the bank seen as Reserved on one CPU is
assumed to be the same type as the bank seen as a known type on another
CPU.
In general, this occurs when the hardware represented by the MCA bank
is disabled, e.g. disabled memory controllers on certain models, etc.
The MCA bank is disabled in the hardware, so there is no possibility of
getting an MCA/MCE from it even if it is assumed to have a known type.
For example:
Full system:
Bank | Type seen on CPU0 | Type seen on CPU1
------------------------------------------------
0 | LS | LS
1 | UMC | UMC
2 | CS | CS
System with hardware disabled:
Bank | Type seen on CPU0 | Type seen on CPU1
------------------------------------------------
0 | LS | LS
1 | UMC | RAZ
2 | CS | CS
For this reason, there is a single, global struct smca_banks[] that is
initialized at boot time. This array is initialized on each CPU as it
comes online. However, the array will not be updated if an entry already
exists.
This works as expected when the first CPU (usually CPU0) has all
possible MCA banks enabled. But if the first CPU has a subset, then it
will save a "Reserved" type in smca_banks[]. Successive CPUs will then
not be able to update smca_banks[] even if they encounter a known bank
type.
This may result in unexpected behavior. Depending on the system
configuration, a user may observe issues enumerating the MCA
thresholding sysfs interface. The issues may be as trivial as sysfs
entries not being available, or as severe as system hangs.
For example:
Bank | Type seen on CPU0 | Type seen on CPU1
------------------------------------------------
0 | LS | LS
1 | RAZ | UMC
2 | CS | CS
Extend the smca_banks[] entry check to return if the entry is a
non-reserved type. Otherwise, continue so that CPUs that encounter a
known bank type can update smca_banks[].
Fixes: 68627a697c ("x86/mce/AMD, EDAC/mce_amd: Enumerate Reserved SMCA bank type")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191121141508.141273-1-Yazen.Ghannam@amd.com
... because interrupts are disabled that early and sending IPIs can
deadlock:
BUG: sleeping function called from invalid context at kernel/sched/completion.c:99
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
no locks held by swapper/1/0.
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8106dda9>] copy_process+0x8b9/0x1ca0
softirqs last enabled at (0): [<ffffffff8106dda9>] copy_process+0x8b9/0x1ca0
softirqs last disabled at (0): [<0000000000000000>] 0x0
Preemption disabled at:
[<ffffffff8104703b>] start_secondary+0x3b/0x190
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.5.0-rc2+ #1
Hardware name: GIGABYTE MZ01-CE1-00/MZ01-CE1-00, BIOS F02 08/29/2018
Call Trace:
dump_stack
___might_sleep.cold.92
wait_for_completion
? generic_exec_single
rdmsr_safe_on_cpu
? wrmsr_on_cpus
mce_amd_feature_init
mcheck_cpu_init
identify_cpu
identify_secondary_cpu
smp_store_cpu_info
start_secondary
secondary_startup_64
The function smca_configure() is called only on the current CPU anyway,
therefore replace rdmsr_safe_on_cpu() with atomic rdmsr_safe() and avoid
the IPI.
[ bp: Update commit message. ]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/157252708836.3876.4604398213417262402.stgit@buzz
Remove two unused variables:
arch/x86/kernel/process.c: In function ‘__switch_to_xtra’:
arch/x86/kernel/process.c:618:31: warning: variable ‘next’ set but not used [-Wunused-but-set-variable]
618 | struct thread_struct *prev, *next;
| ^~~~
arch/x86/kernel/process.c:618:24: warning: variable ‘prev’ set but not used [-Wunused-but-set-variable]
618 | struct thread_struct *prev, *next;
|
They are never used and so can be removed.
Signed-off-by: yu kuai <yukuai3@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Cc: yi.zhang@huawei.com
Cc: zhengbin13@huawei.com
Link: https://lkml.kernel.org/r/20191213121253.10072-1-yukuai3@huawei.com
-----BEGIN PGP SIGNATURE-----
Comment: Kees Cook <kees@outflux.net>
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl3umDgWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJlvsD/49R12HK7UzTxNTrcpvbadJ4t7j
j/qJvjMerW7iVNAPOoNAOePUa21+y3rI1AZPvoPyzIqp1Bf2eOICf5SdisG2cG+O
X0A8EKWvS0SSQWSKaT6udUKJ3nBJItwvOvQ5B58KQzcOj3S4X7B9iVBWgieMHrzz
urkZm7pqowrZB3wuF8keRtli5IZaoiCwzApy48Qrn70G3OeXymknFbpHTDwIAiGw
RiE5Xh0R4EzQdsYyCgjR8U56gBchadAmj8BUJU0ppMnOFMyIAG670hNLrs0L3roP
8TOIeyb993ZC5GZaMlnR8mz0jfibfkPa3Z85VAsVyQSPaOQldwc9j8TGBqD5Gfat
1PjOU5RVwma0pH5xTPOeevWPQpIK9KovQpQYqMMN9GMxOEx96IOUjwTrnNK2xWoN
UGyOVlESFGoniClhCiKYzPSrYOjlIBk5ovf15PdTe+bwyUDMfyfy5CZV88OS2DHz
ZBZvpLrH/EMW9zJ+FqMTp0C4s4wa2Ioid3bSh6XuNUTtltKSjp71eUja8ZEz+2sd
5AGstCC+hYqxaEk+6/851pfkQ9sbBjwuGtNrtX+pqreiLUvWLhQ0yUj6cLXlEQNH
aucjCukCjI+4lMzofeaQ2LbNhtff4YsfO4b1Ye8maoDdHjzUVL57n3bTOxKhdzbt
y6FM3lApOjk3OyaTJQ==
=YU4A
-----END PGP SIGNATURE-----
Merge tag 'sizeof_field-v5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull FIELD_SIZEOF conversion from Kees Cook:
"A mostly mechanical treewide conversion from FIELD_SIZEOF() to
sizeof_field(). This avoids the redundancy of having 2 macros
(actually 3) doing the same thing, and consolidates on sizeof_field().
While "field" is not an accurate name, it is the common name used in
the kernel, and doesn't result in any unintended innuendo.
As there are still users of FIELD_SIZEOF() in -next, I will clean up
those during this coming development cycle and send the final old
macro removal patch at that time"
* tag 'sizeof_field-v5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
treewide: Use sizeof_field() macro
MIPS: OCTEON: Replace SIZEOF_FIELD() macro
Now that the orc_unwind and orc_unwind_ip tables are sorted at build time,
remove the boot time sorting pass.
No change in functionality.
[ mingo: Rewrote the changelog and code comments. ]
Signed-off-by: Shile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kbuild@vger.kernel.org
Link: https://lkml.kernel.org/r/20191204004633.88660-8-shile.zhang@linux.alibaba.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Removal of code I accidentally applied when doing a minor fix up
to a patch, and then using "git commit -a --amend", which pulled
in some other changes I was playing with.
- Remove an used variable in trace_events_inject code
- Fix to function graph tracer when it traces a ftrace direct function.
It will now ignore tracing a function that has a ftrace direct
tramploine attached. This is needed for eBPF to use the ftrace direct
code.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXfD/thQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qoo2AP4j7ONw7BTmMyo+GdYqPPntBeDnClHK
vfMKrgK1j5BxYgEA7LgkwuUT9bcyLjfJVcyfeW67rB2PtmovKTWnKihFOwI=
=DZ6N
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- Remove code I accidentally applied when doing a minor fix up to a
patch, and then using "git commit -a --amend", which pulled in some
other changes I was playing with.
- Remove an used variable in trace_events_inject code
- Fix function graph tracer when it traces a ftrace direct function.
It will now ignore tracing a function that has a ftrace direct
tramploine attached. This is needed for eBPF to use the ftrace direct
code.
* tag 'trace-v5.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix function_graph tracer interaction with BPF trampoline
tracing: remove set but not used variable 'buffer'
module: Remove accidental change of module_enable_x()
Depending on type of BPF programs served by BPF trampoline it can call original
function. In such case the trampoline will skip one stack frame while
returning. That will confuse function_graph tracer and will cause crashes with
bad RIP. Teach graph tracer to skip functions that have BPF trampoline attached.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Move the definition of acpi_get_wakeup_address() into sleep.c to break
linux/acpi.h's dependency (by way of asm/acpi.h) on asm/realmode.h.
Everyone and their mother includes linux/acpi.h, i.e. modifying
realmode.h results in a full kernel rebuild, which makes the already
inscrutable real mode boot code even more difficult to understand and is
positively rage inducing when trying to make changes to x86's boot flow.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Link: https://lkml.kernel.org/r/20191126165417.22423-13-sean.j.christopherson@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
None of the declarations in x86's acpi/sleep.h are in any way dependent
on the real mode boot code. Remove sleep.h's include of asm/realmode.h
to limit the dependencies on realmode.h to code that actually interacts
with the boot code.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20191126165417.22423-11-sean.j.christopherson@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The inclusion of linux/vmalloc.h, which is required for its definition
of set_vm_flush_reset_perms(), is somehow dependent on asm/realmode.h
being included by asm/acpi.h. Explicitly include linux/vmalloc.h so
that a future patch can drop the realmode.h include from asm/acpi.h
without breaking the build.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20191126165417.22423-5-sean.j.christopherson@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The inclusion of linux/vmalloc.h, which is required for its definition
of set_vm_flush_reset_perms(), is somehow dependent on asm/realmode.h
being included by asm/acpi.h. Explicitly include linux/vmalloc.h so
that a future patch can drop the realmode.h include from asm/acpi.h
without breaking the build.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20191126165417.22423-4-sean.j.christopherson@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Explicitly include asm/realmode.h, which provides reserve_real_mode(),
instead of picking it up by an indirect include of asm/acpi.h. acpi.h
will soon stop including realmode.h so that changing realmode.h doesn't
require a full kernel rebuild.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Link: https://lkml.kernel.org/r/20191126165417.22423-3-sean.j.christopherson@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Untangle the somewhat incestous way of how VMALLOC_START is used all across the
kernel, but is, on x86, defined deep inside one of the lowest level page table headers.
It doesn't help that vmalloc.h only includes a single asm header:
#include <asm/page.h> /* pgprot_t */
So there was no existing cross-arch way to decouple address layout
definitions from page.h details. I used this:
#ifndef VMALLOC_START
# include <asm/vmalloc.h>
#endif
This way every architecture that wants to simplify page.h can do so.
- Also on x86 we had a couple of LDT related inline functions that used
the late-stage address space layout positions - but these could be
uninlined without real trouble - the end result is cleaner this way as
well.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
pat.h is a file whose main purpose is to provide the memtype_*() APIs.
PAT is the low level hardware mechanism - but the high level abstraction
is memtype.
So name the header <memtype.h> as well - this goes hand in hand with memtype.c
and memtype_interval.c.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Update various comments, fix outright mistakes and meaningless descriptions.
Also harmonize the style across the file, both in form and in language.
Cc: linux-kernel@vger.kernel.org
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In 20 years we accumulated 89 #include lines in setup.c,
but we only need 30 of them (!) ...
Get rid of the excessive ones, and while at it, sort the
remaining ones alphabetically.
Also get rid of the incomplete changelogs at the header of the file,
and explain better what this file does.
Cc: linux-kernel@vger.kernel.org
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Replace all the occurrences of FIELD_SIZEOF() with sizeof_field() except
at places where these are defined. Later patches will remove the unused
definition of FIELD_SIZEOF().
This patch is generated using following script:
EXCLUDE_FILES="include/linux/stddef.h|include/linux/kernel.h"
git grep -l -e "\bFIELD_SIZEOF\b" | while read file;
do
if [[ "$file" =~ $EXCLUDE_FILES ]]; then
continue
fi
sed -i -e 's/\bFIELD_SIZEOF\b/sizeof_field/g' $file;
done
Signed-off-by: Pankaj Bharadiya <pankaj.laxminarayan.bharadiya@intel.com>
Link: https://lore.kernel.org/r/20190924105839.110713-3-pankaj.laxminarayan.bharadiya@intel.com
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: David Miller <davem@davemloft.net> # for net
Zhang Xiaoxu noted that physical address locations for MTRR were visible
to non-root users, which could be considered an information leak.
In discussing[1] the options for solving this, it sounded like just
moving the capable check into open() was the first step.
If this breaks userspace, then we will have a test case for the more
conservative approaches discussed in the thread. In summary:
- MTRR should check capabilities at open time (or retain the
checks on the opener's permissions for later checks).
- changing the DAC permissions might break something that expects to
open mtrr when not uid 0.
- if we leave the DAC permissions alone and just move the capable check
to the opener, we should get the desired protection. (i.e. check
against CAP_SYS_ADMIN not just the wider uid 0.)
- if that still breaks things, as in userspace expects to be able to
read other parts of the file as non-uid-0 and non-CAP_SYS_ADMIN, then
we need to censor the contents using the opener's permissions. For
example, as done in other /proc cases, like commit
51d7b12041 ("/proc/iomem: only expose physical resource addresses to privileged users").
[1] https://lore.kernel.org/lkml/201911110934.AC5BA313@keescook/
Reported-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-security-module@vger.kernel.org
Cc: Matthew Garrett <mjg59@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: x86-ml <x86@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/201911181308.63F06502A1@keescook
Pull x86 fixes from Ingo Molnar:
"Various fixes:
- Fix the PAT performance regression that downgraded write-combining
device memory regions to uncached.
- There's been a number of bugs in 32-bit double fault handling -
hopefully all fixed now.
- Fix an LDT crash
- Fix an FPU over-optimization that broke with GCC9 code
optimizations.
- Misc cleanups"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/pat: Fix off-by-one bugs in interval tree search
x86/ioperm: Save an indentation level in tss_update_io_bitmap()
x86/fpu: Don't cache access to fpu_fpregs_owner_ctx
x86/entry/32: Remove unused 'restore_all_notrace' local label
x86/ptrace: Document FSBASE and GSBASE ABI oddities
x86/ptrace: Remove set_segment_reg() implementations for current
x86/traps: die() instead of panicking on a double fault
x86/doublefault/32: Rewrite the x86_32 #DF handler and unify with 64-bit
x86/doublefault/32: Move #DF stack and TSS to cpu_entry_area
x86/doublefault/32: Rename doublefault.c to doublefault_32.c
x86/traps: Disentangle the 32-bit and 64-bit doublefault code
lkdtm: Add a DOUBLE_FAULT crash type on x86
selftests/x86/single_step_syscall: Check SYSENTER directly
x86/mm/32: Sync only to VMALLOC_END in vmalloc_sync_all()
Pull RAS fix from Borislav Petkov:
"One urgent fix for the thermal throttling machinery: the recent change
reworking the thermal notifications forgot to mask out read-only and
reserved bits in the thermal status MSRs, leading to exceptions while
writing those MSRs.
The fix takes care of masking out those bits first"
* 'ras-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce/therm_throt: Mask out read-only and reserved MSR bits
- improve dma-debug scalability (Eric Dumazet)
- tiny dma-debug cleanup (Dan Carpenter)
- check for vmap memory in dma_map_single (Kees Cook)
- check for dma_addr_t overflows in dma-direct when using
DMA offsets (Nicolas Saenz Julienne)
- switch the x86 sta2x11 SOC to use more generic DMA code
(Nicolas Saenz Julienne)
- fix arm-nommu dma-ranges handling (Vladimir Murzin)
- use __initdata in CMA (Shyam Saini)
- replace the bus dma mask with a limit (Nicolas Saenz Julienne)
- merge the remapping helpers into the main dma-direct flow (me)
- switch xtensa to the generic dma remap handling (me)
- various cleanups around dma_capable (me)
- remove unused dev arguments to various dma-noncoherent helpers (me)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl3f+eULHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYPyPg/+PVHCrhmepudQQFHu6wfurE5U77iNnoUifvG+b5z5
5mHmTMkQwyox6rKDe8NuFApAhz1VJDSUgSelPmvTSOIEIGXCvX1p+GqRSVS5YQON
aLzGvbWKE8hCpaPdDHKYDauD1FZGMM8L2P5oOMF9X9fQ94xxRqfqJM6c8iD16Sgg
+aOgPNzTnxQHJFF/Dbt/mjJrKXWI+XF+bgUbH+l9yKa7Dd7ibmJR8yl9hs1jmp0H
1CZ+CizwnAs57rCd1a6Ybc6gj59tySc03NMnnbTko+KDxrcbD3Ee2tpqHVkkCjYz
Yl0m4FIpbotrpokL/FIS727bVvkjbWgoeM+kiVPoYzmZea3pq/tFDr6tp/BxDhFj
TZXSFfgQljlYMD3ppSoklFlfjGriVWV0tPO3arPXwuuMF5EX/IMQmvxei05jpc8n
iELNXOP9iZZkY4tLHy2hn2uWrxBRrS1WQwlLg9hahlNRzyfFSyHeP0zWlVDt+RgF
5CCbEI+HQcUqg1FApB30lQNWTn1+dJftrpKVBlgNBIyIa/z2rFbt8GdSnItxjfQX
/XX8EZbFvF6AcXkgURkYFIoKM/EbYShOSLcYA3PTUtcuTnF6Kk5eimySiGWZTVCS
prruSFDZJOvL3SnOIMIiYVmBdB7lEbDyLI/VYuhoECXEDCJpVmRktNkJNg4q6/E+
fjQ=
=e5wO
-----END PGP SIGNATURE-----
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux; tag 'dma-mapping-5.5' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- improve dma-debug scalability (Eric Dumazet)
- tiny dma-debug cleanup (Dan Carpenter)
- check for vmap memory in dma_map_single (Kees Cook)
- check for dma_addr_t overflows in dma-direct when using DMA offsets
(Nicolas Saenz Julienne)
- switch the x86 sta2x11 SOC to use more generic DMA code (Nicolas
Saenz Julienne)
- fix arm-nommu dma-ranges handling (Vladimir Murzin)
- use __initdata in CMA (Shyam Saini)
- replace the bus dma mask with a limit (Nicolas Saenz Julienne)
- merge the remapping helpers into the main dma-direct flow (me)
- switch xtensa to the generic dma remap handling (me)
- various cleanups around dma_capable (me)
- remove unused dev arguments to various dma-noncoherent helpers (me)
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux:
* tag 'dma-mapping-5.5' of git://git.infradead.org/users/hch/dma-mapping: (22 commits)
dma-mapping: treat dev->bus_dma_mask as a DMA limit
dma-direct: exclude dma_direct_map_resource from the min_low_pfn check
dma-direct: don't check swiotlb=force in dma_direct_map_resource
dma-debug: clean up put_hash_bucket()
powerpc: remove support for NULL dev in __phys_to_dma / __dma_to_phys
dma-direct: avoid a forward declaration for phys_to_dma
dma-direct: unify the dma_capable definitions
dma-mapping: drop the dev argument to arch_sync_dma_for_*
x86/PCI: sta2x11: use default DMA address translation
dma-direct: check for overflows on 32 bit DMA addresses
dma-debug: increase HASH_SIZE
dma-debug: reorder struct dma_debug_entry fields
xtensa: use the generic uncached segment support
dma-mapping: merge the generic remapping helpers into dma-direct
dma-direct: provide mmap and get_sgtable method overrides
dma-direct: remove the dma_handle argument to __dma_direct_alloc_pages
dma-direct: remove __dma_direct_free_pages
usb: core: Remove redundant vmap checks
kernel: dma-contiguous: mark CMA parameters __initdata/__initconst
dma-debug: add a schedule point in debug_dma_dump_mappings()
...
- PERAMAENT flag to ftrace_ops when attaching a callback to a function
As /proc/sys/kernel/ftrace_enabled when set to zero will disable all
attached callbacks in ftrace, this has a detrimental impact on live
kernel tracing, as it disables all that it patched. If a ftrace_ops
is registered to ftrace with the PERMANENT flag set, it will prevent
ftrace_enabled from being disabled, and if ftrace_enabled is already
disabled, it will prevent a ftrace_ops with PREMANENT flag set from
being registered.
- New register_ftrace_direct(). As eBPF would like to register its own
trampolines to be called by the ftrace nop locations directly,
without going through the ftrace trampoline, this function has been
added. This allows for eBPF trampolines to live along side of
ftrace, perf, kprobe and live patching. It also utilizes the ftrace
enabled_functions file that keeps track of functions that have been
modified in the kernel, to allow for security auditing.
- Allow for kernel internal use of ftrace instances. Subsystems in
the kernel can now create and destroy their own tracing instances
which allows them to have their own tracing buffer, and be able
to record events without worrying about other users from writing over
their data.
- New seq_buf_hex_dump() that lets users use the hex_dump() in their
seq_buf usage.
- Notifications now added to tracing_max_latency to allow user space
to know when a new max latency is hit by one of the latency tracers.
- Wider spread use of generic compare operations for use of bsearch and
friends.
- More synthetic event fields may be defined (32 up from 16)
- Use of xarray for architectures with sparse system calls, for the
system call trace events.
This along with small clean ups and fixes.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXdwv4BQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qnB5AP91vsdHQjwE1+/UWG/cO+qFtKvn2QJK
QmBRIJNH/s+1TAD/fAOhgw+ojSK3o/qc+NpvPTEW9AEwcJL1wacJUn+XbQc=
=ztql
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"New tracing features:
- New PERMANENT flag to ftrace_ops when attaching a callback to a
function.
As /proc/sys/kernel/ftrace_enabled when set to zero will disable
all attached callbacks in ftrace, this has a detrimental impact on
live kernel tracing, as it disables all that it patched. If a
ftrace_ops is registered to ftrace with the PERMANENT flag set, it
will prevent ftrace_enabled from being disabled, and if
ftrace_enabled is already disabled, it will prevent a ftrace_ops
with PREMANENT flag set from being registered.
- New register_ftrace_direct().
As eBPF would like to register its own trampolines to be called by
the ftrace nop locations directly, without going through the ftrace
trampoline, this function has been added. This allows for eBPF
trampolines to live along side of ftrace, perf, kprobe and live
patching. It also utilizes the ftrace enabled_functions file that
keeps track of functions that have been modified in the kernel, to
allow for security auditing.
- Allow for kernel internal use of ftrace instances.
Subsystems in the kernel can now create and destroy their own
tracing instances which allows them to have their own tracing
buffer, and be able to record events without worrying about other
users from writing over their data.
- New seq_buf_hex_dump() that lets users use the hex_dump() in their
seq_buf usage.
- Notifications now added to tracing_max_latency to allow user space
to know when a new max latency is hit by one of the latency
tracers.
- Wider spread use of generic compare operations for use of bsearch
and friends.
- More synthetic event fields may be defined (32 up from 16)
- Use of xarray for architectures with sparse system calls, for the
system call trace events.
This along with small clean ups and fixes"
* tag 'trace-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (51 commits)
tracing: Enable syscall optimization for MIPS
tracing: Use xarray for syscall trace events
tracing: Sample module to demonstrate kernel access to Ftrace instances.
tracing: Adding new functions for kernel access to Ftrace instances
tracing: Fix Kconfig indentation
ring-buffer: Fix typos in function ring_buffer_producer
ftrace: Use BIT() macro
ftrace: Return ENOTSUPP when DYNAMIC_FTRACE_WITH_DIRECT_CALLS is not configured
ftrace: Rename ftrace_graph_stub to ftrace_stub_graph
ftrace: Add a helper function to modify_ftrace_direct() to allow arch optimization
ftrace: Add helper find_direct_entry() to consolidate code
ftrace: Add another check for match in register_ftrace_direct()
ftrace: Fix accounting bug with direct->count in register_ftrace_direct()
ftrace/selftests: Fix spelling mistake "wakeing" -> "waking"
tracing: Increase SYNTH_FIELDS_MAX for synthetic_events
ftrace/samples: Add a sample module that implements modify_ftrace_direct()
ftrace: Add modify_ftrace_direct()
tracing: Add missing "inline" in stub function of latency_fsnotify()
tracing: Remove stray tab in TRACE_EVAL_MAP_FILE's help text
tracing: Use seq_buf_hex_dump() to dump buffers
...
ftracetest multiple_kprobes.tc testcase hits the following NULL pointer
exception:
BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD 800000007bf60067 P4D 800000007bf60067 PUD 7bf5f067 PMD 0
Oops: 0000 [#1] PREEMPT SMP PTI
RIP: 0010:poke_int3_handler+0x39/0x100
Call Trace:
<IRQ>
do_int3+0xd/0xf0
int3+0x42/0x50
RIP: 0010:sched_clock+0x6/0x10
poke_int3_handler+0x39 was alternatives:958:
static inline void *text_poke_addr(struct text_poke_loc *tp)
{
return _stext + tp->rel_addr; <------ Here is line #958
}
This seems to be caused by tp (bp_patching.vec) being NULL but
bp_patching.nr_entries != 0. There is a small chance for this
to happen, because we have no synchronization between the zeroing
of bp_patching.nr_entries and before clearing bp_patching.vec.
Steve suggested we could fix this by adding sync_core(), because int3
is done with interrupts disabled, and the on_each_cpu() requires
all CPUs to have had their interrupts enabled.
[ mingo: Edited the comments and the changelog. ]
Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Fixes: c0213b0ac0 ("x86/alternative: Batch of patch operations")
Link: https://lkml.kernel.org/r/157483421229.25881.15314414408559963162.stgit@devnote2
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add a few words describing how it is safe to overwrite the 4 bytes
after a kprobe. In specific it is possible the JMP.d32 required for
the optimized kprobe overwrites multiple instructions.
Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191111132458.401696663@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Kprobes does something like:
register:
arch_arm_kprobe()
text_poke(INT3)
/* guarantees nothing, INT3 will become visible at some point, maybe */
kprobe_optimizer()
/* guarantees the bytes after INT3 are unused */
synchronize_rcu_tasks();
text_poke_bp(JMP32);
/* implies IPI-sync, kprobe really is enabled */
unregister:
__disarm_kprobe()
unoptimize_kprobe()
text_poke_bp(INT3 + tail);
/* implies IPI-sync, so tail is guaranteed visible */
arch_disarm_kprobe()
text_poke(old);
/* guarantees nothing, old will maybe become visible */
synchronize_rcu()
free-stuff
Now the problem is that on register, the synchronize_rcu_tasks() does
not imply sufficient to guarantee all CPUs have already observed INT3
(although in practice this is exceedingly unlikely not to have
happened) (similar to how MEMBARRIER_CMD_PRIVATE_EXPEDITED does not
imply MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE).
Worse, even if it did, we'd have to do 2 synchronize calls to provide
the guarantee we're looking for, the first to ensure INT3 is visible,
the second to guarantee nobody is then still using the instruction
bytes after INT3.
Similar on unregister; the synchronize_rcu() between
__unregister_kprobe_top() and __unregister_kprobe_bottom() does not
guarantee all CPUs are free of the INT3 (and observe the old text).
Therefore, sprinkle some IPI-sync love around. This guarantees that
all CPUs agree on the text and RCU once again provides the required
guaranteed.
Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191111132458.162172862@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
... because it calls the .init.text function text_poke_early(). That is
ok because it does call that function early, during boot.
Fixes: 9706f7c3531f ("x86/ftrace: Use text_poke()")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191116204607.GC23231@zn.tnic
Employ the fact that all text must be within a s32 displacement of one
another to shrink the text_poke_loc::addr field. Make it relative to
_stext.
This then shrinks struct text_poke_loc to 16 bytes, and consequently
increases TP_VEC_MAX from 170 to 256.
Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191111132458.047052889@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Per the BUG_ON(len != insn.length) in text_poke_loc_init(), tp->len
must indeed be the same as text_opcode_size(tp->opcode). Use this to
remove this field from the structure.
Sadly, due to 8 byte alignment, this only increases the structure
padding.
Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191111132457.989922744@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move ftrace over to using the generic x86 text_poke functions; this
avoids having a second/different copy of that code around.
This also avoids ftrace violating the (new) W^X rule and avoids
fragmenting the kernel text page-tables, due to no longer having to
toggle them RW.
Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191111132457.761255803@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Adding another text_poke_bp_batch() user made me realize the interface
is all sorts of wrong. The text poke vector should be internal to the
implementation.
This then results in a trivial interface:
text_poke_queue() - which has the 'normal' text_poke_bp() interface
text_poke_finish() - which takes no arguments and flushes any
pending text_poke()s.
Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191111132457.646280715@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Update the ACPICA code in the kernel to upstream revision 20191018
including:
* Fixes for Clang warnings (Bob Moore).
* Fix for possible overflow in get_tick_count() (Bob Moore).
* Introduction of acpi_unload_table() (Bob Moore).
* Debugger and utilities updates (Erik Schmauss).
* Fix for unloading tables loaded via configfs (Nikolaus Voss).
- Add support for EFI specific purpose memory to optionally allow
either application-exclusive or core-kernel-mm managed access to
differentiated memory (Dan Williams).
- Fix and clean up processing of the HMAT table (Brice Goglin,
Qian Cai, Tao Xu).
- Update the ACPI EC driver to make it work on systems with
hardware-reduced ACPI (Daniel Drake).
- Always build in support for the Generic Event Device (GED) to
allow one kernel binary to work both on systems with full
hardware ACPI and hardware-reduced ACPI (Arjan van de Ven).
- Fix the table unload mechanism to unregister platform devices
created when the given table was loaded (Andy Shevchenko).
- Rework the lid blacklist handling in the button driver and add
more lid quirks to it (Hans de Goede).
- Improve ACPI-based device enumeration for some platforms based
on Intel BayTrail SoCs (Hans de Goede).
- Add an OpRegion driver for the Cherry Trail Crystal Cove PMIC
and prevent handlers from being registered for unhandled PMIC
OpRegions (Hans de Goede).
- Unify ACPI _HID/_UID matching (Andy Shevchenko).
- Clean up documentation and comments (Cao jin, James Pack, Kacper
Piwiński).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl3dHNkSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRx/NkP/2y6DWjslA6UW4gjZwaRBcjYoyWExMtQ
Z86goiRJtP+/NqOwm09wHFcV6FdZ4kitUno3UgMCDZJjrURapg1D0rxb1lSYtMzs
mGr2FBZlVsJ9erOVSzKj1x2afVhdgl0Rl0fxPzoKgCFt8tCJar6cXy4CVEQKdeLs
eUui2ksXMIEODGhpN/tr/fJqY4O4jlLmPY6gKWfFpSTsv6lnZmzcCxLf5EvUU7JW
O91/jXdWz4Vl6IdP32sce6dGDjkvwnY105c7HeBf5EQWUe9RHFuSex982qhCD8U+
iE+JzlhoYpUb03EktJSXbL++IKUHvoUpTanbhka6unMhazC86x0hDf7ruUtYo2Bk
V8347CFeQ1x2O5IabfJNnUfKaMYhYmOXIoFHJTLKFO5mcCJmP8KOOyDAYilC1psb
RJpl1fDoAhk7NqhMttyBqfxiotP0kMoKuqtAAl8Y0hTF0DwR9IfKntuTtp1yTGds
R4dpJrizUDzw1/o4fCWbc3dFZQR3NFGpL/EAyfPzqjGaeaBBkLoNYstqkal5XHwT
CILmQg2WHoNuQLXZ4NFFDrM2k2G+VUAjQdkYcb/MCOFbw+aTVPu1wyQq37RLtbMo
9UwGeeT6SXW3iA1nyMoM+YvitjmxS7gHPPPl+b9G6kBubAzBPp91Ra0Mj9dPIGRB
Evv5nzOIh8Hi
=7Cqr
-----END PGP SIGNATURE-----
Merge tag 'acpi-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI updates from Rafael Wysocki:
"These update the ACPICA code in the kernel to upstream revision
20191018, add support for EFI specific purpose memory, update the ACPI
EC driver to make it work on systems with hardware-reduced ACPI,
improve ACPI-based device enumeration for some platforms, rework the
lid blacklist handling in the button driver and add more lid quirks to
it, unify ACPI _HID/_UID matching, fix assorted issues and clean up
the code and documentation.
Specifics:
- Update the ACPICA code in the kernel to upstream revision 20191018
including:
* Fixes for Clang warnings (Bob Moore)
* Fix for possible overflow in get_tick_count() (Bob Moore)
* Introduction of acpi_unload_table() (Bob Moore)
* Debugger and utilities updates (Erik Schmauss)
* Fix for unloading tables loaded via configfs (Nikolaus Voss)
- Add support for EFI specific purpose memory to optionally allow
either application-exclusive or core-kernel-mm managed access to
differentiated memory (Dan Williams)
- Fix and clean up processing of the HMAT table (Brice Goglin, Qian
Cai, Tao Xu)
- Update the ACPI EC driver to make it work on systems with
hardware-reduced ACPI (Daniel Drake)
- Always build in support for the Generic Event Device (GED) to allow
one kernel binary to work both on systems with full hardware ACPI
and hardware-reduced ACPI (Arjan van de Ven)
- Fix the table unload mechanism to unregister platform devices
created when the given table was loaded (Andy Shevchenko)
- Rework the lid blacklist handling in the button driver and add more
lid quirks to it (Hans de Goede)
- Improve ACPI-based device enumeration for some platforms based on
Intel BayTrail SoCs (Hans de Goede)
- Add an OpRegion driver for the Cherry Trail Crystal Cove PMIC and
prevent handlers from being registered for unhandled PMIC OpRegions
(Hans de Goede)
- Unify ACPI _HID/_UID matching (Andy Shevchenko)
- Clean up documentation and comments (Cao jin, James Pack, Kacper
Piwiński)"
* tag 'acpi-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (52 commits)
ACPI: OSI: Shoot duplicate word
ACPI: HMAT: use %u instead of %d to print u32 values
ACPI: NUMA: HMAT: fix a section mismatch
ACPI: HMAT: don't mix pxm and nid when setting memory target processor_pxm
ACPI: NUMA: HMAT: Register "soft reserved" memory as an "hmem" device
ACPI: NUMA: HMAT: Register HMAT at device_initcall level
device-dax: Add a driver for "hmem" devices
dax: Fix alloc_dax_region() compile warning
lib: Uplevel the pmem "region" ida to a global allocator
x86/efi: Add efi_fake_mem support for EFI_MEMORY_SP
arm/efi: EFI soft reservation to memblock
x86/efi: EFI soft reservation to E820 enumeration
efi: Common enable/disable infrastructure for EFI soft reservation
x86/efi: Push EFI_MEMMAP check into leaf routines
efi: Enumerate EFI_MEMORY_SP
ACPI: NUMA: Establish a new drivers/acpi/numa/ directory
ACPICA: Update version to 20191018
ACPICA: debugger: remove leading whitespaces when converting a string to a buffer
ACPICA: acpiexec: initialize all simple types and field units from user input
ACPICA: debugger: add field unit support for acpi_db_get_next_token
...
Pull perf updates from Ingo Molnar:
"The main kernel side changes in this cycle were:
- Various Intel-PT updates and optimizations (Alexander Shishkin)
- Prohibit kprobes on Xen/KVM emulate prefixes (Masami Hiramatsu)
- Add support for LSM and SELinux checks to control access to the
perf syscall (Joel Fernandes)
- Misc other changes, optimizations, fixes and cleanups - see the
shortlog for details.
There were numerous tooling changes as well - 254 non-merge commits.
Here are the main changes - too many to list in detail:
- Enhancements to core tooling infrastructure, perf.data, libperf,
libtraceevent, event parsing, vendor events, Intel PT, callchains,
BPF support and instruction decoding.
- There were updates to the following tools:
perf annotate
perf diff
perf inject
perf kvm
perf list
perf maps
perf parse
perf probe
perf record
perf report
perf script
perf stat
perf test
perf trace
- And a lot of other changes: please see the shortlog and Git log for
more details"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (279 commits)
perf parse: Fix potential memory leak when handling tracepoint errors
perf probe: Fix spelling mistake "addrees" -> "address"
libtraceevent: Fix memory leakage in copy_filter_type
libtraceevent: Fix header installation
perf intel-bts: Does not support AUX area sampling
perf intel-pt: Add support for decoding AUX area samples
perf intel-pt: Add support for recording AUX area samples
perf pmu: When using default config, record which bits of config were changed by the user
perf auxtrace: Add support for queuing AUX area samples
perf session: Add facility to peek at all events
perf auxtrace: Add support for dumping AUX area samples
perf inject: Cut AUX area samples
perf record: Add aux-sample-size config term
perf record: Add support for AUX area sampling
perf auxtrace: Add support for AUX area sample recording
perf auxtrace: Move perf_evsel__find_pmu()
perf record: Add a function to test for kernel support for AUX area sampling
perf tools: Add kernel AUX area sampling definitions
perf/core: Make the mlock accounting simple again
perf report: Jump to symbol source view from total cycles view
...
seg_segment_reg() should be unreachable with task == current.
Rather than confusingly trying to make it work, just explicitly
disable this case.
(regset->get is used for current in the coredump code, but the ->set
interface is only used for ptrace, and you can't ptrace yourself.)
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A double fault has a decent chance of being recoverable by killing
the offending thread. Use die() so that we at least try to recover.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The old x86_32 doublefault_fn() was old and crufty, and it did not
even try to recover. do_double_fault() is much nicer. Rewrite the
32-bit double fault code to sanitize CPU state and call
do_double_fault(). This is mostly an exercise i386 archaeology.
With this patch applied, 32-bit double faults get a real stack trace,
just like 64-bit double faults.
[ mingo: merged the patch to a later kernel base. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are three problems with the current layout of the doublefault
stack and TSS. First, the TSS is only cacheline-aligned, which is
not enough -- if the hardware portion of the TSS (struct x86_hw_tss)
crosses a page boundary, horrible things happen [0]. Second, the
stack and TSS are global, so simultaneous double faults on different
CPUs will cause massive corruption. Third, the whole mechanism
won't work if user CR3 is loaded, resulting in a triple fault [1].
Let the doublefault stack and TSS share a page (which prevents the
TSS from spanning a page boundary), make it percpu, and move it into
cpu_entry_area. Teach the stack dump code about the doublefault
stack.
[0] Real hardware will read past the end of the page onto the next
*physical* page if a task switch happens. Virtual machines may
have any number of bugs, and I would consider it reasonable for
a VM to summarily kill the guest if it tries to task-switch to
a page-spanning TSS.
[1] Real hardware triple faults. At least some VMs seem to hang.
I'm not sure what's going on.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
doublefault.c now only contains 32-bit code. Rename it to
doublefault_32.c.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The 64-bit doublefault handler is much nicer than the 32-bit one.
As a first step toward unifying them, make the 64-bit handler
self-contained. This should have no effect no functional effect
except in the odd case of x86_64 with CONFIG_DOUBLEFAULT=n in which
case it will change the logging a bit.
This also gets rid of CONFIG_DOUBLEFAULT configurability on 64-bit
kernels. It didn't do anything useful -- CONFIG_DOUBLEFAULT=n
didn't actually disable doublefault handling on x86_64.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 iopl updates from Ingo Molnar:
"This implements a nice simplification of the iopl and ioperm code that
Thomas Gleixner discovered: we can implement the IO privilege features
of the iopl system call by using the IO permission bitmap in
permissive mode, while trapping CLI/STI/POPF/PUSHF uses in user-space
if they change the interrupt flag.
This implements that feature, with testing facilities and related
cleanups"
[ "Simplification" may be an over-statement. The main goal is to avoid
the cli/sti of iopl by effectively implementing the IO port access
parts of iopl in terms of ioperm.
This may end up not workign well in case people actually depend on
cli/sti being available, or if there are mixed uses of iopl and
ioperm. We will see.. - Linus ]
* 'x86-iopl-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
x86/ioperm: Fix use of deprecated config option
x86/entry/32: Clarify register saving in __switch_to_asm()
selftests/x86/iopl: Extend test to cover IOPL emulation
x86/ioperm: Extend IOPL config to control ioperm() as well
x86/iopl: Remove legacy IOPL option
x86/iopl: Restrict iopl() permission scope
x86/iopl: Fixup misleading comment
selftests/x86/ioperm: Extend testing so the shared bitmap is exercised
x86/ioperm: Share I/O bitmap if identical
x86/ioperm: Remove bitmap if all permissions dropped
x86/ioperm: Move TSS bitmap update to exit to user work
x86/ioperm: Add bitmap sequence number
x86/ioperm: Move iobitmap data into a struct
x86/tss: Move I/O bitmap data into a seperate struct
x86/io: Speedup schedule out of I/O bitmap user
x86/ioperm: Avoid bitmap allocation if no permissions are set
x86/ioperm: Simplify first ioperm() invocation logic
x86/iopl: Cleanup include maze
x86/tss: Fix and move VMX BUILD_BUG_ON()
x86/cpu: Unify cpu_init()
...
Pull x86 asm updates from Ingo Molnar:
"The main changes in this cycle were:
- Cross-arch changes to move the linker sections for NOTES and
EXCEPTION_TABLE into the RO_DATA area, where they belong on most
architectures. (Kees Cook)
- Switch the x86 linker fill byte from x90 (NOP) to 0xcc (INT3), to
trap jumps into the middle of those padding areas instead of
sliding execution. (Kees Cook)
- A thorough cleanup of symbol definitions within x86 assembler code.
The rather randomly named macros got streamlined around a
(hopefully) straightforward naming scheme:
SYM_START(name, linkage, align...)
SYM_END(name, sym_type)
SYM_FUNC_START(name)
SYM_FUNC_END(name)
SYM_CODE_START(name)
SYM_CODE_END(name)
SYM_DATA_START(name)
SYM_DATA_END(name)
etc - with about three times of these basic primitives with some
label, local symbol or attribute variant, expressed via postfixes.
No change in functionality intended. (Jiri Slaby)
- Misc other changes, cleanups and smaller fixes"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (67 commits)
x86/entry/64: Remove pointless jump in paranoid_exit
x86/entry/32: Remove unused resume_userspace label
x86/build/vdso: Remove meaningless CFLAGS_REMOVE_*.o
m68k: Convert missed RODATA to RO_DATA
x86/vmlinux: Use INT3 instead of NOP for linker fill bytes
x86/mm: Report actual image regions in /proc/iomem
x86/mm: Report which part of kernel image is freed
x86/mm: Remove redundant address-of operators on addresses
xtensa: Move EXCEPTION_TABLE to RO_DATA segment
powerpc: Move EXCEPTION_TABLE to RO_DATA segment
parisc: Move EXCEPTION_TABLE to RO_DATA segment
microblaze: Move EXCEPTION_TABLE to RO_DATA segment
ia64: Move EXCEPTION_TABLE to RO_DATA segment
h8300: Move EXCEPTION_TABLE to RO_DATA segment
c6x: Move EXCEPTION_TABLE to RO_DATA segment
arm64: Move EXCEPTION_TABLE to RO_DATA segment
alpha: Move EXCEPTION_TABLE to RO_DATA segment
x86/vmlinux: Move EXCEPTION_TABLE to RO_DATA segment
x86/vmlinux: Actually use _etext for the end of the text segment
vmlinux.lds.h: Allow EXCEPTION_TABLE to live in RO_DATA
...
Pull x86 fixes from Ingo Molnar:
"These are the fixes left over from the v5.4 cycle:
- Various low level 32-bit entry code fixes and improvements by Andy
Lutomirski, Peter Zijlstra and Thomas Gleixner.
- Fix 32-bit Xen PV breakage, by Jan Beulich"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry/32: Fix FIXUP_ESPFIX_STACK with user CR3
x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise
selftests/x86/sigreturn/32: Invalidate DS and ES when abusing the kernel
selftests/x86/mov_ss_trap: Fix the SYSENTER test
x86/entry/32: Fix NMI vs ESPFIX
x86/entry/32: Unwind the ESPFIX stack earlier on exception entry
x86/entry/32: Move FIXUP_FRAME after pushing %fs in SAVE_ALL
x86/entry/32: Use %ss segment where required
x86/entry/32: Fix IRET exception
x86/cpu_entry_area: Add guard page for entry stack on 32bit
x86/pti/32: Size initial_page_table correctly
x86/doublefault/32: Fix stack canaries in the double fault handler
x86/xen/32: Simplify ring check in xen_iret_crit_fixup()
x86/xen/32: Make xen_iret_crit_fixup() independent of frame layout
x86/stackframe/32: Repair 32-bit Xen PV
Pull x86 PTI updates from Ingo Molnar:
"Fix reporting bugs of the MDS and TAA mitigation status, if one or
both are set via a boot option.
No change to mitigation behavior intended"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation: Fix redundant MDS mitigation message
x86/speculation: Fix incorrect MDS/TAA mitigation status
Pull x86 platform updates from Ingo Molnar:
"UV platform updates (with a 'hubless' variant) and Jailhouse updates
for better UART support"
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/jailhouse: Only enable platform UARTs if available
x86/jailhouse: Improve setup data version comparison
x86/platform/uv: Account for UV Hubless in is_uvX_hub Ops
x86/platform/uv: Check EFI Boot to set reboot type
x86/platform/uv: Decode UVsystab Info
x86/platform/uv: Add UV Hubbed/Hubless Proc FS Files
x86/platform/uv: Setup UV functions for Hubless UV Systems
x86/platform/uv: Add return code to UV BIOS Init function
x86/platform/uv: Return UV Hubless System Type
x86/platform/uv: Save OEM_ID from ACPI MADT probe
Pull x86 mm updates from Ingo Molnar:
"The main changes in this cycle were:
- A PAT series from Davidlohr Bueso, which simplifies the memtype
rbtree by using the interval tree helpers. (There's more cleanups
in this area queued up, but they didn't make the merge window.)
- Also flip over CONFIG_X86_5LEVEL to default-y. This might draw in a
few more testers, as all the major distros are going to have
5-level paging enabled by default in their next iterations.
- Misc cleanups"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/pat: Rename pat_rbtree.c to pat_interval.c
x86/mm/pat: Drop the rbt_ prefix from external memtype calls
x86/mm/pat: Do not pass 'rb_root' down the memtype tree helper functions
x86/mm/pat: Convert the PAT tree to a generic interval tree
x86/mm: Clean up the pmd_read_atomic() comments
x86/mm: Fix function name typo in pmd_read_atomic() comment
x86/cpu: Clean up intel_tlb_table[]
x86/mm: Enable 5-level paging support by default
Pull x86 kdump updates from Ingo Molnar:
"This solves a kdump artifact where encrypted memory contents are
dumped, instead of unencrypted ones.
The solution also happens to simplify the kdump code, to everyone's
delight"
* 'x86-kdump-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/crash: Align function arguments on opening braces
x86/kdump: Remove the backup region handling
x86/kdump: Always reserve the low 1M when the crashkernel option is specified
x86/crash: Add a forward declaration of struct kimage
Pull x86 hyperv updates from Ingo Molnar:
"Misc updates to the hyperv guest code:
- Rework clockevents initialization to better support hibernation
- Allow guests to enable InvariantTSC
- Micro-optimize send_ipi_one"
* 'x86-hyperv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/hyperv: Initialize clockevents earlier in CPU onlining
x86/hyperv: Allow guests to enable InvariantTSC
x86/hyperv: Micro-optimize send_ipi_one()
Pull x86 cpu and fpu updates from Ingo Molnar:
- math-emu fixes
- CPUID updates
- sanity-check RDRAND output to see whether the CPU at least pretends
to produce random data
- various unaligned-access across cachelines fixes in preparation of
hardware level split-lock detection
- fix MAXSMP constraints to not allow !CPUMASK_OFFSTACK kernels with
larger than 512 NR_CPUS
- misc FPU related cleanups
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Align the x86_capability array to size of unsigned long
x86/cpu: Align cpu_caps_cleared and cpu_caps_set to unsigned long
x86/umip: Make the comments vendor-agnostic
x86/Kconfig: Rename UMIP config parameter
x86/Kconfig: Enforce limit of 512 CPUs with MAXSMP and no CPUMASK_OFFSTACK
x86/cpufeatures: Add feature bit RDPRU on AMD
x86/math-emu: Limit MATH_EMULATION to 486SX compatibles
x86/math-emu: Check __copy_from_user() result
x86/rdrand: Sanity-check RDRAND output
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Use XFEATURE_FP/SSE enum values instead of hardcoded numbers
x86/fpu: Shrink space allocated for xstate_comp_offsets
x86/fpu: Update stale variable name in comment
Pull x86 boot updates from Ingo Molnar:
"The main changes were:
- Extend the boot protocol to allow future extensions without hitting
the setup_header size limit.
- Add quirk to devicetree systems to disable the RTC unless it's
listed as a supported device.
- Fix ld.lld linker pedantry"
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Introduce setup_indirect
x86/boot: Introduce kernel_info.setup_type_max
x86/boot: Introduce kernel_info
x86/init: Allow DT configured systems to disable RTC at boot time
x86/realmode: Explicitly set entry point via ENTRY in linker script
Pull x86 objtool, cleanup, and apic updates from Ingo Molnar:
"Objtool:
- Fix a gawk 5.0 incompatibility in gen-insn-attr-x86.awk. Most
distros are still on gawk 4.2.x.
Cleanup:
- Misc cleanups, plus the removal of obsolete code such as Calgary
IOMMU support, which code hasn't seen any real testing in a long
time and there's no known users left.
apic:
- Two changes: a cleanup and a fix for an (old) race for oneshot
threaded IRQ handlers"
* 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/insn: Fix awk regexp warnings
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Remove unused asm/rio.h
x86: Fix typos in comments
x86/pci: Remove #ifdef __KERNEL__ guard from <asm/pci.h>
x86/pci: Remove pci_64.h
x86: Remove the calgary IOMMU driver
x86/apic, x86/uprobes: Correct parameter names in kernel-doc comments
x86/kdump: Remove the unused crash_copy_backup_region()
x86/nmi: Remove stale EDAC include leftover
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ioapic: Rename misnamed functions
x86/ioapic: Prevent inconsistent state when moving an interrupt
* acpi-mm:
ACPI: HMAT: use %u instead of %d to print u32 values
ACPI: NUMA: HMAT: fix a section mismatch
ACPI: HMAT: don't mix pxm and nid when setting memory target processor_pxm
ACPI: NUMA: HMAT: Register "soft reserved" memory as an "hmem" device
ACPI: NUMA: HMAT: Register HMAT at device_initcall level
device-dax: Add a driver for "hmem" devices
dax: Fix alloc_dax_region() compile warning
lib: Uplevel the pmem "region" ida to a global allocator
x86/efi: Add efi_fake_mem support for EFI_MEMORY_SP
arm/efi: EFI soft reservation to memblock
x86/efi: EFI soft reservation to E820 enumeration
efi: Common enable/disable infrastructure for EFI soft reservation
x86/efi: Push EFI_MEMMAP check into leaf routines
efi: Enumerate EFI_MEMORY_SP
ACPI: NUMA: Establish a new drivers/acpi/numa/ directory