Commit Graph

12370 Commits

Author SHA1 Message Date
Josh Poimboeuf
cb76c93982 x86/dumpstack: Add get_stack_info() interface
valid_stack_ptr() is buggy: it assumes that all stacks are of size
THREAD_SIZE, which is not true for exception stacks.  So the
walk_stack() callbacks will need to know the location of the beginning
of the stack as well as the end.

Another issue is that in general the various features of a stack (type,
size, next stack pointer, description string) are scattered around in
various places throughout the stack dump code.

Encapsulate all that information in a single place with a new stack_info
struct and a get_stack_info() interface.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/8164dd0db96b7e6a279fa17ae5e6dc375eecb4a9.1473905218.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-15 08:13:15 +02:00
Josh Poimboeuf
9c00390757 x86/dumpstack: Simplify in_exception_stack()
in_exception_stack() does some bad, bad things just so the unwinder can
print different values for different areas of the debug exception stack.

There's no need to clarify where exactly on the stack it is.  Just print
"#DB" and be done with it.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e91cb410169dd576678dd427c35efb716fd0cee1.1473905218.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-15 08:13:14 +02:00
Josh Poimboeuf
cfeeed279d x86/dumpstack: Allow preemption in show_stack_log_lvl() and dump_trace()
show_stack_log_lvl() and dump_trace() are already preemption safe:

- If they're running in irq or exception context, preemption is already
  disabled and the percpu stack pointers can be trusted.

- If they're running with preemption enabled, they must be running on
  the task stack anyway, so it doesn't matter if they're comparing the
  stack pointer against a percpu stack pointer from this CPU or another
  one: either way it won't match.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a0ca0b1044eca97d4f0ec7c1619cf80b3b65560d.1473371307.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-14 17:23:30 +02:00
Josh Poimboeuf
5a8ff54c26 x86/dumpstack: Remove unnecessary stack pointer arguments
When calling show_stack_log_lvl() or dump_trace() with a regs argument,
providing a stack pointer or frame pointer is redundant.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>d
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1694e2e955e3b9a73a3c3d5ba2634344014dd550.1472057064.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-08 08:58:40 +02:00
Josh Poimboeuf
4b8afafbe7 x86/dumpstack: Add get_stack_pointer() and get_frame_pointer()
The various functions involved in dumping the stack all do similar
things with regard to getting the stack pointer and the frame pointer
based on the regs and task arguments.  Create helper functions to
do that instead.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f448914885a35f333fe04da1b97a6c2cc1f80974.1472057064.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-08 08:58:40 +02:00
Josh Poimboeuf
d438f5fda3 x86/dumpstack: Make printk_stack_address() more generally useful
Change printk_stack_address() to be useful when called by an unwinder
outside the context of dump_trace().

Specifically:

- printk_stack_address()'s 'data' argument is always used as the log
  level string.  Make that explicit.

- Call touch_nmi_watchdog().

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/9fbe0db05bacf66d337c162edbf61450d0cff1e2.1472057064.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-08 08:58:40 +02:00
Andy Lutomirski
6271cfdfc0 x86/mm: Improve stack-overflow #PF handling
If we get a page fault indicating kernel stack overflow, invoke
handle_stack_overflow().  To prevent us from overflowing the stack
again while handling the overflow (because we are likely to have
very little stack space left), call handle_stack_overflow() on the
double-fault stack.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/6d6cf96b3fb9b4c9aa303817e1dc4de0c7c36487.1472603235.git.luto@kernel.org
[ Minor edit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-08 08:47:20 +02:00
Ingo Molnar
2b3061c77c Merge branch 'x86/mm' into x86/asm, to unify the two branches for simplicity
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-08 08:41:52 +02:00
Brian Gerst
ffcb043ba5 sched/x86: Fix thread_saved_pc()
thread_saved_pc() was using a completely bogus method to get the return
address.  Since switch_to() was previously inlined, there was no sane way
to know where on the stack the return address was stored.  Now with the
frame of a sleeping thread well defined, this can be implemented correctly.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1471106302-10159-7-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-24 12:31:51 +02:00
Brian Gerst
616d24835e sched/x86: Pass kernel thread parameters in 'struct fork_frame'
Instead of setting up a fake pt_regs context, put the kernel thread
function pointer and arg into the unused callee-restored registers
of 'struct fork_frame'.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1471106302-10159-6-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-24 12:31:50 +02:00
Brian Gerst
0100301bfd sched/x86: Rewrite the switch_to() code
Move the low-level context switch code to an out-of-line asm stub instead of
using complex inline asm.  This allows constructing a new stack frame for the
child process to make it seamlessly flow to ret_from_fork without an extra
test and branch in __switch_to().  It also improves code generation for
__schedule() by using the C calling convention instead of clobbering all
registers.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1471106302-10159-5-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-24 12:31:41 +02:00
Brian Gerst
7b32aeadbc sched/x86: Add 'struct inactive_task_frame' to better document the sleeping task stack frame
Add 'struct inactive_task_frame', which defines the layout of the stack for
a sleeping process.  For now, the only defined field is the BP register
(frame pointer).

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1471106302-10159-4-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-24 12:27:41 +02:00
Brian Gerst
163630191e sched/x86/64, kgdb: Clear GDB_PS on 64-bit
switch_to() no longer saves EFLAGS, so it's bogus to look for it on the
stack.  Set it to zero like 32-bit.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1471106302-10159-3-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-24 12:27:40 +02:00
Brian Gerst
4e047aa7f2 sched/x86/32, kgdb: Don't use thread.ip in sleeping_thread_to_gdb_regs()
Match 64-bit and set gdb_regs[GDB_PC] to zero.  thread.ip is always the
same point in the scheduler (except for newly forked processes), and will
be removed in a future patch.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1471106302-10159-2-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-24 12:27:40 +02:00
Josh Poimboeuf
13e25bab7e x86/dumpstack/ftrace: Don't print unreliable addresses in print_context_stack_bp()
When function graph tracing is enabled, print_context_stack_bp() can
report return_to_handler() as an unreliable address, which is confusing
and misleading: return_to_handler() is really only useful as a hint for
debugging, whereas print_context_stack_bp() users only care about the
actual 'reliable' call path.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/c51aef578d8027791b38d2ad9bac0c7f499fde91.1471607358.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-24 12:15:15 +02:00
Josh Poimboeuf
6f727b84e2 x86/dumpstack/ftrace: Mark function graph handler function as unreliable
When function graph tracing is enabled for a function, its return
address on the stack is replaced with the address of an ftrace handler
(return_to_handler).

Currently 'return_to_handler' can be reported as reliable.  That's not
ideal, and can actually be misleading.  When saving or dumping the
stack, you normally only care about what led up to that point (the call
path), rather than what will happen in the future (the return path).

That's especially true in the non-oops stack trace case, which isn't
used for debugging.  For example, in a perf profiling operation,
reporting return_to_handler() in the trace would just be confusing.

And in the oops case, where debugging is important, "unreliable" is also
more appropriate there because it serves as a hint that graph tracing
was involved, instead of trying to imply that return_to_handler() was
the real caller.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f8af15749c7d632d3e7f815995831d5b7f82950d.1471607358.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-24 12:15:15 +02:00
Josh Poimboeuf
471bd10f5e ftrace/x86: Implement HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
Use the more reliable version of ftrace_graph_ret_addr() so we no longer
have to worry about the unwinder getting out of sync with the function
graph ret_stack index, which can happen if the unwinder skips any frames
before calling ftrace_graph_ret_addr().

This fixes this issue (and several others like it):

  $ cat /proc/self/stack
  [<ffffffff810489a2>] save_stack_trace_tsk+0x22/0x40
  [<ffffffff81311a89>] proc_pid_stack+0xb9/0x110
  [<ffffffff813127c4>] proc_single_show+0x54/0x80
  [<ffffffff812be088>] seq_read+0x108/0x3e0
  [<ffffffff812923d7>] __vfs_read+0x37/0x140
  [<ffffffff812929d9>] vfs_read+0x99/0x140
  [<ffffffff81293f28>] SyS_read+0x58/0xc0
  [<ffffffff818af97c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
  [<ffffffffffffffff>] 0xffffffffffffffff

  $ echo function_graph > /sys/kernel/debug/tracing/current_tracer

  $ cat /proc/self/stack
  [<ffffffff818b2428>] return_to_handler+0x0/0x27
  [<ffffffff810394cc>] print_context_stack+0xfc/0x100
  [<ffffffff818b2428>] return_to_handler+0x0/0x27
  [<ffffffff8103891b>] dump_trace+0x12b/0x350
  [<ffffffff818b2428>] return_to_handler+0x0/0x27
  [<ffffffff810489a2>] save_stack_trace_tsk+0x22/0x40
  [<ffffffff818b2428>] return_to_handler+0x0/0x27
  [<ffffffff81311a89>] proc_pid_stack+0xb9/0x110
  [<ffffffff818b2428>] return_to_handler+0x0/0x27
  [<ffffffff813127c4>] proc_single_show+0x54/0x80
  [<ffffffff818b2428>] return_to_handler+0x0/0x27
  [<ffffffff812be088>] seq_read+0x108/0x3e0
  [<ffffffff818b2428>] return_to_handler+0x0/0x27
  [<ffffffff812923d7>] __vfs_read+0x37/0x140
  [<ffffffff818b2428>] return_to_handler+0x0/0x27
  [<ffffffff812929d9>] vfs_read+0x99/0x140
  [<ffffffffffffffff>] 0xffffffffffffffff

Enabling function graph tracing causes the stack trace to change in two
ways:

First, the real call addresses are confusingly interspersed with
'return_to_handler' addresses.  This issue will be fixed by the next
patch.

Second, the stack trace is offset by two frames, because the unwinder
skipped the first two frames and got out of sync with the ret_stack
index.  This patch fixes this issue.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a6d623e36f8d08f9a17bd74d804d201177a23afd.1471607358.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-24 12:15:15 +02:00
Josh Poimboeuf
408fe5de2f x86/dumpstack/ftrace: Convert dump_trace() callbacks to use ftrace_graph_ret_addr()
Convert print_context_stack() and print_context_stack_bp() to use the
arch-independent ftrace_graph_ret_addr() helper.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/56ec97cafc1bf2e34d1119e6443d897db406da86.1471607358.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-24 12:15:14 +02:00
Josh Poimboeuf
9a7c348ba6 ftrace: Add return address pointer to ftrace_ret_stack
Storing this value will help prevent unwinders from getting out of sync
with the function graph tracer ret_stack.  Now instead of needing a
stateful iterator, they can compare the return address pointer to find
the right ret_stack entry.

Note that an array of 50 ftrace_ret_stack structs is allocated for every
task.  So when an arch implements this, it will add either 200 or 400
bytes of memory usage per task (depending on whether it's a 32-bit or
64-bit platform).

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a95cfcc39e8f26b89a430c56926af0bb217bc0a1.1471607358.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-24 12:15:14 +02:00
Andy Lutomirski
e37e43a497 x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)
This allows x86_64 kernels to enable vmapped stacks by setting
HAVE_ARCH_VMAP_STACK=y - which enables the CONFIG_VMAP_STACK=y
high level Kconfig option.

There are a couple of interesting bits:

First, x86 lazily faults in top-level paging entries for the vmalloc
area.  This won't work if we get a page fault while trying to access
the stack: the CPU will promote it to a double-fault and we'll die.
To avoid this problem, probe the new stack when switching stacks and
forcibly populate the pgd entry for the stack when switching mms.

Second, once we have guard pages around the stack, we'll want to
detect and handle stack overflow.

I didn't enable it on x86_32.  We'd need to rework the double-fault
code a bit and I'm concerned about running out of vmalloc virtual
addresses under some workloads.

This patch, by itself, will behave somewhat erratically when the
stack overflows while RSP is still more than a few tens of bytes
above the bottom of the stack.  Specifically, we'll get #PF and make
it to no_context and them oops without reliably triggering a
double-fault, and no_context doesn't know about stack overflows.
The next patch will improve that case.

Thank you to Nadav and Brian for helping me pay enough attention to
the SDM to hopefully get this right.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/c88f3e2920b18e6cc621d772a04a62c06869037e.1470907718.git.luto@kernel.org
[ Minor edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-24 12:11:42 +02:00
Borislav Petkov
556b672368 x86/entry: Remove outdated comment about SYSCALL targets
The comment probably meant some old AMD64 incarnation which most likely
never saw the light of day. STAR and LSTAR are two different registers
and STAR sets CS/SS(DS) selectors for *all* modes, not only 32-bit.

So simply remove that comment.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160823172356.15879-1-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-24 11:20:31 +02:00
Josh Poimboeuf
4950d6d48a x86/dumpstack: Remove 64-byte gap at end of irq stack
There has been a 64-byte gap at the end of the irq stack for at least 12
years.  It predates git history, and I can't find any good reason for
it.  Remove it.  What's the worst that could happen?

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/14f9281c5475cc44af95945ea7546bff2e3836db.1471535549.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18 18:41:33 +02:00
Josh Poimboeuf
72b4f6a5e9 x86/dumpstack: Fix x86_32 kernel_stack_pointer() previous stack access
On x86_32, when an interrupt happens from kernel space, SS and SP aren't
pushed and the existing stack is used.  So pt_regs is effectively two
words shorter, and the previous stack pointer is normally the memory
after the shortened pt_regs, aka '&regs->sp'.

But in the rare case where the interrupt hits right after the stack
pointer has been changed to point to an empty stack, like for example
when call_on_stack() is used, the address immediately after the
shortened pt_regs is no longer on the stack.  In that case, instead of
'&regs->sp', the previous stack pointer should be retrieved from the
beginning of the current stack page.

kernel_stack_pointer() wants to do that, but it forgets to dereference
the pointer.  So instead of returning a pointer to the previous stack,
it returns a pointer to the beginning of the current stack.

Note that it's probably outside of kernel_stack_pointer()'s scope to be
switching stacks at all.  The x86_64 version of this function doesn't do
it, and it would be better for the caller to do it if necessary.  But
that's a patch for another day.  This just fixes the original intent.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 0788aa6a23 ("x86: Prepare removal of previous_esp from i386 thread_info structure")
Link: http://lkml.kernel.org/r/472453d6e9f6a2d4ab16aaed4935f43117111566.1471535549.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18 18:41:31 +02:00
Josh Poimboeuf
ae952ffdfd x86/head: Remove useless zeroed word
This zeroed word has no apparent purpose, so remove it.

Brian Gerst says:

  "FYI the word used to be the SS segment selector for the LSS
   instruction, which isn't needed in 64-bit mode."

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/b056855c295bbb3825b97c1e9f7958539a4d6cf2.1471535549.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18 18:41:30 +02:00
Josh Poimboeuf
6225f3232a x86/dumpstack: Remove extra brackets around "<EOE>"
When starting the dump of an exception stack, it shows "<<EOE>>" instead
of "<EOE>".  print_trace_stack() already adds brackets, no need to add
them again.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/77f185fd5b81845869b400aa619415458df6b6cc.1471535549.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18 18:41:29 +02:00
Josh Poimboeuf
b32f96c75d x86/asm/head: Rename 'stack_start' -> 'initial_stack'
The 'stack_start' variable is similar in usage to 'initial_code' and
'initial_gs': they're all stored in head_64.S and they're all updated by
SMP and ACPI suspend before starting a CPU.

Rename it to 'initial_stack' to be consistent with the others.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/87063d773a3212051b77e17b0ee427f6582a5050.1471535549.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18 18:41:29 +02:00
Josh Poimboeuf
bf255bdaad x86/dumpstack: Remove show_trace()
There are a bewildering array of options for dumping the stack.
Simplify things a little by removing show_trace(), which is unused.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/fe02292eac9d409001ec0cf6d06f90ced242570d.1471535549.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18 18:41:27 +02:00
Jiri Olsa
7b0501b1e7 x86/smp: Fix __max_logical_packages value setup
Frank reported kernel panic when he disabled several cores in BIOS
via following option:

  Core Disable Bitmap(Hex)   [0]

with number 0xFFE, which leaves 16 CPUs in system (out of 48).

The kernel panic below goes along with following messages:

 smpboot: Max logical packages: 2^M
 smpboot: APIC(0) Converting physical 0 to logical package 0^M
 smpboot: APIC(20) Converting physical 1 to logical package 1^M
 smpboot: APIC(40) Package 2 exceeds logical package map^M
 smpboot: CPU 8 APICId 40 disabled^M
 smpboot: APIC(60) Package 3 exceeds logical package map^M
 smpboot: CPU 12 APICId 60 disabled^M
 ...
 general protection fault: 0000 [#1] SMP^M
 Modules linked in:^M
 CPU: 15 PID: 1 Comm: swapper/0 Not tainted 4.7.0-rc5+ #1^M
 Hardware name: SGI UV300/UV300, BIOS SGI UV 300 series BIOS 05/25/2016^M
 task: ffff8801673e0000 ti: ffff8801673ac000 task.ti: ffff8801673ac000^M
 RIP: 0010:[<ffffffff81014d54>]  [<ffffffff81014d54>] uncore_change_context+0xd4/0x180^M
 ...
  [<ffffffff810158ac>] uncore_event_init_cpu+0x6c/0x70^M
  [<ffffffff81d8c91c>] intel_uncore_init+0x1c2/0x2dd^M
  [<ffffffff81d8c75a>] ? uncore_cpu_setup+0x17/0x17^M
  [<ffffffff81002190>] do_one_initcall+0x50/0x190^M
  [<ffffffff810ab193>] ? parse_args+0x293/0x480^M
  [<ffffffff81d87365>] kernel_init_freeable+0x1a5/0x249^M
  [<ffffffff81d86a35>] ? set_debug_rodata+0x12/0x12^M
  [<ffffffff816dc19e>] kernel_init+0xe/0x110^M
  [<ffffffff816e93bf>] ret_from_fork+0x1f/0x40^M
  [<ffffffff816dc190>] ? rest_init+0x80/0x80^M

The reason for the panic is wrong value of __max_logical_packages,
which lets logical_package_map uninitialized and the uncore code
relying on this map being properly initialized (maybe we should
add some safety checks there as well).

The __max_logical_packages is computed as:

  DIV_ROUND_UP(total_cpus, ncpus);
  - ncpus being number of cores

With above BIOS setup we get total_cpus == 16 which set
__max_logical_packages to 2 (ncpus is 12).

Once topology_update_package_map processes CPU with logical
pkg over 2 we display above messages and fail to initialize
the physical_to_logical_pkg map, which makes the uncore code
crash.

The fix is to remove logical_package_map bitmap completely
and keep and update the logical_packages number instead.

After we enumerate all the present CPUs, we check if the
enumerated logical packages count is within its computed
maximum from BIOS data.

If it's not the case, we set this maximum to the new enumerated
value and freeze any new addition of logical packages.

The freeze is because lot of init code like uncore/rapl/cqm
depends on having maximum logical package value set to allocate
their data, so we can't change it later on.

Prarit Bhargava tested the patch and confirms that it solves
the problem:

  From dmidecode:
          Core Count: 24
          Core Enabled: 24
          Thread Count: 48

Orig kernel boot log:

 [    0.464981] smpboot: Max logical packages: 19
 [    0.469861] smpboot: APIC(0) Converting physical 0 to logical package 0
 [    0.477261] smpboot: APIC(40) Converting physical 1 to logical package 1
 [    0.484760] smpboot: APIC(80) Converting physical 2 to logical package 2
 [    0.492258] smpboot: APIC(c0) Converting physical 3 to logical package 3

1.  nr_cpus=8, should stop enumerating in package 0:

 [    0.533664] smpboot: APIC(0) Converting physical 0 to logical package 0
 [    0.539596] smpboot: Max logical packages: 19

2.  max_cpus=8, should still enumerate all packages:

 [    0.526494] smpboot: APIC(0) Converting physical 0 to logical package 0
 [    0.532428] smpboot: APIC(40) Converting physical 1 to logical package 1
 [    0.538456] smpboot: APIC(80) Converting physical 2 to logical package 2
 [    0.544486] smpboot: APIC(c0) Converting physical 3 to logical package 3
 [    0.550524] smpboot: Max logical packages: 19

3.  nr_cpus=49 ( 2 socket + 1 core on 3rd socket), should stop enumerating in
    package 2:

 [    0.521378] smpboot: APIC(0) Converting physical 0 to logical package 0
 [    0.527314] smpboot: APIC(40) Converting physical 1 to logical package 1
 [    0.533345] smpboot: APIC(80) Converting physical 2 to logical package 2
 [    0.539368] smpboot: Max logical packages: 19

4.  maxcpus=49, should still enumerate all packages:

 [    0.525591] smpboot: APIC(0) Converting physical 0 to logical package 0
 [    0.531525] smpboot: APIC(40) Converting physical 1 to logical package 1
 [    0.537547] smpboot: APIC(80) Converting physical 2 to logical package 2
 [    0.543579] smpboot: APIC(c0) Converting physical 3 to logical package 3
 [    0.549624] smpboot: Max logical packages: 19

5.  kdump (nr_cpus=1) works as well.

Reported-by: Frank Ramsay <framsay@redhat.com>
Tested-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Reviewed-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160815101700.GA30090@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18 10:14:48 +02:00
Borislav Petkov
88b2f63402 x86/microcode/AMD: Fix initrd loading with CONFIG_RANDOMIZE_MEMORY=y
Similar to:

  efaad554b4 ("x86/microcode/intel: Fix initrd loading with CONFIG_RANDOMIZE_MEMORY=y")

... fix microcode loading from the initrd on AMD by adding the
randomization offset to the microcode patch container within the initrd.

Reported-and-tested-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-tip-commits@vger.kernel.org
Link: http://lkml.kernel.org/r/20160817113314.GA19221@nazgul.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18 10:06:49 +02:00
Linus Torvalds
01ea443982 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "This is bigger than usual - the reason is partly a pent-up stream of
  fixes after the merge window and partly accidental.  The fixes are:

   - five patches to fix a boot failure on Andy Lutomirsky's laptop
   - four SGI UV platform fixes
   - KASAN fix
   - warning fix
   - documentation update
   - swap entry definition fix
   - pkeys fix
   - irq stats fix"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/apic/x2apic, smp/hotplug: Don't use before alloc in x2apic_cluster_probe()
  x86/efi: Allocate a trampoline if needed in efi_free_boot_services()
  x86/boot: Rework reserve_real_mode() to allow multiple tries
  x86/boot: Defer setup_real_mode() to early_initcall time
  x86/boot: Synchronize trampoline_cr4_features and mmu_cr4_features directly
  x86/boot: Run reserve_bios_regions() after we initialize the memory map
  x86/irq: Do not substract irq_tlb_count from irq_call_count
  x86/mm: Fix swap entry comment and macro
  x86/mm/kaslr: Fix -Wformat-security warning
  x86/mm/pkeys: Fix compact mode by removing protection keys' XSAVE buffer manipulation
  x86/build: Reduce the W=1 warnings noise when compiling x86 syscall tables
  x86/platform/UV: Fix kernel panic running RHEL kdump kernel on UV systems
  x86/platform/UV: Fix problem with UV4 BIOS providing incorrect PXM values
  x86/platform/UV: Fix bug with iounmap() of the UV4 EFI System Table causing a crash
  x86/platform/UV: Fix problem with UV4 Socket IDs not being contiguous
  x86/entry: Clarify the RF saving/restoring situation with SYSCALL/SYSRET
  x86/mm: Disable preemption during CR3 read+write
  x86/mm/KASLR: Increase BRK pages for KASLR memory randomization
  x86/mm/KASLR: Fix physical memory calculation on KASLR memory randomization
  x86, kasan, ftrace: Put APIC interrupt handlers into .irqentry.text
2016-08-12 14:31:10 -07:00
Linus Torvalds
3bc6d8c155 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
 "Misc fixes: a /dev/rtc regression fix, two APIC timer period
  calibration fixes, an ARM clocksource driver fix and a NOHZ
  power use regression fix"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/hpet: Fix /dev/rtc breakage caused by RTC cleanup
  x86/timers/apic: Inform TSC deadline clockevent device about recalibration
  x86/timers/apic: Fix imprecise timer interrupts by eliminating TSC clockevents frequency roundoff error
  timers: Fix get_next_timer_interrupt() computation
  clocksource/arm_arch_timer: Force per-CPU interrupt to be level-triggered
2016-08-12 13:55:06 -07:00
Denys Vlasenko
68187872c7 uprobes/x86: Fix RIP-relative handling of EVEX-encoded instructions
Since instruction decoder now supports EVEX-encoded instructions, two fixes
are needed to correctly handle them in uprobes.

Extended bits for MODRM.rm field need to be sanitized just like we do it
for VEX3, to avoid encoding wrong register for register-relative access.

EVEX has _two_ extended bits: b and x. Theoretically, EVEX.x should be
ignored by the CPU (since GPRs go only up to 15, not 31), but let's be
paranoid here: proper encoding for register-relative access
should have EVEX.x = 1.

Secondly, we should fetch vex.vvvv for EVEX too.
This is now super easy because instruction decoder populates
vex_prefix.bytes[2] for all flavors of (e)vex encodings, even for VEX2.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Cc: <stable@vger.kernel.org> # v4.1+
Fixes: 8a764a875f ("x86/asm/decoder: Create artificial 3rd byte for 2-byte VEX")
Link: http://lkml.kernel.org/r/20160811154521.20469-1-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-12 08:29:24 +02:00
Sebastian Andrzej Siewior
d52c0569ba x86/apic/x2apic, smp/hotplug: Don't use before alloc in x2apic_cluster_probe()
I made a mistake while converting the driver to the hotplug state
machine and as a result x2apic_cluster_probe() was accessing
cpus_in_cluster before allocating it.

This patch fixes it by setting the cpumask after the allocation the
memory succeeded.

While at it, I marked two functions static which are only used within
this file.

Reported-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 6b2c28471d ("x86/x2apic: Convert to CPU hotplug state machine")
Link: http://lkml.kernel.org/r/1470924515-9444-1-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-11 16:35:50 +02:00
Andy Lutomirski
d0de0f685d x86/boot: Defer setup_real_mode() to early_initcall time
There's no need to run setup_real_mode() as early as we run it.
Defer it to the same early_initcall that sets up the page
permissions for the real mode code.

This should be a code size reduction.  More importantly, it give us
a longer window in which we can allocate the real mode trampoline.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mario Limonciello <mario_limonciello@dell.com>
Cc: Matt Fleming <mfleming@suse.de>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/fd62f0da4f79357695e9bf3e365623736b05f119.1470821230.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-11 11:15:00 +02:00
Andy Lutomirski
18bc7bd523 x86/boot: Synchronize trampoline_cr4_features and mmu_cr4_features directly
The initialization process for trampoline_cr4_features and
mmu_cr4_features was confusing.  The intent is for mmu_cr4_features
and *trampoline_cr4_features to stay in sync, but
trampoline_cr4_features is NULL until setup_real_mode() runs.  The
old code synchronized *trampoline_cr4_features *twice*, once in
setup_real_mode() and once in setup_arch().  It also initialized
mmu_cr4_features in setup_real_mode(), which causes the actual value
of mmu_cr4_features to potentially depend on when setup_real_mode()
is called.

With this patch, mmu_cr4_features is initialized directly in
setup_arch(), and *trampoline_cr4_features is synchronized to
mmu_cr4_features when the trampoline is set up.

After this patch, it should be safe to defer setup_real_mode().

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mario Limonciello <mario_limonciello@dell.com>
Cc: Matt Fleming <mfleming@suse.de>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/d48a263f9912389b957dd495a7127b009259ffe0.1470821230.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-11 11:15:00 +02:00
Andy Lutomirski
007b756053 x86/boot: Run reserve_bios_regions() after we initialize the memory map
reserve_bios_regions() is a quirk that reserves memory that we might
otherwise think is available.  There's no need to run it so early,
and running it before we have the memory map initialized with its
non-quirky inputs makes it hard to make reserve_bios_regions() more
intelligent.

Move it right after we populate the memblock state.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mario Limonciello <mario_limonciello@dell.com>
Cc: Matt Fleming <mfleming@suse.de>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/59f58618911005c799c6c9979ce6ae4881d907c2.1470821230.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-11 11:14:59 +02:00
Aaron Lu
82ba4faca1 x86/irq: Do not substract irq_tlb_count from irq_call_count
Since commit:

  52aec3308d ("x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR")

the TLB remote shootdown is done through call function vector. That
commit didn't take care of irq_tlb_count, which a later commit:

  fd0f586972 ("x86: Distinguish TLB shootdown interrupts from other functions call interrupts")

... tried to fix.

The fix assumes every increase of irq_tlb_count has a corresponding
increase of irq_call_count. So the irq_call_count is always bigger than
irq_tlb_count and we could substract irq_tlb_count from irq_call_count.

Unfortunately this is not true for the smp_call_function_single() case.
The IPI is only sent if the target CPU's call_single_queue is empty when
adding a csd into it in generic_exec_single. That means if two threads
are both adding flush tlb csds to the same CPU's call_single_queue, only
one IPI is sent. In other words, the irq_call_count is incremented by 1
but irq_tlb_count is incremented by 2. Over time, irq_tlb_count will be
bigger than irq_call_count and the substract will produce a very large
irq_call_count value due to overflow.

Considering that:

  1) it's not worth to send more IPIs for the sake of accurate counting of
     irq_call_count in generic_exec_single();

  2) it's not easy to tell if the call function interrupt is for TLB
     shootdown in __smp_call_function_single_interrupt().

Not to exclude TLB shootdown from call function count seems to be the
simplest fix and this patch just does that.

This bug was found by LKP's cyclic performance regression tracking recently
with the vm-scalability test suite. I have bisected to commit:

  3dec0ba0be ("mm/rmap: share the i_mmap_rwsem")

This commit didn't do anything wrong but revealed the irq_call_count
problem. IIUC, the commit makes rwc->remap_one in rmap_walk_file
concurrent with multiple threads.  When remap_one is try_to_unmap_one(),
then multiple threads could queue flush TLB to the same CPU but only
one IPI will be sent.

Since the commit was added in Linux v3.19, the counting problem only
shows up from v3.19 onwards.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Link: http://lkml.kernel.org/r/20160811074430.GA18163@aaronlu.sh.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-11 11:14:59 +02:00
Dave Hansen
b79daf8589 x86/mm/pkeys: Fix compact mode by removing protection keys' XSAVE buffer manipulation
The Memory Protection Keys "rights register" (PKRU) is
XSAVE-managed, and is saved/restored along with the FPU state.

When kernel code accesses FPU regsisters, it does a delicate
dance with preempt.  Otherwise, the context switching code can
get confused as to whether the most up-to-date state is in the
registers themselves or in the XSAVE buffer.

But, PKRU is not a normal FPU register.  Using it does not
generate the normal device-not-available (#NM) exceptions which
means we can not manage it lazily, and the kernel completley
disallows using lazy mode when it is enabled.

The dance with preempt *only* occurs when managing the FPU
lazily.  Since we never manage PKRU lazily, we do not have to do
the dance with preempt; we can access it directly.  Doing it
this way saves a ton of complicated code (and is faster too).

Further, the XSAVES reenabling failed to patch a bit of code
in fpu__xfeature_set_state() the checked for compacted buffers.
That check caused fpu__xfeature_set_state() to silently refuse to
work when the kernel is using compacted XSAVE buffers.  This
broke execute-only and future pkey_mprotect() support when using
compact XSAVE buffers.

But, removing fpu__xfeature_set_state() gets rid of this issue,
in addition to the nice cleanup and speedup.

This fixes the same thing as a fix that Sai posted:

  https://lkml.org/lkml/2016/7/25/637

The fix that he posted is a much more obviously correct, but I
think we should just do this instead.

Reported-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Ravi Shankar <ravi.v.shankar@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yu-Cheng Yu <yu-cheng.yu@intel.com>
Link: http://lkml.kernel.org/r/20160727232040.7D060DAD@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 16:12:26 +02:00
Mike Travis
5a52e8f822 x86/platform/UV: Fix kernel panic running RHEL kdump kernel on UV systems
The latest UV kernel support panics when RHEL7 kexec's the kdump kernel
to make a dumpfile.  This patch fixes the problem by turning off all UV
support if NUMA is off.

Tested-by: Frank Ramsay <framsay@sgi.com>
Tested-by: John Estabrook <estabrook@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Dimitri Sivanich <sivanich@sgi.com>
Reviewed-by: Nathan Zimmer <nzimmer@sgi.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russ Anderson <rja@sgi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160801184050.577755634@asylum.americas.sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 15:55:39 +02:00
Mike Travis
22ac2bca92 x86/platform/UV: Fix problem with UV4 BIOS providing incorrect PXM values
There are some circumstances where the UV4 BIOS cannot provide the
correct Proximity Node values to associate with specific Sockets and
Physical Nodes.  The decision was made to remove these values from BIOS
and for the kernel to get these values from the standard ACPI tables.

Tested-by: Frank Ramsay <framsay@sgi.com>
Tested-by: John Estabrook <estabrook@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Dimitri Sivanich <sivanich@sgi.com>
Reviewed-by: Nathan Zimmer <nzimmer@sgi.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russ Anderson <rja@sgi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160801184050.414210079@asylum.americas.sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 15:55:38 +02:00
Mike Travis
054f621fd5 x86/platform/UV: Fix problem with UV4 Socket IDs not being contiguous
The UV4 Socket IDs are not guaranteed to equate to Node values which
can cause the GAM (Global Addressable Memory) table lookups to fail.
Fix this by using an independent index into the GAM table instead of
the Socket ID to reference the base address.

Tested-by: Frank Ramsay <framsay@sgi.com>
Tested-by: John Estabrook <estabrook@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Dimitri Sivanich <sivanich@sgi.com>
Reviewed-by: Nathan Zimmer <nzimmer@sgi.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russ Anderson <rja@sgi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160801184050.048755337@asylum.americas.sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 15:55:38 +02:00
Kees Cook
404f6aac9b x86: Apply more __ro_after_init and const
Guided by grsecurity's analogous __read_only markings in arch/x86,
this applies several uses of __ro_after_init to structures that are
only updated during __init, and const for some structures that are
never updated.  Additionally extends __init markings to some functions
that are only used during __init, and cleans up some missing C99 style
static initializers.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Brown <david.brown@linaro.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-hardening@lists.openwall.com
Link: http://lkml.kernel.org/r/20160808232906.GA29731@www.outflux.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:55:05 +02:00
Thomas Garnier
c7d2361f75 x86/mm/KASLR: Fix physical memory calculation on KASLR memory randomization
Initialize KASLR memory randomization after max_pfn is initialized. Also
ensure the size is rounded up. It could create problems on machines
with more than 1Tb of memory on certain random addresses.

Signed-off-by: Thomas Garnier <thgarnie@google.com>
Cc: Aleksey Makarov <aleksey.makarov@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lv Zheng <lv.zheng@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: kernel-hardening@lists.openwall.com
Fixes: 021182e52f ("Enable KASLR for physical mapping memory regions")
Link: http://lkml.kernel.org/r/1470762665-88032-1-git-send-email-thgarnie@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:45:19 +02:00
Arnd Bergmann
22cc1ca3c5 x86/hpet: Fix /dev/rtc breakage caused by RTC cleanup
Ville Syrjälä reports "The first time I run hwclock after rebooting
I get this:

 open("/dev/rtc", O_RDONLY)              = 3
 ioctl(3, PHN_SET_REGS or RTC_UIE_ON, 0) = 0
 select(4, [3], NULL, NULL, {10, 0})     = 0 (Timeout)
 ioctl(3, PHN_NOT_OH or RTC_UIE_OFF, 0)  = 0
 close(3)                                = 0

On all subsequent runs I get this:

 open("/dev/rtc", O_RDONLY)              = 3
 ioctl(3, PHN_SET_REGS or RTC_UIE_ON, 0) = -1 EINVAL (Invalid argument)
 ioctl(3, RTC_RD_TIME, 0x7ffd76b3ae70)   = -1 EINVAL (Invalid argument)
 close(3)                                = 0"

This was caused by a stupid typo in a patch that should have been
a simple rename to move around contents of a header file, but
accidentally wrote zeroes into the rtc rather than reading from
it:

  463a86304c ("char/genrtc: x86: remove remnants of asm/rtc.h")

Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: rtc-linux@googlegroups.com
Fixes: 463a86304c ("char/genrtc: x86: remove remnants of asm/rtc.h")
Link: http://lkml.kernel.org/r/20160809195528.1604312-1-arnd@arndb.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:37:06 +02:00
Ingo Molnar
fdbdfefbab Merge branch 'linus' into timers/urgent, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:36:23 +02:00
Nicolai Stange
6731b0d611 x86/timers/apic: Inform TSC deadline clockevent device about recalibration
This patch eliminates a source of imprecise APIC timer interrupts,
which imprecision may result in double interrupts or even late
interrupts.

The TSC deadline clockevent devices' configuration and registration
happens before the TSC frequency calibration is refined in
tsc_refine_calibration_work().

This results in the TSC clocksource and the TSC deadline clockevent
devices being configured with slightly different frequencies: the former
gets the refined one and the latter are configured with the inaccurate
frequency detected earlier by means of the "Fast TSC calibration using PIT".

Within the APIC code, introduce the notifier function
lapic_update_tsc_freq() which reconfigures all per-CPU TSC deadline
clockevent devices with the current tsc_khz.

Call it from the TSC code after TSC calibration refinement has happened.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christopher S. Hall <christopher.s.hall@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20160714152255.18295-3-nicstange@gmail.com
[ Pushed #ifdef CONFIG_X86_LOCAL_APIC into header, improved changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 12:38:12 +02:00
Nicolai Stange
1a9e4c564a x86/timers/apic: Fix imprecise timer interrupts by eliminating TSC clockevents frequency roundoff error
I noticed the following bug/misbehavior on certain Intel systems: with a
single task running on a NOHZ CPU on an Intel Haswell, I recognized
that I did not only get the one expected local_timer APIC interrupt, but
two per second at minimum. (!)

Further tracing showed that the first one precedes the programmed deadline
by up to ~50us and hence, it did nothing except for reprogramming the TSC
deadline clockevent device to trigger shortly thereafter again.

The reason for this is imprecise calibration, the timeout we program into
the APIC results in 'too short' timer interrupts. The core (hr)timer code
notices this (because it has a precise ktime source and sees the short
interrupt) and fixes it up by programming an additional very short
interrupt period.

This is obviously suboptimal.

The reason for the imprecise calibration is twofold, and this patch
fixes the first reason:

In setup_APIC_timer(), the registered clockevent device's frequency
is calculated by first dividing tsc_khz by TSC_DIVISOR and multiplying
it with 1000 afterwards:

  (tsc_khz / TSC_DIVISOR) * 1000

The multiplication with 1000 is done for converting from kHz to Hz and the
division by TSC_DIVISOR is carried out in order to make sure that the final
result fits into an u32.

However, with the order given in this calculation, the roundoff error
introduced by the division gets magnified by a factor of 1000 by the
following multiplication.

To fix it, reversing the order of the division and the multiplication a la:

  (tsc_khz * 1000) / TSC_DIVISOR

... reduces the roundoff error already.

Furthermore, if TSC_DIVISOR divides 1000, associativity holds:

  (tsc_khz * 1000) / TSC_DIVISOR = tsc_khz * (1000 / TSC_DIVISOR)

and thus, the roundoff error even vanishes and the whole operation can be
carried out within 32 bits.

The powers of two that divide 1000 are 2, 4 and 8. A value of 8 for
TSC_DIVISOR still allows for TSC frequencies up to
2^32 / 10^9ns * 8 = 34.4GHz which is way larger than anything to expect
in the next years.

Thus we also replace the current TSC_DIVISOR value of 32 by 8. Reverse
the order of the divison and the multiplication in the calculation of
the registered clockevent device's frequency.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christopher S. Hall <christopher.s.hall@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20160714152255.18295-2-nicstange@gmail.com
[ Improved changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 12:37:38 +02:00
Linus Torvalds
80fac0f577 * ARM bugfix and MSI injection support
* x86 nested virt tweak and OOPS fix
 * Simplify pvclock code (vdso bits acked by Andy Lutomirski).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJXpKOnAAoJEL/70l94x66D5z4H/R2660Vy3brQrI8lGxCtkXJt
 AVe8PwI8nDfYJ/UkMZ2KcHPSvy+sHW2ydaZXYNqXHVBeTaUxiPW9rTgK61ebypGL
 1tPOgJ3kGZF6XEdAz6gS8LniNFc+D3W6Y6sRylkEsqPj39/hxe7QMoOMSCQ9imbW
 WMIx7/81i1EMw6oi+9FVtq+yHCpvyfFnD8t1TDsYWOReVn1J15SxbEs4Ih+hBMLz
 HZ5DEjp9cAmzeR7GLje5eH1t6TEEoNb1MNgFWuscoAsDf8D9DKqRB9s0hC+TLFYn
 oZbGSqjQwu3/VMblgedinH6X9MTm8V0zW29ToGnDcoO00AUmdlNmXSaZUhvT/Rs=
 =H5cD
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more KVM updates from Paolo Bonzini:
 - ARM bugfix and MSI injection support
 - x86 nested virt tweak and OOPS fix
 - Simplify pvclock code (vdso bits acked by Andy Lutomirski).

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  nvmx: mark ept single context invalidation as supported
  nvmx: remove comment about missing nested vpid support
  KVM: lapic: fix access preemption timer stuff even if kernel_irqchip=off
  KVM: documentation: fix KVM_CAP_X2APIC_API information
  x86: vdso: use __pvclock_read_cycles
  pvclock: introduce seqcount-like API
  arm64: KVM: Set cpsr before spsr on fault injection
  KVM: arm: vgic-irqfd: Workaround changing kvm_set_routing_entry prototype
  KVM: arm/arm64: Enable MSI routing
  KVM: arm/arm64: Enable irqchip routing
  KVM: Move kvm_setup_default/empty_irq_routing declaration in arch specific header
  KVM: irqchip: Convey devid to kvm_set_msi
  KVM: Add devid in kvm_kernel_irq_routing_entry
  KVM: api: Pass the devid in the msi routing entry
2016-08-06 09:18:21 -04:00
Linus Torvalds
c98f5827f8 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "Two fixes and a cleanup-fix, to the syscall entry code and to ptrace"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/syscalls/64: Add compat_sys_keyctl for 32-bit userspace
  x86/ptrace: Stop setting TS_COMPAT in ptrace code
  x86/vdso: Error out if the vDSO isn't a valid DSO
2016-08-06 09:04:35 -04:00
Linus Torvalds
6c84239d59 RTC for 4.8
Cleanups:
  - huge cleanup of rtc-generic and char/genrtc this allowed to cleanup rtc-cmos,
   rtc-sh, rtc-m68k, rtc-powerpc and rtc-parisc
  - move mn10300 to rtc-cmos
 
 Subsystem:
  - fix wakealarms after hibernate
  - multiples fixes for rctest
  - simplify implementations of .read_alarm
 
 New drivers:
  - Maxim MAX6916
 
 Drivers:
  - ds1307: fix weekday
  - m41t80: add wakeup support
  - pcf85063: add support for PCF85063A variant
  - rv8803: extend i2c fix and other fixes
  - s35390a: fix alarm reading, this fixes instant reboot after shutdown for QNAP
    TS-41x
  - s3c: clock fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJXokhIAAoJENiigzvaE+LCZqQP+wWzintN/N1u3dKiVB7iSdwq
 +S/jAXD9wW8OK9PI60/YUGRYeUXmZW9t4XYg1VKCxU9KpVC17LgOtDyXD8BufP1V
 uREJEzZw9O7zCCjeHp/ICFjBkc62Net6ZDOO+ZyXPNfddpS1Xq1uUgXLZc/202UR
 ID/kewu0pJRDnoxyqznWn9+8D33w/ygXs2slY2Ive0ONtjdgxGcsj2rNbb2RYn2z
 OP7br3lLg7qkFh4TtXb61eh/9GYIk6wzP/CrX5l/jH4SjQnrIk5g/X/Cd1qQ/qso
 JZzFoonOKvIp5Gw/+fZ9NP3YFcnkoRMv4NjZV8PAmsYLds+ibRiBcoB8u6FmiJV7
 WW5uopgPkfCGN5BV3+QHwJDVe+WlgnlzaT5zPUCcP5KWusDts4fWIgzP7vrtAzf4
 3OJLrgSGdBeOqWnJD21nxKUD27JOseX7D+BFtwxR4lMsXHqlHJfETpZ8gts1ZGH3
 2U353j/jkZvGWmc6dMcuxOXT2K4VqpYeIIqs0IcLu6hM9crtR89zPR2Iu1AilfDW
 h2NroF+Q//SgMMzWoTEG6Tn7RAc7MthgA/tRCFZF9CBMzNs988w0CTHnKsIHmjpU
 UKkMeJGAC9YrPYIcqrg0oYsmLUWXc8JuZbGJBnei3BzbaMTlcwIN9qj36zfq6xWc
 TMLpbWEoIsgFIZMP/hAP
 =rpGB
 -----END PGP SIGNATURE-----

Merge tag 'rtc-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux

Pull RTC updates from Alexandre Belloni:
 "RTC for 4.8

  Cleanups:
   - huge cleanup of rtc-generic and char/genrtc this allowed to cleanup
     rtc-cmos, rtc-sh, rtc-m68k, rtc-powerpc and rtc-parisc
   - move mn10300 to rtc-cmos

  Subsystem:
   - fix wakealarms after hibernate
   - multiples fixes for rctest
   - simplify implementations of .read_alarm

  New drivers:
   - Maxim MAX6916

  Drivers:
   - ds1307: fix weekday
   - m41t80: add wakeup support
   - pcf85063: add support for PCF85063A variant
   - rv8803: extend i2c fix and other fixes
   - s35390a: fix alarm reading, this fixes instant reboot after
     shutdown for QNAP TS-41x
   - s3c: clock fixes"

* tag 'rtc-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (65 commits)
  rtc: rv8803: Clear V1F when setting the time
  rtc: rv8803: Stop the clock while setting the time
  rtc: rv8803: Always apply the I²C workaround
  rtc: rv8803: Fix read day of week
  rtc: rv8803: Remove the check for valid time
  rtc: rv8803: Kconfig: Indicate rx8900 support
  rtc: asm9260: remove .owner field for driver
  rtc: at91sam9: Fix missing spin_lock_init()
  rtc: m41t80: add suspend handlers for alarm IRQ
  rtc: m41t80: make it a real error message
  rtc: pcf85063: Add support for the PCF85063A device
  rtc: pcf85063: fix year range
  rtc: hym8563: in .read_alarm set .tm_sec to 0 to signal minute accuracy
  rtc: explicitly set tm_sec = 0 for drivers with minute accurancy
  rtc: s3c: Add s3c_rtc_{enable/disable}_clk in s3c_rtc_setfreq()
  rtc: s3c: Remove unnecessary call to disable already disabled clock
  rtc: abx80x: use devm_add_action_or_reset()
  rtc: m41t80: use devm_add_action_or_reset()
  rtc: fix a typo and reduce three empty lines to one
  rtc: s35390a: improve two comments in .set_alarm
  ...
2016-08-05 09:48:22 -04:00