Commit Graph

24058 Commits

Author SHA1 Message Date
Linus Torvalds
e46b4e2b46 Nothing major this round. Mostly small clean ups and fixes.
Some visible changes:
 
  A new flag was added to distinguish traces done in NMI context.
 
  Preempt tracer now shows functions where preemption is disabled but
  interrupts are still enabled.
 
 Other notes:
 
  Updates were done to function tracing to allow better performance
  with perf.
 
  Infrastructure code has been added to allow for a new histogram
  feature for recording live trace event histograms that can be
  configured by simple user commands. The feature itself was just
  finished, but needs a round in linux-next before being pulled.
  This only includes some infrastructure changes that will be needed.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJW8/WPAAoJEKKk/i67LK/8wrAH/j2gU9ZfjVxTu8068TBGWRJP
 yvvzq0cK5evB3dsVuUmKKRfU52nSv4J1WcFF569X0RulSLylR0dHlcxFJMn4kkgR
 bm0AHRrqOf87ub3VimcpG146iVQij37l5A0SRoFbvSPLQx1KUW18v99x41Ji8dv6
 oWXRc6/YhdzEE7l0nUsVjmScQ4b2emsems3cxZzXOY+nRJsiim6i+VaDeatdyey1
 csLVqtRCs+x62TVtxG3+GhcLdRoPRbnHAGzrKDFIn1SrQaRXCc54wN5d2hWxjgNI
 1laOwaj070lnJiWfBLIP/K+lx+VKRx5/O0rKZX35foLUTqJJKSyjAbKXuMCcSAM=
 =2h2K
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "Nothing major this round.  Mostly small clean ups and fixes.

  Some visible changes:

   - A new flag was added to distinguish traces done in NMI context.

   - Preempt tracer now shows functions where preemption is disabled but
     interrupts are still enabled.

  Other notes:

   - Updates were done to function tracing to allow better performance
     with perf.

   - Infrastructure code has been added to allow for a new histogram
     feature for recording live trace event histograms that can be
     configured by simple user commands.  The feature itself was just
     finished, but needs a round in linux-next before being pulled.

     This only includes some infrastructure changes that will be needed"

* tag 'trace-v4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (22 commits)
  tracing: Record and show NMI state
  tracing: Fix trace_printk() to print when not using bprintk()
  tracing: Remove redundant reset per-CPU buff in irqsoff tracer
  x86: ftrace: Fix the misleading comment for arch/x86/kernel/ftrace.c
  tracing: Fix crash from reading trace_pipe with sendfile
  tracing: Have preempt(irqs)off trace preempt disabled functions
  tracing: Fix return while holding a lock in register_tracer()
  ftrace: Use kasprintf() in ftrace_profile_tracefs()
  ftrace: Update dynamic ftrace calls only if necessary
  ftrace: Make ftrace_hash_rec_enable return update bool
  tracing: Fix typoes in code comment and printk in trace_nop.c
  tracing, writeback: Replace cgroup path to cgroup ino
  tracing: Use flags instead of bool in trigger structure
  tracing: Add an unreg_all() callback to trigger commands
  tracing: Add needs_rec flag to event triggers
  tracing: Add a per-event-trigger 'paused' field
  tracing: Add get_syscall_name()
  tracing: Add event record param to trigger_ops.func()
  tracing: Make event trigger functions available
  tracing: Make ftrace_event_field checking functions available
  ...
2016-03-24 10:52:25 -07:00
Linus Torvalds
3fa2fe2ce0 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "This tree contains various perf fixes on the kernel side, plus three
  hw/event-enablement late additions:

   - Intel Memory Bandwidth Monitoring events and handling
   - the AMD Accumulated Power Mechanism reporting facility
   - more IOMMU events

  ... and a final round of perf tooling updates/fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
  perf llvm: Use strerror_r instead of the thread unsafe strerror one
  perf llvm: Use realpath to canonicalize paths
  perf tools: Unexport some methods unused outside strbuf.c
  perf probe: No need to use formatting strbuf method
  perf help: Use asprintf instead of adhoc equivalents
  perf tools: Remove unused perf_pathdup, xstrdup functions
  perf tools: Do not include stringify.h from the kernel sources
  tools include: Copy linux/stringify.h from the kernel
  tools lib traceevent: Remove redundant CPU output
  perf tools: Remove needless 'extern' from function prototypes
  perf tools: Simplify die() mechanism
  perf tools: Remove unused DIE_IF macro
  perf script: Remove lots of unused arguments
  perf thread: Rename perf_event__preprocess_sample_addr to thread__resolve
  perf machine: Rename perf_event__preprocess_sample to machine__resolve
  perf tools: Add cpumode to struct perf_sample
  perf tests: Forward the perf_sample in the dwarf unwind test
  perf tools: Remove misplaced __maybe_unused
  perf list: Fix documentation of :ppp
  perf bench numa: Fix assertion for nodes bitfield
  ...
2016-03-24 10:02:14 -07:00
Linus Torvalds
d88f48e128 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "Misc fixes:

   - fix hotplug bugs
   - fix irq live lock
   - fix various topology handling bugs
   - fix APIC ACK ordering
   - fix PV iopl handling
   - fix speling
   - fix/tweak memcpy_mcsafe() return value
   - fix fbcon bug
   - remove stray prototypes"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/msr: Remove unused native_read_tscp()
  x86/apic: Remove declaration of unused hw_nmi_is_cpu_stuck
  x86/oprofile/nmi: Add missing hotplug FROZEN handling
  x86/hpet: Use proper mask to modify hotplug action
  x86/apic/uv: Fix the hotplug notifier
  x86/apb/timer: Use proper mask to modify hotplug action
  x86/topology: Use total_cpus not nr_cpu_ids for logical packages
  x86/topology: Fix Intel HT disable
  x86/topology: Fix logical package mapping
  x86/irq: Cure live lock in fixup_irqs()
  x86/tsc: Prevent NULL pointer deref in calibrate_delay_is_known()
  x86/apic: Fix suspicious RCU usage in smp_trace_call_function_interrupt()
  x86/iopl: Fix iopl capability check on Xen PV
  x86/iopl/64: Properly context-switch IOPL on Xen PV
  selftests/x86: Add an iopl test
  x86/mm, x86/mce: Fix return type/value for memcpy_mcsafe()
  x86/video: Don't assume all FB devices are PCI devices
  arch/x86/irq: Purge useless handler declarations from hw_irq.h
  x86: Fix misspellings in comments
2016-03-24 09:47:32 -07:00
Prarit Bhargava
9da77666d6 x86/msr: Remove unused native_read_tscp()
After e76b027 ("x86,vdso: Use LSL unconditionally for vgetcpu")
native_read_tscp() is unused in the kernel. The function can be removed like
native_read_tsc() was.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1458687968-9106-1-git-send-email-prarit@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-23 12:34:17 +01:00
Linus Torvalds
a24e3d414e Merge branch 'akpm' (patches from Andrew)
Merge third patch-bomb from Andrew Morton:

 - more ocfs2 changes

 - a few hotfixes

 - Andy's compat cleanups

 - misc fixes to fatfs, ptrace, coredump, cpumask, creds, eventfd,
   panic, ipmi, kgdb, profile, kfifo, ubsan, etc.

 - many rapidio updates: fixes, new drivers.

 - kcov: kernel code coverage feature.  Like gcov, but not
   "prohibitively expensive".

 - extable code consolidation for various archs

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (81 commits)
  ia64/extable: use generic search and sort routines
  x86/extable: use generic search and sort routines
  s390/extable: use generic search and sort routines
  alpha/extable: use generic search and sort routines
  kernel/...: convert pr_warning to pr_warn
  drivers: dma-coherent: use memset_io for DMA_MEMORY_IO mappings
  drivers: dma-coherent: use MEMREMAP_WC for DMA_MEMORY_MAP
  memremap: add MEMREMAP_WC flag
  memremap: don't modify flags
  kernel/signal.c: add compile-time check for __ARCH_SI_PREAMBLE_SIZE
  mm/mprotect.c: don't imply PROT_EXEC on non-exec fs
  ipc/sem: make semctl setting sempid consistent
  ubsan: fix tree-wide -Wmaybe-uninitialized false positives
  kfifo: fix sparse complaints
  scripts/gdb: account for changes in module data structure
  scripts/gdb: add cmdline reader command
  scripts/gdb: add version command
  kernel: add kcov code coverage
  profile: hide unused functions when !CONFIG_PROC_FS
  hpwdt: use nmi_panic() when kernel panics in NMI handler
  ...
2016-03-22 17:09:14 -07:00
Linus Torvalds
b91d9c6716 * Build fixes for PPC KVM
* Miscellaneous bugfixes for ARM KVM
 * Cleanup of memory barrier and removal of redundant barriers
 * x86 fixes: page tracking oops, support for old buggy KVM nested on 4.5
 * Support for protection keys in guests
 * Lockdep fix
 * Another conversion to simple wait queues and raw spinlocks,
   backported from PREEMPT_RT
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJW8aaXAAoJEL/70l94x66D7voH/i2ytj6PbuWQQobSKDY38x8F
 MHDFJ5UgTFZPPt8cB8YiCl6Tu0C5I2mNOk0rfb+bcpM5C1U9IAnBbbupyUblp6K9
 1u+u+al8IlnOsoLzJSUXKDK5H4mEVrUnwVxTpZol5Ph5qQ8FvpbkxboMu3AGevO5
 PIUXucK7fP5WXVV3Nh4YnUnBkeYzuuXcqYcV/TjscNQ4NcMofElcgpBxmE498TTk
 rvhyuf2chEtY2DsDh3nzeYgxcGLpvE4/l5a+puEoOx4M5CH24wwne9LHAWJz6ofm
 H3XNhsCz3jIGmrNqqkGUUya5qSkCsq2ha7n+VDw+fiP1TKy3FtkrBYQrDj+ISuc=
 =UtvF
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more KVM updates from Paolo Bonzini:
 "Second round of KVM changes for 4.6:

   - build fixes for PPC KVM
   - miscellaneous bugfixes for ARM KVM
   - cleanup of memory barrier and removal of redundant barriers
   - x86 fixes: page tracking oops, support for old buggy KVM nested on 4.5
   - support for protection keys in guests
   - lockdep fix
   - another conversion to simple wait queues and raw spinlocks,
     backported from PREEMPT_RT"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (27 commits)
  KVM: page_track: fix access to NULL slot
  KVM: PPC: do not compile in vfio.o unconditionally
  kvm, rt: change async pagefault code locking for PREEMPT_RT
  KVM/PPC: update the comment of memory barrier in the kvmppc_prepare_to_enter()
  KVM/x86: update the comment of memory barrier in the vcpu_enter_guest()
  KVM: Replace smp_mb() with smp_load_acquire() in the kvm_flush_remote_tlbs()
  KVM/x86: Call smp_wmb() before increasing tlbs_dirty
  KVM: Replace smp_mb() with smp_mb_after_atomic() in the kvm_make_all_cpus_request()
  KVM/x86: Replace smp_mb() with smp_store_mb/release() in the walk_shadow_page_lockless_begin/end()
  KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()
  KVM, pkeys: expose CPUID/CR4 to guest
  KVM, pkeys: add pkeys support for permission_fault
  KVM, pkeys: introduce pkru_mask to cache conditions
  KVM, pkeys: save/restore PKRU when guest/host switches
  x86: pkey: introduce write_pkru() for KVM
  KVM, pkeys: add pkeys support for xsave state
  KVM, pkeys: disable pkeys for guests in non-paging mode
  KVM: x86: remove magic number with enum cpuid_leafs
  KVM: MMU: return page fault error code from permission_fault
  KVM: fix spin_lock_init order on x86
  ...
2016-03-22 16:28:22 -07:00
Ard Biesheuvel
29934b0fb8 x86/extable: use generic search and sort routines
Replace the arch specific versions of search_extable() and
sort_extable() with calls to the generic ones, which now support
relative exception tables as well.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-22 15:36:02 -07:00
Dmitry Vyukov
5c9a8750a6 kernel: add kcov code coverage
kcov provides code coverage collection for coverage-guided fuzzing
(randomized testing).  Coverage-guided fuzzing is a testing technique
that uses coverage feedback to determine new interesting inputs to a
system.  A notable user-space example is AFL
(http://lcamtuf.coredump.cx/afl/).  However, this technique is not
widely used for kernel testing due to missing compiler and kernel
support.

kcov does not aim to collect as much coverage as possible.  It aims to
collect more or less stable coverage that is function of syscall inputs.
To achieve this goal it does not collect coverage in soft/hard
interrupts and instrumentation of some inherently non-deterministic or
non-interesting parts of kernel is disbled (e.g.  scheduler, locking).

Currently there is a single coverage collection mode (tracing), but the
API anticipates additional collection modes.  Initially I also
implemented a second mode which exposes coverage in a fixed-size hash
table of counters (what Quentin used in his original patch).  I've
dropped the second mode for simplicity.

This patch adds the necessary support on kernel side.  The complimentary
compiler support was added in gcc revision 231296.

We've used this support to build syzkaller system call fuzzer, which has
found 90 kernel bugs in just 2 months:

  https://github.com/google/syzkaller/wiki/Found-Bugs

We've also found 30+ bugs in our internal systems with syzkaller.
Another (yet unexplored) direction where kcov coverage would greatly
help is more traditional "blob mutation".  For example, mounting a
random blob as a filesystem, or receiving a random blob over wire.

Why not gcov.  Typical fuzzing loop looks as follows: (1) reset
coverage, (2) execute a bit of code, (3) collect coverage, repeat.  A
typical coverage can be just a dozen of basic blocks (e.g.  an invalid
input).  In such context gcov becomes prohibitively expensive as
reset/collect coverage steps depend on total number of basic
blocks/edges in program (in case of kernel it is about 2M).  Cost of
kcov depends only on number of executed basic blocks/edges.  On top of
that, kernel requires per-thread coverage because there are always
background threads and unrelated processes that also produce coverage.
With inlined gcov instrumentation per-thread coverage is not possible.

kcov exposes kernel PCs and control flow to user-space which is
insecure.  But debugfs should not be mapped as user accessible.

Based on a patch by Quentin Casasnovas.

[akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
[akpm@linux-foundation.org: unbreak allmodconfig]
[akpm@linux-foundation.org: follow x86 Makefile layout standards]
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Tavis Ormandy <taviso@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@google.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: David Drysdale <drysdale@google.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-22 15:36:02 -07:00
Andy Lutomirski
f970165bee x86/compat: remove is_compat_task()
x86's is_compat_task always checked the current syscall type, not the
task type.  It has no non-arch users any more, so just remove it to
avoid confusion.

On x86, nothing should really be checking the task ABI.  There are
legitimate users for the syscall ABI and for the mm ABI.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-22 15:36:02 -07:00
Linus Torvalds
55fc733c7e xen: features and fixes for 4.6-rc0
- Make earlyprintk=xen work for HVM guests.
 - Remove module support for things never built as modules.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJW8YhlAAoJEFxbo/MsZsTRiCgH/3GnL93s/BsRTrA/DYykYqA4
 rPPkDd6WMlEbrko6FY7o4IxUAZ50LWU8wjmDNVTT+R/VKlWCVh2+ksxjbH8e0nzr
 0QohAYTdG1VGmhkrwTjKj0wC3Nr8vaDqoBXAFRz4DiQdbobn12wzXjaVrbl8RRNc
 oHDqR+DfZj6ontqx8oFkkTk7CFKIGEOtm/uBhCAV2fNkQSUCzuQyKLlarxM6jq/2
 EnLR1bhnyoy5zNFzXxmaJgZOit8QFMKvPdDRknlp9yDHYJ5zolJkv+6FPiAG6SZs
 A8yAUPQoAD+ekAiV2DIhb8qoNNNdZXdRCFBpoudwX3GyyU4O2P5Intl0xEg17Nk=
 =L8S4
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.6-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen updates from David Vrabel:
 "Features and fixes for 4.6:

  - Make earlyprintk=xen work for HVM guests

  - Remove module support for things never built as modules"

* tag 'for-linus-4.6-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  drivers/xen: make platform-pci.c explicitly non-modular
  drivers/xen: make sys-hypervisor.c explicitly non-modular
  drivers/xen: make xenbus_dev_[front/back]end explicitly non-modular
  drivers/xen: make [xen-]ballon explicitly non-modular
  xen: audit usages of module.h ; remove unnecessary instances
  xen/x86: Drop mode-selecting ifdefs in startup_xen()
  xen/x86: Zero out .bss for PV guests
  hvc_xen: make early_printk work with HVM guests
  hvc_xen: fix xenboot for DomUs
  hvc_xen: add earlycon support
2016-03-22 12:55:17 -07:00
Paolo Bonzini
a6adb10622 KVM: page_track: fix access to NULL slot
This happens when doing the reboot test from virt-tests:

[  131.833653] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  131.842461] IP: [<ffffffffa0950087>] kvm_page_track_is_active+0x17/0x60 [kvm]
[  131.850500] PGD 0
[  131.852763] Oops: 0000 [#1] SMP
[  132.007188] task: ffff880075fbc500 ti: ffff880850a3c000 task.ti: ffff880850a3c000
[  132.138891] Call Trace:
[  132.141639]  [<ffffffffa092bd11>] page_fault_handle_page_track+0x31/0x40 [kvm]
[  132.149732]  [<ffffffffa093380f>] paging64_page_fault+0xff/0x910 [kvm]
[  132.172159]  [<ffffffffa092c734>] kvm_mmu_page_fault+0x64/0x110 [kvm]
[  132.179372]  [<ffffffffa06743c2>] handle_exception+0x1b2/0x430 [kvm_intel]
[  132.187072]  [<ffffffffa067a301>] vmx_handle_exit+0x1e1/0xc50 [kvm_intel]
...

Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Fixes: 3d0c27ad6e
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 17:27:28 +01:00
Rik van Riel
9db284f303 kvm, rt: change async pagefault code locking for PREEMPT_RT
The async pagefault wake code can run from the idle task in exception
context, so everything here needs to be made non-preemptible.

Conversion to a simple wait queue and raw spinlock does the trick.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:38:38 +01:00
Lan Tianyu
0f127d12e4 KVM/x86: update the comment of memory barrier in the vcpu_enter_guest()
The barrier also orders the write to mode from any reads
to the page tables done and so update the comment.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:38:35 +01:00
Lan Tianyu
7bfdf21778 KVM/x86: Call smp_wmb() before increasing tlbs_dirty
Update spte before increasing tlbs_dirty to make sure no tlb flush
in lost after spte is zapped. This pairs with the barrier in the
kvm_flush_remote_tlbs().

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:38:32 +01:00
Lan Tianyu
36ca7e0a57 KVM/x86: Replace smp_mb() with smp_store_mb/release() in the walk_shadow_page_lockless_begin/end()
Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:38:29 +01:00
Lan Tianyu
9753f52915 KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()
There is already a barrier inside of kvm_flush_remote_tlbs() which can
help to make sure everyone sees our modifications to the page tables and
see changes to vcpu->mode here. So remove the smp_mb in the
kvm_mmu_commit_zap_page() and update the comment.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:38:27 +01:00
Huaitong Han
b9baba8614 KVM, pkeys: expose CPUID/CR4 to guest
X86_FEATURE_PKU is referred to as "PKU" in the hardware documentation:
CPUID.7.0.ECX[3]:PKU. X86_FEATURE_OSPKE is software support for pkeys,
enumerated with CPUID.7.0.ECX[4]:OSPKE, and it reflects the setting of
CR4.PKE(bit 22).

This patch disables CPUID:PKU without ept, because pkeys is not yet
implemented for shadow paging.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:38:17 +01:00
Huaitong Han
be94f6b710 KVM, pkeys: add pkeys support for permission_fault
Protection keys define a new 4-bit protection key field (PKEY) in bits
62:59 of leaf entries of the page tables, the PKEY is an index to PKRU
register(16 domains), every domain has 2 bits(write disable bit, access
disable bit).

Static logic has been produced in update_pkru_bitmask, dynamic logic need
read pkey from page table entries, get pkru value, and deduce the correct
result.

[ Huaitong: Xiao helps to modify many sections. ]

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:23:37 +01:00
Huaitong Han
2d344105f5 KVM, pkeys: introduce pkru_mask to cache conditions
PKEYS defines a new status bit in the PFEC. PFEC.PK (bit 5), if some
conditions is true, the fault is considered as a PKU violation.
pkru_mask indicates if we need to check PKRU.ADi and PKRU.WDi, and
does cache some conditions for permission_fault.

[ Huaitong: Xiao helps to modify many sections. ]

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:21:06 +01:00
Xiao Guangrong
1be0e61c1f KVM, pkeys: save/restore PKRU when guest/host switches
Currently XSAVE state of host is not restored after VM-exit and PKRU
is managed by XSAVE so the PKRU from guest is still controlling the
memory access even if the CPU is running the code of host. This is
not safe as KVM needs to access the memory of userspace (e,g QEMU) to
do some emulation.

So we save/restore PKRU when guest/host switches.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:21:06 +01:00
Xiao Guangrong
9e90199c25 x86: pkey: introduce write_pkru() for KVM
KVM will use it to switch pkru between guest and host.

CC: Ingo Molnar <mingo@redhat.com>
CC: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:21:05 +01:00
Huaitong Han
17a511f878 KVM, pkeys: add pkeys support for xsave state
This patch adds pkeys support for xsave state.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:21:05 +01:00
Huaitong Han
ddba262891 KVM, pkeys: disable pkeys for guests in non-paging mode
Pkeys is disabled if CPU is in non-paging mode in hardware. However KVM
always uses paging mode to emulate guest non-paging, mode with TDP. To
emulate this behavior, pkeys needs to be manually disabled when guest
switches to non-paging mode.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:21:04 +01:00
Huaitong Han
e0b18ef718 KVM: x86: remove magic number with enum cpuid_leafs
This patch removes magic number with enum cpuid_leafs.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:21:04 +01:00
Paolo Bonzini
f13577e8aa KVM: MMU: return page fault error code from permission_fault
This will help in the implementation of PKRU, where the PK bit of the page
fault error code cannot be computed in advance (unlike I/D, R/W and U/S).

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:20:54 +01:00
Paolo Bonzini
ef697a712a KVM: VMX: fix nested vpid for old KVM guests
Old KVM guests invoke single-context invvpid without actually checking
whether it is supported.  This was fixed by commit 518c8ae ("KVM: VMX:
Make sure single type invvpid is supported before issuing invvpid
instruction", 2010-08-01) and the patch after, but pre-2.6.36
kernels lack it including RHEL 6.

Reported-by: jmontleo@redhat.com
Tested-by: jmontleo@redhat.com
Cc: stable@vger.kernel.org
Fixes: 99b83ac893
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 12:02:46 +01:00
Paolo Bonzini
f6870ee9e5 KVM: VMX: avoid guest hang on invalid invvpid instruction
A guest executing an invalid invvpid instruction would hang
because the instruction pointer was not updated.

Reported-by: jmontleo@redhat.com
Tested-by: jmontleo@redhat.com
Cc: stable@vger.kernel.org
Fixes: 99b83ac893
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 12:02:42 +01:00
Paolo Bonzini
2849eb4f99 KVM: VMX: avoid guest hang on invalid invept instruction
A guest executing an invalid invept instruction would hang
because the instruction pointer was not updated.

Cc: stable@vger.kernel.org
Fixes: bfd0a56b90
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 12:02:38 +01:00
Srinivas Pandruvada
7b0fd56930 perf/x86/intel/rapl: Add missing Broadwell models
Added Broadwell-H and Broadwell-Server.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: bp@alien8.de
Link: http://lkml.kernel.org/r/1458517938-25308-1-git-send-email-srinivas.pandruvada@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 11:16:19 +01:00
Kan Liang
cb2252522a perf/x86/intel/uncore: Remove ev_sel_ext bit support for PCU
The ev_sel_ext in PCU_MSR_PMON_CTL is locked on some CPU models, so despite
it being documented in the SDM, if we write 1 to that bit then we can get a #GP
fault.

Which #GP the perf fuzzer happily triggered in Peter Zijlstra's testing.

Also, there are no public events which use that bit, so remove ev_sel_ext
bit support for PCU.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1458500301-3594-1-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 11:16:19 +01:00
Huang Rui
c7ab62bfbe perf/x86/amd/power: Add AMD accumulated power reporting mechanism
Introduce an AMD accumlated power reporting mechanism for the Family
15h, Model 60h processor that can be used to calculate the average
power consumed by a processor during a measurement interval. The
feature support is indicated by CPUID Fn8000_0007_EDX[12].

This feature will be implemented both in hwmon and perf. The current
design provides one event to report per package/processor power
consumption by counting each compute unit power value.

Here the gory details of how the computation is done:

* Tsample: compute unit power accumulator sample period
* Tref: the PTSC counter period (PTSC: performance timestamp counter)
* N: the ratio of compute unit power accumulator sample period to the
  PTSC period

* Jmax: max compute unit accumulated power which is indicated by
  MSR_C001007b[MaxCpuSwPwrAcc]

* Jx/Jy: compute unit accumulated power which is indicated by
  MSR_C001007a[CpuSwPwrAcc]

* Tx/Ty: the value of performance timestamp counter which is indicated
  by CU_PTSC MSR_C0010280[PTSC]
* PwrCPUave: CPU average power

i. Determine the ratio of Tsample to Tref by executing CPUID Fn8000_0007.
	N = value of CPUID Fn8000_0007_ECX[CpuPwrSampleTimeRatio[15:0]].

ii. Read the full range of the cumulative energy value from the new
    MSR MaxCpuSwPwrAcc.
	Jmax = value returned.

iii. At time x, software reads CpuSwPwrAcc and samples the PTSC.
	Jx = value read from CpuSwPwrAcc and Tx = value read from PTSC.

iv. At time y, software reads CpuSwPwrAcc and samples the PTSC.
	Jy = value read from CpuSwPwrAcc and Ty = value read from PTSC.

v. Calculate the average power consumption for a compute unit over
time period (y-x). Unit of result is uWatt:

	if (Jy < Jx) // Rollover has occurred
		Jdelta = (Jy + Jmax) - Jx
	else
		Jdelta = Jy - Jx
	PwrCPUave = N * Jdelta * 1000 / (Ty - Tx)

Simple example:

  root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' make -j4
    CHK     include/config/kernel.release
    CHK     include/generated/uapi/linux/version.h
    CHK     include/generated/utsrelease.h
    CHK     include/generated/timeconst.h
    CHK     include/generated/bounds.h
    CHK     include/generated/asm-offsets.h
    CALL    scripts/checksyscalls.sh
    CHK     include/generated/compile.h
    SKIPPED include/generated/compile.h
    Building modules, stage 2.
  Kernel: arch/x86/boot/bzImage is ready  (#40)
    MODPOST 4225 modules

   Performance counter stats for 'system wide':

              183.44 mWatts power/power-pkg/

       341.837270111 seconds time elapsed

  root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' sleep 10

   Performance counter stats for 'system wide':

                0.18 mWatts power/power-pkg/

        10.012551815 seconds time elapsed

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: jacob.w.shin@gmail.com
Link: http://lkml.kernel.org/r/1457502306-2559-1-git-send-email-ray.huang@amd.com
[ Fixed the modular build. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 09:37:15 +01:00
Huang Rui
01fe03ff1c x86/cpufeature, perf/x86: Add AMD Accumulated Power Mechanism feature flag
AMD CPU family 15h model 0x60 introduces a mechanism for measuring
accumulated power. It is used to report the processor power consumption
and support for it is indicated by CPUID Fn8000_0007_EDX[12].

Signed-off-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andreas Herrmann <herrmann.der.user@googlemail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Jacob Shin <jacob.w.shin@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wan Zongshun <Vincent.Wan@amd.com>
Cc: spg_linux_kernel@amd.com
Link: http://lkml.kernel.org/r/1452739808-11871-4-git-send-email-ray.huang@amd.com
[ Resolved conflict and moved the synthetic CPUID slot to 19. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 09:35:29 +01:00
Suravee Suthikulpanit
f8519155b4 perf/x86/amd: Add support for new IOMMU performance events
This patch adds new IOMMU performance event based on
the information in table 74 of the AMD I/O Virtualization Technology
(IOMMU) Specification (Document Id: 4882, Rev 2.62, Feb 2015)

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Cc: <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://support.amd.com/TechDocs/48882_IOMMU.pdf
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 09:35:28 +01:00
Huang Rui
8dfeae0d73 perf/x86/amd: Move nodes_per_socket into bsp_init_amd()
nodes_per_socket is static and it needn't be initialized many
times during every CPU core init. So move its initialization into
bsp_init_amd().

Signed-off-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andreas Herrmann <herrmann.der.user@googlemail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Jacob Shin <jacob.w.shin@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: spg_linux_kernel@amd.com
Link: http://lkml.kernel.org/r/1452739808-11871-2-git-send-email-ray.huang@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 09:08:22 +01:00
Peter Zijlstra
27348f382b perf/x86/cqm: Factor out some common code
Having the same code twice (and once quite ugly) is fragile.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 09:08:22 +01:00
Vikas Shivappa
e7ee3e8cb5 perf/x86/mbm: Add support for MBM counter overflow handling
This patch adds a per package timer which periodically updates the
memory bandwidth counters for the events that are currently active.

Current patch has a periodic timer every 1s since the SDM guarantees
that the counter will not overflow in 1s but this time can be definitely
improved by calibrating on the system. The overflow is really a function
of the max memory b/w that the socket can support, max counter value and
scaling factor.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: fenghua.yu@intel.com
Cc: h.peter.anvin@intel.com
Cc: ravi.v.shankar@intel.com
Cc: vikas.shivappa@intel.com
Link: http://lkml.kernel.org/r/013b756c5006b1c4ca411f3ecf43ed52f19fbf87.1457723885.git.tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 09:08:21 +01:00
Vikas Shivappa
2d4de8376f perf/x86/mbm: Implement RMID recycling
RMID could be allocated or deallocated as part of RMID recycling.

When an RMID is allocated for MBM event, the MBM counter needs to be
initialized because next time we read the counter we need the previous
value to account for total bytes that went to the memory controller.

Similarly, when RMID is deallocated we need to update the ->count
variable.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: fenghua.yu@intel.com
Cc: h.peter.anvin@intel.com
Cc: ravi.v.shankar@intel.com
Cc: vikas.shivappa@intel.com
Link: http://lkml.kernel.org/r/1457652732-4499-6-git-send-email-vikas.shivappa@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 09:08:20 +01:00
Tony Luck
87f01cc2a2 perf/x86/mbm: Add memory bandwidth monitoring event management
Includes all the core infrastructure to measure the total_bytes and
bandwidth.

We have per socket counters for both total system wide L3 external
bytes and local socket memory-controller bytes. The OS does MSR writes
to MSR_IA32_QM_EVTSEL and MSR_IA32_QM_CTR to read the counters and
uses the IA32_PQR_ASSOC_MSR to associate the RMID with the task. The
tasks have a common RMID for CQM (cache quality of service monitoring)
and MBM. Hence most of the scheduling code is reused from CQM.

Signed-off-by: Tony Luck <tony.luck@intel.com>
[ Restructured rmid_read to not have an obvious hole, removed MBM_CNTR_MAX as its unused. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: fenghua.yu@intel.com
Cc: h.peter.anvin@intel.com
Cc: ravi.v.shankar@intel.com
Cc: vikas.shivappa@intel.com
Link: http://lkml.kernel.org/r/abd7aac9a18d93b95b985b931cf258df0164746d.1457723885.git.tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 09:08:20 +01:00
Vikas Shivappa
33c3cc7acf perf/x86/mbm: Add Intel Memory B/W Monitoring enumeration and init
The MBM init patch enumerates the Intel MBM (Memory b/w monitoring)
and initializes the perf events and datastructures for monitoring the
memory b/w.

Its based on original patch series by Tony Luck and Kanaka Juvva.

Memory bandwidth monitoring (MBM) provides OS/VMM a way to monitor
bandwidth from one level of cache to another. The current patches
support L3 external bandwidth monitoring. It supports both 'local
bandwidth' and 'total bandwidth' monitoring for the socket. Local
bandwidth measures the amount of data sent through the memory controller
on the socket and total b/w measures the total system bandwidth.

Extending the cache quality of service monitoring (CQM) we add two
more events to the perf infrastructure:

  intel_cqm_llc/local_bytes - bytes sent through local socket memory controller
  intel_cqm_llc/total_bytes - total L3 external bytes sent

The tasks are associated with a Resouce Monitoring ID (RMID) just like
in CQM and OS uses a MSR write to indicate the RMID of the task during
scheduling.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: fenghua.yu@intel.com
Cc: h.peter.anvin@intel.com
Cc: ravi.v.shankar@intel.com
Cc: vikas.shivappa@intel.com
Link: http://lkml.kernel.org/r/1457652732-4499-4-git-send-email-vikas.shivappa@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 09:08:19 +01:00
Vikas Shivappa
ada2f634cd perf/x86/cqm: Fix CQM memory leak and notifier leak
Fixes the hotcpu notifier leak and other global variable memory leaks
during CQM (cache quality of service monitoring) initialization.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: fenghua.yu@intel.com
Cc: h.peter.anvin@intel.com
Cc: ravi.v.shankar@intel.com
Cc: vikas.shivappa@intel.com
Link: http://lkml.kernel.org/r/1457652732-4499-3-git-send-email-vikas.shivappa@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 09:08:19 +01:00
Vikas Shivappa
a223c1c7ab perf/x86/cqm: Fix CQM handling of grouping events into a cache_group
Currently CQM (cache quality of service monitoring) is grouping all
events belonging to same PID to use one RMID. However its not counting
all of these different events. Hence we end up with a count of zero
for all events other than the group leader.

The patch tries to address the issue by keeping a flag in the
perf_event.hw which has other CQM related fields. The field is updated
at event creation and during grouping.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
[peterz: Changed hw_perf_event::is_group_event to an int]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: fenghua.yu@intel.com
Cc: h.peter.anvin@intel.com
Cc: ravi.v.shankar@intel.com
Cc: vikas.shivappa@intel.com
Link: http://lkml.kernel.org/r/1457652732-4499-2-git-send-email-vikas.shivappa@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 09:08:18 +01:00
Peter Zijlstra
e8d8a90fc5 perf/x86/BTS: Fix RCU usage
This splat reminds us:

[ 8166.045595] [ INFO: suspicious RCU usage. ]

[ 8166.168972]  [<ffffffff81127837>] lockdep_rcu_suspicious+0xe7/0x120
[ 8166.175966]  [<ffffffff811e0bae>] perf_callchain+0x23e/0x250
[ 8166.182280]  [<ffffffff811dda3d>] perf_prepare_sample+0x27d/0x350
[ 8166.189082]  [<ffffffff8100f503>] intel_pmu_drain_bts_buffer+0x133/0x200

... that as the core code does, one should hold rcu_read_lock() over that
entire BTS event-output generation sequence as well.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 09:08:17 +01:00
Peter Zijlstra
c2872d381f perf/x86/ibs: Add IBS interrupt to the dynamic throttle
Interrupt throttling is normally only done against
sysctl_perf_event_sample_rate. This means that if that number is too
high (for whatever reason) you can lock up your machine.

We have, however, a dynamic throttling scheme too, but for that to
work, we need to add a callback to the interrupt handler, IBS did not
have this, so add it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 09:08:16 +01:00
Peter Zijlstra
5a50f52917 perf/x86/ibs: Fix race with IBS_STARTING state
While tracing the IBS bits I saw the NMI hitting between clearing
IBS_STARTING and the actual MSR writes to disable the counter.

Since IBS_STARTING was cleared, the handler assumed these were spurious
NMIs and because STOPPING wasn't set yet either, insta-triggered an
"Unknown NMI".

Cure this by clearing IBS_STARTING after disabling the hardware.

Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 09:08:15 +01:00
Peter Zijlstra
0158b83f75 perf/x86/ibs: Fix IBS throttle
When the IBS IRQ handler get a !0 return from perf_event_overflow;
meaning it should throttle the event, it only disables it, it doesn't
call perf_ibs_stop().

This confuses the state machine, as we'll use pmu::start() ->
perf_ibs_start() to unthrottle.

Tested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: dvyukov@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Link: http://lkml.kernel.org/r/20160311142346.GE6344@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 09:08:15 +01:00
Linus Torvalds
643ad15d47 Merge branch 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 protection key support from Ingo Molnar:
 "This tree adds support for a new memory protection hardware feature
  that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

  There's a background article at LWN.net:

      https://lwn.net/Articles/643797/

  The gist is that protection keys allow the encoding of
  user-controllable permission masks in the pte.  So instead of having a
  fixed protection mask in the pte (which needs a system call to change
  and works on a per page basis), the user can map a (handful of)
  protection mask variants and can change the masks runtime relatively
  cheaply, without having to change every single page in the affected
  virtual memory range.

  This allows the dynamic switching of the protection bits of large
  amounts of virtual memory, via user-space instructions.  It also
  allows more precise control of MMU permission bits: for example the
  executable bit is separate from the read bit (see more about that
  below).

  This tree adds the MM infrastructure and low level x86 glue needed for
  that, plus it adds a high level API to make use of protection keys -
  if a user-space application calls:

        mmap(..., PROT_EXEC);

  or

        mprotect(ptr, sz, PROT_EXEC);

  (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
  this special case, and will set a special protection key on this
  memory range.  It also sets the appropriate bits in the Protection
  Keys User Rights (PKRU) register so that the memory becomes unreadable
  and unwritable.

  So using protection keys the kernel is able to implement 'true'
  PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
  PROT_READ as well.  Unreadable executable mappings have security
  advantages: they cannot be read via information leaks to figure out
  ASLR details, nor can they be scanned for ROP gadgets - and they
  cannot be used by exploits for data purposes either.

  We know about no user-space code that relies on pure PROT_EXEC
  mappings today, but binary loaders could start making use of this new
  feature to map binaries and libraries in a more secure fashion.

  There is other pending pkeys work that offers more high level system
  call APIs to manage protection keys - but those are not part of this
  pull request.

  Right now there's a Kconfig that controls this feature
  (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
  (like most x86 CPU feature enablement code that has no runtime
  overhead), but it's not user-configurable at the moment.  If there's
  any serious problem with this then we can make it configurable and/or
  flip the default"

* 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
  x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
  mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
  x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
  mm/core, x86/mm/pkeys: Add execute-only protection keys support
  x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
  x86/mm/pkeys: Allow kernel to modify user pkey rights register
  x86/fpu: Allow setting of XSAVE state
  x86/mm: Factor out LDT init from context init
  mm/core, x86/mm/pkeys: Add arch_validate_pkey()
  mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
  x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
  x86/mm/pkeys: Add Kconfig prompt to existing config option
  x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
  x86/mm/pkeys: Dump PKRU with other kernel registers
  mm/core, x86/mm/pkeys: Differentiate instruction fetches
  x86/mm/pkeys: Optimize fault handling in access_error()
  mm/core: Do not enforce PKEY permissions on remote mm access
  um, pkeys: Add UML arch_*_access_permitted() methods
  mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
  x86/mm/gup: Simplify get_user_pages() PTE bit handling
  ...
2016-03-20 19:08:56 -07:00
Linus Torvalds
24b5e20f11 Merge branch 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI updates from Ingo Molnar:
 "The main changes are:

   - Use separate EFI page tables when executing EFI firmware code.
     This isolates the EFI context from the rest of the kernel, which
     has security and general robustness advantages.  (Matt Fleming)

   - Run regular UEFI firmware with interrupts enabled.  This is already
     the status quo under other OSs.  (Ard Biesheuvel)

   - Various x86 EFI enhancements, such as the use of non-executable
     attributes for EFI memory mappings.  (Sai Praneeth Prakhya)

   - Various arm64 UEFI enhancements.  (Ard Biesheuvel)

   - ... various fixes and cleanups.

  The separate EFI page tables feature got delayed twice already,
  because it's an intrusive change and we didn't feel confident about
  it - third time's the charm we hope!"

* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
  x86/mm/pat: Fix boot crash when 1GB pages are not supported by the CPU
  x86/efi: Only map kernel text for EFI mixed mode
  x86/efi: Map EFI_MEMORY_{XP,RO} memory region bits to EFI page tables
  x86/mm/pat: Don't implicitly allow _PAGE_RW in kernel_map_pages_in_pgd()
  efi/arm*: Perform hardware compatibility check
  efi/arm64: Check for h/w support before booting a >4 KB granular kernel
  efi/arm: Check for LPAE support before booting a LPAE kernel
  efi/arm-init: Use read-only early mappings
  efi/efistub: Prevent __init annotations from being used
  arm64/vmlinux.lds.S: Handle .init.rodata.xxx and .init.bss sections
  efi/arm64: Drop __init annotation from handle_kernel_image()
  x86/mm/pat: Use _PAGE_GLOBAL bit for EFI page table mappings
  efi/runtime-wrappers: Run UEFI Runtime Services with interrupts enabled
  efi: Reformat GUID tables to follow the format in UEFI spec
  efi: Add Persistent Memory type name
  efi: Add NV memory attribute
  x86/efi: Show actual ending addresses in efi_print_memmap
  x86/efi/bgrt: Don't ignore the BGRT if the 'valid' bit is 0
  efivars: Use to_efivar_entry
  efi: Runtime-wrapper: Get rid of the rtc_lock spinlock
  ...
2016-03-20 18:58:18 -07:00
Linus Torvalds
26660a4046 Merge branch 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull 'objtool' stack frame validation from Ingo Molnar:
 "This tree adds a new kernel build-time object file validation feature
  (ONFIG_STACK_VALIDATION=y): kernel stack frame correctness validation.
  It was written by and is maintained by Josh Poimboeuf.

  The motivation: there's a category of hard to find kernel bugs, most
  of them in assembly code (but also occasionally in C code), that
  degrades the quality of kernel stack dumps/backtraces.  These bugs are
  hard to detect at the source code level.  Such bugs result in
  incorrect/incomplete backtraces most of time - but can also in some
  rare cases result in crashes or other undefined behavior.

  The build time correctness checking is done via the new 'objtool'
  user-space utility that was written for this purpose and which is
  hosted in the kernel repository in tools/objtool/.  The tool's (very
  simple) UI and source code design is shaped after Git and perf and
  shares quite a bit of infrastructure with tools/perf (which tooling
  infrastructure sharing effort got merged via perf and is already
  upstream).  Objtool follows the well-known kernel coding style.

  Objtool does not try to check .c or .S files, it instead analyzes the
  resulting .o generated machine code from first principles: it decodes
  the instruction stream and interprets it.  (Right now objtool supports
  the x86-64 architecture.)

  From tools/objtool/Documentation/stack-validation.txt:

   "The kernel CONFIG_STACK_VALIDATION option enables a host tool named
    objtool which runs at compile time.  It has a "check" subcommand
    which analyzes every .o file and ensures the validity of its stack
    metadata.  It enforces a set of rules on asm code and C inline
    assembly code so that stack traces can be reliable.

    Currently it only checks frame pointer usage, but there are plans to
    add CFI validation for C files and CFI generation for asm files.

    For each function, it recursively follows all possible code paths
    and validates the correct frame pointer state at each instruction.

    It also follows code paths involving special sections, like
    .altinstructions, __jump_table, and __ex_table, which can add
    alternative execution paths to a given instruction (or set of
    instructions).  Similarly, it knows how to follow switch statements,
    for which gcc sometimes uses jump tables."

  When this new kernel option is enabled (it's disabled by default), the
  tool, if it finds any suspicious assembly code pattern, outputs
  warnings in compiler warning format:

    warning: objtool: rtlwifi_rate_mapping()+0x2e7: frame pointer state mismatch
    warning: objtool: cik_tiling_mode_table_init()+0x6ce: call without frame pointer save/setup
    warning: objtool:__schedule()+0x3c0: duplicate frame pointer save
    warning: objtool:__schedule()+0x3fd: sibling call from callable instruction with changed frame pointer

  ... so that scripts that pick up compiler warnings will notice them.
  All known warnings triggered by the tool are fixed by the tree, most
  of the commits in fact prepare the kernel to be warning-free.  Most of
  them are bugfixes or cleanups that stand on their own, but there are
  also some annotations of 'special' stack frames for justified cases
  such entries to JIT-ed code (BPF) or really special boot time code.

  There are two other long-term motivations behind this tool as well:

   - To improve the quality and reliability of kernel stack frames, so
     that they can be used for optimized live patching.

   - To create independent infrastructure to check the correctness of
     CFI stack frames at build time.  CFI debuginfo is notoriously
     unreliable and we cannot use it in the kernel as-is without extra
     checking done both on the kernel side and on the build side.

  The quality of kernel stack frames matters to debuggability as well,
  so IMO we can merge this without having to consider the live patching
  or CFI debuginfo angle"

* 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
  objtool: Only print one warning per function
  objtool: Add several performance improvements
  tools: Copy hashtable.h into tools directory
  objtool: Fix false positive warnings for functions with multiple switch statements
  objtool: Rename some variables and functions
  objtool: Remove superflous INIT_LIST_HEAD
  objtool: Add helper macros for traversing instructions
  objtool: Fix false positive warnings related to sibling calls
  objtool: Compile with debugging symbols
  objtool: Detect infinite recursion
  objtool: Prevent infinite recursion in noreturn detection
  objtool: Detect and warn if libelf is missing and don't break the build
  tools: Support relative directory path for 'O='
  objtool: Support CROSS_COMPILE
  x86/asm/decoder: Use explicitly signed chars
  objtool: Enable stack metadata validation on 64-bit x86
  objtool: Add CONFIG_STACK_VALIDATION option
  objtool: Add tool to perform compile-time stack metadata validation
  x86/kprobes: Mark kretprobe_trampoline() stack frame as non-standard
  sched: Always inline context_switch()
  ...
2016-03-20 18:23:21 -07:00
Ard Biesheuvel
142b9e6c9d x86/kallsyms: fix GOLD link failure with new relative kallsyms table format
Commit 2213e9a66b ("kallsyms: add support for relative offsets in
kallsyms address table") changed the default kallsyms symbol table
format to use relative references rather than absolute addresses.

This reduces the size of the kallsyms symbol table by 50% on 64-bit
architectures, and further reduces the size of the relocation tables
used by relocatable kernels.  Since the memory footprint of the static
kernel image is always much smaller than 4 GB, these relative references
are assumed to be representable in 32 bits, even when the native word
size is 64 bits.

On 64-bit architectures, this obviously only works if the distance
between each relative reference and the chosen anchor point is
representable in 32 bits, and so the table generation code in
scripts/kallsyms.c scans the table for the lowest value that is covered
by the kernel text, and selects it as the anchor point.

However, when using the GOLD linker rather than the default BFD linker
to build the x86_64 kernel, the symbol phys_offset_64, which is the
result of arithmetic defined in the linker script, is emitted as a 'T'
rather than an 'A' type symbol, resulting in scripts/kallsyms.c to
mistake it for a suitable anchor point, even though it is far away from
the actual kernel image in the virtual address space.  This results in
out-of-range warnings from scripts/kallsyms.c and a broken build.

So let's align with the BFD linker, and emit the phys_offset_[32|64]
symbols as absolute symbols explicitly.  Note that the out of range
issue does not exist on 32-bit x86, but this patch changes both symbols
for symmetry.

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-20 13:52:37 -07:00
Linus Torvalds
3c2de27d79 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:

 - Preparations of parallel lookups (the remaining main obstacle is the
   need to move security_d_instantiate(); once that becomes safe, the
   rest will be a matter of rather short series local to fs/*.c

 - preadv2/pwritev2 series from Christoph

 - assorted fixes

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (32 commits)
  splice: handle zero nr_pages in splice_to_pipe()
  vfs: show_vfsstat: do not ignore errors from show_devname method
  dcache.c: new helper: __d_add()
  don't bother with __d_instantiate(dentry, NULL)
  untangle fsnotify_d_instantiate() a bit
  uninline d_add()
  replace d_add_unique() with saner primitive
  quota: use lookup_one_len_unlocked()
  cifs_get_root(): use lookup_one_len_unlocked()
  nfs_lookup: don't bother with d_instantiate(dentry, NULL)
  kill dentry_unhash()
  ceph_fill_trace(): don't bother with d_instantiate(dn, NULL)
  autofs4: don't bother with d_instantiate(dentry, NULL) in ->lookup()
  configfs: move d_rehash() into configfs_create() for regular files
  ceph: don't bother with d_rehash() in splice_dentry()
  namei: teach lookup_slow() to skip revalidate
  namei: massage lookup_slow() to be usable by lookup_one_len_unlocked()
  lookup_one_len_unlocked(): use lookup_dcache()
  namei: simplify invalidation logics in lookup_dcache()
  namei: change calling conventions for lookup_{fast,slow} and follow_managed()
  ...
2016-03-19 18:52:29 -07:00