Commit Graph

20126 Commits

Author SHA1 Message Date
Linus Torvalds
e4e65676f2 Fixes and features for 3.18.
Apart from the usual cleanups, here is the summary of new features:
 
 - s390 moves closer towards host large page support
 
 - PowerPC has improved support for debugging (both inside the guest and
   via gdbstub) and support for e6500 processors
 
 - ARM/ARM64 support read-only memory (which is necessary to put firmware
   in emulated NOR flash)
 
 - x86 has the usual emulator fixes and nested virtualization improvements
   (including improved Windows support on Intel and Jailhouse hypervisor
   support on AMD), adaptive PLE which helps overcommitting of huge guests.
   Also included are some patches that make KVM more friendly to memory
   hot-unplug, and fixes for rare caching bugs.
 
 Two patches have trivial mm/ parts that were acked by Rik and Andrew.
 
 Note: I will soon switch to a subkey for signing purposes.  To verify
 future signed pull requests from me, please update my key with
 "gpg --recv-keys 9B4D86F2".  You should see 3 new subkeys---the
 one for signing will be a 2048-bit RSA key, 4E6B09D7.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJUL5sPAAoJEBvWZb6bTYbyfkEP/3MNhSyn6HCjPjtjLNPAl9KL
 WpExZSUFL2+4CztpdGIsek1BeJYHmqv3+c5S+WvaWVA1aqh2R7FT1D1ErBLjgLQq
 lq23IOr+XxmC3dXQUEEk+TlD+283UzypzEG4l4UD3JYg79fE3UrXAz82SeyewJDY
 x7aPYhkZG3RHu+wAyMPasG6E3zS5LySdUtGWbiPwz5BejrhBJoJdeb2WIL/RwnUK
 7ppSLB5EoFj/uMkuyeAAdAbdfSrhHA6faDZxNdxS9k9wGutrhhfUoQ49ONrKG4dV
 sFo1tSPTVgRs8QFYUZ2fJUPBAmUVddsgqh2K9d0NftGTq7b8YszaCsfFrs2/Y4MU
 YxssWEhxsfszerCu12bbAJrv6JBZYQ7TwGvI9L7P0iFU6IVw/djmukU4AkM9/e91
 YS/cue/PN+9Pn2ccXzL9J7xRtZb8FsOuRsCXTCmbOwDkLmrKPDBN2t3RUbeF+Eam
 ABrpWnLKX13kZSo4LKU+/niarzmPMp7odQfHVdr8ea0fiYLp4iN8puA20WaSPIgd
 CLvm+RAvXe5Lm91L4mpFotJ2uFyK6QlIYJV4FsgeWv/0D0qppWQi0Utb/aCNHCgy
 z8MyUMD48y7EpoQrFYr/7cddXIu0/NegnM8I1coVjIPEk4NfeebGUlCJ/V3D8wMG
 BgEfS2x6jRc5zB3hjwDr
 =iEVi
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "Fixes and features for 3.18.

  Apart from the usual cleanups, here is the summary of new features:

   - s390 moves closer towards host large page support

   - PowerPC has improved support for debugging (both inside the guest
     and via gdbstub) and support for e6500 processors

   - ARM/ARM64 support read-only memory (which is necessary to put
     firmware in emulated NOR flash)

   - x86 has the usual emulator fixes and nested virtualization
     improvements (including improved Windows support on Intel and
     Jailhouse hypervisor support on AMD), adaptive PLE which helps
     overcommitting of huge guests.  Also included are some patches that
     make KVM more friendly to memory hot-unplug, and fixes for rare
     caching bugs.

  Two patches have trivial mm/ parts that were acked by Rik and Andrew.

  Note: I will soon switch to a subkey for signing purposes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (157 commits)
  kvm: do not handle APIC access page if in-kernel irqchip is not in use
  KVM: s390: count vcpu wakeups in stat.halt_wakeup
  KVM: s390/facilities: allow TOD-CLOCK steering facility bit
  KVM: PPC: BOOK3S: HV: CMA: Reserve cma region only in hypervisor mode
  arm/arm64: KVM: Report correct FSC for unsupported fault types
  arm/arm64: KVM: Fix VTTBR_BADDR_MASK and pgd alloc
  kvm: Fix kvm_get_page_retry_io __gup retval check
  arm/arm64: KVM: Fix set_clear_sgi_pend_reg offset
  kvm: x86: Unpin and remove kvm_arch->apic_access_page
  kvm: vmx: Implement set_apic_access_page_addr
  kvm: x86: Add request bit to reload APIC access page address
  kvm: Add arch specific mmu notifier for page invalidation
  kvm: Rename make_all_cpus_request() to kvm_make_all_cpus_request() and make it non-static
  kvm: Fix page ageing bugs
  kvm/x86/mmu: Pass gfn and level to rmapp callback.
  x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-only
  kvm: x86: use macros to compute bank MSRs
  KVM: x86: Remove debug assertion of non-PAE reserved bits
  kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
  kvm: Faults which trigger IO release the mmap_sem
  ...
2014-10-08 05:27:39 -04:00
Andi Kleen
2dee5c43da x86: Fix section conflict for numachip
A variable cannot be both __read_mostly and const. This
is a meaningless combination.

Just make it only const.

This fixes the LTO build with numachip enabled.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1411533139-25708-1-git-send-email-andi@firstfloor.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-10-08 11:18:49 +02:00
Ben Hutchings
0e6d3112a4 x86: Reject x32 executables if x32 ABI not supported
It is currently possible to execve() an x32 executable on an x86_64
kernel that has only ia32 compat enabled.  However all its syscalls
will fail, even _exit().  This usually causes it to segfault.

Change the ELF compat architecture check so that x32 executables are
rejected if we don't support the x32 ABI.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Link: http://lkml.kernel.org/r/1410120305.6822.9.camel@decadent.org.uk
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-10-08 11:17:42 +02:00
Bryan O'Donoghue
aece118e48 x86: Add cpu_detect_cache_sizes to init_intel() add Quark legacy_cache()
Intel processors which don't report cache information via cpuid(2)
or cpuid(4) need quirk code in the legacy_cache_size callback to
report this data. For Intel that callback is is intel_size_cache().

This patch enables calling of cpu_detect_cache_sizes() inside of
init_intel() and hence the calling of the legacy_cache callback in
intel_size_cache(). Adding this call will ensure that PIII Tualatin
currently in intel_size_cache() and Quark SoC X1000 being added to
intel_size_cache() in this patch will report their respective cache
sizes.

This model of calling cpu_detect_cache_sizes() is consistent with
AMD/Via/Cirix/Transmeta and Centaur.

Also added is a string to idenitfy the Quark as Quark SoC X1000
giving better and more descriptive output via /proc/cpuinfo

Adding cpu_detect_cache_sizes to init_intel() will enable calling
of intel_size_cache() on Intel processors which currently no code
can reach. Therefore this patch will also re-enable reporting
of PIII Tualatin cache size information as well as add
Quark SoC X1000 support.

Comment text and cache flow logic suggested by Thomas Gleixner

Signed-off-by: Bryan O'Donoghue <pure.logic@nexus-software.ie>
Cc: davej@redhat.com
Cc: hmh@hmh.eng.br
Link: http://lkml.kernel.org/r/1412641189-12415-3-git-send-email-pure.logic@nexus-software.ie
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-10-08 10:07:46 +02:00
Bryan O'Donoghue
2075244f9b x86: Quark: Comment setup_arch() to document TLB/PGE bug
Quark SoC X1000 advertises Page Global Enable for it's
Translation Lookaside Buffer via cpuid. The silicon does not
in fact support PGE and hence will not flush the TLB when CR4.PGE
is rewritten. The Quark documentation makes clear the necessity to
instead rewrite CR3 in order to flush any TLB entries, irrespective
of the state of CR4.PGE or an individual PTE.PGE

See Intel Quark Core DevMan_001.pdf section 6.4.11

In setup.c setup_arch() the code will load_cr3() and then do a
__flush_tlb_all().

On Quark the entire TLB will be flushed at the load_cr3().
The __flush_tlb_all() have no effect and can be safely ignored.

Later on in the boot process we switch off the flag for cpu_has_pge()
which means that subsequent calls to __flush_tlb_all() will
call __flush_tlb() not __flush_tlb_global() flushing the TLB in the
correct way via load_cr3() not CR4.PGE rewrite

This patch documents the behaviour of flushing the TLB for Quark in
setup_arch()

Comment text suggested by Thomas Gleixner

Signed-off-by: Bryan O'Donoghue <pure.logic@nexus-software.ie>
Cc: davej@redhat.com
Cc: hmh@hmh.eng.br
Link: http://lkml.kernel.org/r/1412641189-12415-2-git-send-email-pure.logic@nexus-software.ie
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-10-08 10:07:46 +02:00
Jan Beulich
5f1d919a8c x86: Improve cmpxchg8b_emu.S
- don't include unneeded headers
- drop redundant entry point label
- complete unwind annotations
- use .L prefix on local labels to not clutter the symbol table

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/5422917E0200007800038081@mail.emea.novell.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-10-08 10:05:49 +02:00
Jan Beulich
3f63572187 x86: Improve cmpxchg16b_emu.S
- don't include unneeded headers
- don't open-code PER_CPU_VAR()
- drop redundant entry point label
- complete unwind annotations
- use .L prefix on local label to not clutter the symbol table

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/542290BC020000780003807D@mail.emea.novell.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-10-08 10:05:49 +02:00
Linus Torvalds
74da38631a Tinification for 3.18
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJUL0J0AAoJEA7Zo9+K/4c9w40P/iMFPfCethdBtPz5rI88CVr2
 7yU99TdbEPoRJm+rU4ohvHdB73p2KWINIKvpSThvegvjXbEcKxQkdpVWHsFJZeHS
 bZiYmhjxdCBvJGLrYo5IwqH0PrSjokTPzMUekUCk7BkUKNJRaDjfUBHvUmKsinUR
 dQL+3KE3edy6W3DL+FOd0QZwSOgmOfEibTWpfmg+n16kFNa75Kg/QLwjYRvtQplP
 eElywDZN07IhAeBFqKhKvlKmDSAeqMd8RfoPPo9Ts+reeIrWYjVNbl9ISOqXqy2x
 JoLeZQmwSXj/C9Ehr5e+aId2eO8In5xueQfXP8SS8dCC7VLwRbnNgyAQQZEslEBk
 QH0GhT6GqTamBdiNI3I+usfs65cEaialXh2afcoLwGS/iGD8MhZ8Dt+m4iyXNxEZ
 kT9VA4974mPjJ1g0mDDnYIxNjxF43m+SD5K1sR/XGpMcA8NdqMUmvKNcbePCobVa
 WTutIemQqGipNeWE94XwZEbc0B+aWwH7eiZOBMVGhWsHInd7QeTBTbfZlctyBkzf
 AswgsFjC5FW05CWK6J1Lf/UI1FD9PmHMKpmQUPED1+7okDTfqGjKjdREWgZSixUt
 LIRfWqWEaNpRRBFbDyt0C+F4pBRPLiRDaOyNhwEdtXuVGKRXb1G3qX7nFOJAZo6G
 GDTZo9iIRNSfm/M4tJ+n
 =2VyW
 -----END PGP SIGNATURE-----

Merge tag 'tiny/for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/josh/linux

Pull "tinification" patches from Josh Triplett.

Work on making smaller kernels.

* tag 'tiny/for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/josh/linux:
  bloat-o-meter: Ignore syscall aliases SyS_ and compat_SyS_
  mm: Support compiling out madvise and fadvise
  x86: Support compiling out human-friendly processor feature names
  x86: Drop support for /proc files when !CONFIG_PROC_FS
  x86, boot: Don't compile early_serial_console.c when !CONFIG_EARLY_PRINTK
  x86, boot: Don't compile aslr.c when !CONFIG_RANDOMIZE_BASE
  x86, boot: Use the usual -y -n mechanism for objects in vmlinux
  x86: Add "make tinyconfig" to configure the tiniest possible kernel
  x86, platform, kconfig: move kvmconfig functionality to a helper
2014-10-07 08:51:59 -04:00
Rafael J. Wysocki
88b42a4883 Merge branch 'pm-genirq'
* pm-genirq:
  PM / genirq: Document rules related to system suspend and interrupts
  PCI / PM: Make PCIe PME interrupts wake up from suspend-to-idle
  x86 / PM: Set IRQCHIP_SKIP_SET_WAKE for IOAPIC IRQ chip objects
  genirq: Simplify wakeup mechanism
  genirq: Mark wakeup sources as armed on suspend
  genirq: Create helper for flow handler entry check
  genirq: Distangle edge handler entry
  genirq: Avoid double loop on suspend
  genirq: Move MASK_ON_SUSPEND handling into suspend_device_irqs()
  genirq: Make use of pm misfeature accounting
  genirq: Add sanity checks for PM options on shared interrupt lines
  genirq: Move suspend/resume logic into irq/pm code
  PM / sleep: Mechanism for aborting system suspends unconditionally
2014-10-07 01:17:21 +02:00
Andy Lutomirski
8c7aa698ba x86_64, entry: Filter RFLAGS.NT on entry from userspace
The NT flag doesn't do anything in long mode other than causing IRET
to #GP.  Oddly, CPL3 code can still set NT using popf.

Entry via hardware or software interrupt clears NT automatically, so
the only relevant entries are fast syscalls.

If user code causes kernel code to run with NT set, then there's at
least some (small) chance that it could cause trouble.  For example,
user code could cause a call to EFI code with NT set, and who knows
what would happen?  Apparently some games on Wine sometimes do
this (!), and, if an IRET return happens, they will segfault.  That
segfault cannot be handled, because signal delivery fails, too.

This patch programs the CPU to clear NT on entry via SYSCALL (both
32-bit and 64-bit, by my reading of the AMD APM), and it clears NT
in software on entry via SYSENTER.

To save a few cycles, this borrows a trick from Jan Beulich in Xen:
it checks whether NT is set before trying to clear it.  As a result,
it seems to have very little effect on SYSENTER performance on my
machine.

There's another minor bug fix in here: it looks like the CFI
annotations were wrong if CONFIG_AUDITSYSCALL=n.

Testers beware: on Xen, SYSENTER with NT set turns into a GPF.

I haven't touched anything on 32-bit kernels.

The syscall mask change comes from a variant of this patch by Anish
Bhatt.

Note to stable maintainers: there is no known security issue here.
A misguided program can set NT and cause the kernel to try and fail
to deliver SIGSEGV, crashing the program.  This patch fixes Far Cry
on Wine: https://bugs.winehq.org/show_bug.cgi?id=33275

Cc: <stable@vger.kernel.org>
Reported-by: Anish Bhatt <anish@chelsio.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/395749a5d39a29bd3e4b35899cf3a3c1340e5595.1412189265.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2014-10-06 10:53:26 -07:00
Mukesh Rathor
a2ef5dc2c7 x86/xen: Set EFER.NX and EFER.SCE in PVH guests
This fixes two bugs in PVH guests:

  - Not setting EFER.NX means the NX bit in page table entries is
    ignored on Intel processors and causes reserved bit page faults on
    AMD processors.

  - After the Xen commit 7645640d6ff1 ("x86/PVH: don't set EFER_SCE for
    pvh guest") PVH guests are required to set EFER.SCE to enable the
    SYSCALL instruction.

Secondary VCPUs are started with pagetables with the NX bit set so
EFER.NX must be set before using any stack or data segment.
xen_pvh_cpu_early_init() is the new secondary VCPU entry point that
sets EFER before jumping to cpu_bringup_and_idle().

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-10-06 10:27:47 +01:00
Matt Fleming
75b128573b Merge branch 'next' into efi-next-merge
Conflicts:
	arch/x86/boot/compressed/eboot.c
2014-10-03 22:15:56 +01:00
Matt Fleming
60b4dc7720 efi: Delete the in_nmi() conditional runtime locking
commit 5dc3826d9f08 ("efi: Implement mandatory locking for UEFI Runtime
Services") implemented some conditional locking when accessing variable
runtime services that Ingo described as "pretty disgusting".

The intention with the !efi_in_nmi() checks was to avoid live-locks when
trying to write pstore crash data into an EFI variable. Such lockless
accesses are allowed according to the UEFI specification when we're in a
"non-recoverable" state, but whether or not things are implemented
correctly in actual firmware implementations remains an unanswered
question, and so it would seem sensible to avoid doing any kind of
unsynchronized variable accesses.

Furthermore, the efi_in_nmi() tests are inadequate because they don't
account for the case where we call EFI variable services from panic or
oops callbacks and aren't executing in NMI context. In other words,
live-locking is still possible.

Let's just remove the conditional locking altogether. Now we've got the
->set_variable_nonblocking() EFI variable operation we can abort if the
runtime lock is already held. Aborting is by far the safest option.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-10-03 18:41:03 +01:00
Andre Müller
77e21e87ac x86/efi: Adding efi_printks on memory allocationa and pci.reads
All other calls to allocate memory seem to make some noise already, with the
exception of two calls (for gop, uga) in the setup_graphics path.

The purpose is to be noisy on worrysome errors immediately.

commit fb86b2440d ("x86/efi: Add better error logging to EFI boot
stub") introduces printing false alarms for lots of hardware. Rather
than playing Whack a Mole with non-fatal exit conditions, try the other
way round.

This is per Matt Fleming's suggestion:

> Where I think we could improve things
> is by adding efi_printk() message in certain error paths. Clearly, not
> all error paths need such messages, e.g. the EFI_INVALID_PARAMETER path
> you highlighted above, but it makes sense for memory allocation and PCI
> read failures.

Link: http://article.gmane.org/gmane.linux.kernel.efi/4628
Signed-off-by: Andre Müller <andre.muller@web.de>
Cc: Ulf Winkelvos <ulf@winkelvos.de>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-10-03 18:41:03 +01:00
Mathias Krause
4e78eb0561 x86/efi: Mark initialization code as such
The 32 bit and 64 bit implementations differ in their __init annotations
for some functions referenced from the common EFI code. Namely, the 32
bit variant is missing some of the __init annotations the 64 bit variant
has.

To solve the colliding annotations, mark the corresponding functions in
efi_32.c as initialization code, too -- as it is such.

Actually, quite a few more functions are only used during initialization
and therefore can be marked __init. They are therefore annotated, too.
Also add the __init annotation to the prototypes in the efi.h header so
users of those functions will see it's meant as initialization code
only.

This patch also fixes the "prelog" typo. ("prologue" / "epilogue" might
be more appropriate but this is C code after all, not an opera! :D)

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-10-03 18:41:03 +01:00
Mathias Krause
0ce4605c9a x86/efi: Update comment regarding required phys mapped EFI services
Commit 3f4a7836e3 ("x86/efi: Rip out phys_efi_get_time()") left
set_virtual_address_map as the only runtime service needed with a
phys mapping but missed to update the preceding comment. Fix that.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-10-03 18:41:02 +01:00
Mathias Krause
6092068547 x86/efi: Unexport add_efi_memmap variable
This variable was accidentally exported, even though it's only used in
this compilation unit and only during initialization.

Remove the bogus export, make the variable static instead and mark it
as __initdata.

Fixes: 200001eb14 ("x86 boot: only pick up additional EFI memmap...")
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-10-03 18:41:02 +01:00
Mathias Krause
24ffd84b60 x86/efi: Remove unused efi_call* macros
Complement commit 62fa6e69a4 ("x86/efi: Delete most of the efi_call*
macros") and delete the stub macros for the !CONFIG_EFI case, too. In
fact, there are no EFI calls in this case so we don't need a dummy for
efi_call() even.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-10-03 18:41:02 +01:00
Laszlo Ersek
ace1d1218d x86: efi: Format EFI memory type & attrs with efi_md_typeattr_format()
An example log excerpt demonstrating the change:

Before the patch:

> efi: mem00: type=7, attr=0xf, range=[0x0000000000000000-0x000000000009f000) (0MB)
> efi: mem01: type=2, attr=0xf, range=[0x000000000009f000-0x00000000000a0000) (0MB)
> efi: mem02: type=7, attr=0xf, range=[0x0000000000100000-0x0000000000400000) (3MB)
> efi: mem03: type=2, attr=0xf, range=[0x0000000000400000-0x0000000000800000) (4MB)
> efi: mem04: type=10, attr=0xf, range=[0x0000000000800000-0x0000000000808000) (0MB)
> efi: mem05: type=7, attr=0xf, range=[0x0000000000808000-0x0000000000810000) (0MB)
> efi: mem06: type=10, attr=0xf, range=[0x0000000000810000-0x0000000000900000) (0MB)
> efi: mem07: type=4, attr=0xf, range=[0x0000000000900000-0x0000000001100000) (8MB)
> efi: mem08: type=7, attr=0xf, range=[0x0000000001100000-0x0000000001400000) (3MB)
> efi: mem09: type=2, attr=0xf, range=[0x0000000001400000-0x0000000002613000) (18MB)
> efi: mem10: type=7, attr=0xf, range=[0x0000000002613000-0x0000000004000000) (25MB)
> efi: mem11: type=4, attr=0xf, range=[0x0000000004000000-0x0000000004020000) (0MB)
> efi: mem12: type=7, attr=0xf, range=[0x0000000004020000-0x00000000068ea000) (40MB)
> efi: mem13: type=2, attr=0xf, range=[0x00000000068ea000-0x00000000068f0000) (0MB)
> efi: mem14: type=3, attr=0xf, range=[0x00000000068f0000-0x0000000006c7b000) (3MB)
> efi: mem15: type=6, attr=0x800000000000000f, range=[0x0000000006c7b000-0x0000000006c7d000) (0MB)
> efi: mem16: type=5, attr=0x800000000000000f, range=[0x0000000006c7d000-0x0000000006c85000) (0MB)
> efi: mem17: type=6, attr=0x800000000000000f, range=[0x0000000006c85000-0x0000000006c87000) (0MB)
> efi: mem18: type=3, attr=0xf, range=[0x0000000006c87000-0x0000000006ca3000) (0MB)
> efi: mem19: type=6, attr=0x800000000000000f, range=[0x0000000006ca3000-0x0000000006ca6000) (0MB)
> efi: mem20: type=10, attr=0xf, range=[0x0000000006ca6000-0x0000000006cc6000) (0MB)
> efi: mem21: type=6, attr=0x800000000000000f, range=[0x0000000006cc6000-0x0000000006d95000) (0MB)
> efi: mem22: type=5, attr=0x800000000000000f, range=[0x0000000006d95000-0x0000000006e22000) (0MB)
> efi: mem23: type=7, attr=0xf, range=[0x0000000006e22000-0x0000000007165000) (3MB)
> efi: mem24: type=4, attr=0xf, range=[0x0000000007165000-0x0000000007d22000) (11MB)
> efi: mem25: type=7, attr=0xf, range=[0x0000000007d22000-0x0000000007d25000) (0MB)
> efi: mem26: type=3, attr=0xf, range=[0x0000000007d25000-0x0000000007ea2000) (1MB)
> efi: mem27: type=5, attr=0x800000000000000f, range=[0x0000000007ea2000-0x0000000007ed2000) (0MB)
> efi: mem28: type=6, attr=0x800000000000000f, range=[0x0000000007ed2000-0x0000000007ef6000) (0MB)
> efi: mem29: type=7, attr=0xf, range=[0x0000000007ef6000-0x0000000007f00000) (0MB)
> efi: mem30: type=9, attr=0xf, range=[0x0000000007f00000-0x0000000007f02000) (0MB)
> efi: mem31: type=10, attr=0xf, range=[0x0000000007f02000-0x0000000007f06000) (0MB)
> efi: mem32: type=4, attr=0xf, range=[0x0000000007f06000-0x0000000007fd0000) (0MB)
> efi: mem33: type=6, attr=0x800000000000000f, range=[0x0000000007fd0000-0x0000000007ff0000) (0MB)
> efi: mem34: type=7, attr=0xf, range=[0x0000000007ff0000-0x0000000008000000) (0MB)

After the patch:

> efi: mem00: [Conventional Memory|   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000000000-0x000000000009f000) (0MB)
> efi: mem01: [Loader Data        |   |  |  |  |   |WB|WT|WC|UC] range=[0x000000000009f000-0x00000000000a0000) (0MB)
> efi: mem02: [Conventional Memory|   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000100000-0x0000000000400000) (3MB)
> efi: mem03: [Loader Data        |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000400000-0x0000000000800000) (4MB)
> efi: mem04: [ACPI Memory NVS    |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000800000-0x0000000000808000) (0MB)
> efi: mem05: [Conventional Memory|   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000808000-0x0000000000810000) (0MB)
> efi: mem06: [ACPI Memory NVS    |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000810000-0x0000000000900000) (0MB)
> efi: mem07: [Boot Data          |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000900000-0x0000000001100000) (8MB)
> efi: mem08: [Conventional Memory|   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000001100000-0x0000000001400000) (3MB)
> efi: mem09: [Loader Data        |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000001400000-0x0000000002613000) (18MB)
> efi: mem10: [Conventional Memory|   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000002613000-0x0000000004000000) (25MB)
> efi: mem11: [Boot Data          |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000004000000-0x0000000004020000) (0MB)
> efi: mem12: [Conventional Memory|   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000004020000-0x00000000068ea000) (40MB)
> efi: mem13: [Loader Data        |   |  |  |  |   |WB|WT|WC|UC] range=[0x00000000068ea000-0x00000000068f0000) (0MB)
> efi: mem14: [Boot Code          |   |  |  |  |   |WB|WT|WC|UC] range=[0x00000000068f0000-0x0000000006c7b000) (3MB)
> efi: mem15: [Runtime Data       |RUN|  |  |  |   |WB|WT|WC|UC] range=[0x0000000006c7b000-0x0000000006c7d000) (0MB)
> efi: mem16: [Runtime Code       |RUN|  |  |  |   |WB|WT|WC|UC] range=[0x0000000006c7d000-0x0000000006c85000) (0MB)
> efi: mem17: [Runtime Data       |RUN|  |  |  |   |WB|WT|WC|UC] range=[0x0000000006c85000-0x0000000006c87000) (0MB)
> efi: mem18: [Boot Code          |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000006c87000-0x0000000006ca3000) (0MB)
> efi: mem19: [Runtime Data       |RUN|  |  |  |   |WB|WT|WC|UC] range=[0x0000000006ca3000-0x0000000006ca6000) (0MB)
> efi: mem20: [ACPI Memory NVS    |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000006ca6000-0x0000000006cc6000) (0MB)
> efi: mem21: [Runtime Data       |RUN|  |  |  |   |WB|WT|WC|UC] range=[0x0000000006cc6000-0x0000000006d95000) (0MB)
> efi: mem22: [Runtime Code       |RUN|  |  |  |   |WB|WT|WC|UC] range=[0x0000000006d95000-0x0000000006e22000) (0MB)
> efi: mem23: [Conventional Memory|   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000006e22000-0x0000000007165000) (3MB)
> efi: mem24: [Boot Data          |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000007165000-0x0000000007d22000) (11MB)
> efi: mem25: [Conventional Memory|   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000007d22000-0x0000000007d25000) (0MB)
> efi: mem26: [Boot Code          |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000007d25000-0x0000000007ea2000) (1MB)
> efi: mem27: [Runtime Code       |RUN|  |  |  |   |WB|WT|WC|UC] range=[0x0000000007ea2000-0x0000000007ed2000) (0MB)
> efi: mem28: [Runtime Data       |RUN|  |  |  |   |WB|WT|WC|UC] range=[0x0000000007ed2000-0x0000000007ef6000) (0MB)
> efi: mem29: [Conventional Memory|   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000007ef6000-0x0000000007f00000) (0MB)
> efi: mem30: [ACPI Reclaim Memory|   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000007f00000-0x0000000007f02000) (0MB)
> efi: mem31: [ACPI Memory NVS    |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000007f02000-0x0000000007f06000) (0MB)
> efi: mem32: [Boot Data          |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000007f06000-0x0000000007fd0000) (0MB)
> efi: mem33: [Runtime Data       |RUN|  |  |  |   |WB|WT|WC|UC] range=[0x0000000007fd0000-0x0000000007ff0000) (0MB)
> efi: mem34: [Conventional Memory|   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000007ff0000-0x0000000008000000) (0MB)

Both the type enum and the attribute bitmap are decoded, with the
additional benefit that the memory ranges line up as well.

Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-10-03 18:41:01 +01:00
Dave Young
a5a750a98f x86/efi: Clear EFI_RUNTIME_SERVICES if failing to enter virtual mode
If enter virtual mode failed due to some reason other than the efi call
the EFI_RUNTIME_SERVICES bit in efi.flags should be cleared thus users
of efi runtime services can check the bit and handle the case instead of
assume efi runtime is ok.

Per Matt, if efi call SetVirtualAddressMap fails we will be not sure
it's safe to make any assumptions about the state of the system. So
kernel panics instead of clears EFI_RUNTIME_SERVICES bit.

Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-10-03 18:40:59 +01:00
Dave Young
5ae3683c38 efi: Add kernel param efi=noruntime
noefi kernel param means actually disabling efi runtime, Per suggestion
from Leif Lindholm efi=noruntime should be better. But since noefi is
already used in X86 thus just adding another param efi=noruntime for
same purpose.

Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-10-03 18:40:59 +01:00
Dave Young
6ccc72b87b lib: Add a generic cmdline parse function parse_option_str
There should be a generic function to parse params like a=b,c
Adding parse_option_str in lib/cmdline.c which will return true
if there's specified option set in the params.

Also updated efi=old_map parsing code to use the new function

Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-10-03 18:40:58 +01:00
Dave Young
b2e0a54a12 efi: Move noefi early param code out of x86 arch code
noefi param can be used for arches other than X86 later, thus move it
out of x86 platform code.

Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-10-03 18:40:58 +01:00
Josh Triplett
1282278ee0 efi-bgrt: Add error handling; inform the user when ignoring the BGRT
Gracefully handle failures to allocate memory for the image, which might
be arbitrarily large.

efi_bgrt_init can fail in various ways as well, usually because the
BIOS-provided BGRT structure does not match expectations.  Add
appropriate error messages rather than failing silently.

Reported-by: Srihari Vijayaraghavan <linux.bug.reporting@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81321
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-10-03 18:40:58 +01:00
Matt Fleming
5a17dae422 efi: Add efi= parameter parsing to the EFI boot stub
We need a way to customize the behaviour of the EFI boot stub, in
particular, we need a way to disable the "chunking" workaround, used
when reading files from the EFI System Partition.

One of my machines doesn't cope well when reading files in 1MB chunks to
a buffer above the 4GB mark - it appears that the "chunking" bug
workaround triggers another firmware bug. This was only discovered with
commit 4bf7111f50 ("x86/efi: Support initrd loaded above 4G"), and
that commit is perfectly valid. The symptom I observed was a corrupt
initrd rather than any kind of crash.

efi= is now used to specify EFI parameters in two very different
execution environments, the EFI boot stub and during kernel boot.

There is also a slight performance optimization by enabling efi=nochunk,
but that's offset by the fact that you're more likely to run into
firmware issues, at least on x86. This is the rationale behind leaving
the workaround enabled by default.

Also provide some documentation for EFI_READ_CHUNK_SIZE and why we're
using the current value of 1MB.

Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Roy Franz <roy.franz@linaro.org>
Cc: Maarten Lankhorst <m.b.lankhorst@gmail.com>
Cc: Leif Lindholm <leif.lindholm@linaro.org>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-10-03 18:40:57 +01:00
Ard Biesheuvel
161485e827 efi: Implement mandatory locking for UEFI Runtime Services
According to section 7.1 of the UEFI spec, Runtime Services are not fully
reentrant, and there are particular combinations of calls that need to be
serialized. Use a spinlock to serialize all Runtime Services with respect
to all others, even if this is more than strictly needed.

We've managed to get away without requiring a runtime services lock
until now because most of the interactions with EFI involve EFI
variables, and those operations are already serialised with
__efivars->lock.

Some of the assumptions underlying the decision whether locks are
needed or not (e.g., SetVariable() against ResetSystem()) may not
apply universally to all [new] architectures that implement UEFI.
Rather than try to reason our way out of this, let's just implement at
least what the spec requires in terms of locking.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-10-03 18:40:57 +01:00
Juergen Gross
d1e9abd630 xen: eliminate scalability issues from initrd handling
Size restrictions native kernels wouldn't have resulted from the initrd
getting mapped into the initial mapping. The kernel doesn't really need
the initrd to be mapped, so use infrastructure available in Xen to avoid
the mapping and hence the restriction.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-10-03 12:34:52 +01:00
Pranith Kumar
2291059c85 locking,arch: Use ACCESS_ONCE() instead of cast to volatile in atomic_read()
Use the much more reader friendly ACCESS_ONCE() instead of the cast to volatile.
This is purely a stylistic change.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Link: http://lkml.kernel.org/r/1411482607-20948-1-git-send-email-bobby.prani@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-03 06:06:23 +02:00
Wei Huang
cc6cd47e73 perf/x86: Tone down kernel messages when the PMU check fails in a virtual environment
PMU checking can fail due to various reasons. On native machine, this
is mostly caused by faulty hardware and it is reasonable to use
KERN_ERR in reporting. However, when kernel is running on virtualized
environment, this checking can fail if virtual PMU is not supported
(e.g. KVM on AMD host). It is annoying to see an error message on
splash screen, even though we know such failure is benign on
virtualized environment.

This patch checks if the kernel is running in a virtualized environment.
If so, it will use KERN_INFO in reporting, which reduces the syslog
priority of them. This patch was tested successfully on KVM.

Signed-off-by: Wei Huang <wei@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Link: http://lkml.kernel.org/r/1411617314-24659-1-git-send-email-wei@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-03 06:04:41 +02:00
Andi Kleen
4f971248bc perf/x86/intel/uncore: Fix minor race in box set up
I was looking for the trinity oops cause in the uncore driver.
(so far didn't found it)

However I found this tiny race: when a box is set up two threads on the
same CPU, they may be setting up the box in parallel (e.g. with kernel
preemption). This could lead to the reference count being increasing
too much. Always recheck there is no existing cpu reference inside the lock.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Link: http://lkml.kernel.org/r/1411424826-15629-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-03 06:02:49 +02:00
Dave Hansen
728e5653e6 sched/x86: Fix up typo in topology detection
Commit:

  cebf15eb09 ("x86, sched: Add new topology for multi-NUMA-node CPUs")

some code to try to detect the situation where we have a NUMA node
inside of the "DIE" sched domain.

It detected this by looking for cpus which match_die() but do not match
NUMA nodes via topology_same_node().

I wrote it up as:

	if (match_die(c, o) == !topology_same_node(c, o))

which actually seemed to work some of the time, albiet
accidentally.

It should have been doing an &&, not an ==.

This code essentially chopped off the "DIE" domain on one of
Andrew Morton's systems.  He reported that this patch fixed his
issue.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Lan Tianyu <tianyu.lan@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Link: http://lkml.kernel.org/r/20140930214546.FD481CFF@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-03 05:46:52 +02:00
David S. Miller
739e4a758e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/usb/r8152.c
	net/netfilter/nfnetlink.c

Both r8152 and nfnetlink conflicts were simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-02 11:25:43 -07:00
Paolo Bonzini
f439ed27f8 kvm: do not handle APIC access page if in-kernel irqchip is not in use
This fixes the following OOPS:

   loaded kvm module (v3.17-rc1-168-gcec26bc)
   BUG: unable to handle kernel paging request at fffffffffffffffe
   IP: [<ffffffff81168449>] put_page+0x9/0x30
   PGD 1e15067 PUD 1e17067 PMD 0
   Oops: 0000 [#1] PREEMPT SMP
    [<ffffffffa063271d>] ? kvm_vcpu_reload_apic_access_page+0x5d/0x70 [kvm]
    [<ffffffffa013b6db>] vmx_vcpu_reset+0x21b/0x470 [kvm_intel]
    [<ffffffffa0658816>] ? kvm_pmu_reset+0x76/0xb0 [kvm]
    [<ffffffffa064032a>] kvm_vcpu_reset+0x15a/0x1b0 [kvm]
    [<ffffffffa06403ac>] kvm_arch_vcpu_setup+0x2c/0x50 [kvm]
    [<ffffffffa062e540>] kvm_vm_ioctl+0x200/0x780 [kvm]
    [<ffffffff81212170>] do_vfs_ioctl+0x2d0/0x4b0
    [<ffffffff8108bd99>] ? __mmdrop+0x69/0xb0
    [<ffffffff812123d1>] SyS_ioctl+0x81/0xa0
    [<ffffffff8112a6f6>] ? __audit_syscall_exit+0x1f6/0x2a0
    [<ffffffff817229e9>] system_call_fastpath+0x16/0x1b
   Code: c6 78 ce a3 81 4c 89 e7 e8 d9 80 ff ff 0f 0b 4c 89 e7 e8 8f f6 ff ff e9 fa fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 <48> f7 07 00 c0 00 00 55 48 89 e5 75 1e 8b 47 1c 85 c0 74 27 f0
   RIP  [<ffffffff81193045>] put_page+0x5/0x50

when not using the in-kernel irqchip ("-machine kernel_irqchip=off"
with QEMU).  The fix is to make the same check in
kvm_vcpu_reload_apic_access_page that we already have
in vmx.c's vm_need_virtualize_apic_accesses().

Reported-by: Jan Kiszka <jan.kiszka@siemens.com>
Tested-by: Jan Kiszka <jan.kiszka@siemens.com>
Fixes: 4256f43f9f
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-10-02 19:26:11 +02:00
Mathias Krause
5cfed7b335 Revert "crypto: aesni - disable "by8" AVX CTR optimization"
This reverts commit 7da4b29d49.

Now, that the issue is fixed, we can re-enable the code.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Chandramouli Narayanan <mouli@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-10-02 14:40:28 +08:00
Herbert Xu
9561dccb45 Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Merging the crypto tree for 3.17 to pull in the "by8" AVX CTR revert.
2014-10-02 14:37:20 +08:00
Mathias Krause
e3b3bb5ac1 crypto: aesni - remove unused defines in "by8" variant
The defines for xkey3, xkey6 and xkey9 are not used in the code. They're
probably left overs from merging the three source files for 128, 192 and
256 bit AES. They can safely be removed.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Chandramouli Narayanan <mouli@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-10-02 14:35:03 +08:00
Mathias Krause
80dca4734b crypto: aesni - fix counter overflow handling in "by8" variant
The "by8" CTR AVX implementation fails to propperly handle counter
overflows. That was the reason it got disabled in commit 7da4b29d49
("crypto: aesni - disable "by8" AVX CTR optimization").

Fix the overflow handling by incrementing the counter block as a double
quad word, i.e. a 128 bit, and testing for overflows afterwards. We need
to use VPTEST to do so as VPADD* does not set the flags itself and
silently drops the carry bit.

As this change adds branches to the hot path, minor performance
regressions  might be a side effect. But, OTOH, we now have a conforming
implementation -- the preferable goal.

A tcrypt test on a SandyBridge system (i7-2620M) showed almost identical
numbers for the old and this version with differences within the noise
range. A dm-crypt test with the fixed version gave even slightly better
results for this version. So the performance impact might not be as big
as expected.

Tested-by: Romain Francoise <romain@orebokech.com>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Chandramouli Narayanan <mouli@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-10-02 14:35:03 +08:00
Kees Cook
20cc28882b x86, boot, kaslr: Fix nuisance warning on 32-bit builds
Building 32-bit threw a warning on kASLR enabled builds:

arch/x86/boot/compressed/aslr.c: In function ‘mem_avoid_overlap’:
arch/x86/boot/compressed/aslr.c:198:17: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
   avoid.start = (u64)ptr;
                 ^

This fixes the warning; unsigned long should have been used here.

Signed-off-by: Kees Cook <keescook@chromium.org>
Link: http://lkml.kernel.org/r/20141001183632.GA11431@www.outflux.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-10-01 11:41:24 -07:00
Linus Torvalds
cd40fab6db Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "This has:

   - EFI revert to fix a boot regression
   - early_ioremap() fix for boot failure
   - KASLR fix for possible boot failures
   - EFI fix for corrupted string printing
   - remove a misleading EFI bootup 'failed!' error message

  Unfortunately it's all rather close to the merge window"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/efi: Truncate 64-bit values when calling 32-bit OutputString()
  x86/efi: Delete misleading efi_printk() error message
  Revert "efi/x86: efistub: Move shared dependencies to <asm/efi.h>"
  x86/kaslr: Avoid the setup_data area when picking location
  x86 early_ioremap: Increase FIX_BTMAPS_SLOTS to 8
2014-09-27 14:23:13 -07:00
Alexei Starovoitov
749730ce42 bpf: enable bpf syscall on x64 and i386
done as separate commit to ease conflict resolution

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:14 -04:00
Linus Torvalds
d2865c7d4e Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "A CONFIG_STACK_GROWSUP=y fix, and a hotplug llc CPU mask fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix unreleased llc_shared_mask bit during CPU hotplug
  sched: Fix end_of_stack() and location of stack canary for architectures using CONFIG_STACK_GROWSUP
2014-09-26 08:38:09 -07:00
Ingo Molnar
29282ac0bd * Revert the static library changes from the merge window since they're
causing issues for Macbooks and Fedora + Grub2 - Matt Fleming
 
  * Delete the misleading "setup_efi_pci() failed!" message which some
    people are seeing when booting EFI - Matt Fleming
 
  * Fix printing strings from the 32-bit EFI boot stub by only passing
    32-bit addresses to the firmware - Matt Fleming
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUIzOFAAoJEC84WcCNIz1VpSwQAJhp9Yu60dWXyqRpV+ER1yau
 MWHXunkeaccbBXNABkzuDUqb2a6DRIJow+/n+dYjIGY6Nf9zzLQFQ+s/EsS+IiyY
 s4rRvAqzfGYk1d6xzgvEccrdD4fRP32kFKnSlZpfTRuJZHZieD+f2y6TP7D33Ja3
 HV/ivPQHZNjxgsExpcE8Rz/QyOZpqRacTr9Gr7IusBRL6IMCyycfCTmt6d5pf1iD
 kQGSGwIBvgN4xMqPUdxTNo31bQZA5ZeywNOh9WhdSCL7FAIDfG9TmXt/J7ckq5ax
 0f6X92qgCs3peLY+/szgSZ2LsZI6I/FM1udc2SFiIOPvCwwytcJ94Wro5pLYzZ4i
 SnqB2xLLEmsR2J3MXIeY0aVy2VtHT4bYRnXYNd9G0eaVfrlJ+4lgwqJavAJmtDZx
 88ey1R8LKRQr+ueSv/BnOvE6T2+38HrjrMooFQsPvolRR0S6MITBr8I2hoRASkUt
 YsA+7s6+tO2QBmQYrKCYSAi9A7onMA9Fh93dmv7XLqFw/SsfVm3RnrNhOVsO9kPC
 zIsWZoS+PGwb4RRvM2i7JAEqUCbuLpIAHYEU6gqprWm1ERHsX9mfFYfsJQHzHuOY
 rg6+wtWQ9MGxek8POac4d2mC+PwC4DA0AkaTTZQBcdAJu+h/gZNt4w7mpk5v4Th1
 QYr/otShMvc84Zd+RMeV
 =5mVJ
 -----END PGP SIGNATURE-----

Merge tag 'efi-urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/urgent

Pull EFI fixes from Matt Fleming:

  * Revert the static library changes from the merge window since they're
    causing issues for Macbooks and Fedora + Grub2 (Matt Fleming)

  * Delete the misleading "setup_efi_pci() failed!" message which some
    people are seeing when booting EFI (Matt Fleming)

  * Fix printing strings from the 32-bit EFI boot stub by only passing
    32-bit addresses to the firmware (Matt Fleming)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-25 16:40:08 +02:00
Matt Fleming
115c6628a5 x86/efi: Truncate 64-bit values when calling 32-bit OutputString()
If we're executing the 32-bit efi_char16_printk() code path (i.e.
running on top of 32-bit firmware) we know that efi_early->text_output
will be a 32-bit value, even though ->text_output has type u64.

Unfortunately, we currently pass ->text_output directly to
efi_early->call() so for CONFIG_X86_32 the compiler will push a 64-bit
value onto the stack, causing the other parameters to be misaligned.

The way we handle this in the rest of the EFI boot stub is to pass
pointers as arguments to efi_early->call(), which automatically do the
right thing (pointers are 32-bit on CONFIG_X86_32, and we simply ignore
the upper 32-bits of the argument register if running in 64-bit mode
with 32-bit firmware).

This fixes a corruption bug when printing strings from the 32-bit EFI
boot stub.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=84241
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-09-24 21:56:46 +01:00
David S. Miller
4daaab4f0c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-09-24 16:48:32 -04:00
Tejun Heo
d06efebf0c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into for-3.18
This is to receive 0a30288da1 ("blk-mq, percpu_ref: implement a
kludge for SCSI blk-mq stall during probe") which implements
__percpu_ref_kill_expedited() to work around SCSI blk-mq stall.  The
commit reverted and patches to implement proper fix will be added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
2014-09-24 13:00:21 -04:00
Linus Torvalds
2368a9426f Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
 "This fixes three issues:

   - if ccp is loaded on a machine without ccp, it will incorrectly
     activate causing all requests to fail.  Fixed by preventing ccp
     from loading if hardware isn't available.

   - not all IRQs were enabled for the qat driver, leading to potential
     stalls when it is used

   - disabled buggy AVX CTR implementation in aesni"

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: aesni - disable "by8" AVX CTR optimization
  crypto: ccp - Check for CCP before registering crypto algs
  crypto: qat - Enable all 32 IRQs
2014-09-24 09:37:35 -07:00
Ben Hutchings
eeeda4cd06 x86/relocs: Make per_cpu_load_addr static
per_cpu_load_addr is only used for 64-bit relocations, but is
declared in both configurations of relocs.c - with different
types.  This has undefined behaviour in general.  GNU ld is
documented to use the larger size in this case, but other tools
may differ and some warn about this.

References: https://bugs.debian.org/748577
Reported-by: Michael Tautschnig <mt@debian.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: 748577@bugs.debian.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1411561812.3659.23.camel@decadent.org.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 15:17:47 +02:00
Oleg Nesterov
212be3b232 x86/lib/Makefile: Remove the unnecessary "+= thunk_64.o"
Trivial. We have "lib-y += thunk_$(BITS).o" at the start, no
need to add thunk_64.o if !CONFIG_X86_32.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140921184232.GB23727@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 15:15:39 +02:00
Oleg Nesterov
0ad6e3c519 x86: Speed up ___preempt_schedule*() by using THUNK helpers
___preempt_schedule() does SAVE_ALL/RESTORE_ALL but this is
suboptimal, we do not need to save/restore the callee-saved
register. And we already have arch/x86/lib/thunk_*.S which
implements the similar asm wrappers, so it makes sense to
redefine ___preempt_schedule() as "THUNK ..." and remove
preempt.S altogether.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140921184153.GA23727@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 15:15:38 +02:00
Mathias Krause
7da4b29d49 crypto: aesni - disable "by8" AVX CTR optimization
The "by8" implementation introduced in commit 22cddcc7df ("crypto: aes
- AES CTR x86_64 "by8" AVX optimization") is failing crypto tests as it
handles counter block overflows differently. It only accounts the right
most 32 bit as a counter -- not the whole block as all other
implementations do. This makes it fail the cryptomgr test #4 that
specifically tests this corner case.

As we're quite late in the release cycle, just disable the "by8" variant
for now.

Reported-by: Romain Francoise <romain@orebokech.com>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Chandramouli Narayanan <mouli@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-09-24 21:15:31 +08:00
Wanpeng Li
03bd4e1f72 sched: Fix unreleased llc_shared_mask bit during CPU hotplug
The following bug can be triggered by hot adding and removing a large number of
xen domain0's vcpus repeatedly:

	BUG: unable to handle kernel NULL pointer dereference at 0000000000000004 IP: [..] find_busiest_group
	PGD 5a9d5067 PUD 13067 PMD 0
	Oops: 0000 [#3] SMP
	[...]
	Call Trace:
	load_balance
	? _raw_spin_unlock_irqrestore
	idle_balance
	__schedule
	schedule
	schedule_timeout
	? lock_timer_base
	schedule_timeout_uninterruptible
	msleep
	lock_device_hotplug_sysfs
	online_store
	dev_attr_store
	sysfs_write_file
	vfs_write
	SyS_write
	system_call_fastpath

Last level cache shared mask is built during CPU up and the
build_sched_domain() routine takes advantage of it to setup
the sched domain CPU topology.

However, llc_shared_mask is not released during CPU disable,
which leads to an invalid sched domainCPU topology.

This patch fix it by releasing the llc_shared_mask correctly
during CPU disable.

Yasuaki also reported that this can happen on real hardware:

  https://lkml.org/lkml/2014/7/22/1018

His case is here:

	==
	Here is an example on my system.
	My system has 4 sockets and each socket has 15 cores and HT is
	enabled. In this case, each core of sockes is numbered as
	follows:

		 | CPU#
	Socket#0 | 0-14 , 60-74
	Socket#1 | 15-29, 75-89
	Socket#2 | 30-44, 90-104
	Socket#3 | 45-59, 105-119

	Then llc_shared_mask of CPU#30 has 0x3fff80000001fffc0000000.

	It means that last level cache of Socket#2 is shared with
	CPU#30-44 and 90-104.

	When hot-removing socket#2 and #3, each core of sockets is
	numbered as follows:

		 | CPU#
	Socket#0 | 0-14 , 60-74
	Socket#1 | 15-29, 75-89

	But llc_shared_mask is not cleared. So llc_shared_mask of CPU#30
	remains having 0x3fff80000001fffc0000000.

	After that, when hot-adding socket#2 and #3, each core of
	sockets is numbered as follows:

		 | CPU#
	Socket#0 | 0-14 , 60-74
	Socket#1 | 15-29, 75-89
	Socket#2 | 30-59
	Socket#3 | 90-119

	Then llc_shared_mask of CPU#30 becomes
	0x3fff8000fffffffc0000000. It means that last level cache of
	Socket#2 is shared with CPU#30-59 and 90-104. So the mask has
	the wrong value.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Tested-by: Linn Crosetto <linn@hp.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1411547885-48165-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 15:13:20 +02:00
Bryan O'Donoghue
ee1b5b165c x86/intel/quark: Switch off CR4.PGE so TLB flush uses CR3 instead
Quark x1000 advertises PGE via the standard CPUID method
PGE bits exist in Quark X1000's PTEs. In order to flush
an individual PTE it is necessary to reload CR3 irrespective
of the PTE.PGE bit.

See Quark Core_DevMan_001.pdf section 6.4.11

This bug was fixed in Galileo kernels, unfixed vanilla kernels are expected to
crash and burn on this platform.

Signed-off-by: Bryan O'Donoghue <pure.logic@nexus-software.ie>
Cc: Borislav Petkov <bp@alien8.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1411514784-14885-1-git-send-email-pure.logic@nexus-software.ie
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 15:06:15 +02:00
Lan Tianyu
2ed53c0d6c x86/smpboot: Speed up suspend/resume by avoiding 100ms sleep for CPU offline during S3
With certain kernel configurations, CPU offline consumes more than
100ms during S3.

It's a timing related issue: native_cpu_die() would occasionally fall
into a 100ms sleep when the CPU idle loop thread marked the CPU state
to DEAD too slowly.

What native_cpu_die() does is that it polls the CPU state and waits
for 100ms if CPU state hasn't been marked to DEAD. The 100ms sleep
doesn't make sense and is purely historic.

To avoid such long sleeping, this patch adds a 'struct completion'
to each CPU, waits for the completion in native_cpu_die() and wakes
up the completion when the CPU state is marked to DEAD.

Tested on an Intel Xeon server with 48 cores, Ivybridge and on
Haswell laptops. The CPU offlining cost on these machines is
reduced from more than 100ms to less than 5ms. The system
suspend time is reduced by 2.3s on the servers.

Borislav and Prarit also helped to test the patch on an AMD
machine and a few systems of various sizes and configurations
(multi-socket, single-socket, no hyper threading, etc.). No
issues were seen.

Tested-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: srostedt@redhat.com
Cc: toshi.kani@hp.com
Cc: imammedo@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1409039025-32310-1-git-send-email-tianyu.lan@intel.com
[ Improved a few minor details in the code, cleaned up the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 15:02:06 +02:00
Stephane Eranian
521e8bac67 perf/x86/intel/uncore: Update support for client uncore IMC PMU
This patch restructures the memory controller (IMC) uncore PMU support
for client SNB/IVB/HSW processors. The main change is that it can now
cope with more than one PCI device ID per processor model. There are
many flavors of memory controllers for each processor. They have
different PCI device ID, yet they behave the same w.r.t. the memory
controller PMU that we are interested in.

The patch now supports two distinct memory controllers for IVB
processors: one for mobile, one for desktop.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140917090616.GA11281@quad
Cc: ak@linux.intel.com
Cc: kan.liang@intel.com
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:48:25 +02:00
Andi Kleen
b10fc1c3e3 perf/x86/intel/uncore: Fix PCU filter setup for Sandy/Ivy/Haswell EP
The PCU frequency band filters use 8 bit each in a register.
When setting up the value the shift value was not correctly
scaled, which resulted in all filters except for band 0 to
be zero. Fix the scaling.

This allows to correctly monitor multiple uncore frequency bands.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1409872109-31645-5-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:48:24 +02:00
Andi Kleen
7e96ae1a89 perf/x86/intel/uncore: Add missing cbox filter flags on IvyBridge-EP uncore driver
The IvyBridge-EP uncore driver was missing three filter flags:
NC, ISOC, C6 which are useful in some cases. Support them in the same way
as the Haswell EP driver, by allowing to set them and exposing
them in the sysfs formats.

Also fix a typo in a define.

Relies on the Haswell EP driver to be applied earlier.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1409872109-31645-4-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:48:23 +02:00
Yan, Zheng
513d793e5f perf/x86/intel/uncore: Register the PMU only if the uncore pci device exists
Current code registers PMUs for all possible uncore pci devices.
This is not good because, on some machines, one or more uncore pci
devices can be missing. The missing pci device make corresponding
PMU unusable. Register the PMU only if the uncore device exists.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1409872109-31645-3-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:48:22 +02:00
Yan, Zheng
e735b9db12 perf/x86/intel/uncore: Add Haswell-EP uncore support
The uncore subsystem in Haswell-EP is similar to Sandy/Ivy
Bridge-EP. There are some differences in config register
encoding and pci device IDs. The Haswell-EP uncore also
supports a few new events. Add the Haswell-EP driver to
the snbep split driver.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
[ Add missing break. Add imc events. Add cbox nc/isoc/c6. ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1409872109-31645-2-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:48:21 +02:00
Andi Kleen
fdda3c4aac perf/x86/intel: Use Broadwell cache event list for Haswell
Use the newly added Broadwell cache event list for Haswell too.
All Haswell and Broadwell events and offcore masks used in these lists
are identical.

However Haswell is very different from the Sandy Bridge
list that was used previously. That fixes a wide range of mis-counting
cache events.

The node events are now only for retired memory events, so prefetching
and speculative memory accesses are not included. They are PEBS
capable now, which makes it much easier to sample for them, plus it's
possible to create address maps with -d.

The prefetch events are gone now. They way the hardware counts
them is very misleading (some prefetches included, others not), so
it seemed best to leave them out.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1409683455-29168-5-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:48:20 +02:00
Andi Kleen
c46e665f03 perf/x86: Add INST_RETIRED.ALL workarounds
On Broadwell INST_RETIRED.ALL cannot be used with any period
that doesn't have the lowest 6 bits cleared. And the period
should not be smaller than 128.

Add a new callback to enforce this, and set it for Broadwell.

This is erratum BDM57 and BDM11.

How does this handle the case when an app requests a specific
period with some of the bottom bits set

The apps thinks it is sampling at X occurences per sample, when it is
in fact at X - 63 (worst case).

Short answer:

Any useful instruction sampling period needs to be 4-6 orders
of magnitude larger than 128, as an PMI every 128 instructions
would instantly overwhelm the system and be throttled.
So the +-64 error from this is really small compared to the
period, much smaller than normal system jitter.

Long answer:

<write up by Peter:>

IFF we guarantee perf_event_attr::sample_period >= 128.

Suppose we start out with sample_period=192; then we'll set period_left
to 192, we'll end up with left = 128 (we truncate the lower bits). We
get an interrupt, find that period_left = 64 (>0 so we return 0 and
don't get an overflow handler), up that to 128. Then we trigger again,
at n=256. Then we find period_left = -64 (<=0 so we return 1 and do get
an overflow). We increment with sample_period so we get left = 128. We
fire again, at n=384, period_left = 0 (<=0 so we return 1 and get an
overflow). And on and on.

So while the individual interrupts are 'wrong' we get then with
interval=256,128 in exactly the right ratio to average out at 192. And
this works for everything >=128.

So the num_samples*fixed_period thing is still entirely correct +- 127,
which is good enough I'd say, as you already have that error anyhow.

So no need to 'fix' the tools, al we need to do is refuse to create
INST_RETIRED:ALL events with sample_period < 128.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
Cc: Mark Davies <junk@eslaf.co.uk>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1409683455-29168-4-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:48:19 +02:00
Andi Kleen
86a349a28b perf/x86/intel: Add Broadwell core support
Add Broadwell support for Broadwell Client to perf.  This is very
similar to Haswell.  It uses a new cache event table, because there
were various changes there.

The constraint list has one new event that needs to be handled over
Haswell.

The PEBS event list is the same, so we reuse Haswell's.

[fengguang.wu: make intel_bdw_event_constraints[] static]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1409683455-29168-3-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:48:18 +02:00
Andi Kleen
d86c8eaf95 perf/x86/intel: Document all Haswell models
Add names for each Haswell model as requested by Peter.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1409683455-29168-2-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:48:16 +02:00
Andi Kleen
b76146851e perf/x86/intel: Remove incorrect model number from Haswell perf
71 is a Broadwell, not a Haswell. The model number was added
by mistake earlier.

Remove it for now, until it can be re-added later with
real Broadwell support.

In practice it does not cause a lot of issues because the Broadwell
PMU is very similar to Haswell, but some details were wrong,
and it's better to handle it correctly.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Link: http://lkml.kernel.org/r/1409683455-29168-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:48:15 +02:00
Dave Hansen
cebf15eb09 x86, sched: Add new topology for multi-NUMA-node CPUs
I'm getting the spew below when booting with Haswell (Xeon
E5-2699 v3) CPUs and the "Cluster-on-Die" (CoD) feature enabled
in the BIOS.  It seems similar to the issue that some folks from
AMD ran in to on their systems and addressed in this commit:

  161270fc1f ("x86/smp: Fix topology checks on AMD MCM CPUs")

Both these Intel and AMD systems break an assumption which is
being enforced by topology_sane(): a socket may not contain more
than one NUMA node.

AMD special-cased their system by looking for a cpuid flag.  The
Intel mode is dependent on BIOS options and I do not know of a
way which it is enumerated other than the tables being parsed
during the CPU bringup process.  In other words, we have to trust
the ACPI tables <shudder>.

This detects the situation where a NUMA node occurs at a place in
the middle of the "CPU" sched domains.  It replaces the default
topology with one that relies on the NUMA information from the
firmware (SRAT table) for all levels of sched domains above the
hyperthreads.

This also fixes a sysfs bug.  We used to freak out when we saw
the "mc" group cross a node boundary, so we stopped building the
MC group.  MC gets exported as the 'core_siblings_list' in
/sys/devices/system/cpu/cpu*/topology/ and this caused CPUs with
the same 'physical_package_id' to not be listed together in
'core_siblings_list'.  This violates a statement from
Documentation/ABI/testing/sysfs-devices-system-cpu:

	core_siblings: internal kernel map of cpu#'s hardware threads
	within the same physical_package_id.

	core_siblings_list: human-readable list of the logical CPU
	numbers within the same physical_package_id as cpu#.

The sysfs effects here cause an issue with the hwloc tool where
it gets confused and thinks there are more sockets than are
physically present.

Before this patch, there are two packages:

# cd /sys/devices/system/cpu/
# cat cpu*/topology/physical_package_id | sort | uniq -c
     18 0
     18 1

But 4 _sets_ of core siblings:

# cat cpu*/topology/core_siblings_list | sort | uniq -c
      9 0-8
      9 18-26
      9 27-35
      9 9-17

After this set, there are only 2 sets of core siblings, which
is what we expect for a 2-socket system.

# cat cpu*/topology/physical_package_id | sort | uniq -c
     18 0
     18 1
# cat cpu*/topology/core_siblings_list | sort | uniq -c
     18 0-17
     18 18-35

Example spew:
...
	NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
	 #2  #3  #4  #5  #6  #7  #8
	.... node  #1, CPUs:    #9
	------------[ cut here ]------------
	WARNING: CPU: 9 PID: 0 at /home/ak/hle/linux-hle-2.6/arch/x86/kernel/smpboot.c:306 topology_sane.isra.2+0x74/0x90()
	sched: CPU #9's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
	Modules linked in:
	CPU: 9 PID: 0 Comm: swapper/9 Not tainted 3.17.0-rc1-00293-g8e01c4d-dirty #631
	Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0036.R05.1407140519 07/14/2014
	0000000000000009 ffff88046ddabe00 ffffffff8172e485 ffff88046ddabe48
	ffff88046ddabe38 ffffffff8109691d 000000000000b001 0000000000000009
	ffff88086fc12580 000000000000b020 0000000000000009 ffff88046ddabe98
	Call Trace:
	[<ffffffff8172e485>] dump_stack+0x45/0x56
	[<ffffffff8109691d>] warn_slowpath_common+0x7d/0xa0
	[<ffffffff8109698c>] warn_slowpath_fmt+0x4c/0x50
	[<ffffffff81074f94>] topology_sane.isra.2+0x74/0x90
	[<ffffffff8107530e>] set_cpu_sibling_map+0x31e/0x4f0
	[<ffffffff8107568d>] start_secondary+0x1ad/0x240
	---[ end trace 3fe5f587a9fcde61 ]---
	#10 #11 #12 #13 #14 #15 #16 #17
	.... node  #2, CPUs:   #18 #19 #20 #21 #22 #23 #24 #25 #26
	.... node  #3, CPUs:   #27 #28 #29 #30 #31 #32 #33 #34 #35

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
[ Added LLC domain and s/match_mc/match_die/ ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: brice.goglin@gmail.com
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/20140918193334.C065EBCE@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:47:14 +02:00
Mathias Krause
615f77511e x86/PCI: Mark PCI BIOS initialization code as such
The pci_find_bios() function is only ever called from initialization code,
therefore can be marked as such, too.  This, in turn, allows marking other
functions called only in this context as well.

The bios32_indirect variable can be marked as __initdata as it is only
referenced from __init functions now.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 06:46:27 -06:00
Mathias Krause
6af13bac77 x86/PCI: Constify pci_mmcfg_probes[] array
The pci_mmcfg_probes[] array is only ever read, therefore make it const.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 06:46:22 -06:00
Mathias Krause
776f7ad632 x86/PCI: Mark constants of pci_mmcfg_nvidia_mcp55() as __initconst
The constants in pci_mmcfg_nvidia_mcp55() need to be marked as __initconst
or they will remain in memory after init memory was released.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 06:46:17 -06:00
Mathias Krause
64474b5235 x86/PCI: Move __init annotation to the correct place
According to include/linux/init.h, the __init annotation should be added
immediately before the function name.  However, for quite a few functions
in mmconfig-shared.c this is not the case.  It's either before the return
type or even in the middle of it.  Beside gcc still getting it right, we
should change them to comply to the rules of include/linux/init.h.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 06:46:01 -06:00
Tang Chen
c24ae0dcd3 kvm: x86: Unpin and remove kvm_arch->apic_access_page
In order to make the APIC access page migratable, stop pinning it in
memory.

And because the APIC access page is not pinned in memory, we can
remove kvm_arch->apic_access_page.  When we need to write its
physical address into vmcs, we use gfn_to_page() to get its page
struct, which is needed to call page_to_phys(); the page is then
immediately unpinned.

Suggested-by: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:08:01 +02:00
Tang Chen
38b9917350 kvm: vmx: Implement set_apic_access_page_addr
Currently, the APIC access page is pinned by KVM for the entire life
of the guest.  We want to make it migratable in order to make memory
hot-unplug available for machines that run KVM.

This patch prepares to handle this for the case where there is no nested
virtualization, or where the nested guest does not have an APIC page of
its own.  All accesses to kvm->arch.apic_access_page are changed to go
through kvm_vcpu_reload_apic_access_page.

If the APIC access page is invalidated when the host is running, we update
the VMCS in the next guest entry.

If it is invalidated when the guest is running, the MMU notifier will force
an exit, after which we will handle everything as in the previous case.

If it is invalidated when a nested guest is running, the request will update
either the VMCS01 or the VMCS02.  Updating the VMCS01 is done at the
next L2->L1 exit, while updating the VMCS02 is done in prepare_vmcs02.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:08:01 +02:00
Tang Chen
4256f43f9f kvm: x86: Add request bit to reload APIC access page address
Currently, the APIC access page is pinned by KVM for the entire life
of the guest.  We want to make it migratable in order to make memory
hot-unplug available for machines that run KVM.

This patch prepares to handle this in generic code, through a new
request bit (that will be set by the MMU notifier) and a new hook
that is called whenever the request bit is processed.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:08:00 +02:00
Tang Chen
fe71557afb kvm: Add arch specific mmu notifier for page invalidation
This will be used to let the guest run while the APIC access page is
not pinned.  Because subsequent patches will fill in the function
for x86, place the (still empty) x86 implementation in the x86.c file
instead of adding an inline function in kvm_host.h.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:59 +02:00
Andres Lagar-Cavilla
5712846808 kvm: Fix page ageing bugs
1. We were calling clear_flush_young_notify in unmap_one, but we are
within an mmu notifier invalidate range scope. The spte exists no more
(due to range_start) and the accessed bit info has already been
propagated (due to kvm_pfn_set_accessed). Simply call
clear_flush_young.

2. We clear_flush_young on a primary MMU PMD, but this may be mapped
as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
This required expanding the interface of the clear_flush_young mmu
notifier, so a lot of code has been trivially touched.

3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
the access bit by blowing the spte. This requires proper synchronizing
with MMU notifier consumers, like every other removal of spte's does.

Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:58 +02:00
Andres Lagar-Cavilla
8a9522d2fe kvm/x86/mmu: Pass gfn and level to rmapp callback.
Callbacks don't have to do extra computation to learn what the caller
(lvm_handle_hva_range()) knows very well. Useful for
debugging/tracing/printk/future.

Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:57 +02:00
Paolo Bonzini
c1118b3602 x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-only
On x86_64, kernel text mappings are mapped read-only with CONFIG_DEBUG_RODATA.
In that case, KVM will fail to patch VMCALL instructions to VMMCALL
as required on AMD processors.

The failure mode is currently a divide-by-zero exception, which obviously
is a KVM bug that has to be fixed.  However, picking the right instruction
between VMCALL and VMMCALL will be faster and will help if you cannot upgrade
the hypervisor.

Reported-by: Chris Webb <chris@arachsys.com>
Tested-by: Chris Webb <chris@arachsys.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:57 +02:00
Chen Yucong
81760dccf8 kvm: x86: use macros to compute bank MSRs
Avoid open coded calculations for bank MSRs by using well-defined
macros that hide the index of higher bank MSRs.

No semantic changes.

Signed-off-by: Chen Yucong <slaoub@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:56 +02:00
Nadav Amit
d5262739cb KVM: x86: Remove debug assertion of non-PAE reserved bits
Commit 346874c950 ("KVM: x86: Fix CR3 reserved bits") removed non-PAE
reserved bits which were not according to Intel SDM.  However, residue was left
in a debug assertion (CR3_NONPAE_RESERVED_BITS).  Remove it.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:55 +02:00
Tiejun Chen
b461966063 kvm: x86: fix two typos in comment
s/drity/dirty and s/vmsc01/vmcs01

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:53 +02:00
Nadav Amit
4566654bb9 KVM: vmx: Inject #GP on invalid PAT CR
Guest which sets the PAT CR to invalid value should get a #GP.  Currently, if
vmx supports loading PAT CR during entry, then the value is not checked.  This
patch makes the required check in that case.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:52 +02:00
Nadav Amit
040c8dc8a5 KVM: x86: emulating descriptor load misses long-mode case
In 64-bit mode a #GP should be delivered to the guest "if the code segment
descriptor pointed to by the selector in the 64-bit gate doesn't have the L-bit
set and the D-bit clear." - Intel SDM "Interrupt 13—General Protection
Exception (#GP)".

This patch fixes the behavior of CS loading emulation code. Although the
comment says that segment loading is not supported in long mode, this function
is executed in long mode, so the fix is necassary.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:52 +02:00
Liang Chen
77c3913b74 KVM: x86: directly use kvm_make_request again
A one-line wrapper around kvm_make_request is not particularly
useful. Replace kvm_mmu_flush_tlb() with kvm_make_request().

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:51 +02:00
Radim Krčmář
a70656b63a KVM: x86: count actual tlb flushes
- we count KVM_REQ_TLB_FLUSH requests, not actual flushes
  (KVM can have multiple requests for one flush)
- flushes from kvm_flush_remote_tlbs aren't counted
- it's easy to make a direct request by mistake

Solve these by postponing the counting to kvm_check_request().

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:50 +02:00
Marcelo Tosatti
bc6134942d KVM: nested VMX: disable perf cpuid reporting
Initilization of L2 guest with -cpu host, on L1 guest with -cpu host
triggers:

(qemu) KVM: entry failed, hardware error 0x7
...
nested_vmx_run: VMCS MSR_{LOAD,STORE} unsupported

Nested VMX MSR load/store support is not sufficient to
allow perf for L2 guest.

Until properly fixed, trap CPUID and disable function 0xA.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:50 +02:00
Nadav Amit
a2b9e6c1a3 KVM: x86: Don't report guest userspace emulation error to userspace
Commit fc3a9157d3 ("KVM: X86: Don't report L2 emulation failures to
user-space") disabled the reporting of L2 (nested guest) emulation failures to
userspace due to race-condition between a vmexit and the instruction emulator.
The same rational applies also to userspace applications that are permitted by
the guest OS to access MMIO area or perform PIO.

This patch extends the current behavior - of injecting a #UD instead of
reporting it to userspace - also for guest userspace code.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:49 +02:00
Paolo Bonzini
1f755a8275 kvm: Make init_rmode_tss() return 0 on success.
In init_rmode_tss(), there two variables indicating the return
value, r and ret, and it return 0 on error, 1 on success. The function
is only called by vmx_set_tss_addr(), and ret is redundant.

This patch removes the redundant variable, by making init_rmode_tss()
return 0 on success, -errno on failure.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:48 +02:00
Nadav Amit
dd598091de KVM: x86: Warn if guest virtual address space is not 48-bits
The KVM emulator code assumes that the guest virtual address space (in 64-bit)
is 48-bits wide.  Fail the KVM_SET_CPUID and KVM_SET_CPUID2 ioctl if
userspace tries to create a guest that does not obey this restriction.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:48 +02:00
Matt Fleming
56394ab8c2 x86/efi: Delete misleading efi_printk() error message
A number of people are reporting seeing the "setup_efi_pci() failed!"
error message in what used to be a quiet boot,

  https://bugzilla.kernel.org/show_bug.cgi?id=81891

The message isn't all that helpful because setup_efi_pci() can return a
non-success error code for a variety of reasons, not all of them fatal.

Let's drop the return code from setup_efi_pci*() altogether, since
there's no way to process it in any meaningful way outside of the inner
__setup_efi_pci*() functions.

Reported-by: Darren Hart <dvhart@linux.intel.com>
Reported-by: Josh Boyer <jwboyer@fedoraproject.org>
Cc: Ulf Winkelvos <ulf@winkelvos.de>
Cc: Andre Müller <andre.muller@web.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-09-24 12:46:59 +01:00
Andy Lutomirski
f12c1f9002 x86/vdso: Fix vdso2c's special_pages[] error checking
Stephen Rothwell's compiler did something amazing: it unrolled a
loop, discovered that one iteration of that loop contained an
always-true test, and emitted a warning that will IMO only serve
to convince people to disable the warning.

That bogus warning caused me to wonder what prompted such an
absurdity from his compiler, and I discovered that the code in
question was, in fact, completely wrong -- I was looking things
up in the wrong array.

This affects 3.16 as well, but the only effect is to screw up
the error checking a bit.  vdso2c's output is unaffected.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/53d96ad5.80ywqrbs33ZBCQej%25akpm@linux-foundation.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 09:55:38 +02:00
Matt Fleming
84be880560 Revert "efi/x86: efistub: Move shared dependencies to <asm/efi.h>"
This reverts commit f23cf8bd5c ("efi/x86: efistub: Move shared
dependencies to <asm/efi.h>") as well as the x86 parts of commit
f4f75ad574 ("efi: efistub: Convert into static library").

The road leading to these two reverts is long and winding.

The above two commits were merged during the v3.17 merge window and
turned the common EFI boot stub code into a static library. This
necessitated making some symbols global in the x86 boot stub which
introduced new entries into the early boot GOT.

The problem was that we weren't fixing up the newly created GOT entries
before invoking the EFI boot stub, which sometimes resulted in hangs or
resets. This failure was reported by Maarten on his Macbook pro.

The proposed fix was commit 9cb0e39423 ("x86/efi: Fixup GOT in all
boot code paths"). However, that caused issues for Linus when booting
his Sony Vaio Pro 11. It was subsequently reverted in commit
f3670394c2.

So that leaves us back with Maarten's Macbook pro not booting.

At this stage in the release cycle the least risky option is to revert
the x86 EFI boot stub to the pre-merge window code structure where we
explicitly #include efi-stub-helper.c instead of linking with the static
library. The arm64 code remains unaffected.

We can take another swing at the x86 parts for v3.18.

Conflicts:
	arch/x86/include/asm/efi.h

Tested-by: Josh Boyer <jwboyer@fedoraproject.org>
Tested-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Tested-by: Leif Lindholm <leif.lindholm@linaro.org> [arm64]
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>,
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-09-23 22:01:55 +01:00
Richard Guy Briggs
b4f0d3755c audit: x86: drop arch from __audit_syscall_entry() interface
Since the arch is found locally in __audit_syscall_entry(), there is no need to
pass it in as a parameter.  Delete it from the parameter list.

x86* was the only arch to call __audit_syscall_entry() directly and did so from
assembly code.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-audit@redhat.com
Signed-off-by: Eric Paris <eparis@redhat.com>

---

As this patch relies on changes in the audit tree, I think it
appropriate to send it through my tree rather than the x86 tree.
2014-09-23 16:21:28 -04:00
Eric Paris
91397401bb ARCH: AUDIT: audit_syscall_entry() should not require the arch
We have a function where the arch can be queried, syscall_get_arch().
So rather than have every single piece of arch specific code use and/or
duplicate syscall_get_arch(), just have the audit code use the
syscall_get_arch() code.

Based-on-patch-by: Richard Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
Cc: linux-alpha@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-ia64@vger.kernel.org
Cc: microblaze-uclinux@itee.uq.edu.au
Cc: linux-mips@linux-mips.org
Cc: linux@lists.openrisc.net
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: user-mode-linux-devel@lists.sourceforge.net
Cc: linux-xtensa@linux-xtensa.org
Cc: x86@kernel.org
2014-09-23 16:21:26 -04:00
Eric Paris
4b4665e13c UM: implement syscall_get_arch()
This patch defines syscall_get_arch() for the um platform.  It adds a
new syscall.h header file to define this.  It copies the HOST_AUDIT_ARCH
definition from ptrace.h.  (that definition will be removed when we
switch audit to use this new syscall_get_arch() function)

Based-on-patch-by: Richard Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
Cc: user-mode-linux-devel@lists.sourceforge.net
2014-09-23 16:20:02 -04:00
David S. Miller
1f6d80358d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	arch/mips/net/bpf_jit.c
	drivers/net/can/flexcan.c

Both the flexcan and MIPS bpf_jit conflicts were cases of simple
overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-23 12:09:27 -04:00
David Vrabel
f955371ca9 x86: remove the Xen-specific _PAGE_IOMAP PTE flag
The _PAGE_IO_MAP PTE flag was only used by Xen PV guests to mark PTEs
that were used to map I/O regions that are 1:1 in the p2m.  This
allowed Xen to obtain the correct PFN when converting the MFNs read
from a PTE back to their PFN.

Xen guests no longer use _PAGE_IOMAP for this. Instead mfn_to_pfn()
returns the correct PFN by using a combination of the m2p and p2m to
determine if an MFN corresponds to a 1:1 mapping in the the p2m.

Remove _PAGE_IOMAP, replacing it with _PAGE_UNUSED2 to allow for
future uses of the PTE flag.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
2014-09-23 13:36:20 +00:00
David Vrabel
7f2f882245 x86/xen: do not use _PAGE_IOMAP PTE flag for I/O mappings
Since mfn_to_pfn() returns the correct PFN for identity mappings (as
used for MMIO regions), the use of _PAGE_IOMAP is not required in
pte_mfn_to_pfn().

Do not set the _PAGE_IOMAP flag in pte_pfn_to_mfn() and do not use it
in pte_mfn_to_pfn().

This will allow _PAGE_IOMAP to be removed, making it available for
future use.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-09-23 13:36:20 +00:00
David Vrabel
3166851142 x86: skip check for spurious faults for non-present faults
If a fault on a kernel address is due to a non-present page, then it
cannot be the result of stale TLB entry from a protection change (RO
to RW or NX to X).  Thus the pagetable walk in spurious_fault() can be
skipped.

See the initial if in spurious_fault() and the tests in
spurious_fault_check()) for the set of possible error codes checked
for spurious faults.  These are:

         IRUWP
Before   x00xx && ( 1xxxx || xxx1x )
After  ( 10001 || 00011 ) && ( 1xxxx || xxx1x )

Thus the new condition is a subset of the previous one, excluding only
non-present faults (I == 1 and W == 1 are mutually exclusive).

This avoids spurious_fault() oopsing in some cases if the pagetables
it attempts to walk are not accessible.  This obscures the location of
the original fault.

This also fixes a crash with Xen PV guests when they access entries in
the M2P corresponding to device MMIO regions.  The M2P is mapped
(read-only) by Xen into the kernel address space of the guest and this
mapping may contains holes for non-RAM regions.  Read faults will
result in calls to spurious_fault(), but because the page tables for
the M2P mappings are not accessible by the guest the pagetable walk
would fault.

This was not normally a problem as MMIO mappings would not normally
result in a M2P lookup because of the use of the _PAGE_IOMAP bit the
PTE.  However, removing the _PAGE_IOMAP bit requires M2P lookups for
MMIO mappings as well.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
2014-09-23 13:36:20 +00:00
Daniel Kiper
342cd340f6 xen/efi: Directly include needed headers
I discovered that some needed stuff is defined/declared in headers
which are not included directly. Currently it works but if somebody
remove required headers from currently included headers then build
will break. So, just in case directly include all needed headers.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-09-23 13:36:20 +00:00
Matt Rushton
4fbb67e3c8 xen/setup: Remap Xen Identity Mapped RAM
Instead of ballooning up and down dom0 memory this remaps the existing mfns
that were replaced by the identity map. The reason for this is that the
existing implementation ballooned memory up and and down which caused dom0
to have discontiguous pages. In some cases this resulted in the use of bounce
buffers which reduced network I/O performance significantly. This change will
honor the existing order of the pages with the exception of some boundary
conditions.

To do this we need to update both the Linux p2m table and the Xen m2p table.
Particular care must be taken when updating the p2m table since it's important
to limit table memory consumption and reuse the existing leaf pages which get
freed when an entire leaf page is set to the identity map. To implement this,
mapping updates are grouped into blocks with table entries getting cached
temporarily and then released.

On my test system before:
Total pages: 2105014
Total contiguous: 1640635

After:
Total pages: 2105014
Total contiguous: 2098904

Signed-off-by: Matthew Rushton <mrushton@amazon.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-09-23 13:36:18 +00:00
Josh Triplett
3cf6b0151b Merge branches 'tiny/bloat-o-meter-no-SyS', 'tiny/more-procless', 'tiny/no-advice', 'tiny/tinyconfig' and 'tiny/x86-boot-compressed-use-yn' into tiny/next 2014-09-22 23:14:40 -07:00
Linus Torvalds
f3670394c2 Revert "x86/efi: Fixup GOT in all boot code paths"
This reverts commit 9cb0e39423.

It causes my Sony Vaio Pro 11 to immediately reboot at startup.

Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-22 23:05:49 -07:00
Mathias Krause
4ac9cbfa35 x86/PCI: Mark DMI tables as initialization data
The DMI tables are only used in __init code, thereby can be marked as
initialization data, too.  The same is true for the callback functions
referenced from the DMI tables.

This moves ~9.6 kB of code and r/o data to the init sections, marking the
memory for release after initialization.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
2014-09-22 14:24:41 -06:00
Zubair Lutfullah Kakakhel
8057b30814 x86: use generic dma-contiguous.h
dma-contiguous.h is now in asm-generic. Use that to avoid code
repetition in x86.

Signed-off-by: Zubair Lutfullah Kakakhel <Zubair.Kakakhel@imgtec.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: catalin.marinas@arm.com
Cc: will.deacon@arm.com
Cc: tglx@linutronix.de
Cc: mingo@redhat.com
Cc: hpa@zytor.com
Cc: arnd@arndb.de
Cc: gregkh@linuxfoundation.org
Cc: m.szyprowski@samsung.com
Cc: x86@kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linux-arch@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/7359/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2014-09-22 13:35:52 +02:00
Linus Torvalds
b29f83aa8b PCI updates for v3.17:
Enumeration
     - Don't default exclusively to first video device (Bruno Prémont)
 
   PCI device hotplug
     - Remove "no hotplug settings from platform" warning (Bjorn Helgaas)
     - Add pci_ignore_hotplug() for VGA switcheroo (Bjorn Helgaas)
 
   Freescale i.MX6
     - Put LTSSM in "Detect" state before disabling (Lucas Stach)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUG7o9AAoJEFmIoMA60/r8hbYP/3gR3xHd2QKpkmBcM1lf1yiR
 osQQnAfRqEO4fzrpmOYrYbLIAOPwanK6Y36rmIYB+wHU2SUaffV7ZI9uW32shTud
 09+1N+OrSS6fwzVUWOuKsf1kv/jxpS+ic2fb+Qe1OXwJh5G+z1D9Kvd2EPLJdlgK
 ySyX4zSTrLni8CoclzREO7u82VVO5rTdvbujBxuvpOQTOdD5TFqV/uhb/y3gQz+u
 sG6IxUbdXsy4r24C6OnPrmmZ1Rk/lgCMyA+QSozc5Eu5PdGzcY9a6gcKlTnsbwBs
 qYLAb+/KCa3KgQh07NYmFfYdpoMZUXgSsEtD8gyvfJQHwUYwW8rsEMKxlSCQrzYr
 0OrpBSVTO6ta1r8SKOWtSYETQgPE3GUiJR1DuCyV+55RLZYp6Q8zH6dbgfWQbA/g
 R/kWHihR/tcD9YIlT99QrBppZtvG5nZ3y7aLSqdYYxEJqHE0tlbuxAu8hgwDf3Qp
 lKZJMyadLB1MS9lnrMj8DYqIOKbe62LOwcEYzhMJzaq8vCy+JWtjxOOgwBkT7P5v
 bhhYh3eqi5/MBONtw52V6RDUQId9vOLGHoiM5/akG4FFmWdhO9S0SbMBhAJyazKT
 n3IP5yj657XAi/fK939PUCQ3YuT5GqlCNcXDCqUzBZhGnt/Ln7LPmQI323519Lp4
 3vI3irFT0Fu2IFEBhc6l
 =xIYe
 -----END PGP SIGNATURE-----

Merge tag 'pci-v3.17-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI fixes from Bjorn Helgaas:
 "These fix:

   - Boot video device detection on dual-GPU Apple systems
   - Hotplug fiascos on VGA switcheroo with radeon & nouveau drivers
   - Boot hang on Freescale i.MX6 systems
   - Excessive "no hotplug settings from platform" warnings

  In particular:

  Enumeration
    - Don't default exclusively to first video device (Bruno Prémont)

  PCI device hotplug
    - Remove "no hotplug settings from platform" warning (Bjorn Helgaas)
    - Add pci_ignore_hotplug() for VGA switcheroo (Bjorn Helgaas)

  Freescale i.MX6
    - Put LTSSM in "Detect" state before disabling (Lucas Stach)"

* tag 'pci-v3.17-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
  vgaarb: Drop obsolete #ifndef
  vgaarb: Don't default exclusively to first video device with mem+io
  ACPIPHP / radeon / nouveau: Remove acpi_bus_no_hotplug()
  PCI: Remove "no hotplug settings from platform" warning
  PCI: Add pci_ignore_hotplug() to ignore hotplug events for a device
  PCI: imx6: Put LTSSM in "Detect" state before disabling it
  MAINTAINERS: Add Lucas Stach as co-maintainer for i.MX6 PCI driver
2014-09-19 10:50:30 -07:00
Linus Torvalds
598a0c7d09 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Two kernel side fixes: a kprobes fix and a perf_remove_from_context()
  fix (which does not yet fix the migration bug which is WIP)"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix a race condition in perf_remove_from_context()
  kprobes/x86: Free 'optinsn' cache when range check fails
2014-09-19 10:31:36 -07:00
Linus Torvalds
7a5e87867e Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "Misc fixes:

  EFI fixes, a build fix, a page table dumping (debug) fix and a clang
  build fix"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi/arm64: Fix fdt-related memory reservation
  x86/mm: Apply the section attribute to the variable, not its type
  x86/efi: Fixup GOT in all boot code paths
  x86/efi: Only load initrd above 4g on second try
  x86-64, ptdump: Mark espfix area only if existent
  x86, irq: Fix build error caused by 9eabc99a63
2014-09-19 09:07:47 -07:00
David E. Box
ed2226bd4d x86/platform/intel/iosf: Add debugfs config option for IOSF
Makes the IOSF sideband available through debugfs. Allows
developers to experiment with using the sideband to provide
debug and analytical tools for units on the SoC.

Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Link: http://lkml.kernel.org/r/1411017231-20807-4-git-send-email-david.e.box@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 13:08:43 +02:00
David E. Box
ced3ce760b x86/platform/intel/iosf: Add better description of IOSF driver in config
Adds better description of IOSF driver to determine when it
should be enabled. Also moves the Kconfig option to "Processor
type and features" menu from main configuration menu.

Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Link: http://lkml.kernel.org/r/1411017231-20807-3-git-send-email-david.e.box@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 13:08:42 +02:00
David E. Box
849f5d8943 x86/platform/intel/iosf: Add Braswell PCI ID
Add Braswell PCI ID to list of supported ID's for the IOSF driver.

Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Link: http://lkml.kernel.org/r/1411017231-20807-2-git-send-email-david.e.box@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 13:08:42 +02:00
Kees Cook
0cacbfbeb5 x86/kaslr: Avoid the setup_data area when picking location
The KASLR location-choosing logic needs to avoid the setup_data
list memory areas as well. Without this, it would be possible to
have the ASLR position stomp on the memory, ultimately causing
the boot to fail.

Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Baoquan He <bhe@redhat.com>
Cc: stable@vger.kernel.org
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140911161931.GA12001@www.outflux.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 13:04:29 +02:00
Martin Kelly
9575a6a23a x86/platform/pmc_atom: Fix warning when CONFIG_DEBUG_FS=n
When compiling with CONFIG_DEBUG_FS=n, GCC emits an unused
variable warning for pmc_atom.c because "ret" is used only
within the CONFIG_DEBUG_FS block.

This patch adds a dummy #ifdef for pmc_dbgfs_register() when
CONFIG_DEBUG_FS=n to simplify the code and remove the warning.

Signed-off-by: Martin Kelly <martkell@amazon.com>
Acked-by: "Li, Aubrey" <aubrey.li@linux.intel.com>
Cc: vishwesh.m.rudramuni@intel.com
Link: http://lkml.kernel.org/r/1410963476-8360-1-git-send-email-martin@martingkelly.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 13:02:21 +02:00
Rakib Mullick
d286c3af48 x86/mce: Avoid showing repetitive message from intel_init_thermal()
intel_init_thermal() is called from a) at the time of system initializing
and b) at the time of system resume to initialize thermal
monitoring.

In case when thermal monitoring is handled by SMI, we get to know it via
printk(). Currently it gives the message at both cases, but its okay if
we get it only once and no need to get the same message at every time
system resumes.

So, limit showing this message only at system boot time by avoid showing
at system resume and reduce abusing kernel log buffer.

Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1411068135.5121.10.camel@localhost.localdomain
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:56:05 +02:00
Ingo Molnar
43657ffb79 * Increase the number of early_ioremap slots to fix a regression with
earlyprintk=efi after recent changes to the ACPI code - Dave Young
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUG/pjAAoJEC84WcCNIz1VQToP+wfz21mZLPyurzCXycbrXi3I
 h2Jg99HhCe2TL1+9slSEJZvCFGsnspbkZtfA8OB0Ms7eomt3mY9yTwsfMVfyUZRM
 2DAWjK0eg9ZEAXcmU0tGbI4Iy9Kf/U0OPEsZ8ZKJkG1QGA+pHrKqhfdrKzP/pVV6
 iQ94OgBa1BZqljYureERvIourghN6hcwlJHqTLF66kzbFtL//2NQJIidHBcJRO0k
 CA/2U65WNEFXAtUenHHUnSM8hhN5H2NiSxPKg6p4iK6nsNcBL80rhEmMHYW80VHp
 PP7lXoCAbGARIfXbjPPNvn7aHL0qzpsF3fvNrtQRSkEwGjEod8gxHAqqzguLrK58
 G0HovhfUvEADXzjDiWCWUAIT/ryHutJLxXo6Fn2UijtAso+W/wUorqJlKNDG1dHX
 PRJKO0kk7TCKPBYxNRkds9Gt8g8uEoInr+V13qvURjwb6H9hFd8fyuL25bRUftdV
 TE4/pzm5F9YjLU2548RfXKcz5snSuOT9EYf3s4C8wTtm4tJv7GxQ+/vz000Hres3
 YlWaY2CywsrZfdBZaaF9gEtsjsdBVH+PspOGtodY0jjnz/I71H7oZNdgb8vslRLS
 6iH9+swCEm6cdytzqFXO1qp7DKLpxZJmwTHKhAvdSYzXZBvf1JMCQ6zdoXNfdNRR
 R8vYh1pvBd1tRJQ9tpJe
 =t4a9
 -----END PGP SIGNATURE-----

Merge tag 'efi-urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/urgent

Pull EFI fix from Matt Fleming:

  * Increase the number of early_ioremap() slots to fix a regression with
    earlyprintk=efi after recent changes to the ACPI code (Dave Young)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:49:20 +02:00
Aaron Tomlin
a70857e46d sched: Add helper for task stack page overrun checking
This facility is used in a few places so let's introduce
a helper function to improve code readability.

Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: aneesh.kumar@linux.vnet.ibm.com
Cc: dzickus@redhat.com
Cc: bmr@redhat.com
Cc: jcastillo@redhat.com
Cc: oleg@redhat.com
Cc: riel@redhat.com
Cc: prarit@redhat.com
Cc: jgh@redhat.com
Cc: minchan@kernel.org
Cc: mpe@ellerman.id.au
Cc: tglx@linutronix.de
Cc: hannes@cmpxchg.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/1410527779-8133-3-git-send-email-atomlin@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:23 +02:00
Aaron Tomlin
d4311ff1a8 init/main.c: Give init_task a canary
Tasks get their end of stack set to STACK_END_MAGIC with the
aim to catch stack overruns. Currently this feature does not
apply to init_task. This patch removes this restriction.

Note that a similar patch was posted by Prarit Bhargava
some time ago but was never merged:

  http://marc.info/?l=linux-kernel&m=127144305403241&w=2

Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: aneesh.kumar@linux.vnet.ibm.com
Cc: dzickus@redhat.com
Cc: bmr@redhat.com
Cc: jcastillo@redhat.com
Cc: jgh@redhat.com
Cc: minchan@kernel.org
Cc: tglx@linutronix.de
Cc: hannes@cmpxchg.org
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Daeseok Youn <daeseok.youn@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/1410527779-8133-2-git-send-email-atomlin@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:22 +02:00
Tang Chen
f51770ed46 kvm: Make init_rmode_identity_map() return 0 on success.
In init_rmode_identity_map(), there two variables indicating the return
value, r and ret, and it return 0 on error, 1 on success. The function
is only called by vmx_create_vcpu(), and ret is redundant.

This patch removes the redundant variable, and makes init_rmode_identity_map()
return 0 on success, -errno on failure.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-17 13:10:12 +02:00
Tang Chen
a255d4795f kvm: Remove ept_identity_pagetable from struct kvm_arch.
kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But
it is never used to refer to the page at all.

In vcpu initialization, it indicates two things:
1. indicates if ept page is allocated
2. indicates if a memory slot for identity page is initialized

Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept
identity pagetable is initialized. So we can remove ept_identity_pagetable.

NOTE: In the original code, ept identity pagetable page is pinned in memroy.
      As a result, it cannot be migrated/hot-removed. After this patch, since
      kvm_arch->ept_identity_pagetable is removed, ept identity pagetable page
      is no longer pinned in memory. And it can be migrated/hot-removed.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-17 13:10:11 +02:00
Bruno Prémont
86fd887b7f vgaarb: Don't default exclusively to first video device with mem+io
Commit 20cde69402 ("x86, ia64: Move EFI_FB vga_default_device()
initialization to pci_vga_fixup()") moved boot video device detection from
efifb to x86 and ia64 pci/fixup.c.

For dual-GPU Apple computers above change represents a regression as code
in efifb did forcefully override vga_default_device while the merge did not
(vgaarb happens prior to PCI fixup).

To improve on initial device selection by vgaarb (it cannot know if PCI
device not behind bridges see/decode legacy VGA I/O or not), move the
screen_info based check from pci_video_fixup() to vgaarb's init function and
use it to refine/override decision taken while adding the individual PCI
VGA devices.  This way PCI fixup has no reason to adjust vga_default_device
anymore but can depend on its value for flagging shadowed VBIOS.

This has the nice benefit of removing duplicated code but does introduce a
#if defined() block in vgaarb.  Not all architectures have screen_info and
would cause compile to fail without it.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=84461
Reported-and-Tested-By: Andreas Noever <andreas.noever@gmail.com>
Signed-off-by: Bruno Prémont <bonbons@linux-vserver.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
CC: Matthew Garrett <matthew.garrett@nebula.com>
CC: stable@vger.kernel.org # v3.5+
2014-09-16 13:06:18 -06:00
Guo Hui Liu
105b21bbf6 KVM: x86: Use kvm_make_request when applicable
This patch replace the set_bit method by kvm_make_request
to make code more readable and consistent.

Signed-off-by: Guo Hui Liu <liuguohui@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-16 14:44:20 +02:00
Igor Mammedov
ce4b1b1650 x86/smpboot: Initialize secondary CPU only if master CPU will wait for it
Hang is observed on virtual machines during CPU hotplug,
especially in big guests with many CPUs. (It reproducible
more often if host is over-committed).

It happens because master CPU gives up waiting on
secondary CPU and allows it to run wild. As result
AP causes locking or crashing system. For example
as described here:

  https://lkml.org/lkml/2014/3/6/257

If master CPU have sent STARTUP IPI successfully,
and AP signalled to master CPU that it's ready
to start initialization, make master CPU wait
indefinitely till AP is onlined.

To ensure that AP won't ever run wild, make it
wait at early startup till master CPU confirms its
intention to wait for AP. If AP doesn't respond in 10
seconds, the master CPU will timeout and cancel
AP onlining.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1403266991-12233-1-git-send-email-imammedo@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-16 11:11:32 +02:00
Rasmus Villemoes
5c63008944 x86/kbuild: Eliminate duplicate command line options
The options -mno-mmx and -mno-sse are unconditionally added to
KBUILD_CFLAGS in both branches of an ifeq and through a
$(cc-option) further down. We can safely remove the first
instances.

In fact, since the -mno-mmx and -mno-sse options were introduced
simultaneous with the other two options in the $(cc-option)
[according to http://www.gnu.org/software/gcc/gcc-3.1/changes.html],
and since the former were unconditionally used, one can deduce that
only gcc versions knowing about all four are supported. So also
eliminate the $(cc-option) wrap.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Peter Foley <pefoley2@pefoley.com>
Link: http://lkml.kernel.org/r/1410365139-24440-1-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-16 10:33:02 +02:00
Lee, Chun-Yi
8477957555 x86/mm, hibernate: Do not assume the first e820 area to be RAM
In arch/x86/kernel/setup.c::trim_bios_range(), the codes
introduced by 1b5576e6 (base on d8a9e6a5), it updates the first
4Kb of memory to be E820_RESERVED region. That's because it's a
BIOS owned area but generally not listed in the E820 table:

  e820: BIOS-provided physical RAM map:
  BIOS-e820: [mem 0x0000000000000000-0x0000000000096fff] usable
  BIOS-e820: [mem 0x0000000000097000-0x0000000000097fff] reserved
  ...
  e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
  e820: remove [mem 0x000a0000-0x000fffff] usable

But the region of first 4Kb didn't register to nosave memory:

  PM: Registered nosave memory: [mem 0x00097000-0x00097fff]
  PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]

The code in e820_mark_nosave_regions() assumes the first e820
area to be RAM, so it causes the first 4Kb E820_RESERVED region
ignored when register to nosave. This patch removed assumption
of the first e820 area.

Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <len.brown@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Takashi Iwai <tiwai@suse.de>
Link: http://lkml.kernel.org/r/1410491038-17576-1-git-send-email-jlee@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-16 09:54:31 +02:00
Luiz Capitulino
8b375f64dc x86/mm/numa: Drop dead code and rename setup_node_data() to setup_alloc_data()
The setup_node_data() function allocates a pg_data_t object,
inserts it into the node_data[] array and initializes the
following fields: node_id, node_start_pfn and
node_spanned_pages.

However, a few function calls later during the kernel boot,
free_area_init_node() re-initializes those fields, possibly with
setup_node_data() is not used.

This causes a small glitch when running Linux as a hyperv numa
guest:

  SRAT: PXM 0 -> APIC 0x00 -> Node 0
  SRAT: PXM 0 -> APIC 0x01 -> Node 0
  SRAT: PXM 1 -> APIC 0x02 -> Node 1
  SRAT: PXM 1 -> APIC 0x03 -> Node 1
  SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
  SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff]
  SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff]
  NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff]
  Initmem setup node 0 [mem 0x00000000-0x7fffffff]
    NODE_DATA [mem 0x7ffdc000-0x7ffeffff]
  Initmem setup node 1 [mem 0x80800000-0x1081fffff]
    NODE_DATA [mem 0x1081ea000-0x1081fdfff]
  crashkernel: memory value expected
   [ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0
   [ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1
  Zone ranges:
    DMA      [mem 0x00001000-0x00ffffff]
    DMA32    [mem 0x01000000-0xffffffff]
    Normal   [mem 0x100000000-0x1081fffff]
  Movable zone start for each node
  Early memory node ranges
    node   0: [mem 0x00001000-0x0009efff]
    node   0: [mem 0x00100000-0x7ffeffff]
    node   1: [mem 0x80200000-0xf7ffffff]
    node   1: [mem 0x100000000-0x1081fffff]
  On node 0 totalpages: 524174
    DMA zone: 64 pages used for memmap
    DMA zone: 21 pages reserved
    DMA zone: 3998 pages, LIFO batch:0
    DMA32 zone: 8128 pages used for memmap
    DMA32 zone: 520176 pages, LIFO batch:31
  On node 1 totalpages: 524288
    DMA32 zone: 7672 pages used for memmap
    DMA32 zone: 491008 pages, LIFO batch:31
    Normal zone: 520 pages used for memmap
    Normal zone: 33280 pages, LIFO batch:7

In this dmesg, the SRAT table reports that the memory range for
node 1 starts at 0x80200000.  However, the line starting with
"Initmem" reports that node 1 memory range starts at 0x80800000.
 The "Initmem" line is reported by setup_node_data() and is
wrong, because the kernel ends up using the range as reported in
the SRAT table.

This commit drops all that dead code from setup_node_data(),
renames it to alloc_node_data() and adds a printk() to
free_area_init_node() so that we report a node's memory range
accurately.

Here's the same dmesg section with this patch applied:

   SRAT: PXM 0 -> APIC 0x00 -> Node 0
   SRAT: PXM 0 -> APIC 0x01 -> Node 0
   SRAT: PXM 1 -> APIC 0x02 -> Node 1
   SRAT: PXM 1 -> APIC 0x03 -> Node 1
   SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
   SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff]
   SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff]
   NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff]
   NODE_DATA(0) allocated [mem 0x7ffdc000-0x7ffeffff]
   NODE_DATA(1) allocated [mem 0x1081ea000-0x1081fdfff]
   crashkernel: memory value expected
    [ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0
    [ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1
   Zone ranges:
     DMA      [mem 0x00001000-0x00ffffff]
     DMA32    [mem 0x01000000-0xffffffff]
     Normal   [mem 0x100000000-0x1081fffff]
   Movable zone start for each node
   Early memory node ranges
     node   0: [mem 0x00001000-0x0009efff]
     node   0: [mem 0x00100000-0x7ffeffff]
     node   1: [mem 0x80200000-0xf7ffffff]
     node   1: [mem 0x100000000-0x1081fffff]
   Initmem setup node 0 [mem 0x00001000-0x7ffeffff]
   On node 0 totalpages: 524174
     DMA zone: 64 pages used for memmap
     DMA zone: 21 pages reserved
     DMA zone: 3998 pages, LIFO batch:0
     DMA32 zone: 8128 pages used for memmap
     DMA32 zone: 520176 pages, LIFO batch:31
   Initmem setup node 1 [mem 0x80200000-0x1081fffff]
   On node 1 totalpages: 524288
     DMA32 zone: 7672 pages used for memmap
     DMA32 zone: 491008 pages, LIFO batch:31
     Normal zone: 520 pages used for memmap
     Normal zone: 33280 pages, LIFO batch:7

This commit was tested on a two node bare-metal NUMA machine and
Linux as a numa guest on hyperv and qemu/kvm.

PS: The wrong memory range reported by setup_node_data() seems to be
    harmless in the current kernel because it's just not used.  However,
    that bad range is used in kernel 2.6.32 to initialize the old boot
    memory allocator, which causes a crash during boot.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-16 08:55:10 +02:00
Yasuaki Ishimatsu
9661d5bcd0 x86/mm/hotplug: Modify PGD entry when removing memory
When hot-adding/removing memory, sync_global_pgds() is called
for synchronizing PGD to PGD entries of all processes MM.  But
when hot-removing memory, sync_global_pgds() does not work
correctly.

At first, sync_global_pgds() checks whether target PGD is none
or not.  And if PGD is none, the PGD is skipped.  But when
hot-removing memory, PGD may be none since PGD may be cleared by
free_pud_table().  So when sync_global_pgds() is called after
hot-removing memory, sync_global_pgds() should not skip PGD even
if the PGD is none.  And sync_global_pgds() must clear PGD
entries of all processes MM.

Currently sync_global_pgds() does not clear PGD entries of all
processes MM when hot-removing memory.  So when hot adding
memory which is same memory range as removed memory after
hot-removing memory, following call traces are shown:

 kernel BUG at arch/x86/mm/init_64.c:206!
 ...
 [<ffffffff815e0c80>] kernel_physical_mapping_init+0x1b2/0x1d2
 [<ffffffff815ced94>] init_memory_mapping+0x1d4/0x380
 [<ffffffff8104aebd>] arch_add_memory+0x3d/0xd0
 [<ffffffff815d03d9>] add_memory+0xb9/0x1b0
 [<ffffffff81352415>] acpi_memory_device_add+0x1af/0x28e
 [<ffffffff81325dc4>] acpi_bus_device_attach+0x8c/0xf0
 [<ffffffff813413b9>] acpi_ns_walk_namespace+0xc8/0x17f
 [<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7
 [<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7
 [<ffffffff813418ed>] acpi_walk_namespace+0x95/0xc5
 [<ffffffff81326b4c>] acpi_bus_scan+0x9a/0xc2
 [<ffffffff81326bff>] acpi_scan_bus_device_check+0x8b/0x12e
 [<ffffffff81326cb5>] acpi_scan_device_check+0x13/0x15
 [<ffffffff81320122>] acpi_os_execute_deferred+0x25/0x32
 [<ffffffff8107e02b>] process_one_work+0x17b/0x460
 [<ffffffff8107edfb>] worker_thread+0x11b/0x400
 [<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400
 [<ffffffff81085aef>] kthread+0xcf/0xe0
 [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
 [<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0
 [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140

This patch clears PGD entries of all processes MM when
sync_global_pgds() is called after hot-removing memory

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-16 08:55:09 +02:00
Yasuaki Ishimatsu
5255e0a79f x86/mm/hotplug: Pass sync_global_pgds() a correct argument in remove_pagetable()
When hot-adding memory after hot-removing memory, following call
traces are shown:

  kernel BUG at arch/x86/mm/init_64.c:206!
  ...
 [<ffffffff815e0c80>] kernel_physical_mapping_init+0x1b2/0x1d2
 [<ffffffff815ced94>] init_memory_mapping+0x1d4/0x380
 [<ffffffff8104aebd>] arch_add_memory+0x3d/0xd0
 [<ffffffff815d03d9>] add_memory+0xb9/0x1b0
 [<ffffffff81352415>] acpi_memory_device_add+0x1af/0x28e
 [<ffffffff81325dc4>] acpi_bus_device_attach+0x8c/0xf0
 [<ffffffff813413b9>] acpi_ns_walk_namespace+0xc8/0x17f
 [<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7
 [<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7
 [<ffffffff813418ed>] acpi_walk_namespace+0x95/0xc5
 [<ffffffff81326b4c>] acpi_bus_scan+0x9a/0xc2
 [<ffffffff81326bff>] acpi_scan_bus_device_check+0x8b/0x12e
 [<ffffffff81326cb5>] acpi_scan_device_check+0x13/0x15
 [<ffffffff81320122>] acpi_os_execute_deferred+0x25/0x32
 [<ffffffff8107e02b>] process_one_work+0x17b/0x460
 [<ffffffff8107edfb>] worker_thread+0x11b/0x400
 [<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400
 [<ffffffff81085aef>] kthread+0xcf/0xe0
 [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
 [<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0
 [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140

The patch-set fixes the issue.

This patch (of 2):

remove_pagetable() gets start argument and passes the argument
to sync_global_pgds().  In this case, the argument must not be
modified.  If the argument is modified and passed to
sync_global_pgds(), sync_global_pgds() does not correctly
synchronize PGD to PGD entries of all processes MM since
synchronized range of memory [start, end] is wrong.

Unfortunately the start argument is modified in
remove_pagetable().  So this patch fixes the issue.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-16 08:55:08 +02:00
Peter Neubauer
2e151c70df x86: HPET force enable for e6xx based systems
As the Soekris net6501 and other e6xx based systems do not have
any ACPI implementation, HPET won't get enabled.
This patch enables HPET on such platforms.

[    0.430149] pci 0000:00:01.0: Force enabled HPET at 0xfed00000
[    0.644838] HPET: 3 timers in total, 0 timers will be used for per-cpu timer

Original patch by Peter Neubauer (http://www.mail-archive.com/soekris-tech@lists.soekris.com/msg06462.html)
slightly modified by Conrad Kostecki <ck@conrad-kostecki.de> and massaged
accoring to Thomas Gleixners <tglx@linutronix.de> by me.

Suggested-by: Conrad Kostecki <ck@conrad-kostecki.de>
Signed-off-by: Eric Sesterhenn <eric.sesterhenn@lsexperts.de>
Cc: Peter Neubauer <pneubauer@bluerwhite.org>
Link: http://lkml.kernel.org/r/5412D3A5.2030909@lsexperts.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-09-15 17:53:35 -07:00
Dave Young
3eddc69ffe x86 early_ioremap: Increase FIX_BTMAPS_SLOTS to 8
3.16 kernel boot fail with earlyprintk=efi, it keeps scrolling at the
bottom line of screen.

Bisected, the first bad commit is below:
commit 86dfc6f339
Author: Lv Zheng <lv.zheng@intel.com>
Date:   Fri Apr 4 12:38:57 2014 +0800

    ACPICA: Tables: Fix table checksums verification before installation.

I did some debugging by enabling both serial and efi earlyprintk, below is
some debug dmesg, seems early_ioremap fails in scroll up function due to
no free slot, see below dmesg output:

  WARNING: CPU: 0 PID: 0 at mm/early_ioremap.c:116 __early_ioremap+0x90/0x1c4()
  __early_ioremap(ed00c800, 00000c80) not found slot
  Modules linked in:
  CPU: 0 PID: 0 Comm: swapper Not tainted 3.17.0-rc1+ #204
  Hardware name: Hewlett-Packard HP Z420 Workstation/1589, BIOS J61 v03.15 05/09/2013
  Call Trace:
    dump_stack+0x4e/0x7a
    warn_slowpath_common+0x75/0x8e
    ? __early_ioremap+0x90/0x1c4
    warn_slowpath_fmt+0x47/0x49
    __early_ioremap+0x90/0x1c4
    ? sprintf+0x46/0x48
    early_ioremap+0x13/0x15
    early_efi_map+0x24/0x26
    early_efi_scroll_up+0x6d/0xc0
    early_efi_write+0x1b0/0x214
    call_console_drivers.constprop.21+0x73/0x7e
    console_unlock+0x151/0x3b2
    ? vprintk_emit+0x49f/0x532
    vprintk_emit+0x521/0x532
    ? console_unlock+0x383/0x3b2
    printk+0x4f/0x51
    acpi_os_vprintf+0x2b/0x2d
    acpi_os_printf+0x43/0x45
    acpi_info+0x5c/0x63
    ? __acpi_map_table+0x13/0x18
    ? acpi_os_map_iomem+0x21/0x147
    acpi_tb_print_table_header+0x177/0x186
    acpi_tb_install_table_with_override+0x4b/0x62
    acpi_tb_install_standard_table+0xd9/0x215
    ? early_ioremap+0x13/0x15
    ? __acpi_map_table+0x13/0x18
    acpi_tb_parse_root_table+0x16e/0x1b4
    acpi_initialize_tables+0x57/0x59
    acpi_table_init+0x50/0xce
    acpi_boot_table_init+0x1e/0x85
    setup_arch+0x9b7/0xcc4
    start_kernel+0x94/0x42d
    ? early_idt_handlers+0x120/0x120
    x86_64_start_reservations+0x2a/0x2c
    x86_64_start_kernel+0xf3/0x100

Quote reply from Lv.zheng about the early ioremap slot usage in this case:

"""
In early_efi_scroll_up(), 2 mapping entries will be used for the src/dst screen buffer.
In drivers/acpi/acpica/tbutils.c, we've improved the early table loading code in acpi_tb_parse_root_table().
We now need 2 mapping entries:
1. One mapping entry is used for RSDT table mapping. Each RSDT entry contains an address for another ACPI table.
2. For each entry in RSDP, we need another mapping entry to map the table to perform necessary check/override before installing it.

When acpi_tb_parse_root_table() prints something through EFI earlyprintk console, we'll have 4 mapping entries used.
The current 4 slots setting of early_ioremap() seems to be too small for such a use case.
"""

Thus increase the slot to 8 in this patch to fix this issue.
boot-time mappings become 512 page with this patch.

Signed-off-by: Dave Young <dyoung@redhat.com>
Cc: <stable@vger.kernel.org> # v3.16
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-09-14 15:24:31 +01:00
Linus Torvalds
72d9310460 Make ARCH_HAS_FAST_MULTIPLIER a real config variable
It used to be an ad-hoc hack defined by the x86 version of
<asm/bitops.h> that enabled a couple of library routines to know whether
an integer multiply is faster than repeated shifts and additions.

This just makes it use the real Kconfig system instead, and makes x86
(which was the only architecture that did this) select the option.

NOTE! Even for x86, this really is kind of wrong.  If we cared, we would
probably not enable this for builds optimized for netburst (P4), where
shifts-and-adds are generally faster than multiplies.  This patch does
*not* change that kind of logic, though, it is purely a syntactic change
with no code changes.

This was triggered by the fact that we have other places that really
want to know "do I want to expand multiples by constants by hand or
not", particularly the hash generation code.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-13 11:14:53 -07:00
Frederic Weisbecker
3010279f0f x86: Tell irq work about self IPI support
x86 supports irq work self-IPIs when local apic is available. This is
partly known on runtime so lets implement arch_irq_work_has_interrupt()
accordingly.

This should be safely called after setup_arch().

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-09-13 18:38:29 +02:00
Peter Zijlstra
c5c38ef3d7 irq_work: Introduce arch_irq_work_has_interrupt()
The nohz full code needs irq work to trigger its own interrupt so that
the subsystem can work even when the tick is stopped.

Lets introduce arch_irq_work_has_interrupt() that archs can override to
tell about their support for this ability.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-09-13 18:38:07 +02:00
Linus Torvalds
7ee2d2d671 xen: bug fixes for 3.17-rc4
- Fix for PVHVM suspend/resume and migration
 - Don't pointlessly retry certain ballooning ops
 - Fix gntalloc when grefs have run out.
 - Fix PV boot if KSALR is enable or very large modules are used.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJUEZw4AAoJEFxbo/MsZsTRoHkIAJ5h287Wer2Yf4ALlFI47Esl
 AhcgVKi7WMJ6mBQUk/xpbSS3MZEuTuPJVbuiabJwRaZk9vX7w8q08yrAD8SOrCwY
 6iGooaWEZ96P5DPI5fmWBuOgt5f2wfqzAp//wl/dK/kzr6Kw63huTXa0H0fmd8BY
 Gy13pN4JEPDyvjjZWFvDcrcEUA9eTcTaXQZisBUGuGictXySr5ovz9G3EKOtJP0D
 vkmDzApijFEEjJwZMGazgipng4mwh94fW4+7bs57s6iREpMzOJDTRKrl9suaXdza
 5HA2ed7Au/qL8IwiQAFGxQpMBqNQxEGerlFfPBhRPuGQ+Ek3/pZF3BTU56w/nt4=
 =y/Ol
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.17-b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull Xen bug fixes from David Vrabel:
 - fix for PVHVM suspend/resume and migration
 - don't pointlessly retry certain ballooning ops
 - fix gntalloc when grefs have run out.
 - fix PV boot if KSALR is enable or very large modules are used.

* tag 'stable/for-linus-3.17-b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  x86/xen: don't copy bogus duplicate entries into kernel page tables
  xen/gntalloc: safely delete grefs in add_grefs() undo path
  xen/gntalloc: fix oops after runnning out of grant refs
  xen/balloon: cancel ballooning if adding new memory failed
  xen/manage: Always freeze/thaw processes when suspend/resuming
2014-09-11 16:52:29 -07:00
Dave Hansen
9298b815ef x86: Add more disabled features
The original motivation for these patches was for an Intel CPU
feature called MPX.  The patch to add a disabled feature for it
will go in with the other parts of the support.

But, in the meantime, there are a few other features than MPX
that we can make assumptions about at compile-time based on
compile options.  Add them to disabled-features.h and check them
with cpu_feature_enabled().

Note that this gets rid of the last things that needed an #ifdef
CONFIG_X86_64 in cpufeature.h.  Yay!

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140911211524.C0EC332A@viggo.jf.intel.com
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-09-11 14:30:17 -07:00
Dave Hansen
381aa07a9b x86: Introduce disabled-features
I believe the REQUIRED_MASK aproach was taken so that it was
easier to consult in assembly (arch/x86/kernel/verify_cpu.S).
DISABLED_MASK does not have the same restriction, but I
implemented it the same way for consistency.

We have a REQUIRED_MASK... which does two things:
1. Keeps a list of cpuid bits to check in very early boot and
   refuse to boot if those are not present.
2. Consulted during cpu_has() checks, which allows us to
   optimize out things at compile-time.  In other words, if we
   *KNOW* we will not boot with the feature off, then we can
   safely assume that it will be present forever.

But, we don't have a similar mechanism for CPU features which
may be present but that we know we will not use.  We simply
use our existing mechanisms to repeatedly check the status of
the bit at runtime (well, the alternatives patching helps here
but it does not provide compile-time optimization).

Adding a feature to disabled-features.h allows the bit to be
checked via a new macro: cpu_feature_enabled().  Note that
for features in DISABLED_MASK, checks with this macro have
all of the benefits of an #ifdef.  Before, we would have done
this in a header:

#ifdef CONFIG_X86_INTEL_MPX
#define cpu_has_mpx cpu_has(X86_FEATURE_MPX)
#else
#define cpu_has_mpx 0
#endif

and this in the code:

	if (cpu_has_mpx)
		do_some_mpx_thing();

Now, just add your feature to DISABLED_MASK and you can do this
everywhere, and get the same benefits you would have from
#ifdefs:

	if (cpu_feature_enabled(X86_FEATURE_MPX))
		do_some_mpx_thing();

We need a new function and *not* a modification to cpu_has()
because there are cases where we actually need to check the CPU
itself, despite what features the kernel supports.  The best
example of this is a hypervisor which has no control over what
features its guests are using and where the guest does not depend
on the host for support.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140911211513.9E35E931@viggo.jf.intel.com
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-09-11 14:30:02 -07:00
Dave Hansen
c8128cceb4 x86: Axe the lightly-used cpu_has_pae
cpu_has_pae is only referenced in one place: the X86_32 kexec
code (in a file not even built on 64-bit).  It hardly warrants
its own macro, or the trouble we go to ensuring that it can't
be called in X86_64 code.

Axe the macro and replace it with a direct cpu feature check.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140911211511.AD76E774@viggo.jf.intel.com
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-09-11 14:30:01 -07:00
Paolo Bonzini
a183b638b6 KVM: x86: make apic_accept_irq tracepoint more generic
Initially the tracepoint was added only to the APIC_DM_FIXED case,
also because it reported coalesced interrupts that only made sense
for that case.  However, the coalesced argument is not used anymore
and tracing other delivery modes is useful, so hoist the call out
of the switch statement.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-11 11:51:02 +02:00
Tang Chen
73a6d94162 kvm: Use APIC_DEFAULT_PHYS_BASE macro as the apic access page address.
We have APIC_DEFAULT_PHYS_BASE defined as 0xfee00000, which is also the address of
apic access page. So use this macro.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-11 11:10:22 +02:00
Stefan Bader
0b5a50635f x86/xen: don't copy bogus duplicate entries into kernel page tables
When RANDOMIZE_BASE (KASLR) is enabled; or the sum of all loaded
modules exceeds 512 MiB, then loading modules fails with a warning
(and hence a vmalloc allocation failure) because the PTEs for the
newly-allocated vmalloc address space are not zero.

  WARNING: CPU: 0 PID: 494 at linux/mm/vmalloc.c:128
           vmap_page_range_noflush+0x2a1/0x360()

This is caused by xen_setup_kernel_pagetables() copying
level2_kernel_pgt into level2_fixmap_pgt, overwriting many non-present
entries.

Without KASLR, the normal kernel image size only covers the first half
of level2_kernel_pgt and module space starts after that.

L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[  0..255]->kernel
                                                  [256..511]->module
                          [511]->level2_fixmap_pgt[  0..505]->module

This allows 512 MiB of of module vmalloc space to be used before
having to use the corrupted level2_fixmap_pgt entries.

With KASLR enabled, the kernel image uses the full PUD range of 1G and
module space starts in the level2_fixmap_pgt. So basically:

L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[0..511]->kernel
                          [511]->level2_fixmap_pgt[0..505]->module

And now no module vmalloc space can be used without using the corrupt
level2_fixmap_pgt entries.

Fix this by properly converting the level2_fixmap_pgt entries to MFNs,
and setting level1_fixmap_pgt as read-only.

A number of comments were also using the the wrong L3 offset for
level2_kernel_pgt.  These have been corrected.

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: stable@vger.kernel.org
2014-09-10 15:23:42 +01:00
Waiman Long
6157c7e1bb locking/rwlock, x86: Delete unused asm/rwlock.h and rwlock.S
This patch removes the unused asm/rwlock.h and rwlock.S files.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1408037251-45918-3-git-send-email-Waiman.Long@hp.com
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Francesco Fusco <ffusco@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Graf <tgraf@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-10 11:46:39 +02:00
Waiman Long
2ff810a7ef locking/rwlock, x86: Clean up asm/spinlock*.h to remove old rwlock code
As the x86 architecture now uses qrwlock for its read/write lock
implementation, it is no longer necessary to keep the old rwlock code
around. This patch removes the old rwlock code in the asm/spinlock.h
and asm/spinlock_types.h files. Now the ARCH_USE_QUEUE_RWLOCK
config parameter cannot be removed from x86/Kconfig or there will be
a compilation error.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Waiman Long <Waiman.Long@hp.com>
Link: http://lkml.kernel.org/r/1408037251-45918-2-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-10 11:46:38 +02:00
Daniel Borkmann
286aad3c40 net: bpf: be friendly to kmemcheck
Reported by Mikulas Patocka, kmemcheck currently barks out a
false positive since we don't have special kmemcheck annotation
for bitfields used in bpf_prog structure.

We currently have jited:1, len:31 and thus when accessing len
while CONFIG_KMEMCHECK enabled, kmemcheck throws a warning that
we're reading uninitialized memory.

As we don't need the whole bit universe for pages member, we
can just split it to u16 and use a bool flag for jited instead
of a bitfield.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-09 16:58:56 -07:00
Daniel Borkmann
738cbe72ad net: bpf: consolidate JIT binary allocator
Introduced in commit 314beb9bca ("x86: bpf_jit_comp: secure bpf jit
against spraying attacks") and later on replicated in aa2d2c73c2
("s390/bpf,jit: address randomize and write protect jit code") for
s390 architecture, write protection for BPF JIT images got added and
a random start address of the JIT code, so that it's not on a page
boundary anymore.

Since both use a very similar allocator for the BPF binary header,
we can consolidate this code into the BPF core as it's mostly JIT
independant anyway.

This will also allow for future archs that support DEBUG_SET_MODULE_RONX
to just reuse instead of reimplementing it.

JIT tested on x86_64 and s390x with BPF test suite.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-09 16:58:56 -07:00
Alexei Starovoitov
02ab695bb3 net: filter: add "load 64-bit immediate" eBPF instruction
add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register.
All previous instructions were 8-byte. This is first 16-byte instruction.
Two consecutive 'struct bpf_insn' blocks are interpreted as single instruction:
insn[0].code = BPF_LD | BPF_DW | BPF_IMM
insn[0].dst_reg = destination register
insn[0].imm = lower 32-bit
insn[1].code = 0
insn[1].imm = upper 32-bit
All unused fields must be zero.

Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM
which loads 32-bit immediate value into a register.

x64 JITs it as single 'movabsq %rax, imm64'
arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn

Note that old eBPF programs are binary compatible with new interpreter.

It helps eBPF programs load 64-bit constant into a register with one
instruction instead of using two registers and 4 instructions:
BPF_MOV32_IMM(R1, imm32)
BPF_ALU64_IMM(BPF_LSH, R1, 32)
BPF_MOV32_IMM(R2, imm32)
BPF_ALU64_REG(BPF_OR, R1, R2)

User space generated programs will use this instruction to load constants only.

To tell kernel that user space needs a pointer the _pseudo_ variant of
this instruction may be added later, which will use extra bits of encoding
to indicate what type of pointer user space is asking kernel to provide.
For example 'off' or 'src_reg' fields can be used for such purpose.
src_reg = 1 could mean that user space is asking kernel to validate and
load in-kernel map pointer.
src_reg = 2 could mean that user space needs readonly data section pointer
src_reg = 3 could mean that user space needs a pointer to per-cpu local data
All such future pseudo instructions will not be carrying the actual pointer
as part of the instruction, but rather will be treated as a request to kernel
to provide one. The kernel will verify the request_for_a_pointer, then
will drop _pseudo_ marking and will store actual internal pointer inside
the instruction, so the end result is the interpreter and JITs never
see pseudo BPF_LD_IMM64 insns and only operate on generic BPF_LD_IMM64 that
loads 64-bit immediate into a register. User space never operates on direct
pointers and verifier can easily recognize request_for_pointer vs other
instructions.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-09 10:26:47 -07:00
Ingo Molnar
5ac385d835 * Fix early boot regression affecting x86 EFI boot stub when loading
initrds above 4GB - Yinghai Lu
 
  * Relocate GOT entries in the x86 EFI boot stub now that we have
    symbols with global visibility - Matt Fleming
 
  * fdt memory reservation fix for arm64 - Mark Salter
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUDqq/AAoJEC84WcCNIz1VgzEP/1Ax+XnXQjIRMGcR7gHolcan
 lzBzDL3afEp28LcWevmDZ9Bp4VRFjCRecg1gdI64HJhn+b7Ay1iPX/hUaIqfgPfb
 ptY8uAomNAwDyxC7z0S13GNiZZxPKB6eHnoV8t2Hi3uM8oUnkca/WTXHOyXs+gJG
 4fQZtXWn/T8j7vAXuHGSbdH1pF4HYf2vX9i0c7iWVIcKyl+Oe5xGMcql4BqPJnAz
 6hN9etyRMWF37CHZjD1pH0YHhRMJ6uuqUFvUQZt2q+OPUzgYVPv1Es6984r5q2CI
 HHQK2RSfHifYhNuLHuQo+8hOzz41pTriUrrDLDYk9SXDaJM4nHF6n2AXvra320P3
 Xa0TR87+DxOdCM+1s1LeLl/9wMrwz1tgx8m9St16yISnRcGkkJrWYeV9z4PXYsi5
 Qe1uGFS4eVWMAGVuaQgOP/olLAOxr1Vxwrnci+mg4Zh5LgohDZ4FBqbDdMeP3GIF
 vuI+yNnH9jxqmKZXD7wKtxVmS5s3vB3bH0+H8fFCMdBfUWqcM2CA0QJjSCsYGgkB
 mv5jaccRBk8WlI4KrDDuJ2BzM5prg59XTsO0m8oaloCk8b2OEvOte3XsF1DAkYh+
 DnMbESfyDJxc6OFzq6pzAFeY5JUbSgWe0AwnyDJ3Woo9qCCpSkQHImllohRuXVgO
 BruJYYr5r55mjhyDWzJk
 =DjZA
 -----END PGP SIGNATURE-----

Merge tag 'efi-urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/urgent

Pull EFI fixes from Matt Fleming:

  * Fix early boot regression affecting x86 EFI boot stub when loading
    initrds above 4GB - Yinghai Lu

  * Relocate GOT entries in the x86 EFI boot stub now that we have
    symbols with global visibility - Matt Fleming

  * fdt memory reservation fix for arm64 - Mark Salter

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-09 16:56:12 +02:00
Jan-Simon Möller
cc99535eb4 x86/mm: Apply the section attribute to the variable, not its type
This fixes a compilation error in clang in that a linker section
attribute can't be added to a type:

  arch/x86/mm/mmap.c:34:8: error: '__section__' attribute only applies to functions and global variables struct __read_mostly
  ...

By moving the section attribute to the variable declaration, the
desired effect is achieved.

Signed-off-by: Jan-Simon Möller <dl9pf@gmx.de>
Signed-off-by: Behan Webster <behanw@converseincode.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1409959005-11479-1-git-send-email-behanw@converseincode.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-09 07:13:39 +02:00
Andi Kleen
a08b6769d4 perf/x86: Fix section mismatch in split uncore driver
The new split Intel uncore driver code that recently went
into tip added a section mismatch, which the build process
complains about.

uncore_pmu_register() can be called from uncore_pci_probe,()
which is not __init and can be called from pci driver ->probe.
I'm not fully sure if it's actually possible to call the probe
function later, but it seems safer to mark uncore_pmu_register
not __init.

This also fixes the warning.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1409332858-29039-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-09 06:53:08 +02:00
Mathias Krause
066ce64c7e perf/x86/intel: Mark initialization code as such
A few of the initialization functions are missing the __init annotation.
Fix this and thereby allow ~680 additional bytes of code to be released
after initialization.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/1409071785-26015-1-git-send-email-minipli@googlemail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-09 06:53:06 +02:00
Ingo Molnar
bdea534db8 Linux 3.17-rc4
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJUDOW+AAoJEHm+PkMAQRiGOXYH/00TPKm8PdM5cXXG2YYYv9eT
 W99K7KD2i0/qiVtlGgjjvB7fO3K0HcZusTd2jmVd8IWntXvauq7Zpw5YZkjwu4KX
 Y1HCwwCd2aw0FoqgrJhNP3+j5Cr1BD/HLtbffjCe+A3tppOIis4Bwt2wJOoYlXpS
 hU9Jxxc4lcRo8YKbffouDo7PIneWeJy8N+WGpUR5BfJIEK0ZZtCUqn3/3WLX4FYu
 fE6uiF/bACTpKXU/mo4dDbhZp439H/QdwQc9B0F8+8CBDMXKaNHrPV7kN36T2SWa
 fD4boikTsi/yh9Ks1fvHbvNq2N0ihoMnja+vLRyvjAcAQv2fKG3OZtYgFWSdghU=
 =Xknd
 -----END PGP SIGNATURE-----

Merge tag 'v3.17-rc4' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-09 06:48:07 +02:00
Mel Gorman
59f6e2073c percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
A commit in linux-next was causing boot to fail and bisection
identified the patch 4ba2968420 ("percpu: Resolve ambiguities in
__get_cpu_var/cpumask_var_").  One of the changes in that patch looks
very suspicious.  Reverting the full patch fixes boot as does this
fixlet.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
2014-09-09 07:17:43 +09:00
Andy Lutomirski
1dcf74f6ed x86_64, entry: Use split-phase syscall_trace_enter for 64-bit syscalls
On KVM on my box, this reduces the overhead from an always-accept
seccomp filter from ~130ns to ~17ns.  Most of that comes from
avoiding IRET on every syscall when seccomp is enabled.

In extremely approximate hacked-up benchmarking, just bypassing IRET
saves about 80ns, so there's another 43ns of savings here from
simplifying the seccomp path.

The diffstat is also rather nice :)

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/a3dbd267ee990110478d349f78cccfdac5497a84.1409954077.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-09-08 14:14:12 -07:00
Andy Lutomirski
54eea9957f x86_64, entry: Treat regs->ax the same in fastpath and slowpath syscalls
For slowpath syscalls, we initialize regs->ax to -ENOSYS and stick
the syscall number into regs->orig_ax prior to any possible tracing
and syscall execution.  This is user-visible ABI used by ptrace
syscall emulation and seccomp.

For fastpath syscalls, there's no good reason not to do the same
thing.  It's even slightly simpler than what we're currently doing.
It probably has no measureable performance impact.  It should have
no user-visible effect.

The purpose of this patch is to prepare for two-phase syscall
tracing, in which the first phase might modify the saved RAX without
leaving the fast path.  This change is just subtle enough that I'm
keeping it separate.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/01218b493f12ae2f98034b78c9ae085e38e94350.1409954077.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-09-08 14:14:08 -07:00
Andy Lutomirski
e0ffbaabc4 x86: Split syscall_trace_enter into two phases
This splits syscall_trace_enter into syscall_trace_enter_phase1 and
syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
phase 2 is permitted to modify any of pt_regs except for orig_ax.

The intent is that phase 1 can be called from the syscall fast path.

In this implementation, phase1 can handle any combination of
TIF_NOHZ (RCU context tracking), TIF_SECCOMP, and TIF_SYSCALL_AUDIT,
unless seccomp requests a ptrace event, in which case phase2 is
forced.

In principle, this could yield a big speedup for TIF_NOHZ as well as
for TIF_SECCOMP if syscall exit work were similarly split up.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/2df320a600020fda055fccf2b668145729dd0c04.1409954077.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-09-08 14:14:03 -07:00