The protocol for suspending or migrating an LPAR requires all present
processor threads to enter H_JOIN. So if we have threads offline, we
have to temporarily bring them up. This can race with administrator
actions such as SMT state changes. As of dfd718a2ed ("powerpc/rtas:
Fix a potential race between CPU-Offline & Migration"),
rtas_ibm_suspend_me() accounts for this, but errors out with -EBUSY
for what almost certainly is a transient condition in any reasonable
scenario.
Callers of rtas_ibm_suspend_me() already retry when -EAGAIN is
returned, and it is typical during a migration for that to happen
repeatedly for several minutes polling the H_VASI_STATE hcall result
before proceeding to the next stage.
So return -EAGAIN instead of -EBUSY when this race is
encountered. Additionally: logging this event is still appropriate but
use pr_info instead of pr_err; and remove use of unlikely() while here
as this is not a hot path at all.
Fixes: dfd718a2ed ("powerpc/rtas: Fix a potential race between CPU-Offline & Migration")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Merge our fixes branch into next, this brings in a number of commits
that fix bugs we don't want to hit in next, in particular the fix for
CVE-2019-12817.
One fix for a regression in my commit adding KUAP (Kernel User Access
Prevention) on Radix, which incorrectly touched the AMR in the early machine
check handler.
Thanks to:
Nicholas Piggin.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJdF00aAAoJEFHr6jzI4aWAT2UQAJCnXrBsNJd7WikZE8NzwdmM
G6bioGCSPgNuWDwaxgpi6RSilET3poBBt+NpttgOslZtzif/5mrLIuYqwQYgOTbL
Oa4CnzVBHnBDFKqcqe/Sm7cKuvd7KO8RVbyfhNuQbm1y9Nqr3vPYKwQ6CTz7bth4
AatNvjP12Ag8hDwk3VpOOiG88jKpj/N3V7PLNWOt9jn8B3rCWm5/7xZ84VSNWdRQ
/MvdGAcFAboywZMj44u8mBpT7+EueFa/vVbpCj8gv9QhRSSGwSL1jZ5wNu2Iv6D+
IxxZqdO3KHJVixEAC4fs5KWCuA84uhjlRMkP2BXTgKNZT3qXaLx0e8Qv9okg/xAU
dAuZEQ0cv+gxdCblEiVZ+jjG0LQsntwXJwnsCeWjcHQr6S0umd2utFLl1N3HTqfx
QhgatD5pTGvGU2WHO4+dhXeh0nITVfcB2E3cM0DHUgCESc1BGmK0MtS1kHYiQptt
BMY5Y92D3vndmnoLTZzQ2DFj5of2u49+y0Cpti7RhJN9yV836bPGm1K8GnropHz8
7HHYS4hV3HBFUlYH7zHLp4BMNg3nkdTK+WTR6HwFFSREzM59NZtVg5xJVk0j66GK
mZIJoVOSQ0Sac03xYqwtdxdupxoulXy+khBcjC56OxxOEMIfjS66ZnawTDhI2jVf
EI7VE3Y4hzrA4pMTw9fp
=I22i
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.2-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fix from Michael Ellerman:
"One fix for a regression in my commit adding KUAP (Kernel User Access
Prevention) on Radix, which incorrectly touched the AMR in the early
machine check handler.
Thanks to Nicholas Piggin"
* tag 'powerpc-5.2-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s/exception: Fix machine check early corrupting AMR
The early machine check runs in real mode, so locking is unnecessary.
Worse, the windup does not restore AMR, so this can result in a false
KUAP fault after a recoverable machine check hits inside a user copy
operation.
Fix this similarly to HMI by just avoiding the kuap lock in the
early machine check handler (it will be set by the late handler that
runs in virtual mode if that runs). If the virtual mode handler is
reached, it will lock and restore the AMR.
Fixes: 890274c2dc ("powerpc/64s: Implement KUAP for Radix MMU")
Cc: Russell Currey <ruscur@russell.cc>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The return value is fixed. Remove it and amend the callers.
[ tglx: Fixup arm/bL_switcher and powerpc/rtas ]
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20190613064813.8102-2-namit@vmware.com
Seven fixes, all for bugs introduced this cycle.
The commit to add KASAN support broke booting on 32-bit SMP machines, due to a
refactoring that moved some setup out of the secondary CPU path.
A fix for another 32-bit SMP bug introduced by the fast syscall entry
implementation for 32-bit BOOKE. And a build fix for the same commit.
Our change to allow the DAWR to be force enabled on Power9 introduced a bug in
KVM, where we clobber r3 leading to a host crash.
The same commit also exposed a previously unreachable bug in the nested KVM
handling of DAWR, which could lead to an oops in a nested host.
One of the DMA reworks broke the b43legacy WiFi driver on some people's
powermacs, fix it by enabling a 30-bit ZONE_DMA on 32-bit.
A fix for TLB flushing in KVM introduced a new bug, as it neglected to also
flush the ERAT, this could lead to memory corruption in the guest.
Thanks to:
Aaro Koskinen, Christoph Hellwig, Christophe Leroy, Larry Finger, Michael
Neuling, Suraj Jitindar Singh.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJdDg4YAAoJEFHr6jzI4aWACKYP/RG1cqYDjEWz0N9bxjAOanx6
z//hPZqZrObEORx0mek07LNj6JDy4eL7CB9WaEudJjHt7mYugLYq0g7hUMVvBWnB
irFEuzGJ8EgWl1aMbmz+fgf49PBIuroy2o/4pyzzQXoDaw44QyUaCke2VEBskQNG
RW64C2rDVrPgpRHzBB9EZVNv7svmo6ERJsEpRvqP3PZG1ZxgXW+DXbEdSmJCcgAt
8oI+z6frRv+0ez+nge7TULo8DuheShfxc7l0jFrd48i35v2qB/IowPr8cof9fRwM
TqnB+3dZXHPKPz6J9mz80p9ZDe1omLzg6i9EiR2/7a3XGpRBo7kCg3Iri7N5pu0j
LotK9l1+mXWLy5P6lOHH5/tEHv52Wqsvh5IetpNJ2tgXp3MzbOc1/Ut9h7Ag7cRw
WRa7tNXQ5Ud8uPM1Pds8Ymhd+/nZ9RItjGcu6S095/OGpM1FJR9a0QnfUHMyfyuX
kAGrJDWcAkCd/Q9tKHsQotuZAFmRCQe4JFkzTiGzwdjYWYgtTA1c/eIv3+SG7eLV
1dsaIYzIS56b+Qz2Qc/pKHwho+I9o505Y7LFXxlCGXDDjyI72ioTQDwiSBzaZdc9
ORwNchLfpXNpiNXRoRqAnqmhWxavYmA6oJ13RDBiMBxIUWHynVbEzLlX9fPNdBFj
Kw3Zd15znokXBzU+1mDE
=Ju1y
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.2-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"This is a frustratingly large batch at rc5. Some of these were sent
earlier but were missed by me due to being distracted by other things,
and some took a while to track down due to needing manual bisection on
old hardware. But still we clearly need to improve our testing of KVM,
and of 32-bit, so that we catch these earlier.
Summary: seven fixes, all for bugs introduced this cycle.
- The commit to add KASAN support broke booting on 32-bit SMP
machines, due to a refactoring that moved some setup out of the
secondary CPU path.
- A fix for another 32-bit SMP bug introduced by the fast syscall
entry implementation for 32-bit BOOKE. And a build fix for the same
commit.
- Our change to allow the DAWR to be force enabled on Power9
introduced a bug in KVM, where we clobber r3 leading to a host
crash.
- The same commit also exposed a previously unreachable bug in the
nested KVM handling of DAWR, which could lead to an oops in a
nested host.
- One of the DMA reworks broke the b43legacy WiFi driver on some
people's powermacs, fix it by enabling a 30-bit ZONE_DMA on 32-bit.
- A fix for TLB flushing in KVM introduced a new bug, as it neglected
to also flush the ERAT, this could lead to memory corruption in the
guest.
Thanks to: Aaro Koskinen, Christoph Hellwig, Christophe Leroy, Larry
Finger, Michael Neuling, Suraj Jitindar Singh"
* tag 'powerpc-5.2-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries
powerpc: enable a 30-bit ZONE_DMA for 32-bit pmac
KVM: PPC: Book3S HV: Only write DAWR[X] when handling h_set_dawr in real mode
KVM: PPC: Book3S HV: Fix r3 corruption in h_set_dabr()
powerpc/32: fix build failure on book3e with KVM
powerpc/booke: fix fast syscall entry on SMP
powerpc/32s: fix initial setup of segment registers on secondary CPU
When the firmware does PCI BAR resource allocation, it passes the assigned
addresses and flags (prefetch/64bit/...) via the "reg" property of
a PCI device device tree node so the kernel does not need to do
resource allocation.
The flags are stored in resource::flags - the lower byte stores
PCI_BASE_ADDRESS_SPACE/etc bits and the other bytes are IORESOURCE_IO/etc.
Some flags from PCI_BASE_ADDRESS_xxx and IORESOURCE_xxx are duplicated,
such as PCI_BASE_ADDRESS_MEM_PREFETCH/PCI_BASE_ADDRESS_MEM_TYPE_64/etc.
When parsing the "reg" property, we copy the prefetch flag but we skip
on PCI_BASE_ADDRESS_MEM_TYPE_64 which leaves the flags out of sync.
The missing IORESOURCE_MEM_64 flag comes into play under 2 conditions:
1. we remove PCI_PROBE_ONLY for pseries (by hacking pSeries_setup_arch()
or by passing "/chosen/linux,pci-probe-only");
2. we request resource alignment (by passing pci=resource_alignment=
via the kernel cmd line to request PAGE_SIZE alignment or defining
ppc_md.pcibios_default_alignment which returns anything but 0). Note that
the alignment requests are ignored if PCI_PROBE_ONLY is enabled.
With 1) and 2), the generic PCI code in the kernel unconditionally
decides to:
- reassign the BARs in pci_specified_resource_alignment() (works fine)
- write new BARs to the device - this fails for 64bit BARs as the generic
code looks at IORESOURCE_MEM_64 (not set) and writes only lower 32bits
of the BAR and leaves the upper 32bit unmodified which breaks BAR mapping
in the hypervisor.
This fixes the issue by copying the flag. This is useful if we want to
enforce certain BAR alignment per platform as handling subpage sized BARs
is proven to cause problems with hotplug (SLOF already aligns BARs to 64k).
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com>
Reviewed-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Shawn Anastasio <shawn@anastas.io>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Based on 1 normalized pattern(s):
gplv2
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 58 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081207.556988620@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 2 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation #
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 4122 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 2 normalized pattern(s):
this source code is licensed under the gnu general public license
version 2 see the file copying for more details
this source code is licensed under general public license version 2
see
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 52 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190602204653.449021192@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
powerpc hardware triggers watchpoint before executing the instruction.
To make trigger-after-execute behavior, kernel emulates the
instruction. If the instruction is 'load something into non-volatile
register', exception handler should restore emulated register state
while returning back, otherwise there will be register state
corruption. eg, adding a watchpoint on a list can corrput the list:
# cat /proc/kallsyms | grep kthread_create_list
c00000000121c8b8 d kthread_create_list
Add watchpoint on kthread_create_list->prev:
# perf record -e mem:0xc00000000121c8c0
Run some workload such that new kthread gets invoked. eg, I just
logged out from console:
list_add corruption. next->prev should be prev (c000000001214e00), \
but was c00000000121c8b8. (next=c00000000121c8b8).
WARNING: CPU: 59 PID: 309 at lib/list_debug.c:25 __list_add_valid+0xb4/0xc0
CPU: 59 PID: 309 Comm: kworker/59:0 Kdump: loaded Not tainted 5.1.0-rc7+ #69
...
NIP __list_add_valid+0xb4/0xc0
LR __list_add_valid+0xb0/0xc0
Call Trace:
__list_add_valid+0xb0/0xc0 (unreliable)
__kthread_create_on_node+0xe0/0x260
kthread_create_on_node+0x34/0x50
create_worker+0xe8/0x260
worker_thread+0x444/0x560
kthread+0x160/0x1a0
ret_from_kernel_thread+0x5c/0x70
List corruption happened because it uses 'load into non-volatile
register' instruction:
Snippet from __kthread_create_on_node:
c000000000136be8: addis r29,r2,-19
c000000000136bec: ld r29,31424(r29)
if (!__list_add_valid(new, prev, next))
c000000000136bf0: mr r3,r30
c000000000136bf4: mr r5,r28
c000000000136bf8: mr r4,r29
c000000000136bfc: bl c00000000059a2f8 <__list_add_valid+0x8>
Register state from WARN_ON():
GPR00: c00000000059a3a0 c000007ff23afb50 c000000001344e00 0000000000000075
GPR04: 0000000000000000 0000000000000000 0000001852af8bc1 0000000000000000
GPR08: 0000000000000001 0000000000000007 0000000000000006 00000000000004aa
GPR12: 0000000000000000 c000007ffffeb080 c000000000137038 c000005ff62aaa00
GPR16: 0000000000000000 0000000000000000 c000007fffbe7600 c000007fffbe7370
GPR20: c000007fffbe7320 c000007fffbe7300 c000000001373a00 0000000000000000
GPR24: fffffffffffffef7 c00000000012e320 c000007ff23afcb0 c000000000cb8628
GPR28: c00000000121c8b8 c000000001214e00 c000007fef5b17e8 c000007fef5b17c0
Watchpoint hit at 0xc000000000136bec.
addis r29,r2,-19
=> r29 = 0xc000000001344e00 + (-19 << 16)
=> r29 = 0xc000000001214e00
ld r29,31424(r29)
=> r29 = *(0xc000000001214e00 + 31424)
=> r29 = *(0xc00000000121c8c0)
0xc00000000121c8c0 is where we placed a watchpoint and thus this
instruction was emulated by emulate_step. But because handle_dabr_fault
did not restore emulated register state, r29 still contains stale
value in above register state.
Fixes: 5aae8a5370 ("powerpc, hw_breakpoints: Implement hw_breakpoints for 64-bit server processors")
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Cc: stable@vger.kernel.org # 2.6.36+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Previously, only IBAT1 and IBAT2 were used to map kernel linear mem.
Since commit 63b2bc6195 ("powerpc/mm/32s: Use BATs for
STRICT_KERNEL_RWX"), we may have all 8 BATs used for mapping
kernel text. But the suspend/restore functions only save/restore
BATs 0 to 3, and clears BATs 4 to 7.
Make suspend and restore functions respectively save and reload
the 8 BATs on CPUs having MMU_FTR_USE_HIGH_BATS feature.
Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Cc: stable@vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
One fix for a regression introduced by our 32-bit KASAN support, which broke
booting on machines with "bootx" early debugging enabled.
A fix for a bug which broke kexec on 32-bit, introduced by changes to the 32-bit
STRICT_KERNEL_RWX support in v5.1.
Finally two fixes going to stable for our THP split/collapse handling,
discovered by Nick. The first fixes random crashes and/or corruption in guests
under sufficient load.
Thanks to:
Nicholas Piggin, Christophe Leroy, Aaro Koskinen, Mathieu Malaterre.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJdBOivAAoJEFHr6jzI4aWA/T0P/1pmr4JLWzXOPQOGOwS2mmu5
HhXBFuDflpGiWS31syOhfKhiE2qlwdIcGaclSo1wAgUnMKp+sxVEagF6DEt484r3
DXs3eRyrGu5vQT7Q6yReuT3Kw2ZR474a5ob00WGAQosBKyJF4ZHWz16ETVWMdAMQ
TknEEU3hOUnMWWIEvLnZOKT7eJcmzj5IYy1OtLHBjWiHHizGC8PSxdVhiRcD/O6R
F6C7XrFb7RRj5ran6gxwMbcTvjgu922TSQPOCw93qnXYWLfvWDUXC4yCqY21oHnr
b3zgJNgIdSoMYxE8pfOH7Y+eaJrbzgnhlS5OJNEz/4NOfGQnJXYcSF8QO6eeVoKM
L2SkT1Ov+QMmZQjMC5e9OAe7DFHfM59RYFg11eaUqfiaObRsmwu8rqjngITV5Ede
Ydq3W39XQkjB3aQ4qb0MnEBbVgyQ/y6/T5hoHRlvnb5byFk5Pd3jpub56sq87UGM
M4GD3YdmT5eBqzGxApddyIiS839PZpdw7g/Ivtp2GYCtiNNZcJqaXvN7IwfcVF3c
YJrCfhNTUJPjICIL9k9v2RQovGuGZqtM3BHc8PmyVvDxTqtEbh8B9GEg+QC58xp/
s+UI2sEd/Vg6NVKluIJHau7aEmJlaxYIshqsIhYvPgJRN6KrFMhdfPNx7D+zMlu8
nbM6Z1n2VFk9Cxb0vA9K
=MXJI
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.2-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"One fix for a regression introduced by our 32-bit KASAN support, which
broke booting on machines with "bootx" early debugging enabled.
A fix for a bug which broke kexec on 32-bit, introduced by changes to
the 32-bit STRICT_KERNEL_RWX support in v5.1.
Finally two fixes going to stable for our THP split/collapse handling,
discovered by Nick. The first fixes random crashes and/or corruption
in guests under sufficient load.
Thanks to: Nicholas Piggin, Christophe Leroy, Aaro Koskinen, Mathieu
Malaterre"
* tag 'powerpc-5.2-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/32s: fix booting with CONFIG_PPC_EARLY_DEBUG_BOOTX
powerpc/64s: __find_linux_pte() synchronization vs pmdp_invalidate()
powerpc/64s: Fix THP PMD collapse serialisation
powerpc: Fix kexec failure on book3s/32
Build failure was introduced by the commit identified below,
due to missed macro expension leading to wrong called function's name.
arch/powerpc/kernel/head_fsl_booke.o: In function `SystemCall':
arch/powerpc/kernel/head_fsl_booke.S:416: undefined reference to `kvmppc_handler_BOOKE_INTERRUPT_SYSCALL_SPRN_SRR1'
Makefile:1052: recipe for target 'vmlinux' failed
The called function should be kvmppc_handler_8_0x01B(). This patch fixes it.
Reported-by: Paul Mackerras <paulus@ozlabs.org>
Fixes: 1a4b739bbb ("powerpc/32: implement fast entry for syscalls on BOOKE")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Otherwise, the following warning is encountered:
WARNING: vmlinux.o(.text+0x3dc6): Section mismatch in reference from the variable start_here_multiplatform to the function .init.text:.early_setup()
The function start_here_multiplatform() references
the function __init .early_setup().
This is often because start_here_multiplatform lacks a __init
annotation or the annotation of .early_setup is wrong.
Fixes: 56c46bba9b ("powerpc/64: Fix booting large kernels with STRICT_KERNEL_RWX")
Cc: Russell Currey <ruscur@russell.cc>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use r10 instead of r9 to calculate CPU offset as r9 contains
the value from SRR1 which is used later.
Fixes: 1a4b739bbb ("powerpc/32: implement fast entry for syscalls on BOOKE")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The patch referenced below moved the loading of segment registers
out of load_up_mmu() in order to do it earlier in the boot sequence.
However, the secondary CPU still needs it to be done when loading up
the MMU.
Reported-by: Erhard F. <erhard_f@mailbox.org>
Fixes: 215b823707 ("powerpc/32s: set up an early static hash table for KASAN")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Allow external callers to force the cacheinfo code to release all its
references to cache nodes, e.g. before processing device tree updates
post-migration, and to rebuild the hierarchy afterward.
CPU online/offline must be blocked by callers; enforce this.
Fixes: 410bccf978 ("powerpc/pseries: Partition migration in the kernel")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The declaration for pfn_is_nosave is only available in
kernel/power/power.h. Since this function can be override in arch,
expose it globally. Having a prototype will make sure to avoid warning
(sometime treated as error with W=1) such as:
arch/powerpc/kernel/suspend.c:18:5: error: no previous prototype for 'pfn_is_nosave' [-Werror=missing-prototypes]
This moves the declaration into a globally visible header file and add
missing include to avoid a warning on powerpc.
Also remove the duplicated prototypes since not required anymore.
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When booting through OF, setup_disp_bat() does nothing because
disp_BAT are not set. By change, it used to work because BOOTX
buffer is mapped 1:1 at address 0x81000000 by the bootloader, and
btext_setup_display() sets virt addr same as phys addr.
But since commit 215b823707 ("powerpc/32s: set up an early static
hash table for KASAN."), a temporary page table overrides the
bootloader mapping.
This 0x81000000 is also problematic with the newly implemented
Kernel Userspace Access Protection (KUAP) because it is within user
address space.
This patch fixes those issues by properly setting disp_BAT through
a call to btext_prepare_BAT(), allowing setup_disp_bat() to
properly setup BAT3 for early bootx screen buffer access.
Reported-by: Mathieu Malaterre <malat@debian.org>
Fixes: 215b823707 ("powerpc/32s: set up an early static hash table for KASAN.")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Tested-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In the old days, _PAGE_EXEC didn't exist on 6xx aka book3s/32.
Therefore, allthough __mapin_ram_chunk() was already mapping kernel
text with PAGE_KERNEL_TEXT and the rest with PAGE_KERNEL, the entire
memory was executable. Part of the memory (first 512kbytes) was
mapped with BATs instead of page table, but it was also entirely
mapped as executable.
In commit 385e89d5b2 ("powerpc/mm: add exec protection on
powerpc 603"), we started adding exec protection to some 6xx, namely
the 603, for pages mapped via pagetables.
Then, in commit 63b2bc6195 ("powerpc/mm/32s: Use BATs for
STRICT_KERNEL_RWX"), the exec protection was extended to BAT mapped
memory, so that really only the kernel text could be executed.
The problem here is that kexec is based on copying some code into
upper part of memory then executing it from there in order to install
a fresh new kernel at its definitive location.
However, the code is position independant and first part of it is
just there to deactivate the MMU and jump to the second part. So it
is possible to run this first part inplace instead of running the
copy. Once the MMU is off, there is no protection anymore and the
second part of the code will just run as before.
Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi>
Fixes: 63b2bc6195 ("powerpc/mm/32s: Use BATs for STRICT_KERNEL_RWX")
Cc: stable@vger.kernel.org # v5.1+
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Tested-by: Aaro Koskinen <aaro.koskinen@iki.fi>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
While the TIF_SYSCALL_EMU is set in ptrace_resume independent of any
architecture, currently only powerpc and x86 unset the TIF_SYSCALL_EMU
flag in ptrace_disable which gets called from ptrace_detach.
Let's move the clearing of TIF_SYSCALL_EMU flag to __ptrace_unlink
which gets executed from ptrace_detach and also keep it along with
or close to clearing of TIF_SYSCALL_TRACE.
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Based on 1 normalized pattern(s):
distribute under gplv2
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 8 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190114.475576622@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation this program is
distributed in the hope that it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details you should have received a copy of the gnu general
public license along with this program if not write to the free
software foundation inc 59 temple place suite 330 boston ma 02111
1307 usa
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 136 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190530000436.384967451@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation version 2 of the license this program
is distributed in the hope that it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 100 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141900.918357685@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation this program is
distributed in the hope that it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details you should have received a copy of the gnu general
public license along with this program if not write to the free
software foundation 51 franklin street fifth floor boston ma 02110
1301 usa
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 67 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141333.953658117@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A minor fix to our IMC PMU code to print a less confusing error message when the
driver can't initialise properly.
A fix for a bug where a user requesting an unsupported branch sampling filter
can corrupt PMU state, preventing the PMU from counting properly.
And finally a fix for a bug in our support for kexec_file_load(), which
prevented loading a kernel and initramfs. Most versions of kexec don't yet use
kexec_file_load().
Thanks to:
Anju T Sudhakar, Dave Young, Madhavan Srinivasan, Ravi Bangoria, Thiago Jung
Bauermann.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJc86afAAoJEFHr6jzI4aWAIC8P/R/1Y8ro94tmu58MvqNmG80F
Rc/+/1dY023X/itMHikSzPzs1UnQ1sjSKOOk3WOJcgVBE6Ks9Q7peeWD6wnSwYUq
PwvXAZNmsD0Bgh9Oazj7M4dMSZHq7zJzQ8WzopltVueKgx941BLwYPRZG5zFXfcV
k2Zaln87wfOuyfl/gQyWuG2ikziHJl9EvS3IIFejAnnogkFMti/FAe5bYXP3IT4W
c4whLQJMNLeaeA7zeFBUvotWCZVfCq+Li9QTKWYJIPIZqN3kSBP3wHuUC8wEFuXa
32+qu7RW5NfbuJERtCJYXqjO9C9A/QgOAm2n0LF2vBzo5CwBIqiR69PRuVzzUekU
NhYsHbbwviRl1By7pxUCNm48TW+3utm903etEp+PPV3FYdG+E2hlhs8NoUEdiN5X
PWzEMqUqneN6+HnwpEeeiDQef19XTaGr3PUd8CtJYJQ4rSeyvnBASjGKfVuBW6FL
85aCwsmAYdapJOyhK5OSaDEWwMnpLBClJKqNrdH7y9fFm2lFHSJSNxHERBtSQcfx
Ra+F7M0tyPRJFmVfXiX8u2ZYvYBLUIYUV1+mJHokEFOR9tW+Oy1lV3UMlXzWZXqb
CY3qpCk0ix1rLjv8VpDoVQE2LkMhhUGqsgymSFueIwsrQhfhb/ymiVmaEH202tTM
DyvBWmVuI9vyqR3Vt9GQ
=ilhV
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"A minor fix to our IMC PMU code to print a less confusing error
message when the driver can't initialise properly.
A fix for a bug where a user requesting an unsupported branch sampling
filter can corrupt PMU state, preventing the PMU from counting
properly.
And finally a fix for a bug in our support for kexec_file_load(),
which prevented loading a kernel and initramfs. Most versions of kexec
don't yet use kexec_file_load().
Thanks to: Anju T Sudhakar, Dave Young, Madhavan Srinivasan, Ravi
Bangoria, Thiago Jung Bauermann"
* tag 'powerpc-5.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/kexec: Fix loading of kernel + initramfs with kexec_file_load()
powerpc/perf: Fix MMCRA corruption by bhrb_filter
powerpc/powernv: Return for invalid IMC domain
On POWER9, if the hypervisor supports XIVE exploitation mode, the
guest OS will unconditionally requests for the XIVE interrupt mode
even if XIVE was deactivated with the kernel command line xive=off.
Later on, when the spapr XIVE init code handles xive=off, it disables
XIVE and tries to fall back on the legacy mode XICS.
This discrepency causes a kernel panic because the hypervisor is
configured to provide the XIVE interrupt mode to the guest :
kernel BUG at arch/powerpc/sysdev/xics/xics-common.c:135!
...
NIP xics_smp_probe+0x38/0x98
LR xics_smp_probe+0x2c/0x98
Call Trace:
xics_smp_probe+0x2c/0x98 (unreliable)
pSeries_smp_probe+0x40/0xa0
smp_prepare_cpus+0x62c/0x6ec
kernel_init_freeable+0x148/0x448
kernel_init+0x2c/0x148
ret_from_kernel_thread+0x5c/0x68
Look for xive=off during prom_init and don't ask for XIVE in this
case. One exception though: if the host only supports XIVE, we still
want to boot so we ignore xive=off.
Similarly, have the spapr XIVE init code to looking at the interrupt
mode negotiated during CAS, and ignore xive=off if the hypervisor only
supports XIVE.
Fixes: eac1e731b5 ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
Cc: stable@vger.kernel.org # v4.20
Reported-by: Pavithra R. Prakash <pavrampu@in.ibm.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In commit eab00a208e ("powerpc: Move `path` variable inside
DEBUG_PROM") DEBUG_PROM sentinels were added to silence a warning
(treated as error with W=1):
arch/powerpc/kernel/prom_init.c:1388:8: error: variable ‘path’ set but not used [-Werror=unused-but-set-variable]
Rework the original patch and simplify the code, by removing the
variable ‘path’ completely. Fix line over 90 characters.
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Based on 1 normalized pattern(s):
licensed under gplv2
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 99 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190528170027.163048684@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details you
should have received a copy of the gnu general public license along
with this program if not write to the free software foundation inc
59 temple place suite 330 boston ma 02111 1307 usa
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 1334 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070033.113240726@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 3029 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As synchronous exceptions really only make sense against the current
task (otherwise how are you synchronous) remove the task parameter
from from force_sig_fault to make it explicit that is what is going
on.
The two known exceptions that deliver a synchronous exception to a
stopped ptraced task have already been changed to
force_sig_fault_to_task.
The callers have been changed with the following emacs regular expression
(with obvious variations on the architectures that take more arguments)
to avoid typos:
force_sig_fault[(]\([^,]+\)[,]\([^,]+\)[,]\([^,]+\)[,]\W+current[)]
->
force_sig_fault(\1,\2,\3)
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
All of the remaining callers pass current into force_sig so
remove the task parameter to make this obvious and to make
misuse more difficult in the future.
This also makes it clear force_sig passes current into force_sig_info.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details you
should have received a copy of the gnu general public license along
with this program if not write to the free software foundation inc
675 mass ave cambridge ma 02139 usa
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 441 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520071858.739733335@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit b6664ba42f ("s390, kexec_file: drop arch_kexec_mem_walk()")
changed kexec_add_buffer() to skip searching for a memory location if
kexec_buf.mem is already set, and use the address that is there.
In powerpc code we reuse a kexec_buf variable for loading both the
kernel and the initramfs by resetting some of the fields between those
uses, but not mem. This causes kexec_add_buffer() to try to load the
kernel at the same address where initramfs will be loaded, which is
naturally rejected:
# kexec -s -l --initrd initramfs vmlinuz
kexec_file_load failed: Invalid argument
Setting the mem field before every call to kexec_add_buffer() fixes
this regression.
Fixes: b6664ba42f ("s390, kexec_file: drop arch_kexec_mem_walk()")
Cc: stable@vger.kernel.org # v5.0+
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Reviewed-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Merge yet more updates from Andrew Morton:
"A few final bits:
- large changes to vmalloc, yielding large performance benefits
- tweak the console-flush-on-panic code
- a few fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
panic: add an option to replay all the printk message in buffer
initramfs: don't free a non-existent initrd
fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount
mm/compaction.c: correct zone boundary handling when isolating pages from a pageblock
mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro
mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro
mm/vmalloc.c: keep track of free blocks for vmap allocation
One fix going back to stable, for a bug on 32-bit introduced when we added
support for THREAD_INFO_IN_TASK.
A fix for a typo in a recent rework of our hugetlb code that leads to crashes on
64-bit when using hugetlbfs with a 4K PAGE_SIZE.
Two fixes for our recent rework of the address layout on 64-bit hash CPUs, both
only triggered when userspace tries to access addresses outside the user or
kernel address ranges.
Finally a fix for a recently introduced double free in an error path in our
cacheinfo code.
Thanks to:
Aneesh Kumar K.V, Christophe Leroy, Sachin Sant, Tobin C. Harding.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJc3+WRAAoJEFHr6jzI4aWAlN4QAIP7UnQUApYzQX8YrMJEiip7
1Jtez373pgDEX21McmaznyYRy04gtLWVRV0D5vRsTCG5RHBOVt2KPXe0PrzyjJws
/AxS9aRuMgM+VkLkD9c6DzWsuBLP8kJMfuTbmY7+C0tByQQT9Xfp+K7/gC5kGlK4
igyTGZvALlEVilsjTQO/FhADWisYIHM+y2/e0ocxLlK5F+TxdJKpESyUzjMOT78i
SmAn1qufZedV7ZH91VmvLm8f8MSqg+t1NTwpAnXuS58z2eSNisRhUcP43K4XdAez
QTGODTgUGGzLsbnIq8KJB/X4YVLwtTR2UM3p6ubTwlDuVkBfrFPUHowywDmxoPHZ
Kh6KWlRFAL49sGmYMJpfkaEiyN+J+VLN2HMdNLW2HgnROzDSOXCXcjuAuBE9nAKe
kvMV7Zwb4mQ0j0QUxHuugbBRzUpAcDg+PNZKKb9P/jIMWlMhGyb9fjGxC1yJfDtB
Kq03zJ+0lzvDCr1XbcDhSv4Fu6Dxxfi7GX9BXlzHTzsGFwZKICj9sLXDXXjFh4h8
CDNAeWtol1WE2xj315LL6WTlONUMQJ3wDAzd5VJw7SewadDzpK57vuwba88IN9hd
2PmH7FF8VyrksKmqt19ZsF0X7n7JL3+xbOcV9OT0r0DjkkM1fC6zo+JT6YYIER6A
FE+pFQRQJndKRzHqvyBJ
=2265
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"One fix going back to stable, for a bug on 32-bit introduced when we
added support for THREAD_INFO_IN_TASK.
A fix for a typo in a recent rework of our hugetlb code that leads to
crashes on 64-bit when using hugetlbfs with a 4K PAGE_SIZE.
Two fixes for our recent rework of the address layout on 64-bit hash
CPUs, both only triggered when userspace tries to access addresses
outside the user or kernel address ranges.
Finally a fix for a recently introduced double free in an error path
in our cacheinfo code.
Thanks to: Aneesh Kumar K.V, Christophe Leroy, Sachin Sant, Tobin C.
Harding"
* tag 'powerpc-5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/cacheinfo: Remove double free
powerpc/mm/hash: Fix get_region_id() for invalid addresses
powerpc/mm: Drop VM_BUG_ON in get_region_id()
powerpc/mm: Fix crashes with hugepages & 4K pages
powerpc/32s: fix flush_hash_pages() on SMP
Currently on panic, kernel will lower the loglevel and print out pending
printk msg only with console_flush_on_panic().
Add an option for users to configure the "panic_print" to replay all
dmesg in buffer, some of which they may have never seen due to the
loglevel setting, which will help panic debugging .
[feng.tang@intel.com: keep the original console_flush_on_panic() inside panic()]
Link: http://lkml.kernel.org/r/1556199137-14163-1-git-send-email-feng.tang@intel.com
[feng.tang@intel.com: use logbuf lock to protect the console log index]
Link: http://lkml.kernel.org/r/1556269868-22654-1-git-send-email-feng.tang@intel.com
Link: http://lkml.kernel.org/r/1556095872-36838-1-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Aaro Koskinen <aaro.koskinen@nokia.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull more vfs mount updates from Al Viro:
"Propagation of new syscalls to other architectures + cosmetic change
from Christian (fscontext didn't follow the convention for anon inode
names)"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
uapi: Wire up the mount API syscalls on non-x86 arches [ver #2]
uapi, x86: Fix the syscall numbering of the mount API syscalls [ver #2]
uapi, fsopen: use square brackets around "fscontext" [ver #2]
kfree() after kobject_put(). Who ever wrote this was on crack.
Fixes: 7e8039795a ("powerpc/cacheinfo: Fix kobject memleak")
Signed-off-by: Tobin C. Harding <tobin@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Wire up the mount API syscalls on non-x86 arches.
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
CONFIG_DEBUG_KERNEL should not impact code generation. Use the newly
defined CONFIG_DEBUG_MISC instead to keep the current code.
Link: http://lkml.kernel.org/r/20190413224438.10802-3-okaya@kernel.org
Signed-off-by: Sinan Kaya <okaya@kernel.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Florian Westphal <fw@strlen.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Bogendoerfer <tbogendoerfer@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common
place. We need to eliminate potential issues beforehand.
If it is enabled for powerpc, the following modpost warnings are
reported:
WARNING: vmlinux.o(.text.unlikely+0x20): Section mismatch in reference from the function .prom_getprop() to the function .init.text:.call_prom()
The function .prom_getprop() references the function __init .call_prom().
This is often because .prom_getprop lacks a __init annotation or the annotation of .call_prom is wrong.
WARNING: vmlinux.o(.text.unlikely+0x3c): Section mismatch in reference from the function .prom_getproplen() to the function .init.text:.call_prom()
The function .prom_getproplen() references the function __init .call_prom().
This is often because .prom_getproplen lacks a __init annotation or the annotation of .call_prom is wrong.
Link: http://lkml.kernel.org/r/20190423034959.13525-9-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boris Brezillon <bbrezillon@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Norris <computersforpeace@gmail.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marek Vasut <marek.vasut@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Stefan Agner <stefan@agner.ch>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Highlights:
- Support for Kernel Userspace Access/Execution Prevention (like
SMAP/SMEP/PAN/PXN) on some 64-bit and 32-bit CPUs. This prevents the kernel
from accidentally accessing userspace outside copy_to/from_user(), or
ever executing userspace.
- KASAN support on 32-bit.
- Rework of where we map the kernel, vmalloc, etc. on 64-bit hash to use the
same address ranges we use with the Radix MMU.
- A rewrite into C of large parts of our idle handling code for 64-bit Book3S
(ie. power8 & power9).
- A fast path entry for syscalls on 32-bit CPUs, for a 12-17% speedup in the
null_syscall benchmark.
- On 64-bit bare metal we have support for recovering from errors with the time
base (our clocksource), however if that fails currently we hang in __delay()
and never crash. We now have support for detecting that case and short
circuiting __delay() so we at least panic() and reboot.
- Add support for optionally enabling the DAWR on Power9, which had to be
disabled by default due to a hardware erratum. This has the effect of
enabling hardware breakpoints for GDB, the downside is a badly behaved
program could crash the machine by pointing the DAWR at cache inhibited
memory. This is opt-in obviously.
- xmon, our crash handler, gets support for a read only mode where operations
that could change memory or otherwise disturb the system are disabled.
Plus many clean-ups, reworks and minor fixes etc.
Thanks to:
Christophe Leroy, Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Andrew
Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anton Blanchard, Ben Hutchings,
Bo YU, Breno Leitao, Cédric Le Goater, Christopher M. Riedl, Christoph
Hellwig, Colin Ian King, David Gibson, Ganesh Goudar, Gautham R. Shenoy,
George Spelvin, Greg Kroah-Hartman, Greg Kurz, Horia Geantă, Jagadeesh
Pagadala, Joel Stanley, Joe Perches, Julia Lawall, Laurentiu Tudor, Laurent
Vivier, Lukas Bulwahn, Madhavan Srinivasan, Mahesh Salgaonkar, Mathieu
Malaterre, Michael Neuling, Mukesh Ojha, Nathan Fontenot, Nathan Lynch,
Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Peng Hao, Qian Cai, Ravi
Bangoria, Rick Lindsley, Russell Currey, Sachin Sant, Stewart Smith, Sukadev
Bhattiprolu, Thomas Huth, Tobin C. Harding, Tyrel Datwyler, Valentin
Schneider, Wei Yongjun, Wen Yang, YueHaibing.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJc1WbwAAoJEFHr6jzI4aWAv5cP/iDskai4Az/GCa6yLj4b+det
7mc7tTOaEzhUtvfrYYfHgvvdNNzo1ETv7rqTdZqtWJ3xfwdeowLFXXZwSywZKUDB
bi4pcl2v55Qlf9kxgx9RDr6+4fTwGG4nhO2qPDJDR1umEih9mG/2HJ7d+Wnq6Va2
E9srd+R6Fa0ty88+9vzBtdyllnDK1XHu3ahsxCH62aRm79ucuVrxyydWmbbs5lJe
a7g/OQIPgZmObHhfXvw9DFkOvkp5Pm6hfHOeyQH2nTB5X6k0judWv00uoHTJgOuP
DKxZtDhaGnajUfuhQYboDPOuFjY7lkfgEXaagyZsjdudqridTMmv1iU1o7iy8BT4
AId4DyJbvFFgqRJkCwKzhKRRHPfFMfM7KTJ38GPZuPmniuULk9uiIy6JyY0tXO+l
UQEclPzOTPkAE12FBaOBuqZqTRuBQuokWQF8ZDPOxbNAixHgFoRd4Z9diNwCPpLu
+KoyCwd2Gm5DyX+mC85sWG28IPKi9Hhhw2XBOA5F4A2kH6uFa1BnERSRGYomx+pc
BvEXHglf/vgV0XUQZfDCsiOecIKYuWxgre0/liLhhU5qMss2pxHczzffH4KtdykS
9y7o3mVRcS7Moitbmb6SAJoQxbR5QhzfN832DbSd6jEfKdg1ytZlfHTG0WZYHKDs
PHs6V1N+cQANdukutrJz
=cUkd
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.2-1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Slightly delayed due to the issue with printk() calling
probe_kernel_read() interacting with our new user access prevention
stuff, but all fixed now.
The only out-of-area changes are the addition of a cpuhp_state, small
additions to Documentation and MAINTAINERS updates.
Highlights:
- Support for Kernel Userspace Access/Execution Prevention (like
SMAP/SMEP/PAN/PXN) on some 64-bit and 32-bit CPUs. This prevents
the kernel from accidentally accessing userspace outside
copy_to/from_user(), or ever executing userspace.
- KASAN support on 32-bit.
- Rework of where we map the kernel, vmalloc, etc. on 64-bit hash to
use the same address ranges we use with the Radix MMU.
- A rewrite into C of large parts of our idle handling code for
64-bit Book3S (ie. power8 & power9).
- A fast path entry for syscalls on 32-bit CPUs, for a 12-17% speedup
in the null_syscall benchmark.
- On 64-bit bare metal we have support for recovering from errors
with the time base (our clocksource), however if that fails
currently we hang in __delay() and never crash. We now have support
for detecting that case and short circuiting __delay() so we at
least panic() and reboot.
- Add support for optionally enabling the DAWR on Power9, which had
to be disabled by default due to a hardware erratum. This has the
effect of enabling hardware breakpoints for GDB, the downside is a
badly behaved program could crash the machine by pointing the DAWR
at cache inhibited memory. This is opt-in obviously.
- xmon, our crash handler, gets support for a read only mode where
operations that could change memory or otherwise disturb the system
are disabled.
Plus many clean-ups, reworks and minor fixes etc.
Thanks to: Christophe Leroy, Akshay Adiga, Alastair D'Silva, Alexey
Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar,
Anton Blanchard, Ben Hutchings, Bo YU, Breno Leitao, Cédric Le Goater,
Christopher M. Riedl, Christoph Hellwig, Colin Ian King, David Gibson,
Ganesh Goudar, Gautham R. Shenoy, George Spelvin, Greg Kroah-Hartman,
Greg Kurz, Horia Geantă, Jagadeesh Pagadala, Joel Stanley, Joe
Perches, Julia Lawall, Laurentiu Tudor, Laurent Vivier, Lukas Bulwahn,
Madhavan Srinivasan, Mahesh Salgaonkar, Mathieu Malaterre, Michael
Neuling, Mukesh Ojha, Nathan Fontenot, Nathan Lynch, Nicholas Piggin,
Nick Desaulniers, Oliver O'Halloran, Peng Hao, Qian Cai, Ravi
Bangoria, Rick Lindsley, Russell Currey, Sachin Sant, Stewart Smith,
Sukadev Bhattiprolu, Thomas Huth, Tobin C. Harding, Tyrel Datwyler,
Valentin Schneider, Wei Yongjun, Wen Yang, YueHaibing"
* tag 'powerpc-5.2-1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (205 commits)
powerpc/64s: Use early_mmu_has_feature() in set_kuap()
powerpc/book3s/64: check for NULL pointer in pgd_alloc()
powerpc/mm: Fix hugetlb page initialization
ocxl: Fix return value check in afu_ioctl()
powerpc/mm: fix section mismatch for setup_kup()
powerpc/mm: fix redundant inclusion of pgtable-frag.o in Makefile
powerpc/mm: Fix makefile for KASAN
powerpc/kasan: add missing/lost Makefile
selftests/powerpc: Add a signal fuzzer selftest
powerpc/booke64: set RI in default MSR
ocxl: Provide global MMIO accessors for external drivers
ocxl: move event_fd handling to frontend
ocxl: afu_irq only deals with IRQ IDs, not offsets
ocxl: Allow external drivers to use OpenCAPI contexts
ocxl: Create a clear delineation between ocxl backend & frontend
ocxl: Don't pass pci_dev around
ocxl: Split pci.c
ocxl: Remove some unused exported symbols
ocxl: Remove superfluous 'extern' from headers
ocxl: read_pasid never returns an error, so make it void
...
Pull speculation mitigation update from Ingo Molnar:
"This adds the "mitigations=" bootline option, which offers a
cross-arch set of options that will work on x86, PowerPC and s390 that
will map to the arch specific option internally"
* 'core-speculation-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
s390/speculation: Support 'mitigations=' cmdline option
powerpc/speculation: Support 'mitigations=' cmdline option
x86/speculation: Support 'mitigations=' cmdline option
cpu/speculation: Add 'mitigations=' cmdline option
On TOD/TB errors timebase register stops/freezes until HMI error recovery
gets TOD/TB back into running state. On successful recovery, TB starts
running again and udelay() that relies on TB value continues to function
properly. But in case when HMI fails to recover from TOD/TB errors, the
TB register stay freezed. With TB not running the __delay() function
keeps looping and never return. If __delay() is called while in panic
path then system hangs and never reboots after panic.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Since the enabling and disabling of IRQs within preempt_schedule_irq()
is contained in a need_resched() loop, we don't need the outer arch
code loop.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
[mpe: Rebase since CURRENT_THREAD_INFO() removal]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PROM_SCRATCH_SIZE is same as sizeof(prom_scratch)
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This can be helpful for debugging problems with the security feature
flags, especially on guests where the flags come from the hypervisor
via an hcall and so can't be observed in the device tree.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
"Reconciling" in terms of interrupt handling, is to bring the soft irq
mask state in to synch with the hardware, after an interrupt causes
MSR[EE] to be cleared (while the soft mask may be enabled, and hard
irqs not marked disabled).
General kernel code should not be called while unreconciled, because
local_irq_disable, etc. manipulations can cause surprising irq traces,
and it's fragile because the soft irq code does not really expect to
be called in this situation.
When exiting from an interrupt, MSR[EE] is cleared to prevent races,
but soft irq state is enabled for the returned-to context, so this is
now an unreconciled state. restore_math is called in this state, and
that can be ftraced, and the ftrace subsystem disables local irqs.
Mark restore_math and its callees as notrace. Restore a sanity check
in the soft irq code that had to be disabled for this case, by commit
4da1f79227 ("powerpc/64: Disable irq restore warning for now").
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch drops__irq_offset_value which has not been used since
commit 9c4cb82515 ("powerpc: Remove use of CONFIG_PPC_MERGE")
This removes a sparse warning.
Fixes: 9c4cb82515 ("powerpc: Remove use of CONFIG_PPC_MERGE")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Compared to ifdefs, IS_ENABLED() provide a cleaner code and allows
to detect compilation failure regardless of the selected options.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use cpu_has_feature() instead of opencoding
Use IS_ENABLED() instead of #ifdef for CONFIG_TAU_AVERAGE
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
CPU_FTR_ALTIVEC is only set when CONFIG_ALTIVEC is selected, so
the ifdef is unnecessary.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
To avoid ifdefs, define a empty static inline mm_iommu_init() function
when CONFIG_SPAPR_TCE_IOMMU is not selected.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
To avoid #ifdefs, define an static inline fadump_cleanup() function
when CONFIG_FADUMP is not selected
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
No need to add dummy frames when calling trace_hardirqs_on or
trace_hardirqs_off. GCC properly handles empty stacks.
In addition, powerpc doesn't set CONFIG_FRAME_POINTER, therefore
__builtin_return_address(1..) returns NULL at all time. So the
dummy frames are definitely unneeded here.
In the meantime, avoid reading memory for loading r1 with a value
we already know.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
As syscalls are now handled via a fast entry path, syscall related
actions can be removed from the generic transfer_to_handler path.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch implements a fast entry for syscalls.
Syscalls don't have to preserve non volatile registers except LR.
This patch then implement a fast entry for syscalls, where
volatile registers get clobbered.
As this entry is dedicated to syscall it always sets MSR_EE
and warns in case MSR_EE was previously off
It also assumes that the call is always from user, system calls are
unexpected from kernel.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch implements a fast entry for syscalls.
Syscalls don't have to preserve non volatile registers except LR.
This patch then implement a fast entry for syscalls, where
volatile registers get clobbered.
As this entry is dedicated to syscall it always sets MSR_EE
and warns in case MSR_EE was previously off
It also assumes that the call is always from user, system calls are
unexpected from kernel.
The overall series improves null_syscall selftest by 12,5% on an 83xx
and by 17% on a 8xx.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
[text mostly copied from benh's RFC/WIP]
ppc32 are still doing something rather gothic and wrong on 32-bit
which we stopped doing on 64-bit a while ago.
We have that thing where some handlers "copy" the EE value from the
original stack frame into the new MSR before transferring to the
handler.
Thus for a number of exceptions, we enter the handlers with interrupts
enabled.
This is rather fishy, some of the stuff that handlers might do early
on such as irq_enter/exit or user_exit, context tracking, etc...
should be run with interrupts off afaik.
Generally our handlers know when to re-enable interrupts if needed.
The problem we were having is that we assumed these interrupts would
return with interrupts enabled. However that isn't the case.
Instead, this patch changes things so that we always enter exception
handlers with interrupts *off* with the notable exception of syscalls
which are special (and get a fast path).
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
EXC_XFER_TEMPLATE() is not called with COPY_EE anymore so
we can get rid of copyee parameters and related COPY_EE and NOCOPY
macros.
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[splited out from benh RFC patch]
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
All exceptions handlers know when to reenable interrupts, so
it is safer to enter all of them with MSR_EE unset, except
for syscalls.
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[splited out from benh RFC patch]
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
syscalls are expected to be entered with MSR_EE set. Lets
make it inconditional by forcing MSR_EE on syscalls.
This patch adds EXC_XFER_SYS for that.
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[splited out from benh RFC patch]
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
SPEFloatingPointException() is the only exception handler which 'forgets' to
re-enable interrupts. This patch makes sure it does.
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Refactor exception entry macros by using the ones defined in head_32.h
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch splits NORMAL_EXCEPTION_PROLOG in the same way as in
head_8xx.S and head_32.S and renames it EXCEPTION_PROLOG() as well
to match head_32.h
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Unlike said in the comment, r1 is not reused by the critical
exception handler, as it uses a dedicated critirq_ctx stack.
Decrementing r1 early is then unneeded.
Should the above be valid, the code is crap buggy anyway as
r1 gets some intermediate values that would jeopardise the
whole process (for instance after mfspr r1,SPRN_SPRG_THREAD)
Using SPRN_SPRG_SCRATCH2 to save r1 is then not needed, r11 can be
used instead. This avoids one mtspr and one mfspr and makes the
prolog closer to what's done on 6xx and 8xx.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
6xx/8xx EXC_XFER_TEMPLATE() macro adds a i##n symbol which is
unused and can be removed.
40x and booke EXC_XFER_TEMPLATE() macros takes msr from the caller
while the 6xx/8xx version uses only MSR_KERNEL as msr value.
This patch modifies the 6xx/8xx version to make it similar to the
40x and booke versions.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
As preparation for using head_32.h for head_40x.S, move
LOAD_MSR_KERNEL() there and use it to load r10 with MSR_KERNEL value.
In the mean time, this patch modifies it so that it takes into account
the size of the passed value to determine if 'li' can be used or if
'lis/ori' is needed instead of using the size of MSR_KERNEL. This is
done by using gas macro.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
EXCEPTION_PROLOG is similar in head_8xx.S and head_32.S
This patch creates head_32.h and moves EXCEPTION_PROLOG macro
into it. It also converts it from a GCC macro to a GAS macro
in order to ease refactorisation with 40x later, since
GAS macros allows the use of #ifdef/#else/#endif inside it.
And it also has the advantage of not requiring the uggly "; \"
at the end of each line.
This patch also moves EXCEPTION() and EXC_XFER_XXXX() macros which
are also similar while adding START_EXCEPTION() out of EXCEPTION().
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reduce #ifdef mess by defining a helper to print
hash info at startup.
In the meantime, remove the display of hash table address
to reduce leak of non necessary information.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
KASAN requires early activation of hash table, before memblock()
functions are available.
This patch implements an early hash_table statically defined in
__initdata.
During early boot, a single page table is used.
For hash32, when doing the final init, one page table is allocated
for each PGD entry because of the _PAGE_HASHPTE flag which can't be
common to several virt pages. This is done after memblock get
available but before switching to the final hash table, otherwise
there are issues with TLB flushing due to the shared entries.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For KASAN, hash table handling will be activated early for
accessing to KASAN shadow areas.
In order to avoid any modification of the hash functions while
they are still used with the early hash table, the code patching
is moved out of MMU_init_hw() and put close to the big-bang switch
to the final hash table.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds KASAN support for PPC32. The following patch
will add an early activation of hash table for book3s. Until
then, a warning will be raised if trying to use KASAN on an
hash 6xx.
To support KASAN, this patch initialises that MMU mapings for
accessing to the KASAN shadow area defined in a previous patch.
An early mapping is set as soon as the kernel code has been
relocated at its definitive place.
Then the definitive mapping is set once paging is initialised.
For modules, the shadow area is allocated at module_alloc().
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
All files containing functions run before kasan_early_init() is called
must have KASAN instrumentation disabled.
For those file, branch profiling also have to be disabled otherwise
each if () generates a call to ftrace_likely_update().
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Since commit 400c47d81c ("powerpc32: memset: only use dcbz once cache is
enabled"), memset() can be used before activation of the cache,
so no need to use memset_io() for zeroing the BSS.
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In kernel/cputable.c, explicitly use memcpy() instead of *y = *x;
This will allow GCC to replace it with __memcpy() when KASAN is
selected.
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When KASAN is active, the string functions in lib/ are doing the
KASAN checks. This is too early for prom_init.
This patch implements dedicated string functions for prom_init,
which will be compiled in with KASAN disabled.
Size of prom_init before the patch:
text data bss dec hex filename
12060 488 6960 19508 4c34 arch/powerpc/kernel/prom_init.o
Size of prom_init after the patch:
text data bss dec hex filename
12460 488 6960 19908 4dc4 arch/powerpc/kernel/prom_init.o
This increases the size of prom_init a bit, but as prom_init is
in __init section, it is freed after boot anyway.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch makes CONFIG_CMDLINE defined at all time. It avoids
having to enclose related code inside #ifdef CONFIG_CMDLINE
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
CONFIG_KASAN implements wrappers for memcpy() memmove() and memset()
Those wrappers are doing the verification then call respectively
__memcpy() __memmove() and __memset(). The arches are therefore
expected to rename their optimised functions that way.
For files on which KASAN is inhibited, #defines are used to allow
them to directly call optimised versions of the functions without
going through the KASAN wrappers.
See commit 393f203f5f ("x86_64: kasan: add interceptors for
memset/memmove/memcpy functions") for details.
Other string / mem functions do not (yet) have kasan wrappers,
we therefore have to fallback to the generic versions when
KASAN is active, otherwise KASAN checks will be skipped.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Fixups to keep selftests working]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In preparation of KASAN, move early_init() into a separate
file in order to allow deactivation of KASAN for that function.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
No need to have this in asm/page.h, move it into asm/hugetlb.h
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 67fda38f0d ("powerpc/mm: Move slb_addr_linit to
early_init_mmu") moved slb_addr_limit init out of setup_arch().
Commit 701101865f ("powerpc/mm: Reduce memory usage for mm_context_t
for radix") brought it back into setup_arch() by error.
This patch reverts that erroneous regress.
Fixes: 701101865f ("powerpc/mm: Reduce memory usage for mm_context_t for radix")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There is a kernel crash that happens if rt_sigreturn() is called inside
a transactional block.
This crash happens if the kernel hits an in-kernel page fault when
accessing userspace memory, usually through copy_ckvsx_to_user(). A
major page fault calls might_sleep() function, which can cause a task
reschedule. A task reschedule (switch_to()) reclaim and recheckpoint
the TM states, but, in the signal return path, the checkpointed memory
was already reclaimed, thus the exception stack has MSR that points to
MSR[TS]=0.
When the code returns from might_sleep() and a task reschedule
happened, then this task is returned with the memory recheckpointed,
and CPU MSR[TS] = suspended.
This means that there is a side effect at might_sleep() if it is
called with CPU MSR[TS] = 0 and the task has regs->msr[TS] != 0.
This side effect can cause a TM bad thing, since at the exception
entrance, the stack saves MSR[TS]=0, and this is what will be used at
RFID, but, the processor has MSR[TS] = Suspended, and this transition
will be invalid and a TM Bad thing will be raised, causing the
following crash:
Unexpected TM Bad Thing exception at c00000000000e9ec (msr 0x8000000302a03031) tm_scratch=800000010280b033
cpu 0xc: Vector: 700 (Program Check) at [c00000003ff1fd70]
pc: c00000000000e9ec: fast_exception_return+0x100/0x1bc
lr: c000000000032948: handle_rt_signal64+0xb8/0xaf0
sp: c0000004263ebc40
msr: 8000000302a03031
current = 0xc000000415050300
paca = 0xc00000003ffc4080 irqmask: 0x03 irq_happened: 0x01
pid = 25006, comm = sigfuz
Linux version 5.0.0-rc1-00001-g3bd6e94bec12 (breno@debian) (gcc version 8.2.0 (Debian 8.2.0-3)) #899 SMP Mon Jan 7 11:30:07 EST 2019
WARNING: exception is not recoverable, can't continue
enter ? for help
[c0000004263ebc40] c000000000032948 handle_rt_signal64+0xb8/0xaf0 (unreliable)
[c0000004263ebd30] c000000000022780 do_notify_resume+0x2f0/0x430
[c0000004263ebe20] c00000000000e844 ret_from_except_lite+0x70/0x74
--- Exception: c00 (System Call) at 00007fffbaac400c
SP (7fffeca90f40) is in userspace
The solution for this problem is running the sigreturn code with
regs->msr[TS] disabled, thus, avoiding hitting the side effect above.
This does not seem to be a problem since regs->msr will be replaced by
the ucontext value, so, it is being flushed already. In this case, it
is flushed earlier.
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Print more information about MCE error whether it is an hardware or
software error.
Some of the MCE errors can be easily categorized as hardware or
software errors e.g. UEs are due to hardware error, where as error
triggered due to invalid usage of tlbie is a pure software bug. But
not all the MCE errors can be easily categorize into either software
or hardware. There are errors like multihit errors which are usually
result of a software bug, but in some rare cases a hardware failure
can cause a multihit error. In past, we have seen case where after
replacing faulty chip, multihit errors stopped occurring. Same with
parity errors, which are usually due to faulty hardware but there are
chances where multihit can also cause an parity error. Such errors are
difficult to determine what really caused it. Hence this patch
classifies MCE errors into following four categorize:
1. Hardware error:
UE and Link timeout failure errors.
2. Probable hardware error (some chance of software cause)
SLB/ERAT/TLB Parity errors.
3. Software error
Invalid tlbie form.
4. Probable software error (some chance of hardware cause)
SLB/ERAT/TLB Multihit errors.
Sample output:
MCE: CPU80: machine check (Warning) Guest SLB Multihit DAR: 000001001b6e0320 [Recovered]
MCE: CPU80: PID: 24765 Comm: qemu-system-ppc Guest NIP: [00007fffa309dc60]
MCE: CPU80: Probable Software error (some chance of hardware cause)
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently all machine check errors are printed as severe errors which
isn't correct. Print soft errors as warning instead of severe errors.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When analysing sources of OS jitter, I noticed that doorbells cannot be
traced.
Signed-off-by: Anton Blanchard <anton@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In commit 2bf1071a8d ("powerpc/64s: Remove POWER9 DD1 support") the
function __switch_to remove usage for 'dummy_copy_buffer'. Since it is
not used anywhere else, remove it completely.
This remove the following warning:
arch/powerpc/kernel/process.c:1156:17: error: 'dummy_copy_buffer' defined but not used
Suggested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently error return from kobject_init_and_add() is not followed by
a call to kobject_put(). This means there is a memory leak.
Add call to kobject_put() in error path of kobject_init_and_add().
Signed-off-by: Tobin C. Harding <tobin@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Towards the goal of removing cc-ldoption, it seems that --hash-style=
was added to binutils 2.17.50.0.2 in 2006. The minimal required
version of binutils for the kernel according to
Documentation/process/changes.rst is 2.20.
Suggested-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reimplement Book3S idle code in C, moving POWER7/8/9 implementation
speific HV idle code to the powernv platform code.
Book3S assembly stubs are kept in common code and used only to save
the stack frame and non-volatile GPRs before executing architected
idle instructions, and restoring the stack and reloading GPRs then
returning to C after waking from idle.
The complex logic dealing with threads and subcores, locking, SPRs,
HMIs, timebase resync, etc., is all done in C which makes it more
maintainable.
This is not a strict translation to C code, there are some
significant differences:
- Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
but saves and restores them itself.
- The optimisation where EC=ESL=0 idle modes did not have to save GPRs
or change MSR is restored, because it's now simple to do. ESL=1
sleeps that do not lose GPRs can use this optimization too.
- KVM secondary entry and cede is now more of a call/return style
rather than branchy. nap_state_lost is not required because KVM
always returns via NVGPR restoring path.
- KVM secondary wakeup from offline sequence is moved entirely into
the offline wakeup, which avoids a hwsync in the normal idle wakeup
path.
Performance measured with context switch ping-pong on different
threads or cores, is possibly improved a small amount, 1-3% depending
on stop state and core vs thread test for shallow states. Deep states
it's in the noise compared with other latencies.
KVM improvements:
- Idle sleepers now always return to caller rather than branch out
to KVM first.
- This allows optimisations like very fast return to caller when no
state has been lost.
- KVM no longer requires nap_state_lost because it controls NVGPR
save/restore itself on the way in and out.
- The heavy idle wakeup KVM request check can be moved out of the
normal host idle code and into the not-performance-critical offline
code.
- KVM nap code now returns from where it is called, which makes the
flow a bit easier to follow.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Squash the KVM changes in]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Using a jiffies timer creates a dependency on the tick_do_timer_cpu
incrementing jiffies. If that CPU has locked up and jiffies is not
incrementing, the watchdog heartbeat timer for all CPUs stops and
creates false positives and confusing warnings on local CPUs, and
also causes the SMP detector to stop, so the root cause is never
detected.
Fix this by using hrtimer based timers for the watchdog heartbeat,
like the generic kernel hardlockup detector.
Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reported-by: Ravikumar Bangoria <ravi.bangoria@in.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Reported-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This comes a bit late, but should be in 5.1 anyway: we want the newly
added system calls to be synchronized across all architectures in
the release.
I hope that in the future, any newly added system calls can be added
to all architectures at the same time, and tested there while they
are in linux-next, avoiding dependencies between the architecture
maintainer trees and the tree that contains the new system call.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJcv2aZAAoJEGCrR//JCVIncu4QALpTBqbjSu9u1/nXRGMLWo9J
uToSBohDvsKW7wMkHcr1dU75ERIX9gqIY5pJWDrwzBdGDt02/oiy6WofXZDv4WkR
Sp4YncdTeZENi0nNN+mrGDzNrcvBJd0FRc1MSLgPzfKXgf8P1oRzEsOaJVlGY5hS
A8rNNUYE37m6rhTS59tNxzGvQcq3J7Q9ZRc0xjbSqIFngYVfQQiVbQCqd8RI6s9W
+Hek+e5VF5HQnzhmTT9MQM4TsxMRMNfzrYpjhhayuLJ3CHROJPX7x9pZEGdyusQS
5rDZxKes9SKTFS9QqycSyJkoP0awxrVrjqD1zFkWOJht0c3UCQAmw6GD7rlJkGPB
vofuzmPzMq5XaZ8vpTucWNL+0ymzRXhhQ6esV39vRwxztRc4/DCy5MHDnrPK5yXb
olPbltMAlHMaY5KePI/3jwpkcmzZjz9SNOKQ9/9tFlB5+RVF2qQdUgRMPE+XYa4H
pRrZChrEAf6ZjINGeLlIVtpTlBFPl1LRF7UkOy7TYBvtRqukduXYpOFPb1XspQUl
flIdBLOY3iF33o0eQnz10BEMxlblFhTj0SQrt0684kili7TjsWDaT+hPZSd72hhi
Wey9l39kaexV2Sh7XZ6oUe205ay3R8sTn0Ic2+CnZaboeOuYlLYc8/w2HkTeTYmu
9f3HAlX4Qu6RuX8bxLO0
=Y7Kd
-----END PGP SIGNATURE-----
Merge tag 'syscalls-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull syscall numbering updates from Arnd Bergmann:
"arch: add pidfd and io_uring syscalls everywhere
This comes a bit late, but should be in 5.1 anyway: we want the newly
added system calls to be synchronized across all architectures in the
release.
I hope that in the future, any newly added system calls can be added
to all architectures at the same time, and tested there while they are
in linux-next, avoiding dependencies between the architecture
maintainer trees and the tree that contains the new system call"
* tag 'syscalls-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
arch: add pidfd and io_uring syscalls everywhere
This helps in debugging. We can look at the dmesg to find out
different kernel mapping details.
On 4K config this shows
kernel vmalloc start = 0xc000100000000000
kernel IO start = 0xc000200000000000
kernel vmemmap start = 0xc000300000000000
On 64K config:
kernel vmalloc start = 0xc008000000000000
kernel IO start = 0xc00a000000000000
kernel vmemmap start = 0xc00c000000000000
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently, our mm_context_t on book3s64 include all hash specific
context details like slice mask and subpage protection details. We
can skip allocating these with radix translation. This will help us to save
8K per mm_context with radix translation.
With the patch applied we have
sizeof(mm_context_t) = 136
sizeof(struct hash_mm_context) = 8288
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Avoid #ifdef in generic code. Also enables us to do this specific to
MMU translation mode on book3s64
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We want to switch to allocating them runtime only when hash translation is
enabled. Add helpers so that both book3s and nohash can be adapted to
upcoming change easily.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch implements Kernel Userspace Access Protection for
book3s/32.
Due to limitations of the processor page protection capabilities,
the protection is only against writing. read protection cannot be
achieved using page protection.
The previous patch modifies the page protection so that RW user
pages are RW for Key 0 and RO for Key 1, and it sets Key 0 for
both user and kernel.
This patch changes userspace segment registers are set to Ku 0
and Ks 1. When kernel needs to write to RW pages, the associated
segment register is then changed to Ks 0 in order to allow write
access to the kernel.
In order to avoid having the read all segment registers when
locking/unlocking the access, some data is kept in the thread_struct
and saved on stack on exceptions. The field identifies both the
first unlocked segment and the first segment following the last
unlocked one. When no segment is unlocked, it contains value 0.
As the hash_page() function is not able to easily determine if a
protfault is due to a bad kernel access to userspace, protfaults
need to be handled by handle_page_fault when KUAP is set.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Drop allow_read/write_to/from_user() as they're now in kup.h,
and adapt allow_user_access() to do nothing when to == NULL]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch prepares Kernel Userspace Access Protection for
book3s/32.
Due to limitations of the processor page protection capabilities,
the protection is only against writing. read protection cannot be
achieved using page protection.
book3s/32 provides the following values for PP bits:
PP00 provides RW for Key 0 and NA for Key 1
PP01 provides RW for Key 0 and RO for Key 1
PP10 provides RW for all
PP11 provides RO for all
Today PP10 is used for RW pages and PP11 for RO pages, and user
segment register's Kp and Ks are set to 1. This patch modifies
page protection to use PP01 for RW pages and sets user segment
registers to Kp 0 and Ks 0.
This will allow to setup Userspace write access protection by
settng Ks to 1 in the following patch.
Kernel space segment registers remain unchanged.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
To implement Kernel Userspace Execution Prevention, this patch
sets NX bit on all user segments on kernel entry and clears NX bit
on all user segments on kernel exit.
Note that powerpc 601 doesn't have the NX bit, so KUEP will not
work on it. A warning is displayed at startup.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds ASM macros for saving, restoring and checking
the KUAP state, and modifies setup_32 to call them on exceptions
from kernel.
The macros are defined as empty by default for when CONFIG_PPC_KUAP
is not selected and/or for platforms which don't handle (yet) KUAP.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
syscalls are from user only, so we can account time without checking
whether returning to kernel or user as it will only be user.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Kernel Userspace Access Prevention utilises a feature of the Radix MMU
which disallows read and write access to userspace addresses. By
utilising this, the kernel is prevented from accessing user data from
outside of trusted paths that perform proper safety checks, such as
copy_{to/from}_user() and friends.
Userspace access is disabled from early boot and is only enabled when
performing an operation like copy_{to/from}_user(). The register that
controls this (AMR) does not prevent userspace from accessing itself,
so there is no need to save and restore when entering and exiting
userspace.
When entering the kernel from the kernel we save AMR and if it is not
blocking user access (because eg. we faulted doing a user access) we
reblock user access for the duration of the exception (ie. the page
fault) and then restore the AMR when returning back to the kernel.
This feature can be tested by using the lkdtm driver (CONFIG_LKDTM=y)
and performing the following:
# (echo ACCESS_USERSPACE) > [debugfs]/provoke-crash/DIRECT
If enabled, this should send SIGSEGV to the thread.
We also add paranoid checking of AMR in switch and syscall return
under CONFIG_PPC_KUAP_DEBUG.
Co-authored-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Some platforms (i.e. Radix MMU) need per-CPU initialisation for KUP.
Any platforms that only want to do KUP initialisation once
globally can just check to see if they're running on the boot CPU, or
check if whatever setup they need has already been performed.
Note that this is only for 64-bit.
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch implements a framework for Kernel Userspace Access
Protection.
Then subarches will have the possibility to provide their own
implementation by providing setup_kuap() and
allow/prevent_user_access().
Some platforms will need to know the area accessed and whether it is
accessed from read, write or both. Therefore source, destination and
size and handed over to the two functions.
mpe: Rename to allow/prevent rather than unlock/lock, and add
read/write wrappers. Drop the 32-bit code for now until we have an
implementation for it. Add kuap to pt_regs for 64-bit as well as
32-bit. Don't split strings, use pr_crit_ratelimited().
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds a skeleton for Kernel Userspace Protection
functionnalities like Kernel Userspace Access Protection and Kernel
Userspace Execution Prevention
The subsequent implementation of KUAP for radix makes use of a MMU
feature in order to patch out assembly when KUAP is disabled or
unsupported. This won't work unless there's an entry point for KUP
support before the feature magic happens, so for PPC64 setup_kup() is
called early in setup.
On PPC32, feature_fixup() is done too early to allow the same.
Suggested-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In order to implement KUAP (Kernel Userspace Access Protection) on
Power9 we will be using the AMR, and therefore indirectly the
UAMOR/AMOR.
So save/restore these regs in the idle code.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Without restoring the IAMR after idle, execution prevention on POWER9
with Radix MMU is overwritten and the kernel can freely execute
userspace without faulting.
This is necessary when returning from any stop state that modifies
user state, as well as hypervisor state.
To test how this fails without this patch, load the lkdtm driver and
do the following:
$ echo EXEC_USERSPACE > /sys/kernel/debug/provoke-crash/DIRECT
which won't fault, then boot the kernel with powersave=off, where it
will fault. Applying this patch will fix this.
Fixes: 3b10d0095a ("powerpc/mm/radix: Prevent kernel execution of user space")
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Russell Currey <ruscur@russell.cc>
Reviewed-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds a flag so that the DAWR can be enabled on P9 via:
echo Y > /sys/kernel/debug/powerpc/dawr_enable_dangerous
The DAWR was previously force disabled on POWER9 in:
9654153158 powerpc: Disable DAWR in the base POWER9 CPU features
Also see Documentation/powerpc/DAWR-POWER9.txt
This is a dangerous setting, USE AT YOUR OWN RISK.
Some users may not care about a bad user crashing their box
(ie. single user/desktop systems) and really want the DAWR. This
allows them to force enable DAWR.
This flag can also be used to disable DAWR access. Once this is
cleared, all DAWR access should be cleared immediately and your
machine once again safe from crashing.
Userspace may get confused by toggling this. If DAWR is force
enabled/disabled between getting the number of breakpoints (via
PTRACE_GETHWDBGINFO) and setting the breakpoint, userspace will get an
inconsistent view of what's available. Similarly for guests.
For the DAWR to be enabled in a KVM guest, the DAWR needs to be force
enabled in the host AND the guest. For this reason, this won't work on
POWERVM as it doesn't allow the HCALL to work. Writes of 'Y' to the
dawr_enable_dangerous file will fail if the hypervisor doesn't support
writing the DAWR.
To double check the DAWR is working, run this kernel selftest:
tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
Any errors/failures/skips mean something is wrong.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add support to hwpoison the pages upon hitting machine check
exception.
This patch queues the address where UE is hit to percpu array
and schedules work to plumb it into memory poison infrastructure.
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
[mpe: Combine #ifdefs, drop PPC_BIT8(), and empty inline stub]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With STRICT_KERNEL_RWX enabled anything marked __init is placed at a 16M
boundary. This is necessary so that it can be repurposed later with
different permissions. However, in kernels with text larger than 16M,
this pushes early_setup past 32M, incapable of being reached by the
branch instruction.
Fix this by setting the CTR and branching there instead.
Fixes: 1e0fc9d1eb ("powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs")
Signed-off-by: Russell Currey <ruscur@russell.cc>
[mpe: Fix it to work on BE by using DOTSYM()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add the io_uring and pidfd_send_signal system calls to all architectures.
These system calls are designed to handle both native and compat tasks,
so all entries are the same across architectures, only arm-compat and
the generic tale still use an old format.
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> (s390)
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
A minor build fix for 64-bit FLATMEM configs.
A fix for a boot failure on 32-bit powermacs.
My commit to fix CLOCK_MONOTONIC across Y2038 broke the 32-bit VDSO on 64-bit
kernels, ie. compat mode, which is only used on big endian.
The rewrite of the SLB code we merged in 4.20 missed the fact that the 0x380
exception is also used with the Radix MMU to report out of range accesses. This
could lead to an oops if userspace tried to read from addresses outside the user
or kernel range.
Thanks to:
Aneesh Kumar K.V, Christophe Leroy, Larry Finger, Nicholas Piggin.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcsVzhAAoJEFHr6jzI4aWAJuAP/2oLukNIIiF2UW/18xIXfvxR
ZA9JljVqcKUHEUR4W+Y673xL4ZKtGGF79P+bzSvh8fUTMJ9cIN9mLO7eGGoDNqTn
XhZX/jxJOh34tbHPYYbi9kYqWpZQKN4WuCjMQSPBCHOHMdx/0yn0wKgriOW1cuzG
AQqDRHcRX4h1QT9o/hnsCAsdcnLEntdBBCTTHL1dZ8BucuUopjL+7cV0wf4qFIui
e9SXOEl7yV03JGurmWcipE4mj9SrUioZJyHg6rJs70tlCUHFM24LQEFNIM4WczuF
GoPfzXi5nNPrOzC3aF/v77hT5t4zD2sPRV2DuKABGsS+gfPoK8sIZC3mo7Vk5y+j
gsbmkQSZt8/wVhRuAA0m0N6Aqg1J8NjhxoDfyM8kj0FzPe75D662VIgGSx15oMkl
3olt/9uDyPetxuZ7tmmnFC8wkcmyaGpVurVz9xnqpt6c2r0KI+16R6Mk4OiT/e2p
KNVBFkqRTp23ETpI8J9HUk9OtFIHqE9Zwzk2YOrX5yuLHByEwMq1T4qn2RuQsJqx
RWPJagSalGLmM6dqDGe08gQl9rovkYKleGxNIAJuJB9rIxZQke86d2+S0eSUQbAW
WWhP8SU0LJ5gmhEeZi5MntcuG+gcENwkz2UBK5nVDBVLFxGuBTPQATavW+w1bSi+
SSEMXx8dNAOvsrqrZ97I
=pfZc
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.1-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"A minor build fix for 64-bit FLATMEM configs.
A fix for a boot failure on 32-bit powermacs.
My commit to fix CLOCK_MONOTONIC across Y2038 broke the 32-bit VDSO on
64-bit kernels, ie. compat mode, which is only used on big endian.
The rewrite of the SLB code we merged in 4.20 missed the fact that the
0x380 exception is also used with the Radix MMU to report out of range
accesses. This could lead to an oops if userspace tried to read from
addresses outside the user or kernel range.
Thanks to: Aneesh Kumar K.V, Christophe Leroy, Larry Finger, Nicholas
Piggin"
* tag 'powerpc-5.1-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/mm: Define MAX_PHYSMEM_BITS for all 64-bit configs
powerpc/64s/radix: Fix radix segment exception handling
powerpc/vdso32: fix CLOCK_MONOTONIC on PPC64
powerpc/32: Fix early boot failure with RTAS built-in
Commit 48e7b76957 ("powerpc/64s/hash: Convert SLB miss handlers to C")
broke the radix-mode segment exception handler. In radix mode, this is
exception is not an SLB miss, rather it signals that the EA is outside
the range translated by any page table.
The commit lost the radix feature alternate code patch, which can
cause faults to some EAs to kernel BUG at arch/powerpc/mm/slb.c:639!
The original radix code would send faults to slb_miss_large_addr,
which would end up faulting due to slb_addr_limit being 0. This patch
sends radix directly to do_bad_slb_fault, which is a bit clearer.
Fixes: 48e7b76957 ("powerpc/64s/hash: Convert SLB miss handlers to C")
Cc: stable@vger.kernel.org # v4.20+
Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit b5b4453e79 ("powerpc/vdso64: Fix CLOCK_MONOTONIC
inconsistencies across Y2038") changed the type of wtom_clock_sec
to s64 on PPC64. Therefore, VDSO32 needs to read it with a 4 bytes
shift in order to retrieve the lower part of it.
Fixes: b5b4453e79 ("powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across Y2038")
Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 2d4f567103 ("KVM: PPC: Introduce kvm_tmp framework") adds
kvm_tmp[] into the .bss section and then free the rest of unused spaces
back to the page allocator.
kernel_init
kvm_guest_init
kvm_free_tmp
free_reserved_area
free_unref_page
free_unref_page_prepare
With DEBUG_PAGEALLOC=y, it will unmap those pages from kernel. As the
result, kmemleak scan will trigger a panic when it scans the .bss
section with unmapped pages.
This patch creates dedicated kmemleak objects for the .data, .bss and
potentially .data..ro_after_init sections to allow partial freeing via
the kmemleak_free_part() in the powerpc kvm_free_tmp() function.
Link: http://lkml.kernel.org/r/20190321171917.62049-1-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Qian Cai <cai@lca.pw>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Tested-by: Qian Cai <cai@lca.pw>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 0df977eafc ("powerpc/6xx: Don't use SPRN_SPRG2 for storing
stack pointer while in RTAS") changes the code to use a field in
thread struct to store the stack pointer while in RTAS instead of
using SPRN_SPRG2. It therefore converts all places which were
manipulating SPRN_SPRG2 to use that field. During early startup, the
zeroing of SPRN_SPRG2 has been replaced by a zeroing of that field in
thread struct. But at least in start_here, that's done wrongly because
it used the physical address of the fields while MMU is on at that
time.
So the virtual address of the field should be used instead, but in
the meantime, thread struct has already been zeroed and initialised
so we can just drop this initialisation.
Reported-by: Larry Finger <Larry.Finger@lwfinger.net>
Fixes: 0df977eafc ("powerpc/6xx: Don't use SPRN_SPRG2 for storing stack pointer while in RTAS")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Tested-by: Larry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When I updated the spectre_v2 reporting to handle software count cache
flush I got the logic wrong when there's no software count cache
enabled at all.
The result is that on systems with the software count cache flush
disabled we print:
Mitigation: Indirect branch cache disabled, Software count cache flush
Which correctly indicates that the count cache is disabled, but
incorrectly says the software count cache flush is enabled.
The root of the problem is that we are trying to handle all
combinations of options. But we know now that we only expect to see
the software count cache flush enabled if the other options are false.
So split the two cases, which simplifies the logic and fixes the bug.
We were also missing a space before "(hardware accelerated)".
The result is we see one of:
Mitigation: Indirect branch serialisation (kernel only)
Mitigation: Indirect branch cache disabled
Mitigation: Software count cache flush
Mitigation: Software count cache flush (hardware accelerated)
Fixes: ee13cb249f ("powerpc/64s: Add support for software count cache flush")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Not only the 603 but all 6xx need SPRN_SPRG_PGDIR to be initialised at
startup. This patch move it from __setup_cpu_603() to start_here()
and __secondary_start(), close to the initialisation of SPRN_THREAD.
Previously, virt addr of PGDIR was retrieved from thread struct.
Now that it is the phys addr which is stored in SPRN_SPRG_PGDIR,
hash_page() shall not convert it to phys anymore.
This patch removes the conversion.
Fixes: 93c4a162b0 ("powerpc/6xx: Store PGDIR physical address in a SPRG")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Jakub Drnec reported:
Setting the realtime clock can sometimes make the monotonic clock go
back by over a hundred years. Decreasing the realtime clock across
the y2k38 threshold is one reliable way to reproduce. Allegedly this
can also happen just by running ntpd, I have not managed to
reproduce that other than booting with rtc at >2038 and then running
ntp. When this happens, anything with timers (e.g. openjdk) breaks
rather badly.
And included a test case (slightly edited for brevity):
#define _POSIX_C_SOURCE 199309L
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <unistd.h>
long get_time(void) {
struct timespec tp;
clock_gettime(CLOCK_MONOTONIC, &tp);
return tp.tv_sec + tp.tv_nsec / 1000000000;
}
int main(void) {
long last = get_time();
while(1) {
long now = get_time();
if (now < last) {
printf("clock went backwards by %ld seconds!\n", last - now);
}
last = now;
sleep(1);
}
return 0;
}
Which when run concurrently with:
# date -s 2040-1-1
# date -s 2037-1-1
Will detect the clock going backward.
The root cause is that wtom_clock_sec in struct vdso_data is only a
32-bit signed value, even though we set its value to be equal to
tk->wall_to_monotonic.tv_sec which is 64-bits.
Because the monotonic clock starts at zero when the system boots the
wall_to_montonic.tv_sec offset is negative for current and future
dates. Currently on a freshly booted system the offset will be in the
vicinity of negative 1.5 billion seconds.
However if the wall clock is set past the Y2038 boundary, the offset
from wall to monotonic becomes less than negative 2^31, and no longer
fits in 32-bits. When that value is assigned to wtom_clock_sec it is
truncated and becomes positive, causing the VDSO assembly code to
calculate CLOCK_MONOTONIC incorrectly.
That causes CLOCK_MONOTONIC to jump ahead by ~4 billion seconds which
it is not meant to do. Worse, if the time is then set back before the
Y2038 boundary CLOCK_MONOTONIC will jump backward.
We can fix it simply by storing the full 64-bit offset in the
vdso_data, and using that in the VDSO assembly code. We also shuffle
some of the fields in vdso_data to avoid creating a hole.
The original commit that added the CLOCK_MONOTONIC support to the VDSO
did actually use a 64-bit value for wtom_clock_sec, see commit
a7f290dad3 ("[PATCH] powerpc: Merge vdso's and add vdso support to
32 bits kernel") (Nov 2005). However just 3 days later it was
converted to 32-bits in commit 0c37ec2aa8 ("[PATCH] powerpc: vdso
fixes (take #2)"), and the bug has existed since then AFAICS.
Fixes: 0c37ec2aa8 ("[PATCH] powerpc: vdso fixes (take #2)")
Cc: stable@vger.kernel.org # v2.6.15+
Link: http://lkml.kernel.org/r/HaC.ZfES.62bwlnvAvMP.1STMMj@seznam.cz
Reported-by: Jakub Drnec <jaydee@email.cz>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
One fix to prevent runtime allocation of 16GB pages when running in a VM (as
opposed to bare metal), because it doesn't work.
A small fix to our recently added KCOV support to exempt some more code from
being instrumented.
Plus a few minor build fixes, a small dead code removal and a defconfig update.
Thanks to:
Alexey Kardashevskiy, Aneesh Kumar K.V, Christophe Leroy, Jason Yan, Joel
Stanley, Mahesh Salgaonkar, Mathieu Malaterre.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcjNHCAAoJEFHr6jzI4aWAJVAP/21RUgDvqAAW55jTwihH6Eit
q6l1mJ30zwARz+UYWssqMe7qIYmnjWDeapgpZncZE3P6f3VMmepJrr75zca0LJhC
ixWqNJOcQgUu9civDwwpaqKQvyY0CYCdF5mu1rA1RNZ2kTeuCMw7zYPPpM84UGkq
IPFe3EgWAOURFeaQUGpH16klJVbPISq/1RCtsAkR4QifD4auM+EDYq+ML69LInc4
m7mi2CpPQDGZyCepFL0zdfOI43zrtWerG0UwCxPbGPYzvT+T3mvxU2unV1NcYn6/
obNYB5V0OCz4gUiu7aLoHnYZx2zK8fi1lTjSrB7XhWdi4ftEfRP3TrUntHWo420n
FC3+ibbjS3Cr8y7eubXgEAAKh74M1xzBF2bdAEHQ/QmqHZLcG+mnUihOq/g8mCp1
LsTKvkzXilov752wKSwdjvSNbU29a2KRaXSXAEgWJvsAQbZAidGRzX7CA9XeHQPp
kRCWHTwzXM0E31oi5rGAk2F1l4EK12QLdk1m0DF96ZanX7xG/UK6MpDNut2y51Wr
KsWPYhUhI6pc9xt+Fts0zehDWAtfttn7RTvE+34dkaZURGl3rQkjsKt1lQ+scRYX
fuSAnpTinE46e6APezwjCELtHDAzOCZvOnh9RVPe+F//KEF8LcNQv6TLhQoukRAe
ldJEhSReJfo3/agqGJ6v
=6cp4
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"One fix to prevent runtime allocation of 16GB pages when running in a
VM (as opposed to bare metal), because it doesn't work.
A small fix to our recently added KCOV support to exempt some more
code from being instrumented.
Plus a few minor build fixes, a small dead code removal and a
defconfig update.
Thanks to: Alexey Kardashevskiy, Aneesh Kumar K.V, Christophe Leroy,
Jason Yan, Joel Stanley, Mahesh Salgaonkar, Mathieu Malaterre"
* tag 'powerpc-5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: Include <asm/nmi.h> header file to fix a warning
powerpc/powernv: Fix compile without CONFIG_TRACEPOINTS
powerpc/mm: Disable kcov for SLB routines
powerpc: remove dead code in head_fsl_booke.S
powerpc/configs: Sync skiroot defconfig
powerpc/hugetlb: Don't do runtime allocation of 16G pages in LPAR configuration
Make sure to include <asm/nmi.h> to provide the following prototype:
hv_nmi_check_nonrecoverable.
Remove the following warning treated as error (W=1):
arch/powerpc/kernel/traps.c:393:6: error: no previous prototype for 'hv_nmi_check_nonrecoverable'
Fixes: ccd477028a ("powerpc/64s: Fix HV NMI vs HV interrupt recoverability test")
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add check for the return value of memblock_alloc*() functions and call
panic() in case of error. The panic message repeats the one used by
panicing memblock allocators with adjustment of parameters to include
only relevant ones.
The replacement was mostly automated with semantic patches like the one
below with manual massaging of format strings.
@@
expression ptr, size, align;
@@
ptr = memblock_alloc(size, align);
+ if (!ptr)
+ panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__, size, align);
[anders.roxell@linaro.org: use '%pa' with 'phys_addr_t' type]
Link: http://lkml.kernel.org/r/20190131161046.21886-1-anders.roxell@linaro.org
[rppt@linux.ibm.com: fix format strings for panics after memblock_alloc]
Link: http://lkml.kernel.org/r/1548950940-15145-1-git-send-email-rppt@linux.ibm.com
[rppt@linux.ibm.com: don't panic if the allocation in sparse_buffer_init fails]
Link: http://lkml.kernel.org/r/20190131074018.GD28876@rapoport-lnx
[akpm@linux-foundation.org: fix xtensa printk warning]
Link: http://lkml.kernel.org/r/1548057848-15136-20-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Reviewed-by: Guo Ren <ren_guo@c-sky.com> [c-sky]
Acked-by: Paul Burton <paul.burton@mips.com> [MIPS]
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> [s390]
Reviewed-by: Juergen Gross <jgross@suse.com> [Xen]
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Acked-by: Max Filippov <jcmvbkbc@gmail.com> [xtensa]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The memblock_alloc_base() function tries to allocate a memory up to the
limit specified by its max_addr parameter and panics if the allocation
fails. Replace its usage with memblock_phys_alloc_range() and make the
callers check the return value and panic in case of error.
Link: http://lkml.kernel.org/r/1548057848-15136-10-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Guo Ren <ren_guo@c-sky.com> [c-sky]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Juergen Gross <jgross@suse.com> [Xen]
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since only the virtual address of allocated blocks is used, lets use
functions returning directly virtual address.
Those functions have the advantage of also zeroing the block.
[rppt@linux.ibm.com: powerpc: remove duplicated alloc_stack() function]
Link: http://lkml.kernel.org/r/20190226064032.GA5873@rapoport-lnx
[rppt@linux.ibm.com: updated error message in alloc_stack() to be more verbose]
[rppt@linux.ibm.com: convereted several additional call sites ]
Link: http://lkml.kernel.org/r/1548057848-15136-3-git-send-email-rppt@linux.ibm.com
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Guo Ren <ren_guo@c-sky.com> [c-sky]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Juergen Gross <jgross@suse.com> [Xen]
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge more updates from Andrew Morton:
- some of the rest of MM
- various misc things
- dynamic-debug updates
- checkpatch
- some epoll speedups
- autofs
- rapidio
- lib/, lib/lzo/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (83 commits)
samples/mic/mpssd/mpssd.h: remove duplicate header
kernel/fork.c: remove duplicated include
include/linux/relay.h: fix percpu annotation in struct rchan
arch/nios2/mm/fault.c: remove duplicate include
unicore32: stop printing the virtual memory layout
MAINTAINERS: fix GTA02 entry and mark as orphan
mm: create the new vm_fault_t type
arm, s390, unicore32: remove oneliner wrappers for memblock_alloc()
arch: simplify several early memory allocations
openrisc: simplify pte_alloc_one_kernel()
sh: prefer memblock APIs returning virtual address
microblaze: prefer memblock API returning virtual address
powerpc: prefer memblock APIs returning virtual address
lib/lzo: separate lzo-rle from lzo
lib/lzo: implement run-length encoding
lib/lzo: fast 8-byte copy on arm64
lib/lzo: 64-bit CTZ on arm64
lib/lzo: tidy-up ifdefs
ipc/sem.c: replace kvmalloc/memset with kvzalloc and use struct_size
ipc: annotate implicit fall through
...
There are several early memory allocations in arch/ code that use
memblock_phys_alloc() to allocate memory, convert the returned physical
address to the virtual address and then set the allocated memory to
zero.
Exactly the same behaviour can be achieved simply by calling
memblock_alloc(): it allocates the memory in the same way as
memblock_phys_alloc(), then it performs the phys_to_virt() conversion
and clears the allocated memory.
Replace the longer sequence with a simpler call to memblock_alloc().
Link: http://lkml.kernel.org/r/1546248566-14910-6-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Simek <michal.simek@xilinx.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "memblock: simplify several early memory allocation", v4.
These patches simplify some of the early memory allocations by replacing
usage of older memblock APIs with newer and shinier ones.
Quite a few places in the arch/ code allocated memory using a memblock
API that returns a physical address of the allocated area, then
converted this physical address to a virtual one and then used memset(0)
to clear the allocated range.
More recent memblock APIs do all the three steps in one call and their
usage simplifies the code.
It's important to note that regardless of API used, the core allocation
is nearly identical for any set of memblock allocators: first it tries
to find a free memory with all the constraints specified by the caller
and then falls back to the allocation with some or all constraints
disabled.
The first three patches perform the conversion of call sites that have
exact requirements for the node and the possible memory range.
The fourth patch is a bit one-off as it simplifies openrisc's
implementation of pte_alloc_one_kernel(), and not only the memblock
usage.
The fifth patch takes care of simpler cases when the allocation can be
satisfied with a simple call to memblock_alloc().
The sixth patch removes one-liner wrappers for memblock_alloc on arm and
unicore32, as suggested by Christoph.
This patch (of 6):
There are a several places that allocate memory using memblock APIs that
return a physical address, convert the returned address to the virtual
address and frequently also memset(0) the allocated range.
Update these places to use memblock allocators already returning a
virtual address. Use memblock functions that clear the allocated memory
instead of calling memset(0) where appropriate.
The calls to memblock_alloc_base() that were not followed by memset(0)
are replaced with memblock_alloc_try_nid_raw(). Since the latter does
not panic() when the allocation fails, the appropriate panic() calls are
added to the call sites.
Link: http://lkml.kernel.org/r/1546248566-14910-2-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mark Salter <msalter@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Notable changes:
- Enable THREAD_INFO_IN_TASK to move thread_info off the stack.
- A big series from Christoph reworking our DMA code to use more of the generic
infrastructure, as he said:
"This series switches the powerpc port to use the generic swiotlb and
noncoherent dma ops, and to use more generic code for the coherent direct
mapping, as well as removing a lot of dead code."
- Increase our vmalloc space to 512T with the Hash MMU on modern CPUs, allowing
us to support machines with larger amounts of total RAM or distance between
nodes.
- Two series from Christophe, one to optimise TLB miss handlers on 6xx, and
another to optimise the way STRICT_KERNEL_RWX is implemented on some 32-bit
CPUs.
- Support for KCOV coverage instrumentation which means we can run syzkaller
and discover even more bugs in our code.
And as always many clean-ups, reworks and minor fixes etc.
Thanks to:
Alan Modra, Alexey Kardashevskiy, Alistair Popple, Andrea Arcangeli, Andrew
Donnellan, Aneesh Kumar K.V, Aravinda Prasad, Balbir Singh, Brajeswar Ghosh,
Breno Leitao, Christian Lamparter, Christian Zigotzky, Christophe Leroy,
Christoph Hellwig, Corentin Labbe, Daniel Axtens, David Gibson, Diana Craciun,
Firoz Khan, Gustavo A. R. Silva, Igor Stoppa, Joe Lawrence, Joel Stanley,
Jonathan Neuschäfer, Jordan Niethe, Laurent Dufour, Madhavan Srinivasan, Mahesh
Salgaonkar, Mark Cave-Ayland, Masahiro Yamada, Mathieu Malaterre, Matteo Croce,
Meelis Roos, Michael W. Bringmann, Nathan Chancellor, Nathan Fontenot, Nicholas
Piggin, Nick Desaulniers, Nicolai Stange, Oliver O'Halloran, Paul Mackerras,
Peter Xu, PrasannaKumar Muralidharan, Qian Cai, Rashmica Gupta, Reza Arbab,
Robert P. J. Day, Russell Currey, Sabyasachi Gupta, Sam Bobroff, Sandipan Das,
Sergey Senozhatsky, Souptick Joarder, Stewart Smith, Tyrel Datwyler, Vaibhav
Jain, YueHaibing.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcgRJlAAoJEFHr6jzI4aWAL9oP+gPlrZgyaAg/51lmubLtlbtk
QuGU8EiuJZoJD1OHrMPtppBOY7rQZOxJe58AoPig8wTvs+j/TxJ25fmiZncnf5U2
PC8QAjbj0UmQHgy+K30sUeOnDg9tdkHKHJ5/ecjJcvykkqsjyMnV7biFQ1cOA0HT
LflXHEEtiG9P9u7jZoAhtnfpgn1/l9mhTYMe26J1fqvC0164qMDFaXDTQXyDfyvG
gmuqccGMawSk7IdagmQxwXtwyfwOnarmGn+n31XKRejApGZ/pjiEA23JOJOaJcia
m76Jy3roao6sEtCUNpBFXEtwOy9POy3OiGy6yg/9896tDMvG84OuO6ltV1nFGawL
PmwE+ug63L4g/HWxZyAeb26T2oTTp/YIaKQPtsq4d286pvg/qr2KPNzFoAEhmJqU
yLrebv276pVeiLpLmCLPvcPj9t76vWKZaUm0FoE+zUDg7Rl7Alow8A/c4tdjOI6y
QwpbCiYseyiJ32lCZZdbN7Cy6+iM6vb3i1oNKc8MVqhBGTwLJnTU0ruPBSvCaRvD
NoQWO1RWpNu/BuivuLEKS9q3AoxenGwiqowxGhdVmI3Oc9jGWcEYlduR00VDYPVp
/RCfwtTY5NyC++h5cnbz8aLJ1hBXG5m79CXfprV+zPWeiLPCaMT6w9Y5QUS2wqA+
EZ734NknDJOjaHc4cGdZ
=Z9bb
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Notable changes:
- Enable THREAD_INFO_IN_TASK to move thread_info off the stack.
- A big series from Christoph reworking our DMA code to use more of
the generic infrastructure, as he said:
"This series switches the powerpc port to use the generic swiotlb
and noncoherent dma ops, and to use more generic code for the
coherent direct mapping, as well as removing a lot of dead
code."
- Increase our vmalloc space to 512T with the Hash MMU on modern
CPUs, allowing us to support machines with larger amounts of total
RAM or distance between nodes.
- Two series from Christophe, one to optimise TLB miss handlers on
6xx, and another to optimise the way STRICT_KERNEL_RWX is
implemented on some 32-bit CPUs.
- Support for KCOV coverage instrumentation which means we can run
syzkaller and discover even more bugs in our code.
And as always many clean-ups, reworks and minor fixes etc.
Thanks to: Alan Modra, Alexey Kardashevskiy, Alistair Popple, Andrea
Arcangeli, Andrew Donnellan, Aneesh Kumar K.V, Aravinda Prasad, Balbir
Singh, Brajeswar Ghosh, Breno Leitao, Christian Lamparter, Christian
Zigotzky, Christophe Leroy, Christoph Hellwig, Corentin Labbe, Daniel
Axtens, David Gibson, Diana Craciun, Firoz Khan, Gustavo A. R. Silva,
Igor Stoppa, Joe Lawrence, Joel Stanley, Jonathan Neuschäfer, Jordan
Niethe, Laurent Dufour, Madhavan Srinivasan, Mahesh Salgaonkar, Mark
Cave-Ayland, Masahiro Yamada, Mathieu Malaterre, Matteo Croce, Meelis
Roos, Michael W. Bringmann, Nathan Chancellor, Nathan Fontenot,
Nicholas Piggin, Nick Desaulniers, Nicolai Stange, Oliver O'Halloran,
Paul Mackerras, Peter Xu, PrasannaKumar Muralidharan, Qian Cai,
Rashmica Gupta, Reza Arbab, Robert P. J. Day, Russell Currey,
Sabyasachi Gupta, Sam Bobroff, Sandipan Das, Sergey Senozhatsky,
Souptick Joarder, Stewart Smith, Tyrel Datwyler, Vaibhav Jain,
YueHaibing"
* tag 'powerpc-5.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (200 commits)
powerpc/32: Clear on-stack exception marker upon exception return
powerpc: Remove export of save_stack_trace_tsk_reliable()
powerpc/mm: fix "section_base" set but not used
powerpc/mm: Fix "sz" set but not used warning
powerpc/mm: Check secondary hash page table
powerpc: remove nargs from __SYSCALL
powerpc/64s: Fix unrelocated interrupt trampoline address test
powerpc/powernv/ioda: Fix locked_vm counting for memory used by IOMMU tables
powerpc/fsl: Fix the flush of branch predictor.
powerpc/powernv: Make opal log only readable by root
powerpc/xmon: Fix opcode being uninitialized in print_insn_powerpc
powerpc/powernv: move OPAL call wrapper tracing and interrupt handling to C
powerpc/64s: Fix data interrupts vs d-side MCE reentrancy
powerpc/64s: Prepare to handle data interrupts vs d-side MCE reentrancy
powerpc/64s: system reset interrupt preserve HSRRs
powerpc/64s: Fix HV NMI vs HV interrupt recoverability test
powerpc/mm/hash: Handle mmap_min_addr correctly in get_unmapped_area topdown search
powerpc/hugetlb: Handle mmap_min_addr correctly in get_unmapped_area callback
selftests/powerpc: Remove duplicate header
powerpc sstep: Add support for modsd, modud instructions
...
Here is the big char/misc driver patch pull request for 5.1-rc1.
The largest thing by far is the new habanalabs driver for their AI
accelerator chip. For now it is in the drivers/misc directory but will
probably move to a new directory soon along with other drivers of this
type.
Other than that, just the usual set of individual driver updates and
fixes. There's an "odd" merge in here from the DRM tree that they asked
me to do as the MEI driver is starting to interact with the i915 driver,
and it needed some coordination. All of those patches have been
properly acked by the relevant subsystem maintainers.
All of these have been in linux-next with no reported issues, most for
quite some time.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXH+dPQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ym1fACgvpZAxjNzoRQJ6f06tc8ujtPk9rUAnR+tCtrZ
9e3l7H76oe33o96Qjhor
=8A2k
-----END PGP SIGNATURE-----
Merge tag 'char-misc-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver updates from Greg KH:
"Here is the big char/misc driver patch pull request for 5.1-rc1.
The largest thing by far is the new habanalabs driver for their AI
accelerator chip. For now it is in the drivers/misc directory but will
probably move to a new directory soon along with other drivers of this
type.
Other than that, just the usual set of individual driver updates and
fixes. There's an "odd" merge in here from the DRM tree that they
asked me to do as the MEI driver is starting to interact with the i915
driver, and it needed some coordination. All of those patches have
been properly acked by the relevant subsystem maintainers.
All of these have been in linux-next with no reported issues, most for
quite some time"
* tag 'char-misc-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (219 commits)
habanalabs: adjust Kconfig to fix build errors
habanalabs: use %px instead of %p in error print
habanalabs: use do_div for 64-bit divisions
intel_th: gth: Fix an off-by-one in output unassigning
habanalabs: fix little-endian<->cpu conversion warnings
habanalabs: use NULL to initialize array of pointers
habanalabs: fix little-endian<->cpu conversion warnings
habanalabs: soft-reset device if context-switch fails
habanalabs: print pointer using %p
habanalabs: fix memory leak with CBs with unaligned size
habanalabs: return correct error code on MMU mapping failure
habanalabs: add comments in uapi/misc/habanalabs.h
habanalabs: extend QMAN0 job timeout
habanalabs: set DMA0 completion to SOB 1007
habanalabs: fix validation of WREG32 to DMA completion
habanalabs: fix mmu cache registers init
habanalabs: disable CPU access on timeouts
habanalabs: add MMU DRAM default page mapping
habanalabs: Dissociate RAZWI info from event types
misc/habanalabs: adjust Kconfig to fix build errors
...
Merge misc updates from Andrew Morton:
- a few misc things
- ocfs2 updates
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (159 commits)
tools/testing/selftests/proc/proc-self-syscall.c: remove duplicate include
proc: more robust bulk read test
proc: test /proc/*/maps, smaps, smaps_rollup, statm
proc: use seq_puts() everywhere
proc: read kernel cpu stat pointer once
proc: remove unused argument in proc_pid_lookup()
fs/proc/thread_self.c: code cleanup for proc_setup_thread_self()
fs/proc/self.c: code cleanup for proc_setup_self()
proc: return exit code 4 for skipped tests
mm,mremap: bail out earlier in mremap_to under map pressure
mm/sparse: fix a bad comparison
mm/memory.c: do_fault: avoid usage of stale vm_area_struct
writeback: fix inode cgroup switching comment
mm/huge_memory.c: fix "orig_pud" set but not used
mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC
mm/memcontrol.c: fix bad line in comment
mm/cma.c: cma_declare_contiguous: correct err handling
mm/page_ext.c: fix an imbalance with kmemleak
mm/compaction: pass pgdat to too_many_isolated() instead of zone
mm: remove zone_lru_lock() function, access ->lru_lock directly
...
The VDSO is part of the kernel image and therefore the struct pages are
marked as reserved during boot.
As we install a special mapping, the actual struct pages will never be
exposed to MM via the page tables. We can therefore leave the pages
marked as reserved.
Link: http://lkml.kernel.org/r/20190114125903.24845-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Replace all open encodings for NUMA_NO_NODE", v3.
All these places for replacement were found by running the following
grep patterns on the entire kernel code. Please let me know if this
might have missed some instances. This might also have replaced some
false positives. I will appreciate suggestions, inputs and review.
1. git grep "nid == -1"
2. git grep "node == -1"
3. git grep "nid = -1"
4. git grep "node = -1"
This patch (of 2):
At present there are multiple places where invalid node number is
encoded as -1. Even though implicitly understood it is always better to
have macros in there. Replace these open encodings for an invalid node
number with the global macro NUMA_NO_NODE. This helps remove NUMA
related assumptions like 'invalid node' from various places redirecting
them to a common definition.
Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [ixgbe]
Acked-by: Jens Axboe <axboe@kernel.dk> [mtip32xx]
Acked-by: Vinod Koul <vkoul@kernel.org> [dmaengine.c]
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Acked-by: Doug Ledford <dledford@redhat.com> [drivers/infiniband]
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Hans Verkuil <hverkuil@xs4all.nl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This code is dead. Just remove it.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
As tglx points out, there are no in-tree module users of
save_stack_trace_tsk_reliable() and its x86 counterpart is not
exported, so remove the powerpc symbol export.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The __SYSCALL macro's arguments are system call number,
system call entry name and number of arguments for the
system call.
Argument- nargs in __SYSCALL(nr, entry, nargs) is neither
calculated nor used anywhere. So it would be better to
keep the implementaion as __SYSCALL(nr, entry). This will
unifies the implementation with some other architetures
too.
Signed-off-by: Firoz Khan <firoz.khan@linaro.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The recent commit got this test wrong, it declared the assembler
symbols the wrong way, and also used the wrong symbol name
(xxx_start rather than start_xxx, see asm/head-64.h).
Fixes: ccd477028a ("powerpc/64s: Fix HV NMI vs HV interrupt recoverability test")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The commit identified below adds MC_BTB_FLUSH macro only when
CONFIG_PPC_FSL_BOOK3E is defined. This results in the following error
on some configs (seen several times with kisskb randconfig_defconfig)
arch/powerpc/kernel/exceptions-64e.S:576: Error: Unrecognized opcode: `mc_btb_flush'
make[3]: *** [scripts/Makefile.build:367: arch/powerpc/kernel/exceptions-64e.o] Error 1
make[2]: *** [scripts/Makefile.build:492: arch/powerpc/kernel] Error 2
make[1]: *** [Makefile:1043: arch/powerpc] Error 2
make: *** [Makefile:152: sub-make] Error 2
This patch adds a blank definition of MC_BTB_FLUSH for other cases.
Fixes: 10c5e83afd ("powerpc/fsl: Flush the branch predictor at each kernel entry (64bit)")
Cc: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Handlers for interrupts that set DAR / DSISR, set MSR[RI] before those
SPRs are read. If a d-side machine check hits in this window, DAR /
DSISR will be clobbered silently, leading to random corruption.
Fix this by having handlers save those registers before setting MSR[RI].
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
A subsequent fix for data interrupts (those that set DAR / DSISR)
requires some interrupt macros to be open-coded, and also requires
the 0x300 interrupt handler to be moved out-of-line.
This patch does that without changing behaviour, which makes the later
fix a smaller change.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Code that uses HSRR registers is not required to clear MSR[RI] by
convention, however the system reset NMI itself may use HSRR
registers (e.g., to call OPAL) and clobber them.
Rather than introduce the requirement to clear RI in order to use
HSRRs, have system reset interrupt save and restore HSRRs.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
HV interrupts that use HSRR registers do not enter with MSR[RI] clear,
but their entry code is not recoverable vs NMI, due to shared use of
HSPRG1 as a scratch register to save r13.
This means that a system reset or machine check that hits in HSRR
interrupt entry can cause r13 to be silently corrupted.
Fix this by marking NMIs non-recoverable if they land in HV interrupt
ranges.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Some stack pointers used to also be thread_info pointers
and were called tp. Now that they are only stack pointers,
rename them sp.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Now that current_thread_info is located at the beginning of 'current'
task struct, CURRENT_THREAD_INFO macro is not really needed any more.
This patch replaces it by loads of the value at PACA_THREAD_INFO(r13).
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Add PACA_THREAD_INFO rather than using PACACURRENT]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Now that thread_info is similar to task_struct, its address is in r2
so CURRENT_THREAD_INFO() macro is useless. This patch removes it.
This patch also moves the 'tovirt(r2, r2)' down just before the
reactivation of MMU translation, so that we keep the physical address
of 'current' in r2 until then. It avoids a few calls to tophys().
At the same time, as the 'cpu' field is not anymore in thread_info,
TI_CPU is renamed TASK_CPU by this patch.
It also allows to get rid of a couple of
'#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE' as ACCOUNT_CPU_USER_ENTRY()
and ACCOUNT_CPU_USER_EXIT() are empty when
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not defined.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Fix a missed conversion of TI_CPU idle_6xx.S]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The table of pointers 'current_set' has been used for retrieving
the stack and current. They used to be thread_info pointers as
they were pointing to the stack and current was taken from the
'task' field of the thread_info.
Now, the pointers of 'current_set' table are now both pointers
to task_struct and pointers to thread_info.
As they are used to get current, and the stack pointer is
retrieved from current's stack field, this patch changes
their type to task_struct, and renames secondary_ti to
secondary_current.
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
thread_info is not anymore in the stack, so the entire stack
can now be used.
There is also no risk anymore of corrupting task_cpu(p) with a
stack overflow so the patch removes the test.
When doing this, an explicit test for NULL stack pointer is
needed in validate_sp() as it is not anymore implicitely covered
by the sizeof(thread_info) gap.
In the meantime, with the previous patch all pointers to the stacks
are not anymore pointers to thread_info so this patch changes them
to void*
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch activates CONFIG_THREAD_INFO_IN_TASK which
moves the thread_info into task_struct.
Moving thread_info into task_struct has the following advantages:
- It protects thread_info from corruption in the case of stack
overflows.
- Its address is harder to determine if stack addresses are leaked,
making a number of attacks more difficult.
This has the following consequences:
- thread_info is now located at the beginning of task_struct.
- The 'cpu' field is now in task_struct, and only exists when
CONFIG_SMP is active.
- thread_info doesn't have anymore the 'task' field.
This patch:
- Removes all recopy of thread_info struct when the stack changes.
- Changes the CURRENT_THREAD_INFO() macro to point to current.
- Selects CONFIG_THREAD_INFO_IN_TASK.
- Modifies raw_smp_processor_id() to get ->cpu from current without
including linux/sched.h to avoid circular inclusion and without
including asm/asm-offsets.h to avoid symbol names duplication
between ASM constants and C constants.
- Modifies klp_init_thread_info() to take a task_struct pointer
argument.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Add task_stack.h to livepatch.h to fix build fails]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Make sure CURRENT_THREAD_INFO() is used with r1 which is the virtual
address of the stack, in order to ease the switch to r2 when we enable
THREAD_INFO_IN_TASK, as we have no register having the phys address of
current.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Split out of larger patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Rather than using the thread info use task_stack_page() to initialise
paca->kstack, that way it will work with THREAD_INFO_IN_TASK.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Split out of larger patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Update a few comments that talk about current_thread_info() in
preparation for THREAD_INFO_IN_TASK.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Split out of larger patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We have a few places that use current_thread_info()->task to access
current. This won't work with THREAD_INFO_IN_TASK so fix them now.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Split out of larger patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
A few places use CURRENT_THREAD_INFO, or the C version, to find the
stack. This will no longer work with THREAD_INFO_IN_TASK so change
them to find the stack in other ways.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Split out of larger patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The purpose of the pointer given to call_do_softirq() and
call_do_irq() is to point the new stack. Currently that's the same
thing as the thread_info, but won't be with THREAD_INFO_IN_TASK.
So change the parameter to void* and rename it 'sp'.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Split out of larger patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch renames THREAD_INFO to TASK_STACK, because it is in fact
the offset of the pointer to the stack in task_struct so this pointer
will not be impacted by the move of THREAD_INFO.
Also make it available on 64-bit, as we'll need it there when we
activate THREAD_INFO_IN_TASK.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Make available on 64-bit]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
[text copied from commit 9bbd4c56b0
("arm64: prep stack walkers for THREAD_INFO_IN_TASK")]
When CONFIG_THREAD_INFO_IN_TASK is selected, task stacks may be freed
before a task is destroyed. To account for this, the stacks are
refcounted, and when manipulating the stack of another task, it is
necessary to get/put the stack to ensure it isn't freed and/or re-used
while we do so.
This patch reworks the powerpc stack walking code to account for this.
When CONFIG_THREAD_INFO_IN_TASK is not selected these perform no
refcounting, and this should only be a structural change that does not
affect behaviour.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Move try_get_task_stack() below tsk == NULL check in show_stack()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When moving to CONFIG_THREAD_INFO_IN_TASK, the thread_info 'cpu' field
gets moved into task_struct and only defined when CONFIG_SMP is set.
This patch ensures that TI_CPU is only used when CONFIG_SMP is set and
that task_struct 'cpu' field is not used directly out of SMP code.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Since only the virtual address of allocated blocks is used,
lets use functions returning directly virtual address.
Those functions have the advantage of also zeroing the block.
Suggested-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In __secondary_start() we load the thread_info of the idle task of the
secondary CPU from current_set[cpu], and then convert it into a stack
pointer before storing that back to paca->kstack.
As pointed out in commit f761622e59 ("powerpc: Initialise
paca->kstack before early_setup_secondary") it's important that we
initialise paca->kstack before calling the MMU setup code, in
particular slb_initialize(), because it will bolt the SLB entry for
the kstack into the SLB.
However we have already setup paca->kstack in cpu_idle_thread_init(),
since commit 3b5750644b ("[POWERPC] Bolt in SLB entry for kernel
stack on secondary cpus") (May 2008).
It's also in cpu_idle_thread_init() that we initialise current_set[cpu]
with the thread_info pointer, so there is no issue of the timing being
different between the two.
Therefore the initialisation of paca->kstack in __setup_secondary() is
completely redundant, so remove it.
This has the added benefit of removing code that runs in real mode,
and is therefore restricted by the RMO, and so opens the way for us to
enable THREAD_INFO_IN_TASK.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently in system_call_exit() we have an optimisation where we
disable MSR_RI (recoverable interrupt) and MSR_EE (external interrupt
enable) in a single mtmsrd instruction.
Unfortunately this will no longer work with THREAD_INFO_IN_TASK,
because then the load of TI_FLAGS might fault and faulting with MSR_RI
clear is treated as an unrecoverable exception which leads to a
panic().
So change the code to only clear MSR_EE prior to loading TI_FLAGS,
leaving the clear of MSR_RI until later. We have some latitude in
where do the clear of MSR_RI. A bit of experimentation has shown that
this location gives the least slow down.
This still causes a noticeable slow down in our null_syscall
performance. On a Power9 DD2.2:
Before After Delta Delta %
955 cycles 999 cycles -44 -4.6%
On the plus side this does simplify the code somewhat, because we
don't have to reenable MSR_RI on the restore_math() or
syscall_exit_work() paths which was necessitated previously by the
optimisation.
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
kcov provides kernel coverage data that's useful for fuzzing tools like
syzkaller.
Wire up kcov support on powerpc. Disable kcov instrumentation on the same
files where we currently disable gcov and UBSan instrumentation, plus some
additional exclusions which appear necessary to boot on book3e machines.
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Daniel Axtens <dja@axtens.net> # e6500
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On 8xx, large pages (512kb or 8M) are used to map kernel linear
memory. Aligning to 8M reduces TLB misses as only 8M pages are used
in that case. We make 8M the default for data.
This patchs allows the user to do it via Kconfig.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch implements handling of STRICT_KERNEL_RWX with
large TLBs directly in the TLB miss handlers.
To do so, etext and sinittext are aligned on 512kB boundaries
and the miss handlers use 512kB pages instead of 8Mb pages for
addresses close to the boundaries.
It sets RO PP flags for addresses under sinittext.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
setibat() and clearibat() allows to manipulate IBATs independently
of DBATs.
update_bats() allows to update bats after init. This is done
with MMU off.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
CONFIG_STRICT_KERNEL_RWX requires a special alignment
for DATA for some subarches. Today it is just defined
as an #ifdef in vmlinux.lds.S
In order to get more flexibility, this patch moves the
definition of this alignment in Kconfig
On some subarches, CONFIG_STRICT_KERNEL_RWX will
require a special alignment of _etext.
This patch also adds a configuration item for it in Kconfig
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
At the time being, initial MMU setup allows 24 Mbytes
of DATA and 8 Mbytes of code.
Some debug setup like CONFIG_KASAN generate huge
kernels with text size over the 8M limit and data over the
24 Mbytes limit.
Here is an 8xx kernel compiled with CONFIG_KASAN_INLINE for
one of my boards:
[root@po16846vm linux-powerpc]# size -x vmlinux
text data bss dec hex filename
0x111019c 0x41b0d4 0x490de0 26984528 19bc050 vmlinux
This patch maps up to 32 Mbytes code based on _einittext symbol
and allows 32 Mbytes of memory instead of 24.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
40x/booke have another path to reach 3f from transfer_to_handler,
make sure it also calls ACCOUNT_CPU_USER_ENTRY() when
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is selected.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For pages without _PAGE_USER, PP field is 00
For pages with _PAGE_USER, PP field is 10 for RW and 11 for RO.
This patch sets _PAGE_USER to 0x002 and _PAGE_RW to 0x001
is order to simplify TLB handling by reducing amount of shifts.
The location of _PAGE_PRESENT and _PAGE_HASHPTE doesn't matter
as they are only SW related flags.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PAGE_ACCESSED is only needed for CONFIG_SWAP. When CONFIG_SWAP
is not set, just ignore it. If CONFIG_SWAP is set and PAGE_ACCESSED
is not, let's take a minor fault.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PP bits take user access into account, so no need to check _PAGE_USER
here. A DSI or ISI will be generated if needed.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PAGE_DIRTY corresponds to the C bit. If writing on
a page for which the C bit is not set, a DataStoreTLBMiss
is generated. No need to check it in DataLoadTLBMiss.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
_PAGE_RW and _PAGE_DIRTY do not matter for ITLB misses.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
ITLB miss on kernel pages only occur with CONFIG_MODULES and
CONFIG_DEBUG_PAGEALLOC.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Since commit c62ce9ef97 ("powerpc: remove remaining bits from
CONFIG_APUS"), tophys() has become a pure constant operation.
PAGE_OFFSET is known at compile time so the physical address
can be builtin directly.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use SPRN_SPRG2 to store the current thread PGDIR and
avoid reading thread_struct.pgdir at every TLB miss.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When calling RTAS, the stack pointer is stored in SPRN_SPRG2
in order to be able to restore it in case of machine check in RTAS.
As machine check is not a perfomance critical path, this patch
frees SPRN_SPRG2 by using a field in thread struct instead.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There is no reason to re-read each time the pointer at
location 0xf0 as it is fixed and known.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Looks like book3s/32 doesn't set RI on machine check, so
checking RI before calling die() will always be fatal
allthought this is not an issue in most cases.
Fixes: b96672dd84 ("powerpc: Machine check interrupt is a non-maskable interrupt")
Fixes: daf00ae71d ("powerpc/traps: restore recoverability of machine_check interrupts")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: stable@vger.kernel.org
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
MSR[RI] has already been cleared a few lines above.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When no machine description matches, display it clearly
before looping forever.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In cpufeatures_process_feature(), if a provided CPU feature is unknown and
enable_unknown is false, we erroneously print that the feature is being
enabled and return true, even though no feature has been enabled, and
may also set feature bits based on the last entry in the match table.
Fix this so that we only set feature bits from the match table if we have
actually enabled a feature from that table, and when failing to enable an
unknown feature, always print the "not enabling" message and return false.
Coincidentally, some older gccs (<GCC 7), when invoked with
-fsanitize-coverage=trace-pc, cause a spurious uninitialised variable
warning in this function:
arch/powerpc/kernel/dt_cpu_ftrs.c: In function ‘cpufeatures_process_feature’:
arch/powerpc/kernel/dt_cpu_ftrs.c:686:7: warning: ‘m’ may be used uninitialized in this function [-Wmaybe-uninitialized]
if (m->cpu_ftr_bit_mask)
An upcoming patch will enable support for kcov, which requires this option.
This patch avoids the warning.
Fixes: 5a61ef74f2 ("powerpc/64s: Support new device tree binding for discovering CPU features")
Reported-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
[ajd: add commit message]
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
The xmon debugger IPI handler waits in the callback function while
xmon is still active. This means they don't complete the IPI, and the
initiator always times out waiting for them.
Things manage to work after the timeout because there is some fallback
logic to keep NMI IPI state sane in case of the timeout, but this is a
bit ugly.
This patch changes NMI IPI back to half-asynchronous (i.e., wait for
everyone to call in, do not wait for IPI function to complete), but
the complexity is avoided by going one step further and allowing new
IPIs to be issued before the IPI functions to all complete.
If synchronization against that is required, it is left up to the
caller, but current callers don't require that. In fact with the
timeout handling, callers must be able to cope with this already.
Fixes: 5b73151fff ("powerpc: NMI IPI make NMI IPIs fully sychronous")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The NMI IPI timeout logic is broken, if __smp_send_nmi_ipi() times out
on the first condition, delay_us will be zero which will send it into
the second spin loop with no timeout so it will spin forever.
Fixes: 5b73151fff ("powerpc: NMI IPI make NMI IPIs fully sychronous")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We weren't using SYSCALL_DEFINE for sys_switch_endian(), which means
it wasn't able to be traced by CONFIG_FTRACE_SYSCALLS.
By using the macro we create the right metadata and the syscall is
visible. eg:
# cd /sys/kernel/debug/tracing
# echo 1 | tee events/syscalls/sys_*_switch_endian/enable
# ~/switch_endian_test
# cat trace
...
switch_endian_t-3604 [009] .... 315.175164: sys_switch_endian()
switch_endian_t-3604 [009] .... 315.175167: sys_switch_endian -> 0x5555aaaa5555aaaa
switch_endian_t-3604 [009] .... 315.175169: sys_switch_endian()
switch_endian_t-3604 [009] .... 315.175169: sys_switch_endian -> 0x5555aaaa5555aaaa
Fixes: 529d235a0e ("powerpc: Add a proper syscall for switching endianness")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 8792468da5 "powerpc: Add the ability to save FPU without
giving it up" unexpectedly removed the MSR_FE0 and MSR_FE1 bits from
the bitmask used to update the MSR of the previous thread in
__giveup_fpu() causing a KVM-PR MacOS guest to lockup and panic the
host kernel.
Leaving FE0/1 enabled means unrelated processes might receive FPEs
when they're not expecting them and crash. In particular if this
happens to init the host will then panic.
eg (transcribed):
qemu-system-ppc[837]: unhandled signal 8 at 12cc9ce4 nip 12cc9ce4 lr 12cc9ca4 code 0
systemd[1]: unhandled signal 8 at 202f02e0 nip 202f02e0 lr 001003d4 code 0
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
Reinstate these bits to the MSR bitmask to enable MacOS guests to run
under 32-bit KVM-PR once again without issue.
Fixes: 8792468da5 ("powerpc: Add the ability to save FPU without giving it up")
Cc: stable@vger.kernel.org # v4.6+
Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds a debugfs interface to force scheduling a recovery event.
This can be used to recover a specific PE or schedule a "special" recovery
even that checks for errors at the PHB level.
To force a recovery of a normal PE, use:
echo '<#pe>:<#phb>' > /sys/kernel/debug/powerpc/eeh_force_recover
To force a scan for broken PHBs:
echo 'hwcheck' > /sys/kernel/debug/powerpc/eeh_force_recover
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently when we detect an error we automatically invoke the EEH recovery
handler. This can be annoying when debugging EEH problems, or when working
on EEH itself so this patch adds a debugfs knob that will prevent a
recovery event from being queued up when an issue is detected.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a helper to find the pci_controller structure based on the domain
number / phb id.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
To use this function at all #define DEBUG needs to be set in eeh_cache.c.
Considering that printing at pr_debug is probably not all that useful since
it adds the additional hurdle of requiring you to enable the debug print if
dynamic_debug is in use so this patch bumps it to pr_info.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Adds a debugfs file that can be read to view the contents of the EEH
address cache. This is pretty similar to the existing
eeh_addr_cache_print() function, but that function is intended to debug
issues inside of the kernel since it's #ifdef`ed out by default, and writes
into the kernel log.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The EEH address cache is used to map a physical MMIO address back to a PCI
device. It's useful to know when it's being manipulated, but currently this
requires recompiling with #define DEBUG set. This is pointless since we
have dynamic_debug nowdays, so remove the #ifdef guard and add a pr_debug()
for the remove case too.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There's no need to the custom getter/setter functions so we should remove
them in favour of using the generic one. While we're here, change the type
of eeh_max_freeze to u32 and print the value in decimal rather than
hex because printing it in hex makes no sense.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
GCC 8 warns about the logic in vr_get/set(), which with -Werror breaks
the build:
In function ‘user_regset_copyin’,
inlined from ‘vr_set’ at arch/powerpc/kernel/ptrace.c:628:9:
include/linux/regset.h:295:4: error: ‘memcpy’ offset [-527, -529] is
out of the bounds [0, 16] of object ‘vrsave’ with type ‘union
<anonymous>’ [-Werror=array-bounds]
arch/powerpc/kernel/ptrace.c: In function ‘vr_set’:
arch/powerpc/kernel/ptrace.c:623:5: note: ‘vrsave’ declared here
} vrsave;
This has been identified as a regression in GCC, see GCC bug 88273.
However we can avoid the warning and also simplify the logic and make
it more robust.
Currently we pass -1 as end_pos to user_regset_copyout(). This says
"copy up to the end of the regset".
The definition of the regset is:
[REGSET_VMX] = {
.core_note_type = NT_PPC_VMX, .n = 34,
.size = sizeof(vector128), .align = sizeof(vector128),
.active = vr_active, .get = vr_get, .set = vr_set
},
The end is calculated as (n * size), ie. 34 * sizeof(vector128).
In vr_get/set() we pass start_pos as 33 * sizeof(vector128), meaning
we can copy up to sizeof(vector128) into/out-of vrsave.
The on-stack vrsave is defined as:
union {
elf_vrreg_t reg;
u32 word;
} vrsave;
And elf_vrreg_t is:
typedef __vector128 elf_vrreg_t;
So there is no bug, but we rely on all those sizes lining up,
otherwise we would have a kernel stack exposure/overwrite on our
hands.
Rather than relying on that we can pass an explict end_pos based on
the sizeof(vrsave). The result should be exactly the same but it's
more obviously not over-reading/writing the stack and it avoids the
compiler warning.
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Mathieu Malaterre <malat@debian.org>
Cc: stable@vger.kernel.org
Tested-by: Mathieu Malaterre <malat@debian.org>
Tested-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds an "in_guest" parameter to machine_check_print_event_info()
so that we can avoid trying to translate guest NIP values into
symbolic form using the host kernel's symbol table.
Reviewed-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There is no good reason for this helper, just opencode it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Now that we've switched all the powerpc nommu and swiotlb methods to
use the generic dma_direct_* calls we can remove these ops vectors
entirely and rely on the common direct mapping bypass that avoids
indirect function calls entirely. This also allows to remove a whole
lot of boilerplate code related to setting up these operations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Switch the streaming DMA mapping and ownership transfer methods to the
functionally identical dma_direct_ versions. Factor the cache
maintainance helpers into the form expected by the common code for that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The generic code allows a few nice things such as node local allocations
and dipping into the CMA area. The lookup of the right zone for a given
dma mask works a little different, but the results should be the same.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The only user left is powerpc, but even there the generic dma-direct
version works just as well, given that we guarantee that the swiotlb
buffer must always be addressable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This function is largely identical to the generic version used
everywhere else. Replace it with the generic version.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This function is identical to the generic dma_direct_get_required_mask,
except that the generic version also takes the bus_dma_mask account,
which could lead to incorrect results in the powerpc version.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The coherent cache version of this function already is functionally
identicall to the default version, and by defining the
arch_dma_coherent_to_pfn hook the same is ture for the noncoherent
version as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use the standard portable helper instead of the powerpc specific one,
which is about to go away.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Instead of letting the architecture supply all of dma_set_mask just
give it an additional hook selected by Kconfig.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The max_direct_dma_addr duplicates the bus_dma_mask field in struct
device. Use the generic field instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
pci_dma_dev_setup_swiotlb is only used by the fsl_pci code, and closely
related to it, so fsl_pci.c seems like a better place for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This function is only used by the Cell iommu code, which can keep track
if it is using the iommu internally just as good.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
All iommu capable platforms now always use the iommu code with the
internal bypass, so there is not need for this magic anymore.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The ppc_md and pci_controller_ops methods are unused now and can be
removed. The dma_nommu implementation is generic to the generic one
except for using max_pfn instead of calling into the memblock API,
and all other dma_map_ops instances implement a method of their own.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This gets rid of a lot of clumsy code and finally allows us to mark
dma_iommu_ops const.
Includes fixes from Michael Ellerman.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a new iommu_bypass flag to struct dev_archdata so that the dma_iommu
implementation can handle the direct mapping transparently instead of
switiching ops around. Setting of this flag is controlled by new
pci_controller_ops method.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
vio_dma_mapping_ops currently does a lot of indirect calls through
dma_iommu_ops, which not only make the code harder to follow but are
also expensive in the post-spectre world. Unwind the indirect calls
by calling the ppc_iommu_* or iommu_* APIs directly applicable, or
just use the dma_iommu_* methods directly where we can.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This series finally gets us to the point of having system calls with
64-bit time_t on all architectures, after a long time of incremental
preparation patches.
There was actually one conversion that I missed during the summer,
i.e. Deepa's timex series, which I now updated based the 5.0-rc1 changes
and review comments.
The following system calls are now added on all 32-bit architectures
using the same system call numbers:
403 clock_gettime64
404 clock_settime64
405 clock_adjtime64
406 clock_getres_time64
407 clock_nanosleep_time64
408 timer_gettime64
409 timer_settime64
410 timerfd_gettime64
411 timerfd_settime64
412 utimensat_time64
413 pselect6_time64
414 ppoll_time64
416 io_pgetevents_time64
417 recvmmsg_time64
418 mq_timedsend_time64
419 mq_timedreceiv_time64
420 semtimedop_time64
421 rt_sigtimedwait_time64
422 futex_time64
423 sched_rr_get_interval_time64
Each one of these corresponds directly to an existing system call
that includes a 'struct timespec' argument, or a structure containing
a timespec or (in case of clock_adjtime) timeval. Not included here
are new versions of getitimer/setitimer and getrusage/waitid, which
are planned for the future but only needed to make a consistent API
rather than for correct operation beyond y2038. These four system
calls are based on 'timeval', and it has not been finally decided
what the replacement kernel interface will use instead.
So far, I have done a lot of build testing across most architectures,
which has found a number of bugs. Runtime testing so far included
testing LTP on 32-bit ARM with the existing system calls, to ensure
we do not regress for existing binaries, and a test with a 32-bit
x86 build of LTP against a modified version of the musl C library
that has been adapted to the new system call interface [3].
This library can be used for testing on all architectures supported
by musl-1.1.21, but it is not how the support is getting integrated
into the official musl release. Official musl support is planned
but will require more invasive changes to the library.
Link: https://lore.kernel.org/lkml/20190110162435.309262-1-arnd@arndb.de/T/
Link: https://lore.kernel.org/lkml/20190118161835.2259170-1-arnd@arndb.de/
Link: https://git.linaro.org/people/arnd/musl-y2038.git/ [2]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJcXf7/AAoJEGCrR//JCVInPSUP/RhsQSCKMGtONB/vVICQhwep
PybhzBSpHWFxszzTi6BEPN1zS9B069G9mDollRBYZCckyPqL/Bv6sI/vzQZdNk01
Q6Nw92OnNE1QP8owZ5TjrZhpbtopWdqIXjsbGZlloUemvuJP2JwvKovQUcn5CPTQ
jbnqU04CVyFFJYVxAnGJ+VSeWNrjW/cm/m+rhLFjUcwW7Y3aodxsPqPP6+K9hY9P
yIWfcH42WBeEWGm1RSBOZOScQl4SGCPUAhFydl/TqyEQagyegJMIyMOv9wZ5AuTT
xK644bDVmNsrtJDZDpx+J8hytXCk1LrnKzkHR/uK80iUIraF/8D7PlaPgTmEEjko
XcrywEkvkXTVU3owCm2/sbV+8fyFKzSPipnNfN1JNxEX71A98kvMRtPjDueQq/GA
Yh81rr2YLF2sUiArkc2fNpENT7EGhrh1q6gviK3FB8YDgj1kSgPK5wC/X0uolC35
E7iC2kg4NaNEIjhKP/WKluCaTvjRbvV+0IrlJLlhLTnsqbA57ZKCCteiBrlm7wQN
4csUtCyxchR9Ac2o/lj+Mf53z68Zv74haIROp18K2dL7ZpVcOPnA3XHeauSAdoyp
wy2Ek6ilNvlNB+4x+mRntPoOsyuOUGv7JXzB9JvweLWUd9G7tvYeDJQp/0YpDppb
K4UWcKnhtEom0DgK08vY
=IZVb
-----END PGP SIGNATURE-----
Merge tag 'y2038-new-syscalls' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground into timers/2038
Pull y2038 - time64 system calls from Arnd Bergmann:
This series finally gets us to the point of having system calls with 64-bit
time_t on all architectures, after a long time of incremental preparation
patches.
There was actually one conversion that I missed during the summer,
i.e. Deepa's timex series, which I now updated based the 5.0-rc1 changes
and review comments.
The following system calls are now added on all 32-bit architectures using
the same system call numbers:
403 clock_gettime64
404 clock_settime64
405 clock_adjtime64
406 clock_getres_time64
407 clock_nanosleep_time64
408 timer_gettime64
409 timer_settime64
410 timerfd_gettime64
411 timerfd_settime64
412 utimensat_time64
413 pselect6_time64
414 ppoll_time64
416 io_pgetevents_time64
417 recvmmsg_time64
418 mq_timedsend_time64
419 mq_timedreceiv_time64
420 semtimedop_time64
421 rt_sigtimedwait_time64
422 futex_time64
423 sched_rr_get_interval_time64
Each one of these corresponds directly to an existing system call that
includes a 'struct timespec' argument, or a structure containing a timespec
or (in case of clock_adjtime) timeval. Not included here are new versions
of getitimer/setitimer and getrusage/waitid, which are planned for the
future but only needed to make a consistent API rather than for correct
operation beyond y2038. These four system calls are based on 'timeval', and
it has not been finally decided what the replacement kernel interface will
use instead.
So far, I have done a lot of build testing across most architectures, which
has found a number of bugs. Runtime testing so far included testing LTP on
32-bit ARM with the existing system calls, to ensure we do not regress for
existing binaries, and a test with a 32-bit x86 build of LTP against a
modified version of the musl C library that has been adapted to the new
system call interface [3]. This library can be used for testing on all
architectures supported by musl-1.1.21, but it is not how the support is
getting integrated into the official musl release. Official musl support is
planned but will require more invasive changes to the library.
Link: https://lore.kernel.org/lkml/20190110162435.309262-1-arnd@arndb.de/T/
Link: https://lore.kernel.org/lkml/20190118161835.2259170-1-arnd@arndb.de/
Link: https://git.linaro.org/people/arnd/musl-y2038.git/ [2]
The system call tables have diverged a bit over the years, and a number
of the recent additions never made it into all architectures, for one
reason or another.
This is an attempt to clean it up as far as we can without breaking
compatibility, doing a number of steps:
- Add system calls that have not yet been integrated into all
architectures but that we definitely want there. This includes
{,f}statfs64() and get{eg,eu,g,p,u,pp}id() on alpha, which have
been missing traditionally.
- The s390 compat syscall handling is cleaned up to be more like
what we do on other architectures, while keeping the 31-bit
pointer extension. This was merged as a shared branch by the
s390 maintainers and is included here in order to base the other
patches on top.
- Add the separate ipc syscalls on all architectures that
traditionally only had sys_ipc(). This version is done without
support for IPC_OLD that is we have in sys_ipc. The
new semtimedop_time64 syscall will only be added here, not
in sys_ipc
- Add syscall numbers for a couple of syscalls that we probably
don't need everywhere, in particular pkey_* and rseq,
for the purpose of symmetry: if it's in asm-generic/unistd.h,
it makes sense to have it everywhere. I expect that any future
system calls will get assigned on all platforms together, even
when they appear to be specific to a single architecture.
- Prepare for having the same system call numbers for any future
calls. In combination with the generated tables, this hopefully
makes it easier to add new calls across all architectures
together.
All of the above are technically separate from the y2038 work,
but are done as preparation before we add the new 64-bit time_t
system calls everywhere, providing a common baseline set of system
calls.
I expect that glibc and other libraries that want to use 64-bit
time_t will require linux-5.1 kernel headers for building in
the future, and at a much later point may also require linux-5.1
or a later version as the minimum kernel at runtime. Having a
common baseline then allows the removal of many architecture or
kernel version specific workarounds.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJcXf6XAAoJEGCrR//JCVInIm4P/AlkMmQRa/B2ziWMW6PifPoI
v18r44017rA1BPENyZvumJUdM5mDvNofOW8F2DYQ7Uiys2YtXenwe/Cf8LHn2n6c
TMXGQryQpvNmfDCyU+0UjF8m2+poFMrL4aRTXtjODh1YTsPNgeDC+KFMCAAtZmZd
cVbXFudtbdYKD/pgCX4SI1CWAMBiXe2e+ukPdJVr+iqusCMTApf+GOuyvDBZY9s/
vURb+tIS87HZ/jehWfZFSuZt+Gu7b3ijUXNC8v9qSIxNYekw62vBNl6F09HE79uB
Bv4OujAODqKvI9gGyydBzLJNzaMo0ryQdusyqcJHT7MY/8s+FwcYAXyTlQ3DbbB4
2u/c+58OwJ9Zk12p4LXZRA47U+vRhQt2rO4+zZWs2txNNJY89ZvCm/Z04KOiu5Xz
1Nnj607KGzthYRs2gs68AwzGGyf0uykIQ3RcaJLIBlX1Nd8BWO0ZgAguCvkXbQMX
XNXJTd92HmeuKKpiO0n/M4/mCeP0cafBRPCZbKlHyTl0Jeqd/HBQEO9Z8Ifwyju3
mXz9JCR9VlPCkX605keATbjtPGZf3XQtaXlQnezitDudXk8RJ33EpPcbhx76wX7M
Rux37ByqEOzk4wMGX9YQyNU7z7xuVg4sJAa2LlJqYeKXHtym+u3gG7SGP5AsYjmk
6mg2+9O2yZuLhQtOtrwm
=s4wf
-----END PGP SIGNATURE-----
Merge tag 'y2038-syscall-cleanup' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground into timers/2038
Pull preparatory work for y2038 changes from Arnd Bergmann:
System call unification and cleanup
The system call tables have diverged a bit over the years, and a number of
the recent additions never made it into all architectures, for one reason
or another.
This is an attempt to clean it up as far as we can without breaking
compatibility, doing a number of steps:
- Add system calls that have not yet been integrated into all architectures
but that we definitely want there. This includes {,f}statfs64() and
get{eg,eu,g,p,u,pp}id() on alpha, which have been missing traditionally.
- The s390 compat syscall handling is cleaned up to be more like what we
do on other architectures, while keeping the 31-bit pointer
extension. This was merged as a shared branch by the s390 maintainers
and is included here in order to base the other patches on top.
- Add the separate ipc syscalls on all architectures that traditionally
only had sys_ipc(). This version is done without support for IPC_OLD
that is we have in sys_ipc. The new semtimedop_time64 syscall will only
be added here, not in sys_ipc
- Add syscall numbers for a couple of syscalls that we probably don't need
everywhere, in particular pkey_* and rseq, for the purpose of symmetry:
if it's in asm-generic/unistd.h, it makes sense to have it everywhere. I
expect that any future system calls will get assigned on all platforms
together, even when they appear to be specific to a single architecture.
- Prepare for having the same system call numbers for any future calls. In
combination with the generated tables, this hopefully makes it easier to
add new calls across all architectures together.
All of the above are technically separate from the y2038 work, but are done
as preparation before we add the new 64-bit time_t system calls everywhere,
providing a common baseline set of system calls.
I expect that glibc and other libraries that want to use 64-bit time_t will
require linux-5.1 kernel headers for building in the future, and at a much
later point may also require linux-5.1 or a later version as the minimum
kernel at runtime. Having a common baseline then allows the removal of many
architecture or kernel version specific workarounds.
This adds 21 new system calls on each ABI that has 32-bit time_t
today. All of these have the exact same semantics as their existing
counterparts, and the new ones all have macro names that end in 'time64'
for clarification.
This gets us to the point of being able to safely use a C library
that has 64-bit time_t in user space. There are still a couple of
loose ends to tie up in various areas of the code, but this is the
big one, and should be entirely uncontroversial at this point.
In particular, there are four system calls (getitimer, setitimer,
waitid, and getrusage) that don't have a 64-bit counterpart yet,
but these can all be safely implemented in the C library by wrapping
around the existing system calls because the 32-bit time_t they
pass only counts elapsed time, not time since the epoch. They
will be dealt with later.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
The time, stime, utime, utimes, and futimesat system calls are only
used on older architectures, and we do not provide y2038 safe variants
of them, as they are replaced by clock_gettime64, clock_settime64,
and utimensat_time64.
However, for consistency it seems better to have the 32-bit architectures
that still use them call the "time32" entry points (leaving the
traditional handlers for the 64-bit architectures), like we do for system
calls that now require two versions.
Note: We used to always define __ARCH_WANT_SYS_TIME and
__ARCH_WANT_SYS_UTIME and only set __ARCH_WANT_COMPAT_SYS_TIME and
__ARCH_WANT_SYS_UTIME32 for compat mode on 64-bit kernels. Now this is
reversed: only 64-bit architectures set __ARCH_WANT_SYS_TIME/UTIME, while
we need __ARCH_WANT_SYS_TIME32/UTIME32 for 32-bit architectures and compat
mode. The resulting asm/unistd.h changes look a bit counterintuitive.
This is only a cleanup patch and it should not change any behavior.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
This is the big flip, where all 32-bit architectures set COMPAT_32BIT_TIME
and use the _time32 system calls from the former compat layer instead
of the system calls that take __kernel_timespec and similar arguments.
The temporary redirects for __kernel_timespec, __kernel_itimerspec
and __kernel_timex can get removed with this.
It would be easy to split this commit by architecture, but with the new
generated system call tables, it's easy enough to do it all at once,
which makes it a little easier to check that the changes are the same
in each table.
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
A lot of system calls that pass a time_t somewhere have an implementation
using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
been reworked so that this implementation can now be used on 32-bit
architectures as well.
The missing step is to redefine them using the regular SYSCALL_DEFINEx()
to get them out of the compat namespace and make it possible to build them
on 32-bit architectures.
Any system call that ends in 'time' gets a '32' suffix on its name for
that version, while the others get a '_time32' suffix, to distinguish
them from the normal version, which takes a 64-bit time argument in the
future.
In this step, only 64-bit architectures are changed, doing this rename
first lets us avoid touching the 32-bit architectures twice.
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
'regno' is directly controlled by user space, hence leading to a potential
exploitation of the Spectre variant 1 vulnerability.
On PTRACE_SETREGS and PTRACE_GETREGS requests, user space passes the
register number that would be read or written. This register number is
called 'regno' which is part of the 'addr' syscall parameter.
This 'regno' value is checked against the maximum pt_regs structure size,
and then used to dereference it, which matches the initial part of a
Spectre v1 (and Spectre v1.1) attack. The dereferenced value, then,
is returned to userspace in the GETREGS case.
This patch sanitizes 'regno' before using it to dereference pt_reg.
Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].
[1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When building a 32 bit powerpc kernel with Binutils 2.31.1 this warning
is emitted:
powerpc-linux-gnu-ld: warning: orphan section `.branch_lt' from
`arch/powerpc/kernel/head_44x.o' being placed in section `.branch_lt'
As of binutils commit 2d7ad24e8726 ("Support PLT16 relocs against local
symbols")[1], 32 bit targets can produce .branch_lt sections in their
output.
Include these symbols in the .data section as the ppc64 kernel does.
[1] https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=commitdiff;h=2d7ad24e8726ba4c45c9e67be08223a146a837ce
Signed-off-by: Joel Stanley <joel@jms.id.au>
Reviewed-by: Alan Modra <amodra@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently, eeh_pe_reset_full() will only attempt to reset a PE more
than once if activating the reset state and deactivating it both
succeed, but later polling shows that it hasn't become active.
Change this so that it will try up to three times for any reason other
than an unrecoverable slot error and adjust the message generation so
that it's clear weather the reset has ultimately succeeded or failed.
This allows the reset to succeed in some situations where it would
currently fail.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently, the EEH recovery process considers passed-through devices
as if they were not EEH-aware, which can cause them to be removed as
part of recovery. Because device removal requires cooperation from
the guest, this may lead to the process stalling or deadlocking.
Also, if devices are removed on the host side, they will be removed
from their IOMMU group, making recovery in the guest impossible.
Therefore, alter the recovery process so that passed-through devices
are not removed but are instead left frozen (and marked isolated)
until the guest performs it's own recovery. If firmware thaws a
passed-through PE because it's parent PE has been thawed (because it
was not passed through), re-freeze it.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a parameter to eeh_clear_pe_frozen_state() that allows
passed-through PEs to be excluded. Update callers to always pass true
so that there is no change in behaviour.
This is to prepare for follow-up work for passed-through devices.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a parameter to eeh_pe_state_clear() that allows passed-through PEs
to be excluded. Update callers to always pass true so that there is no
change in behaviour.
Also refactor to use direct traversal, to allow the removal of some
boilerplate.
This is to prepare for follow-up work for passed-through devices.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
eeh_unfreeze_pe() performs two operations: unfreezing a PE (which may
cause firmware to unfreeze child PEs as well) and de-isolating the PE
and it's children.
To simplify this and support future work, separate out the
de-isolation and perform it at the call sites (when necessary).
There should be no change in behaviour.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The 'clear_sw_state' parameter for eeh_pe_clear_frozen_state() is
redundant because it has no effect (except in the rare case of a
hardware error part way through unfreezing a tree of PEs, where it
would dangerously allow partial de-isolation before returning
failure).
It is passed down to __eeh_pe_clear_frozen_state(), and from there to
eeh_unfreeze_pe(), where it causes EEH_PE_ISOLATED to be removed
from the state of each PE during the traversal. However, when the
traversal finishes, EEH_PE_ISOLATED is unconditionally removed by a
call to eeh_pe_state_clear() regardless of the parameter's value.
So remove the flag and pass false to eeh_unfreeze_pe() (to avoid the
rare case described above, as it was before the flag was introduced).
Also, perform the recursion directly in the function and eliminate a
bit of boilerplate.
There should be no change in functionality, except as mentioned above.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
To match its x86 counterpart, save_stack_trace_tsk_reliable() should
return -EINVAL in cases that it is currently returning 1. No caller is
currently differentiating non-zero error codes, but let's keep the
arch-specific implementations consistent.
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Mostly cosmetic changes:
- Group common stack pointer code at the top
- Simplify the first frame logic
- Code stackframe iteration into for...loop construct
- Check for trace->nr_entries overflow before adding any into the array
Suggested-by: Nicolai Stange <nstange@suse.de>
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The bottom-most stack frame (the first to be unwound) may be largely
uninitialized, for the "Power Architecture 64-Bit ELF V2 ABI" only
requires its backchain pointer to be set.
The reliable stack tracer should be careful when verifying this frame:
skip checks on STACK_FRAME_LR_SAVE and STACK_FRAME_MARKER offsets that
may contain uninitialized residual data.
Fixes: df78d3f614 ("powerpc/livepatch: Implement reliable stack tracing for the consistency model")
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The ppc64 specific implementation of the reliable stacktracer,
save_stack_trace_tsk_reliable(), bails out and reports an "unreliable
trace" whenever it finds an exception frame on the stack. Stack frames
are classified as exception frames if the STACK_FRAME_REGS_MARKER
magic, as written by exception prologues, is found at a particular
location.
However, as observed by Joe Lawrence, it is possible in practice that
non-exception stack frames can alias with prior exception frames and
thus, that the reliable stacktracer can find a stale
STACK_FRAME_REGS_MARKER on the stack. It in turn falsely reports an
unreliable stacktrace and blocks any live patching transition to
finish. Said condition lasts until the stack frame is
overwritten/initialized by function call or other means.
In principle, we could mitigate this by making the exception frame
classification condition in save_stack_trace_tsk_reliable() stronger:
in addition to testing for STACK_FRAME_REGS_MARKER, we could also take
into account that for all exceptions executing on the kernel stack
- their stack frames's backlink pointers always match what is saved
in their pt_regs instance's ->gpr[1] slot and that
- their exception frame size equals STACK_INT_FRAME_SIZE, a value
uncommonly large for non-exception frames.
However, while these are currently true, relying on them would make
the reliable stacktrace implementation more sensitive towards future
changes in the exception entry code. Note that false negatives, i.e.
not detecting exception frames, would silently break the live patching
consistency model.
Furthermore, certain other places (diagnostic stacktraces, perf, xmon)
rely on STACK_FRAME_REGS_MARKER as well.
Make the exception exit code clear the on-stack
STACK_FRAME_REGS_MARKER for those exceptions running on the "normal"
kernel stack and returning to kernelspace: because the topmost frame
is ignored by the reliable stack tracer anyway, returns to userspace
don't need to take care of clearing the marker.
Furthermore, as I don't have the ability to test this on Book 3E or 32
bits, limit the change to Book 3S and 64 bits.
Fixes: df78d3f614 ("powerpc/livepatch: Implement reliable stack tracing for the consistency model")
Reported-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Nicolai Stange <nstange@suse.de>
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Remove linux/rtc.h which is included more than once
Signed-off-by: Brajeswar Ghosh <brajeswar.linux@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The IPC system call handling is highly inconsistent across architectures,
some use sys_ipc, some use separate calls, and some use both. We also
have some architectures that require passing IPC_64 in the flags, and
others that set it implicitly.
For the addition of a y2038 safe semtimedop() system call, I chose to only
support the separate entry points, but that requires first supporting
the regular ones with their own syscall numbers.
The IPC_64 is now implied by the new semctl/shmctl/msgctl system
calls even on the architectures that require passing it with the ipc()
multiplexer.
I'm not adding the new semtimedop() or semop() on 32-bit architectures,
those will get implemented using the new semtimedop_time64() version
that gets added along with the other time64 calls.
Three 64-bit architectures (powerpc, s390 and sparc) get semtimedop().
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Adopt nvram module to reduce code duplication. This means CONFIG_NVRAM
becomes available to PPC64 builds. Previously it was only available to
PPC32 builds because it depended on CONFIG_GENERIC_NVRAM.
The IOC_NVRAM_GET_OFFSET ioctl as implemented on PPC64 validates the
offset returned by pmac_get_partition(). Do the same in the nvram module.
Note that the old PPC32 generic_nvram module lacked this test.
So when CONFIG_PPC32 && CONFIG_PPC_PMAC, the IOC_NVRAM_GET_OFFSET ioctl
would have returned 0 (always). But when CONFIG_PPC64 && CONFIG_PPC_PMAC,
the IOC_NVRAM_GET_OFFSET ioctl would have returned -1 (which is -EPERM)
when the requested partition was not found.
With this patch, the result is now -EINVAL on both PPC32 and PPC64 when
the requested PowerMac NVRAM partition is not found. This is a userspace-
visible change, in the non-existent partition case, which would be in
an error path for an IOC_NVRAM_GET_OFFSET ioctl syscall.
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Switch PPC32 kernels from the generic_nvram module to the nvram module.
Also fix a theoretical bug where CHRP omits the chrp_nvram_init() call
when CONFIG_NVRAM_MODULE=m.
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Remove the nvram_read_byte() and nvram_write_byte() declarations in
powerpc/include/asm/nvram.h and use the cross-platform static functions
in linux/nvram.h instead.
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Recently in commit fbf508da74 ("powerpc: split compat syscall table
out from native table") we changed the layout of the system call
table. Instead of having two entries for each syscall number, one for
the regular entry point and one for the compat entry point, we now
have separate tables for regular and compat entry points.
This inadvertently broke syscall tracing (CONFIG_FTRACE_SYSCALLS),
because our implementation of arch_syscall_addr() knew about the
layout of the table (it did nr * 2).
We can fix it just by dropping our version of arch_syscall_addr() and
using the generic version which does:
return (unsigned long)sys_call_table[nr];
Fixes: fbf508da74 ("powerpc: split compat syscall table out from native table")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On Power9 machines (64-bit Book3S), we can be running with either the
Hash table or Radix tree MMU enabled. So add some text to the __die()
output to tell us which is enabled, for the case where all you have is
the oops output and no other information.
Example output:
kernel BUG at drivers/misc/lkdtm/bugs.c:63!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: kvm vmx_crypto binfmt_misc ip_tables x_tables
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The page size the kernel is built with is useful info when debugging a
crash, so add it to the output in __die().
Result looks like eg:
kernel BUG at drivers/misc/lkdtm/bugs.c:63!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: vmx_crypto kvm binfmt_misc ip_tables
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Using pr_cont() risks having our output interleaved with other output
from other CPUs. Instead print everything in a single printk() call.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In the ld documentation under Builtin Functions:
BLOCK(exp)
This is a synonym for ALIGN, for compatibility with older linker scripts.
Clang's linker (lld) doesn't know about BLOCK so remove this use of
it.
Suggested-by: George Rimar <grimar@accesssoftek.com>
Signed-off-by: Joel Stanley <joel@jms.id.au>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
arch_early_irq_init() does nothing different than the weak
arch_early_irq_init() in kernel/softirq.c
Fixes: 089fb442f3 ("powerpc: Use ARCH_IRQ_INIT_FLAGS")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use a CONSOLE_LOGLEVEL_DEBUG macro for console_loglevel rather
than a naked number.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit e1c3743e1a ("powerpc/tm: Set MSR[TS] just prior to recheckpoint")
moved a code block around and this block uses a 'msr' variable outside of
the CONFIG_PPC_TRANSACTIONAL_MEM, however the 'msr' variable is declared
inside a CONFIG_PPC_TRANSACTIONAL_MEM block, causing a possible error when
CONFIG_PPC_TRANSACTION_MEM is not defined.
error: 'msr' undeclared (first use in this function)
This is not causing a compilation error in the mainline kernel, because
'msr' is being used as an argument of MSR_TM_ACTIVE(), which is defined as
the following when CONFIG_PPC_TRANSACTIONAL_MEM is *not* set:
#define MSR_TM_ACTIVE(x) 0
This patch just fixes this issue avoiding the 'msr' variable usage outside
the CONFIG_PPC_TRANSACTIONAL_MEM block, avoiding trusting in the
MSR_TM_ACTIVE() definition.
Cc: stable@vger.kernel.org
Reported-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
Fixes: e1c3743e1a ("powerpc/tm: Set MSR[TS] just prior to recheckpoint")
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 8c8c10b90d ("powerpc/8xx: fix handling of early NULL pointer
dereference") moved the loading of r6 earlier in the code. As some
functions are called inbetween, r6 needs to be loaded again with the
address of swapper_pg_dir in order to set PTE pointers for
the Abatron BDI.
Fixes: 8c8c10b90d ("powerpc/8xx: fix handling of early NULL pointer dereference")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label".
The jump label is controlled by HAVE_JUMP_LABEL, which is defined
like this:
#if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
# define HAVE_JUMP_LABEL
#endif
We can improve this by testing 'asm goto' support in Kconfig, then
make JUMP_LABEL depend on CC_HAS_ASM_GOTO.
Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will
match to the real kernel capability.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.
It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.
A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.
This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
There were a couple of notable cases:
- csky still had the old "verify_area()" name as an alias.
- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)
- microblaze used the type argument for a debug printout
but other than those oddities this should be a total no-op patch.
I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Including (in no particular order):
- Page table code for AMD IOMMU now supports large pages where
smaller page-sizes were mapped before. VFIO had to work around
that in the past and I included a patch to remove it (acked by
Alex Williamson)
- Patches to unmodularize a couple of IOMMU drivers that would
never work as modules anyway.
- Work to unify the the iommu-related pointers in
'struct device' into one pointer. This work is not finished
yet, but will probably be in the next cycle.
- NUMA aware allocation in iommu-dma code
- Support for r8a774a1 and r8a774c0 in the Renesas IOMMU driver
- Scalable mode support for the Intel VT-d driver
- PM runtime improvements for the ARM-SMMU driver
- Support for the QCOM-SMMUv2 IOMMU hardware from Qualcom
- Various smaller fixes and improvements
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJcKkEoAAoJECvwRC2XARrjCCoQAJxsgaAF5Z0s7z8j2A9SkaGp
SIMnUAI5mDOdyhTOAI+eehpRzg5UVYt/JjFYnHz8HWqbSc8YOvDvHafmhMFIwYvO
hq5knbs6ns2jJNFO+M4dioDq+3THdqkGIF5xoHdGTP7cn9+XyQ8lAoHo0RuL122U
PJGqX7Cp4XnFP4HMb3uQYhVeBV7mU+XqAdB+4aDnQkzI5LkQCRr74GcqOm+Rlnyc
cmQWc2arUMjgc1TJIrex8dx9dT6lq8kOmhyEg/IjHeGaZyJ3HqA+30XDDLEExN0G
MeVawuxJz40HgXlkXr+iZTQtIFYkXdKvJH6rptMbOfbDeDz+YZ01TbtAMMH9o4jX
yxjjMjdcWTsWYQ/MHHdsoMP34cajCi/EYPMNksbycw+E3Y+X/bSReCoWC0HUK8/+
Z4TpZ9mZVygtJR+QNZ+pE9oiJpb4sroM10zTnbMoVHNnvfsO01FYk7FMPkolSKLw
zB4MDswQYgchoFR9Z4ZB4PycYTzeafLKYgDPDoD1vIJgDavuidwvDWDRTDc+aMWM
siIIewq19To9jDJkVjX4dsT/p99KVKgAR/Ps6jjWkAroha7g6GcmlYZHIJnyop04
jiaSXUsk8aRucP/CRz5xdMmaGoN7BsNmpUjcrquc6Povk/6gvXvpY04oCs1+gNMX
ipL9E3GTFCVBubRFrksv
=DT9A
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU updates from Joerg Roedel:
- Page table code for AMD IOMMU now supports large pages where smaller
page-sizes were mapped before. VFIO had to work around that in the
past and I included a patch to remove it (acked by Alex Williamson)
- Patches to unmodularize a couple of IOMMU drivers that would never
work as modules anyway.
- Work to unify the the iommu-related pointers in 'struct device' into
one pointer. This work is not finished yet, but will probably be in
the next cycle.
- NUMA aware allocation in iommu-dma code
- Support for r8a774a1 and r8a774c0 in the Renesas IOMMU driver
- Scalable mode support for the Intel VT-d driver
- PM runtime improvements for the ARM-SMMU driver
- Support for the QCOM-SMMUv2 IOMMU hardware from Qualcom
- Various smaller fixes and improvements
* tag 'iommu-updates-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (78 commits)
iommu: Check for iommu_ops == NULL in iommu_probe_device()
ACPI/IORT: Don't call iommu_ops->add_device directly
iommu/of: Don't call iommu_ops->add_device directly
iommu: Consolitate ->add/remove_device() calls
iommu/sysfs: Rename iommu_release_device()
dmaengine: sh: rcar-dmac: Use device_iommu_mapped()
xhci: Use device_iommu_mapped()
powerpc/iommu: Use device_iommu_mapped()
ACPI/IORT: Use device_iommu_mapped()
iommu/of: Use device_iommu_mapped()
driver core: Introduce device_iommu_mapped() function
iommu/tegra: Use helper functions to access dev->iommu_fwspec
iommu/qcom: Use helper functions to access dev->iommu_fwspec
iommu/of: Use helper functions to access dev->iommu_fwspec
iommu/mediatek: Use helper functions to access dev->iommu_fwspec
iommu/ipmmu-vmsa: Use helper functions to access dev->iommu_fwspec
iommu/dma: Use helper functions to access dev->iommu_fwspec
iommu/arm-smmu: Use helper functions to access dev->iommu_fwspec
ACPI/IORT: Use helper functions to access dev->iommu_fwspec
iommu: Introduce wrappers around dev->iommu_fwspec
...
Mostly clean ups although whilst Doug's was chasing down a odd
lockdep warning he also did some work to improved debugger resilience
when some CPUs fail to respond to the round up request.
The main changes are:
* Fixing a lockdep warning on architectures that cannot use an NMI for
the round up plus related changes to make CPU round up and all CPU
backtrace more resilient.
* Constify the arch ops tables
* A couple of other small clean ups
Two of the three patchsets here include changes that spill over into
arch/. Changes in the arch space are relatively narrow in scope
(and directly related to kgdb). Didn't get comprehensive acks but
all impacted maintainers were Cc:ed in good time.
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAlwonoUbHGRhbmllbC50
aG9tcHNvbkBsaW5hcm8ub3JnAAoJEHzjJV0594ihmooP/1uzSMGQIoQMB8XeU/jT
Da2iILybi6hGp7ILA27d0yN3tsJBxWGWs8wzNdzMo3NQ3J0o4foAUnS/R0Vjkg9w
uphe5EA4HDsIrH05OouNb984BeEgNaC9HSqtyr9fXuh024NboULFKIm7REYm+QHT
C5SrBtmonL1xE7FmAhudLWjl7ZlvxM6DJeoVViH4kKq0raTiILt6VJaGl9JfcAdL
m9GEf9r/nh0sCq3GNgyc0y4BvHed+Kxzy1fsIi3jE6t8elaYYR72gNRQ5LaFxcnQ
F04/UtH75qB4rqYsqqV1q0rFi+tj+p9wYTmxixaGWsVDX4Gb5KXuLWJhaRb5IvwC
bdq/0IAXRr4vUL3y0tFWfCj7pHGaVc/gfXi8aieRXLGAZG+tdfuu99NCiulIZTfc
QqZz12Z+99/qi6dK7dBQtaN8SyPeB1QXKWefeGo2Bt5QqiBmcKHxsQYMUo3nkf3J
UXHpj4LG6Ldsi/w8VZfvXmM0/vbO/jrus9m+X2v+4tJyisjrsyv0FRnREI4avfbC
l09P1ajv7RrAaxtab0smV9krqWZ/mSn0zcgcaD6RdKe0+SwsiP/CEx1z1Wb1MH9c
wjEiClXjdVB39YVT0YVfG2Ho7qH8WRErxVyNb/f4QKHMXL1Mu91hFWhBBpUOGUj2
7Jrq2zK1uWramtt7GBDpHYYH
=Aqlc
-----END PGP SIGNATURE-----
Merge tag 'kgdb-4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux
Pull kgdb updates from Daniel Thompson:
"Mostly clean ups although while Doug's was chasing down a odd lockdep
warning he also did some work to improved debugger resilience when
some CPUs fail to respond to the round up request.
The main changes are:
- Fixing a lockdep warning on architectures that cannot use an NMI
for the round up plus related changes to make CPU round up and all
CPU backtrace more resilient.
- Constify the arch ops tables
- A couple of other small clean ups
Two of the three patchsets here include changes that spill over into
arch/. Changes in the arch space are relatively narrow in scope (and
directly related to kgdb). Didn't get comprehensive acks but all
impacted maintainers were Cc:ed in good time"
* tag 'kgdb-4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
kgdb/treewide: constify struct kgdb_arch arch_kgdb_ops
mips/kgdb: prepare arch_kgdb_ops for constness
kdb: use bool for binary state indicators
kdb: Don't back trace on a cpu that didn't round up
kgdb: Don't round up a CPU that failed rounding up before
kgdb: Fix kgdb_roundup_cpus() for arches who used smp_call_function()
kgdb: Remove irq flags from roundup
- Rework of the kprobe/uprobe and synthetic events to consolidate all
the dynamic event code. This will make changes in the future easier.
- Partial rewrite of the function graph tracing infrastructure.
This will allow for multiple users of hooking onto functions
to get the callback (return) of the function. This is the ground
work for having kprobes and function graph tracer using one code base.
- Clean up of the histogram code that will facilitate adding more
features to the histograms in the future.
- Addition of str_has_prefix() and a few use cases. There currently
is a similar function strstart() that is used in a few places, but
only returns a bool and not a length. These instances will be
removed in the future to use str_has_prefix() instead.
- A few other various clean ups as well.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXCawlBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qhbcAQCFeT0fWWTUxofBQz5jqsHaRnVg21+9
X4sTldYRYEn4YgEAmWOyiwq7zvrsAu4ZwkNBMeqxn3tVymYHiGOGe3Y4BAw=
=u96o
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Rework of the kprobe/uprobe and synthetic events to consolidate all
the dynamic event code. This will make changes in the future easier.
- Partial rewrite of the function graph tracing infrastructure. This
will allow for multiple users of hooking onto functions to get the
callback (return) of the function. This is the ground work for having
kprobes and function graph tracer using one code base.
- Clean up of the histogram code that will facilitate adding more
features to the histograms in the future.
- Addition of str_has_prefix() and a few use cases. There currently is
a similar function strstart() that is used in a few places, but only
returns a bool and not a length. These instances will be removed in
the future to use str_has_prefix() instead.
- A few other various clean ups as well.
* tag 'trace-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (57 commits)
tracing: Use the return of str_has_prefix() to remove open coded numbers
tracing: Have the historgram use the result of str_has_prefix() for len of prefix
tracing: Use str_has_prefix() instead of using fixed sizes
tracing: Use str_has_prefix() helper for histogram code
string.h: Add str_has_prefix() helper function
tracing: Make function ‘ftrace_exports’ static
tracing: Simplify printf'ing in seq_print_sym
tracing: Avoid -Wformat-nonliteral warning
tracing: Merge seq_print_sym_short() and seq_print_sym_offset()
tracing: Add hist trigger comments for variable-related fields
tracing: Remove hist trigger synth_var_refs
tracing: Use hist trigger's var_ref array to destroy var_refs
tracing: Remove open-coding of hist trigger var_ref management
tracing: Use var_refs[] for hist trigger reference checking
tracing: Change strlen to sizeof for hist trigger static strings
tracing: Remove unnecessary hist trigger struct field
tracing: Fix ftrace_graph_get_ret_stack() to use task and not current
seq_buf: Use size_t for len in seq_buf_puts()
seq_buf: Make seq_buf_puts() null-terminate the buffer
arm64: Use ftrace_graph_get_ret_stack() instead of curr_ret_stack
...
checkpatch.pl reports the following:
WARNING: struct kgdb_arch should normally be const
#28: FILE: arch/mips/kernel/kgdb.c:397:
+struct kgdb_arch arch_kgdb_ops = {
This report makes sense, as all other ops struct, this
one should also be const. This patch does the change.
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Rich Felker <dalias@libc.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Acked-by: Paul Burton <paul.burton@mips.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
When I had lockdep turned on and dropped into kgdb I got a nice splat
on my system. Specifically it hit:
DEBUG_LOCKS_WARN_ON(current->hardirq_context)
Specifically it looked like this:
sysrq: SysRq : DEBUG
------------[ cut here ]------------
DEBUG_LOCKS_WARN_ON(current->hardirq_context)
WARNING: CPU: 0 PID: 0 at .../kernel/locking/lockdep.c:2875 lockdep_hardirqs_on+0xf0/0x160
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.0 #27
pstate: 604003c9 (nZCv DAIF +PAN -UAO)
pc : lockdep_hardirqs_on+0xf0/0x160
...
Call trace:
lockdep_hardirqs_on+0xf0/0x160
trace_hardirqs_on+0x188/0x1ac
kgdb_roundup_cpus+0x14/0x3c
kgdb_cpu_enter+0x53c/0x5cc
kgdb_handle_exception+0x180/0x1d4
kgdb_compiled_brk_fn+0x30/0x3c
brk_handler+0x134/0x178
do_debug_exception+0xfc/0x178
el1_dbg+0x18/0x78
kgdb_breakpoint+0x34/0x58
sysrq_handle_dbg+0x54/0x5c
__handle_sysrq+0x114/0x21c
handle_sysrq+0x30/0x3c
qcom_geni_serial_isr+0x2dc/0x30c
...
...
irq event stamp: ...45
hardirqs last enabled at (...44): [...] __do_softirq+0xd8/0x4e4
hardirqs last disabled at (...45): [...] el1_irq+0x74/0x130
softirqs last enabled at (...42): [...] _local_bh_enable+0x2c/0x34
softirqs last disabled at (...43): [...] irq_exit+0xa8/0x100
---[ end trace adf21f830c46e638 ]---
Looking closely at it, it seems like a really bad idea to be calling
local_irq_enable() in kgdb_roundup_cpus(). If nothing else that seems
like it could violate spinlock semantics and cause a deadlock.
Instead, let's use a private csd alongside
smp_call_function_single_async() to round up the other CPUs. Using
smp_call_function_single_async() doesn't require interrupts to be
enabled so we can remove the offending bit of code.
In order to avoid duplicating this across all the architectures that
use the default kgdb_roundup_cpus(), we'll add a "weak" implementation
to debug_core.c.
Looking at all the people who previously had copies of this code,
there were a few variants. I've attempted to keep the variants
working like they used to. Specifically:
* For arch/arc we passed NULL to kgdb_nmicallback() instead of
get_irq_regs().
* For arch/mips there was a bit of extra code around
kgdb_nmicallback()
NOTE: In this patch we will still get into trouble if we try to round
up a CPU that failed to round up before. We'll try to round it up
again and potentially hang when we try to grab the csd lock. That's
not new behavior but we'll still try to do better in a future patch.
Suggested-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
The function kgdb_roundup_cpus() was passed a parameter that was
documented as:
> the flags that will be used when restoring the interrupts. There is
> local_irq_save() call before kgdb_roundup_cpus().
Nobody used those flags. Anyone who wanted to temporarily turn on
interrupts just did local_irq_enable() and local_irq_disable() without
looking at them. So we can definitely remove the flags.
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Fixed the following build warning:
powerpc-linux-gnu-ld: warning: orphan section `__btb_flush_fixup' from
`arch/powerpc/kernel/head_44x.o' being placed in section
`__btb_flush_fixup'.
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
A huge update this time, but a lot of that is just consolidating or
removing code:
- provide a common DMA_MAPPING_ERROR definition and avoid indirect
calls for dma_map_* error checking
- use direct calls for the DMA direct mapping case, avoiding huge
retpoline overhead for high performance workloads
- merge the swiotlb dma_map_ops into dma-direct
- provide a generic remapping DMA consistent allocator for architectures
that have devices that perform DMA that is not cache coherent. Based
on the existing arm64 implementation and also used for csky now.
- improve the dma-debug infrastructure, including dynamic allocation
of entries (Robin Murphy)
- default to providing chaining scatterlist everywhere, with opt-outs
for the few architectures (alpha, parisc, most arm32 variants) that
can't cope with it
- misc sparc32 dma-related cleanups
- remove the dma_mark_clean arch hook used by swiotlb on ia64 and
replace it with the generic noncoherent infrastructure
- fix the return type of dma_set_max_seg_size (Niklas Söderlund)
- move the dummy dma ops for not DMA capable devices from arm64 to
common code (Robin Murphy)
- ensure dma_alloc_coherent returns zeroed memory to avoid kernel data
leaks through userspace. We already did this for most common
architectures, but this ensures we do it everywhere.
dma_zalloc_coherent has been deprecated and can hopefully be
removed after -rc1 with a coccinelle script.
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAlwctQgLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYMxgQ//dBpAfS4/J76CdAbYry2zqgcOUU9hIrD6NHiEMWov
ltJxyvEl3LsUmIdEj3aCrYL9jZN0qsnCzn5BVj2c3jDIVgD64fAr7HDf/PbEEfKb
j6/GgEnVLPZV+sQMvhNA5jOzHrkseaqPa4/pNLFZ/l8jnuZ2d+btusDWJpMoVDer
TXVwtIfgeIu0gTygYOShLYXd5qptWKWsZEpbTZOO2sE6+x+ZJX7yQYUxYDTlcOIj
JWVO2l5QNHPc5T9o2at+6L5aNUvnZOxT79sWgyZLn0Kc+FagKAVwfLqUEl0v7foG
8k/xca5/8p3afB1DfrIrtplJqis7cVgdyGxriwuuoO8X4F0nPyWwpGmxsBhrWwwl
xTqC4UorEJ7QwoP6Azopk/vYI2QXIUBLjuCJCuFXZj9+2BGf4IfvBY1S2cLM9qLs
HMcxQonuXJii044KEFS96ePEuiT+igVINweIFBKWcgNCEG0UQtyL6RQ1U5297ipF
JiWZAqD+p9X52UdKS+oKfAiZEekMXn6Xyo97+YCiNpfOo0GP5eEcwhL+JpY4AiRq
apPXtsRy2o1s8yfjdraUIM2Mc2n62vFKb35oUbGCd/QO9piPrFQHl6T0HHcHk4YR
XrUXcHieFZBCYqh7ZVa4RL8Msq1wvGuTL4Dxl43mXdsMoUFRR6eSNWLoAV4IpOLZ
WgA=
=in72
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-4.21' of git://git.infradead.org/users/hch/dma-mapping
Pull DMA mapping updates from Christoph Hellwig:
"A huge update this time, but a lot of that is just consolidating or
removing code:
- provide a common DMA_MAPPING_ERROR definition and avoid indirect
calls for dma_map_* error checking
- use direct calls for the DMA direct mapping case, avoiding huge
retpoline overhead for high performance workloads
- merge the swiotlb dma_map_ops into dma-direct
- provide a generic remapping DMA consistent allocator for
architectures that have devices that perform DMA that is not cache
coherent. Based on the existing arm64 implementation and also used
for csky now.
- improve the dma-debug infrastructure, including dynamic allocation
of entries (Robin Murphy)
- default to providing chaining scatterlist everywhere, with opt-outs
for the few architectures (alpha, parisc, most arm32 variants) that
can't cope with it
- misc sparc32 dma-related cleanups
- remove the dma_mark_clean arch hook used by swiotlb on ia64 and
replace it with the generic noncoherent infrastructure
- fix the return type of dma_set_max_seg_size (Niklas Söderlund)
- move the dummy dma ops for not DMA capable devices from arm64 to
common code (Robin Murphy)
- ensure dma_alloc_coherent returns zeroed memory to avoid kernel
data leaks through userspace. We already did this for most common
architectures, but this ensures we do it everywhere.
dma_zalloc_coherent has been deprecated and can hopefully be
removed after -rc1 with a coccinelle script"
* tag 'dma-mapping-4.21' of git://git.infradead.org/users/hch/dma-mapping: (73 commits)
dma-mapping: fix inverted logic in dma_supported
dma-mapping: deprecate dma_zalloc_coherent
dma-mapping: zero memory returned from dma_alloc_*
sparc/iommu: fix ->map_sg return value
sparc/io-unit: fix ->map_sg return value
arm64: default to the direct mapping in get_arch_dma_ops
PCI: Remove unused attr variable in pci_dma_configure
ia64: only select ARCH_HAS_DMA_COHERENT_TO_PFN if swiotlb is enabled
dma-mapping: bypass indirect calls for dma-direct
vmd: use the proper dma_* APIs instead of direct methods calls
dma-direct: merge swiotlb_dma_ops into the dma_direct code
dma-direct: use dma_direct_map_page to implement dma_direct_map_sg
dma-direct: improve addressability error reporting
swiotlb: remove dma_mark_clean
swiotlb: remove SWIOTLB_MAP_ERROR
ACPI / scan: Refactor _CCA enforcement
dma-mapping: factor out dummy DMA ops
dma-mapping: always build the direct mapping code
dma-mapping: move dma_cache_sync out of line
dma-mapping: move various slow path functions out of line
...
Notable changes:
- Mitigations for Spectre v2 on some Freescale (NXP) CPUs.
- A large series adding support for pass-through of Nvidia V100 GPUs to guests
on Power9.
- Another large series to enable hardware assistance for TLB table walk on
MPC8xx CPUs.
- Some preparatory changes to our DMA code, to make way for further cleanups
from Christoph.
- Several fixes for our Transactional Memory handling discovered by fuzzing the
signal return path.
- Support for generating our system call table(s) from a text file like other
architectures.
- A fix to our page fault handler so that instead of generating a WARN_ON_ONCE,
user accesses of kernel addresses instead print a ratelimited and
appropriately scary warning.
- A cosmetic change to make our unhandled page fault messages more similar to
other arches and also more compact and informative.
- Freescale updates from Scott:
"Highlights include elimination of legacy clock bindings use from dts
files, an 83xx watchdog handler, fixes to old dts interrupt errors, and
some minor cleanup."
And many clean-ups, reworks and minor fixes etc.
Thanks to:
Alexandre Belloni, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Arnd Bergmann, Benjamin Herrenschmidt, Breno Leitao, Christian Lamparter,
Christophe Leroy, Christoph Hellwig, Daniel Axtens, Darren Stevens, David
Gibson, Diana Craciun, Dmitry V. Levin, Firoz Khan, Geert Uytterhoeven, Greg
Kurz, Gustavo Romero, Hari Bathini, Joel Stanley, Kees Cook, Madhavan
Srinivasan, Mahesh Salgaonkar, Markus Elfring, Mathieu Malaterre, Michal
Suchánek, Naveen N. Rao, Nick Desaulniers, Oliver O'Halloran, Paul Mackerras,
Ram Pai, Ravi Bangoria, Rob Herring, Russell Currey, Sabyasachi Gupta, Sam
Bobroff, Satheesh Rajendran, Scott Wood, Segher Boessenkool, Stephen Rothwell,
Tang Yuantian, Thiago Jung Bauermann, Yangtao Li, Yuantian Tang, Yue Haibing.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcJLwZAAoJEFHr6jzI4aWAAv4P/jMvP52lA90i2E8G72LOVSF1
33DbE/Okib3VfmmMcXZpgpEfwIcEmJcIj86WWcLWzBfXLunehkgwh+AOfBLwqWch
D08+RR9EZb7ppvGe91hvSgn4/28CWVKAxuDviSuoE1OK8lOTncu889r2+AxVFZiY
f6Al9UPlB3FTJonNx8iO4r/GwrPigukjbzp1vkmJJg59LvNUrMQ1Fgf9D3cdlslH
z4Ff9zS26RJy7cwZYQZI4sZXJZmeQ1DxOZ+6z6FL/nZp/O4WLgpw6C6o1+vxo1kE
9ZnO/3+zIRhoWiXd6OcOQXBv3NNCjJZlXh9HHAiL8m5ZqbmxrStQWGyKW/jjEZuK
wVHxfUT19x9Qy1p+BH3XcUNMlxchYgcCbEi5yPX2p9ZDXD6ogNG7sT1+NO+FBTww
ueCT5PCCB/xWOccQlBErFTMkFXFLtyPDNFK7BkV7uxbH0PQ+9guCvjWfBZti6wjD
/6NK4mk7FpmCiK13Y1xjwC5OqabxLUYwtVuHYOMr5TOPh8URUPS4+0pIOdoYDM6z
Ensrq1CC843h59MWADgFHSlZ78FRtZlG37JAXunjLbqGupLOvL7phC9lnwkylHga
2hWUWFeOV8HFQBP4gidZkLk64pkT9LzqHgdgIB4wUwrhc8r2mMZGdQTq5H7kOn3Q
n9I48PWANvEC0PBCJ/KL
=cr6s
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.21-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Notable changes:
- Mitigations for Spectre v2 on some Freescale (NXP) CPUs.
- A large series adding support for pass-through of Nvidia V100 GPUs
to guests on Power9.
- Another large series to enable hardware assistance for TLB table
walk on MPC8xx CPUs.
- Some preparatory changes to our DMA code, to make way for further
cleanups from Christoph.
- Several fixes for our Transactional Memory handling discovered by
fuzzing the signal return path.
- Support for generating our system call table(s) from a text file
like other architectures.
- A fix to our page fault handler so that instead of generating a
WARN_ON_ONCE, user accesses of kernel addresses instead print a
ratelimited and appropriately scary warning.
- A cosmetic change to make our unhandled page fault messages more
similar to other arches and also more compact and informative.
- Freescale updates from Scott:
"Highlights include elimination of legacy clock bindings use from
dts files, an 83xx watchdog handler, fixes to old dts interrupt
errors, and some minor cleanup."
And many clean-ups, reworks and minor fixes etc.
Thanks to: Alexandre Belloni, Alexey Kardashevskiy, Andrew Donnellan,
Aneesh Kumar K.V, Arnd Bergmann, Benjamin Herrenschmidt, Breno Leitao,
Christian Lamparter, Christophe Leroy, Christoph Hellwig, Daniel
Axtens, Darren Stevens, David Gibson, Diana Craciun, Dmitry V. Levin,
Firoz Khan, Geert Uytterhoeven, Greg Kurz, Gustavo Romero, Hari
Bathini, Joel Stanley, Kees Cook, Madhavan Srinivasan, Mahesh
Salgaonkar, Markus Elfring, Mathieu Malaterre, Michal Suchánek, Naveen
N. Rao, Nick Desaulniers, Oliver O'Halloran, Paul Mackerras, Ram Pai,
Ravi Bangoria, Rob Herring, Russell Currey, Sabyasachi Gupta, Sam
Bobroff, Satheesh Rajendran, Scott Wood, Segher Boessenkool, Stephen
Rothwell, Tang Yuantian, Thiago Jung Bauermann, Yangtao Li, Yuantian
Tang, Yue Haibing"
* tag 'powerpc-4.21-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (201 commits)
Revert "powerpc/fsl_pci: simplify fsl_pci_dma_set_mask"
powerpc/zImage: Also check for stdout-path
powerpc: Fix HMIs on big-endian with CONFIG_RELOCATABLE=y
macintosh: Use of_node_name_{eq, prefix} for node name comparisons
ide: Use of_node_name_eq for node name comparisons
powerpc: Use of_node_name_eq for node name comparisons
powerpc/pseries/pmem: Convert to %pOFn instead of device_node.name
powerpc/mm: Remove very old comment in hash-4k.h
powerpc/pseries: Fix node leak in update_lmb_associativity_index()
powerpc/configs/85xx: Enable CONFIG_DEBUG_KERNEL
powerpc/dts/fsl: Fix dtc-flagged interrupt errors
clk: qoriq: add more compatibles strings
powerpc/fsl: Use new clockgen binding
powerpc/83xx: handle machine check caused by watchdog timer
powerpc/fsl-rio: fix spelling mistake "reserverd" -> "reserved"
powerpc/fsl_pci: simplify fsl_pci_dma_set_mask
arch/powerpc/fsl_rmu: Use dma_zalloc_coherent
vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver
vfio_pci: Allow regions to add own capabilities
vfio_pci: Allow mapping extra regions
...
single-stepping fixes, improved tracing, various timer and vGIC
fixes
* x86: Processor Tracing virtualization, STIBP support, some correctness fixes,
refactorings and splitting of vmx.c, use the Hyper-V range TLB flush hypercall,
reduce order of vcpu struct, WBNOINVD support, do not use -ftrace for __noclone
functions, nested guest support for PAUSE filtering on AMD, more Hyper-V
enlightenments (direct mode for synthetic timers)
* PPC: nested VFIO
* s390: bugfixes only this time
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJcH0vFAAoJEL/70l94x66Dw/wH/2FZp1YOM5OgiJzgqnXyDbyf
dNEfWo472MtNiLsuf+ZAfJojVIu9cv7wtBfXNzW+75XZDfh/J88geHWNSiZDm3Fe
aM4MOnGG0yF3hQrRQyEHe4IFhGFNERax8Ccv+OL44md9CjYrIrsGkRD08qwb+gNh
P8T/3wJEKwUcVHA/1VHEIM8MlirxNENc78p6JKd/C7zb0emjGavdIpWFUMr3SNfs
CemabhJUuwOYtwjRInyx1y34FzYwW3Ejuc9a9UoZ+COahUfkuxHE8u+EQS7vLVF6
2VGVu5SA0PqgmLlGhHthxLqVgQYo+dB22cRnsLtXlUChtVAq8q9uu5sKzvqEzuE=
=b4Jx
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"ARM:
- selftests improvements
- large PUD support for HugeTLB
- single-stepping fixes
- improved tracing
- various timer and vGIC fixes
x86:
- Processor Tracing virtualization
- STIBP support
- some correctness fixes
- refactorings and splitting of vmx.c
- use the Hyper-V range TLB flush hypercall
- reduce order of vcpu struct
- WBNOINVD support
- do not use -ftrace for __noclone functions
- nested guest support for PAUSE filtering on AMD
- more Hyper-V enlightenments (direct mode for synthetic timers)
PPC:
- nested VFIO
s390:
- bugfixes only this time"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (171 commits)
KVM: x86: Add CPUID support for new instruction WBNOINVD
kvm: selftests: ucall: fix exit mmio address guessing
Revert "compiler-gcc: disable -ftracer for __noclone functions"
KVM: VMX: Move VM-Enter + VM-Exit handling to non-inline sub-routines
KVM: VMX: Explicitly reference RCX as the vmx_vcpu pointer in asm blobs
KVM: x86: Use jmp to invoke kvm_spurious_fault() from .fixup
MAINTAINERS: Add arch/x86/kvm sub-directories to existing KVM/x86 entry
KVM/x86: Use SVM assembly instruction mnemonics instead of .byte streams
KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()
KVM/MMU: Flush tlb directly in kvm_set_pte_rmapp()
KVM/MMU: Move tlb flush in kvm_set_pte_rmapp() to kvm_mmu_notifier_change_pte()
KVM: Make kvm_set_spte_hva() return int
KVM: Replace old tlb flush function with new one to flush a specified range.
KVM/MMU: Add tlb flush with range helper function
KVM/VMX: Add hv tlb range flush support
x86/hyper-v: Add HvFlushGuestAddressList hypercall support
KVM: Add tlb_remote_flush_with_range callback in kvm_x86_ops
KVM: x86: Disable Intel PT when VMXON in L1 guest
KVM: x86: Set intercept for Intel PT MSRs read/write
KVM: x86: Implement Intel PT MSRs read/write emulation
...
In the end, we ended up with quite a lot more than I expected:
- Support for ARMv8.3 Pointer Authentication in userspace (CRIU and
kernel-side support to come later)
- Support for per-thread stack canaries, pending an update to GCC that
is currently undergoing review
- Support for kexec_file_load(), which permits secure boot of a kexec
payload but also happens to improve the performance of kexec
dramatically because we can avoid the sucky purgatory code from
userspace. Kdump will come later (requires updates to libfdt).
- Optimisation of our dynamic CPU feature framework, so that all
detected features are enabled via a single stop_machine() invocation
- KPTI whitelisting of Cortex-A CPUs unaffected by Meltdown, so that
they can benefit from global TLB entries when KASLR is not in use
- 52-bit virtual addressing for userspace (kernel remains 48-bit)
- Patch in LSE atomics for per-cpu atomic operations
- Custom preempt.h implementation to avoid unconditional calls to
preempt_schedule() from preempt_enable()
- Support for the new 'SB' Speculation Barrier instruction
- Vectorised implementation of XOR checksumming and CRC32 optimisations
- Workaround for Cortex-A76 erratum #1165522
- Improved compatibility with Clang/LLD
- Support for TX2 system PMUS for profiling the L3 cache and DMC
- Reflect read-only permissions in the linear map by default
- Ensure MMIO reads are ordered with subsequent calls to Xdelay()
- Initial support for memory hotplug
- Tweak the threshold when we invalidate the TLB by-ASID, so that
mremap() performance is improved for ranges spanning multiple PMDs.
- Minor refactoring and cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABCgAGBQJcE4TmAAoJELescNyEwWM0Nr0H/iaU7/wQSzHyNXtZoImyKTul
Blu2ga4/EqUrTU7AVVfmkl/3NBILWlgQVpY6tH6EfXQuvnxqD7CizbHyLdyO+z0S
B5PsFUH2GLMNAi48AUNqGqkgb2knFbg+T+9IimijDBkKg1G/KhQnRg6bXX32mLJv
Une8oshUPBVJMsHN1AcQknzKariuoE3u0SgJ+eOZ9yA2ZwKxP4yy1SkDt3xQrtI0
lojeRjxcyjTP1oGRNZC+BWUtGOT35p7y6cGTnBd/4TlqBGz5wVAJUcdoxnZ6JYVR
O8+ob9zU+4I0+SKt80s7pTLqQiL9rxkKZ5joWK1pr1g9e0s5N5yoETXKFHgJYP8=
=sYdt
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 festive updates from Will Deacon:
"In the end, we ended up with quite a lot more than I expected:
- Support for ARMv8.3 Pointer Authentication in userspace (CRIU and
kernel-side support to come later)
- Support for per-thread stack canaries, pending an update to GCC
that is currently undergoing review
- Support for kexec_file_load(), which permits secure boot of a kexec
payload but also happens to improve the performance of kexec
dramatically because we can avoid the sucky purgatory code from
userspace. Kdump will come later (requires updates to libfdt).
- Optimisation of our dynamic CPU feature framework, so that all
detected features are enabled via a single stop_machine()
invocation
- KPTI whitelisting of Cortex-A CPUs unaffected by Meltdown, so that
they can benefit from global TLB entries when KASLR is not in use
- 52-bit virtual addressing for userspace (kernel remains 48-bit)
- Patch in LSE atomics for per-cpu atomic operations
- Custom preempt.h implementation to avoid unconditional calls to
preempt_schedule() from preempt_enable()
- Support for the new 'SB' Speculation Barrier instruction
- Vectorised implementation of XOR checksumming and CRC32
optimisations
- Workaround for Cortex-A76 erratum #1165522
- Improved compatibility with Clang/LLD
- Support for TX2 system PMUS for profiling the L3 cache and DMC
- Reflect read-only permissions in the linear map by default
- Ensure MMIO reads are ordered with subsequent calls to Xdelay()
- Initial support for memory hotplug
- Tweak the threshold when we invalidate the TLB by-ASID, so that
mremap() performance is improved for ranges spanning multiple PMDs.
- Minor refactoring and cleanups"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (125 commits)
arm64: kaslr: print PHYS_OFFSET in dump_kernel_offset()
arm64: sysreg: Use _BITUL() when defining register bits
arm64: cpufeature: Rework ptr auth hwcaps using multi_entry_cap_matches
arm64: cpufeature: Reduce number of pointer auth CPU caps from 6 to 4
arm64: docs: document pointer authentication
arm64: ptr auth: Move per-thread keys from thread_info to thread_struct
arm64: enable pointer authentication
arm64: add prctl control for resetting ptrauth keys
arm64: perf: strip PAC when unwinding userspace
arm64: expose user PAC bit positions via ptrace
arm64: add basic pointer authentication support
arm64/cpufeature: detect pointer authentication
arm64: Don't trap host pointer auth use to EL2
arm64/kvm: hide ptrauth from guests
arm64/kvm: consistently handle host HCR_EL2 flags
arm64: add pointer authentication register bits
arm64: add comments about EC exception levels
arm64: perf: Treat EXCLUDE_EL* bit definitions as unsigned
arm64: kpti: Whitelist Cortex-A CPUs that don't implement the CSV3 field
arm64: enable per-task stack canaries
...
Freescale updates from Scott:
"Highlights include elimination of legacy clock bindings use from dts
files, an 83xx watchdog handler, fixes to old dts interrupt errors, and
some minor cleanup."
The structure of the ret_stack array on the task struct is going to
change, and accessing it directly via the curr_ret_stack index will no
longer give the ret_stack entry that holds the return address. To access
that, architectures must now use ftrace_graph_get_ret_stack() to get the
associated ret_stack that matches the saved return address.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
HMIs will crash the kernel due to
BRANCH_LINK_TO_FAR(hmi_exception_realmode)
Calling into the OPD instead of the actual code.
Fixes: 2337d20728 ("powerpc/64: CONFIG_RELOCATABLE support for hmi interrupts")
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Use DOTSYM() rather than #ifdef]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Convert string compares of DT node names to use of_node_name_eq helper
instead. This removes direct access to the node name pointer.
A couple of open coded iterating thru the child node names are converted
to use for_each_child_of_node() instead.
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When the watchdog timer is set in interrupt mode, it causes a
machine check when it times out. The purpose of this mode is to
ease debugging, not to crash the kernel and reboot the machine.
This patch implements a special handling for that, in order to not
crash the kernel if the watchdog times out while in interrupt or
within the idle task.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[scottwood: added missing #include]
Signed-off-by: Scott Wood <oss@buserror.net>
The powernv platform registers IOMMU groups and adds devices to them
from the pci_controller_ops::setup_bridge() hook except one case when
virtual functions (SRIOV VFs) are added from a bus notifier.
The pseries platform registers IOMMU groups from
the pci_controller_ops::dma_bus_setup() hook and adds devices from
the pci_controller_ops::dma_dev_setup() hook. The very same bus notifier
used for powernv does not add devices for pseries though as
__of_scan_bus() adds devices first, then it does the bus/dev DMA setup.
Both platforms use iommu_add_device() which takes a device and expects
it to have a valid IOMMU table struct with an iommu_table_group pointer
which in turn points the iommu_group struct (which represents
an IOMMU group). Although the helper seems easy to use, it relies on
some pre-existing device configuration and associated data structures
which it does not really need.
This simplifies iommu_add_device() to take the table_group pointer
directly. Pseries already has a table_group pointer handy and the bus
notified is not used anyway. For powernv, this copies the existing bus
notifier, makes it work for powernv only which means an easy way of
getting to the table_group pointer. This was tested on VFs but should
also support physical PCI hotplug.
Since iommu_add_device() receives the table_group pointer directly,
pseries does not do TCE cache invalidation (the hypervisor does) nor
allow multiple groups per a VFIO container (in other words sharing
an IOMMU table between partitionable endpoints), this removes
iommu_table_group_link from pseries.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This new memory does not have page structs as it is not plugged to
the host so gup() will fail anyway.
This adds 2 helpers:
- mm_iommu_newdev() to preregister the "memory device" memory so
the rest of API can still be used;
- mm_iommu_is_devmem() to know if the physical address is one of thise
new regions which we must avoid unpinning of.
This adds @mm to tce_page_is_contained() and iommu_tce_xchg() to test
if the memory is device memory to avoid pfn_to_page().
This adds a check for device memory in mm_iommu_ua_mark_dirty_rm() which
does delayed pages dirtying.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
System call table generation script must be run to gener-
ate unistd_32/64.h and syscall_table_32/64/c32/spu.h files.
This patch will have changes which will invokes the script.
This patch will generate unistd_32/64.h and syscall_table-
_32/64/c32/spu.h files by the syscall table generation
script invoked by parisc/Makefile and the generated files
against the removed files must be identical.
The generated uapi header file will be included in uapi/-
asm/unistd.h and generated system call table header file
will be included by kernel/systbl.S file.
Signed-off-by: Firoz Khan <firoz.khan@linaro.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The system call tables are in different format in all
architecture and it will be difficult to manually add or
modify the system calls in the respective files. To make
it easy by keeping a script and which will generate the
uapi header and syscall table file. This change will also
help to unify the implementation across all architectures.
The system call table generation script is added in
syscalls directory which contain the script to generate
both uapi header file and system call table files.
The syscall.tbl file will be the input for the scripts.
syscall.tbl contains the list of available system calls
along with system call number and corresponding entry point.
Add a new system call in this architecture will be possible
by adding new entry in the syscall.tbl file.
Adding a new table entry consisting of:
- System call number.
- ABI.
- System call name.
- Entry point name.
- Compat entry name, if required.
syscallhdr.sh and syscalltbl.sh will generate uapi header-
unistd_32/64.h and syscall_table_32/64/c32/spu.h files
respectively. File syscall_table_32/64/c32/spu.h is incl-
uded by syscall.S - the real system call table. Both *.sh
files will parse the content syscall.tbl to generate the
header and table files.
ARM, s390 and x86 architecuture does have similar support.
I leverage their implementation to come up with a generic
solution.
Signed-off-by: Firoz Khan <firoz.khan@linaro.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PowerPC uses a syscall table with native and compat calls
interleaved, which is a slightly simpler way to define two
matching tables.
As we move to having the tables generated, that advantage
is no longer important, but the interleaved table gets in
the way of using the same scripts as on the other archit-
ectures.
Split out a new compat_sys_call_table symbol that contains
all the compat calls, and leave the main table for the nat-
ive calls, to more closely match the method we use every-
where else.
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Firoz Khan <firoz.khan@linaro.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Move the macro definition for compat_sys_sigsuspend from
asm/systbl.h to the file which it is getting included.
One of the patch in this patch series is generating uapi
header and syscall table files. In order to come up with
a common implimentation across all architecture, we need
to do this change.
This change will simplify the implementation of system
call table generation script and help to come up a common
implementation across all architecture.
Signed-off-by: Firoz Khan <firoz.khan@linaro.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There is a TM Bad Thing bug that can be caused when you return from a
signal context in a suspended transaction but with ucontext MSR[TS] unset.
This forces regs->msr[TS] to be set at syscall entrance (since the CPU
state is transactional). It also calls treclaim() to flush the transaction
state, which is done based on the live (mfmsr) MSR state.
Since user context MSR[TS] is not set, then restore_tm_sigcontexts() is not
called, thus, not executing recheckpoint, keeping the CPU state as not
transactional. When calling rfid, SRR1 will have MSR[TS] set, but the CPU
state is non transactional, causing the TM Bad Thing with the following
stack:
[ 33.862316] Bad kernel stack pointer 3fffd9dce3e0 at c00000000000c47c
cpu 0x8: Vector: 700 (Program Check) at [c00000003ff7fd40]
pc: c00000000000c47c: fast_exception_return+0xac/0xb4
lr: 00003fff865f442c
sp: 3fffd9dce3e0
msr: 8000000102a03031
current = 0xc00000041f68b700
paca = 0xc00000000fb84800 softe: 0 irq_happened: 0x01
pid = 1721, comm = tm-signal-sigre
Linux version 4.9.0-3-powerpc64le (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26)
WARNING: exception is not recoverable, can't continue
The same problem happens on 32-bits signal handler, and the fix is very
similar, if tm_recheckpoint() is not executed, then regs->msr[TS] should be
zeroed.
This patch also fixes a sparse warning related to lack of indentation when
CONFIG_PPC_TRANSACTIONAL_MEM is set.
Fixes: 2b0a576d15 ("powerpc: Add new transactional memory state to the signal context")
CC: Stable <stable@vger.kernel.org> # 3.10+
Signed-off-by: Breno Leitao <leitao@debian.org>
Tested-by: Michal Suchánek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Usually a TM Bad Thing exception is raised due to three different problems.
a) touching SPRs in an active transaction; b) using TM instruction with the
facility disabled and c) setting a wrong MSR/SRR1 at RFID.
The two initial cases are easy to identify by looking at the instructions.
The latter case is harder, because the MSR is masked after RFID, so, it is
very useful to look at the previous MSR (SRR1) before RFID as also the
current and masked MSR.
Since MSR is saved at paca just before RFID, this patch prints it if a TM
Bad thing happen, helping to understand what is the invalid TM transition
that is causing the exception.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
As other exit points, move SRR1 (MSR) into paca->tm_scratch, so, if
there is a TM Bad Thing in RFID, it is easy to understand what was the
SRR1 value being used.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On a signal handler return, the user could set a context with MSR[TS] bits
set, and these bits would be copied to task regs->msr.
At restore_tm_sigcontexts(), after current task regs->msr[TS] bits are set,
several __get_user() are called and then a recheckpoint is executed.
This is a problem since a page fault (in kernel space) could happen when
calling __get_user(). If it happens, the process MSR[TS] bits were
already set, but recheckpoint was not executed, and SPRs are still invalid.
The page fault can cause the current process to be de-scheduled, with
MSR[TS] active and without tm_recheckpoint() being called. More
importantly, without TEXASR[FS] bit set also.
Since TEXASR might not have the FS bit set, and when the process is
scheduled back, it will try to reclaim, which will be aborted because of
the CPU is not in the suspended state, and, then, recheckpoint. This
recheckpoint will restore thread->texasr into TEXASR SPR, which might be
zero, hitting a BUG_ON().
kernel BUG at /build/linux-sf3Co9/linux-4.9.30/arch/powerpc/kernel/tm.S:434!
cpu 0xb: Vector: 700 (Program Check) at [c00000041f1576d0]
pc: c000000000054550: restore_gprs+0xb0/0x180
lr: 0000000000000000
sp: c00000041f157950
msr: 8000000100021033
current = 0xc00000041f143000
paca = 0xc00000000fb86300 softe: 0 irq_happened: 0x01
pid = 1021, comm = kworker/11:1
kernel BUG at /build/linux-sf3Co9/linux-4.9.30/arch/powerpc/kernel/tm.S:434!
Linux version 4.9.0-3-powerpc64le (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26)
enter ? for help
[c00000041f157b30] c00000000001bc3c tm_recheckpoint.part.11+0x6c/0xa0
[c00000041f157b70] c00000000001d184 __switch_to+0x1e4/0x4c0
[c00000041f157bd0] c00000000082eeb8 __schedule+0x2f8/0x990
[c00000041f157cb0] c00000000082f598 schedule+0x48/0xc0
[c00000041f157ce0] c0000000000f0d28 worker_thread+0x148/0x610
[c00000041f157d80] c0000000000f96b0 kthread+0x120/0x140
[c00000041f157e30] c00000000000c0e0 ret_from_kernel_thread+0x5c/0x7c
This patch simply delays the MSR[TS] set, so, if there is any page fault in
the __get_user() section, it does not have regs->msr[TS] set, since the TM
structures are still invalid, thus avoiding doing TM operations for
in-kernel exceptions and possible process reschedule.
With this patch, the MSR[TS] will only be set just before recheckpointing
and setting TEXASR[FS] = 1, thus avoiding an interrupt with TM registers in
invalid state.
Other than that, if CONFIG_PREEMPT is set, there might be a preemption just
after setting MSR[TS] and before tm_recheckpoint(), thus, this block must
be atomic from a preemption perspective, thus, calling
preempt_disable/enable() on this code.
It is not possible to move tm_recheckpoint to happen earlier, because it is
required to get the checkpointed registers from userspace, with
__get_user(), thus, the only way to avoid this undesired behavior is
delaying the MSR[TS] set.
The 32-bits signal handler seems to be safe this current issue, but, it
might be exposed to the preemption issue, thus, disabling preemption in
this chunk of code.
Changes from v2:
* Run the critical section with preempt_disable.
Fixes: 87b4e5393a ("powerpc/tm: Fix return of active 64bit signals")
Cc: stable@vger.kernel.org (v3.9+)
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For fadump to work successfully there should not be any holes in reserved
memory ranges where kernel has asked firmware to move the content of old
kernel memory in event of crash. Now that fadump uses CMA for reserved
area, this memory area is now not protected from hot-remove operations
unless it is cma allocated. Hence, fadump service can fail to re-register
after the hot-remove operation, if hot-removed memory belongs to fadump
reserved region. To avoid this make sure that memory from fadump reserved
area is not hot-removable if fadump is registered.
However, if user still wants to remove that memory, he can do so by
manually stopping fadump service before hot-remove operation.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
fadump fails to register when there are holes in reserved memory area.
This can happen if user has hot-removed a memory that falls in the
fadump reserved memory area. Throw a meaningful error message to the
user in such case.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
[mpe: is_reserved_memory_area_contiguous() returns bool, unsplit string]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
One of the primary issues with Firmware Assisted Dump (fadump) on Power
is that it needs a large amount of memory to be reserved. On large
systems with TeraBytes of memory, this reservation can be quite
significant.
In some cases, fadump fails if the memory reserved is insufficient, or
if the reserved memory was DLPAR hot-removed.
In the normal case, post reboot, the preserved memory is filtered to
extract only relevant areas of interest using the makedumpfile tool.
While the tool provides flexibility to determine what needs to be part
of the dump and what memory to filter out, all supported distributions
default this to "Capture only kernel data and nothing else".
We take advantage of this default and the Linux kernel's Contiguous
Memory Allocator (CMA) to fundamentally change the memory reservation
model for fadump.
Instead of setting aside a significant chunk of memory nobody can use,
this patch uses CMA instead, to reserve a significant chunk of memory
that the kernel is prevented from using (due to MIGRATE_CMA), but
applications are free to use it. With this fadump will still be able
to capture all of the kernel memory and most of the user space memory
except the user pages that were present in CMA region.
Essentially, on a P9 LPAR with 2 cores, 8GB RAM and current upstream:
[root@zzxx-yy10 ~]# free -m
total used free shared buff/cache available
Mem: 7557 193 6822 12 541 6725
Swap: 4095 0 4095
With this patch:
[root@zzxx-yy10 ~]# free -m
total used free shared buff/cache available
Mem: 8133 194 7464 12 475 7338
Swap: 4095 0 4095
Changes made here are completely transparent to how fadump has
traditionally worked.
Thanks to Aneesh Kumar and Anshuman Khandual for helping us understand
CMA and its usage.
TODO:
- Handle case where CMA reservation spans nodes.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The main new feature this time is support in HV nested KVM for passing
a device that is emulated by a level 0 hypervisor and presented to
level 1 as a PCI device through to a level 2 guest using VFIO.
Apart from that there are improvements for migration of radix guests
under HV KVM and some other fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJcGFzEAAoJEJ2a6ncsY3GfKjoH/Azcf8QIO5ftyHrjazFZOSUh
5Lr24HZTYHheowp6obzuZWRAIyckHmflRmOkv8RVGuA8+Sp+m5pBxN3WTVPOwDUh
WanOWVGJsuhl6qATmkm7xIxmYhQEyLxVNbnWva7WXuZ92rgGCNfHtByHWAx/7vTe
q5Shr4fLIQ8HRzor8Xqqph1I0hQNTE9VsaK1hW/PxI0gsO8qjDwOR8SDpT/aaJrS
Sir+lM0TwCbJREuObDxYAXn1OWy8rMYjlb9fEBv5tmPCQKiB9vJz4tV+ahR9eJ14
PEF57MoBOGwzQXo4geFLuo/Bu8fDygKsKQX1eYGcn6tRGA4pnTxzYl0+dHLBkOM=
=3WkD
-----END PGP SIGNATURE-----
Merge tag 'kvm-ppc-next-4.21-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
PPC KVM update for 4.21 from Paul Mackerras
The main new feature this time is support in HV nested KVM for passing
a device that is emulated by a level 0 hypervisor and presented to
level 1 as a PCI device through to a level 2 guest using VFIO.
Apart from that there are improvements for migration of radix guests
under HV KVM and some other fixes and cleanups.
Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE
for debugfs files.
Semantic patch information:
Rationale: DEFINE_SIMPLE_ATTRIBUTE + debugfs_create_file()
imposes some significant overhead as compared to
DEFINE_DEBUGFS_ATTRIBUTE + debugfs_create_file_unsafe().
Generated by: scripts/coccinelle/api/debugfs/debugfs_simple_attr.cocci
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Report branch predictor state flush as a mitigation for
Spectre variant 2.
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If the user choses not to use the mitigations, replace
the code sequence with nops.
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In order to protect against speculation attacks on
indirect branches, the branch predictor is flushed at
kernel entry to protect for the following situations:
- userspace process attacking another userspace process
- userspace process attacking the kernel
Basically when the privillege level change (i.e.the kernel
is entered), the branch predictor state is flushed.
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In order to protect against speculation attacks on
indirect branches, the branch predictor is flushed at
kernel entry to protect for the following situations:
- userspace process attacking another userspace process
- userspace process attacking the kernel
Basically when the privillege level change (i.e. the
kernel is entered), the branch predictor state is flushed.
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When the command line argument is present, the Spectre variant 2
mitigations are disabled.
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently for CONFIG_PPC_FSL_BOOK3E the spectre_v2 file is incorrect:
$ cat /sys/devices/system/cpu/vulnerabilities/spectre_v2
"Mitigation: Software count cache flush"
Which is wrong. Fix it to report vulnerable for now.
Fixes: ee13cb249f ("powerpc/64s: Add support for software count cache flush")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In order to protect against speculation attacks (Spectre
variant 2) on NXP PowerPC platforms, the branch predictor
should be flushed when the privillege level is changed.
This patch is adding the infrastructure to fixup at runtime
the code sections that are performing the branch predictor flush
depending on a boot arg parameter which is added later in a
separate patch.
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If the device tree doesn't reside in the memory which is declared
inside it, it has to be moved as well as this memory will not be
mapped by the kernel.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For this use case, completions and semaphores are equivalent,
but semaphores are an awkward interface that should generally
be avoided, so use the completion instead.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Combine the SYSCALL_EMU and SYSCALL_TRACE handling so that we only
call tracehook_report_syscall_entry() in one place.
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
[mpe: Flesh out change log, s/cached_flags/flags/, reflow comments]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Powerpc has somewhat odd usage where ZONE_DMA is used for all memory on
common 64-bit configfs, and ZONE_DMA32 is used for 31-bit schemes.
Move to a scheme closer to what other architectures use (and I dare to
say the intent of the system):
- ZONE_DMA: optionally for memory < 31-bit (64-bit embedded only)
- ZONE_NORMAL: everything addressable by the kernel
- ZONE_HIGHMEM: memory > 32-bit for 32-bit kernels
Also provide information on how ZONE_DMA is used by defining
ARCH_ZONE_DMA_BITS.
Contains various fixes from Benjamin Herrenschmidt.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The implemementation for the CONFIG_NOT_COHERENT_CACHE case doesn't share
any code with the one for systems with coherent caches. Split it off
and merge it with the helpers in dma-noncoherent.c that have no other
callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The unmap methods need to transfer memory ownership back from the
device to the cpu by identical means as dma_sync_*_to_cpu. I'm not
sure powerpc needs to do any work in this transfer direction, but
given that it does invalidate the caches in dma_sync_*_to_cpu already
we should make sure we also do so on unmapping.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[mpe: s/dir/direction in dma_nommu_unmap_page()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch fixes early DEBUG messages in prom.c:
- Use %px instead of %p to see the addresses
- Cast memblock_phys_mem_size() with (unsigned long long) to
avoid build failure when phys_addr_t is not 64 bits.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The 603 doesn't have a HASH table, TLB misses are handled by
software. It is then possible to generate page fault when
_PAGE_EXEC is not set like in nohash/32.
There is one "reserved" PTE bit available, this patch uses
it for _PAGE_EXEC.
In order to support it, set_pte_filter() and
set_access_flags_filter() are made common, and the handling
is made dependent on MMU_FTR_HPTE_TABLE
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
commit f21f49ea63 ("[POWERPC] Remove the dregs of APUS support from
arch/powerpc") removed CONFIG_APUS, but forgot to remove the logic
which adapts tophys() and tovirt() for it.
This patch removes the last stale pieces.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use patch sites and associated helpers to manage TLB handlers
patching instead of hardcoding.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Instead of manually patching a blr at hash_page() entry in
MMU_init_hw(), this patch adds a features section in head_32.S
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use patch_site_addr() instead of hardcoding the
address calculation in machine_init()
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Merge our fixes branch again, this has a couple of build fixes and also
a change to do_syscall_trace_enter() that will conflict with a patch we
want to apply in next.
Use the new function to replace the open-coded iommu check.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Russell Currey <ruscur@russell.cc>
Cc: Sam Bobroff <sbobroff@linux.ibm.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The POWER9 radix mmu has the concept of quadrants. The quadrant number
is the two high bits of the effective address and determines the fully
qualified address to be used for the translation. The fully qualified
address consists of the effective lpid, the effective pid and the
effective address. This gives then 4 possible quadrants 0, 1, 2, and 3.
When accessing these quadrants the fully qualified address is obtained
as follows:
Quadrant | Hypervisor | Guest
--------------------------------------------------------------------------
| EA[0:1] = 0b00 | EA[0:1] = 0b00
0 | effLPID = 0 | effLPID = LPIDR
| effPID = PIDR | effPID = PIDR
--------------------------------------------------------------------------
| EA[0:1] = 0b01 |
1 | effLPID = LPIDR | Invalid Access
| effPID = PIDR |
--------------------------------------------------------------------------
| EA[0:1] = 0b10 |
2 | effLPID = LPIDR | Invalid Access
| effPID = 0 |
--------------------------------------------------------------------------
| EA[0:1] = 0b11 | EA[0:1] = 0b11
3 | effLPID = 0 | effLPID = LPIDR
| effPID = 0 | effPID = 0
--------------------------------------------------------------------------
In the Guest;
Quadrant 3 is normally used to address the operating system since this
uses effPID=0 and effLPID=LPIDR, meaning the PID register doesn't need to
be switched.
Quadrant 0 is normally used to address user space since the effLPID and
effPID are taken from the corresponding registers.
In the Host;
Quadrant 0 and 3 are used as above, however the effLPID is always 0 to
address the host.
Quadrants 1 and 2 can be used by the host to address guest memory using
a guest effective address. Since the effLPID comes from the LPID register,
the host loads the LPID of the guest it would like to access (and the
PID of the process) and can perform accesses to a guest effective
address.
This means quadrant 1 can be used to address the guest user space and
quadrant 2 can be used to address the guest operating system from the
hypervisor, using a guest effective address.
Access to the quadrants can cause a Hypervisor Data Storage Interrupt
(HDSI) due to being unable to perform partition scoped translation.
Previously this could only be generated from a guest and so the code
path expects us to take the KVM trampoline in the interrupt handler.
This is no longer the case so we modify the handler to call
bad_page_fault() to check if we were expecting this fault so we can
handle it gracefully and just return with an error code. In the hash mmu
case we still raise an unknown exception since quadrants aren't defined
for the hash mmu.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
One notable fix for our change to split pt_regs between user/kernel, we forgot
to update BPF to use the user-visible type which was an ABI break for BPF
programs.
A slightly ugly but minimal fix to do_syscall_trace_enter() so that we use
tracehook_report_syscall_entry() properly. We'll rework the code in next to
avoid the empty if body.
Seven commits fixing bugs in the new papr_scm (Storage Class Memory) driver.
The driver was finally able to be tested on the other hypervisor which exposed
several bugs. The fixes are all fairly minimal at least.
Fix a crash in our MSI code if an MSI-capable device is plugged into a non-MSI
capable PHB, only seen on older hardware (MPC8378).
Fix our legacy serial code to look for "stdout-path" since the device trees were
updated to use that instead of "linux,stdout-path".
A change to the COFF zImage code to fix booting old powermacs.
A couple of minor build fixes.
Thanks to:
Benjamin Herrenschmidt, Daniel Axtens, Dmitry V. Levin, Elvira Khabirova,
Oliver O'Halloran, Paul Mackerras, Radu Rendec, Rob Herring, Sandipan Das.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcE7YGAAoJEFHr6jzI4aWA/rAP/2k8NqgbkfKMdY/nHQ/+Tvwu
6EdKZ5dMi875380TomrP80mIyAAwwtN/c1MmJT7plZRCwYcuhGk4UnQjTRcNzrfK
Q7SokdMAzSjSDaj7VPiVdpJ6yVaedZYmsRe9m+uP7frmcHr9KL4vjgDn3z8IL8bV
I46dvjuSCvgWhFvhXkxf9PHIsC+sP4Z6JRNVhyBjlzQHZiGparq238H8jJVsNHRs
f/fsze0z6m3S6dVKEZKZyDlCb3TkP+DjuXpv4hbT9nvnOY132kqO53L++7rQ2YvV
TNkabwFfj+p/DnrXXH7OHNkvnDW4cy3KjhyStOnTH5lxYYVYNd5vnjj1AnXa37xE
GCBFiHExEHbdG6vifwBmQjvoNRPf6Kh4RGYuRMm8ci7W0WUmq00LKZhxeLbwoou9
1K7lNp0JwLHgqOxCSy9liwNV7YQp1upvC/caQ1qE/ZISnUOXMLxomVfmGQYW51Xc
0/OXCdmacViA0RGatrGSScMO+CTNG0Wa3pm+fo/ufLahxspTDMVdJgs/kEfCAGeN
6BruUoEvm6wAOsPF8+zDumYOpUN9cD7tOQ573B2pMbJYIv3HJxRMDHYGBSijEgRK
xy8BbQ0ZtxgfILjIuA95QfGQ5V35ZjugQfM4mt33GPrQ/qIV9SJ+/zZJNISSqaKI
Y1xgZdJeVbECA9YYswRK
=wI/G
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.20-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"One notable fix for our change to split pt_regs between user/kernel,
we forgot to update BPF to use the user-visible type which was an ABI
break for BPF programs.
A slightly ugly but minimal fix to do_syscall_trace_enter() so that we
use tracehook_report_syscall_entry() properly. We'll rework the code
in next to avoid the empty if body.
Seven commits fixing bugs in the new papr_scm (Storage Class Memory)
driver. The driver was finally able to be tested on the other
hypervisor which exposed several bugs. The fixes are all fairly
minimal at least.
Fix a crash in our MSI code if an MSI-capable device is plugged into a
non-MSI capable PHB, only seen on older hardware (MPC8378).
Fix our legacy serial code to look for "stdout-path" since the device
trees were updated to use that instead of "linux,stdout-path".
A change to the COFF zImage code to fix booting old powermacs.
A couple of minor build fixes.
Thanks to: Benjamin Herrenschmidt, Daniel Axtens, Dmitry V. Levin,
Elvira Khabirova, Oliver O'Halloran, Paul Mackerras, Radu Rendec, Rob
Herring, Sandipan Das"
* tag 'powerpc-4.20-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/ptrace: replace ptrace_report_syscall() with a tracehook call
powerpc/mm: Fallback to RAM if the altmap is unusable
powerpc/papr_scm: Use ibm,unit-guid as the iset cookie
powerpc/papr_scm: Fix DIMM device registration race
powerpc/papr_scm: Remove endian conversions
powerpc/papr_scm: Update DT properties
powerpc/papr_scm: Fix resource end address
powerpc/papr_scm: Use depend instead of select
powerpc/bpf: Fix broken uapi for BPF_PROG_TYPE_PERF_EVENT
powerpc/boot: Fix build failures with -j 1
powerpc: Look for "stdout-path" when setting up legacy consoles
powerpc/msi: Fix NULL pointer access in teardown code
powerpc/mm: Fix linux page tables build with some configs
powerpc: Fix COFF zImage booting on old powermacs
While the dma-direct code is (relatively) clean and simple we actually
have to use the swiotlb ops for the mapping on many architectures due
to devices with addressing limits. Instead of keeping two
implementations around this commit allows the dma-direct
implementation to call the swiotlb bounce buffering functions and
thus share the guts of the mapping implementation. This also
simplified the dma-mapping setup on a few architectures where we
don't have to differenciate which implementation to use.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Arch code should use tracehook_*() helpers, as documented in
include/linux/tracehook.h, ptrace_report_syscall() is not expected to
be used outside that file.
The patch does not look very nice, but at least it is correct
and opens the way for PTRACE_GET_SYSCALL_INFO API.
Co-authored-by: Dmitry V. Levin <ldv@altlinux.org>
Fixes: 5521eb4bca ("powerpc/ptrace: Add support for PTRACE_SYSEMU")
Signed-off-by: Elvira Khabirova <lineprinter@altlinux.org>
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
[mpe: Take this as a minimal fix for 4.20, we'll rework it later]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The powerpc iommu code already returns (~(dma_addr_t)0x0) on mapping
failures, so we can switch over to returning DMA_MAPPING_ERROR and let
the core dma-mapping code handle the rest.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
The dma-direct code already returns (~(dma_addr_t)0x0) on mapping
failures, so we can switch over to returning DMA_MAPPING_ERROR and let
the core dma-mapping code handle the rest.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Memblock list is another source for usable system memory layout.
So move powerpc's arch_kexec_walk_mem() to common code so that other
memblock-based architectures, particularly arm64, can also utilise it.
A moved function is now renamed to kexec_walk_memblock() and integrated
into kexec_locate_mem_hole(), which will now be usable for all
architectures with no need for overriding arch_kexec_walk_mem().
With this change, arch_kexec_walk_mem() need no longer be a weak function,
and was now renamed to kexec_walk_resources().
Since powerpc doesn't support kdump in its kexec_file_load(), the current
kexec_walk_memblock() won't work for kdump either in this form, this will
be fixed in the next patch.
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Acked-by: James Morse <james.morse@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
As this is running with MMU off, the CPU only does speculative
fetch for code in the same page.
Following the significant size reduction of TLB handler routines,
the side handlers can be brought back close to the main part,
ie in the same page.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch reworks the TLB Miss handler in order to not use r12
register, hence avoiding having to save it into SPRN_SPRG_SCRATCH2.
In the DAR Fixup code we can now use SPRN_M_TW, freeing
SPRN_SPRG_SCRATCH2.
Then SPRN_SPRG_SCRATCH2 may be used for something else in the future.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Today, on the 8xx the TLB handlers do SW tablewalk by doing all
the calculation in ASM, in order to match with the Linux page
table structure.
The 8xx offers hardware assistance which allows significant size
reduction of the TLB handlers, hence also reduces the time spent
in the handlers.
However, using this HW assistance implies some constraints on the
page table structure:
- Regardless of the main page size used (4k or 16k), the
level 1 table (PGD) contains 1024 entries and each PGD entry covers
a 4Mbytes area which is managed by a level 2 table (PTE) containing
also 1024 entries each describing a 4k page.
- 16k pages require 4 identifical entries in the L2 table
- 512k pages PTE have to be spread every 128 bytes in the L2 table
- 8M pages PTE are at the address pointed by the L1 entry and each
8M page require 2 identical entries in the PGD.
This patch modifies the TLB handlers to use HW assistance for 4K PAGES.
Before that patch, the mean time spent in TLB miss handlers is:
- ITLB miss: 80 ticks
- DTLB miss: 62 ticks
After that patch, the mean time spent in TLB miss handlers is:
- ITLB miss: 72 ticks
- DTLB miss: 54 ticks
So the improvement is 10% for ITLB and 13% for DTLB misses
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In preparation of making use of hardware assistance in TLB handlers,
this patch temporarily disables 16K pages and hugepages. The reason
is that when using HW assistance in 4K pages mode, the linux model
fit with the HW model for 4K pages and 8M pages.
However for 16K pages and 512K mode some additional work is needed
to get linux model fit with HW model.
For the 8M pages, they will naturaly come back when we switch to
HW assistance, without any additional handling.
In order to keep the following patch smaller, the removal of the
current special handling for 8M pages gets removed here as well.
Therefore the 4K pages mode will be implemented first and without
support for 512k hugepages. Then the 512k hugepages will be brought
back. And the 16K pages will be implemented in the following step.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In order to simplify time critical exceptions handling 8xx
specific SW perf counters, this patch moves the counters into
the beginning of memory. This is possible because .text is readable
and the counters are never modified outside of the handlers.
By doing this, we avoid having to set a second register with
the upper part of the address of the counters.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The purpose of this patch is to move platform specific
mmu-xxx.h files in platform directories like pte-xxx.h files.
In the meantime this patch creates common nohash and
nohash/32 + nohash/64 mmu.h files for future common parts.
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There is a plan to build the kernel with -Wimplicit-fallthrough and these
places in the code produced warnings, but because we build arch/powerpc
with -Werror, they became errors. Fix them up.
This patch produces no change in behaviour, but should be reviewed in
case these are actually bugs not intentional fallthoughs.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Instead of running with interrupts disabled, use a semaphore. This should
make it easier for backends that may need to sleep (e.g. EFI) when
performing a write:
|BUG: sleeping function called from invalid context at kernel/sched/completion.c:99
|in_atomic(): 1, irqs_disabled(): 1, pid: 2236, name: sig-xstate-bum
|Preemption disabled at:
|[<ffffffff99d60512>] pstore_dump+0x72/0x330
|CPU: 26 PID: 2236 Comm: sig-xstate-bum Tainted: G D 4.20.0-rc3 #45
|Call Trace:
| dump_stack+0x4f/0x6a
| ___might_sleep.cold.91+0xd3/0xe4
| __might_sleep+0x50/0x90
| wait_for_completion+0x32/0x130
| virt_efi_query_variable_info+0x14e/0x160
| efi_query_variable_store+0x51/0x1a0
| efivar_entry_set_safe+0xa3/0x1b0
| efi_pstore_write+0x109/0x140
| pstore_dump+0x11c/0x330
| kmsg_dump+0xa4/0xd0
| oops_exit+0x22/0x30
...
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Fixes: 21b3ddd39f ("efi: Don't use spinlocks for efi vars")
Signed-off-by: Kees Cook <keescook@chromium.org>
Commit 78e5dfea84 ("powerpc: dts: replace 'linux,stdout-path' with
'stdout-path'") broke the default console on a number of embedded
PowerPC systems, because it failed to also update the code in
arch/powerpc/kernel/legacy_serial.c to look for that property in
addition to the old one.
This fixes it.
Fixes: 78e5dfea84 ("powerpc: dts: replace 'linux,stdout-path' with 'stdout-path'")
Cc: stable@vger.kernel.org # v4.17+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The arch_teardown_msi_irqs() function assumes that controller ops
pointers were already checked in arch_setup_msi_irqs(), but this
assumption is wrong: arch_teardown_msi_irqs() can be called even when
arch_setup_msi_irqs() returns an error (-ENOSYS).
This can happen in the following scenario:
- msi_capability_init() calls pci_msi_setup_msi_irqs()
- pci_msi_setup_msi_irqs() returns -ENOSYS
- msi_capability_init() notices the error and calls free_msi_irqs()
- free_msi_irqs() calls pci_msi_teardown_msi_irqs()
This is easier to see when CONFIG_PCI_MSI_IRQ_DOMAIN is not set and
pci_msi_setup_msi_irqs() and pci_msi_teardown_msi_irqs() are just
aliases to arch_setup_msi_irqs() and arch_teardown_msi_irqs().
The call to free_msi_irqs() upon pci_msi_setup_msi_irqs() failure
seems legit, as it does additional cleanup; e.g.
list_del(&entry->list) and kfree(entry) inside free_msi_irqs() do
happen (MSI descriptors are allocated before pci_msi_setup_msi_irqs()
is called and need to be cleaned up if that fails).
Fixes: 6b2fd7efeb ("PCI/MSI/PPC: Remove arch_msi_check_device()")
Cc: stable@vger.kernel.org # v3.18+
Signed-off-by: Radu Rendec <radu.rendec@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The function_graph_enter() function does the work of calling the function
graph hook function and the management of the shadow stack, simplifying the
work done in the architecture dependent prepare_ftrace_return().
Have powerpc use the new code, and remove the shadow stack management as well as
having to set up the trace structure.
This is needed to prepare for a fix of a design bug on how the curr_ret_stack
is used.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: stable@kernel.org
Fixes: 03274a3ffb ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Today we have:
config PPC_BOOK3S_32
bool "512x/52xx/6xx/7xx/74xx/82xx/83xx/86xx"
[depends on PPC32 within a choice]
config PPC_BOOK3S
def_bool y
depends on PPC_BOOK3S_32 || PPC_BOOK3S_64
config PPC_STD_MMU
def_bool y
depends on PPC_BOOK3S
config PPC_STD_MMU_32
def_bool y
depends on PPC_STD_MMU && PPC32
PPC_STD_MMU_32 is therefore redundant with PPC_BOOK3S_32.
In order to make the code clearer, lets use preferably PPC_BOOK3S_32.
This will allow to remove CONFIG_PPC_STD_MMU_32 in a later patch.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Today we have:
config PPC_BOOK3S_32
bool "512x/52xx/6xx/7xx/74xx/82xx/83xx/86xx"
[depends on PPC32 within a choice]
config PPC_BOOK3S
def_bool y
depends on PPC_BOOK3S_32 || PPC_BOOK3S_64
config 6xx
def_bool y
depends on PPC32 && PPC_BOOK3S
6xx is therefore redundant with PPC_BOOK3S_32.
In order to make the code clearer, lets use preferably PPC_BOOK3S_32.
This will allow to remove CONFIG_6xx in a later patch.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Remove directly accessing device_node.type pointer and use the
accessors instead. This will eventually allow removing the type
pointer.
Replace the open coded iterating over child nodes with
for_each_child_of_node() while we're here.
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Remove directly accessing device_node.type pointer and use the
accessors instead. This will eventually allow removing the type
pointer.
In the process, the of_stdout pointer can be used instead of finding
the stdout node again.
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There is no need to have the 'intoffset' variable static since new value
always be assigned before use it.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.
Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When both `CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y` and `CONFIG_UBSAN=y`
are set, link step typically produce numberous warnings about orphan
section:
+ powerpc-linux-gnu-ld -EB -m elf32ppc -Bstatic --orphan-handling=warn --build-id --gc-sections -X -o .tmp_vmlinux1 -T ./arch/powerpc/kernel/vmlinux.lds --who
le-archive built-in.a --no-whole-archive --start-group lib/lib.a --end-group
powerpc-linux-gnu-ld: warning: orphan section `.data..Lubsan_data393' from `init/main.o' being placed in section `.data..Lubsan_data393'.
powerpc-linux-gnu-ld: warning: orphan section `.data..Lubsan_data394' from `init/main.o' being placed in section `.data..Lubsan_data394'.
...
powerpc-linux-gnu-ld: warning: orphan section `.data..Lubsan_type11' from `init/main.o' being placed in section `.data..Lubsan_type11'.
powerpc-linux-gnu-ld: warning: orphan section `.data..Lubsan_type12' from `init/main.o' being placed in section `.data..Lubsan_type12'.
...
This commit remove those warnings produced at W=1.
Link: https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg135407.html
Suggested-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Function pci_ers_result_name() is a static function, although not declared
as such. This was detected by sparse in the following warning
arch/powerpc/kernel/eeh_driver.c:63:12: warning: symbol 'pci_ers_result_name' was not declared. Should it be static?
This patch simply declares the function a static.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Current powerpc security.c file is defining functions, as
cpu_show_meltdown(), cpu_show_spectre_v{1,2} and others, that are being
declared at linux/cpu.h header without including the header file that
contains these declarations.
This is being reported by sparse, which thinks that these functions are
static, due to the lack of declaration:
arch/powerpc/kernel/security.c:105:9: warning: symbol 'cpu_show_meltdown' was not declared. Should it be static?
arch/powerpc/kernel/security.c:139:9: warning: symbol 'cpu_show_spectre_v1' was not declared. Should it be static?
arch/powerpc/kernel/security.c:161:9: warning: symbol 'cpu_show_spectre_v2' was not declared. Should it be static?
arch/powerpc/kernel/security.c:209:6: warning: symbol 'stf_barrier' was not declared. Should it be static?
arch/powerpc/kernel/security.c:289:9: warning: symbol 'cpu_show_spec_store_bypass' was not declared. Should it be static?
This patch simply includes the proper header (linux/cpu.h) to match
function definition and declaration.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 4c2de74cc8 ("powerpc/64: Interrupts save PPR on stack rather
than thread_struct") changed sizeof(struct pt_regs) % 16 from 0 to 8,
which causes the interrupt frame allocation on kernel entry to put the
kernel stack out of alignment.
Quadword (16-byte) alignment for the stack is required by both the
64-bit v1 ABI (v1.9 § 3.2.2) and the 64-bit v2 ABI (v1.1 § 2.2.2.1).
Add a pad field to fix alignment, and add a BUILD_BUG_ON to catch this
in future.
Fixes: 4c2de74cc8 ("powerpc/64: Interrupts save PPR on stack rather than thread_struct")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Some things that I missed due to travel, or that came in late.
Two fixes also going to stable:
- A revert of a buggy change to the 8xx TLB miss handlers.
- Our flushing of SPE (Signal Processing Engine) registers on fork was broken.
Other changes:
- A change to the KVM decrementer emulation to use proper APIs.
- Some cleanups to the way we do code patching in the 8xx code.
- Expose the maximum possible memory for the system in /proc/powerpc/lparcfg.
- Merge some updates from Scott: "a couple device tree updates, and a fix for a
missing prototype warning."
A few other minor fixes and a handful of fixes for our selftests.
Thanks to:
Aravinda Prasad, Breno Leitao, Camelia Groza, Christophe Leroy, Felipe Rechia,
Joel Stanley, Naveen N. Rao, Paul Mackerras, Scott Wood, Tyrel Datwyler.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJb3C+uAAoJEFHr6jzI4aWAPJgQAIX0aD/PiYfEUI/rm/Q0vnJI
HO3FCKroi+LVF/URU24+NLA/1KGCBfO9by9m6D/nnmHl+vi2P69fFgokywO0Ajru
nf9a+9Gx53IbO7EEUf1fZVwxCMobBqU8eWq1hIBndd5HTz9QEftc/RXgpgqZcQ/x
x8xN1FNMSUT9NwMk750QDO7CFBrSfSjFC+/WrkViBaMiRWx2rwle+2tDipQ/fegY
Tsu4wg0qEzWMT//MFP4yOlkcLV8M6d2Sw65Km59rWHA1I2wqsTek1yQ68Epo9Co0
RdJh9Nt1kjLC5XXteneFhe18UUPKRmrXbYDFByw5CUhs5VI99Dq4w5kamh197XLr
+jA3XHAeAyaXf21I9zmmZXbhHanowCPZGyzZqZXWJ86bVJp5v328wXmnxtKrb0Nz
pH7fjQ6zjzsZgIcN9i2CFpIvuDQ/z1A/QyHdBnRvJ8HoXlTerZCn22JTgY7d2VJu
XJn1n+VABG2BrzJexW/7quY3Z7V6tvdkloWwOA3PdAwkcoImd4BfYq9K2DU//diN
BnXPnDs2K7JwDG9s+cgUEHnrP6DOKsxT+mmYpqXf0Ta0wtyoZJ1zgSdfZAUOmjnb
MhcK46l+F4E891qjnsuuVNnspqI4yPMLmAGmife5OUrfcoFdm/vFbM75FaXameBx
cMOmidrJJhO6z5eWSKvO
=VCkW
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.20-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Some things that I missed due to travel, or that came in late.
Two fixes also going to stable:
- A revert of a buggy change to the 8xx TLB miss handlers.
- Our flushing of SPE (Signal Processing Engine) registers on fork
was broken.
Other changes:
- A change to the KVM decrementer emulation to use proper APIs.
- Some cleanups to the way we do code patching in the 8xx code.
- Expose the maximum possible memory for the system in
/proc/powerpc/lparcfg.
- Merge some updates from Scott: "a couple device tree updates, and a
fix for a missing prototype warning"
A few other minor fixes and a handful of fixes for our selftests.
Thanks to: Aravinda Prasad, Breno Leitao, Camelia Groza, Christophe
Leroy, Felipe Rechia, Joel Stanley, Naveen N. Rao, Paul Mackerras,
Scott Wood, Tyrel Datwyler"
* tag 'powerpc-4.20-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (21 commits)
selftests/powerpc: Fix compilation issue due to asm label
selftests/powerpc/cache_shape: Fix out-of-tree build
selftests/powerpc/switch_endian: Fix out-of-tree build
selftests/powerpc/pmu: Link ebb tests with -no-pie
selftests/powerpc/signal: Fix out-of-tree build
selftests/powerpc/ptrace: Fix out-of-tree build
powerpc/xmon: Relax frame size for clang
selftests: powerpc: Fix warning for security subdir
selftests/powerpc: Relax L1d miss targets for rfi_flush test
powerpc/process: Fix flush_all_to_thread for SPE
powerpc/pseries: add missing cpumask.h include file
selftests/powerpc: Fix ptrace tm failure
KVM: PPC: Use exported tb_to_ns() function in decrementer emulation
powerpc/pseries: Export maximum memory value
powerpc/8xx: Use patch_site for perf counters setup
powerpc/8xx: Use patch_site for memory setup patching
powerpc/code-patching: Add a helper to get the address of a patch_site
Revert "powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for CONFIG_SWAP"
powerpc/8xx: add missing header in 8xx_mmu.c
powerpc/8xx: Add DT node for using the SEC engine of the MPC885
...
When a memblock allocation APIs are called with align = 0, the alignment
is implicitly set to SMP_CACHE_BYTES.
Implicit alignment is done deep in the memblock allocator and it can
come as a surprise. Not that such an alignment would be wrong even
when used incorrectly but it is better to be explicit for the sake of
clarity and the prinicple of the least surprise.
Replace all such uses of memblock APIs with the 'align' parameter
explicitly set to SMP_CACHE_BYTES and stop implicit alignment assignment
in the memblock internal allocation functions.
For the case when memblock APIs are used via helper functions, e.g. like
iommu_arena_new_node() in Alpha, the helper functions were detected with
Coccinelle's help and then manually examined and updated where
appropriate.
The direct memblock APIs users were updated using the semantic patch below:
@@
expression size, min_addr, max_addr, nid;
@@
(
|
- memblock_alloc_try_nid_raw(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid_raw(size, SMP_CACHE_BYTES, min_addr, max_addr,
nid)
|
- memblock_alloc_try_nid_nopanic(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid_nopanic(size, SMP_CACHE_BYTES, min_addr, max_addr,
nid)
|
- memblock_alloc_try_nid(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid(size, SMP_CACHE_BYTES, min_addr, max_addr, nid)
|
- memblock_alloc(size, 0)
+ memblock_alloc(size, SMP_CACHE_BYTES)
|
- memblock_alloc_raw(size, 0)
+ memblock_alloc_raw(size, SMP_CACHE_BYTES)
|
- memblock_alloc_from(size, 0, min_addr)
+ memblock_alloc_from(size, SMP_CACHE_BYTES, min_addr)
|
- memblock_alloc_nopanic(size, 0)
+ memblock_alloc_nopanic(size, SMP_CACHE_BYTES)
|
- memblock_alloc_low(size, 0)
+ memblock_alloc_low(size, SMP_CACHE_BYTES)
|
- memblock_alloc_low_nopanic(size, 0)
+ memblock_alloc_low_nopanic(size, SMP_CACHE_BYTES)
|
- memblock_alloc_from_nopanic(size, 0, min_addr)
+ memblock_alloc_from_nopanic(size, SMP_CACHE_BYTES, min_addr)
|
- memblock_alloc_node(size, 0, nid)
+ memblock_alloc_node(size, SMP_CACHE_BYTES, nid)
)
[mhocko@suse.com: changelog update]
[akpm@linux-foundation.org: coding-style fixes]
[rppt@linux.ibm.com: fix missed uses of implicit alignment]
Link: http://lkml.kernel.org/r/20181016133656.GA10925@rapoport-lnx
Link: http://lkml.kernel.org/r/1538687224-17535-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Paul Burton <paul.burton@mips.com> [MIPS]
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move remaining definitions and declarations from include/linux/bootmem.h
into include/linux/memblock.h and remove the redundant header.
The includes were replaced with the semantic patch below and then
semi-automated removal of duplicated '#include <linux/memblock.h>
@@
@@
- #include <linux/bootmem.h>
+ #include <linux/memblock.h>
[sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
[sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
[sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make it explicit that the caller gets a physical address rather than a
virtual one.
This will also allow using meblock_alloc prefix for memblock allocations
returning virtual address, which is done in the following patches.
The conversion is done using the following semantic patch:
@@
expression e1, e2, e3;
@@
(
- memblock_alloc(e1, e2)
+ memblock_phys_alloc(e1, e2)
|
- memblock_alloc_nid(e1, e2, e3)
+ memblock_phys_alloc_nid(e1, e2, e3)
|
- memblock_alloc_try_nid(e1, e2, e3)
+ memblock_phys_alloc_try_nid(e1, e2, e3)
)
Link: http://lkml.kernel.org/r/1536927045-23536-7-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Notable changes:
- A large series to rewrite our SLB miss handling, replacing a lot of fairly
complicated asm with much fewer lines of C.
- Following on from that, we now maintain a cache of SLB entries for each
process and preload them on context switch. Leading to a 27% speedup for our
context switch benchmark on Power9.
- Improvements to our handling of SLB multi-hit errors. We now print more debug
information when they occur, and try to continue running by flushing the SLB
and reloading, rather than treating them as fatal.
- Enable THP migration on 64-bit Book3S machines (eg. Power7/8/9).
- Add support for physical memory up to 2PB in the linear mapping on 64-bit
Book3S. We only support up to 512TB as regular system memory, otherwise the
percpu allocator runs out of vmalloc space.
- Add stack protector support for 32 and 64-bit, with a per-task canary.
- Add support for PTRACE_SYSEMU and PTRACE_SYSEMU_SINGLESTEP.
- Support recognising "big cores" on Power9, where two SMT4 cores are presented
to us as a single SMT8 core.
- A large series to cleanup some of our ioremap handling and PTE flags.
- Add a driver for the PAPR SCM (storage class memory) interface, allowing
guests to operate on SCM devices (acked by Dan).
- Changes to our ftrace code to handle very large kernels, where we need to use
a trampoline to get to ftrace_caller().
Many other smaller enhancements and cleanups.
Thanks to:
Alan Modra, Alistair Popple, Aneesh Kumar K.V, Anton Blanchard, Aravinda
Prasad, Bartlomiej Zolnierkiewicz, Benjamin Herrenschmidt, Breno Leitao,
Cédric Le Goater, Christophe Leroy, Christophe Lombard, Dan Carpenter, Daniel
Axtens, Finn Thain, Gautham R. Shenoy, Gustavo Romero, Haren Myneni, Hari
Bathini, Jia Hongtao, Joel Stanley, John Allen, Laurent Dufour, Madhavan
Srinivasan, Mahesh Salgaonkar, Mark Hairgrove, Masahiro Yamada, Michael
Bringmann, Michael Neuling, Michal Suchanek, Murilo Opsfelder Araujo, Nathan
Fontenot, Naveen N. Rao, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran,
Paul Mackerras, Petr Vorel, Rashmica Gupta, Reza Arbab, Rob Herring, Sam
Bobroff, Samuel Mendoza-Jonas, Scott Wood, Stan Johnson, Stephen Rothwell,
Stewart Smith, Suraj Jitindar Singh, Tyrel Datwyler, Vaibhav Jain, Vasant
Hegde, YueHaibing, zhong jiang,
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJb01vTAAoJEFHr6jzI4aWADsEP/jqL3+2qxs098ra80tpXCpXJ
tgXCosEs4b35sGtyHeUWZZZfWXeisaPAIlP8zTx1n50HACZduDYRAl0Ew9XB7Xdw
enDHRVccD21FsmHBOx/Ii1rVJlovWlj6EQCWHKeZmNjeRoFuClVZ7CYmf+mBifKR
sw2Db2fKA/59wMTq2zIMy5pqYgqlAs4jTWS6uN5hKPoBmO/82ARnNG+qgLuloD3Z
O8zSDM9QQ7PpuyDgTjO9SAo2YjmEfXlEG6cOCCejsU3DMctaEAK5PUZ+blsHYHBH
BYZYKs/x4pcw0SO41GtTh0M2YqDYBVuBIpRw8lLZap97Xo9ucSkAm5WD3rGxk4CY
YeZKEPUql6MHN3+DKl8mx2F0V+Et/tio2HNqc9KReR1tfoolZAbe+SFZHfgmc/Rq
RD9nnG8KRd4K2K1BTqpkTmI1EtE7jPtPJPSV8gMGhgL/N5vPmH3mql/qyOtYx48E
6/hPzWESgs16VRZ/opLh8VvjlY1HBDODQhehhhl+o23/Vb8qEgRf8Uqhq50rQW1H
EeOqyyYQ90txSU31Sgy1kQkvOgIFAsBObWT1ZCJ3RbfGbB4/tdEAvZqTZRlXo2OY
7P0Sqcw/9Le5eJkHIlLtBv0TF7y1OYemCbLgRQzFlcRP+UKtYyg8eFnFjqbPEEmP
ulwhn/BfFVSgaYKQ503u
=I0pj
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Notable changes:
- A large series to rewrite our SLB miss handling, replacing a lot of
fairly complicated asm with much fewer lines of C.
- Following on from that, we now maintain a cache of SLB entries for
each process and preload them on context switch. Leading to a 27%
speedup for our context switch benchmark on Power9.
- Improvements to our handling of SLB multi-hit errors. We now print
more debug information when they occur, and try to continue running
by flushing the SLB and reloading, rather than treating them as
fatal.
- Enable THP migration on 64-bit Book3S machines (eg. Power7/8/9).
- Add support for physical memory up to 2PB in the linear mapping on
64-bit Book3S. We only support up to 512TB as regular system
memory, otherwise the percpu allocator runs out of vmalloc space.
- Add stack protector support for 32 and 64-bit, with a per-task
canary.
- Add support for PTRACE_SYSEMU and PTRACE_SYSEMU_SINGLESTEP.
- Support recognising "big cores" on Power9, where two SMT4 cores are
presented to us as a single SMT8 core.
- A large series to cleanup some of our ioremap handling and PTE
flags.
- Add a driver for the PAPR SCM (storage class memory) interface,
allowing guests to operate on SCM devices (acked by Dan).
- Changes to our ftrace code to handle very large kernels, where we
need to use a trampoline to get to ftrace_caller().
And many other smaller enhancements and cleanups.
Thanks to: Alan Modra, Alistair Popple, Aneesh Kumar K.V, Anton
Blanchard, Aravinda Prasad, Bartlomiej Zolnierkiewicz, Benjamin
Herrenschmidt, Breno Leitao, Cédric Le Goater, Christophe Leroy,
Christophe Lombard, Dan Carpenter, Daniel Axtens, Finn Thain, Gautham
R. Shenoy, Gustavo Romero, Haren Myneni, Hari Bathini, Jia Hongtao,
Joel Stanley, John Allen, Laurent Dufour, Madhavan Srinivasan, Mahesh
Salgaonkar, Mark Hairgrove, Masahiro Yamada, Michael Bringmann,
Michael Neuling, Michal Suchanek, Murilo Opsfelder Araujo, Nathan
Fontenot, Naveen N. Rao, Nicholas Piggin, Nick Desaulniers, Oliver
O'Halloran, Paul Mackerras, Petr Vorel, Rashmica Gupta, Reza Arbab,
Rob Herring, Sam Bobroff, Samuel Mendoza-Jonas, Scott Wood, Stan
Johnson, Stephen Rothwell, Stewart Smith, Suraj Jitindar Singh, Tyrel
Datwyler, Vaibhav Jain, Vasant Hegde, YueHaibing, zhong jiang"
* tag 'powerpc-4.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (221 commits)
Revert "selftests/powerpc: Fix out-of-tree build errors"
powerpc/msi: Fix compile error on mpc83xx
powerpc: Fix stack protector crashes on CPU hotplug
powerpc/traps: restore recoverability of machine_check interrupts
powerpc/64/module: REL32 relocation range check
powerpc/64s/radix: Fix radix__flush_tlb_collapsed_pmd double flushing pmd
selftests/powerpc: Add a test of wild bctr
powerpc/mm: Fix page table dump to work on Radix
powerpc/mm/radix: Display if mappings are exec or not
powerpc/mm/radix: Simplify split mapping logic
powerpc/mm/radix: Remove the retry in the split mapping logic
powerpc/mm/radix: Fix small page at boundary when splitting
powerpc/mm/radix: Fix overuse of small pages in splitting logic
powerpc/mm/radix: Fix off-by-one in split mapping logic
powerpc/ftrace: Handle large kernel configs
powerpc/mm: Fix WARN_ON with THP NUMA migration
selftests/powerpc: Fix out-of-tree build errors
powerpc/time: no steal_time when CONFIG_PPC_SPLPAR is not selected
powerpc/time: Only set CONFIG_ARCH_HAS_SCALED_CPUTIME on PPC64
powerpc/time: isolate scaled cputime accounting in dedicated functions.
...
- various swiotlb cleanups
- do not dip into the ѕwiotlb pool for dma coherent allocations
- add support for not cache coherent DMA to swiotlb
- switch ARM64 to use the generic swiotlb_dma_ops
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAlvTDNELHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYOfdg//a4nGw5bTwj2uol/oS/vOdpE0otR3XpnNC3oopHd3
3xbb0LUR12A0YJmV5Ao9iIRFc60boLNblZglho8Diqks3P9+nYWCQ2rQ+o7ZsfK9
xGBNP0HrdGKPBzs9JodvhP/WBNWEReNbb6KwMe63LtDqUqNDNFmlYfxUgQ+fbgST
ZscR7iHJT1Nl0PoJ0mVziPFDA5DFB6JN7z0nU9uddVXLM7MPOFVX/WsFdLZLg/8W
/jgQ1gQ+gRMBSyctR8u+LG5KTxtYofiPo46lbyuSa5sx+cnZmjo7Uk5OZgs3soHn
gnobSjxgNbPgt8ttaKoFtOpyuO/3cLQeObS8i2MQuzvYpUD/nBT5q6l8XZBAFQCc
BkOe0qaBa2BrSG2O7BRgagHuSUjpgkxR3vyVT5e6eyr7uFk+LWslJKx4Uem3uIRf
yfNQzx91Xaw7ypaF7//jVPeWucLFZuwYsy13ZAg5D71vYpZBkfgHoMSCrmZQT9o/
V1ijmJy2ZDlGIKH95A0Utr6QKrci1drigg8Zce5A2E2EPAzTiro90ZeGT2tcJmVs
LOrbxEN1Dctkkk5l6PJE3k4zYCP7rW4lbW7wPcLZrJoptsm3sQprE10zaAFGKks6
vPLY615FK/JydJ9AIjx6PhFiY6M6Zb2er6CbRo18BjpqjKHgmK0A/uCbLhN+Qnac
IoE=
=pI5p
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-4.20-1' of git://git.infradead.org/users/hch/dma-mapping
Pull more dma-mapping updates from Christoph Hellwig:
- various swiotlb cleanups
- do not dip into the ѕwiotlb pool for dma coherent allocations
- add support for not cache coherent DMA to swiotlb
- switch ARM64 to use the generic swiotlb_dma_ops
* tag 'dma-mapping-4.20-1' of git://git.infradead.org/users/hch/dma-mapping:
arm64: use the generic swiotlb_dma_ops
swiotlb: add support for non-coherent DMA
swiotlb: don't dip into swiotlb pool for coherent allocations
swiotlb: refactor swiotlb_map_page
swiotlb: use swiotlb_map_page in swiotlb_map_sg_attrs
swiotlb: merge swiotlb_unmap_page and unmap_single
swiotlb: remove the overflow buffer
swiotlb: do not panic on mapping failures
swiotlb: mark is_swiotlb_buffer static
swiotlb: remove a pointless comment
Fix a bug introduced by the creation of flush_all_to_thread() for
processors that have SPE (Signal Processing Engine) and use it to
compute floating-point operations.
>From userspace perspective, the problem was seen in attempts of
computing floating-point operations which should generate exceptions.
For example:
fork();
float x = 0.0 / 0.0;
isnan(x); // forked process returns False (should be True)
The operation above also should always cause the SPEFSCR FINV bit to
be set. However, the SPE floating-point exceptions were turned off
after a fork().
Kernel versions prior to the bug used flush_spe_to_thread(), which
first saves SPEFSCR register values in tsk->thread and then calls
giveup_spe(tsk).
After commit 579e633e76, the save_all() function was called first
to giveup_spe(), and then the SPEFSCR register values were saved in
tsk->thread. This would save the SPEFSCR register values after
disabling SPE for that thread, causing the bug described above.
Fixes 579e633e76 ("powerpc: create flush_all_to_thread()")
Signed-off-by: Felipe Rechia <felipe.rechia@datacom.com.br>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The 8xx TLB miss routines are patched when (de)activating
perf counters.
This patch uses the new patch_site functionality in order
to get a better code readability and avoid a label mess when
dumping the code with 'objdump -d'
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The 8xx TLB miss routines are patched at startup at several places.
This patch uses the new patch_site functionality in order
to get a better code readability and avoid a label mess when
dumping the code with 'objdump -d'
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This reverts commit 4f94b2c746.
That commit was buggy, as it used rlwinm instead of rlwimi.
Instead of fixing that bug, we revert the previous commit in order to
reduce the dependency between L1 entries and L2 entries
Fixes: 4f94b2c746 ("powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for CONFIG_SWAP")
Cc: stable@vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
ARM:
- Improved guest IPA space support (32 to 52 bits)
- RAS event delivery for 32bit
- PMU fixes
- Guest entry hardening
- Various cleanups
- Port of dirty_log_test selftest
PPC:
- Nested HV KVM support for radix guests on POWER9. The performance is
much better than with PR KVM. Migration and arbitrary level of
nesting is supported.
- Disable nested HV-KVM on early POWER9 chips that need a particular hardware
bug workaround
- One VM per core mode to prevent potential data leaks
- PCI pass-through optimization
- merge ppc-kvm topic branch and kvm-ppc-fixes to get a better base
s390:
- Initial version of AP crypto virtualization via vfio-mdev
- Improvement for vfio-ap
- Set the host program identifier
- Optimize page table locking
x86:
- Enable nested virtualization by default
- Implement Hyper-V IPI hypercalls
- Improve #PF and #DB handling
- Allow guests to use Enlightened VMCS
- Add migration selftests for VMCS and Enlightened VMCS
- Allow coalesced PIO accesses
- Add an option to perform nested VMCS host state consistency check
through hardware
- Automatic tuning of lapic_timer_advance_ns
- Many fixes, minor improvements, and cleanups
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJb0FINAAoJEED/6hsPKofoI60IAJRS3vOAQ9Fav8cJsO1oBHcX
3+NexfnBke1bzrjIR3SUcHKGZbdnVPNZc+Q4JjIbPpPmmOMU5jc9BC1dmd5f4Vzh
BMnQ0yCvgFv3A3fy/Icx1Z8NJppxosdmqdQLrQrNo8aD3cjnqY2yQixdXrAfzLzw
XEgKdIFCCz8oVN/C9TT4wwJn6l9OE7BM5bMKGFy5VNXzMu7t64UDOLbbjZxNgi1g
teYvfVGdt5mH0N7b2GPPWRbJmgnz5ygVVpVNQUEFrdKZoCm6r5u9d19N+RRXAwan
ZYFj10W2T8pJOUf3tryev4V33X7MRQitfJBo4tP5hZfi9uRX89np5zP1CFE7AtY=
=yEPW
-----END PGP SIGNATURE-----
Merge tag 'kvm-4.20-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Radim Krčmář:
"ARM:
- Improved guest IPA space support (32 to 52 bits)
- RAS event delivery for 32bit
- PMU fixes
- Guest entry hardening
- Various cleanups
- Port of dirty_log_test selftest
PPC:
- Nested HV KVM support for radix guests on POWER9. The performance
is much better than with PR KVM. Migration and arbitrary level of
nesting is supported.
- Disable nested HV-KVM on early POWER9 chips that need a particular
hardware bug workaround
- One VM per core mode to prevent potential data leaks
- PCI pass-through optimization
- merge ppc-kvm topic branch and kvm-ppc-fixes to get a better base
s390:
- Initial version of AP crypto virtualization via vfio-mdev
- Improvement for vfio-ap
- Set the host program identifier
- Optimize page table locking
x86:
- Enable nested virtualization by default
- Implement Hyper-V IPI hypercalls
- Improve #PF and #DB handling
- Allow guests to use Enlightened VMCS
- Add migration selftests for VMCS and Enlightened VMCS
- Allow coalesced PIO accesses
- Add an option to perform nested VMCS host state consistency check
through hardware
- Automatic tuning of lapic_timer_advance_ns
- Many fixes, minor improvements, and cleanups"
* tag 'kvm-4.20-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits)
KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned
Revert "kvm: x86: optimize dr6 restore"
KVM: PPC: Optimize clearing TCEs for sparse tables
x86/kvm/nVMX: tweak shadow fields
selftests/kvm: add missing executables to .gitignore
KVM: arm64: Safety check PSTATE when entering guest and handle IL
KVM: PPC: Book3S HV: Don't use streamlined entry path on early POWER9 chips
arm/arm64: KVM: Enable 32 bits kvm vcpu events support
arm/arm64: KVM: Rename function kvm_arch_dev_ioctl_check_extension()
KVM: arm64: Fix caching of host MDCR_EL2 value
KVM: VMX: enable nested virtualization by default
KVM/x86: Use 32bit xor to clear registers in svm.c
kvm: x86: Introduce KVM_CAP_EXCEPTION_PAYLOAD
kvm: vmx: Defer setting of DR6 until #DB delivery
kvm: x86: Defer setting of CR2 until #PF delivery
kvm: x86: Add payload operands to kvm_multiple_exception
kvm: x86: Add exception payload fields to kvm_vcpu_events
kvm: x86: Add has_payload and payload to kvm_queued_exception
KVM: Documentation: Fix omission in struct kvm_vcpu_events
KVM: selftests: add Enlightened VMCS test
...
Pull timekeeping updates from Thomas Gleixner:
"The timers and timekeeping departement provides:
- Another large y2038 update with further preparations for providing
the y2038 safe timespecs closer to the syscalls.
- An overhaul of the SHCMT clocksource driver
- SPDX license identifier updates
- Small cleanups and fixes all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
tick/sched : Remove redundant cpu_online() check
clocksource/drivers/dw_apb: Add reset control
clocksource: Remove obsolete CLOCKSOURCE_OF_DECLARE
clocksource/drivers: Unify the names to timer-* format
clocksource/drivers/sh_cmt: Add R-Car gen3 support
dt-bindings: timer: renesas: cmt: document R-Car gen3 support
clocksource/drivers/sh_cmt: Properly line-wrap sh_cmt_of_table[] initializer
clocksource/drivers/sh_cmt: Fix clocksource width for 32-bit machines
clocksource/drivers/sh_cmt: Fixup for 64-bit machines
clocksource/drivers/sh_tmu: Convert to SPDX identifiers
clocksource/drivers/sh_mtu2: Convert to SPDX identifiers
clocksource/drivers/sh_cmt: Convert to SPDX identifiers
clocksource/drivers/renesas-ostm: Convert to SPDX identifiers
clocksource: Convert to using %pOFn instead of device_node.name
tick/broadcast: Remove redundant check
RISC-V: Request newstat syscalls
y2038: signal: Change rt_sigtimedwait to use __kernel_timespec
y2038: socket: Change recvmmsg to use __kernel_timespec
y2038: sched: Change sched_rr_get_interval to use __kernel_timespec
y2038: utimes: Rework #ifdef guards for compat syscalls
...
Pull security subsystem updates from James Morris:
"In this patchset, there are a couple of minor updates, as well as some
reworking of the LSM initialization code from Kees Cook (these prepare
the way for ordered stackable LSMs, but are a valuable cleanup on
their own)"
* 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
LSM: Don't ignore initialization failures
LSM: Provide init debugging infrastructure
LSM: Record LSM name in struct lsm_info
LSM: Convert security_initcall() into DEFINE_LSM()
vmlinux.lds.h: Move LSM_TABLE into INIT_DATA
LSM: Convert from initcall to struct lsm_info
LSM: Remove initcall tracing
LSM: Rename .security_initcall section to .lsm_info
vmlinux.lds.h: Avoid copy/paste of security_init section
LSM: Correctly announce start of LSM initialization
security: fix LSM description location
keys: Fix the use of the C++ keyword "private" in uapi/linux/keyctl.h
seccomp: remove unnecessary unlikely()
security: tomoyo: Fix obsolete function
security/capabilities: remove check for -EINVAL
Pull siginfo updates from Eric Biederman:
"I have been slowly sorting out siginfo and this is the culmination of
that work.
The primary result is in several ways the signal infrastructure has
been made less error prone. The code has been updated so that manually
specifying SEND_SIG_FORCED is never necessary. The conversion to the
new siginfo sending functions is now complete, which makes it
difficult to send a signal without filling in the proper siginfo
fields.
At the tail end of the patchset comes the optimization of decreasing
the size of struct siginfo in the kernel from 128 bytes to about 48
bytes on 64bit. The fundamental observation that enables this is by
definition none of the known ways to use struct siginfo uses the extra
bytes.
This comes at the cost of a small user space observable difference.
For the rare case of siginfo being injected into the kernel only what
can be copied into kernel_siginfo is delivered to the destination, the
rest of the bytes are set to 0. For cases where the signal and the
si_code are known this is safe, because we know those bytes are not
used. For cases where the signal and si_code combination is unknown
the bits that won't fit into struct kernel_siginfo are tested to
verify they are zero, and the send fails if they are not.
I made an extensive search through userspace code and I could not find
anything that would break because of the above change. If it turns out
I did break something it will take just the revert of a single change
to restore kernel_siginfo to the same size as userspace siginfo.
Testing did reveal dependencies on preferring the signo passed to
sigqueueinfo over si->signo, so bit the bullet and added the
complexity necessary to handle that case.
Testing also revealed bad things can happen if a negative signal
number is passed into the system calls. Something no sane application
will do but something a malicious program or a fuzzer might do. So I
have fixed the code that performs the bounds checks to ensure negative
signal numbers are handled"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits)
signal: Guard against negative signal numbers in copy_siginfo_from_user32
signal: Guard against negative signal numbers in copy_siginfo_from_user
signal: In sigqueueinfo prefer sig not si_signo
signal: Use a smaller struct siginfo in the kernel
signal: Distinguish between kernel_siginfo and siginfo
signal: Introduce copy_siginfo_from_user and use it's return value
signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE
signal: Fail sigqueueinfo if si_signo != sig
signal/sparc: Move EMT_TAGOVF into the generic siginfo.h
signal/unicore32: Use force_sig_fault where appropriate
signal/unicore32: Generate siginfo in ucs32_notify_die
signal/unicore32: Use send_sig_fault where appropriate
signal/arc: Use force_sig_fault where appropriate
signal/arc: Push siginfo generation into unhandled_exception
signal/ia64: Use force_sig_fault where appropriate
signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn
signal/ia64: Use the generic force_sigsegv in setup_frame
signal/arm/kvm: Use send_sig_mceerr
signal/arm: Use send_sig_fault where appropriate
signal/arm: Use force_sig_fault where appropriate
...
Recently in commit 7241d26e81 ("powerpc/64: properly initialise
the stackprotector canary on SMP.") we fixed a crash with stack
protector on SMP by initialising the stack canary in
cpu_idle_thread_init().
But this can also causes crashes, when a CPU comes back online after
being offline:
Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: pnv_smp_cpu_kill_self+0x2a0/0x2b0
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.19.0-rc3-gcc-7.3.1-00168-g4ffe713b7587 #94
Call Trace:
dump_stack+0xb0/0xf4 (unreliable)
panic+0x144/0x328
__stack_chk_fail+0x2c/0x30
pnv_smp_cpu_kill_self+0x2a0/0x2b0
cpu_die+0x48/0x70
arch_cpu_idle_dead+0x20/0x40
do_idle+0x274/0x390
cpu_startup_entry+0x38/0x50
start_secondary+0x5e4/0x600
start_secondary_prolog+0x10/0x14
Looking at the stack we see that the canary value in the stack frame
doesn't match the canary in the task/paca. That is because we have
reinitialised the task/paca value, but then the CPU coming online has
returned into a function using the old canary value. That causes the
comparison to fail.
Instead we can call boot_init_stack_canary() from start_secondary()
which never returns. This is essentially what the generic code does in
cpu_startup_entry() under #ifdef X86, we should make that non-x86
specific in a future patch.
Fixes: 7241d26e81 ("powerpc/64: properly initialise the stackprotector canary on SMP.")
Reported-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
commit b96672dd84 ("powerpc: Machine check interrupt is a non-
maskable interrupt") added a call to nmi_enter() at the beginning of
machine check restart exception handler. Due to that, in_interrupt()
always returns true regardless of the state before entering the
exception, and die() panics even when the system was not already in
interrupt.
This patch calls nmi_exit() before calling die() in order to restore
the interrupt state we had before calling nmi_enter()
Fixes: b96672dd84 ("powerpc: Machine check interrupt is a non-maskable interrupt")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The recent module relocation overflow crash demonstrated that we
have no range checking on REL32 relative relocations. This patch
implements a basic check, the same kernel that previously oopsed
and rebooted now continues with some of these errors when loading
the module:
module_64: x_tables: REL32 527703503449812 out of range!
Possibly other relocations (ADDR32, REL16, TOC16, etc.) should also have
overflow checks.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently, we expect to be able to reach ftrace_caller() from all
ftrace-enabled functions through a single relative branch. With large
kernel configs, we see functions outside of 32MB of ftrace_caller()
causing ftrace_init() to bail.
In such configurations, gcc/ld emits two types of trampolines for mcount():
1. A long_branch, which has a single branch to mcount() for functions that
are one hop away from mcount():
c0000000019e8544 <00031b56.long_branch._mcount>:
c0000000019e8544: 4a 69 3f ac b c00000000007c4f0 <._mcount>
2. A plt_branch, for functions that are farther away from mcount():
c0000000051f33f8 <0008ba04.plt_branch._mcount>:
c0000000051f33f8: 3d 82 ff a4 addis r12,r2,-92
c0000000051f33fc: e9 8c 04 20 ld r12,1056(r12)
c0000000051f3400: 7d 89 03 a6 mtctr r12
c0000000051f3404: 4e 80 04 20 bctr
We can reuse those trampolines for ftrace if we can have those
trampolines go to ftrace_caller() instead. However, with ABIv2, we
cannot depend on r2 being valid. As such, we use only the long_branch
trampolines by patching those to instead branch to ftrace_caller or
ftrace_regs_caller.
In addition, we add additional trampolines around .text and .init.text
to catch locations that are covered by the plt branches. This allows
ftrace to work with most large kernel configurations.
For now, we always patch the trampolines to go to ftrace_regs_caller,
which is slightly inefficient. This can be optimized further at a later
point.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If CONFIG_PPC_SPLPAR is not selected, steal_time will always
be NUL, so accounting it is pointless
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
scaled cputime is only meaningfull when the processor has
SPURR and/or PURR, which means only on PPC64.
Removing it on PPC32 significantly reduces the size of
vtime_account_system() and vtime_account_idle() on an 8xx:
Before:
00000000 l F .text 000000a8 vtime_delta
00000280 g F .text 0000010c vtime_account_system
0000038c g F .text 00000048 vtime_account_idle
After:
(vtime_delta gets inlined inside the two functions)
000001d8 g F .text 000000a0 vtime_account_system
00000278 g F .text 00000038 vtime_account_idle
In terms of performance, we also get approximatly 7% improvement on
task switch. The following small benchmark app is run with perf stat:
void *thread(void *arg)
{
int i;
for (i = 0; i < atoi((char*)arg); i++)
pthread_yield();
}
int main(int argc, char **argv)
{
pthread_t th1, th2;
pthread_create(&th1, NULL, thread, argv[1]);
pthread_create(&th2, NULL, thread, argv[1]);
pthread_join(th1, NULL);
pthread_join(th2, NULL);
return 0;
}
Before the patch:
Performance counter stats for 'chrt -f 98 ./sched 100000' (50 runs):
8228.476465 task-clock (msec) # 0.954 CPUs utilized ( +- 0.23% )
200004 context-switches # 0.024 M/sec ( +- 0.00% )
After the patch:
Performance counter stats for 'chrt -f 98 ./sched 100000' (50 runs):
7649.070444 task-clock (msec) # 0.955 CPUs utilized ( +- 0.27% )
200004 context-switches # 0.026 M/sec ( +- 0.00% )
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
scaled cputime is only meaningfull when the processor has
SPURR and/or PURR, which means only on PPC64.
In preparation of the following patch that will remove
CONFIG_ARCH_HAS_SCALED_CPUTIME on PPC32, this patch moves
all scaled cputing accounting logic into dedicated functions.
This patch doesn't change any functionality. It's only code
reorganisation.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
module_frob_arch_sections() is called before the module is moved to its
final location. The function descriptor section addresses we are setting
here are thus invalid. Fix this by processing opd section during
module_finalize()
Fixes: 5633e85b2c ("powerpc64: Add .opd based function descriptor dereference")
Cc: stable@vger.kernel.org # v4.16
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Like all other dma mapping drivers just return an error code instead
of an actual memory buffer. The reason for the overflow buffer was
that at the time swiotlb was invented there was no way to check for
dma mapping errors, but this has long been fixed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
In the recent commit 8b78fdb045 ("powerpc/time: Use
clockevents_register_device(), fixing an issue with large
decrementer") we changed the way we initialise the decrementer
clockevent(s).
We no longer initialise the mult & shift values of
decrementer_clockevent itself.
This has the effect of breaking PR KVM, because it uses those values
in kvmppc_emulate_dec(). The symptom is guest kernels spin forever
mid-way through boot.
For now fix it by assigning back to decrementer_clockevent the mult
and shift values.
Fixes: 8b78fdb045 ("powerpc/time: Use clockevents_register_device(), fixing an issue with large decrementer")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Back when I added -Werror in commit ba55bd7436 ("powerpc: Add
configurable -Werror for arch/powerpc") I did it by adding it to most
of the arch Makefiles.
At the time we excluded math-emu, because apparently it didn't build
cleanly. But that seems to have been fixed somewhere in the interim.
So move the -Werror addition to the top-level of the arch, this saves
us from repeating it in every Makefile and means we won't forget to
add it to any new sub-dirs.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
do_exit() already includes a test to panic() is in_interrupt()
This patch removes powerpc one which is redundant.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When creating the boot-time FDT from an actual Open Firmware live
tree, let's generate "phandle" properties for the phandles instead
of the old deprecated "linux,phandle".
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Unsplit warning printf()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
prom_init.c must not modify the kernel image outside
of the .bss.prominit section. Thus make sure that
prom_init.o doesn't have anything in any of these:
.data
.bss
.init.data
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This makes __prombss its own section, and for now store
it in .bss.
This will give us the ability later to store it elsewhere
and/or free it after boot (it's about 8KB).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
As they are no longer used past the end of prom_init
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Make the existing initialized definition constant and copy
it to a __prombss copy
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Initialize it dynamically instead of statically
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We removed support for running under any OPAL version
earlier than v3 in 2015 (they never saw the light of day
anyway), but we kept some leftovers of this support in
prom_init.c, so let's take it out.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This replaces all occurrences of __initdata for uninitialized
data with a new __prombss
Currently __promdata is defined to be __initdata but we'll
eventually change that.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch implements support for discovering storage class memory
devices at boot and for handling hotplug of new regions via RTAS
hotplug events.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
[mpe: Fix CONFIG_MEMORY_HOTPLUG=n build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When printing the machine check cause, the cause appears on the
following line due to bad use of printk without \n:
[ 33.663993] Machine check in kernel mode.
[ 33.664011] Caused by (from SRR1=9032):
[ 33.664036] Data access error at address c90c8000
This patch fixes it by using pr_cont() for the second part:
[ 133.258131] Machine check in kernel mode.
[ 133.258146] Caused by (from SRR1=9032): Data access error at address c90c8000
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
slb_flush_and_rebolt() is misleading, it is called in virtual mode, so
it can not possibly change the stack, so it should not be touching the
shadow area. And since vmalloc is no longer bolted, it should not
change any bolted mappings at all.
Change the name to slb_flush_and_restore_bolted(), and have it just
load the kernel stack from what's currently in the shadow SLB area.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When switching processes, currently all user SLBEs are cleared, and a
few (exec_base, pc, and stack) are preloaded. In trivial testing with
small apps, this tends to miss the heap and low 256MB segments, and it
will also miss commonly accessed segments on large memory workloads.
Add a simple round-robin preload cache that just inserts the last SLB
miss into the head of the cache and preloads those at context switch
time. Every 256 context switches, the oldest entry is removed from the
cache to shrink the cache and require fewer slbmte if they are unused.
Much more could go into this, including into the SLB entry reclaim
side to track some LRU information etc, which would require a study of
large memory workloads. But this is a simple thing we can do now that
is an obvious win for common workloads.
With the full series, process switching speed on the context_switch
benchmark on POWER9/hash (with kernel speculation security masures
disabled) increases from 140K/s to 178K/s (27%).
POWER8 does not change much (within 1%), it's unclear why it does not
see a big gain like POWER9.
Booting to busybox init with 256MB segments has SLB misses go down
from 945 to 69, and with 1T segments 900 to 21. These could almost all
be eliminated by preloading a bit more carefully with ELF binary
loading.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This will be used by the SLB code in the next patch, but for now this
sets the slb_addr_limit to the correct size for 32-bit tasks.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add 32-entry bitmaps to track the allocation status of the first 32
SLB entries, and whether they are user or kernel entries. These are
used to allocate free SLB entries first, before resorting to the round
robin allocator.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch moves SLB miss handlers completely to C, using the standard
exception handler macros to set up the stack and branch to C.
This can be done because the segment containing the kernel stack is
always bolted, so accessing it with relocation on will not cause an
SLB exception.
Arbitrary kernel memory must not be accessed when handling kernel
space SLB misses, so care should be taken there. However user SLB
misses can access any kernel memory, which can be used to move some
fields out of the paca (in later patches).
User SLB misses could quite easily reconcile IRQs and set up a first
class kernel environment and exit via ret_from_except, however that
doesn't seem to be necessary at the moment, so we only do that if a
bad fault is encountered.
[ Credit to Aneesh for bug fixes, error checks, and improvements to
bad address handling, etc ]
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Disallow tracing for all of slb.c for now.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PPR is the odd register out when it comes to interrupt handling, it is
saved in current->thread.ppr while all others are saved on the stack.
The difficulty with this is that accessing thread.ppr can cause a SLB
fault, but the SLB fault handler implementation in C change had
assumed the normal exception entry handlers would not cause an SLB
fault.
Fix this by allocating room in the interrupt stack to save PPR.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Now that we've split the user & kernel versions of pt_regs we need to
be more careful in the ptrace code.
For now we've ensured the location of the fields in both structs is
the same, so most of the ptrace code doesn't need updating.
But there are a few places where we use sizeof(pt_regs), and these
will be wrong as soon as we increase the size of the kernel structure.
So flip them all to use sizeof(user_pt_regs).
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We use a shared definition for struct pt_regs in uapi/asm/ptrace.h.
That means the layout of the structure is ABI, ie. we can't change it.
That would be fine if it was only used to describe the user-visible
register state of a process, but it's also the struct we use in the
kernel to describe the registers saved in an interrupt frame.
We'd like more flexibility in the content (and possibly layout) of the
kernel version of the struct, but currently that's not possible.
So split the definition into a user-visible definition which remains
unchanged, and a kernel internal one.
At the moment they're still identical, and we check that at build
time. That's because we have code (in ptrace etc.) that assumes that
they are the same. We will fix that code in future patches, and then
we can break the strict symmetry between the two structs.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
_PAGE_PRIVILEGED corresponds to the SH bit which doesn't protect
against user access but only disables ASID verification on kernel
accesses. User access is controlled with _PMD_USER flag.
Name it _PAGE_SH instead of _PAGE_PRIVILEGED
_PAGE_HUGE corresponds to the SPS bit which doesn't really tells
that's it is a huge page but only that it is not a 4k page.
Name it _PAGE_SPS instead of _PAGE_HUGE
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In order to avoid multiple conversions, handover directly a
pgprot_t to map_kernel_page() as already done for radix.
Do the same for __ioremap_caller() and __ioremap_at().
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Set PAGE_KERNEL directly in the caller and do not rely on a
hack adding PAGE_KERNEL flags when _PAGE_PRESENT is not set.
As already done for PPC64, use pgprot_cache() helpers instead of
_PAGE_XXX flags in PPC32 ioremap() derived functions.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In many places, ioremap_prot() and __ioremap() can be replaced with
higher level functions like ioremap(), ioremap_coherent(),
ioremap_cache(), ioremap_wc() ...
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Live Partition Migrations require all the present CPUs to execute the
H_JOIN call, and hence rtas_ibm_suspend_me() onlines any offline CPUs
before initiating the migration for this purpose.
The commit 85a88cabad
("powerpc/pseries: Disable CPU hotplug across migrations")
disables any CPU-hotplug operations once all the offline CPUs are
brought online to prevent any further state change. Once the
CPU-Hotplug operation is disabled, the code assumes that all the CPUs
are online.
However, there is a minor window in rtas_ibm_suspend_me() between
onlining the offline CPUs and disabling CPU-Hotplug when a concurrent
CPU-offline operations initiated by the userspace can succeed thereby
nullifying the the aformentioned assumption. In this unlikely case
these offlined CPUs will not call H_JOIN, resulting in a system hang.
Fix this by verifying that all the present CPUs are actually online
after CPU-Hotplug has been disabled, failing which we restore the
state of the offline CPUs in rtas_ibm_suspend_me() and return an
-EBUSY.
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently on POWER9 SMT8 cores systems, in sysfs, we report the
shared_cache_map for L1 caches (both data and instruction) to be the
cpu-ids of the threads in SMT8 cores. This is incorrect since on
POWER9 SMT8 cores there are two groups of threads, each of which
shares its own L1 cache.
This patch addresses this by reporting the shared_cpu_map correctly in
sysfs for L1 caches.
Before the patch
/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_map : 000000ff
/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_map : 000000ff
/sys/devices/system/cpu/cpu1/cache/index0/shared_cpu_map : 000000ff
/sys/devices/system/cpu/cpu1/cache/index1/shared_cpu_map : 000000ff
After the patch
/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_map : 00000055
/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_map : 00000055
/sys/devices/system/cpu/cpu1/cache/index0/shared_cpu_map : 000000aa
/sys/devices/system/cpu/cpu1/cache/index1/shared_cpu_map : 000000aa
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
POWER9 SMT8 cores consist of two groups of threads, where threads in
each group shares L1-cache. The scheduler is not aware of this
distinction as the current sched-domain hierarchy has all the threads
of the core defined at the SMT domain.
SMT [Thread siblings of the SMT8 core]
DIE [CPUs in the same die]
NUMA [All the CPUs in the system]
Due to this, we can observe run-to-run variance when we run a
multi-threaded benchmark bound to a single core based on how the
scheduler spreads the software threads across the two groups in the
core.
We fix this in this patch by defining each group of threads which
share L1-cache to be the SMT level. The group of threads in the SMT8
core is defined to be the CACHE level. The sched-domain hierarchy
after this patch will be :
SMT [Thread siblings in the core that share L1 cache]
CACHE [Thread siblings that are in the SMT8 core]
DIE [CPUs in the same die]
NUMA [All the CPUs in the system]
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On IBM POWER9, the device tree exposes a property array identifed by
"ibm,thread-groups" which will indicate which groups of threads share
a particular set of resources.
As of today we only have one form of grouping identifying the group of
threads in the core that share the L1 cache, translation cache and
instruction data flow.
This patch adds helper functions to parse the contents of
"ibm,thread-groups" and populate a per-cpu variable to cache
information about siblings of each CPU that share the L1, traslation
cache and instruction data-flow.
It also defines a new global variable named "has_big_cores" which
indicates if the cores on this configuration have multiple groups of
threads that share L1 cache.
For each online CPU, it maintains a cpu_smallcore_mask, which
indicates the online siblings which share the L1-cache with it.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 6c1719942e ("powerpc/of: Remove useless register save/restore
when calling OF back") removed the saving of srr0 and srr1 when calling
into OpenFirmware. Commit e31aa453bb ("powerpc: Use LOAD_REG_IMMEDIATE
only for constants on 64-bit") did the same for rtas.
This means we don't need to save the extra stack space and can use
the common SWITCH_FRAME_SIZE.
There were already no users of _SRR0 and _SRR1 so we can remove them
too.
Link: https://github.com/linuxppc/linux/issues/83
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The powerpc mobility code may receive RTAS requests to perform PRRN
(Platform Resource Reassignment Notification) topology changes at any
time, including during LPAR migration operations.
In some configurations where the affinity of CPUs or memory is being
changed on that platform, the PRRN requests may apply or refer to
outdated information prior to the complete update of the device-tree.
This patch changes the duration for which topology updates are
suppressed during LPAR migrations from just the rtas_ibm_suspend_me()
/ 'ibm,suspend-me' call(s) to cover the entire migration_store()
operation to allow all changes to the device-tree to be applied prior
to accepting and applying any PRRN requests.
For tracking purposes, pr_info notices are added to the functions
start_topology_update() and stop_topology_update() of 'numa.c'.
Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Rather than mixing "if (state)" blocks and gotos, convert entirely to
"if (state)" blocks to make the state machine behaviour clearer.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The wait_state member of eeh_ops does not need to be platform
dependent; it's just logic around eeh_ops.get_state(). Therefore,
merge the two (slightly different!) platform versions into a new
function, eeh_wait_state() and remove the eeh_ops member.
While doing this, also correct:
* The wait logic, so that it never waits longer than max_wait.
* The wait logic, so that it never waits less than
EEH_STATE_MIN_WAIT_TIME.
* One call site where the result is treated like a bit field before
it's checked for negative error values.
* In pseries_eeh_get_state(), rename the "state" parameter to "delay"
because that's what it is.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently, eeh_pe_state_mark() marks a PE (and it's children) with a
state and then performs additional processing if that state included
EEH_PE_ISOLATED.
The state parameter is always a constant at the call site, so
rearrange eeh_pe_state_mark() into two functions and just call the
appropriate one at each site.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The function eeh_pe_state_mark_with_cfg() just performs the work of
eeh_pe_state_mark() and then, conditionally, the work of
eeh_pe_state_clear(). However it is only ever called with a constant
state such that the condition is always true, so replace it by direct
calls.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Move the call to eeh_dev_to_pe() up, so that later it's clear that
"pe" isn't NULL.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Change the name of the fields in eeh_rmv_data to clarify their usage.
Change "edev_list" to "removed_vf_list" because it does not contain
generic edevs, but rather only edevs that contain virtual functions
(which need to be removed during recovery).
Similarly, change "removed" to "removed_dev_count" because it is a
count of any removed devices, not just those in the above list.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Instances of struct eeh_pe are placed in a tree structure using the
fields "child_list" and "child", so place these next to each other
in the definition.
The field "child" is a list entry, so remove the unnecessary and
misleading use of the list initializer, LIST_HEAD(), on it.
The eeh_dev struct contains two list entry fields, called "list" and
"rmv_list". Rename them to "entry" and "rmv_entry" and, as above, stop
initializing them with LIST_HEAD().
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Remove the unnecessary cast through void * on the first parameter and
remove the unused second parameter (always NULL).
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The 'bus' member of struct eeh_dev is assigned to once but never used,
so remove it.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently a flag, EEH_POSTPONED_PROBE, is used to prevent an incorrect
message "EEH: No capable adapters found" from being displayed during
the boot of powernv systems.
It is necessary because, on powernv, the call to eeh_probe_devices()
made from eeh_init() is too early and EEH can't yet be enabled. A
second call is made later from eeh_pnv_post_init(), which succeeds.
(On pseries, the first call succeeds because PCI devices are set up
early enough and no second call is made.)
This can be simplified by moving the early call to eeh_probe_devices()
from eeh_init() (where it's seen by both platforms) to
pSeries_final_fixup(), so that each platform only calls
eeh_probe_devices() once, at a point where it can succeed.
This is slightly later in the boot sequence, but but still early
enough and it is now in the same place in the sequence for both
platforms (the pcibios_fixup hook).
The display of the message can be cleaned up as well, by moving it
into eeh_probe_devices().
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
eeh_add_to_parent_pe() sometimes removes the EEH_PE_KEEP flag, but it
incorrectly removes it from pe->type, instead of pe->state.
However, rather than clearing it from the correct field, remove it.
Inspection of the code shows that it can't ever have had any effect
(even if it had been cleared from the correct field), because the
field is never tested after it is cleared by the statement in
question.
The clear statement was added by commit 807a827d4e ("powerpc/eeh:
Keep PE during hotplug"), but it didn't explain why it was necessary.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If a device is removed during EEH processing (either by a driver's
handler or as part of recovery), it can lead to a null dereference
in eeh_pe_report_edev().
To handle this, skip devices that have been removed.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If an error occurs during an unplug operation, it's possible for
eeh_dump_dev_log() to be called when edev->pdn is null, which
currently leads to dereferencing a null pointer.
Handle this by skipping the error log for those devices.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently when we get an unknown RTAS event it prints the type as
"Unknown" and no other useful information. Add the raw type code to the
log message so that we have something to work off.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The powerpc kernel uses setjmp which causes a warning when building
with clang:
In file included from arch/powerpc/xmon/xmon.c:51:
./arch/powerpc/include/asm/setjmp.h:15:13: error: declaration of
built-in function 'setjmp' requires inclusion of the header <setjmp.h>
[-Werror,-Wbuiltin-requires-header]
extern long setjmp(long *);
^
./arch/powerpc/include/asm/setjmp.h:16:13: error: declaration of
built-in function 'longjmp' requires inclusion of the header <setjmp.h>
[-Werror,-Wbuiltin-requires-header]
extern void longjmp(long *, long);
^
This *is* the header and we're not using the built-in setjump but
rather the one in arch/powerpc/kernel/misc.S. As the compiler warning
does not make sense, it for the files where setjmp is used.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
[mpe: Move subdir-ccflags in xmon/Makefile to not clobber -Werror]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
instructions_to_print var is assigned value 16 and there is no
way to change it.
This patch replaces it by a constant.
Reviewed-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When two processes crash at the same time, we sometimes encounter
interleaving in the middle of a line:
init[1]: segfault (11) at 0 nip 0 lr 0 code 1
init[1]: code: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
init[74]: segfault (11) at 10a74 nip 1000c198 lr 100078c8 code 1 in sh[10000000+14000]
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
init[1]: code: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
init[74]: code: 90010024 bf61000c 91490a7c 3fa01002 3be00000 7d3e4b78 3bbd0c20 3b600000
init[74]: code: 3b9d0040 7c7fe02e 2f830000 419e0028 <89230000> 2f890000 41be001c 4b7f6e79
This patch fixes it by preparing complete lines in a buffer and
printing it at once.
Fixes: 88b0fe1757 ("powerpc: Add show_user_instructions()")
Reviewed-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Use seq_buf_printf() not seq_buf_puts() which doesn't NULL terminate]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
As spotted by sparse:
arch/powerpc/kernel/process.c:1302:6: warning: symbol 'show_user_instructions' was not declared. Should it be static?
Fixes: 88b0fe1757 ("powerpc: Add show_user_instructions()")
Reviewed-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Split out of larger patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch fixes the following warnings, which are leftovers
from when __get_user() was replaced by probe_kernel_address().
arch/powerpc/kernel/process.c:1287:22: warning: incorrect type in argument 2 (different address spaces)
arch/powerpc/kernel/process.c:1287:22: expected void const *src
arch/powerpc/kernel/process.c:1287:22: got unsigned int [noderef] <asn:1>*<noident>
arch/powerpc/kernel/process.c:1319:21: warning: incorrect type in argument 2 (different address spaces)
arch/powerpc/kernel/process.c:1319:21: expected void const *src
arch/powerpc/kernel/process.c:1319:21: got unsigned int [noderef] <asn:1>*<noident>
Fixes: 7b051f665c ("powerpc: Use probe_kernel_address in show_instructions")
Reviewed-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Split out of larger patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Since the struct lsm_info table is not an initcall, we can just move it
into INIT_DATA like all the other tables.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: John Johansen <john.johansen@canonical.com>
Reviewed-by: James Morris <james.morris@microsoft.com>
Signed-off-by: James Morris <james.morris@microsoft.com>
This adds a new hypercall, H_ENTER_NESTED, which is used by a nested
hypervisor to enter one of its nested guests. The hypercall supplies
register values in two structs. Those values are copied by the level 0
(L0) hypervisor (the one which is running in hypervisor mode) into the
vcpu struct of the L1 guest, and then the guest is run until an
interrupt or error occurs which needs to be reported to L1 via the
hypercall return value.
Currently this assumes that the L0 and L1 hypervisors are the same
endianness, and the structs passed as arguments are in native
endianness. If they are of different endianness, the version number
check will fail and the hcall will be rejected.
Nested hypervisors do not support indep_threads_mode=N, so this adds
code to print a warning message if the administrator has set
indep_threads_mode=N, and treat it as Y.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When the 'regs' field was added to struct kvm_vcpu_arch, the code
was changed to use several of the fields inside regs (e.g., gpr, lr,
etc.) but not the ccr field, because the ccr field in struct pt_regs
is 64 bits on 64-bit platforms, but the cr field in kvm_vcpu_arch is
only 32 bits. This changes the code to use the regs.ccr field
instead of cr, and changes the assembly code on 64-bit platforms to
use 64-bit loads and stores instead of 32-bit ones.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When doing nested virtualization, it is only necessary to do the
transactional memory hypervisor assist at level 0, that is, when
we are in hypervisor mode. Nested hypervisors can just use the TM
facilities as architected. Therefore we should clear the
CPU_FTR_P9_TM_HV_ASSIST bit when we are not in hypervisor mode,
along with the CPU_FTR_HVMODE bit.
Doing this will not change anything at this stage because the only
code that tests CPU_FTR_P9_TM_HV_ASSIST is in HV KVM, which currently
can only be used when when CPU_FTR_HVMODE is set.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Four regression fixes.
A fix for a change to lib/xz which broke our zImage loader when building with XZ
compression. OK'ed by Herbert who merged the original patch.
The recent fix we did to avoid patching __init text broke some 32-bit machines,
fix that.
Our show_user_instructions() could be tricked into printing kernel memory, add a
check to avoid that.
And a fix for a change to our NUMA initialisation logic, which causes crashes in
some kdump configurations.
Thanks to:
Christophe Leroy, Hari Bathini, Jann Horn, Joel Stanley, Meelis Roos, Murilo
Opsfelder Araujo, Srikar Dronamraju.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAlu4n8QTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgGJ+D/4nJ4ij6tb6LaIfMLxX7qCk43XFcXei
R0wMzliZp0sC3KqVgoeGZxpXL6vAI3G46Ix+4lTUEw2LxpDpkQBRXKVnaium2Cs7
qDSdycoWaqy5e40DRxUSno8oPtjMmDS43YnjY+/8ByA4hhF9XqzDGtF5EuCvc5Fr
Lsts+jT3fvxVEAQM/10KnCFXDAO3gh8HBbDX3AwMRwCyxuQXjLSqDfW0+gX5ZuOC
6LVZ2jGw1yen0mwFKPVWiJCMYeuaBn0u1LvzGb7XWHbITUBNGdx8YuOiLzAuOPYN
OfP7+TPDsp5lbyB5KWNOVUB2ajxXUCMeT4CWZ++0HpUi1rcbJIz8ksJ9ZjA+Y4Ij
FWOdauPdozDs6GUsvp1w9FHaD4B4KFDE7d5sKoIg73pKUPS+M/2JOIpNn/DMtg0a
RgArGaxcswIGaJLV+SW9K8laZb085HIAFJCKuHgRPixtz2t35y+f+lTyYVg/BIpF
lI/h60BSf7PMX0Jp1I1D8fUhmgFWcgz/rPN3J1BIG3jNjMmIU6bB+muGBlYZFDeC
tICbYa4FAKkc8KHCALb7iF2k3BFa28HxzHU6JbDm66JNaMRD0tSpTEgX7spMr7XS
RFySP+ofTBuzxwunsGWLjQy2rnY/DnFtbQgesDVByzTfCYhjQYqEdS38/LmWDZ6w
pLqBpNfSgvHd0g==
=FxNJ
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.19-4' of https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Michael writes:
"powerpc fixes for 4.19 #4
Four regression fixes.
A fix for a change to lib/xz which broke our zImage loader when
building with XZ compression. OK'ed by Herbert who merged the
original patch.
The recent fix we did to avoid patching __init text broke some 32-bit
machines, fix that.
Our show_user_instructions() could be tricked into printing kernel
memory, add a check to avoid that.
And a fix for a change to our NUMA initialisation logic, which causes
crashes in some kdump configurations.
Thanks to:
Christophe Leroy, Hari Bathini, Jann Horn, Joel Stanley, Meelis
Roos, Murilo Opsfelder Araujo, Srikar Dronamraju."
* tag 'powerpc-4.19-4' of https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/numa: Skip onlining a offline node in kdump path
powerpc: Don't print kernel instructions in show_user_instructions()
powerpc/lib: fix book3s/32 boot failure due to code patching
lib/xz: Put CRC32_POLY_LE in xz_private.h
Recently we implemented show_user_instructions() which dumps the code
around the NIP when a user space process dies with an unhandled
signal. This was modelled on the x86 code, and we even went so far as
to implement the exact same bug, namely that if the user process
crashed with its NIP pointing into the kernel we will dump kernel text
to dmesg. eg:
bad-bctr[2996]: segfault (11) at c000000000010000 nip c000000000010000 lr 12d0b0894 code 1
bad-bctr[2996]: code: fbe10068 7cbe2b78 7c7f1b78 fb610048 38a10028 38810020 fb810050 7f8802a6
bad-bctr[2996]: code: 3860001c f8010080 48242371 60000000 <7c7b1b79> 4082002c e8010080 eb610048
This was discovered on x86 by Jann Horn and fixed in commit
342db04ae7 ("x86/dumpstack: Don't dump kernel memory based on usermode RIP").
Fix it by checking the adjusted NIP value (pc) and number of
instructions against USER_DS, and bail if we fail the check, eg:
bad-bctr[2969]: segfault (11) at c000000000010000 nip c000000000010000 lr 107930894 code 1
bad-bctr[2969]: Bad NIP, not dumping instructions.
Fixes: 88b0fe1757 ("powerpc: Add show_user_instructions()")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PPC_INVALIDATE_ERAT is slbia IH=7 which is a new variant introduced
with POWER9, and the result is undefined on earlier CPUs.
Commits 7b9f71f974 ("powerpc/64s: POWER9 machine check handler") and
d4748276ae ("powerpc/64s: Improve local TLB flush for boot and MCE on
POWER9") caused POWER7/8 code to use this instruction. Remove it. An
ERAT flush can be made by invalidatig the SLB, but before POWER9 that
requires a flush and rebolt.
Fixes: 7b9f71f974 ("powerpc/64s: POWER9 machine check handler")
Fixes: d4748276ae ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9")
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If CONFIG_PPC_WATCHDOG is enabled we always cap the decrementer to
0x7fffffff:
if (IS_ENABLED(CONFIG_PPC_WATCHDOG))
set_dec(0x7fffffff);
else
set_dec(decrementer_max);
If there are no future events, we don't reprogram the decrementer
after this and we end up with 0x7fffffff even on a large decrementer
capable system.
As suggested by Nick, add a set_state_oneshot_stopped callback
so we program the decrementer with decrementer_max if there are
no future events.
Signed-off-by: Anton Blanchard <anton@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We currently cap the decrementer clockevent at 4 seconds, even on systems
with large decrementer support. Fix this by converting the code to use
clockevents_register_device() which calculates the upper bound based on
the max_delta passed in.
Signed-off-by: Anton Blanchard <anton@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add call to early_memtest() so that kernel compiled with
CONFIG_MEMTEST really perform memtest at startup when requested
via 'memtest' boot parameter.
Tested-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The comments in this file don't conform to the coding style so take
them to "Comment Formatting Re-Education Camp".
Suggested-by: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
[mpe: Reflow some comments and add full stops, fix spelling of Sergeant.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The code in machine_check_exception excludes 64s hvmode when
incrementing the MCE counter only to call opal_machine_check to
increment it specifically for this case.
Remove the exclusion and special case.
Fixes: a43c159042 ("powerpc/pseries: Flush SLB contents on SLB MCE
errors.")
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On a kernel TM Bad thing program exception, the Machine State Register
(MSR) is not being properly displayed. The exception code dumps a 32-bits
value but MSR is a 64 bits register for all platforms that have HTM
enabled.
This patch dumps the MSR value as a 64-bits value instead of 32 bits. In
order to do so, the 'reason' variable could not be used, since it trimmed
MSR to 32-bits (int).
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently msr_tm_active() is a wrapper around MSR_TM_ACTIVE() if
CONFIG_PPC_TRANSACTIONAL_MEM is set, or it is just a function that
returns false if CONFIG_PPC_TRANSACTIONAL_MEM is not set.
This function is not necessary, since MSR_TM_ACTIVE() just do the same and
could be used, removing the dualism and simplifying the code.
This patchset remove every instance of msr_tm_active() and replaced it
by MSR_TM_ACTIVE().
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This is a patch that adds support for PTRACE_SYSEMU ptrace request in
PowerPC architecture.
When ptrace(PTRACE_SYSEMU, ...) request is called, it will be handled by
the arch independent function ptrace_resume(), which will tag the task with
the TIF_SYSCALL_EMU flag. This flag needs to be handled from a platform
dependent point of view, which is what this patch does.
This patch adds this task's flag as part of the _TIF_SYSCALL_DOTRACE, which
is the MACRO that is used to trace syscalls at entrance/exit.
Since TIF_SYSCALL_EMU is now part of _TIF_SYSCALL_DOTRACE, if the task has
_TIF_SYSCALL_DOTRACE set, it will hit do_syscall_trace_enter() at syscall
entrance and do_syscall_trace_leave() at syscall leave.
do_syscall_trace_enter() needs to handle the TIF_SYSCALL_EMU flag properly,
which will interrupt the syscall executing if TIF_SYSCALL_EMU is set. The
output values should not be changed, i.e. the return value (r3) should
contain the original syscall argument on exit.
With this flag set, the syscall is not executed fundamentally, because
do_syscall_trace_enter() is returning -1 which is bigger than NR_syscall,
thus, skipping the syscall execution and exiting userspace.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Moving TIF_32BIT to use bit 20 instead of 4 in the task flag field.
This change is making room for an upcoming new task macro
(_TIF_SYSCALL_EMU) which is preferred to set a bit in the lower 16-bits
part of the word.
This upcoming flag macro will take part in a composed macro
(_TIF_SYSCALL_DOTRACE) which will contain other flags as well, and it is
preferred that the whole _TIF_SYSCALL_DOTRACE macro only sets the lower 16
bits of a word, so, it could be handled using immediate operations (as load
immediate, add immediate, ...) where the immediate operand (SI) is limited
to 16-bits.
Another possible solution would be using the LOAD_REG_IMMEDIATE() macro
to load a full 64-bits word immediate, but it takes 5 operations instead of
one.
Having TIF_32BITS being redefined to use an upper bit is not a problem
since there is only one place in the assembly code where TIF_32BIT is being
used, and it could be replaced with an operation with right shift (addis),
since it is used alone, i.e. not being part of a composed macro, which has
different bits set, and would require LOAD_REG_IMMEDIATE().
Tested on a 64 bits Big Endian machine running a 32 bits task.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On PPC64, as register r13 points to the paca_struct at all time,
this patch adds a copy of the canary there, which is copied at
task_switch.
That new canary is then used by using the following GCC options:
-mstack-protector-guard=tls
-mstack-protector-guard-reg=r13
-mstack-protector-guard-offset=offsetof(struct paca_struct, canary))
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This functionality was tentatively added in the past
(commit 6533b7c16e ("powerpc: Initial stack protector
(-fstack-protector) support")) but had to be reverted
(commit f2574030b0 ("powerpc: Revert the initial stack
protector support") because of GCC implementing it differently
whether it had been built with libc support or not.
Now, GCC offers the possibility to manually set the
stack-protector mode (global or tls) regardless of libc support.
This time, the patch selects HAVE_STACKPROTECTOR only if
-mstack-protector-guard=tls is supported by GCC.
On PPC32, as register r2 points to current task_struct at
all time, the stack_canary located inside task_struct can be
used directly by using the following GCC options:
-mstack-protector-guard=tls
-mstack-protector-guard-reg=r2
-mstack-protector-guard-offset=offsetof(struct task_struct, stack_canary))
The protector is disabled for prom_init and bootx_init as
it is too early to handle it properly.
$ echo CORRUPT_STACK > /sys/kernel/debug/provoke-crash/DIRECT
[ 134.943666] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: lkdtm_CORRUPT_STACK+0x64/0x64
[ 134.943666]
[ 134.955414] CPU: 0 PID: 283 Comm: sh Not tainted 4.18.0-s3k-dev-12143-ga3272be41209 #835
[ 134.963380] Call Trace:
[ 134.965860] [c6615d60] [c001f76c] panic+0x118/0x260 (unreliable)
[ 134.971775] [c6615dc0] [c001f654] panic+0x0/0x260
[ 134.976435] [c6615dd0] [c032c368] lkdtm_CORRUPT_STACK_STRONG+0x0/0x64
[ 134.982769] [c6615e00] [ffffffff] 0xffffffff
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PPC32 uses nonrecoverable_exception() while PPC64 uses
unrecoverable_exception().
Both functions are doing almost the same thing.
This patch removes nonrecoverable_exception()
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This reverts commits:
5e46e29e6a ("powerpc/64s/hash: convert SLB miss handlers to C")
8fed04d0f6 ("powerpc/64s/hash: remove user SLB data from the paca")
655deecf67 ("powerpc/64s/hash: SLB allocation status bitmaps")
2e1626744e ("powerpc/64s/hash: provide arch_setup_exec hooks for hash slice setup")
89ca4e126a ("powerpc/64s/hash: Add a SLB preload cache")
This series had a few bugs, and the fixes are not all trivial. So
revert most of it for now.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
A reasonably big batch of fixes due to me being away for a few weeks.
A fix for the TM emulation support on Power9, which could result in corrupting
the guest r11 when running under KVM.
Two fixes to the TM code which could lead to userspace GPR corruption if we take
an SLB miss at exactly the wrong time.
Our dynamic patching code had a bug that meant we could patch freed __init text,
which could lead to corrupting userspace memory.
csum_ipv6_magic() didn't work on little endian platforms since we optimised it
recently.
A fix for an endian bug when reading a device tree property telling us how many
storage keys the machine has available.
Fix a crash seen on some configurations of PowerVM when migrating the partition
from one machine to another.
A fix for a regression in the setup of our CPU to NUMA node mapping in KVM
guests.
A fix to our selftest Makefiles to make them work since a recent change to the
shared Makefile logic.
Thanks to:
Alexey Kardashevskiy, Breno Leitao, Christophe Leroy, Michael Bringmann,
Michael Neuling, Nicholas Piggin, Paul Mackerras,, Srikar Dronamraju, Thiago
Jung Bauermann, Xin Long.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAluuCTsTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgDKjD/sHWK1MFXM4baYrdju2tW9tBBiIq18n
UQ4byhJ6wgXzNDzxSq3v4ofW/5A8HoF+2wTqIXFY5l9hMQRw+TfB3ZymloZZ/0Nf
3sxFfL0sv7or4V8Ben/vGEdalg9PzFInwAXhBCGV0nnHiE3D7GkO+MU3zsZUjzum
EbWN/SKMLhjGByeqGbujJYfLC2IdvAdpTzrYd0t0X/Pqi8EUuGjJa0E+cJqD6mjI
GR4eGagImuDuTH6F9eLwcQt2mlDvmv6KJjtgoMbVeh99pqkG9tSnfOlZ+KrWWVps
1+/CXHlaJiHsmX3wqWKjw31x1Ag4v+BEtAjXpwnr7MFroumZ18xRLaDfy11eJ+GB
x3fPbdylittMQy1XYKRrntppDsr0dTAMCjBmE6KSMRyi+F5/agW8bG3sx7D26/AO
JoWGw/XTxislvdr/PIPkjISXRSKHa5jwT3rH45Id8W1uEt0TQ7hJOmWIOdC1WXU8
IbYy3imksXLCcxIJmSzc+cUkVo1wHf4S4NxcJsVs4N5pE/zih5mEKi9jaIpxocF/
YX43n4hGa/6oJYBBMeQZpsO3aDiP/Ynjk4j/I2l92WAIXzQXjVXhreLfHzci+fGg
Q6KdxNWPbwe3UKUTnQ8TOY4ncshU716HrFsp/B4HWvJTfvY6tc7nKDupieetzyKp
G5NA6Bh+VOVDxQ==
=G6wq
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.19-3' of https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Michael writes:
"powerpc fixes for 4.19 #3
A reasonably big batch of fixes due to me being away for a few weeks.
A fix for the TM emulation support on Power9, which could result in
corrupting the guest r11 when running under KVM.
Two fixes to the TM code which could lead to userspace GPR corruption
if we take an SLB miss at exactly the wrong time.
Our dynamic patching code had a bug that meant we could patch freed
__init text, which could lead to corrupting userspace memory.
csum_ipv6_magic() didn't work on little endian platforms since we
optimised it recently.
A fix for an endian bug when reading a device tree property telling
us how many storage keys the machine has available.
Fix a crash seen on some configurations of PowerVM when migrating the
partition from one machine to another.
A fix for a regression in the setup of our CPU to NUMA node mapping
in KVM guests.
A fix to our selftest Makefiles to make them work since a recent
change to the shared Makefile logic."
* tag 'powerpc-4.19-3' of https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
selftests/powerpc: Fix Makefiles for headers_install change
powerpc/numa: Use associativity if VPHN hcall is successful
powerpc/tm: Avoid possible userspace r1 corruption on reclaim
powerpc/tm: Fix userspace r13 corruption
powerpc/pseries: Fix unitialized timer reset on migration
powerpc/pkeys: Fix reading of ibm, processor-storage-keys property
powerpc: fix csum_ipv6_magic() on little endian platforms
powerpc/powernv/ioda2: Reduce upper limit for DMA window size (again)
powerpc: Avoid code patching freed init sections
KVM: PPC: Book3S HV: Fix guest r11 corruption with POWER9 TM workarounds
Current we store the userspace r1 to PACATMSCRATCH before finally
saving it to the thread struct.
In theory an exception could be taken here (like a machine check or
SLB miss) that could write PACATMSCRATCH and hence corrupt the
userspace r1. The SLB fault currently doesn't touch PACATMSCRATCH, but
others do.
We've never actually seen this happen but it's theoretically
possible. Either way, the code is fragile as it is.
This patch saves r1 to the kernel stack (which can't fault) before we
turn MSR[RI] back on. PACATMSCRATCH is still used but only with
MSR[RI] off. We then copy r1 from the kernel stack to the thread
struct once we have MSR[RI] back on.
Suggested-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When we treclaim we store the userspace checkpointed r13 to a scratch
SPR and then later save the scratch SPR to the user thread struct.
Unfortunately, this doesn't work as accessing the user thread struct
can take an SLB fault and the SLB fault handler will write the same
scratch SPRG that now contains the userspace r13.
To fix this, we store r13 to the kernel stack (which can't fault)
before we access the user thread struct.
Found by running P8 guest + powervm + disable_1tb_segments + TM. Seen
as a random userspace segfault with r13 looking like a kernel address.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Call force_sig_pkuerr directly instead of rolling it by hand
in _exception_pkey.
Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Now that _exception no longer calls _exception_pkey it is no longer
necessary to handle any signal with any si_code. All pkey exceptions
are SIGSEGV with paired with SEGV_PKUERR. So just handle
that case and remove the now unnecessary parameters from _exception_pkey.
Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The callers of _exception don't need the pkey exception logic because
they are not processing a pkey exception. So just call exception_common
directly and then call force_sig_fault to generate the appropriate siginfo
and deliver the appropriate signal.
Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
It is brittle and wrong to populate si_pkey when there was not a pkey
exception. The field does not exist for all si_codes and in some
cases another field exists in the same memory location.
So factor out the code that all exceptions handlers must run
into exception_common, leaving the individual exception handlers
to generate the signals themselves.
Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Replace user_single_step_siginfo with user_single_step_report
that allocates siginfo structure on the stack and sends it.
This allows tracehook_report_syscall_exit to become a simple
if statement that calls user_single_step_report or ptrace_report_syscall
depending on the value of step.
Update the default helper function now called user_single_step_report
to explicitly set si_code to SI_USER and to set si_uid and si_pid to 0.
The default helper has always been doing this (using memset) but it
was far from obvious.
The powerpc helper can now just call force_sig_fault.
The x86 helper can now just call send_sigtrap.
Unfortunately the default implementation of user_single_step_report
can not use force_sig_fault as it does not use a SIGTRAP si_code.
So it has to carefully setup the siginfo and use use force_sig_info.
The net result is code that is easier to understand and simpler
to maintain.
Ref: 85ec7fd9f8 ("ptrace: introduce user_single_step_siginfo() helper")
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Firmware-Assisted Dump (FADump) needs to be registered again after any
memory hot add/remove operation to update the crash memory ranges. But
currently, the kernel returns '-EEXIST' if we try to register without
uregistering it first. This could expose the system to racing issues
while unregistering and registering FADump from userspace during udev
events. Spare the userspace of this and let it be taken care of in the
kernel space for a simpler interface.
Since this change, running 'echo 1 > /sys/kernel/fadump_registered'
would result in re-regisering (unregistering and registering) FADump,
if it was already registered.
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In prom_check_platform_support() we retrieve and parse the
"ibm,arch-vec-5-platform-support" property of the chosen node.
Currently we use a variable length array however to avoid this use an
array of constant length 8.
This property is used to indicate the supported options of vector 5
bytes 23-26 of the ibm,architecture.vec node. Each of these options
is a pair of bytes, thus for 4 options we have a max length of 8 bytes.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When performing partition migrations all present CPUs must be online
as all present CPUs must make the H_JOIN call as part of the migration
process. Once all present CPUs make the H_JOIN call, one CPU is returned
to make the rtas call to perform the migration to the destination system.
During testing of migration and changing the SMT state we have found
instances where CPUs are offlined, as part of the SMT state change,
before they make the H_JOIN call. This results in a hung system where
every CPU is either in H_JOIN or offline.
To prevent this this patch disables CPU hotplug during the migration
process.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Reviewed-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When switching processes, currently all user SLBEs are cleared, and a
few (exec_base, pc, and stack) are preloaded. In trivial testing with
small apps, this tends to miss the heap and low 256MB segments, and it
will also miss commonly accessed segments on large memory workloads.
Add a simple round-robin preload cache that just inserts the last SLB
miss into the head of the cache and preloads those at context switch
time. Every 256 context switches, the oldest entry is removed from the
cache to shrink the cache and require fewer slbmte if they are unused.
Much more could go into this, including into the SLB entry reclaim
side to track some LRU information etc, which would require a study of
large memory workloads. But this is a simple thing we can do now that
is an obvious win for common workloads.
With the full series, process switching speed on the context_switch
benchmark on POWER9/hash (with kernel speculation security masures
disabled) increases from 140K/s to 178K/s (27%).
POWER8 does not change much (within 1%), it's unclear why it does not
see a big gain like POWER9.
Booting to busybox init with 256MB segments has SLB misses go down
from 945 to 69, and with 1T segments 900 to 21. These could almost all
be eliminated by preloading a bit more carefully with ELF binary
loading.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This will be used by the SLB code in the next patch, but for now this
sets the slb_addr_limit to the correct size for 32-bit tasks.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add 32-entry bitmaps to track the allocation status of the first 32
SLB entries, and whether they are user or kernel entries. These are
used to allocate free SLB entries first, before resorting to the round
robin allocator.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
User SLB mappig data is copied into the PACA from the mm->context so
it can be accessed by the SLB miss handlers.
After the C conversion, SLB miss handlers now run with relocation on,
and user SLB misses are able to take recursive kernel SLB misses, so
the user SLB mapping data can be removed from the paca and accessed
directly.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch moves SLB miss handlers completely to C, using the standard
exception handler macros to set up the stack and branch to C.
This can be done because the segment containing the kernel stack is
always bolted, so accessing it with relocation on will not cause an
SLB exception.
Arbitrary kernel memory may not be accessed when handling kernel space
SLB misses, so care should be taken there. However user SLB misses can
access any kernel memory, which can be used to move some fields out of
the paca (in later patches).
User SLB misses could quite easily reconcile IRQs and set up a first
class kernel environment and exit via ret_from_except, however that
doesn't seem to be necessary at the moment, so we only do that if a
bad fault is encountered.
[ Credit to Aneesh for bug fixes, error checks, and improvements to bad
address handling, etc ]
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Since RFC:
- Added MSR[RI] handling
- Fixed up a register loss bug exposed by irq tracing (Aneesh)
- Reject misses outside the defined kernel regions (Aneesh)
- Added several more sanity checks and error handling (Aneesh), we may
look at consolidating these tests and tightenig up the code but for
a first pass we decided it's better to check carefully.
Since v1:
- Fixed SLB cache corruption (Aneesh)
- Fixed untidy SLBE allocation "leak" in get_vsid error case
- Now survives some stress testing on real hardware
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
I only have POWER8/9 to test, so just remove it for those.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Now that other platforms also implements real mode mce handler,
lets consolidate the code by sharing existing powernv machine check
early code. Rename machine_check_powernv_early to
machine_check_common_early and reuse the code.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On pseries, as of today system crashes if we get a machine check
exceptions due to SLB errors. These are soft errors and can be fixed
by flushing the SLBs so the kernel can continue to function instead of
system crash. We do this in real mode before turning on MMU. Otherwise
we would run into nested machine checks. This patch now fetches the
rtas error log in real mode and flushes the SLBs on SLB/ERAT errors.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michal Suchanek <msuchanek@suse.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The tbl pointer is being derefenced by IOMMU_PAGE_SIZE prior the check
if it is not NULL.
Just moving the dereference code to after the check, where there will
be guarantee that 'tbl' will not be NULL.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch simply fix part of the documentation on the HTM code.
This fixes reference to old fields that were renamed in commit
000ec280e3 ("powerpc: tm: Rename transct_(*) to ck(\1)_state")
It also documents better the flow after commit eb5c3f1c86 ("powerpc:
Always save/restore checkpointed regs during treclaim/trecheckpoint"),
where tm_recheckpoint can recheckpoint what is in ck{fp,vr}_state
blindly.
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When we come into the softpatch handler (0x1500), we use r11 to store
the HSRR0 for later use by the denorm handler.
We also use the softpatch handler for the TM workarounds for
POWER9. Unfortunately, in kvmppc_interrupt_hv we later store r11 out
to the vcpu assuming it's still what we got from userspace.
This causes r11 to be corrupted in the VCPU and hence when we restore
the guest, we get a corrupted r11. We've seen this when running TM
tests inside guests on P9.
This fixes the problem by only touching r11 in the denorm case.
Fixes: 4bb3c7a020 ("KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9")
Cc: <stable@vger.kernel.org> # 4.17+
Test-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Call Frame Information is used by gdb for back-traces and inserting
breakpoints on function return for the "finish" command. This failed
when inside __kernel_clock_gettime. More concerning than difficulty
debugging is that CFI is also used by stack frame unwinding code to
implement exceptions. If you have an app that needs to handle
asynchronous exceptions for some reason, and you are unlucky enough to
get one inside the VDSO time functions, your app will crash.
What's wrong: There is control flow in __kernel_clock_gettime that
reaches label 99 without saving lr in r12. CFI info however is
interpreted by the unwinder without reference to control flow: It's a
simple matter of "Execute all the CFI opcodes up to the current
address". That means the unwinder thinks r12 contains the return
address at label 99. Disabuse it of that notion by resetting CFI for
the return address at label 99.
Note that the ".cfi_restore lr" could have gone anywhere from the
"mtlr r12" a few instructions earlier to the instruction at label 99.
I put the CFI as late as possible, because in general that's best
practice (and if possible grouped with other CFI in order to reduce
the number of CFI opcodes executed when unwinding). Using r12 as the
return address is perfectly fine after the "mtlr r12" since r12 on
that code path still contains the return address.
__get_datapage also has a CFI error. That function temporarily saves
lr in r0, and reflects that fact with ".cfi_register lr,r0". A later
use of r0 means the CFI at that point isn't correct, as r0 no longer
contains the return address. Fix that too.
Signed-off-by: Alan Modra <amodra@gmail.com>
Tested-by: Reza Arbab <arbab@linux.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Currently on P9N DD2.1 we end up taking infinite TM facility
unavailable exceptions on the first TM usage by userspace.
In the special case of TM no suspend (P9N DD2.1), Linux is told TM is
off via CPU dt-ftrs but told to (partially) use it via
OPAL_REINIT_CPUS_TM_SUSPEND_DISABLED. So HFSCR[TM] will be off from
dt-ftrs but we need to turn it on for the no suspend case.
This patch fixes this by enabling HFSCR TM in this case.
Cc: stable@vger.kernel.org # 4.15+
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
At the moment the real mode handler of H_PUT_TCE calls iommu_tce_xchg_rm()
which in turn reads the old TCE and if it was a valid entry, marks
the physical page dirty if it was mapped for writing. Since it is in
real mode, realmode_pfn_to_page() is used instead of pfn_to_page()
to get the page struct. However SetPageDirty() itself reads the compound
page head and returns a virtual address for the head page struct and
setting dirty bit for that kills the system.
This adds additional dirty bit tracking into the MM/IOMMU API for use
in the real mode. Note that this does not change how VFIO and
KVM (in virtual mode) set this bit. The KVM (real mode) changes include:
- use the lowest bit of the cached host phys address to carry
the dirty bit;
- mark pages dirty when they are unpinned which happens when
the preregistered memory is released which always happens in virtual
mode;
- add mm_iommu_ua_mark_dirty_rm() helper to set delayed dirty bit;
- change iommu_tce_xchg_rm() to take the kvm struct for the mm to use
in the new mm_iommu_ua_mark_dirty_rm() helper;
- move iommu_tce_xchg_rm() to book3s_64_vio_hv.c (which is the only
caller anyway) to reduce the real mode KVM and IOMMU knowledge
across different subsystems.
This removes realmode_pfn_to_page() as it is not used anymore.
While we at it, remove some EXPORT_SYMBOL_GPL() as that code is for
the real mode only and modules cannot call it anyway.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Christoph Hellwig suggested a slightly different path for handling
backwards compatibility with the 32-bit time_t based system calls:
Rather than simply reusing the compat_sys_* entry points on 32-bit
architectures unchanged, we get rid of those entry points and the
compat_time types by renaming them to something that makes more sense
on 32-bit architectures (which don't have a compat mode otherwise),
and then share the entry points under the new name with the 64-bit
architectures that use them for implementing the compatibility.
The following types and interfaces are renamed here, and moved
from linux/compat_time.h to linux/time32.h:
old new
--- ---
compat_time_t old_time32_t
struct compat_timeval struct old_timeval32
struct compat_timespec struct old_timespec32
struct compat_itimerspec struct old_itimerspec32
ns_to_compat_timeval() ns_to_old_timeval32()
get_compat_itimerspec64() get_old_itimerspec32()
put_compat_itimerspec64() put_old_itimerspec32()
compat_get_timespec64() get_old_timespec32()
compat_put_timespec64() put_old_timespec32()
As we already have aliases in place, this patch addresses only the
instances that are relevant to the system call interface in particular,
not those that occur in device drivers and other modules. Those
will get handled separately, while providing the 64-bit version
of the respective interfaces.
I'm not renaming the timex, rusage and itimerval structures, as we are
still debating what the new interface will look like, and whether we
will need a replacement at all.
This also doesn't change the names of the syscall entry points, which can
be done more easily when we actually switch over the 32-bit architectures
to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
- An implementation for the newly added hv_ops->flush() for the OPAL hvc
console driver backends, I forgot to apply this after merging the hvc driver
changes before the merge window.
- Enable all PCI bridges at boot on powernv, to avoid races when multiple
children of a bridge try to enable it simultaneously. This is a workaround
until the PCI core can be enhanced to fix the races.
- A fix to query PowerVM for the correct system topology at boot before
initialising sched domains, seen in some configurations to cause broken
scheduling etc.
- A fix for pte_access_permitted() on "nohash" platforms.
- Two commits to fix SIGBUS when using remap_pfn_range() seen on Power9 due to
a workaround when using the nest MMU (GPUs, accelerators).
- Another fix to the VFIO code used by KVM, the previous fix had some bugs
which caused guests to not start in some configurations.
- A handful of other minor fixes.
Thanks to:
Aneesh Kumar K.V, Benjamin Herrenschmidt, Christophe Leroy, Hari Bathini, Luke
Dashjr, Mahesh Salgaonkar, Nicholas Piggin, Paul Mackerras, Srikar Dronamraju.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAlt//10THG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgLUGD/9y/MPrs+V2lNFhxP+l1jO4Ro0v8DPI
vARjqq06WfppgQgSS1dvWfzLaMFbGe8wPRfL0T1xMOXCZg8Ts/HrgxVBVFYcv6/Q
xBaU5bKztg6HKQbwwO+8B/gdTA3hO7yFVux6JGwsGO5Ebl8Q3UDOdbvYX6XTj3H8
rdiicle9LaI7qodC8bxlBo1Be0YKEW0O/ag179sxXzozzvoPIyFpeX6FL1sAft++
XlQS1MHu4hErSM2rbmyoFCm+SmyRt3CD0NTVmNd2cgw5XexPOBFlnsdgpaK1jJFc
CYu1chP83E91ol1/8NAPkcmPWvP6MGyoqOl75RghooY2D1IJ2GtLKoz2Dvc53ay2
ZlIpMyc2CYIa55Mj18tOV/NbGbh0Lf0Ta++BxqxbcCDt5fq0VxHkoDPqBgWh/tdp
Po7oQc7U2VTKKC3wiLr//nSHpgtSTAWtucDt7oT7GdP8+EMxUZ8teBFIfTkwfuD2
plroEmcYuRD3beI4FAG/iCp5POOCsnHLkKVDl7tyQPl3Yvu8hvyLY9gBS9RN4Unt
/z94YFJtz7UD+VP7jDaPiQS26y0WEJfW8ml0tNxMZVBdksZLPIPN0/EU/04V0jWf
GxynzKMhElxC67tK/Eb43EGiLEZAlbEnJxoOtiCnL1MK+OIJPjg6e7o6qlsmaa+Y
zkXVpxLtkq0OoQ==
=1AUa
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- An implementation for the newly added hv_ops->flush() for the OPAL
hvc console driver backends, I forgot to apply this after merging the
hvc driver changes before the merge window.
- Enable all PCI bridges at boot on powernv, to avoid races when
multiple children of a bridge try to enable it simultaneously. This
is a workaround until the PCI core can be enhanced to fix the races.
- A fix to query PowerVM for the correct system topology at boot before
initialising sched domains, seen in some configurations to cause
broken scheduling etc.
- A fix for pte_access_permitted() on "nohash" platforms.
- Two commits to fix SIGBUS when using remap_pfn_range() seen on Power9
due to a workaround when using the nest MMU (GPUs, accelerators).
- Another fix to the VFIO code used by KVM, the previous fix had some
bugs which caused guests to not start in some configurations.
- A handful of other minor fixes.
Thanks to: Aneesh Kumar K.V, Benjamin Herrenschmidt, Christophe Leroy,
Hari Bathini, Luke Dashjr, Mahesh Salgaonkar, Nicholas Piggin, Paul
Mackerras, Srikar Dronamraju.
* tag 'powerpc-4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/mce: Fix SLB rebolting during MCE recovery path.
KVM: PPC: Book3S: Fix guest DMA when guest partially backed by THP pages
powerpc/mm/radix: Only need the Nest MMU workaround for R -> RW transition
powerpc/mm/books3s: Add new pte bit to mark pte temporarily invalid.
powerpc/nohash: fix pte_access_permitted()
powerpc/topology: Get topology for shared processors at boot
powerpc64/ftrace: Include ftrace.h needed for enable/disable calls
powerpc/powernv/pci: Work around races in PCI bridge enabling
powerpc/fadump: cleanup crash memory ranges support
powerpc/powernv: provide a console flush operation for opal hvc driver
powerpc/traps: Avoid rate limit messages from show unhandled signals
powerpc/64s: Fix PACA_IRQ_HARD_DIS accounting in idle_power4()
On a shared LPAR, Phyp will not update the CPU associativity at boot
time. Just after the boot system does recognize itself as a shared
LPAR and trigger a request for correct CPU associativity. But by then
the scheduler would have already created/destroyed its sched domains.
This causes
- Broken load balance across Nodes causing islands of cores.
- Performance degradation esp if the system is lightly loaded
- dmesg to wrongly report all CPUs to be in Node 0.
- Messages in dmesg saying borken topology.
- With commit 051f3ca02e ("sched/topology: Introduce NUMA identity
node sched domain"), can cause rcu stalls at boot up.
The sched_domains_numa_masks table which is used to generate cpumasks
is only created at boot time just before creating sched domains and
never updated. Hence, its better to get the topology correct before
the sched domains are created.
For example on 64 core Power 8 shared LPAR, dmesg reports
Brought up 512 CPUs
Node 0 CPUs: 0-511
Node 1 CPUs:
Node 2 CPUs:
Node 3 CPUs:
Node 4 CPUs:
Node 5 CPUs:
Node 6 CPUs:
Node 7 CPUs:
Node 8 CPUs:
Node 9 CPUs:
Node 10 CPUs:
Node 11 CPUs:
...
BUG: arch topology borken
the DIE domain not a subset of the NUMA domain
BUG: arch topology borken
the DIE domain not a subset of the NUMA domain
numactl/lscpu output will still be correct with cores spreading across
all nodes:
Socket(s): 64
NUMA node(s): 12
Model: 2.0 (pvr 004d 0200)
Model name: POWER8 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 64K
L1i cache: 32K
NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
NUMA node4 CPU(s): 208-215,304-311,400-407,496-503
NUMA node5 CPU(s): 168-175,264-271,360-367,456-463
NUMA node6 CPU(s): 128-135,224-231,320-327,416-423
NUMA node7 CPU(s): 136-143,232-239,328-335,424-431
NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
NUMA node10 CPU(s): 152-159,248-255,344-351,440-447
NUMA node11 CPU(s): 160-167,256-263,352-359,448-455
Currently on this LPAR, the scheduler detects 2 levels of Numa and
created numa sched domains for all CPUs, but it finds a single DIE
domain consisting of all CPUs. Hence it deletes all numa sched
domains.
To address this, detect the shared processor and update topology soon
after CPUs are setup so that correct topology is updated just before
scheduler creates sched domain.
With the fix, dmesg reports:
numa: Node 0 CPUs: 0-7 32-39 64-71 96-103 176-183 272-279 368-375 464-471
numa: Node 1 CPUs: 8-15 40-47 72-79 104-111 184-191 280-287 376-383 472-479
numa: Node 2 CPUs: 16-23 48-55 80-87 112-119 192-199 288-295 384-391 480-487
numa: Node 3 CPUs: 24-31 56-63 88-95 120-127 200-207 296-303 392-399 488-495
numa: Node 4 CPUs: 208-215 304-311 400-407 496-503
numa: Node 5 CPUs: 168-175 264-271 360-367 456-463
numa: Node 6 CPUs: 128-135 224-231 320-327 416-423
numa: Node 7 CPUs: 136-143 232-239 328-335 424-431
numa: Node 8 CPUs: 216-223 312-319 408-415 504-511
numa: Node 9 CPUs: 144-151 240-247 336-343 432-439
numa: Node 10 CPUs: 152-159 248-255 344-351 440-447
numa: Node 11 CPUs: 160-167 256-263 352-359 448-455
and lscpu also reports:
Socket(s): 64
NUMA node(s): 12
Model: 2.0 (pvr 004d 0200)
Model name: POWER8 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 64K
L1i cache: 32K
NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
NUMA node4 CPU(s): 208-215,304-311,400-407,496-503
NUMA node5 CPU(s): 168-175,264-271,360-367,456-463
NUMA node6 CPU(s): 128-135,224-231,320-327,416-423
NUMA node7 CPU(s): 136-143,232-239,328-335,424-431
NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
NUMA node10 CPU(s): 152-159,248-255,344-351,440-447
NUMA node11 CPU(s): 160-167,256-263,352-359,448-455
Reported-by: Manjunatha H R <manjuhr1@in.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[mpe: Trim / format change log]
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 1bd6a1c4b8 ("powerpc/fadump: handle crash memory ranges array
index overflow") changed crash memory ranges to a dynamic array that
is reallocated on-demand with krealloc(). The relevant header for this
call was not included. The kernel compiles though. But be cautious and
add the header anyway.
Also, memory allocation logic in fadump_add_crash_memory() takes care
of memory allocation for crash memory ranges in all scenarios. Drop
unnecessary memory allocation in fadump_setup_crash_memory_ranges().
Fixes: 1bd6a1c4b8 ("powerpc/fadump: handle crash memory ranges array index overflow")
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In the recent commit to add an explicit ratelimit state when showing
unhandled signals, commit 35a52a10c3 ("powerpc/traps: Use an
explicit ratelimit state for show_signal_msg()"), I put the check of
show_unhandled_signals and the ratelimit state before the call to
unhandled_signal() so as to avoid unnecessarily calling the latter
when show_unhandled_signals is false.
However that causes us to check the ratelimit state on every call, so
if we take a lot of *handled* signals that has the effect of making
the ratelimit code print warnings that callbacks have been suppressed
when they haven't.
So rearrange the code so that we check show_unhandled_signals first,
then call unhandled_signal() and finally check the ratelimit state.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Notable changes:
- A fix for a bug in our page table fragment allocator, where a page table page
could be freed and reallocated for something else while still in use, leading
to memory corruption etc. The fix reuses pt_mm in struct page (x86 only) for
a powerpc only refcount.
- Fixes to our pkey support. Several are user-visible changes, but bring us in
to line with x86 behaviour and/or fix outright bugs. Thanks to Florian Weimer
for reporting many of these.
- A series to improve the hvc driver & related OPAL console code, which have
been seen to cause hardlockups at times. The hvc driver changes in particular
have been in linux-next for ~month.
- Increase our MAX_PHYSMEM_BITS to 128TB when SPARSEMEM_VMEMMAP=y.
- Remove Power8 DD1 and Power9 DD1 support, neither chip should be in use
anywhere other than as a paper weight.
- An optimised memcmp implementation using Power7-or-later VMX instructions
- Support for barrier_nospec on some NXP CPUs.
- Support for flushing the count cache on context switch on some IBM CPUs
(controlled by firmware), as a Spectre v2 mitigation.
- A series to enhance the information we print on unhandled signals to bring it
into line with other arches, including showing the offending VMA and dumping
the instructions around the fault.
Thanks to:
Aaro Koskinen, Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Alexey
Spirkov, Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar,
Arnd Bergmann, Bartosz Golaszewski, Benjamin Herrenschmidt, Bharat Bhushan,
Bjoern Noetel, Boqun Feng, Breno Leitao, Bryant G. Ly, Camelia Groza,
Christophe Leroy, Christoph Hellwig, Cyril Bur, Dan Carpenter, Daniel Klamt,
Darren Stevens, Dave Young, David Gibson, Diana Craciun, Finn Thain, Florian
Weimer, Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven, Geoff Levand,
Guenter Roeck, Gustavo Romero, Haren Myneni, Hari Bathini, Joel Stanley,
Jonathan Neuschäfer, Kees Cook, Madhavan Srinivasan, Mahesh Salgaonkar, Markus
Elfring, Mathieu Malaterre, Mauro S. M. Rodrigues, Michael Hanselmann, Michael
Neuling, Michael Schmitz, Mukesh Ojha, Murilo Opsfelder Araujo, Nicholas
Piggin, Parth Y Shah, Paul Mackerras, Paul Menzel, Ram Pai, Randy Dunlap,
Rashmica Gupta, Reza Arbab, Rodrigo R. Galvao, Russell Currey, Sam Bobroff,
Scott Wood, Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stan Johnson,
Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain, Vasant Hegde, Venkat Rao
B, zhong jiang.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAlt2O6cTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgC7hD/4+cj796Df7GsVsIMxzQm7SS9dklIdO
JuKj2Nr5HRzTH59jWlXukLG9mfTNCFgFJB4gEpK1ArDOTcHTCI9RRsLZTZ/kum66
7Pd+7T40dLYXB5uecuUs0vMXa2fI3syKh1VLzACSXv3Dh9BBIKQBwW/aD2eww4YI
1fS5LnXZ2PSxfr6KNAC6ogZnuaiD0sHXOYrtGHq+S/TFC7+Z6ySa6+AnPS+hPVoo
/rHDE1Khr66aj7uk+PP2IgUrCFj6Sbj6hTVlS/iAuwbMjUl9ty6712PmvX9x6wMZ
13hJQI+g6Ci+lqLKqmqVUpXGSr6y4NJGPS/Hko4IivBTJApI+qV/tF2H9nxU+6X0
0RqzsMHPHy13n2torA1gC7ttzOuXPI4hTvm6JWMSsfmfjTxLANJng3Dq3ejh6Bqw
76EMowpDLexwpy7/glPpqNdsP4ySf2Qm8yq3mR7qpL4m3zJVRGs11x+s5DW8NKBL
Fl5SqZvd01abH+sHwv6NLaLkEtayUyohxvyqu2RU3zu5M5vi7DhqstybTPjKPGu0
icSPh7b2y10WpOUpC6lxpdi8Me8qH47mVc/trZ+SpgBrsuEmtJhGKszEnzRCOqos
o2IhYHQv3lQv86kpaAFQlg/RO+Lv+Lo5qbJ209V+hfU5nYzXpEulZs4dx1fbA+ze
fK8GEh+u0L4uJg==
=PzRz
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Notable changes:
- A fix for a bug in our page table fragment allocator, where a page
table page could be freed and reallocated for something else while
still in use, leading to memory corruption etc. The fix reuses
pt_mm in struct page (x86 only) for a powerpc only refcount.
- Fixes to our pkey support. Several are user-visible changes, but
bring us in to line with x86 behaviour and/or fix outright bugs.
Thanks to Florian Weimer for reporting many of these.
- A series to improve the hvc driver & related OPAL console code,
which have been seen to cause hardlockups at times. The hvc driver
changes in particular have been in linux-next for ~month.
- Increase our MAX_PHYSMEM_BITS to 128TB when SPARSEMEM_VMEMMAP=y.
- Remove Power8 DD1 and Power9 DD1 support, neither chip should be in
use anywhere other than as a paper weight.
- An optimised memcmp implementation using Power7-or-later VMX
instructions
- Support for barrier_nospec on some NXP CPUs.
- Support for flushing the count cache on context switch on some IBM
CPUs (controlled by firmware), as a Spectre v2 mitigation.
- A series to enhance the information we print on unhandled signals
to bring it into line with other arches, including showing the
offending VMA and dumping the instructions around the fault.
Thanks to: Aaro Koskinen, Akshay Adiga, Alastair D'Silva, Alexey
Kardashevskiy, Alexey Spirkov, Alistair Popple, Andrew Donnellan,
Aneesh Kumar K.V, Anju T Sudhakar, Arnd Bergmann, Bartosz Golaszewski,
Benjamin Herrenschmidt, Bharat Bhushan, Bjoern Noetel, Boqun Feng,
Breno Leitao, Bryant G. Ly, Camelia Groza, Christophe Leroy, Christoph
Hellwig, Cyril Bur, Dan Carpenter, Daniel Klamt, Darren Stevens, Dave
Young, David Gibson, Diana Craciun, Finn Thain, Florian Weimer,
Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven, Geoff Levand,
Guenter Roeck, Gustavo Romero, Haren Myneni, Hari Bathini, Joel
Stanley, Jonathan Neuschäfer, Kees Cook, Madhavan Srinivasan, Mahesh
Salgaonkar, Markus Elfring, Mathieu Malaterre, Mauro S. M. Rodrigues,
Michael Hanselmann, Michael Neuling, Michael Schmitz, Mukesh Ojha,
Murilo Opsfelder Araujo, Nicholas Piggin, Parth Y Shah, Paul
Mackerras, Paul Menzel, Ram Pai, Randy Dunlap, Rashmica Gupta, Reza
Arbab, Rodrigo R. Galvao, Russell Currey, Sam Bobroff, Scott Wood,
Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stan Johnson, Thiago
Jung Bauermann, Tyrel Datwyler, Vaibhav Jain, Vasant Hegde, Venkat
Rao, zhong jiang"
* tag 'powerpc-4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (234 commits)
powerpc/mm/book3s/radix: Add mapping statistics
powerpc/uaccess: Enable get_user(u64, *p) on 32-bit
powerpc/mm/hash: Remove unnecessary do { } while(0) loop
powerpc/64s: move machine check SLB flushing to mm/slb.c
powerpc/powernv/idle: Fix build error
powerpc/mm/tlbflush: update the mmu_gather page size while iterating address range
powerpc/mm: remove warning about ‘type’ being set
powerpc/32: Include setup.h header file to fix warnings
powerpc: Move `path` variable inside DEBUG_PROM
powerpc/powermac: Make some functions static
powerpc/powermac: Remove variable x that's never read
cxl: remove a dead branch
powerpc/powermac: Add missing include of header pmac.h
powerpc/kexec: Use common error handling code in setup_new_fdt()
powerpc/xmon: Add address lookup for percpu symbols
powerpc/mm: remove huge_pte_offset_and_shift() prototype
powerpc/lib: Use patch_site to patch copy_32 functions once cache is enabled
powerpc/pseries: Fix endianness while restoring of r3 in MCE handler.
powerpc/fadump: merge adjacent memory ranges to reduce PT_LOAD segements
powerpc/fadump: handle crash memory ranges array index overflow
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAlt1f9AUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vxbdhAArnhRvkwOk4m4/LCuKF6HpmlxbBNC
TjnBCenNf+lFXzWskfDFGFl/Wif4UzGbRTSCNQrwMzj3Ww3f/6R2QIq9rEJvyNC4
VdxQnaBEZSUgN87q5UGqgdjMTo3zFvlFH6fpb5XDiQ5IX/QZeXeYqoB64w+HvKPU
M+IsoOvnA5gb7pMcpchrGUnSfS1e6AqQbbTt6tZflore6YCEA4cH5OnpGx8qiZIp
ut+CMBvQjQB01fHeBc/wGrVte4NwXdONrXqpUb4sHF7HqRNfEh0QVyPhvebBi+k1
kquqoBQfPFTqgcab31VOcQhg70dEx+1qGm5/YBAwmhCpHR/g2gioFXoROsr+iUOe
BtF6LZr+Y8cySuhJnkCrJBqWvvBaKbJLg0KMbI+7p4o9MZpod2u7LS5LFrlRDyKW
3nz3o+b1+v3tCCKVKIhKo0ljolgkweQtR1f6KIHvq93wBODHVQnAOt9NlPfHVyks
ryGBnOhMjoU5hvfexgIWFk9Ph9MEVQSffkI+TeFPO/tyGBfGfQyGtESiXuEaMQaH
FGdZHX2RLkY3pWHOtWeMzRHzOnr2XjpDFcAqL3HBGPdJ30K3Umv3WOgoFe2SaocG
0gaddPjKSwwM4Sa/VP+O5cjGuzi7QnczSDdpYjxIGZzBav32hqx4/rsnLw7bHH8y
XkEme7cYJc8MGsA=
=2Dmn
-----END PGP SIGNATURE-----
Merge tag 'pci-v4.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull pci updates from Bjorn Helgaas:
- Decode AER errors with names similar to "lspci" (Tyler Baicar)
- Expose AER statistics in sysfs (Rajat Jain)
- Clear AER status bits selectively based on the type of recovery (Oza
Pawandeep)
- Honor "pcie_ports=native" even if HEST sets FIRMWARE_FIRST (Alexandru
Gagniuc)
- Don't clear AER status bits if we're using the "Firmware-First"
strategy where firmware owns the registers (Alexandru Gagniuc)
- Use sysfs_match_string() to simplify ASPM sysfs parsing (Andy
Shevchenko)
- Remove unnecessary includes of <linux/pci-aspm.h> (Bjorn Helgaas)
- Defer DPC event handling to work queue (Keith Busch)
- Use threaded IRQ for DPC bottom half (Keith Busch)
- Print AER status while handling DPC events (Keith Busch)
- Work around IDT switch ACS Source Validation erratum (James
Puthukattukaran)
- Emit diagnostics for all cases of PCIe Link downtraining (Links
operating slower than they're capable of) (Alexandru Gagniuc)
- Skip VFs when configuring Max Payload Size (Myron Stowe)
- Reduce Root Port Max Payload Size if necessary when hot-adding a
device below it (Myron Stowe)
- Simplify SHPC existence/permission checks (Bjorn Helgaas)
- Remove hotplug sample skeleton driver (Lukas Wunner)
- Convert pciehp to threaded IRQ handling (Lukas Wunner)
- Improve pciehp tolerance of missed events and initially unstable
links (Lukas Wunner)
- Clear spurious pciehp events on resume (Lukas Wunner)
- Add pciehp runtime PM support, including for Thunderbolt controllers
(Lukas Wunner)
- Support interrupts from pciehp bridges in D3hot (Lukas Wunner)
- Mark fall-through switch cases before enabling -Wimplicit-fallthrough
(Gustavo A. R. Silva)
- Move DMA-debug PCI init from arch code to PCI core (Christoph
Hellwig)
- Fix pci_request_irq() usage of IRQF_ONESHOT when no handler is
supplied (Heiner Kallweit)
- Unify PCI and DMA direction #defines (Shunyong Yang)
- Add PCI_DEVICE_DATA() macro (Andy Shevchenko)
- Check for VPD completion before checking for timeout (Bert Kenward)
- Limit Netronome NFP5000 config space size to work around erratum
(Jakub Kicinski)
- Set IRQCHIP_ONESHOT_SAFE for PCI MSI irqchips (Heiner Kallweit)
- Document ACPI description of PCI host bridges (Bjorn Helgaas)
- Add "pci=disable_acs_redir=" parameter to disable ACS redirection for
peer-to-peer DMA support (we don't have the peer-to-peer support yet;
this is just one piece) (Logan Gunthorpe)
- Clean up devm_of_pci_get_host_bridge_resources() resource allocation
(Jan Kiszka)
- Fixup resizable BARs after suspend/resume (Christian König)
- Make "pci=earlydump" generic (Sinan Kaya)
- Fix ROM BAR access routines to stay in bounds and check for signature
correctly (Rex Zhu)
- Add DMA alias quirk for Microsemi Switchtec NTB (Doug Meyer)
- Expand documentation for pci_add_dma_alias() (Logan Gunthorpe)
- To avoid bus errors, enable PASID only if entire path supports
End-End TLP prefixes (Sinan Kaya)
- Unify slot and bus reset functions and remove hotplug knowledge from
callers (Sinan Kaya)
- Add Function-Level Reset quirks for Intel and Samsung NVMe devices to
fix guest reboot issues (Alex Williamson)
- Add function 1 DMA alias quirk for Marvell 88SS9183 PCIe SSD
Controller (Bjorn Helgaas)
- Remove Xilinx AXI-PCIe host bridge arch dependency (Palmer Dabbelt)
- Remove Aardvark outbound window configuration (Evan Wang)
- Fix Aardvark bridge window sizing issue (Zachary Zhang)
- Convert Aardvark to use pci_host_probe() to reduce code duplication
(Thomas Petazzoni)
- Correct the Cadence cdns_pcie_writel() signature (Alan Douglas)
- Add Cadence support for optional generic PHYs (Alan Douglas)
- Add Cadence power management ops (Alan Douglas)
- Remove redundant variable from Cadence driver (Colin Ian King)
- Add Kirin MSI support (Xiaowei Song)
- Drop unnecessary root_bus_nr setting from exynos, imx6, keystone,
armada8k, artpec6, designware-plat, histb, qcom, spear13xx (Shawn
Guo)
- Move link notification settings from DesignWare core to individual
drivers (Gustavo Pimentel)
- Add endpoint library MSI-X interfaces (Gustavo Pimentel)
- Correct signature of endpoint library IRQ interfaces (Gustavo
Pimentel)
- Add DesignWare endpoint library MSI-X callbacks (Gustavo Pimentel)
- Add endpoint library MSI-X test support (Gustavo Pimentel)
- Remove unnecessary GFP_ATOMIC from Hyper-V "new child" allocation
(Jia-Ju Bai)
- Add more devices to Broadcom PAXC quirk (Ray Jui)
- Work around corrupted Broadcom PAXC config space to enable SMMU and
GICv3 ITS (Ray Jui)
- Disable MSI parsing to work around broken Broadcom PAXC logic in some
devices (Ray Jui)
- Hide unconfigured functions to work around a Broadcom PAXC defect
(Ray Jui)
- Lower iproc log level to reduce console output during boot (Ray Jui)
- Fix mobiveil iomem/phys_addr_t type usage (Lorenzo Pieralisi)
- Fix mobiveil missing include file (Lorenzo Pieralisi)
- Add mobiveil Kconfig/Makefile support (Lorenzo Pieralisi)
- Fix mvebu I/O space remapping issues (Thomas Petazzoni)
- Use generic pci_host_bridge in mvebu instead of ARM-specific API
(Thomas Petazzoni)
- Whitelist VMD devices with fast interrupt handlers to avoid sharing
vectors with slow handlers (Keith Busch)
* tag 'pci-v4.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (153 commits)
PCI/AER: Don't clear AER bits if error handling is Firmware-First
PCI: Limit config space size for Netronome NFP5000
PCI/MSI: Set IRQCHIP_ONESHOT_SAFE for PCI-MSI irqchips
PCI/VPD: Check for VPD access completion before checking for timeout
PCI: Add PCI_DEVICE_DATA() macro to fully describe device ID entry
PCI: Match Root Port's MPS to endpoint's MPSS as necessary
PCI: Skip MPS logic for Virtual Functions (VFs)
PCI: Add function 1 DMA alias quirk for Marvell 88SS9183
PCI: Check for PCIe Link downtraining
PCI: Add ACS Redirect disable quirk for Intel Sunrise Point
PCI: Add device-specific ACS Redirect disable infrastructure
PCI: Convert device-specific ACS quirks from NULL termination to ARRAY_SIZE
PCI: Add "pci=disable_acs_redir=" parameter for peer-to-peer support
PCI: Allow specifying devices using a base bus and path of devfns
PCI: Make specifying PCI devices in kernel parameters reusable
PCI: Hide ACS quirk declarations inside PCI core
PCI: Delay after FLR of Intel DC P3700 NVMe
PCI: Disable Samsung SM961/PM961 NVMe before FLR
PCI: Export pcie_has_flr()
PCI: mvebu: Drop bogus comment above mvebu_pcie_map_registers()
...
- Mark fall-through switch cases before enabling -Wimplicit-fallthrough
(Gustavo A. R. Silva)
- Move DMA-debug PCI init from arch code to PCI core (Christoph Hellwig)
- Fix pci_request_irq() usage of IRQF_ONESHOT when no handler is supplied
(Heiner Kallweit)
- Unify PCI and DMA direction #defines (Shunyong Yang)
- Add PCI_DEVICE_DATA() macro (Andy Shevchenko)
- Check for VPD completion before checking for timeout (Bert Kenward)
- Limit Netronome NFP5000 config space size to work around erratum (Jakub
Kicinski)
* pci/misc:
PCI: Limit config space size for Netronome NFP5000
PCI/VPD: Check for VPD access completion before checking for timeout
PCI: Add PCI_DEVICE_DATA() macro to fully describe device ID entry
PCI: Unify PCI and normal DMA direction definitions
PCI: Use IRQF_ONESHOT if pci_request_irq() called with no handler
PCI: Call dma_debug_add_bus() for pci_bus_type from PCI core
PCI: Mark fall-through switch cases before enabling -Wimplicit-fallthrough
# Conflicts:
# drivers/pci/hotplug/pciehp_ctrl.c
- verify depmod is installed before modules_install
- support build salt in case build ids must be unique between builds
- allow users to specify additional host compiler flags via HOST*FLAGS,
and rename internal variables to KBUILD_HOST*FLAGS
- update buildtar script to drop vax support, add arm64 support
- update builddeb script for better debarch support
- document the pit-fall of if_changed usage
- fix parallel build of UML with O= option
- make 'samples' target depend on headers_install to fix build errors
- remove deprecated host-progs variable
- add a new coccinelle script for refcount_t vs atomic_t check
- improve double-test coccinelle script
- misc cleanups and fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbdFZ0AAoJED2LAQed4NsGcHYP/23txxk3GRP7O4UkfPw9Rtky
MHiXTgcoy2vbG+l12BgzWX+qFii8XTUe3dQtK4HnGQFUIBtEBV/hpZPJtxfgGSev
Zou5cv1kr5rNzTkCn//TG3O6/WIkTBCe2hahDCtmGDI3kd/cPK4dHbU/q6KpaqIJ
qzZYBXIvCeu2GM8idQoCRrwdMpgu1pBz1gz2sDje1yHH2toI7T6cXHRLQDgx+HPq
LIP7W9GUsoDdXjecvPD51LiW89E6BUxETBh5Ft9r9uzwB5ylQQMcw6Qyu2DiYDUX
PPsHCMiolYV+Ttcy+vj/67KOvKmEaFotssck+RD/xDCF17zKhRkup+YM8kPLHTVZ
TcAUZadbnT6U/s2W6GFwvVbN/P7cc3aif+aNCC/Pl23yagp3pydlSCocYxQgiVR7
/rx48haYDEgu/MJ1X0dOpSO0ErY7zu2OoAlNerW+D9QizwbP+WtZO/CJH8SxQRuN
dQ1xmyNrie+ODgi9tbc4eBrsb+1rioX927TP5MbJcfXt5CTsxDmIqop5XwyYIoQN
ZWWlzC8Ii3P2trAVpBgM2IEbngSxwr6T9Wbf1ScJnPKr/o1rq+pBk49cYstTz3kQ
OwJ8gPwUrkW4R+hlD7L6mL/WcrKzZBQS0Ij1QW2kVSEhRrsKo99psE1/rGehnHu9
KGB0LYYCqGSOHR4zOjg0
=VjfG
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- verify depmod is installed before modules_install
- support build salt in case build ids must be unique between builds
- allow users to specify additional host compiler flags via HOST*FLAGS,
and rename internal variables to KBUILD_HOST*FLAGS
- update buildtar script to drop vax support, add arm64 support
- update builddeb script for better debarch support
- document the pit-fall of if_changed usage
- fix parallel build of UML with O= option
- make 'samples' target depend on headers_install to fix build errors
- remove deprecated host-progs variable
- add a new coccinelle script for refcount_t vs atomic_t check
- improve double-test coccinelle script
- misc cleanups and fixes
* tag 'kbuild-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (41 commits)
coccicheck: return proper error code on fail
Coccinelle: doubletest: reduce side effect false positives
kbuild: remove deprecated host-progs variable
kbuild: make samples really depend on headers_install
um: clean up archheaders recipe
kbuild: add %asm-generic to no-dot-config-targets
um: fix parallel building with O= option
scripts: Add Python 3 support to tracing/draw_functrace.py
builddeb: Add automatic support for sh{3,4}{,eb} architectures
builddeb: Add automatic support for riscv* architectures
builddeb: Add automatic support for m68k architecture
builddeb: Add automatic support for or1k architecture
builddeb: Add automatic support for sparc64 architecture
builddeb: Add automatic support for mips{,64}r6{,el} architectures
builddeb: Add automatic support for mips64el architecture
builddeb: Add automatic support for ppc64 and powerpcspe architectures
builddeb: Introduce functions to simplify kconfig tests in set_debarch
builddeb: Drop check for 32-bit s390
builddeb: Change architecture detection fallback to use dpkg-architecture
builddeb: Skip architecture detection when KBUILD_DEBARCH is set
...
When idle_power4() hard disables interrupts then finds a soft pending
interrupt, it returns with interrupts hard disabled but without
PACA_IRQ_HARD_DIS set. Commit 9b81c0211c ("powerpc/64s: make
PACA_IRQ_HARD_DIS track MSR[EE] closely") added a warning for that
condition (since disabled).
Fix this by adding the PACA_IRQ_HARD_DIS for that case.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Pull perf update from Thomas Gleixner:
"The perf crowd presents:
Kernel updates:
- Removal of jprobes
- Cleanup and consolidatation the handling of kprobes
- Cleanup and consolidation of hardware breakpoints
- The usual pile of fixes and updates to PMUs and event descriptors
Tooling updates:
- Updates and improvements all over the place. Nothing outstanding,
just the (good) boring incremental grump work"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
perf trace: Do not require --no-syscalls to suppress strace like output
perf bpf: Include uapi/linux/bpf.h from the 'perf trace' script's bpf.h
perf tools: Allow overriding MAX_NR_CPUS at compile time
perf bpf: Show better message when failing to load an object
perf list: Unify metric group description format with PMU event description
perf vendor events arm64: Update ThunderX2 implementation defined pmu core events
perf cs-etm: Generate branch sample for CS_ETM_TRACE_ON packet
perf cs-etm: Generate branch sample when receiving a CS_ETM_TRACE_ON packet
perf cs-etm: Support dummy address value for CS_ETM_TRACE_ON packet
perf cs-etm: Fix start tracing packet handling
perf build: Fix installation directory for eBPF
perf c2c report: Fix crash for empty browser
perf tests: Fix indexing when invoking subtests
perf trace: Beautify the AF_INET & AF_INET6 'socket' syscall 'protocol' args
perf trace beauty: Add beautifiers for 'socket''s 'protocol' arg
perf trace beauty: Do not print NULL strarray entries
perf beauty: Add a generator for IPPROTO_ socket's protocol constants
tools include uapi: Grab a copy of linux/in.h
perf tests: Fix complex event name parsing
perf evlist: Fix error out while applying initial delay and LBR
...
The machine check code that flushes and restores bolted segments in
real mode belongs in mm/slb.c. This will also be used by pseries
machine check and idle code in future changes.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Make sure to include setup.h to provide the following prototypes:
- irqstack_early_init
- setup_power_save
- initialize_cache_info
Fix the following warnings (treated as error in W=1):
arch/powerpc/kernel/setup_32.c:198:13: error: no previous prototype for ‘irqstack_early_init’
arch/powerpc/kernel/setup_32.c:238:13: error: no previous prototype for ‘setup_power_save’
arch/powerpc/kernel/setup_32.c:253:13: error: no previous prototype for ‘initialize_cache_info’
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add gcc attribute unused for two variables. Fix warnings treated as errors
with W=1:
arch/powerpc/kernel/prom_init.c:1388:8: error: variable ‘path’ set but not used [-Werror=unused-but-set-variable]
Suggested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a jump target so that a bit of exception handling can be better
reused at the end of this function.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The symbol memcpy_nocache_branch defined in order to allow patching
of memset function once cache is enabled leads to confusing reports
by perf tool.
Using the new patch_site functionality solves this issue.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With dynamic memory allocation support for crash memory ranges array,
there is no hard limit on the no. of crash memory ranges kernel could
export, but program headers count could overflow in the /proc/vmcore
ELF file while exporting each memory range as PT_LOAD segment. Reduce
the likelihood of a such scenario, by folding adjacent crash memory
ranges which minimizes the total number of PT_LOAD segments.
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Crash memory ranges is an array of memory ranges of the crashing kernel
to be exported as a dump via /proc/vmcore file. The size of the array
is set based on INIT_MEMBLOCK_REGIONS, which works alright in most cases
where memblock memory regions count is less than INIT_MEMBLOCK_REGIONS
value. But this count can grow beyond INIT_MEMBLOCK_REGIONS value since
commit 142b45a72e ("memblock: Add array resizing support").
On large memory systems with a few DLPAR operations, the memblock memory
regions count could be larger than INIT_MEMBLOCK_REGIONS value. On such
systems, registering fadump results in crash or other system failures
like below:
task: c00007f39a290010 ti: c00000000b738000 task.ti: c00000000b738000
NIP: c000000000047df4 LR: c0000000000f9e58 CTR: c00000000010f180
REGS: c00000000b73b570 TRAP: 0300 Tainted: G L X (4.4.140+)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22004484 XER: 20000000
CFAR: c000000000008500 DAR: 000007a450000000 DSISR: 40000000 SOFTE: 0
...
NIP [c000000000047df4] smp_send_reschedule+0x24/0x80
LR [c0000000000f9e58] resched_curr+0x138/0x160
Call Trace:
resched_curr+0x138/0x160 (unreliable)
check_preempt_curr+0xc8/0xf0
ttwu_do_wakeup+0x38/0x150
try_to_wake_up+0x224/0x4d0
__wake_up_common+0x94/0x100
ep_poll_callback+0xac/0x1c0
__wake_up_common+0x94/0x100
__wake_up_sync_key+0x70/0xa0
sock_def_readable+0x58/0xa0
unix_stream_sendmsg+0x2dc/0x4c0
sock_sendmsg+0x68/0xa0
___sys_sendmsg+0x2cc/0x2e0
__sys_sendmsg+0x5c/0xc0
SyS_socketcall+0x36c/0x3f0
system_call+0x3c/0x100
as array index overflow is not checked for while setting up crash memory
ranges causing memory corruption. To resolve this issue, dynamically
allocate memory for crash memory ranges and resize it incrementally,
in units of pagesize, on hitting array size limit.
Fixes: 2df173d9e8 ("fadump: Initialize elfcore header and add PT_LOAD program headers.")
Cc: stable@vger.kernel.org # v3.4+
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
[mpe: Just use PAGE_SIZE directly, fixup variable placement]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In Makefiles if we're testing a CONFIG_FOO symbol for equality with 'y'
we can instead just use ifdef. The latter reads easily, so convert to
it where possible.
Signed-off-by: Rodrigo R. Galvao <rosattig@linux.vnet.ibm.com>
Reviewed-by: Mauro S. M. Rodrigues <maurosr@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Call show_user_instructions() in arch/powerpc/kernel/traps.c to dump
instructions at faulty location, useful to debugging.
Before this patch, an unhandled signal message looked like:
pandafault[10524]: segfault (11) at 100007d0 nip 1000061c lr 7fffbd295100 code 2 in pandafault[10000000+10000]
After this patch, it looks like:
pandafault[10524]: segfault (11) at 100007d0 nip 1000061c lr 7fffbd295100 code 2 in pandafault[10000000+10000]
pandafault[10524]: code: 4bfffeec 4bfffee8 3c401002 38427f00 fbe1fff8 f821ffc1 7c3f0b78 3d22fffe
pandafault[10524]: code: 392988d0 f93f0020 e93f0020 39400048 <99490000> 39200000 7d234b78 383f0040
Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show_user_instructions() is a slightly modified version of
show_instructions() that allows userspace instruction dump.
This will be useful within show_signal_msg() to dump userspace
instructions of the faulty location.
Here is a sample of what show_user_instructions() outputs:
pandafault[10850]: code: 4bfffeec 4bfffee8 3c401002 38427f00 fbe1fff8 f821ffc1 7c3f0b78 3d22fffe
pandafault[10850]: code: 392988d0 f93f0020 e93f0020 39400048 <99490000> 39200000 7d234b78 383f0040
The current->comm and current->pid printed can serve as a glue that
links the instructions dump to its originator, allowing messages to be
interleaved in the logs.
Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds VMA address in the message printed for unhandled signals,
similarly to what other architectures, like x86, print.
Before this patch, a page fault looked like:
pandafault[61470]: unhandled signal 11 at 100007d0 nip 1000061c lr 7fff8d185100 code 2
After this patch, a page fault looks like:
pandafault[6303]: segfault 11 at 100007d0 nip 1000061c lr 7fff93c55100 code 2 in pandafault[10000000+10000]
Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use %lx format to print registers. This avoids having two different
formats and avoids checking for MSR_64BIT, improving readability of the
function.
Even though we could have used %px, which is functionally equivalent to %lx
as per Documentation/core-api/printk-formats.rst, it is not semantically
correct because the data printed are not pointers. And using %px requires
casting data to (void *).
Besides that, %lx matches the format used in show_regs().
Before this patch:
pandafault[4808]: unhandled signal 11 at 0000000010000718 nip 0000000010000574 lr 00007fff935e7a6c code 2
After this patch:
pandafault[4732]: unhandled signal 11 at 10000718 nip 10000574 lr 7fff86697a6c code 2
Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Replace printk_ratelimited() by printk() and a default rate limit
burst to limit displaying unhandled signals messages.
This will allow us to call print_vma_addr() in a future patch, which
does not work with printk_ratelimited().
Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Isolate the logic of printing unhandled signals out of _exception_pkey().
No functional change, only code rearrangement.
Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Because rfi_flush_fallback runs immediately before the return to
userspace it currently runs with the user r1 (stack pointer). This
means if we oops in there we will report a bad kernel stack pointer in
the exception entry path, eg:
Bad kernel stack pointer 7ffff7150e40 at c0000000000023b4
Oops: Bad kernel stack pointer, sig: 6 [#1]
LE SMP NR_CPUS=32 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1246 Comm: klogd Not tainted 4.18.0-rc2-gcc-7.3.1-00175-g0443f8a69ba3 #7
NIP: c0000000000023b4 LR: 0000000010053e00 CTR: 0000000000000040
REGS: c0000000fffe7d40 TRAP: 4100 Not tainted (4.18.0-rc2-gcc-7.3.1-00175-g0443f8a69ba3)
MSR: 9000000002803031 <SF,HV,VEC,VSX,FP,ME,IR,DR,LE> CR: 44000442 XER: 20000000
CFAR: c00000000000bac8 IRQMASK: c0000000f1e66a80
GPR00: 0000000002000000 00007ffff7150e40 00007fff93a99900 0000000000000020
...
NIP [c0000000000023b4] rfi_flush_fallback+0x34/0x80
LR [0000000010053e00] 0x10053e00
Although the NIP tells us where we were, and the TRAP number tells us
what happened, it would still be nicer if we could report the actual
exception rather than barfing about the stack pointer.
We an do that fairly simply by loading the kernel stack pointer on
entry and restoring the user value before returning. That way we see a
regular oops such as:
Unrecoverable exception 4100 at c00000000000239c
Oops: Unrecoverable exception, sig: 6 [#1]
LE SMP NR_CPUS=32 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1251 Comm: klogd Not tainted 4.18.0-rc3-gcc-7.3.1-00097-g4ebfcac65acd-dirty #40
NIP: c00000000000239c LR: 0000000010053e00 CTR: 0000000000000040
REGS: c0000000f1e17bb0 TRAP: 4100 Not tainted (4.18.0-rc3-gcc-7.3.1-00097-g4ebfcac65acd-dirty)
MSR: 9000000002803031 <SF,HV,VEC,VSX,FP,ME,IR,DR,LE> CR: 44000442 XER: 20000000
CFAR: c00000000000bac8 IRQMASK: 0
...
NIP [c00000000000239c] rfi_flush_fallback+0x3c/0x80
LR [0000000010053e00] 0x10053e00
Call Trace:
[c0000000f1e17e30] [c00000000000b9e4] system_call+0x5c/0x70 (unreliable)
Note this shouldn't make the kernel stack pointer vulnerable to a
meltdown attack, because it should be flushed from the cache before we
return to userspace. The user r1 value will be in the cache, because
we load it in the return path, but that is harmless.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Some CPU revisions support a mode where the count cache needs to be
flushed by software on context switch. Additionally some revisions may
have a hardware accelerated flush, in which case the software flush
sequence can be shortened.
If we detect the appropriate flag from firmware we patch a branch
into _switch() which takes us to a count cache flush sequence.
That sequence in turn may be patched to return early if we detect that
the CPU supports accelerating the flush sequence in hardware.
Add debugfs support for reporting the state of the flush, as well as
runtime disabling it.
And modify the spectre_v2 sysfs file to report the state of the
software flush.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Used barrier_nospec to sanitize the syscall table.
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In a subsequent patch we will enable building security.c for Book3E.
However the NXP platforms are not vulnerable to Meltdown, so make the
Meltdown vulnerability reporting PPC_BOOK3S_64 specific.
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
[mpe: Split out of larger patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently we require platform code to call setup_barrier_nospec(). But
if we add an empty definition for the !CONFIG_PPC_BARRIER_NOSPEC case
then we can call it in setup_arch().
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a config symbol to encode which platforms support the
barrier_nospec speculation barrier. Currently this is just Book3S 64
but we will add Book3E in a future patch.
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
NXP Book3E platforms are not vulnerable to speculative store
bypass, so make the mitigations PPC_BOOK3S_64 specific.
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The speculation barrier can be disabled from the command line
with the parameter: "nospectre_v1".
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We only need to use __MASKABLE_EXCEPTION in one of the four cases for
hardware interrupt, so use the helper macros in the other cases.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
_MASKABLE_RELON_EXCEPTION_PSERIES() does nothing useful, update all
callers to use __MASKABLE_RELON_EXCEPTION_PSERIES() directly.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
_MASKABLE_EXCEPTION_PSERIES() does nothing useful, update all callers
to use __MASKABLE_EXCEPTION_PSERIES() directly.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
As with the other patches in this series, we are removing the
"PSERIES" from the name as it's no longer meaningful.
In this case it's not simply a case of removing the "PSERIES" as that
would result in a clash with the existing EXCEPTION_PROLOG_1.
Instead we name this one EXCEPTION_PROLOG_2, as it's usually used in
sequence after 0 and 1.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We recently added a warning in arch_local_irq_restore() to check that
the soft masking state matches reality.
Unfortunately it trips in a few places, which are not entirely trivial
to fix. The key problem is if we're doing function_graph tracing of
restore_math(), the warning pops and then seems to recurse. It's not
entirely clear because the system continuously oopses on all CPUs,
with the output interleaved and unreadable.
It's also been observed on a G5 coming out of idle.
Until we can fix those cases disable the warning for now.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When a PCI device is detected, pdev->is_added is set to 1 and proc and
sysfs entries are created.
When the device is removed, pdev->is_added is checked for one and then
device is detached with clearing of proc and sys entries and at end,
pdev->is_added is set to 0.
is_added and is_busmaster are bit fields in pci_dev structure sharing same
memory location.
A strange issue was observed with multiple removal and rescan of a PCIe
NVMe device using sysfs commands where is_added flag was observed as zero
instead of one while removing device and proc,sys entries are not cleared.
This causes issue in later device addition with warning message
"proc_dir_entry" already registered.
Debugging revealed a race condition between the PCI core setting the
is_added bit in pci_bus_add_device() and the NVMe driver reset work-queue
setting the is_busmaster bit in pci_set_master(). As these fields are not
handled atomically, that clears the is_added bit.
Move the is_added bit to a separate private flag variable and use atomic
functions to set and retrieve the device addition state. This avoids the
race because is_added no longer shares a memory location with is_busmaster.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200283
Signed-off-by: Hari Vyas <hari.vyas@broadcom.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Lukas Wunner <lukas@wunner.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
There is nothing arch-specific about PCI or dma-debug, so call
dma_debug_add_bus() from the PCI core just after registering the bus type.
Most of dma-debug is already generic; this just adds reporting of pending
dma-allocations on driver unload for arches other than powerpc, sh, and
x86.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
set_breakpoint() is only used in process.c so make it static
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Files not using fixmap consts or functions don't need asm/fixmap.h
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
files not using feature fixup don't need asm/feature-fixups.h
files using feature fixup need asm/feature-fixups.h
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Only include linux/stringify.h is files using __stringify()
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch moves ASM_CONST() and stringify_in_c() into
dedicated asm-const.h, then cleans all related inclusions.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: asm-compat.h should include asm-const.h]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Files not using cpu_has_feature() don't need cpu_has_feature.h
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Since commit dc3106690b ("powerpc: tm: Always use fp_state and
vr_state to store live registers") tm_reclaim_thread() doesn't use the
parameter anymore, both callers have to bother getting it as they have
no need for a struct thread_info either.
Just remove it and adjust the callers.
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In commit eb5c3f1c86 ("powerpc: Always save/restore checkpointed regs
during treclaim/trecheckpoint") __tm_recheckpoint was modified to no
longer take the second parameter 'unsigned long orig_msr' as part of a
TM rewrite to simplify the reclaiming/recheckpointing process.
There is a comment in the asm file where the function is delcared which
has an incorrect prototype with the 'orig_msr' parameter.
This patch corrects the comment.
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With SPARSEMEM config enabled, we make sure that we don't add sections beyond
MAX_PHYSMEM_BITS range. This results in not building vmemmap mapping for
range beyond max range. But our memblock layer looks the device tree and create
mapping for the full memory range. Prevent this by checking against
MAX_PHSYSMEM_BITS when doing memblock_add.
We don't do similar check for memeblock_reserve_range. If reserve range is beyond
MAX_PHYSMEM_BITS we expect that to be configured with 'nomap'. Any other
reserved range should come from existing memblock ranges which we already
filtered while adding.
This avoids crash as below when running on a system with system ram config above
MAX_PHSYSMEM_BITS
Unable to handle kernel paging request for data at address 0xc00a001000000440
Faulting instruction address: 0xc000000001034118
cpu 0x0: Vector: 300 (Data Access) at [c00000000124fb30]
pc: c000000001034118: __free_pages_bootmem+0xc0/0x1c0
lr: c00000000103b258: free_all_bootmem+0x19c/0x22c
sp: c00000000124fdb0
msr: 9000000002001033
dar: c00a001000000440
dsisr: 40000000
current = 0xc00000000120dd00
paca = 0xc000000001f60000^I irqmask: 0x03^I irq_happened: 0x01
pid = 0, comm = swapper
[c00000000124fe20] c00000000103b258 free_all_bootmem+0x19c/0x22c
[c00000000124fee0] c000000001010a68 mem_init+0x3c/0x5c
[c00000000124ff00] c00000000100401c start_kernel+0x298/0x5e4
[c00000000124ff90] c00000000000b57c start_here_common+0x1c/0x520
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When I added the spectre_v2 information in sysfs, I included the
availability of the ori31 speculation barrier.
Although the ori31 barrier can be used to mitigate v2, it's primarily
intended as a spectre v1 mitigation. Spectre v2 is mitigated by
hardware changes.
So rework the sysfs files to show the ori31 information in the
spectre_v1 file, rather than v2.
Currently we display eg:
$ grep . spectre_v*
spectre_v1:Mitigation: __user pointer sanitization
spectre_v2:Mitigation: Indirect branch cache disabled, ori31 speculation barrier enabled
After:
$ grep . spectre_v*
spectre_v1:Mitigation: __user pointer sanitization, ori31 speculation barrier enabled
spectre_v2:Mitigation: Indirect branch cache disabled
Fixes: d6fbe1c55c ("powerpc/64s: Wire up cpu_show_spectre_v2()")
Cc: stable@vger.kernel.org # v4.17+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There is an asynchronous aspect to smp_send_nmi_ipi. The caller waits
for all CPUs to call in to the handler, but it does not wait for
completion of the handler. This is a needless complication, so remove
it and always wait synchronously.
The synchronous wait allows the caller to easily time out and clear
the wait for completion (zero nmi_ipi_busy_count) in the case of badly
behaved handlers. This would have prevented the recent smp_send_stop
NMI IPI bug from causing the system to hang.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When the masked interrupt handler clears MSR[EE] for an interrupt in
the PACA_IRQ_MUST_HARD_MASK set, it does not set PACA_IRQ_HARD_DIS.
This makes them get out of synch.
With that taken into account, it's only low level irq manipulation
(and interrupt entry before reconcile) where they can be out of synch.
This makes the code less surprising.
It also allows the IRQ replay code to rely on the IRQ_HARD_DIS value
and not have to mtmsrd again in this case (e.g., for an external
interrupt that has been masked). The bigger benefit might just be
that there is not such an element of surprise in these two bits of
state.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When a thread forks the contents of AMR, IAMR, UAMOR registers in the
newly forked thread are not inherited.
Save the registers before forking, for content of those
registers to be automatically copied into the new thread.
Fixes: cf43d3b264 ("powerpc: Enable pkey subsystem")
Cc: stable@vger.kernel.org # v4.16+
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This property was added in 2004 and the only use of it, which was
already inside `#if 0`, was removed a month later.
Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
NULL pointers are pointers to user memory space. So user pagetable
has to be set in order to avoid random behaviour in case of NULL
pointer dereference, otherwise we may encounter random memory
access hence Machine Check Exception from TLB Miss handlers.
Set user pagetable as early as possible in order to properly
catch early kernel NULL pointer dereference.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Merge in some commits we're sharing with the KVM tree.
I manually propagated the change from commit d3d4ffaae4
("powerpc/powernv/ioda2: Reduce upper limit for DMA window size") into
pci-ioda-tce.c.
Conflicts:
arch/powerpc/include/asm/cputable.h
arch/powerpc/platforms/powernv/pci-ioda.c
arch/powerpc/platforms/powernv/pci.h
On 64-bit servers, SPRN_SPRG3 and its userspace read-only mirror
SPRN_USPRG3 are used as userspace VDSO write and read registers
respectively.
SPRN_SPRG3 is lost when we enter stop4 and above, and is currently not
restored. As a result, any read from SPRN_USPRG3 returns zero on an
exit from stop4 (Power9 only) and above.
Thus in this situation, on POWER9, any call from sched_getcpu() always
returns zero, as on powerpc, we call __kernel_getcpu() which relies
upon SPRN_USPRG3 to report the CPU and NUMA node information.
Fix this by restoring SPRN_SPRG3 on wake up from a deep stop state
with the sprg_vdso value that is cached in PACA.
Fixes: e1c1cfed54 ("powerpc/powernv: Save/Restore additional SPRs for stop4 cpuidle")
Cc: stable@vger.kernel.org # v4.14+
Reported-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The vDSO needs to have a unique build id in a similar manner
to the kernel and modules. Use the build salt macro.
Signed-off-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
POWER9 DD1 was never a product. It is no longer supported by upstream
firmware, and it is not effectively supported in Linux due to lack of
testing.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>
[mpe: Remove arch_make_huge_pte() entirely]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When we take an SLB multi-hit on bare metal, we see both the multi-hit
and parity error bits set in DSISR. The user manuals indicates this is
expected to always happen on Power8, whereas on Power9 it says a
multi-hit will "usually" also cause a parity error.
We decide what to do based on the various error tables in mce_power.c,
and because we process them in order and only report the first, we
currently always report a parity error but not the multi-hit, eg:
Severe Machine check interrupt [Recovered]
Initiator: CPU
Error type: SLB [Parity]
Effective address: c000000ffffd4300
Although this is correct, it leaves the user wondering why they got a
parity error. It would be clearer instead if we reported the
multi-hit because that is more likely to be simply a software bug,
whereas a true parity error is possibly an indication of a bad core.
We can do that simply by reordering the error tables so that multi-hit
appears before parity. That doesn't affect the error recovery at all,
because we flush the SLB either way.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This was added to support an early version of Power8 that did not have
working doorbells. These machines were not publicly available, and all of
the internal users have long since upgraded.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Memory reservation for crashkernel could fail if there are holes around
kdump kernel offset (128M). Fail gracefully in such cases and print an
error message.
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Tested-by: David Gibson <dgibson@redhat.com>
Reviewed-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 59f47eff03 ("powerpc/pci: Use of_irq_parse_and_map_pci() helper")
removed the 'oirq' variable, but kept memsetting it when the DEBUG macro is
defined.
When setting DEBUG macro for debugging purpose, the kernel fails to build since
'oirq' is not defined anymore.
This patch simply remove the debug block, since it does not seem to sense
now.
Fixes: 59f47eff03 ("powerpc/pci: Use of_irq_parse_and_map_pci() helper")
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Due to recent refactoring in EEH in:
commit b9fde58db7 ("powerpc/powernv: Rework EEH initialization on
powernv")
a misleading message was seen in the kernel message buffer:
[ 0.108431] EEH: PowerNV platform initialized
[ 0.589979] EEH: No capable adapters found
This happened due to the removal of the initialization delay for powernv
platform.
Even though the EEH infrastructure for the devices is eventually
initialized and still works just fine the eeh device probe step is
postponed in order to assure the PEs are created. Later
pnv_eeh_post_init does the probe devices job but at that point the
message was already shown right after eeh_init flow.
This patch introduces a new flag EEH_POSTPONED_PROBE to represent that
temporary state and avoid the message mentioned above and showing the
follow one instead:
[ 0.107724] EEH: PowerNV platform initialized
[ 4.844825] EEH: PCI Enhanced I/O Error Handling Enabled
Signed-off-by: Mauro S. M. Rodrigues <maurosr@linux.vnet.ibm.com>
Acked-by: Russell Currey <ruscur@russell.cc>
Tested-by:Venkat Rao B <vrbagal1@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
- introduce __diag_* macros and suppress -Wattribute-alias warnings from GCC 8
- fix stack protector test script for x86_64
- fix line number handling in Kconfig
- document that '#' starts a comment in Kconfig
- handle P_SYMBOL property in dump debugging of Kconfig
- correct help message of LD_DEAD_CODE_DATA_ELIMINATION
- fix occasional segmentation faults in Kconfig
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbN450AAoJED2LAQed4NsGFdAP/0fc2NhkzQMvz1EBEc2n93LC
FUXew75tsX2ZewssoLzb4Iepkb/mHU+fjhBaE65S+Xu2/6mNfId9a7HAtywvFyO2
ZUQPXHjMHnLEPRKuzQy34uCy9/wWCiqi8rpWUsOEohmNIcLaF0vMZf5Ifod7wIr7
pnix3b9Q+dY+l49TSsSv4MX7F9qs5fXRhEarcQ3jYEb3yRUEXgmli3hV1wRita/n
tJhFDiIdJDeISDkgmHUuOhjFnv5Yf3WJTXi/ILZ2zvpGjjqNDAwxtyzGnPMShQEc
fxk3/1nkg9h/ScVAaGavrYYmiiH8XsqWY2q6p52jTK3kD+yTXaVakPSmxw8UHImh
aNWQutzMF8GYEsb+ld1ncsNrwfgd40mA25mEyb/ZPSw2IdNBrXtIVbw7XiBLi8eH
recAlRN0MouzD7+sXafgtoKopqanQbB/rMqDO4ULfnVvZLWDmZVbfreCc+qrJtiJ
mqydBMUVxrvB+qf5SHQ7WlDmXWHY1xQuxXzS0gRVGT14EsyD6yhC2D62pEHnB7uG
zE1pGemOCzOlGY6nDAbtQVR1n5AAWEZYveZXUuFn+vuqR7ZtYxCFUFOS0u621zFI
HMI9B81ifdNV2efT2VTVi6Tnnvn44sAXOYjaULX6566EyX0/mOL5CWZlTqn5SKOn
PwNxc7ZeCylTbkZww2c4
=oABr
-----END PGP SIGNATURE-----
Merge tag 'kbuild-fixes-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- introduce __diag_* macros and suppress -Wattribute-alias warnings
from GCC 8
- fix stack protector test script for x86_64
- fix line number handling in Kconfig
- document that '#' starts a comment in Kconfig
- handle P_SYMBOL property in dump debugging of Kconfig
- correct help message of LD_DEAD_CODE_DATA_ELIMINATION
- fix occasional segmentation faults in Kconfig
* tag 'kbuild-fixes-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kconfig: loop boundary condition fix
kbuild: reword help of LD_DEAD_CODE_DATA_ELIMINATION
kconfig: handle P_SYMBOL in print_symbol()
kconfig: document Kconfig source file comments
kconfig: fix line numbers for if-entries in menu tree
stack-protector: Fix test with 32-bit userland and CONFIG_64BIT=y
powerpc: Remove -Wattribute-alias pragmas
disable -Wattribute-alias warning for SYSCALL_DEFINEx()
kbuild: add macro for controlling warnings to linux/compiler.h
Migrate to the new API in order to remove arch_validate_hwbkpt_settings()
that clumsily mixes up architecture validation and commit
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joel Fernandes <joel.opensrc@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/1529981939-8231-5-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We can't pass the breakpoint directly on arch_check_bp_in_kernelspace()
anymore because its architecture internal datas (struct arch_hw_breakpoint)
are not yet filled by the time we call the function, and most
implementation need this backend to be up to date. So arrange the
function to take the probing struct instead.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joel Fernandes <joel.opensrc@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/1529981939-8231-3-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With SYSCALL_DEFINEx() disabling -Wattribute-alias generically, there's
no need to duplicate that for PowerPC syscalls.
This reverts commit 4155203739 ("powerpc: fix build failure by
disabling attribute-alias warning in pci_32") and commit 2479bfc9bc
("powerpc: Fix build by disabling attribute-alias warning for
SYSCALL_DEFINEx").
Signed-off-by: Paul Burton <paul.burton@mips.com>
Acked-by: Christophe Leroy <christophe.leroy@c-s.fr>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Pull rseq fixes from Thomas Gleixer:
"A pile of rseq related fixups:
- Prevent infinite recursion when delivering SIGSEGV
- Remove the abort of rseq critical section on fork() as syscalls
inside rseq critical sections are explicitely forbidden. So no
point in doing the abort on the child.
- Align the rseq structure on 32 bytes in the ARM selftest code.
- Fix file permissions of the test script"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rseq: Avoid infinite recursion when delivering SIGSEGV
rseq/cleanup: Do not abort rseq c.s. in child on fork()
rseq/selftests/arm: Align 'struct rseq_cs' on 32 bytes
rseq/selftests: Make run_param_test.sh executable
When delivering a signal to a task that is using rseq, we call into
__rseq_handle_notify_resume() so that the registers pushed in the
sigframe are updated to reflect the state of the restartable sequence
(for example, ensuring that the signal returns to the abort handler if
necessary).
However, if the rseq management fails due to an unrecoverable fault when
accessing userspace or certain combinations of RSEQ_CS_* flags, then we
will attempt to deliver a SIGSEGV. This has the potential for infinite
recursion if the rseq code continuously fails on signal delivery.
Avoid this problem by using force_sigsegv() instead of force_sig(), which
is explicitly designed to reset the SEGV handler to SIG_DFL in the case
of a recursive fault. In doing so, remove rseq_signal_deliver() from the
internal rseq API and have an optional struct ksignal * parameter to
rseq_handle_notify_resume() instead.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: peterz@infradead.org
Cc: paulmck@linux.vnet.ibm.com
Cc: boqun.feng@gmail.com
Link: https://lkml.kernel.org/r/1529664307-983-1-git-send-email-will.deacon@arm.com
Clear current_kprobe and enable preemption in kprobe
even if pre_handler returns !0.
This simplifies function override using kprobes.
Jprobe used to require to keep the preemption disabled and
keep current_kprobe until it returned to original function
entry. For this reason kprobe_int3_handler() and similar
arch dependent kprobe handers checks pre_handler result
and exit without enabling preemption if the result is !0.
After removing the jprobe, Kprobes does not need to
keep preempt disabled even if user handler returns !0
anymore.
But since the function override handler in error-inject
and bpf is also returns !0 if it overrides a function,
to balancing the preempt count, it enables preemption
and reset current kprobe by itself.
That is a bad design that is very buggy. This fixes
such unbalanced preempt-count and current_kprobes setting
in kprobes, bpf and error-inject.
Note: for powerpc and x86, this removes all preempt_disable
from kprobe_ftrace_handler because ftrace callbacks are
called under preempt disabled.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: linux-arch@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: linux-snps-arc@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Link: https://lore.kernel.org/lkml/152942494574.15209.12323837825873032258.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Don't call the ->break_handler() from the powerpc kprobes code,
because it was only used by jprobes which got removed.
This also removes skip_singlestep() and embeds it in the
caller, kprobe_ftrace_handler(), which simplifies regs->nip
operation around there.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-arch@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: https://lore.kernel.org/lkml/152942477127.15209.8982613703787878618.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove arch dependent setjump/longjump functions
and unused fields in kprobe_ctlblk for jprobes
from arch/powerpc. This also reverts commits
related __is_active_jprobe() function.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-arch@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: https://lore.kernel.org/lkml/152942445234.15209.12868722778364739753.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I broke the build when CONFIG_NMI_IPI=n with my recent commit to add
arch_trigger_cpumask_backtrace(), eg:
stacktrace.c:(.text+0x1b0): undefined reference to `.smp_send_safe_nmi_ipi'
We should rework the CONFIG symbols here in future to avoid these
double barrelled ifdefs but for now they fix the build.
Fixes: 5cc05910f2 ("powerpc/64s: Wire up arch_trigger_cpumask_backtrace()")
Reported-by: Christophe LEROY <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Similar to previous patches, hard disable interrupts when a CPU is
in panic. This reduces the chance the watchdog has to interfere with
the panic, and avoids any other type of masked interrupt being
executed when crashing which minimises the length of the crash path.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Marking CPUs stopped by smp_send_stop as offline can cause warnings
due to cross-CPU wakeups. This trace was noticed on a busy system
running a sysrq+c crash test, after the injected crash:
WARNING: CPU: 51 PID: 1546 at kernel/sched/core.c:1179 set_task_cpu+0x22c/0x240
CPU: 51 PID: 1546 Comm: kworker/u352:1 Tainted: G D
Workqueue: mlx5e mlx5e_update_stats_work [mlx5_core]
[...]
NIP [c00000000017c21c] set_task_cpu+0x22c/0x240
LR [c00000000017d580] try_to_wake_up+0x230/0x720
Call Trace:
[c000000001017700] runqueues+0x0/0xb00 (unreliable)
[c00000000017d580] try_to_wake_up+0x230/0x720
[c00000000015a214] insert_work+0x104/0x140
[c00000000015adb0] __queue_work+0x230/0x690
[c000003fc5007910] [c00000000015b26c] queue_work_on+0x5c/0x90
[c0080000135fc8f8] mlx5_cmd_exec+0x538/0xcb0 [mlx5_core]
[c008000013608fd0] mlx5_core_access_reg+0x140/0x1d0 [mlx5_core]
[c00800001362777c] mlx5e_update_pport_counters.constprop.59+0x6c/0x90 [mlx5_core]
[c008000013628868] mlx5e_update_ndo_stats+0x28/0x90 [mlx5_core]
[c008000013625558] mlx5e_update_stats_work+0x68/0xb0 [mlx5_core]
[c00000000015bcec] process_one_work+0x1bc/0x5f0
[c00000000015ecac] worker_thread+0xac/0x6b0
[c000000000168338] kthread+0x168/0x1b0
[c00000000000b628] ret_from_kernel_thread+0x5c/0xb4
This happens because firstly the CPU is not really offline in the
usual sense, processes and interrupts have not been migrated away.
Secondly smp_send_stop does not happen atomically on all CPUs, so
one CPU can have marked itself offline, while another CPU is still
running processes or interrupts which can affect the first CPU.
Fix this by just not marking the CPU as offline. It's more like
frozen in time, so offline does not really reflect its state properly
anyway. There should be nothing in the crash/panic path that walks
online CPUs and synchronously waits for them, so this change should
not introduce new hangs.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Similarly to commit 855bfe0de1 ("powerpc: hard disable irqs in
smp_send_stop loop"), irqs should be hard disabled by
panic_smp_self_stop.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In the device tree CPU features quirk code we want to set
CPU_FTR_POWER9_DD2_1 on all Power9s that aren't DD2.0 or earlier. But
we got the logic wrong and instead set it on all CPUs that aren't
Power9 DD2.0 or earlier, ie. including Power8.
Fix it by making sure we're on a Power9. This isn't a bug in practice
because the only code that checks the feature is Power9 only to begin
with. But we'll backport it anyway to avoid confusion.
Fixes: 9e9626ed3a ("powerpc/64s: Fix POWER9 DD2.2 and above in DT CPU features")
Cc: stable@vger.kernel.org # v4.17+
Reported-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
- fix some bugs introduced by the recent Kconfig syntax extension
- add some symbols about compiler information in Kconfig, such as
CC_IS_GCC, CC_IS_CLANG, GCC_VERSION, etc.
- test compiler capability for the stack protector in Kconfig, and
clean-up Makefile
- test compiler capability for GCC-plugins in Kconfig, and clean-up
Makefile
- allow to enable GCC-plugins for COMPILE_TEST
- test compiler capability for KCOV in Kconfig and correct dependency
- remove auto-detect mode of the GCOV format, which is now more nicely
handled in Kconfig
- test compiler capability for mprofile-kernel on PowerPC, and
clean-up Makefile
- misc cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbISvEAAoJED2LAQed4NsGEsoQAKBHMqUM9yQo0LdVMnDMCLQI
Xsjyqzr0ySp6YiuF+cobwDs49sggt7/8EX+OnrP/sLlAhY0QrNGI1ulhwpFx1Ewa
xFxz5kF/1jDwC+AjngXcK5Dr9nGSSMfT3wQhLGKjMkKSypbz2QyTrfMOfHGYSzU1
gD8RMWYXxKoJFmIaqmpLz7PDfWKPzhSOZo7BflPjAGXdlpfSV9cQvu+TkJ12qvSp
KZ2uHUgLz95NnltSuGtN71X8so7w4eTYAvkJ5bOeOpYsZSVYRq4Exvwe0Y0dbwie
WDpcRC5KrQOlIFxRUUSGn5cDsaW9yYJJAwMG6Dr8qJ66QlgY5GqOKXxXX+ARa7WU
7GkeAZ11n5dArjjdSjfClh8CwDiZNpJmAUbahm+feQfUfq9nbs+0JX6bOG5ZE+nt
3iE0ZoSGDjxD5Pjy4u+NtQM0JCpieuz3JNxqVbAVm0Ua5q8niwSEneixyrNmjkBF
1tV+qsMYus7AFwdGuDRXzBhVY7hd931H34czA3FUZZqwcClFVoJiygI++s62mVXx
w9kYi8Ades/W6dt7c7XGjmqYTDgnTolLaYY5vggpEeLOzc1QPW6iKt9tpREi6Zzm
n+y586YsIo0vjTMfRcfmGZUPG3CJeqL2UDslYmG8PgMQ6/eaAHBDXECLrAkGGPlG
aIPZcMam5BQxhmSJc19c
=VABv
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v4.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull more Kbuild updates from Masahiro Yamada:
- fix some bugs introduced by the recent Kconfig syntax extension
- add some symbols about compiler information in Kconfig, such as
CC_IS_GCC, CC_IS_CLANG, GCC_VERSION, etc.
- test compiler capability for the stack protector in Kconfig, and
clean-up Makefile
- test compiler capability for GCC-plugins in Kconfig, and clean-up
Makefile
- allow to enable GCC-plugins for COMPILE_TEST
- test compiler capability for KCOV in Kconfig and correct dependency
- remove auto-detect mode of the GCOV format, which is now more nicely
handled in Kconfig
- test compiler capability for mprofile-kernel on PowerPC, and clean-up
Makefile
- misc cleanups
* tag 'kbuild-v4.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
linux/linkage.h: replace VMLINUX_SYMBOL_STR() with __stringify()
kconfig: fix localmodconfig
sh: remove no-op macro VMLINUX_SYMBOL()
powerpc/kbuild: move -mprofile-kernel check to Kconfig
Documentation: kconfig: add recommended way to describe compiler support
gcc-plugins: disable GCC_PLUGIN_STRUCTLEAK_BYREF_ALL for COMPILE_TEST
gcc-plugins: allow to enable GCC_PLUGINS for COMPILE_TEST
gcc-plugins: test plugin support in Kconfig and clean up Makefile
gcc-plugins: move GCC version check for PowerPC to Kconfig
kcov: test compiler capability in Kconfig and correct dependency
gcov: remove CONFIG_GCOV_FORMAT_AUTODETECT
arm64: move GCC version check for ARCH_SUPPORTS_INT128 to Kconfig
kconfig: add CC_IS_CLANG and CLANG_VERSION
kconfig: add CC_IS_GCC and GCC_VERSION
stack-protector: test compiler capability in Kconfig and drop AUTO mode
kbuild: fix endless syncconfig in case arch Makefile sets CROSS_COMPILE
This eliminates the workaround that requires disabling
-mprofile-kernel by default in Kconfig.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Pull restartable sequence support from Thomas Gleixner:
"The restartable sequences syscall (finally):
After a lot of back and forth discussion and massive delays caused by
the speculative distraction of maintainers, the core set of
restartable sequences has finally reached a consensus.
It comes with the basic non disputed core implementation along with
support for arm, powerpc and x86 and a full set of selftests
It was exposed to linux-next earlier this week, so it does not fully
comply with the merge window requirements, but there is really no
point to drag it out for yet another cycle"
* 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rseq/selftests: Provide Makefile, scripts, gitignore
rseq/selftests: Provide parametrized tests
rseq/selftests: Provide basic percpu ops test
rseq/selftests: Provide basic test
rseq/selftests: Provide rseq library
selftests/lib.mk: Introduce OVERRIDE_TARGETS
powerpc: Wire up restartable sequences system call
powerpc: Add syscall detection for restartable sequences
powerpc: Add support for restartable sequences
x86: Wire up restartable sequence system call
x86: Add support for restartable sequences
arm: Wire up restartable sequences system call
arm: Add syscall detection for restartable sequences
arm: Add restartable sequences support
rseq: Introduce restartable sequences system call
uapi/headers: Provide types_32_64.h
Notable changes:
- Support for split PMD page table lock on 64-bit Book3S (Power8/9).
- Add support for HAVE_RELIABLE_STACKTRACE, so we properly support live
patching again.
- Add support for patching barrier_nospec in copy_from_user() and syscall entry.
- A couple of fixes for our data breakpoints on Book3S.
- A series from Nick optimising TLB/mm handling with the Radix MMU.
- Numerous small cleanups to squash sparse/gcc warnings from Mathieu Malaterre.
- Several series optimising various parts of the 32-bit code from Christophe Leroy.
- Removal of support for two old machines, "SBC834xE" and "C2K" ("GEFanuc,C2K"),
which is why the diffstat has so many deletions.
And many other small improvements & fixes.
There's a few out-of-area changes. Some minor ftrace changes OK'ed by Steve, and
a fix to our powernv cpuidle driver. Then there's a series touching mm, x86 and
fs/proc/task_mmu.c, which cleans up some details around pkey support. It was
ack'ed/reviewed by Ingo & Dave and has been in next for several weeks.
Thanks to:
Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Al Viro, Andrew
Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Arnd Bergmann, Balbir Singh,
Cédric Le Goater, Christophe Leroy, Christophe Lombard, Colin Ian King, Dave
Hansen, Fabio Estevam, Finn Thain, Frederic Barrat, Gautham R. Shenoy, Haren
Myneni, Hari Bathini, Ingo Molnar, Jonathan Neuschäfer, Josh Poimboeuf,
Kamalesh Babulal, Madhavan Srinivasan, Mahesh Salgaonkar, Mark Greer, Mathieu
Malaterre, Matthew Wilcox, Michael Neuling, Michal Suchanek, Naveen N. Rao,
Nicholas Piggin, Nicolai Stange, Olof Johansson, Paul Gortmaker, Paul
Mackerras, Peter Rosin, Pridhiviraj Paidipeddi, Ram Pai, Rashmica Gupta, Ravi
Bangoria, Russell Currey, Sam Bobroff, Samuel Mendoza-Jonas, Segher
Boessenkool, Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stewart Smith,
Thiago Jung Bauermann, Torsten Duwe, Vaibhav Jain, Wei Yongjun, Wolfram Sang,
Yisheng Xie, YueHaibing.
-----BEGIN PGP SIGNATURE-----
iQIwBAABCAAaBQJbGQKBExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYBq
TRAAioK7rz5xYMkxaM3Ng3ybobEeNAwQqOolz98xvmnB9SfDWNuc99vf8cGu0/fQ
zc8AKZ5RcnwipOjyGlxW9oa1ZhVq0xtYnQPiYLEKMdLQmh5D+C7+KpvAd1UElweg
ub40/xDySWfMujfuMSF9JDCWPIXyojt4Xg5nJKIVRrAm/3YMe/+i5Am7NWHuMCEb
aQmZtlYW5Mz81XY0968hjpUO6eKFRmsaM7yFAhGTXx6+oLRpGj1PZB4AwdRIKS2L
Ak7q/VgxtE4W+s3a0GK2s+eXIhGKeFuX9AVnx3nti+8/K1OqrqhDcLMUC/9JpCpv
EvOtO7dxPnZujHjdu4Eai/xNoo4h6zRy7bWqve9LoBM40CP5jljKzu1lwqqb5yO0
jC7/aXhgiSIxxcRJLjoI/TYpZPu40MifrkydmczykdPyPCnMIWEJDcj4KsRL/9Y8
9SSbJzRNC/SgQNTbUYPZFFi6G0QaMmlcbCb628k8QT+Gn3Xkdf/ZtxzqEyoF4Irq
46kFBsiSSK4Bu0rVlcUtJQLgdqytWULO6NKEYnD67laxYcgQd8pGFQ8SjZhRZLgU
q5LA3HIWhoAI4M0wZhOnKXO6JfiQ1UbO8gUJLsWsfF0Fk5KAcdm+4kb4jbI1H4Qk
Vol9WNRZwEllyaiqScZN9RuVVuH0GPOZeEH1dtWK+uWi0lM=
=ZlBf
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Notable changes:
- Support for split PMD page table lock on 64-bit Book3S (Power8/9).
- Add support for HAVE_RELIABLE_STACKTRACE, so we properly support
live patching again.
- Add support for patching barrier_nospec in copy_from_user() and
syscall entry.
- A couple of fixes for our data breakpoints on Book3S.
- A series from Nick optimising TLB/mm handling with the Radix MMU.
- Numerous small cleanups to squash sparse/gcc warnings from Mathieu
Malaterre.
- Several series optimising various parts of the 32-bit code from
Christophe Leroy.
- Removal of support for two old machines, "SBC834xE" and "C2K"
("GEFanuc,C2K"), which is why the diffstat has so many deletions.
And many other small improvements & fixes.
There's a few out-of-area changes. Some minor ftrace changes OK'ed by
Steve, and a fix to our powernv cpuidle driver. Then there's a series
touching mm, x86 and fs/proc/task_mmu.c, which cleans up some details
around pkey support. It was ack'ed/reviewed by Ingo & Dave and has
been in next for several weeks.
Thanks to: Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Al
Viro, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Arnd
Bergmann, Balbir Singh, Cédric Le Goater, Christophe Leroy, Christophe
Lombard, Colin Ian King, Dave Hansen, Fabio Estevam, Finn Thain,
Frederic Barrat, Gautham R. Shenoy, Haren Myneni, Hari Bathini, Ingo
Molnar, Jonathan Neuschäfer, Josh Poimboeuf, Kamalesh Babulal,
Madhavan Srinivasan, Mahesh Salgaonkar, Mark Greer, Mathieu Malaterre,
Matthew Wilcox, Michael Neuling, Michal Suchanek, Naveen N. Rao,
Nicholas Piggin, Nicolai Stange, Olof Johansson, Paul Gortmaker, Paul
Mackerras, Peter Rosin, Pridhiviraj Paidipeddi, Ram Pai, Rashmica
Gupta, Ravi Bangoria, Russell Currey, Sam Bobroff, Samuel
Mendoza-Jonas, Segher Boessenkool, Shilpasri G Bhat, Simon Guo,
Souptick Joarder, Stewart Smith, Thiago Jung Bauermann, Torsten Duwe,
Vaibhav Jain, Wei Yongjun, Wolfram Sang, Yisheng Xie, YueHaibing"
* tag 'powerpc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (251 commits)
powerpc/64s/radix: Fix missing ptesync in flush_cache_vmap
cpuidle: powernv: Fix promotion from snooze if next state disabled
powerpc: fix build failure by disabling attribute-alias warning in pci_32
ocxl: Fix missing unlock on error in afu_ioctl_enable_p9_wait()
powerpc-opal: fix spelling mistake "Uniterrupted" -> "Uninterrupted"
powerpc: fix spelling mistake: "Usupported" -> "Unsupported"
powerpc/pkeys: Detach execute_only key on !PROT_EXEC
powerpc/powernv: copy/paste - Mask SO bit in CR
powerpc: Remove core support for Marvell mv64x60 hostbridges
powerpc/boot: Remove core support for Marvell mv64x60 hostbridges
powerpc/boot: Remove support for Marvell mv64x60 i2c controller
powerpc/boot: Remove support for Marvell MPSC serial controller
powerpc/embedded6xx: Remove C2K board support
powerpc/lib: optimise PPC32 memcmp
powerpc/lib: optimise 32 bits __clear_user()
powerpc/time: inline arch_vtime_task_switch()
powerpc/Makefile: set -mcpu=860 flag for the 8xx
powerpc: Implement csum_ipv6_magic in assembly
powerpc/32: Optimise __csum_partial()
powerpc/lib: Adjust .balign inside string functions for PPC32
...
- improve fixdep to coalesce consecutive slashes in dep-files
- fix some issues of the maintainer string generation in deb-pkg script
- remove unused CONFIG_HAVE_UNDERSCORE_SYMBOL_PREFIX and clean-up
several tools and linker scripts
- clean-up modpost
- allow to enable the dead code/data elimination for PowerPC in EXPERT mode
- improve two coccinelle scripts for better performance
- pass endianness and machine size flags to sparse for all architecture
- misc fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbF/yvAAoJED2LAQed4NsGEPgP/2qBg7w4raGvQtblqGY1qo6j
3xGKYUKdg3GhIRf1zB9lPwkAmQcyLKzKlet/gYoTUTLKbfRUX8wDzJf/3TV0kpLW
QQ2HM1/jsqrD1HSO21OPJ1rzMSNn1NcOSLWSeOLWUBorHkkvAHlenJcJSOo6szJr
tTgEN78T/9id/artkFqdG+1Q3JhnI5FfH3u0lE20Eqxk5AAxrUKArHYsgRjgOg9o
8DlHDTRsnTiUd4TtmC+VYSZK1BHz1ORlANaRiL69T+BGFZGNCvRSV09QkaD+ObxT
dB4TTJne32Qg6g5qYX0bzLqfRdfJ8tpmJGQkycf3OT1rLgmDbWFaaOEDQTAe3mSw
nT6ZbpQB1OoTgMD2An9ApWfUQRfsMnujm/pRP+BkRdKKkMJvXJCH7PvFw8rjqTt3
PjK6DGbpG6H0G+DePtthMHrz/TU6wi5MFf7kQxl0AtFmpa3R0q67VhdM04BEYNCq
Dbs1YaXWKKi101k14oSQ0kmRasZ9Jz5tvyfZ7wvy1LpGONXxtEbc6JQyBJ6tmf4f
fCAxvHLSb/TQSmJhk9Rch7uPYT9B9hC16dseMrF9Pab8yR346fz70L1UdFE10j3q
iKFbYkueq8uJCJDxNktsgHzbOF6Le5vaWauOafRN26K7p7+CRpVOy0O2bknX3yDa
hKOGzCfQjT8sfdMmtyIH
=2LYT
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- improve fixdep to coalesce consecutive slashes in dep-files
- fix some issues of the maintainer string generation in deb-pkg script
- remove unused CONFIG_HAVE_UNDERSCORE_SYMBOL_PREFIX and clean-up
several tools and linker scripts
- clean-up modpost
- allow to enable the dead code/data elimination for PowerPC in EXPERT
mode
- improve two coccinelle scripts for better performance
- pass endianness and machine size flags to sparse for all architecture
- misc fixes
* tag 'kbuild-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (25 commits)
kbuild: add machine size to CHECKFLAGS
kbuild: add endianness flag to CHEKCFLAGS
kbuild: $(CHECK) doesnt need NOSTDINC_FLAGS twice
scripts: Fixed printf format mismatch
scripts/tags.sh: use `find` for $ALLSOURCE_ARCHS generation
coccinelle: deref_null: improve performance
coccinelle: mini_lock: improve performance
powerpc: Allow LD_DEAD_CODE_DATA_ELIMINATION to be selected
kbuild: Allow LD_DEAD_CODE_DATA_ELIMINATION to be selectable if enabled
kbuild: LD_DEAD_CODE_DATA_ELIMINATION no -ffunction-sections/-fdata-sections for module build
kbuild: Fix asm-generic/vmlinux.lds.h for LD_DEAD_CODE_DATA_ELIMINATION
modpost: constify *modname function argument where possible
modpost: remove redundant is_vmlinux() test
modpost: use strstarts() helper more widely
modpost: pass struct elf_info pointer to get_modinfo()
checkpatch: remove VMLINUX_SYMBOL() check
vmlinux.lds.h: remove no-op macro VMLINUX_SYMBOL()
kbuild: remove CONFIG_HAVE_UNDERSCORE_SYMBOL_PREFIX
export.h: remove code for prefixing symbols with underscore
depmod.sh: remove symbol prefix support
...
Syscalls are not allowed inside restartable sequences, so add a call to
rseq_syscall() at the very beginning of system call exiting path for
CONFIG_DEBUG_RSEQ=y kernel. This could help us to detect whether there
is a syscall issued inside restartable sequences.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: linux-api@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20180602124408.8430-10-mathieu.desnoyers@efficios.com
Call the rseq_handle_notify_resume() function on return to userspace if
TIF_NOTIFY_RESUME thread flag is set.
Perform fixup on the pre-signal when a signal is delivered on top of a
restartable sequence critical section.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: linux-api@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20180602124408.8430-9-mathieu.desnoyers@efficios.com
Commit 2479bfc9bc ("powerpc: Fix build by disabling attribute-alias
warning for SYSCALL_DEFINEx") forgot arch/powerpc/kernel/pci_32.c
Latest GCC version emit the following warnings
As arch/powerpc code is built with -Werror, this breaks build with
GCC 8.1
This patch inhibits this warning
In file included from arch/powerpc/kernel/pci_32.c:14:
./include/linux/syscalls.h:233:18: error: 'sys_pciconfig_iobase' alias between functions of incompatible types 'long int(long int, long unsigned int, long unsigned int)' and 'long int(long int, long int, long int)' [-Werror=attribute-alias]
asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
^~~
./include/linux/syscalls.h:222:2: note: in expansion of macro '__SYSCALL_DEFINEx'
__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
^~~~~~~~~~~~~~~~~
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Pull timers and timekeeping updates from Thomas Gleixner:
- Core infrastucture work for Y2038 to address the COMPAT interfaces:
+ Add a new Y2038 safe __kernel_timespec and use it in the core
code
+ Introduce config switches which allow to control the various
compat mechanisms
+ Use the new config switch in the posix timer code to control the
32bit compat syscall implementation.
- Prevent bogus selection of CPU local clocksources which causes an
endless reselection loop
- Remove the extra kthread in the clocksource code which has no value
and just adds another level of indirection
- The usual bunch of trivial updates, cleanups and fixlets all over the
place
- More SPDX conversions
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
clocksource/drivers/mxs_timer: Switch to SPDX identifier
clocksource/drivers/timer-imx-tpm: Switch to SPDX identifier
clocksource/drivers/timer-imx-gpt: Switch to SPDX identifier
clocksource/drivers/timer-imx-gpt: Remove outdated file path
clocksource/drivers/arc_timer: Add comments about locking while read GFRC
clocksource/drivers/mips-gic-timer: Add pr_fmt and reword pr_* messages
clocksource/drivers/sprd: Fix Kconfig dependency
clocksource: Move inline keyword to the beginning of function declarations
timer_list: Remove unused function pointer typedef
timers: Adjust a kernel-doc comment
tick: Prefer a lower rating device only if it's CPU local device
clocksource: Remove kthread
time: Change nanosleep to safe __kernel_* types
time: Change types to new y2038 safe __kernel_* types
time: Fix get_timespec64() for y2038 safe compat interfaces
time: Add new y2038 safe __kernel_timespec
posix-timers: Make compat syscalls depend on CONFIG_COMPAT_32BIT_TIME
time: Introduce CONFIG_COMPAT_32BIT_TIME
time: Introduce CONFIG_64BIT_TIME in architectures
compat: Enable compat_get/put_timespec64 always
...
Pull siginfo updates from Eric Biederman:
"This set of changes close the known issues with setting si_code to an
invalid value, and with not fully initializing struct siginfo. There
remains work to do on nds32, arc, unicore32, powerpc, arm, arm64, ia64
and x86 to get the code that generates siginfo into a simpler and more
maintainable state. Most of that work involves refactoring the signal
handling code and thus careful code review.
Also not included is the work to shrink the in kernel version of
struct siginfo. That depends on getting the number of places that
directly manipulate struct siginfo under control, as it requires the
introduction of struct kernel_siginfo for the in kernel things.
Overall this set of changes looks like it is making good progress, and
with a little luck I will be wrapping up the siginfo work next
development cycle"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
signal/sh: Stop gcc warning about an impossible case in do_divide_error
signal/mips: Report FPE_FLTUNK for undiagnosed floating point exceptions
signal/um: More carefully relay signals in relay_signal.
signal: Extend siginfo_layout with SIL_FAULT_{MCEERR|BNDERR|PKUERR}
signal: Remove unncessary #ifdef SEGV_PKUERR in 32bit compat code
signal/signalfd: Add support for SIGSYS
signal/signalfd: Remove __put_user from signalfd_copyinfo
signal/xtensa: Use force_sig_fault where appropriate
signal/xtensa: Consistenly use SIGBUS in do_unaligned_user
signal/um: Use force_sig_fault where appropriate
signal/sparc: Use force_sig_fault where appropriate
signal/sparc: Use send_sig_fault where appropriate
signal/sh: Use force_sig_fault where appropriate
signal/s390: Use force_sig_fault where appropriate
signal/riscv: Replace do_trap_siginfo with force_sig_fault
signal/riscv: Use force_sig_fault where appropriate
signal/parisc: Use force_sig_fault where appropriate
signal/parisc: Use force_sig_mceerr where appropriate
signal/openrisc: Use force_sig_fault where appropriate
signal/nios2: Use force_sig_fault where appropriate
...
- replaceme the force_dma flag with a dma_configure bus method.
(Nipun Gupta, although one patch is іncorrectly attributed to me
due to a git rebase bug)
- use GFP_DMA32 more agressively in dma-direct. (Takashi Iwai)
- remove PCI_DMA_BUS_IS_PHYS and rely on the dma-mapping API to do the
right thing for bounce buffering.
- move dma-debug initialization to common code, and apply a few cleanups
to the dma-debug code.
- cleanup the Kconfig mess around swiotlb selection
- swiotlb comment fixup (Yisheng Xie)
- a trivial swiotlb fix. (Dan Carpenter)
- support swiotlb on RISC-V. (based on a patch from Palmer Dabbelt)
- add a new generic dma-noncoherent dma_map_ops implementation and use
it for arc, c6x and nds32.
- improve scatterlist validity checking in dma-debug. (Robin Murphy)
- add a struct device quirk to limit the dma-mask to 32-bit due to
bridge/system issues, and switch x86 to use it instead of a local
hack for VIA bridges.
- handle devices without a dma_mask more gracefully in the dma-direct
code.
-----BEGIN PGP SIGNATURE-----
iQI/BAABCAApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAlsU1hwLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYPraxAAocC7JiFKW133/VugCtGA1x9uE8DPHealtsWTAeEq
KOOB3GxWMU2hKqQ4km5tcfdWoGJvvab6hmDXcitzZGi2JajO7Ae0FwIy3yvxSIKm
iH/ON7c4sJt8gKrXYsLVylmwDaimNs4a6xfODoCRgnWuovI2QrrZzupnlzPNsiOC
lv8ezzcW+Ay/gvDD/r72psO+w3QELETif/OzR/qTOtvLrVabM06eHmPQ8Wb98smu
/UPMMv6/3XwQnxpxpdyqN+p/gUdneXithzT261wTeZ+8gDXmcWBwHGcMBCimcoBi
FklW52moazIPIsTysqoNlVFsLGJTeS4p2D3BLAp5NwWYsLv+zHUVZsI1JY/8u5Ox
mM11LIfvu9JtUzaqD9SvxlxIeLhhYZZGnUoV3bQAkpHSQhN/xp2YXd5NWSo5ac2O
dch83+laZkZgd6ryw6USpt/YTPM/UHBYy7IeGGHX/PbmAke0ZlvA6Rae7kA5DG59
7GaLdwQyrHp8uGFgwze8P+R4POSk1ly73HHLBT/pFKnDD7niWCPAnBzuuEQGJs00
0zuyWLQyzOj1l6HCAcMNyGnYSsMp8Fx0fvEmKR/EYs8O83eJKXi6L9aizMZx4v1J
0wTolUWH6SIIdz474YmewhG5YOLY7mfe9E8aNr8zJFdwRZqwaALKoteRGUxa3f6e
zUE=
=6Acj
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-4.18' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- replace the force_dma flag with a dma_configure bus method. (Nipun
Gupta, although one patch is іncorrectly attributed to me due to a
git rebase bug)
- use GFP_DMA32 more agressively in dma-direct. (Takashi Iwai)
- remove PCI_DMA_BUS_IS_PHYS and rely on the dma-mapping API to do the
right thing for bounce buffering.
- move dma-debug initialization to common code, and apply a few
cleanups to the dma-debug code.
- cleanup the Kconfig mess around swiotlb selection
- swiotlb comment fixup (Yisheng Xie)
- a trivial swiotlb fix. (Dan Carpenter)
- support swiotlb on RISC-V. (based on a patch from Palmer Dabbelt)
- add a new generic dma-noncoherent dma_map_ops implementation and use
it for arc, c6x and nds32.
- improve scatterlist validity checking in dma-debug. (Robin Murphy)
- add a struct device quirk to limit the dma-mask to 32-bit due to
bridge/system issues, and switch x86 to use it instead of a local
hack for VIA bridges.
- handle devices without a dma_mask more gracefully in the dma-direct
code.
* tag 'dma-mapping-4.18' of git://git.infradead.org/users/hch/dma-mapping: (48 commits)
dma-direct: don't crash on device without dma_mask
nds32: use generic dma_noncoherent_ops
nds32: implement the unmap_sg DMA operation
nds32: consolidate DMA cache maintainance routines
x86/pci-dma: switch the VIA 32-bit DMA quirk to use the struct device flag
x86/pci-dma: remove the explicit nodac and allowdac option
x86/pci-dma: remove the experimental forcesac boot option
Documentation/x86: remove a stray reference to pci-nommu.c
core, dma-direct: add a flag 32-bit dma limits
dma-mapping: remove unused gfp_t parameter to arch_dma_alloc_attrs
dma-debug: check scatterlist segments
c6x: use generic dma_noncoherent_ops
arc: use generic dma_noncoherent_ops
arc: fix arc_dma_{map,unmap}_page
arc: fix arc_dma_sync_sg_for_{cpu,device}
arc: simplify arc_dma_sync_single_for_{cpu,device}
dma-mapping: provide a generic dma-noncoherent implementation
dma-mapping: simplify Kconfig dependencies
riscv: add swiotlb support
riscv: only enable ZONE_DMA32 for 64-bit
...
arch_vtime_task_switch() is a small function which is called
only from vtime_common_task_switch(), so it is worth inlining
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use fault_in_pages_readable() to prefault user context
instead of open coding
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
reloc_offset() is the same as add_reloc_offset(0)
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Direction is already checked in all calling functions in
include/linux/dma-mapping.h and also in called function __dma_sync()
So really no need to check it once more here.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We now have barrier_nospec as mitigation so print it in
cpu_show_spectre_v1() when enabled.
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Our syscall entry is done in assembly so patch in an explicit
barrier_nospec.
Based on a patch by Michal Suchanek.
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Check what firmware told us and enable/disable the barrier_nospec as
appropriate.
We err on the side of enabling the barrier, as it's no-op on older
systems, see the comment for more detail.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Note that unlike RFI which is patched only in kernel the nospec state
reflects settings at the time the module was loaded.
Iterating all modules and re-patching every time the settings change
is not implemented.
Based on lwsync patching.
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Based on the RFI patching. This is required to be able to disable the
speculation barrier.
Only one barrier type is supported and it does nothing when the
firmware does not enable it. Also re-patching modules is not supported
So the only meaningful thing that can be done is patching out the
speculation barrier at boot when the user says it is not wanted.
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This now has new code in it written by Nick and I, and switch to a
SPDX tag.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
This allows eg. the RCU stall detector, or the soft/hardlockup
detectors to trigger a backtrace on all CPUs.
We implement this by sending a "safe" NMI, which will actually only
send an IPI. Unfortunately the generic code prints "NMI", so that's a
little confusing but we can probably live with it.
If one of the CPUs doesn't respond to the IPI, we then print some info
from it's paca and do a backtrace based on its saved_r1.
Example output:
INFO: rcu_sched detected stalls on CPUs/tasks:
2-...0: (0 ticks this GP) idle=1be/1/4611686018427387904 softirq=1055/1055 fqs=25735
(detected by 4, t=58847 jiffies, g=58, c=57, q=1258)
Sending NMI from CPU 4 to CPUs 2:
CPU 2 didn't respond to backtrace IPI, inspecting paca.
irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 3623 (bash)
Back trace of paca->saved_r1 (0xc0000000e1c83ba0) (possibly stale):
Call Trace:
[c0000000e1c83ba0] [0000000000000014] 0x14 (unreliable)
[c0000000e1c83bc0] [c000000000765798] lkdtm_do_action+0x48/0x80
[c0000000e1c83bf0] [c000000000765a40] direct_entry+0x110/0x1b0
[c0000000e1c83c90] [c00000000058e650] full_proxy_write+0x90/0xe0
[c0000000e1c83ce0] [c0000000003aae3c] __vfs_write+0x6c/0x1f0
[c0000000e1c83d80] [c0000000003ab214] vfs_write+0xd4/0x240
[c0000000e1c83dd0] [c0000000003ab5cc] ksys_write+0x6c/0x110
[c0000000e1c83e30] [c00000000000b860] system_call+0x58/0x6c
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Currently the options we have for sending NMIs are not necessarily
safe, that is they can potentially interrupt a CPU in a
non-recoverable region of code, meaning the kernel must then panic().
But we'd like to use smp_send_nmi_ipi() to do cross-CPU calls in
situations where we don't want to risk a panic(), because it doesn't
have the requirement that interrupts must be enabled like
smp_call_function().
So add an API for the caller to indicate that it wants to use the NMI
infrastructure, but doesn't want to do anything "unsafe".
Currently that is implemented by not actually calling cause_nmi_ipi(),
instead falling back to an IPI. In future we can pass the safe
parameter down to cause_nmi_ipi() and the individual backends can
potentially take it into account before deciding what to do.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
A CPU that gets stuck with interrupts hard disable can be difficult to
debug, as on some platforms we have no way to interrupt the CPU to
find out what it's doing.
A stop-gap is to have the CPU save it's stack pointer (r1) in its paca
when it hard disables interrupts. That way if we can't interrupt it,
we can at least trace the stack based on where it last disabled
interrupts.
In some cases that will be total junk, but the stack trace code should
handle that. In the simple case of a CPU that disable interrupts and
then gets stuck in a loop, the stack trace should be informative.
We could clear the saved stack pointer when we enable interrupts, but
that loses information which could be useful if we have nothing else
to go on.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
set_fs() sets the addr_limit, which is used in access_ok() to
determine if an address is a user or kernel address.
Some code paths use set_fs() to temporarily elevate the addr_limit so
that kernel code can read/write kernel memory as if it were user
memory. That is fine as long as the code can't ever return to
userspace with the addr_limit still elevated.
If that did happen, then userspace can read/write kernel memory as if
it were user memory, eg. just with write(2). In case it's not clear,
that is very bad. It has also happened in the past due to bugs.
Commit 5ea0727b16 ("x86/syscalls: Check address limit on user-mode
return") added a mechanism to check the addr_limit value before
returning to userspace. Any call to set_fs() sets a thread flag,
TIF_FSCHECK, and if we see that on the return to userspace we go out
of line to check that the addr_limit value is not elevated.
For further info see the above commit, as well as:
https://lwn.net/Articles/722267/https://bugs.chromium.org/p/project-zero/issues/detail?id=990
Verified to work on 64-bit Book3S using a POC that objdumps the system
call handler, and a modified lkdtm_CORRUPT_USER_DS() that doesn't kill
the caller.
Before:
$ sudo ./test-tif-fscheck
...
0000000000000000 <.data>:
0: e1 f7 8a 79 rldicl. r10,r12,30,63
4: 80 03 82 40 bne 0x384
8: 00 40 8a 71 andi. r10,r12,16384
c: 78 0b 2a 7c mr r10,r1
10: 10 fd 21 38 addi r1,r1,-752
14: 08 00 c2 41 beq- 0x1c
18: 58 09 2d e8 ld r1,2392(r13)
1c: 00 00 41 f9 std r10,0(r1)
20: 70 01 61 f9 std r11,368(r1)
24: 78 01 81 f9 std r12,376(r1)
28: 70 00 01 f8 std r0,112(r1)
2c: 78 00 41 f9 std r10,120(r1)
30: 20 00 82 41 beq 0x50
34: a6 42 4c 7d mftb r10
After:
$ sudo ./test-tif-fscheck
Killed
And in dmesg:
Invalid address limit on user-mode return
WARNING: CPU: 1 PID: 3689 at ../include/linux/syscalls.h:260 do_notify_resume+0x140/0x170
...
NIP [c00000000001ee50] do_notify_resume+0x140/0x170
LR [c00000000001ee4c] do_notify_resume+0x13c/0x170
Call Trace:
do_notify_resume+0x13c/0x170 (unreliable)
ret_from_except_lite+0x70/0x74
Performance overhead is essentially zero in the usual case, because
the bit is checked as part of the existing _TIF_USER_WORK_MASK check.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In PPC_PTRACE_GETHWDBGINFO and PPC_PTRACE_SETHWDEBUG we do an
access_ok() check and then __copy_{from,to}_user().
Instead we should just use copy_{from,to}_user() which does all that
for us and is less error prone.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The EEH report functions now share a fair bit of code around the start
and end of each function.
So factor out as much as possible, and move the traversal into a
custom function. This also allows accurate debug to be generated more
easily.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
[mpe: Format with clang-format]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If a device without a driver is recovered via EEH, the flag
EEH_DEV_NO_HANDLER is incorrectly left set on the device after
recovery, because the test in eeh_report_resume() for the existence of
a bound driver is done before the flag is cleared. If a driver is
later bound, and EEH experienced again, some of the drivers EEH
handers are not called.
To correct this, clear the flag unconditionally after EEH processing
is complete.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
To ease future refactoring, extract calls to eeh_enable_irq() and
eeh_disable_irq() from the various report functions. This makes
the report functions initial sequences more similar, as well as making
the IRQ changes visible when reading eeh_handle_normal_event().
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
To ease future refactoring, extract setting of the channel state
from the report functions out into their own functions. This increases
the amount of code that is identical across all of the report
functions.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The same test is done in every EEH report function, so factor it out.
Since eeh_dev_removed() needs to be moved higher up in the file,
simplify it a little while we're at it.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a for_each-style macro for iterating through PEs without the
boilerplate required by a traversal function. eeh_pe_next() is now
exported, as it is now used directly in place.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
As EEH event handling progresses, a cumulative result of type
pci_ers_result is built up by (some of) the eeh_report_*() functions
using either:
if (rc == PCI_ERS_RESULT_NEED_RESET) *res = rc;
if (*res == PCI_ERS_RESULT_NONE) *res = rc;
or:
if ((*res == PCI_ERS_RESULT_NONE) ||
(*res == PCI_ERS_RESULT_RECOVERED)) *res = rc;
if (*res == PCI_ERS_RESULT_DISCONNECT &&
rc == PCI_ERS_RESULT_NEED_RESET) *res = rc;
(Where *res is the accumulator.)
However, the intent is not immediately clear and the result in some
situations is order dependent.
Address this by assigning a priority to each result value, and always
merging to the highest priority. This renders the intent clear, and
provides a stable value for all orderings.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
[mpe: Minor formatting (clang-format)]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
To aid debugging, add a message to show when EEH processing for a PE
will be done at the device's parent, rather than directly at the
device.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The traversal functions eeh_pe_traverse() and eeh_pe_dev_traverse()
both provide their first argument as void * but every single user casts
it to the expected type.
Change the type of the first parameter from void * to the appropriate
type, and clean up all uses.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Correct two cases where eeh_pcid_get() is used to reference the driver's
module but the reference is dropped before the driver pointer is used.
In eeh_rmv_device() also refactor a little so that only two calls to
eeh_pcid_put() are needed, rather than three and the reference isn't
taken at all if it wasn't needed.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a single log line at the end of successful EEH recovery, so that
it's clear that event processing has finished.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
to_tm() is now completely unused, the only reference being in the
_dump_time() helper that is also unused. This removes both, leaving
the rest of the powerpc RTC code y2038 safe to as far as the hardware
supports.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
update_persistent_clock() is deprecated because it suffers from overflow
in 2038 on 32-bit architectures. This changes powerpc to use the
update_persistent_clock64() replacement, and to pass down 64-bit
timestamps consistently.
This is now simpler, as we no longer have to worry about the offset
numbers in tm_year and tm_mon that are different between the Linux
conventions and RTAS.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Looking through the remaining users of the deprecated mktime()
function, I found the powerpc rtc handlers, which use it in
place of rtc_tm_to_time64().
To clean this up, I'm changing over the read_persistent_clock()
function to the read_persistent_clock64() variant, and change
all the platform specific handlers along with it.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The to_tm() helper function operates on a signed integer for the time,
so it will suffer from overflow in 2038, even on 64-bit kernels.
Rather than fix that function, this replaces its use in the rtas
procfs implementation with the standard rtc_time64_to_tm() helper
that is very similar but is not affected by the overflow.
In order to actually support long times, the parser function gets
changed to 64-bit user input and output as well. Note that the tm_mon
and tm_year representation is slightly different, so we have to manually
add an offset here.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently we do not have an isync, or any other context synchronizing
instruction prior to the slbie/slbmte in _switch() that updates the
SLB entry for the kernel stack.
However that is not correct as outlined in the ISA.
From Power ISA Version 3.0B, Book III, Chapter 11, page 1133:
"Changing the contents of ... the contents of SLB entries ... can
have the side effect of altering the context in which data
addresses and instruction addresses are interpreted, and in which
instructions are executed and data accesses are performed.
...
These side effects need not occur in program order, and therefore
may require explicit synchronization by software.
...
The synchronizing instruction before the context-altering
instruction ensures that all instructions up to and including that
synchronizing instruction are fetched and executed in the context
that existed before the alteration."
And page 1136:
"For data accesses, the context synchronizing instruction before the
slbie, slbieg, slbia, slbmte, tlbie, or tlbiel instruction ensures
that all preceding instructions that access data storage have
completed to a point at which they have reported all exceptions
they will cause."
We're not aware of any bugs caused by this, but it should be fixed
regardless.
Add the missing isync when updating kernel stack SLB entry.
Cc: stable@vger.kernel.org
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[mpe: Flesh out change log with more ISA text & explanation]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The current implementation of TID allocation, using a global IDR, may
result in an errant process starving the system of available TIDs.
Instead, use task_pid_nr(), as mentioned by the original author. The
scenario described which prevented it's use is not applicable, as
set_thread_tidr can only be called after the task struct has been
populated.
In the unlikely event that 2 threads share the TID and are waiting,
all potential outcomes have been determined safe.
Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Switch the use of TIDR on it's CPU feature, rather than assuming it
is available based on architecture.
Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds a CPU feature bit to show whether the CPU has
the TIDR register available, enabling as_notify/wait in userspace.
Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When the soft enabled flag was changed to a soft disable mask, xmon
and register dump code was not updated to reflect that, which is
confusing ('SOFTE: 1' previously meant interrupts were soft enabled,
currently it means the opposite, the general interrupt type has been
disabled).
Fix this by using the name irqmask, and printing it in hex.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
These are not local timer interrupts but IPIs. It's good to be able
to see how timer offloading is behaving, so split these out into
their own category.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Large decrementers (e.g., POWER9) can take a very long time to wrap,
so when the timer iterrupt handler sets the decrementer to max so as
to avoid taking another decrementer interrupt when hard enabling
interrupts before running timers, it effectively disables the soft
NMI coverage for timer interrupts.
Fix this by using the traditional 31-bit value instead, which wraps
after a few seconds. masked interrupt code does the same thing, and
in normal operation neither of these paths would ever wrap even the
31 bit value.
Note: the SMP watchdog should catch timer interrupt lockups, but it
is preferable for the local soft-NMI to catch them, mainly to avoid
the IPI.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The broadcast tick recipient can call tick_receive_broadcast rather
than re-running the full timer interrupt.
It does not have to check for the next event time, because the sender
already determined the timer has expired. It does not have to test
irq_work_pending, because that's a direct decrementer interrupt and
does not go through the clock events subsystem. And it does not have
to read PURR because that was removed with the previous patch.
This results in no code size change, but both the decrementer and
broadcast path lengths are reduced.
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For SPLPAR, lparcfg provides a sum of PURR registers for all CPUs.
Currently this is done by reading PURR in context switch and timer
interrupt, and storing that into a per-CPU variable. These are summed
to provide the value.
This does not work with all timer schemes (e.g., NO_HZ_FULL), and it
is sub-optimal for performance because it reads the PURR register on
every context switch, although that's been difficult to distinguish
from noise in the contxt_switch microbenchmark.
This patch implements the sum by calling a function on each CPU, to
read and add PURR values of each CPU.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
irq_work_raise should not cause a decrementer exception unless it is
called from NMI context. Doing so often just results in an immediate
masked decrementer interrupt:
<...>-550 90d... 4us : update_curr_rt <-dequeue_task_rt
<...>-550 90d... 5us : dbs_update_util_handler <-update_curr_rt
<...>-550 90d... 6us : arch_irq_work_raise <-irq_work_queue
<...>-550 90d... 7us : soft_nmi_interrupt <-soft_nmi_common
<...>-550 90d... 7us : printk_nmi_enter <-soft_nmi_interrupt
<...>-550 90d.Z. 8us : rcu_nmi_enter <-soft_nmi_interrupt
<...>-550 90d.Z. 9us : rcu_nmi_exit <-soft_nmi_interrupt
<...>-550 90d... 9us : printk_nmi_exit <-soft_nmi_interrupt
<...>-550 90d... 10us : cpuacct_charge <-update_curr_rt
The soft_nmi_interrupt here is the call into the watchdog, due to the
decrementer interrupt firing with irqs soft-disabled. This is
harmless, but sub-optimal.
When it's not called from NMI context or with interrupts enabled, mark
the decrementer pending in the irq_happened mask directly, rather than
having the masked decrementer interupt handler do it. This will be
replayed at the next local_irq_enable. See the comment for details.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
GCC 8.1 emits warnings such as the following. As arch/powerpc code is
built with -Werror, this breaks the build with GCC 8.1.
In file included from arch/powerpc/kernel/pci_64.c:23:
./include/linux/syscalls.h:233:18: error: 'sys_pciconfig_iobase' alias
between functions of incompatible types 'long int(long int, long
unsigned int, long unsigned int)' and 'long int(long int, long int,
long int)' [-Werror=attribute-alias]
asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
^~~
./include/linux/syscalls.h:222:2: note: in expansion of macro '__SYSCALL_DEFINEx'
__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
This patch inhibits those warnings.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Trim change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
GCC 8.1 warns about possible string truncation:
arch/powerpc/kernel/nvram_64.c:1042:2: error: 'strncpy' specified
bound 12 equals destination size [-Werror=stringop-truncation]
strncpy(new_part->header.name, name, 12);
arch/powerpc/platforms/ps3/repository.c:106:2: error: 'strncpy'
output truncated before terminating nul copying 8 bytes from a
string of the same length [-Werror=stringop-truncation]
strncpy((char *)&n, text, 8);
Fix it by using memcpy(). To make that safe we need to ensure the
destination is pre-zeroed. Use kzalloc() in the nvram code and
initialise the u64 to zero in the ps3 code.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Use kzalloc() in the nvram code, flesh out change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We ended up with an ugly conflict between fixes and next in ftrace.h
involving multiple nested ifdefs, and the automatic resolution is
wrong. So merge fixes into next so we can fix it up.
In commit eae5f709a4 ("powerpc: Add __printf verification to
prom_printf") __printf attribute was added to prom_printf(), which
means GCC started warning about type/format mismatches. As part of
that commit we changed some "%lx" formats to "%llx" where the type is
actually unsigned long long.
Unfortunately prom_printf() doesn't know how to print "%llx", it just
prints a literal "lx", eg:
reserved memory map:
lx - lx
lx - lx
prom_printf() also doesn't know how to print "%u" (only "%lu"), it
just prints a literal "u", eg:
Max number of cores passed to firmware: u (NR_CPUS = 2048)
Instead of:
Max number of cores passed to firmware: 2048 (NR_CPUS = 2048)
This commit adds support for the missing formatters.
Fixes: eae5f709a4 ("powerpc: Add __printf verification to prom_printf")
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Switch VDSO32 build over to use CROSS32_COMPILE directly, and have
it pass in -m32 after the standard c_flags. This allows endianness
overrides to be removed and the endian and bitness flags moved into
standard flags variables.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This merges in the ppc-kvm topic branch of the powerpc repository
to get some changes on which future patches will depend, in particular
some new exports and TEXASR bit definitions.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The toc field in the mod_arch_specific struct isn't actually used
anywhere, so remove it.
Also the ftrace-specific fields are now common between 32-bit and
64-bit, so simplify the struct definition a bit by moving them out of
the __powerpc64__ #ifdef.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PPC:
- Close a hole which could possibly lead to the host timebase getting
out of sync.
- Three fixes relating to PTEs and TLB entries for radix guests.
- Fix a bug which could lead to an interrupt never getting delivered
to the guest, if it is pending for a guest vCPU when the vCPU gets
offlined.
s390:
- Fix false negatives in VSIE validity check (Cc stable)
x86:
- Fix time drift of VMX preemption timer when a guest uses LAPIC timer
in periodic mode (Cc stable)
- Unconditionally expose CPUID.IA32_ARCH_CAPABILITIES to allow
migration from hosts that don't need retpoline mitigation (Cc stable)
- Fix guest crashes on reboot by properly coupling CR4.OSXSAVE and
CPUID.OSXSAVE (Cc stable)
- Report correct RIP after Hyper-V hypercall #UD (introduced in -rc6)
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJbCXxHAAoJEED/6hsPKofon5oIAKTwpbpBi0UKIyYcHQ2pwIoP
+qITTZUGGhEaIfe+aDkzE4vxVIA2ywYCbaC2+OSy4gNVThnytRL8WuhLyV8WLmlC
sDVSQ87RWaN8mW6hEJ95qXMS7FS0TsDJdytaw+c8OpODrsykw1XMSyV2rMLb0sMT
SmfioO2kuDx5JQGyiAPKFFXKHjAnnkH+OtffNemAEHGoPpenJ4qLRuXvrjQU8XT6
tVARIBZsutee5ITIsBKVDmI2n98mUoIe9na21M7N2QaJ98IF+qRz5CxZyL1CgvFk
tHqG8PZ/bqhnmuIIR5Di919UmhamOC3MODsKUVeciBLDS6LHlhado+HEpj6B8mI=
=ygB7
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Radim Krčmář:
"PPC:
- Close a hole which could possibly lead to the host timebase getting
out of sync.
- Three fixes relating to PTEs and TLB entries for radix guests.
- Fix a bug which could lead to an interrupt never getting delivered
to the guest, if it is pending for a guest vCPU when the vCPU gets
offlined.
s390:
- Fix false negatives in VSIE validity check (Cc stable)
x86:
- Fix time drift of VMX preemption timer when a guest uses LAPIC
timer in periodic mode (Cc stable)
- Unconditionally expose CPUID.IA32_ARCH_CAPABILITIES to allow
migration from hosts that don't need retpoline mitigation (Cc
stable)
- Fix guest crashes on reboot by properly coupling CR4.OSXSAVE and
CPUID.OSXSAVE (Cc stable)
- Report correct RIP after Hyper-V hypercall #UD (introduced in
-rc6)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: fix #UD address of failed Hyper-V hypercalls
kvm: x86: IA32_ARCH_CAPABILITIES is always supported
KVM: x86: Update cpuid properly when CR4.OSXAVE or CR4.PKE is changed
x86/kvm: fix LAPIC timer drift when guest uses periodic mode
KVM: s390: vsie: fix < 8k check for the itdba
KVM: PPC: Book 3S HV: Do ptesync in radix guest exit path
KVM: PPC: Book3S HV: XIVE: Resend re-routed interrupts on CPU priority change
KVM: PPC: Book3S HV: Make radix clear pte when unmapping
KVM: PPC: Book3S HV: Make radix use correct tlbie sequence in kvmppc_radix_tlbie_page
KVM: PPC: Book3S HV: Snapshot timebase offset on guest entry
Just one fix, to make sure the PCR (Processor Compatibility Register) is reset
on boot. Otherwise if we're running in compat mode in a guest (eg. pretending a
Power9 is a Power8) and the host kernel oopses and kdumps then the kdump
kernel's userspace will be running in Power8 mode, and will SIGILL if it uses
Power9-only instructions.
Thanks to:
Michael Neuling.
-----BEGIN PGP SIGNATURE-----
iQIwBAABCAAaBQJbB/PjExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYCK
rhAAh2zSxZfpvDkjvWwqD26yDiDRcVV+rU2CtDiFQk+tAWVFhFqEMJvPWaUU4Ub9
EQStTS4gLiYsY282M+TyFGoUX76aAlMJenaCak8YQmYA4DQ8Sv2LB/jTJBhvbVnC
5ig9aNhSef5aDS7i5r5XzITf+SQ+1U2wddMxzy5Oklku+ypFX3cj9MtNvcD+fJIK
D35vvv5kI2a3wZsE/R7TB91MmfLgWY/4fb4da+dXMVTm+E1gHgOOU7JcnmzEJ/fH
Ds3d8JiIvsFXPCSwCruLiTXtmzQjZUYK3p81ffD7f3qCLBb54PbNBvsFgMYq4h8k
JGNeTqo+gGq3rqqbsftSudHVfG9jleKiJk7Xng3m5iqADHioroq0WTxulZczFd7t
DIl3NcmPnl7kFG/OjO39EDg50+Y/o0+uurjiy3EB9xTtivnOQ0Os+7tvmeqnzL8y
RpXVtZ0Uvf1aBrfXaobz/uNGy4Zy0ZamzdyxfLup7gqtPdAnDVnRak2Cn5bKuqK0
Xsi7/liq6Y7mwtys+iFuCMthh4/wza43VyH+ZYleEXOe6c0hfnuAvQdGcRSDtLMh
arSXYKbJCzwaQXKiTL5VRf/ta51MVi22ghV9/72Cf0wYUMJe0SNDYkhn34C8KlxB
Z5xYHljqRZE/uD8lYeeSdZKMvNEcinEJSpaC645uJVcvkKA=
=lmT9
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.17-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fix from Michael Ellerman:
"Just one fix, to make sure the PCR (Processor Compatibility Register)
is reset on boot.
Otherwise if we're running in compat mode in a guest (eg. pretending a
Power9 is a Power8) and the host kernel oopses and kdumps then the
kdump kernel's userspace will be running in Power8 mode, and will
SIGILL if it uses Power9-only instructions.
Thanks to Michael Neuling"
* tag 'powerpc-4.17-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: Clear PCR on boot
The header file <asm/switch_to.h> was missing from the includes. Fix the
following warning, treated as error with W=1:
arch/powerpc/kernel/vecemu.c:260:5: error: no previous prototype for ‘emulate_altivec’ [-Werror=missing-prototypes]
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The header file <linux/syscalls.h> was missing from the includes. Fix the
following warning, treated as error with W=1:
arch/powerpc/kernel/pci_32.c:286:6: error: no previous prototype for ‘sys_pciconfig_iobase’ [-Werror=missing-prototypes]
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
These functions can all be static, make it so. Fix warnings treated as
errors with W=1:
arch/powerpc/kernel/tau_6xx.c:53:6: error: no previous prototype for ‘set_thresholds’ [-Werror=missing-prototypes]
arch/powerpc/kernel/tau_6xx.c:73:6: error: no previous prototype for ‘TAUupdate’ [-Werror=missing-prototypes]
arch/powerpc/kernel/tau_6xx.c:208:13: error: no previous prototype for ‘TAU_init_smp’ [-Werror=missing-prototypes]
arch/powerpc/kernel/tau_6xx.c:220:12: error: no previous prototype for ‘TAU_init’ [-Werror=missing-prototypes]
arch/powerpc/kernel/tau_6xx.c:126:6: error: no previous prototype for ‘TAUException’ [-Werror=missing-prototypes]
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This function can be static, make it so, this fix a warning treated as
error with W=1:
arch/powerpc/kernel/btext.c:173:5: error: no previous prototype for ‘btext_initialize’ [-Werror=missing-prototypes]
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Some function prototypes and body for Thermal Assist Units were not in
sync. Update the function definition to match the existing function
declaration found in `setup-common.c`, changing an `int` return type to a
`u32` return type. Move the prototypes to a header file. Fix the following
warnings, treated as error with W=1:
arch/powerpc/kernel/tau_6xx.c:257:5: error: no previous prototype for ‘cpu_temp_both’ [-Werror=missing-prototypes]
arch/powerpc/kernel/tau_6xx.c:262:5: error: no previous prototype for ‘cpu_temp’ [-Werror=missing-prototypes]
arch/powerpc/kernel/tau_6xx.c:267:5: error: no previous prototype for ‘tau_interrupts’ [-Werror=missing-prototypes]
Compile tested with CONFIG_TAU_INT.
Suggested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In commit 7a22d6321c ("powerpc/mm/radix: Update command line parsing for
disable_radix") an `if` statement was added for a possible empty body
(prom_debug).
Fix the following warning, treated as error with W=1:
arch/powerpc/kernel/prom_init.c:656:46: error: suggest braces around empty body in an ‘if’ statement [-Werror=empty-body]
Suggested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Trivial fix to remove the following sparse warnings:
arch/powerpc/kernel/module_32.c:112:74: warning: Using plain integer as NULL pointer
arch/powerpc/kernel/module_32.c:117:74: warning: Using plain integer as NULL pointer
drivers/macintosh/via-pmu.c:1155:28: warning: Using plain integer as NULL pointer
drivers/macintosh/via-pmu.c:1230:20: warning: Using plain integer as NULL pointer
drivers/macintosh/via-pmu.c:1385:36: warning: Using plain integer as NULL pointer
drivers/macintosh/via-pmu.c:1752:23: warning: Using plain integer as NULL pointer
drivers/macintosh/via-pmu.c:2084:19: warning: Using plain integer as NULL pointer
drivers/macintosh/via-pmu.c:2110:32: warning: Using plain integer as NULL pointer
drivers/macintosh/via-pmu.c:2167:19: warning: Using plain integer as NULL pointer
drivers/macintosh/via-pmu.c:2183:19: warning: Using plain integer as NULL pointer
drivers/macintosh/via-pmu.c:277:20: warning: Using plain integer as NULL pointer
arch/powerpc/platforms/powermac/setup.c:155:67: warning: Using plain integer as NULL pointer
arch/powerpc/platforms/powermac/setup.c:247:27: warning: Using plain integer as NULL pointer
arch/powerpc/platforms/powermac/setup.c:249:27: warning: Using plain integer as NULL pointer
arch/powerpc/platforms/powermac/setup.c:252:37: warning: Using plain integer as NULL pointer
arch/powerpc/mm/tlb_hash32.c:127:21: warning: Using plain integer as NULL pointer
arch/powerpc/mm/tlb_hash32.c:148:21: warning: Using plain integer as NULL pointer
arch/powerpc/mm/tlb_hash32.c:44:21: warning: Using plain integer as NULL pointer
arch/powerpc/mm/tlb_hash32.c:57:21: warning: Using plain integer as NULL pointer
arch/powerpc/mm/tlb_hash32.c:87:21: warning: Using plain integer as NULL pointer
arch/powerpc/kernel/btext.c:160:31: warning: Using plain integer as NULL pointer
arch/powerpc/kernel/btext.c:167:22: warning: Using plain integer as NULL pointer
arch/powerpc/kernel/btext.c:274:21: warning: Using plain integer as NULL pointer
arch/powerpc/kernel/btext.c:285:31: warning: Using plain integer as NULL pointer
arch/powerpc/include/asm/hugetlb.h:204:16: warning: Using plain integer as NULL pointer
arch/powerpc/mm/ppc_mmu_32.c:170:21: warning: Using plain integer as NULL pointer
arch/powerpc/platforms/powermac/pci.c:1227:23: warning: Using plain integer as NULL pointer
arch/powerpc/platforms/powermac/pci.c:65:24: warning: Using plain integer as NULL pointer
Also use `--fix` command line option from `script/checkpatch --strict` to
remove the following:
CHECK: Comparison to NULL could be written "!dispDeviceBase"
#72: FILE: arch/powerpc/kernel/btext.c:160:
+ if (dispDeviceBase == NULL)
CHECK: Comparison to NULL could be written "!vbase"
#80: FILE: arch/powerpc/kernel/btext.c:167:
+ if (vbase == NULL)
CHECK: Comparison to NULL could be written "!base"
#89: FILE: arch/powerpc/kernel/btext.c:274:
+ if (base == NULL)
CHECK: Comparison to NULL could be written "!dispDeviceBase"
#98: FILE: arch/powerpc/kernel/btext.c:285:
+ if (dispDeviceBase == NULL)
CHECK: Comparison to NULL could be written "strstr"
#117: FILE: arch/powerpc/kernel/module_32.c:117:
+ if (strstr(secstrings + sechdrs[i].sh_name, ".debug") != NULL)
CHECK: Comparison to NULL could be written "!Hash"
#130: FILE: arch/powerpc/mm/ppc_mmu_32.c:170:
+ if (Hash == NULL)
CHECK: Comparison to NULL could be written "Hash"
#143: FILE: arch/powerpc/mm/tlb_hash32.c:44:
+ if (Hash != NULL) {
CHECK: Comparison to NULL could be written "!Hash"
#152: FILE: arch/powerpc/mm/tlb_hash32.c:57:
+ if (Hash == NULL) {
CHECK: Comparison to NULL could be written "!Hash"
#161: FILE: arch/powerpc/mm/tlb_hash32.c:87:
+ if (Hash == NULL) {
CHECK: Comparison to NULL could be written "!Hash"
#170: FILE: arch/powerpc/mm/tlb_hash32.c:127:
+ if (Hash == NULL) {
CHECK: Comparison to NULL could be written "!Hash"
#179: FILE: arch/powerpc/mm/tlb_hash32.c:148:
+ if (Hash == NULL) {
ERROR: space required after that ';' (ctx:VxV)
#192: FILE: arch/powerpc/platforms/powermac/pci.c:65:
+ for (; node != NULL;node = node->sibling) {
CHECK: Comparison to NULL could be written "node"
#192: FILE: arch/powerpc/platforms/powermac/pci.c:65:
+ for (; node != NULL;node = node->sibling) {
CHECK: Comparison to NULL could be written "!region"
#201: FILE: arch/powerpc/platforms/powermac/pci.c:1227:
+ if (region == NULL)
CHECK: Comparison to NULL could be written "of_get_property"
#214: FILE: arch/powerpc/platforms/powermac/setup.c:155:
+ if (of_get_property(np, "cache-unified", NULL) != NULL && dc) {
CHECK: Comparison to NULL could be written "!np"
#223: FILE: arch/powerpc/platforms/powermac/setup.c:247:
+ if (np == NULL)
CHECK: Comparison to NULL could be written "np"
#226: FILE: arch/powerpc/platforms/powermac/setup.c:249:
+ if (np != NULL) {
CHECK: Comparison to NULL could be written "l2cr"
#230: FILE: arch/powerpc/platforms/powermac/setup.c:252:
+ if (l2cr != NULL) {
CHECK: Comparison to NULL could be written "via"
#243: FILE: drivers/macintosh/via-pmu.c:277:
+ if (via != NULL)
CHECK: Comparison to NULL could be written "current_req"
#252: FILE: drivers/macintosh/via-pmu.c:1155:
+ if (current_req != NULL) {
CHECK: Comparison to NULL could be written "!req"
#261: FILE: drivers/macintosh/via-pmu.c:1230:
+ if (req == NULL || pmu_state != idle
CHECK: Comparison to NULL could be written "!req"
#270: FILE: drivers/macintosh/via-pmu.c:1385:
+ if (req == NULL) {
CHECK: Comparison to NULL could be written "!pp"
#288: FILE: drivers/macintosh/via-pmu.c:2084:
+ if (pp == NULL)
CHECK: Comparison to NULL could be written "!pp"
#297: FILE: drivers/macintosh/via-pmu.c:2110:
+ if (count < 1 || pp == NULL)
CHECK: Comparison to NULL could be written "!pp"
#306: FILE: drivers/macintosh/via-pmu.c:2167:
+ if (pp == NULL)
CHECK: Comparison to NULL could be written "pp"
#315: FILE: drivers/macintosh/via-pmu.c:2183:
+ if (pp != NULL) {
Link: https://github.com/linuxppc/linux/issues/37
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
__printf is useful to verify format and arguments. Fix arg mismatch
reported by gcc, remove the following warnings (with W=1):
arch/powerpc/kernel/prom_init.c:1467:31: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:1471:31: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:1504:33: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:1505:33: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:1506:33: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:1507:33: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:1508:33: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:1509:33: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:1975:39: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 2 has type ‘unsigned int’
arch/powerpc/kernel/prom_init.c:1986:27: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:2567:38: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:2567:46: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 3 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:2569:38: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:2569:46: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 3 has type ‘long unsigned int’
The patch also include arg mismatch fix for case with #define DEBUG_PROM
(warning not listed here).
This patch fix also the following warnings revealed by checkpatch:
WARNING: Prefer using '"%s...", __func__' to using 'alloc_up', this function's name, in a string
#101: FILE: arch/powerpc/kernel/prom_init.c:1235:
+ prom_debug("alloc_up(%lx, %lx)\n", size, align);
and
WARNING: Prefer using '"%s...", __func__' to using 'alloc_down', this function's name, in a string
#138: FILE: arch/powerpc/kernel/prom_init.c:1278:
+ prom_debug("alloc_down(%lx, %lx, %s)\n", size, align,
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
New binutils generate the following warning
AS arch/powerpc/kernel/head_8xx.o
arch/powerpc/kernel/head_8xx.S: Assembler messages:
arch/powerpc/kernel/head_8xx.S:916: Warning: invalid register expression
This patch fixes it.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch exports tm_enable()/tm_disable/tm_abort() APIs, which
will be used for PR KVM transactional memory logic.
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PR KVM will need to reuse msr_check_and_set().
This patch exports this API for reuse.
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On some CPUs we can prevent a vulnerability related to store-to-load
forwarding by preventing store forwarding between privilege domains,
by inserting a barrier in kernel entry and exit paths.
This is known to be the case on at least Power7, Power8 and Power9
powerpc CPUs.
Barriers must be inserted generally before the first load after moving
to a higher privilege, and after the last store before moving to a
lower privilege, HV and PR privilege transitions must be protected.
Barriers are added as patch sections, with all kernel/hypervisor entry
points patched, and the exit points to lower privilge levels patched
similarly to the RFI flush patching.
Firmware advertisement is not implemented yet, so CPU flush types
are hard coded.
Thanks to Michal Suchánek for bug fixes and review.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michal Suchánek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit e2a800beac ("powerpc/hw_brk: Fix off by one error when
validating DAWR region end") we fixed setting the DAWR end point to
its max value via PPC_PTRACE_SETHWDEBUG. Unfortunately we broke
PTRACE_SET_DEBUGREG when setting a 512 byte aligned breakpoint.
PTRACE_SET_DEBUGREG currently sets the length of the breakpoint to
zero (memset() in hw_breakpoint_init()). This worked with
arch_validate_hwbkpt_settings() before the above patch was applied but
is now broken if the breakpoint is 512byte aligned.
This sets the length of the breakpoint to 8 bytes when using
PTRACE_SET_DEBUGREG.
Fixes: e2a800beac ("powerpc/hw_brk: Fix off by one error when validating DAWR region end")
Cc: stable@vger.kernel.org # v3.11+
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Back when we first introduced the DAWR, in commit 4ae7ebe952
("powerpc: Change hardware breakpoint to allow longer ranges"), we
screwed up the constraint making it a 1024 byte boundary rather than a
512. This makes the check overly permissive. Fortunately GDB is the
only real user and it always did they right thing, so we never
noticed.
This fixes the constraint to 512 bytes.
Fixes: 4ae7ebe952 ("powerpc: Change hardware breakpoint to allow longer ranges")
Cc: stable@vger.kernel.org # v3.9+
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Trivial fix to spelling mistake in battery_charging array.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Clear the PCR (Processor Compatibility Register) on boot to ensure we
are not running in a compatibility mode.
We've seen this cause problems when a crash (and kdump) occurs while
running compat mode guests. The kdump kernel then runs with the PCR
set and causes problems. The symptom in the kdump kernel (also seen in
petitboot after fast-reboot) is early userspace programs taking
sigills on newer instructions (seen in libc).
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: stable@vger.kernel.org
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch moves nip/ctr/lr/xer registers from scattered places in
kvm_vcpu_arch to pt_regs structure.
cr register is "unsigned long" in pt_regs and u32 in vcpu->arch.
It will need more consideration and may move in later patches.
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Current regs are scattered at kvm_vcpu_arch structure and it will
be more neat to organize them into pt_regs structure.
Also it will enable reimplementation of MMIO emulation code with
analyse_instr() later.
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This merges in the ppc-kvm topic branch of the powerpc repository
to get some changes on which future patches will depend, in particular
the definitions of various new TLB flushing functions.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
arch/powerpc/Makefile activates -mmultiple on BE PPC32 configs
in order to use multiple word instructions in functions entry/exit.
The patch does the same for the asm parts, for consistency.
On processors like the 8xx on which insn fetching is pretty slow,
this speeds up registers save/restore.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: PPC32 is BE only, so drop the endian checks]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Doing the test at exit of the function avoids an unnecessary
test and branch inside longjmp().
Semantics are unchanged.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This requires further changes to linker script to KEEP some tables
and wildcard compiler generated sections into the right place. This
includes pp32 modifications from Christophe Leroy.
When compiling powernv_defconfig with this option, the resulting
kernel is almost 400kB smaller (and still boots):
text data bss dec filename
11827621 4810490 1341080 17979191 vmlinux
11752437 4598858 1338776 17690071 vmlinux.dcde
Mathieu's numbers for custom Mac Mini G4 config has almost 200kB
saving. It also had some increase in vmlinux size for as-yet
unknown reasons.
text data bss dec filename
7461457 2475122 1428064 11364643 vmlinux
7386425 2364370 1425432 11176227 vmlinux.dcde
Tested-by: Christophe Leroy <christophe.leroy@c-s.fr> [8xx]
Tested-by: Mathieu Malaterre <malat@debian.org> [32-bit powermac]
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Currently, the HV KVM guest entry/exit code adds the timebase offset
from the vcore struct to the timebase on guest entry, and subtracts
it on guest exit. Which is fine, except that it is possible for
userspace to change the offset using the SET_ONE_REG interface while
the vcore is running, as there is only one timebase offset per vcore
but potentially multiple VCPUs in the vcore. If that were to happen,
KVM would subtract a different offset on guest exit from that which
it had added on guest entry, leading to the timebase being out of sync
between cores in the host, which then leads to bad things happening
such as hangs and spurious watchdog timeouts.
To fix this, we add a new field 'tb_offset_applied' to the vcore struct
which stores the offset that is currently applied to the timebase.
This value is set from the vcore tb_offset field on guest entry, and
is what is subtracted from the timebase on guest exit. Since it is
zero when the timebase offset is not applied, we can simplify the
logic in kvmhv_start_timing and kvmhv_accumulate_time.
In addition, we had secondary threads reading the timebase while
running concurrently with code on the primary thread which would
eventually add or subtract the timebase offset from the timebase.
This occurred while saving or restoring the DEC register value on
the secondary threads. Although no specific incorrect behaviour has
been observed, this is a race which should be fixed. To fix it, we
move the DEC saving code to just before we call kvmhv_commence_exit,
and the DEC restoring code to after the point where we have waited
for the primary thread to switch the MMU context and add the timebase
offset. That way we are sure that the timebase contains the guest
timebase value in both cases.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Directly use fault_in_pages_readable instead of manual __get_user code. Fix
warning treated as error with W=1:
arch/powerpc/kernel/kvm.c:675:6: error: variable ‘tmp’ set but not used [-Werror=unused-but-set-variable]
Suggested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Variants of proc_create{,_data} that directly take a seq_file show
callback and drastically reduces the boilerplate code in the callers.
All trivial callers converted over.
Signed-off-by: Christoph Hellwig <hch@lst.de>
In commit e6a6928c3e ("of/fdt: Convert FDT functions to use
libfdt") (Apr 2014), the generic flat device tree code dropped support
for flat device tree's older than version 0x10 (16).
We still have code in our CPU scanning to cope with flat device tree
versions earlier than 2, which can now never trigger, so drop it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If the systbl_chk.sh checks fail we print a message, but with no
indication that it's an error. That makes it hard to find in build
logs with eg. grep.
So prefix any output with "Error:".
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
it had always been pointless - compat_sys_select() sign-extends
the first argument just fine on its own.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[mpe: Use COMPAT_SPU_NEW() to keep systbl_chk.sh happy]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently the select system call is wired up with the SYSX_SPU()
macro. The SYSX_SPU() is not handled by systbl_chk.c, which means the
syscall number for select is not checked.
That hides the fact that the syscall number for select is actually
__NR__newselect not __NR_select.
In a following patch we'd like to drop ppc32_select() which means
select will become a regular COMPAT_SYS_SPU() syscall. But
COMPAT_SYS_SPU() can't deal with the fact that the syscall number is
actually __NR__newselect. We also can't just redefine __NR_select
because that's still used for the old select call.
So add a new COMPAT_NEW_SPU() that does the same thing as
COMPAT_SYS_SPU() except it encodes that we're using the new number.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[mpe: Fix sys_debug_setcontext() prototype to return long]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The "Power Architecture 64-Bit ELF V2 ABI" says in section 2.3.2.3:
[...] There are several rules that must be adhered to in order to ensure
reliable and consistent call chain backtracing:
* Before a function calls any other function, it shall establish its
own stack frame, whose size shall be a multiple of 16 bytes.
– In instances where a function’s prologue creates a stack frame, the
back-chain word of the stack frame shall be updated atomically with
the value of the stack pointer (r1) when a back chain is implemented.
(This must be supported as default by all ELF V2 ABI-compliant
environments.)
[...]
– The function shall save the link register that contains its return
address in the LR save doubleword of its caller’s stack frame before
calling another function.
To me this sounds like the equivalent of HAVE_RELIABLE_STACKTRACE.
This patch may be unneccessarily limited to ppc64le, but OTOH the only
user of this flag so far is livepatching, which is only implemented on
PPCs with 64-LE, a.k.a. ELF ABI v2.
Feel free to add other ppc variants, but so far only ppc64le got tested.
This change also implements save_stack_trace_tsk_reliable() for ppc64le
that checks for the above conditions, where possible.
Signed-off-by: Torsten Duwe <duwe@suse.de>
Signed-off-by: Nicolai Stange <nstange@suse.de>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Provide timebase and timebase of last heartbeat in watchdog lockup
messages. Also provide a stack trace of when a CPU becomes un-stuck,
which can be useful -- it could be where irqs are re-enabled, so it
may be the end of the critical section which is responsible for the
latency which is useful information.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The watchdog heartbeat timestamp is updated when the local heartbeat
timer fires (or touch_nmi_watchdog() is called).
This is an interesting data point, so don't overwrite it when the
soft-NMI interrupt detects a hard lockup. That code came from a pre-
merge version to prevent hard lockup messages flood, but that's taken
care of with the stuck CPU logic now, so there is no reason to
update the heartbeat timestamp here.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The kexec_state KEXEC_STATE_IRQS_OFF barrier is reached by all
secondary CPUs before the kexec_cpu_down() operation is called on
secondaries. This can raise conflicts and provoque errors in the XIVE
hcalls when XIVE is shutdown with H_INT_RESET on the primary CPU.
To synchronize the kexec_cpu_down() operations and make sure the
secondaries have completed their task before the primary starts doing
the same, let's move the primary kexec_cpu_down() after the
KEXEC_STATE_REAL_MODE barrier.
This change of the ending sequence of kexec is mostly useful on the
pseries platform but it impacts also the powernv, ps3 and 85xx
platforms. powernv can be easily tested and fixed but some caution is
required for the other two.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Most mainstream architectures are using 65536 entries, so lets stick to
that. If someone is really desperate to override it that can still be
done through <asm/dma-mapping.h>, but I'd rather see a really good
rationale for that.
dma_debug_init is now called as a core_initcall, which for many
architectures means much earlier, and provides dma-debug functionality
earlier in the boot process. This should be safe as it only relies
on the memory allocator already being available.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Unregister fadump on kexec down path otherwise the fadump registration
in new kexec-ed kernel complains that fadump is already registered.
This makes new kernel to continue using fadump registered by previous
kernel which may lead to invalid vmcore generation. Hence this patch
fixes this issue by un-registering fadump in fadump_cleanup() which is
called during kexec path so that new kernel can register fadump with
new valid values.
Fixes: b500afff11 ("fadump: Invalidate registration and release reserved memory for general use.")
Cc: stable@vger.kernel.org # v3.4+
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
FADump capture kernel boots in restricted memory environment preserving
the context of previous kernel to save vmcore. Supporting hugepages in
such environment makes things unnecessarily complicated, as hugepages
need memory set aside for them. This means most of the capture kernel's
memory is used in supporting hugepages. In most cases, this results in
out-of-memory issues while booting FADump capture kernel. But hugepages
are not of much use in capture kernel whose only job is to save vmcore.
So, disabling hugepages support, when fadump is active, is a reliable
solution for the out of memory issues. Introducing a flag variable to
disable HugeTLB support when fadump is active.
Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The second kernel, during early boot after the crash, reserves rest of
the memory above boot memory size to make sure it does not touch any of the
dump memory area. It uses memblock_reserve() that reserves the specified
memory region irrespective of memory holes present within that region.
There are chances where previous kernel would have hot removed some of
its memory leaving memory holes behind. In such cases fadump kernel reports
incorrect number of reserved pages through arch_reserved_kernel_pages()
hook causing kernel to hang or panic.
Fix this by excluding memory holes while reserving rest of the memory
above boot memory size during second kernel boot after crash.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We've had dynamic ftrace support for over 9 years since Steve first
wrote it, all the distros use dynamic, and static is basically
untested these days, so drop support for static ftrace.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With -mprofile-kernel, we always save the full register state in
ftrace_caller(). While this works, this is inefficient if we're not
interested in the register state, such as when we're using the function
tracer.
Rename the existing ftrace_caller() as ftrace_regs_caller() and provide
a simpler implementation for ftrace_caller() that is used when registers
are not required to be saved.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Our implementation matches that of the generic version, which also
handles FTRACE_UPDATE_MODIFY_CALL. So, remove our implementation in
favor of the generic version.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For R_PPC64_REL24 relocations, we suppress emitting instructions for TOC
load/restore in the relocation stub if the relocation is for _mcount()
call when using -mprofile-kernel ABI.
To detect this, we check if the preceding instructions are per the
standard set of instructions emitted by gcc: either the two instruction
sequence of 'mflr r0; std r0,16(r1)', or the more optimized variant of a
single 'mflr r0'. This is not sufficient since nothing prevents users
from hand coding sequences involving a 'mflr r0' followed by a 'bl'.
For removing the toc save instruction from the stub, we additionally
check if the symbol is "_mcount". Add the same check here as well.
Also rename is_early_mcount_callsite() to is_mprofile_mcount_callsite()
since that is what is being checked. The use of "early" is misleading
since there is nothing involving this function that qualifies as early.
Fixes: 153086644f ("powerpc/ftrace: Add support for -mprofile-kernel ftrace ABI")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If function_graph tracer is enabled during kexec, we see the below
exception in the simulator:
root@(none):/# kexec -e
kvm: exiting hardware virtualization
kexec_core: Starting new kernel
[ 19.262020070,5] OPAL: Switch to big-endian OS
kexec: Starting switchover sequence.
Interrupt to 0xC000000000004380 from 0xC000000000004380
** Execution stopped: Continuous Interrupt, Instruction caused exception, **
Now that we have a more effective way to completely disable ftrace on
ppc64, let's also use that before switching to a new kernel during
kexec.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Disable ftrace when a cpu is about to go offline. When the cpu is woken
up, ftrace will get enabled in start_secondary().
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On the boot cpu, though we enable paca->ftrace_enabled in early_setup()
(via cpu_ready_for_interrupts()), we don't start tracing until much
later since ftrace is not initialized yet and since we only support
DYNAMIC_FTRACE on powerpc. However, it is possible that ftrace has been
initialized by the time some of the secondary cpus start up. In this
case, we will try to trace some of the early boot code which can cause
problems.
To address this, move setting paca->ftrace_enabled from
cpu_ready_for_interrupts() to early_setup() for the boot cpu, and towards
the end of start_secondary() for secondary cpus.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We have some C code that we call into from real mode where we cannot
take any exceptions. Though the C functions themselves are mostly safe,
if these functions are traced, there is a possibility that we may take
an exception. For instance, in certain conditions, the ftrace code uses
WARN(), which uses a 'trap' to do its job.
For such scenarios, introduce a new field in paca 'ftrace_enabled',
which is checked on ftrace entry before continuing. This field can then
be set to zero to disable/pause ftrace, and set to a non-zero value to
resume ftrace.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
smp_send_stop can lock up the IPI path for any subsequent calls,
because the receiving CPUs spin in their handler function. This
started becoming a problem with the addition of an smp_send_stop
call in the reboot path, because panics can reboot after doing
their own smp_send_stop.
The NMI IPI variant was fixed with ac61c11566 ("powerpc: Fix
smp_send_stop NMI IPI handling"), which leaves the smp_call_function
variant.
This is fixed by having smp_send_stop only ever do the
smp_call_function once. This is a bit less robust than the NMI IPI
fix, because any other call to smp_call_function after smp_send_stop
could deadlock, but that has always been the case, and it was not
been a problem before.
Fixes: f2748bdfe1 ("powerpc/powernv: Always stop secondaries before reboot/shutdown")
Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Using an si_code of 0 that aliases with SI_USER is clearly the wrong
thing todo, and causes problems in interesting ways.
For use in unknown_exception the recently defined TRAP_UNK
semantically is a perfect fit. For use in RunModeException it looks
like something more specific than TRAP_UNK could be used. No one has
bothered to find a better fit than the broken si_code of 0 in all of
these years and I don't see an obvious better fit so TRAP_UNK is
switching RunModeException to return TRAP_UNK is clearly an
improvement.
Recent history suggests no actually cares about crazy corner
cases of the kernel behavior like this so I don't expect any
regressions from changing this. However if something does
happen this change is easy to revert.
Though I wonder if SIGKILL might not be a better fit.
Cc: Paul Mackerras <paulus@samba.org>
Cc: Kumar Gala <kumar.gala@freescale.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: linuxppc-dev@lists.ozlabs.org
Fixes: 9bad068c24d7 ("[PATCH] ppc32: support for e500 and 85xx")
Fixes: 0ed70f6105ef ("PPC32: Provide proper siginfo information on various exceptions.")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Using an si_code of 0 that aliases with SI_USER is clearly the
wrong thing todo, and causes problems in interesting ways.
The newly defined FPE_FLTUNK semantically appears to fit the
bill so use it instead.
Cc: Paul Mackerras <paulus@samba.org>
Cc: Kumar Gala <kumar.gala@freescale.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: linuxppc-dev@lists.ozlabs.org
Fixes: 9bad068c24d7 ("[PATCH] ppc32: support for e500 and 85xx")
Fixes: 0ed70f6105ef ("PPC32: Provide proper siginfo information on various exceptions.")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Call clear_siginfo to ensure every stack allocated siginfo is properly
initialized before being passed to the signal sending functions.
Note: It is not safe to depend on C initializers to initialize struct
siginfo on the stack because C is allowed to skip holes when
initializing a structure.
The initialization of struct siginfo in tracehook_report_syscall_exit
was moved from the helper user_single_step_siginfo into
tracehook_report_syscall_exit itself, to make it clear that the local
variable siginfo gets fully initialized.
In a few cases the scope of struct siginfo has been reduced to make it
clear that siginfo siginfo is not used on other paths in the function
in which it is declared.
Instances of using memset to initialize siginfo have been replaced
with calls clear_siginfo for clarity.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The NMI IPI handler for a receiving CPU increments nmi_ipi_busy_count
over the handler function call, which causes later smp_send_nmi_ipi()
callers to spin until the call is finished.
The stop_this_cpu() function never returns, so the busy count is never
decremeted, which can cause the system to hang in some cases. For
example panic() will call smp_send_stop() early on which calls
stop_this_cpu() on other CPUs, then later in the reboot path,
pnv_restart() will call smp_send_stop() again, which hangs.
Fix this by adding a special case to the stop_this_cpu() handler to
decrement the busy count, because it will never return.
Now that the NMI/non-NMI versions of stop_this_cpu() are different,
split them out into separate functions rather than doing #ifdef tricks
to share the body between the two functions.
Fixes: 6bed323762 ("powerpc: use NMI IPI for smp_send_stop")
Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Split out the functions, tweak change log a bit]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The current code extracts the physical address for UE errors and then
hooks it up into memory failure infrastructure. On successful
extraction of physical address it wrongly sets "handled = 1" which
means this UE error has been recovered. Since MCE handler gets return
value as handled = 1, it assumes that error has been recovered and
goes back to same NIP. This causes MCE interrupt again and again in a
loop leading to hard lockup.
Also, initialize phys_addr to ULONG_MAX so that we don't end up
queuing undesired page to hwpoison.
Without this patch we see:
Severe Machine check interrupt [Recovered]
NIP: [000000001002588c] PID: 7109 Comm: find
Initiator: CPU
Error type: UE [Load/Store]
Effective address: 00007fffd2755940
Physical address: 000020181a080000
...
Severe Machine check interrupt [Recovered]
NIP: [000000001002588c] PID: 7109 Comm: find
Initiator: CPU
Error type: UE [Load/Store]
Effective address: 00007fffd2755940
Physical address: 000020181a080000
Severe Machine check interrupt [Recovered]
NIP: [000000001002588c] PID: 7109 Comm: find
Initiator: CPU
Error type: UE [Load/Store]
Effective address: 00007fffd2755940
Physical address: 000020181a080000
Memory failure: 0x20181a08: recovery action for dirty LRU page: Recovered
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
...
Watchdog CPU:38 Hard LOCKUP
After this patch we see:
Severe Machine check interrupt [Not recovered]
NIP: [00007fffaae585f4] PID: 7168 Comm: find
Initiator: CPU
Error type: UE [Load/Store]
Effective address: 00007fffaafe28ac
Physical address: 00002017c0bd0000
find[7168]: unhandled signal 7 at 00007fffaae585f4 nip 00007fffaae585f4 lr 00007fffaae585e0 code 4
Memory failure: 0x2017c0bd: recovery action for dirty LRU page: Recovered
Fixes: 01eaac2b05 ("powerpc/mce: Hookup ierror (instruction) UE errors")
Fixes: ba41e1e1cc ("powerpc/mce: Hookup derror (load/store) UE errors")
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When running KVM guests on Power8 we can see a lockup where one CPU
stops responding. This often leads to a message such as:
watchdog: CPU 136 detected hard LOCKUP on other CPUs 72
Task dump for CPU 72:
qemu-system-ppc R running task 10560 20917 20908 0x00040004
And then backtraces on other CPUs, such as:
Task dump for CPU 48:
ksmd R running task 10032 1519 2 0x00000804
Call Trace:
...
--- interrupt: 901 at smp_call_function_many+0x3c8/0x460
LR = smp_call_function_many+0x37c/0x460
pmdp_invalidate+0x100/0x1b0
__split_huge_pmd+0x52c/0xdb0
try_to_unmap_one+0x764/0x8b0
rmap_walk_anon+0x15c/0x370
try_to_unmap+0xb4/0x170
split_huge_page_to_list+0x148/0xa30
try_to_merge_one_page+0xc8/0x990
try_to_merge_with_ksm_page+0x74/0xf0
ksm_scan_thread+0x10ec/0x1ac0
kthread+0x160/0x1a0
ret_from_kernel_thread+0x5c/0x78
This is caused by commit 8c1c7fb0b5 ("powerpc/64s/idle: avoid sync
for KVM state when waking from idle"), which added a check in
pnv_powersave_wakeup() to see if the kvm_hstate.hwthread_state is
already set to KVM_HWTHREAD_IN_KERNEL, and if so to skip the store and
test of kvm_hstate.hwthread_req.
The problem is that the primary does not set KVM_HWTHREAD_IN_KVM when
entering the guest, so it can then come out to cede with
KVM_HWTHREAD_IN_KERNEL set. It can then go idle in kvm_do_nap after
setting hwthread_req to 1, but because hwthread_state is still
KVM_HWTHREAD_IN_KERNEL we will skip the test of hwthread_req when we
wake up from idle and won't go to kvm_start_guest. From there the
thread will return somewhere garbage and crash.
Fix it by skipping the store of hwthread_state, but not the test of
hwthread_req, when coming out of idle. It's OK to skip the sync in
that case because hwthread_req will have been set on the same thread,
so there is no synchronisation required.
Fixes: 8c1c7fb0b5 ("powerpc/64s/idle: avoid sync for KVM state when waking from idle")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On boot we save the configuration space of PCIe bridges. We do this so
when we get an EEH event and everything gets reset that we can restore
them.
Unfortunately we save this state before we've enabled the MMIO space
on the bridges. Hence if we have to reset the bridge when we come back
MMIO is not enabled and we end up taking an PE freeze when the driver
starts accessing again.
This patch forces the memory/MMIO and bus mastering on when restoring
bridges on EEH. Ideally we'd do this correctly by saving the
configuration space writes later, but that will have to come later in
a larger EEH rewrite. For now we have this simple fix.
The original bug can be triggered on a boston machine by doing:
echo 0x8000000000000000 > /sys/kernel/debug/powerpc/PCI0001/err_injct_outbound
On boston, this PHB has a PCIe switch on it. Without this patch,
you'll see two EEH events, 1 expected and 1 the failure we are fixing
here. The second EEH event causes the anything under the PHB to
disappear (i.e. the i40e eth).
With this patch, only 1 EEH event occurs and devices properly recover.
Fixes: 652defed48 ("powerpc/eeh: Check PCIe link after reset")
Cc: stable@vger.kernel.org # v3.11+
Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If there is no d-cache-size property in the device tree, l1d_size could
be zero. We don't actually expect that to happen, it's only been seen
on mambo (simulator) in some configurations.
A zero-size l1d_size leads to the loop in the asm wrapping around to
2^64-1, and then walking off the end of the fallback area and
eventually causing a page fault which is fatal.
Just default to 64K which is correct on some CPUs, and sane enough to
not cause a crash on others.
Fixes: aa8a5e0062 ('powerpc/64s: Add support for RFI flush of L1-D cache')
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
[mpe: Rewrite comment and change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
- Fix crashes when loading modules built with a different CONFIG_RELOCATABLE
value by adding CONFIG_RELOCATABLE to vermagic.
- Fix busy loops in the OPAL NVRAM driver if we get certain error conditions
from firmware.
- Remove tlbie trace points from KVM code that's called in real mode, because
it causes crashes.
- Fix checkstops caused by invalid tlbiel on Power9 Radix.
- Ensure the set of CPU features we "know" are always enabled is actually the
minimal set when we build with support for firmware supplied CPU features.
Thanks to:
Aneesh Kumar K.V, Anshuman Khandual, Nicholas Piggin.
-----BEGIN PGP SIGNATURE-----
iQIwBAABCAAaBQJa0oWBExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYAV
EA//UQB7n33HlroMBL3841VswUrIgY36gSi9+QZPjxHiTYGVStI+u0FHq5hm8OMm
1FkronD0sfbN7t8LJ9FS6vCoAlW15vMBj95pWJXAL7NuQneO7cM5JvRbowVKbNbq
GESn4EkUj9bAXl7aTzX2yA7jruzMJIwE/2bgl1J6YkpByHx5x/fgzKTuoCRggQqY
Xd656ZZyWR/uIuWhAjCPFdrPnekVvo7/Gy5Yo5kcR6LbX4MO0JFNOHpiT3wPGGv/
LJedJtdpH1biMk7SrqLfWD7mpMsVJeMelpkftEbe8zUXf17wBd1rl4JkybLTDwZD
HznZ0nbnh0j5Q/KRHwOh0sLDSisJF6pQHH/Bnc4Xqrsn4OERwEk40/Ym/O0kDsjH
n2F3X5a5QLWoBWRsdV3BMyEzc0bgnb++tXeBWOV4jJBEOhoqxwimCU+3fmLiy95A
urJYf8Sljwve9I5kJyXKpruopgeoJ1aumRwopE8YCG9lfBgzvU/90u6+uKqAdkOv
kpZxS5FDT2lNIwbCP9WjEelAOGrtA6JQNWU0iPGwsVAvAxX/p1RwwbQeWVlWs3Ek
UdVF8eGezClpLPa/FQFdDyXfGE6WIf1vK9YnG7k3rUwwkSqoYxKgVQ3o1Vetp0Lg
fzPsDdDi2V2I3KDYNav8s6kohAIIS/a9XYk4BGvT/htRnUg=
=0e7c
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix crashes when loading modules built with a different
CONFIG_RELOCATABLE value by adding CONFIG_RELOCATABLE to vermagic.
- Fix busy loops in the OPAL NVRAM driver if we get certain error
conditions from firmware.
- Remove tlbie trace points from KVM code that's called in real mode,
because it causes crashes.
- Fix checkstops caused by invalid tlbiel on Power9 Radix.
- Ensure the set of CPU features we "know" are always enabled is
actually the minimal set when we build with support for firmware
supplied CPU features.
Thanks to: Aneesh Kumar K.V, Anshuman Khandual, Nicholas Piggin.
* tag 'powerpc-4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: Fix CPU_FTRS_ALWAYS vs DT CPU features
powerpc/mm/radix: Fix checkstops caused by invalid tlbiel
KVM: PPC: Book3S HV: trace_tlbie must not be called in realmode
powerpc/8xx: Fix build with hugetlbfs enabled
powerpc/powernv: Fix OPAL NVRAM driver OPAL_BUSY loops
powerpc/powernv: define a standard delay for OPAL_BUSY type retry loops
powerpc/fscr: Enable interrupts earlier before calling get_user()
powerpc/64s: Fix section mismatch warnings from setup_rfi_flush()
powerpc/modules: Fix crashes by adding CONFIG_RELOCATABLE to vermagic
For s390 new kernels are loaded to fixed addresses in memory before they
are booted. With the current code this is a problem as it assumes the
kernel will be loaded to an 'arbitrary' address. In particular,
kexec_locate_mem_hole searches for a large enough memory region and sets
the load address (kexec_bufer->mem) to it.
Luckily there is a simple workaround for this problem. By returning 1
in arch_kexec_walk_mem, kexec_locate_mem_hole is turned off. This
allows the architecture to set kbuf->mem by hand. While the trick works
fine for the kernel it does not for the purgatory as here the
architectures don't have access to its kexec_buffer.
Give architectures access to the purgatories kexec_buffer by changing
kexec_load_purgatory to take a pointer to it. With this change
architectures have access to the buffer and can edit it as they need.
A nice side effect of this change is that we can get rid of the
purgatory_info->purgatory_load_address field. As now the information
stored there can directly be accessed from kbuf->mem.
Link: http://lkml.kernel.org/r/20180321112751.22196-11-prudo@linux.vnet.ibm.com
Signed-off-by: Philipp Rudo <prudo@linux.vnet.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As arch_kexec_kernel_image_{probe,load}(),
arch_kimage_file_post_load_cleanup() and arch_kexec_kernel_verify_sig()
are almost duplicated among architectures, they can be commonalized with
an architecture-defined kexec_file_ops array. So let's factor them out.
Link: http://lkml.kernel.org/r/20180306102303.9063-3-takahiro.akashi@linaro.org
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Acked-by: Dave Young <dyoung@redhat.com>
Tested-by: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The cpu_has_feature() mechanism has an optimisation where at build
time we construct a mask of the CPU feature bits that will always be
true for the given .config, based on the platform/bitness/etc. that we
are building for.
That is incompatible with DT CPU features, where the set of CPU
features is dependent on feature flags that are given to us by
firmware.
The result is that some feature bits can not be *disabled* by DT CPU
features. Or more accurately, they can be disabled but they will still
appear in the ALWAYS mask, meaning cpu_has_feature() will always
return true for them.
In the past this hasn't really been a problem because on Book3S
64 (where we support DT CPU features), the set of ALWAYS bits has been
very small. That was because we always built for POWER4 and later,
meaning the set of common bits was small.
The only bit that could be cleared by DT CPU features that was also in
the ALWAYS mask was CPU_FTR_NODSISRALIGN, and that was only used in
the alignment handler to create a fake DSISR. That code was itself
deleted in 31bfdb036f ("powerpc: Use instruction emulation
infrastructure to handle alignment faults") (Sep 2017).
However the set of ALWAYS features changed with the recent commit
db5ae1c155 ("powerpc/64s: Refine feature sets for little endian
builds") which restricted the set of feature flags when building
little endian to Power7 or later. That caused the ALWAYS mask to
become much larger for little endian builds.
The result is that the following feature bits can currently not
be *disabled* by DT CPU features:
CPU_FTR_REAL_LE, CPU_FTR_MMCRA, CPU_FTR_CTRL, CPU_FTR_SMT,
CPU_FTR_PURR, CPU_FTR_SPURR, CPU_FTR_DSCR, CPU_FTR_PKEY,
CPU_FTR_VMX_COPY, CPU_FTR_CFAR, CPU_FTR_HAS_PPR.
To fix it we need to mask the set of ALWAYS features with the base set
of DT CPU features, ie. the features that are always enabled by DT CPU
features. That way there are no bits in the ALWAYS mask that are not
also always set by DT CPU features.
Fixes: db5ae1c155 ("powerpc/64s: Refine feature sets for little endian builds")
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The function get_user() can sleep while trying to fetch instruction
from user address space and causes the following warning from the
scheduler.
BUG: sleeping function called from invalid context
Though interrupts get enabled back but it happens bit later after
get_user() is called. This change moves enabling these interrupts
earlier covering the function get_user(). While at this, lets check
for kernel mode and crash as this interrupt should not have been
triggered from the kernel context.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The recent LPM changes to setup_rfi_flush() are causing some section
mismatch warnings because we removed the __init annotation on
setup_rfi_flush():
The function setup_rfi_flush() references
the function __init ppc64_bolted_size().
the function __init memblock_alloc_base().
The references are actually in init_fallback_flush(), but that is
inlined into setup_rfi_flush().
These references are safe because:
- only pseries calls setup_rfi_flush() at runtime
- pseries always passes L1D_FLUSH_FALLBACK at boot
- so the fallback flush area will always be allocated
- so the check in init_fallback_flush() will always return early:
/* Only allocate the fallback flush area once (at boot time). */
if (l1d_flush_fallback_area)
return;
- and therefore we won't actually call the freed init routines.
We should rework the code to make it safer by default rather than
relying on the above, but for now as a quick-fix just add a __ref
annotation to squash the warning.
Fixes: abf110f3e1 ("powerpc/rfi-flush: Make it possible to call setup_rfi_flush() again")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Notable changes:
- Support for 4PB user address space on 64-bit, opt-in via mmap().
- Removal of POWER4 support, which was accidentally broken in 2016 and no one
noticed, and blocked use of some modern instructions.
- Workarounds so that the hypervisor can enable Transactional Memory on Power9.
- A series to disable the DAWR (Data Address Watchpoint Register) on Power9.
- More information displayed in the meltdown/spectre_v1/v2 sysfs files.
- A vpermxor (Power8 Altivec) implementation for the raid6 Q Syndrome.
- A big series to make the allocation of our pacas (per cpu area), kernel page
tables, and per-cpu stacks NUMA aware when using the Radix MMU on Power9.
And as usual many fixes, reworks and cleanups.
Thanks to:
Aaro Koskinen, Alexandre Belloni, Alexey Kardashevskiy, Alistair Popple, Andy
Shevchenko, Aneesh Kumar K.V, Anshuman Khandual, Balbir Singh, Benjamin
Herrenschmidt, Christophe Leroy, Christophe Lombard, Cyril Bur, Daniel Axtens,
Dave Young, Finn Thain, Frederic Barrat, Gustavo Romero, Horia Geantă,
Jonathan Neuschäfer, Kees Cook, Larry Finger, Laurent Dufour, Laurent Vivier,
Logan Gunthorpe, Madhavan Srinivasan, Mark Greer, Mark Hairgrove, Markus
Elfring, Mathieu Malaterre, Matt Brown, Matt Evans, Mauricio Faria de
Oliveira, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Paul Mackerras,
Philippe Bergheaud, Ram Pai, Rob Herring, Sam Bobroff, Segher Boessenkool,
Simon Guo, Simon Horman, Stewart Smith, Sukadev Bhattiprolu, Suraj Jitindar
Singh, Thiago Jung Bauermann, Vaibhav Jain, Vaidyanathan Srinivasan, Vasant
Hegde, Wei Yongjun.
-----BEGIN PGP SIGNATURE-----
iQIwBAABCAAaBQJayKxDExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYAr
JQ/6A9Xs4zHDn9OeT9esEIxciETqUlrP0Wp64c4JVC7EkG1E7xRDZ4Xb4m8R2nNt
9sPhtNO1yCtEk6kFQtPNB0N8v6pud4I6+aMcYnn+tP8mJRYQ4x9bYaF3Hw98IKmE
Kd6TglmsUQvh2GpwPiF93KpzzWu1HB2kZzzqJcAMTMh7C79Qz00BjrTJltzXB2jx
tJ+B4lVy8BeU8G5nDAzJEEwb5Ypkn8O40rS/lpAwVTYOBJ8Rbyq8Fj82FeREK9YO
4EGaEKPkC/FdzX7OJV3v2/nldCd8pzV471fAoGuBUhJiJBMBoBybcTHIdDex7LlL
zMLV1mUtGo8iolRPhL8iCH+GGifZz2WzstYCozz7hgIraWtc/frq9rZp6q0LdH/K
trk7UbPGlVb92ecWZVpZyEcsMzKrCgZqnAe9wRNh1uEKScEdzd/bmRaMhENUObRh
Hili6AVvmSKExpy7k2sZP/oUMaeC15/xz8Lk7l8a/iCkYhNmPYh5iSXM5+UKpcRT
FYOcO0o3DwXsN46Whow3nJ7TqAsDy9/ecPUG71JQi3ZrHnRrm8jxkn8MCG5pZ1Fi
KvKDxlg6RiJo3DF9/fSOpJUokvMwqBS5dJo4eh5eiDy94aBTqmBKFecvPxQm7a0L
l3uXCF/6JuXEvMukFjGBO4RiYhw8i+B2uKsh81XUh7HKrgE=
=HAB1
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Notable changes:
- Support for 4PB user address space on 64-bit, opt-in via mmap().
- Removal of POWER4 support, which was accidentally broken in 2016
and no one noticed, and blocked use of some modern instructions.
- Workarounds so that the hypervisor can enable Transactional Memory
on Power9.
- A series to disable the DAWR (Data Address Watchpoint Register) on
Power9.
- More information displayed in the meltdown/spectre_v1/v2 sysfs
files.
- A vpermxor (Power8 Altivec) implementation for the raid6 Q
Syndrome.
- A big series to make the allocation of our pacas (per cpu area),
kernel page tables, and per-cpu stacks NUMA aware when using the
Radix MMU on Power9.
And as usual many fixes, reworks and cleanups.
Thanks to: Aaro Koskinen, Alexandre Belloni, Alexey Kardashevskiy,
Alistair Popple, Andy Shevchenko, Aneesh Kumar K.V, Anshuman Khandual,
Balbir Singh, Benjamin Herrenschmidt, Christophe Leroy, Christophe
Lombard, Cyril Bur, Daniel Axtens, Dave Young, Finn Thain, Frederic
Barrat, Gustavo Romero, Horia Geantă, Jonathan Neuschäfer, Kees Cook,
Larry Finger, Laurent Dufour, Laurent Vivier, Logan Gunthorpe,
Madhavan Srinivasan, Mark Greer, Mark Hairgrove, Markus Elfring,
Mathieu Malaterre, Matt Brown, Matt Evans, Mauricio Faria de Oliveira,
Michael Neuling, Naveen N. Rao, Nicholas Piggin, Paul Mackerras,
Philippe Bergheaud, Ram Pai, Rob Herring, Sam Bobroff, Segher
Boessenkool, Simon Guo, Simon Horman, Stewart Smith, Sukadev
Bhattiprolu, Suraj Jitindar Singh, Thiago Jung Bauermann, Vaibhav
Jain, Vaidyanathan Srinivasan, Vasant Hegde, Wei Yongjun"
* tag 'powerpc-4.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (207 commits)
powerpc/64s/idle: Fix restore of AMOR on POWER9 after deep sleep
powerpc/64s: Fix POWER9 DD2.2 and above in cputable features
powerpc/64s: Fix pkey support in dt_cpu_ftrs, add CPU_FTR_PKEY bit
powerpc/64s: Fix dt_cpu_ftrs to have restore_cpu clear unwanted LPCR bits
Revert "powerpc/64s/idle: POWER9 ESL=0 stop avoid save/restore overhead"
powerpc: iomap.c: introduce io{read|write}64_{lo_hi|hi_lo}
powerpc: io.h: move iomap.h include so that it can use readq/writeq defs
cxl: Fix possible deadlock when processing page faults from cxllib
powerpc/hw_breakpoint: Only disable hw breakpoint if cpu supports it
powerpc/mm/radix: Update command line parsing for disable_radix
powerpc/mm/radix: Parse disable_radix commandline correctly.
powerpc/mm/hugetlb: initialize the pagetable cache correctly for hugetlb
powerpc/mm/radix: Update pte fragment count from 16 to 256 on radix
powerpc/mm/keys: Update documentation and remove unnecessary check
powerpc/64s/idle: POWER9 ESL=0 stop avoid save/restore overhead
powerpc/64s/idle: Consolidate power9_offline_stop()/power9_idle_stop()
powerpc/powernv: Always stop secondaries before reboot/shutdown
powerpc: hard disable irqs in smp_send_stop loop
powerpc: use NMI IPI for smp_send_stop
powerpc/powernv: Fix SMT4 forcing idle code
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAlrHeY8UHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vxhLRAAndV/0NDyWZU0eZNM6twri2SEFnF7
E4ar+YthxDxxJG4TLJbIA12jc5NgHZy4WuttDa6Jb99KreBXIHJFlNi/V/tme6zf
+yXUuxWae7wJzBiaay57VqLGSc80gt/LTgjLa1siwQqjTbO3wSXR6JJXNaE9FtQ4
/jL61t8bD1Peb5cWTpt9p0hrnKI0/pHwASdReyFS4F/HDKdvpof7BxE/OU3HSxxA
XKC2v6RjY4S93vkzvApDXQ+vhKquVRK7/ojyTXQUO/GIzcARprO7H4k62N4ar0x/
qbXLkR8IMkwA8ecsNmcL92ftb/cXoHfd+wdK8WpijqzF4kW4SdteVWbIhUzI0gbr
0gjDYIzjplvH3pZGv/qvx+8sFtAP95OdPjuAAW2qJ9TCVfmiS8naNFCvcxg87RhD
gjyQD3If1X7F8wy309lhq7VNyRexTHgIMgTXHyFvuZMzn/Qe1huL2XCwDcEAg/OX
AvU2iuSE5tWAh7gIUMF/aWi3uoeJUyyoru5ZR//gqdFfx9YxpSimO1UDXnpPi8SR
Iz/jzHJc0aWGYdQ9l6HiSbJF3P/QQcWYs9igt0A7BRGB05SPdWCh7sSO70FJa8ME
f4WID5/qEiaH26kiSRX4cUqpc8Amk8bT0DXw2OT57qy3JM0ZdV5ENQX11pSpr9hv
uLEf0DU7AEmdvzQ=
=T++R
-----END PGP SIGNATURE-----
Merge tag 'pci-v4.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
- move pci_uevent_ers() out of pci.h (Michael Ellerman)
- skip ASPM common clock warning if BIOS already configured it (Sinan
Kaya)
- fix ASPM Coverity warning about threshold_ns (Gustavo A. R. Silva)
- remove last user of pci_get_bus_and_slot() and the function itself
(Sinan Kaya)
- add decoding for 16 GT/s link speed (Jay Fang)
- add interfaces to get max link speed and width (Tal Gilboa)
- add pcie_bandwidth_capable() to compute max supported link bandwidth
(Tal Gilboa)
- add pcie_bandwidth_available() to compute bandwidth available to
device (Tal Gilboa)
- add pcie_print_link_status() to log link speed and whether it's
limited (Tal Gilboa)
- use PCI core interfaces to report when device performance may be
limited by its slot instead of doing it in each driver (Tal Gilboa)
- fix possible cpqphp NULL pointer dereference (Shawn Lin)
- rescan more of the hierarchy on ACPI hotplug to fix Thunderbolt/xHCI
hotplug (Mika Westerberg)
- add support for PCI I/O port space that's neither directly accessible
via CPU in/out instructions nor directly mapped into CPU physical
memory space. This is fairly intrusive and includes minor changes to
interfaces used for I/O space on most platforms (Zhichang Yuan, John
Garry)
- add support for HiSilicon Hip06/Hip07 LPC I/O space (Zhichang Yuan,
John Garry)
- use PCI_EXP_DEVCTL2_COMP_TIMEOUT in rapidio/tsi721 (Bjorn Helgaas)
- remove possible NULL pointer dereference in of_pci_bus_find_domain_nr()
(Shawn Lin)
- report quirk timings with dev_info (Bjorn Helgaas)
- report quirks that take longer than 10ms (Bjorn Helgaas)
- add and use Altera Vendor ID (Johannes Thumshirn)
- tidy Makefiles and comments (Bjorn Helgaas)
- don't set up INTx if MSI or MSI-X is enabled to align cris, frv,
ia64, and mn10300 with x86 (Bjorn Helgaas)
- move pcieport_if.h to drivers/pci/pcie/ to encapsulate it (Frederick
Lawler)
- merge pcieport_if.h into portdrv.h (Bjorn Helgaas)
- move workaround for BIOS PME issue from portdrv to PCI core (Bjorn
Helgaas)
- completely disable portdrv with "pcie_ports=compat" (Bjorn Helgaas)
- remove portdrv link order dependency (Bjorn Helgaas)
- remove support for unused VC portdrv service (Bjorn Helgaas)
- simplify portdrv feature permission checking (Bjorn Helgaas)
- remove "pcie_hp=nomsi" parameter (use "pci=nomsi" instead) (Bjorn
Helgaas)
- remove unnecessary "pcie_ports=auto" parameter (Bjorn Helgaas)
- use cached AER capability offset (Frederick Lawler)
- don't enable DPC if BIOS hasn't granted AER control (Mika Westerberg)
- rename pcie-dpc.c to dpc.c (Bjorn Helgaas)
- use generic pci_mmap_resource_range() instead of powerpc and xtensa
arch-specific versions (David Woodhouse)
- support arbitrary PCI host bridge offsets on sparc (Yinghai Lu)
- remove System and Video ROM reservations on sparc (Bjorn Helgaas)
- probe for device reset support during enumeration instead of runtime
(Bjorn Helgaas)
- add ACS quirk for Ampere (née APM) root ports (Feng Kan)
- add function 1 DMA alias quirk for Marvell 88SE9220 (Thomas
Vincent-Cross)
- protect device restore with device lock (Sinan Kaya)
- handle failure of FLR gracefully (Sinan Kaya)
- handle CRS (config retry status) after device resets (Sinan Kaya)
- skip various config reads for SR-IOV VFs as an optimization
(KarimAllah Ahmed)
- consolidate VPD code in vpd.c (Bjorn Helgaas)
- add Tegra dependency on PCI_MSI_IRQ_DOMAIN (Arnd Bergmann)
- add DT support for R-Car r8a7743 (Biju Das)
- fix a PCI_EJECT vs PCI_BUS_RELATIONS race condition in Hyper-V host
bridge driver that causes a general protection fault (Dexuan Cui)
- fix Hyper-V host bridge hang in MSI setup on 1-vCPU VMs with SR-IOV
(Dexuan Cui)
- fix Hyper-V host bridge hang when ejecting a VF before setting up MSI
(Dexuan Cui)
- make several structures static (Fengguang Wu)
- increase number of MSI IRQs supported by Synopsys DesignWare bridges
from 32 to 256 (Gustavo Pimentel)
- implemented multiplexed IRQ domain API and remove obsolete MSI IRQ
API from DesignWare drivers (Gustavo Pimentel)
- add Tegra power management support (Manikanta Maddireddy)
- add Tegra loadable module support (Manikanta Maddireddy)
- handle 64-bit BARs correctly in endpoint support (Niklas Cassel)
- support optional regulator for HiSilicon STB (Shawn Guo)
- use regulator bulk API for Qualcomm apq8064 (Srinivas Kandagatla)
- support power supplies for Qualcomm msm8996 (Srinivas Kandagatla)
* tag 'pci-v4.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (123 commits)
MAINTAINERS: Add John Garry as maintainer for HiSilicon LPC driver
HISI LPC: Add ACPI support
ACPI / scan: Do not enumerate Indirect IO host children
ACPI / scan: Rename acpi_is_serial_bus_slave() for more general use
HISI LPC: Support the LPC host on Hip06/Hip07 with DT bindings
of: Add missing I/O range exception for indirect-IO devices
PCI: Apply the new generic I/O management on PCI IO hosts
PCI: Add fwnode handler as input param of pci_register_io_range()
PCI: Remove __weak tag from pci_register_io_range()
MAINTAINERS: Add missing /drivers/pci/cadence directory entry
fm10k: Report PCIe link properties with pcie_print_link_status()
net/mlx5e: Use pcie_bandwidth_available() to compute bandwidth
net/mlx5: Report PCIe link properties with pcie_print_link_status()
net/mlx4_core: Report PCIe link properties with pcie_print_link_status()
PCI: Add pcie_print_link_status() to log link speed and whether it's limited
PCI: Add pcie_bandwidth_available() to compute bandwidth available to device
misc: pci_endpoint_test: Handle 64-bit BARs properly
PCI: designware-ep: Make dw_pcie_ep_reset_bar() handle 64-bit BARs properly
PCI: endpoint: Make sure that BAR_5 does not have 64-bit flag set when clearing
PCI: endpoint: Make epc->ops->clear_bar()/pci_epc_clear_bar() take struct *epf_bar
...
POWER8 restores AMOR when waking from deep sleep, but POWER9 does not,
because it does not go through the subcore restore.
Have POWER9 restore it in core restore.
Fixes: ee97b6b99f ("powerpc/mm/radix: Setup AMOR in HV mode to allow key 0")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The pkey code added a CPU_FTR_PKEY bit, but did not add it to the
dt_cpu_ftrs feature set. Although capability is supported by all
processors in the base dt_cpu_ftrs set for 64s, it's a significant
and sufficiently well defined feature to make it optional. So add
it as a quirk for now, which can be versioned out then controlled
by the firmware (once dt_cpu_ftrs gains versioning support).
Fixes: cf43d3b264 ("powerpc: Enable pkey subsystem")
Cc: stable@vger.kernel.org # v4.16+
Cc: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Presently the dt_cpu_ftrs restore_cpu will only add bits to the LPCR
for secondaries, but some bits must be removed (e.g., UPRT for HPT).
Not clearing these bits on secondaries causes checkstops when booting
with disable_radix.
restore_cpu can not just set LPCR, because it is also called by the
idle wakeup code which relies on opal_slw_set_reg to restore the value
of LPCR, at least on P8 which does not save LPCR to stack in the idle
code.
Fix this by including a mask of bits to clear from LPCR as well, which
is used by restore_cpu.
This is a little messy now, but it's a minimal fix that can be
backported. Longer term, the idle SPR save/restore code can be
reworked to completely avoid calls to restore_cpu, then restore_cpu
would be able to unconditionally set LPCR to match boot processor
environment.
Fixes: 5a61ef74f2 ("powerpc/64s: Support new device tree binding for discovering CPU features")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
As described in that commit:
When stop is executed with EC=ESL=0, it appears to execute like a
normal instruction (resuming from NIP when woken by interrupt). So
all the save/restore handling can be avoided completely.
This is true, except in the case of an NMI interrupt (sreset or
machine check) interrupting the instruction. In that case, the NMI
gets an "interrupt occurred while the processor was in power-saving
mode" indication. The power-save wakeup code uses that bit to decide
whether to restore some registers (e.g., LR). Because these are no
longer saved, this causes random register corruption.
It may be possible to restore this optimisation by detecting the case
of no register loss on the wakeup side, and avoid restoring in that
case, but that's not a minor fix because the wakeup code itself uses
some registers that would be live (e.g., LR).
Fixes: b9ee31e100 ("powerpc/64s/idle: POWER9 ESL=0 stop avoid save/restore overhead")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
These functions will be introduced into the generic iomap.c so they
can deal with PIO accesses in hi-lo/lo-hi variants. Thus, the powerpc
version of iomap.c will need to provide the same functions even
though, in this arch, they are identical to the regular
io{read|write}64 functions.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Tested-by: Horia Geantă <horia.geanta@nxp.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
kernel parameter disable_radix takes different options
disable_radix=yes|no|1|0 or just disable_radix.
prom_init parsing is not supporting these options.
Fixes: 1fd6c02207 ("powerpc/mm: Add a CONFIG option to choose if radix is used by default")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When stop is executed with EC=ESL=0, it appears to execute like a
normal instruction (resuming from NIP when woken by interrupt). So all
the save/restore handling can be avoided completely. In particular NV
GPRs do not have to be saved, and MSR does not have to be switched
back to kernel MSR.
So move the test for EC=ESL=0 sleep states out to power9_idle_stop,
and return directly to the caller after stop in that case.
This improves performance for ping-pong benchmark with the stop0_lite
idle state by 2.54% for 2 threads in the same core, and 2.57% for
different cores. Performance increase with HV_POSSIBLE defined will be
improved further by avoiding the hwsync.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 3d4fbffdd7 ("powerpc/64s/idle: POWER9 implement a separate
idle stop function for hotplug") that added power9_offline_stop() was
written before commit 7672691a08 ("powerpc/powernv: Provide a way to
force a core into SMT4 mode").
When merging the former I failed to notice that it caused us to skip
the force-SMT4 logic for offline CPUs. The result is that offlined
CPUs will not correctly participate in the force-SMT4 logic, which
presumably will result in badness (not tested).
Reconcile the two commits by making power9_offline_stop() a pre-cursor
to power9_idle_stop(), so that they share the force-SMT4 logic.
This is based on an original commit from Nick, all breakage is my own.
Fixes: 3d4fbffdd7 ("powerpc/64s/idle: POWER9 implement a separate idle stop function for hotplug")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
- add a shell script to get Clang version
- improve portability of build scripts
- drop always-enabled CONFIG_THIN_ARCHIVE and remove unused code
- rename built-in.o which is now thin archive to built-in.a
- process clean/build targets one by one to get along with -j option
- simplify ld-option
- improve building with CONFIG_TRIM_UNUSED_KSYMS
- define KBUILD_MODNAME even for objects shared among multiple modules
- avoid linking multiple instances of same objects from composite objects
- move <linux/compiler_types.h> to c_flags to include it only for C files
- clean-up various Makefiles
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJaw6eWAAoJED2LAQed4NsGrK8QAJmbYg83TTNoOgQRK/7Lg+sj
KL1+RGFxmdHRVOqG5n18L7Q4LmTD19tUFNQImrQTTrKrbH2vbMSTF2PfzdmDRwMz
R5vW5+wsagfhSttOce/GR4p9+bM9XEclzEa3liqNVQxijOFXmkV14pn0x5anYfeB
ABthxFFHcVn3exP/q3lmq048x1yNE71wUU5WQIWf6V/ZKf+++wQU8r7HpnATWYeO
vtf8gZq+xyLLjhxoJF6n6olSZXI7Yhz4jV2G68/VroS312AUFWPogK+cSshWGlSw
zGixM1q55oj9CXjZ37nR6pTzQhSZLf/uHX5beatlpeoJ4Hho6HlIABvx2oEnat7b
o5RW64RB0gVJqlYZdKxL29HNrovr9tlWPTaIPGFRvWDpl3c1w+rMKXE+5hwu8jMJ
2jgxd5FZCgBaDsAKojmeQR7PAo2ffAdUO0Dj/SuAaMOpuHWHcnJk9kIN2PUrC+Sf
d/H2soT9x+60KbQmtCEo5VfEN8bvNP3+ZSnadEG/gRN2IIa5FZAUQykU+i50gAvj
tuKAokdRGZHvXM+buYFBfN6RbhVCXzBF/fAG7r37QVR2u1zaUszmgFOUqERhTQfm
RNnyeAs9G9rljtna/AD7cIOdKTg8oETPISxt8Y6EzNMpI8PhF0aGoxso3yD19oH1
M+fq55RigsR48Kic40hY
=N5BL
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- add a shell script to get Clang version
- improve portability of build scripts
- drop always-enabled CONFIG_THIN_ARCHIVE and remove unused code
- rename built-in.o which is now thin archive to built-in.a
- process clean/build targets one by one to get along with -j option
- simplify ld-option
- improve building with CONFIG_TRIM_UNUSED_KSYMS
- define KBUILD_MODNAME even for objects shared among multiple modules
- avoid linking multiple instances of same objects from composite
objects
- move <linux/compiler_types.h> to c_flags to include it only for C
files
- clean-up various Makefiles
* tag 'kbuild-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (29 commits)
kbuild: get <linux/compiler_types.h> out of <linux/kconfig.h>
kbuild: clean up link rule of composite modules
kbuild: clean up archive rule of built-in.a
kbuild: remove partial section mismatch detection for built-in.a
net: liquidio: clean up Makefile for simpler composite object handling
lib: zstd: clean up Makefile for simpler composite object handling
kbuild: link $(real-obj-y) instead of $(obj-y) into built-in.a
kbuild: rename real-objs-y/m to real-obj-y/m
kbuild: move modname and modname-multi close to modname_flags
kbuild: simplify modname calculation
kbuild: fix modname for composite modules
kbuild: define KBUILD_MODNAME even if multiple modules share objects
kbuild: remove unnecessary $(subst $(obj)/, , ...) in modname-multi
kbuild: Use ls(1) instead of stat(1) to obtain file size
kbuild: link vmlinux only once for CONFIG_TRIM_UNUSED_KSYMS
kbuild: move include/config/ksym/* to include/ksym/*
kbuild: move CONFIG_TRIM_UNUSED_KSYMS code unneeded for external module
kbuild: restore autoksyms.h touch to the top Makefile
kbuild: move 'scripts' target below
kbuild: remove wrong 'touch' in adjust_autoksyms.sh
...
Pull sparc updates from David Miller:
1) Add support for ADI (Application Data Integrity) found in more
recent sparc64 cpus. Essentially this is keyed based access to
virtual memory, and if the key encoded in the virual address is
wrong you get a trap.
The mm changes were reviewed by Andrew Morton and others.
Work by Khalid Aziz.
2) Validate DAX completion index range properly, from Rob Gardner.
3) Add proper Kconfig deps for DAX driver. From Guenter Roeck.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next:
sparc64: Make atomic_xchg() an inline function rather than a macro.
sparc64: Properly range check DAX completion index
sparc: Make auxiliary vectors for ADI available on 32-bit as well
sparc64: Oracle DAX driver depends on SPARC64
sparc64: Update signal delivery to use new helper functions
sparc64: Add support for ADI (Application Data Integrity)
mm: Allow arch code to override copy_highpage()
mm: Clear arch specific VM flags on protection change
mm: Add address parameter to arch_validate_prot()
sparc64: Add auxiliary vectors to report platform ADI properties
sparc64: Add handler for "Memory Corruption Detected" trap
sparc64: Add HV fault type handlers for ADI related faults
sparc64: Add support for ADI register fields, ASIs and traps
mm, swap: Add infrastructure for saving page metadata on swap
signals, sparc: Add signal codes for ADI violations
The hard lockup watchdog can fire under local_irq_disable
on platforms with irq soft masking.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use the NMI IPI rather than smp_call_function for smp_send_stop.
Have stopped CPUs hard disable interrupts rather than just soft
disable.
This function is used in crash/panic/shutdown paths to bring other
CPUs down as quickly and reliably as possible, and minimizing their
potential to cause trouble.
Avoiding the Linux smp_call_function infrastructure and (if supported)
using true NMI IPIs makes this more robust.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The PSSCR value is not stored to PACA_REQ_PSSCR if the CPU does not
have the XER[SO] bug.
Fix this by storing up-front, outside the workaround code. The initial
test is not required because it is a slow path.
The workaround is made to depend on CONFIG_KVM_BOOK3S_HV_POSSIBLE, to
match pnv_power9_force_smt4_catch() where it is used. Drop the comment
on pnv_power9_force_smt4_catch() as it's no longer true.
Fixes: 7672691a08 ("powerpc/powernv: Provide a way to force a core into SMT4 mode")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This moves the definition of the default security feature flags
(i.e., enabled by default) closer to the security feature flags.
This can be used to restore current flags to the default flags.
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
flush_thread() calls __set_breakpoint() via set_debug_reg_defaults()
without checking ppc_breakpoint_available(). On Power8 or later CPUs
which have the DAWR feature disabled that will cause a write to the
DABR which is incorrect as those CPUs don't have a DABR.
Fix it two ways, by checking ppc_breakpoint_available() in
set_debug_reg_defaults(), and also by reworking __set_breakpoint() to
only write to DABR on Power7 or earlier.
Fixes: 9654153158 ("powerpc: Disable DAWR in the base POWER9 CPU features")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Rework the logic in __set_breakpoint()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 8e0b634b13 ("powerpc/64s: Do not allocate lppaca if we are
not virtualized") removed allocation of lppaca on bare metal
platforms. But with CONFIG_PPC_SPLPAR enabled, we still access the
lppaca on bare metal in some code paths.
Fix this but adding runtime checks for SPLPAR (shared processor LPAR).
Fixes: 8e0b634b13 ("powerpc/64s: Do not allocate lppaca if we are not virtualized")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Pull removal of in-kernel calls to syscalls from Dominik Brodowski:
"System calls are interaction points between userspace and the kernel.
Therefore, system call functions such as sys_xyzzy() or
compat_sys_xyzzy() should only be called from userspace via the
syscall table, but not from elsewhere in the kernel.
At least on 64-bit x86, it will likely be a hard requirement from
v4.17 onwards to not call system call functions in the kernel: It is
better to use use a different calling convention for system calls
there, where struct pt_regs is decoded on-the-fly in a syscall wrapper
which then hands processing over to the actual syscall function. This
means that only those parameters which are actually needed for a
specific syscall are passed on during syscall entry, instead of
filling in six CPU registers with random user space content all the
time (which may cause serious trouble down the call chain). Those
x86-specific patches will be pushed through the x86 tree in the near
future.
Moreover, rules on how data may be accessed may differ between kernel
data and user data. This is another reason why calling sys_xyzzy() is
generally a bad idea, and -- at most -- acceptable in arch-specific
code.
This patchset removes all in-kernel calls to syscall functions in the
kernel with the exception of arch/. On top of this, it cleans up the
three places where many syscalls are referenced or prototyped, namely
kernel/sys_ni.c, include/linux/syscalls.h and include/linux/compat.h"
* 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux: (109 commits)
bpf: whitelist all syscalls for error injection
kernel/sys_ni: remove {sys_,sys_compat} from cond_syscall definitions
kernel/sys_ni: sort cond_syscall() entries
syscalls/x86: auto-create compat_sys_*() prototypes
syscalls: sort syscall prototypes in include/linux/compat.h
net: remove compat_sys_*() prototypes from net/compat.h
syscalls: sort syscall prototypes in include/linux/syscalls.h
kexec: move sys_kexec_load() prototype to syscalls.h
x86/sigreturn: use SYSCALL_DEFINE0
x86: fix sys_sigreturn() return type to be long, not unsigned long
x86/ioport: add ksys_ioperm() helper; remove in-kernel calls to sys_ioperm()
mm: add ksys_readahead() helper; remove in-kernel calls to sys_readahead()
mm: add ksys_mmap_pgoff() helper; remove in-kernel calls to sys_mmap_pgoff()
mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64()
fs: add ksys_fallocate() wrapper; remove in-kernel calls to sys_fallocate()
fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls
fs: add ksys_truncate() wrapper; remove in-kernel calls to sys_truncate()
fs: add ksys_sync_file_range helper(); remove in-kernel calls to syscall
kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid()
kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare()
...
Using this helper allows us to avoid the in-kernel calls to the
sys_readahead() syscall. The ksys_ prefix denotes that this function is
meant as a drop-in replacement for the syscall. In particular, it uses the
same calling convention as sys_readahead().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper allows us to avoid the in-kernel calls to the
sys_mmap_pgoff() syscall. The ksys_ prefix denotes that this function is
meant as a drop-in replacement for the syscall. In particular, it uses the
same calling convention as sys_mmap_pgoff().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the ksys_fadvise64_64() helper allows us to avoid the in-kernel
calls to the sys_fadvise64_64() syscall. The ksys_ prefix denotes that
this function is meant as a drop-in replacement for the syscall. In
particular, it uses the same calling convention as ksys_fadvise64_64().
Some compat stubs called sys_fadvise64(), which then just passed through
the arguments to sys_fadvise64_64(). Get rid of this indirection, and call
ksys_fadvise64_64() directly.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the ksys_fallocate() wrapper allows us to get rid of in-kernel
calls to the sys_fallocate() syscall. The ksys_ prefix denotes that this
function is meant as a drop-in replacement for the syscall. In
particular, it uses the same calling convention as sys_fallocate().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the ksys_p{read,write}64() wrappers allows us to get rid of
in-kernel calls to the sys_pread64() and sys_pwrite64() syscalls.
The ksys_ prefix denotes that this function is meant as a drop-in
replacement for the syscall. In particular, it uses the same calling
convention as sys_p{read,write}64().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the ksys_truncate() wrapper allows us to get rid of in-kernel
calls to the sys_truncate() syscall. The ksys_ prefix denotes that this
function is meant as a drop-in replacement for the syscall. In
particular, it uses the same calling convention as sys_truncate().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper allows us to avoid the in-kernel calls to the
sys_sync_file_range() syscall. The ksys_ prefix denotes that this function
is meant as a drop-in replacement for the syscall. In particular, it uses
the same calling convention as sys_sync_file_range().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the ksys_ftruncate() wrapper allows us to get rid of in-kernel
calls to the sys_ftruncate() syscall. The ksys_ prefix denotes that this
function is meant as a drop-in replacement for the syscall. In
particular, it uses the same calling convention as sys_ftruncate().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
When using SIG_DBG_BRANCH_TRACING, MSR.BE is left enabled in the
user context when single_step_exception() prepares the SIGTRAP
delivery. The resulting branch-trap-within-the-SIGTRAP-handler
isn't healthy.
Commit 2538c2d08f broke this, by
replacing an MSR mask operation of ~(MSR_SE | MSR_BE) with a call
to clear_single_step() which only clears MSR_SE.
This patch adds a new helper, clear_br_trace(), which clears the
debug trap before invoking the signal handler. This helper is a
NOP for BookE as SIG_DBG_BRANCH_TRACING isn't supported on BookE.
Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
POWER4 has been broken since at least the change 49d09bf2a6
("powerpc/64s: Optimise MSR handling in exception handling"), which
requires mtmsrd L=1 support. This was introduced in ISA v2.01, and
POWER4 supports ISA v2.00.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The CPU_FTR_POWER9_DD2_1 flag is intended to be set for DD2.1 and
above (which is what the cputable setup does). Fix DT CPU features
quirk setup to match.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Merge with upstream changes]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Rather than override the machine type in .S code (which can hide wrong
or ambiguous code generation for the target), set the type to power4
for all assembly.
This also means we need to be careful not to build power4-only code
when we're not building for Book3S, such as the "power7" versions of
copyuser/page/memcpy.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fix Book3E build, don't build the "power7" variants for non-Book3S]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When waking from a CPU idle instruction (e.g., nap or stop), the sync
for ordering the KVM secondary thread state can be avoided if there
wakeup is coming from a kernel context rather than KVM context.
This improves performance for ping-pong benchmark with the stop0 idle
state by 0.46% for 2 threads in the same core, and 1.02% for different
cores.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Implement a new function to invoke stop, power9_offline_stop, which is
like power9_idle_stop but used by the cpu hotplug code.
Move KVM secondary state manipulation code to the offline case.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
system_reset_exception does most of its own crash handling now,
invoking the debugger or crash dumps if they are registered. If not,
then it goes through to die() to print stack traces, and then is
supposed to panic (according to comments).
However after die() prints oopses, it does its own handling which
doesn't allow system_reset_exception to panic (e.g., it may just
kill the current process). This patch causes sreset exceptions to
return from die after it prints messages but before acting.
This also stops die from invoking the debugger on 0x100 crashes.
system_reset_exception similarly calls the debugger. It had been
thought this was harmless (because if the debugger was disabled,
neither call would fire, and if it was enabled the first call
would return). However in some cases like xmon 'X' command, the
debugger returns 0, which currently causes it to be entered
again (first in system_reset_exception, then in die), which is
confusing.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
System Reset, being an NMI, must return more carefully than other
interrupts. It has traditionally returned via the nromal return
from exception path, but that has a number of problems.
- r13 does not get restored if returning to kernel. This is for
interrupts which may cause a context switch, which sreset will
never do. Interrupting OPAL (which uses a different r13) is one
place where this causes breakage.
- It may cause several other problems returning to kernel with
preempt or TIF_EMULATE_STACK_STORE if it hits at the wrong time.
It's safer just to have a simple restore and return, like machine
check which is the other NMI.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The current EEH callbacks can race with a driver unbind. This can
result in a backtraces like this:
EEH: Frozen PHB#0-PE#1fc detected
EEH: PE location: S000009, PHB location: N/A
CPU: 2 PID: 2312 Comm: kworker/u258:3 Not tainted 4.15.6-openpower1 #2
Workqueue: nvme-wq nvme_reset_work [nvme]
Call Trace:
dump_stack+0x9c/0xd0 (unreliable)
eeh_dev_check_failure+0x420/0x470
eeh_check_failure+0xa0/0xa4
nvme_reset_work+0x138/0x1414 [nvme]
process_one_work+0x1ec/0x328
worker_thread+0x2e4/0x3a8
kthread+0x14c/0x154
ret_from_kernel_thread+0x5c/0xc8
nvme nvme1: Removing after probe failure status: -19
<snip>
cpu 0x23: Vector: 300 (Data Access) at [c000000ff50f3800]
pc: c0080000089a0eb0: nvme_error_detected+0x4c/0x90 [nvme]
lr: c000000000026564: eeh_report_error+0xe0/0x110
sp: c000000ff50f3a80
msr: 9000000000009033
dar: 400
dsisr: 40000000
current = 0xc000000ff507c000
paca = 0xc00000000fdc9d80 softe: 0 irq_happened: 0x01
pid = 782, comm = eehd
Linux version 4.15.6-openpower1 (smc@smc-desktop) (gcc version 6.4.0 (Buildroot 2017.11.2-00008-g4b6188e)) #2 SM P Tue Feb 27 12:33:27 PST 2018
enter ? for help
eeh_report_error+0xe0/0x110
eeh_pe_dev_traverse+0xc0/0xdc
eeh_handle_normal_event+0x184/0x4c4
eeh_handle_event+0x30/0x288
eeh_event_handler+0x124/0x170
kthread+0x14c/0x154
ret_from_kernel_thread+0x5c/0xc8
The first part is an EEH (on boot), the second half is the resulting
crash. nvme probe starts the nvme_reset_work() worker thread. This
worker thread starts touching the device which see a device error
(EEH) and hence queues up an event in the powerpc EEH worker
thread. nvme_reset_work() then continues and runs
nvme_remove_dead_ctrl_work() which results in unbinding the driver
from the device and hence releases all resources. At the same time,
the EEH worker thread starts doing the EEH .error_detected() driver
callback, which no longer works since the resources have been freed.
This fixes the problem in the same way the generic PCIe AER code (in
drivers/pci/pcie/aer/aerdrv_core.c) does. It makes the EEH code hold
the device_lock() while performing the driver EEH callbacks and
associated code. This ensures either the callbacks are no longer
register, or if they are registered the driver will not be removed
from underneath us.
This has been broken forever. The EEH call backs were first introduced
in 2005 (in 77bd741561) but it's not clear if a lock was needed back
then.
Fixes: 77bd741561 ("[PATCH] powerpc: PCI Error Recovery: PPC64 core recovery routines")
Cc: stable@vger.kernel.org # v2.6.16+
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
kexec_file_load() on powerpc doesn't support kdump kernels yet, so it
returns -ENOTSUPP in that case.
I've recently learned that this errno is internal to the kernel and
isn't supposed to be exposed to userspace. Therefore, change to
-EOPNOTSUPP which is defined in an uapi header.
This does indeed make kexec-tools happier. Before the patch, on
ppc64le:
# ~bauermann/src/kexec-tools/build/sbin/kexec -s -p /boot/vmlinuz
kexec_file_load failed: Unknown error 524
After the patch:
# ~bauermann/src/kexec-tools/build/sbin/kexec -s -p /boot/vmlinuz
kexec_file_load failed: Operation not supported
Fixes: a0458284f0 ("powerpc: Add support code for kexec_file_load()")
Cc: stable@vger.kernel.org # v4.10+
Reported-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
Reviewed-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On 64-bit Book3E systems, in setup_tlb_core_data() we reference other
CPUs pacas. But in commit 59f577743d ("powerpc/64: Defer paca
allocation until memory topology is discovered") the allocation of
non-boot-CPU pacas was deferred until later in boot.
This leads to an oops:
CPU maps initialized for 1 thread per core
Unable to handle kernel paging request for data at address 0x8888888888888918
Faulting instruction address: 0xc000000000e2f0d0
Oops: Kernel access of bad area, sig: 11 [#1]
NIP .setup_tlb_core_data+0xdc/0x160
Call Trace:
.setup_tlb_core_data+0x5c/0x160 (unreliable)
.setup_arch+0x80/0x348
.start_kernel+0x7c/0x598
start_here_common+0x1c/0x40
Luckily setup_tlb_core_data() is called immediately prior to
smp_setup_pacas(). So simply switching their order is sufficient to
fix the oops and seems unlikely to have any other unwanted side
effects.
Fixes: 59f577743d ("powerpc/64: Defer paca allocation until memory topology is discovered")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Bring in yet another series that touches KVM code, and might need to
be merged into the kvm-ppc branch to resolve conflicts.
This required some changes in pnv_power9_force_smt4_catch/release()
due to the paca array becomming an array of pointers.
For addresses above 512TB we allocate additional mmu contexts. To make
it all easy, addresses above 512TB are handled with IR/DR=1 and with
stack frame setup.
The mmu_context_t is also updated to track the new extended_ids. To
support upto 4PB we need a total 8 contexts.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Minor formatting tweaks and comment wording, switch BUG to WARN
in get_ea_context().]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman reported the following call trace when running
ftracetest:
BUG: using __this_cpu_write() in preemptible [00000000] code: ftracetest/6178
caller is opt_pre_handler+0xc4/0x110
CPU: 1 PID: 6178 Comm: ftracetest Not tainted 4.15.0-rc7-gcc6x-gb2cd1df #1
Call Trace:
[c0000000f9ec39c0] [c000000000ac4304] dump_stack+0xb4/0x100 (unreliable)
[c0000000f9ec3a00] [c00000000061159c] check_preemption_disabled+0x15c/0x170
[c0000000f9ec3a90] [c000000000217e84] opt_pre_handler+0xc4/0x110
[c0000000f9ec3af0] [c00000000004cf68] optimized_callback+0x148/0x170
[c0000000f9ec3b40] [c00000000004d954] optinsn_slot+0xec/0x10000
[c0000000f9ec3e30] [c00000000004bae0] kretprobe_trampoline+0x0/0x10
This is showing up since OPTPROBES is now enabled with CONFIG_PREEMPT.
trampoline_probe_handler() considers itself to be a special kprobe
handler for kretprobes. In doing so, it expects to be called from
kprobe_handler() on a trap, and re-enables preemption before returning a
non-zero return value so as to suppress any subsequent processing of the
trap by the kprobe_handler().
However, with optprobes, we don't deal with special handlers (we ignore
the return code) and just try to re-enable preemption causing the above
trace.
To address this, modify trampoline_probe_handler() to not be special.
The only additional processing done in kprobe_handler() is to emulate
the instruction (in this case, a 'nop'). We adjust the value of
regs->nip for the purpose and delegate the job of re-enabling
preemption and resetting current kprobe to the probe handlers
(kprobe_handler() or optimized_callback()).
Fixes: 8a2d71a3f2 ("powerpc/kprobes: Disable preemption before invoking probe handler for optprobes")
Cc: stable@vger.kernel.org # v4.15+
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Per-node allocations are possible on 64s with radix that does
not have the bolted SLB limitation.
Hash would be able to do the same if all CPUs had the bottom of
their node-local memory bolted as well. This is left as an
exercise for the reader.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Add dummy definition of boot_cpuid for !SMP]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Rename the dummy allocate_pacas() to fix 32-bit build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Build an array that finds hardware CPU number from logical CPU
number in firmware CPU discovery. Use that rather than setting
paca of other CPUs directly, to begin with. Subsequent patch will
not have pacas allocated at this point.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fix SMP=n build by adding #ifdef in arch_match_cpu_phys_id()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Move this into the early setup code, and don't iterate over CPU masks.
We don't want to call into sysfs so early from setup, and a future patch
won't initialize CPU masks by the time this is called.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fold in incremental fix from Nick for DSCR handling]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Split sparsemem initialisation from basic numa topology discovery.
Move the parsing earlier in boot, before pacas are allocated.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
slb_shadow structures are avoided for radix environment.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We no longer allocate lppacas in an array, so this patch removes the
1kB static alignment for the structure, and enforces the PAPR
alignment requirements at allocation time. We can not reduce the 1kB
allocation size however, due to existing KVM hypervisors.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Change the paca array into an array of pointers to pacas. Allocate
pacas individually.
This allows flexibility in where the PACAs are allocated. Future work
will allocate them node-local. Platforms that don't have address limits
on PACAs would be able to defer PACA allocations until later in boot
rather than allocate all possible ones up-front then freeing unused.
This is slightly more overhead (one additional indirection) for cross
CPU paca references, but those aren't too common.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The "lppaca" is a structure registered with the hypervisor. This is
unnecessary when running on non-virtualised platforms. One field from
the lppaca (pmcregs_in_use) is also used by the host, so move the host
part out into the paca (lppaca field is still updated in
guest mode).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fix non-pseries build with some #ifdefs]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Merge our fixes branch from the 4.16 cycle.
There were a number of important fixes merged, in particular some Power9
workarounds that we want in next for testing purposes. There's also been
some conflicting changes in the CPU features code which are best merged
and tested before going upstream.
This disables the DAWR on all POWER9 CPUs via cpu feature quirk.
Using the DAWR on POWER9 can cause xstops, hence we need to disable
it.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This updates the ptrace code to use ppc_breakpoint_available().
We now advertise via PPC_PTRACE_GETHWDBGINFO zero breakpoints when the
DAWR is missing (ie. POWER9). This results in GDB falling back to
software emulation of the breakpoint (which is slow).
For the features advertised by PPC_PTRACE_GETHWDBGINFO, we keep
advertising DAWR as if we don't GDB assumes 1 breakpoint irrespective
of the number of breakpoints advertised. GDB then fails later when
trying to set this one breakpoint.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add ppc_breakpoint_available() to determine if a breakpoint is
available currently via the DAWR or DABR.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Checking for a "fully active" device state requires testing two flag
bits, which is open coded in several places, so add a function to do
it.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The caller will always pass NULL for 'rmv_data' when
'eeh_aware_driver' is true, so the first two calls to
eeh_pe_dev_traverse() can be combined without changing behaviour as
can the two arms of the final 'if' block.
This should not change behaviour.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
eeh_reset_device() tests the value of 'bus' more than once but the
only caller, eeh_handle_normal_device() does this test itself and will
never pass NULL.
So, remove the dead tests.
This should not change behaviour.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
It is currently difficult to understand the behaviour of
eeh_reset_device() due to the way it's parameters are used. In
particular, when 'bus' is NULL, it's value is still necessary so the
same value is looked up again locally under a different name
('frozen_bus') but behaviour is changed.
To clarify this, add a new parameter 'driver_eeh_aware', and have the
caller set it when it would have passed NULL for 'bus' and always pass
a value for 'bus'. Then change any test that was on 'bus' to one on
'!driver_eeh_aware' and replace uses of 'frozen_bus' with 'bus'.
Also update the function's comment.
This should not change behaviour.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The name "frozen_bus" is misleading: it's not necessarily frozen, it's
just the PE's PCI bus.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Remove a test that checks if "frozen_bus" is NULL, because it cannot
have changed since it was tested at the start of the function and so
must be true here.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit "0ba178888b05 powerpc/eeh: Remove reference to PCI device"
removed a call to pci_dev_get() from __eeh_addr_cache_get_device() but
did not update the comment to match.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently the EEH_PE_RECOVERING flag for a PE is managed by both the
caller and callee of eeh_handle_normal_event() (among other places not
considered here). This is complicated by the fact that the PE may
or may not have been invalidated by the call.
So move the callee's handling into eeh_handle_normal_event(), which
clarifies it and allows the return type to be changed to void (because
it no longer needs to indicate at the PE has been invalidated).
This should not change behaviour except in eeh_event_handler() where
it was previously possible to cause eeh_pe_state_clear() to be called
on an invalid PE, which is now avoided.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The function eeh_handle_event(pe) does nothing other than switching
between calling eeh_handle_normal_event(pe) and
eeh_handle_special_event(). However it is only called in two places,
one where pe can't be NULL and the other where it must be NULL (see
eeh_event_handler()) so it does nothing but obscure the flow of
control.
So, remove it.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently the pseries kernel advertises radix MMU support even if
the actual support is disabled via the CONFIG_PPC_RADIX_MMU option.
This adds a check for CONFIG_PPC_RADIX_MMU to avoid advertising radix
to the hypervisor.
Suggested-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a definition for cpu_show_spectre_v2() to override the generic
version. This has several permuations, though in practice some may not
occur we cater for any combination.
The most verbose is:
Mitigation: Indirect branch serialisation (kernel only), Indirect
branch cache disabled, ori31 speculation barrier enabled
We don't treat the ori31 speculation barrier as a mitigation on its
own, because it has to be *used* by code in order to be a mitigation
and we don't know if userspace is doing that. So if that's all we see
we say:
Vulnerable, ori31 speculation barrier enabled
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a definition for cpu_show_spectre_v1() to override the generic
version. Currently this just prints "Not affected" or "Vulnerable"
based on the firmware flag.
Although the kernel does have array_index_nospec() in a few places, we
haven't yet audited all the powerpc code to see where it's necessary,
so for now we don't list that as a mitigation.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Now that we have the security feature flags we can make the
information displayed in the "meltdown" file more informative.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This landed in setup_64.c for no good reason other than we had nowhere
else to put it. Now that we have a security-related file, that is a
better place for it so move it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This commit adds security feature flags to reflect the settings we
receive from firmware regarding Spectre/Meltdown mitigations.
The feature names reflect the names we are given by firmware on bare
metal machines. See the hostboot source for details.
Arguably these could be firmware features, but that then requires them
to be read early in boot so they're available prior to asm feature
patching, but we don't actually want to use them for patching. We may
also want to dynamically update them in future, which would be
incompatible with the way firmware features work (at the moment at
least). So for now just make them separate flags.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently the rfi-flush messages print 'Using <type> flush' for all
enabled_flush_types, but that is not necessarily true -- as now the
fallback flush is always enabled on pseries, but the fixup function
overwrites its nop/branch slot with other flush types, if available.
So, replace the 'Using <type> flush' messages with '<type> flush is
available'.
Also, print the patched flush types in the fixup function, so users
can know what is (not) being used (e.g., the slower, fallback flush,
or no flush type at all if flush is disabled via the debugfs switch).
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For PowerVM migration we want to be able to call setup_rfi_flush()
again after we've migrated the partition.
To support that we need to check that we're not trying to allocate the
fallback flush area after memblock has gone away (i.e., boot-time only).
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
rfi_flush_enable() includes a check to see if we're already
enabled (or disabled), and in that case does nothing.
But that means calling setup_rfi_flush() a 2nd time doesn't actually
work, which is a bit confusing.
Move that check into the debugfs code, where it really belongs.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The SLB bad address handler's trap number fixup does not preserve the
low bit that indicates nonvolatile GPRs have not been saved. This
leads save_nvgprs to skip saving them, and subsequent functions and
return from interrupt will think they are saved.
This causes kernel branch-to-garbage debugging to not have correct
registers, can also cause userspace to have its registers clobbered
after a segfault.
Fixes: f0f558b131 ("powerpc/mm: Preserve CFAR value on SLB miss caused by access to bogus address")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Incremental linking is gone, so rename built-in.o to built-in.a, which
is the usual extension for archive files.
This patch does two things, first is a simple search/replace:
git grep -l 'built-in\.o' | xargs sed -i 's/built-in\.o/built-in\.a/g'
The second is to invert nesting of nested text manipulations to avoid
filtering built-in.a out from libs-y2:
-libs-y2 := $(filter-out %.a, $(patsubst %/, %/built-in.a, $(libs-y)))
+libs-y2 := $(patsubst %/, %/built-in.a, $(filter-out %.a, $(libs-y)))
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
POWER9 has hardware bugs relating to transactional memory and thread
reconfiguration (changes to hardware SMT mode). Specifically, the core
does not have enough storage to store a complete checkpoint of all the
architected state for all four threads. The DD2.2 version of POWER9
includes hardware modifications designed to allow hypervisor software
to implement workarounds for these problems. This patch implements
those workarounds in KVM code so that KVM guests see a full, working
transactional memory implementation.
The problems center around the use of TM suspended state, where the
CPU has a checkpointed state but execution is not transactional. The
workaround is to implement a "fake suspend" state, which looks to the
guest like suspended state but the CPU does not store a checkpoint.
In this state, any instruction that would cause a transition to
transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
checkpointed state (treclaim) causes a "soft patch" interrupt (vector
0x1500) to the hypervisor so that it can be emulated. The trechkpt
instruction also causes a soft patch interrupt.
On POWER9 DD2.2, we avoid returning to the guest in any state which
would require a checkpoint to be present. The trechkpt in the guest
entry path which would normally create that checkpoint is replaced by
either a transition to fake suspend state, if the guest is in suspend
state, or a rollback to the pre-transactional state if the guest is in
transactional state. Fake suspend state is indicated by a flag in the
PACA plus a new bit in the PSSCR. The new PSSCR bit is write-only and
reads back as 0.
On exit from the guest, if the guest is in fake suspend state, we still
do the treclaim instruction as we would in real suspend state, in order
to get into non-transactional state, but we do not save the resulting
register state since there was no checkpoint.
Emulation of the instructions that cause a softpatch interrupt is
handled in two paths. If the guest is in real suspend mode, we call
kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
transitioning to transactional state. This is called before we do the
treclaim in the guest exit path; because we haven't done treclaim, we
can get back to the guest with the transaction still active. If the
instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
handle, or if the guest is in fake suspend state, then we proceed to
do the complete guest exit path and subsequently call
kvmhv_p9_tm_emulation() in host context with the MMU on. This handles
all the cases including the cases that generate program interrupts
(illegal instruction or TM Bad Thing) and facility unavailable
interrupts.
The emulation is reasonably straightforward and is mostly concerned
with checking for exception conditions and updating the state of
registers such as MSR and CR0. The treclaim emulation takes care to
ensure that the TEXASR register gets updated as if it were the guest
treclaim instruction that had done failure recording, not the treclaim
done in hypervisor state in the guest exit path.
With this, the KVM_CAP_PPC_HTM capability returns true (1) even if
transactional memory is not available to host userspace.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
POWER9 processors up to and including "Nimbus" v2.2 have hardware
bugs relating to transactional memory and thread reconfiguration.
One of these bugs has a workaround which is to get the core into
SMT4 state temporarily. This workaround is only needed when
running bare-metal.
This patch provides a function which gets the core into SMT4 mode
by preventing threads from going to a stop state, and waking up
those which are already in a stop state. Once at least 3 threads
are not in a stop state, the core will be in SMT4 and we can
continue.
To do this, we add a "dont_stop" flag to the paca to tell the
thread not to go into a stop state. If this flag is set,
power9_idle_stop() just returns immediately with a return value
of 0. The pnv_power9_force_smt4_catch() function does the following:
1. Set the dont_stop flag for each thread in the core, except
ourselves (in fact we use an atomic_inc() in case more than
one thread is calling this function concurrently).
2. See how many threads are awake, indicated by their
requested_psscr field in the paca being 0. If this is at
least 3, skip to step 5.
3. Send a doorbell interrupt to each thread that was seen as
being in a stop state in step 2.
4. Until at least 3 threads are awake, scan the threads to which
we sent a doorbell interrupt and check if they are awake now.
This relies on the following properties:
- Once dont_stop is non-zero, requested_psccr can't go from zero to
non-zero, except transiently (and without the thread doing stop).
- requested_psscr being zero guarantees that the thread isn't in
a state-losing stop state where thread reconfiguration could occur.
- Doing stop with a PSSCR value of 0 won't be a state-losing stop
and thus won't allow thread reconfiguration.
- Once threads_per_core/2 + 1 (i.e. 3) threads are awake, the core
must be in SMT4 mode, since SMT modes are powers of 2.
This does add a sync to power9_idle_stop(), which is necessary to
provide the correct ordering between setting requested_psscr and
checking dont_stop. The overhead of the sync should be unnoticeable
compared to the latency of going into and out of a stop state.
Because some objected to incurring this extra latency on systems where
the XER[SO] bug is not relevant, I have put the test in
power9_idle_stop inside a feature section. This means that
pnv_power9_force_smt4_catch() WILL NOT WORK correctly on systems
without the CPU_FTR_P9_TM_XER_SO_BUG feature bit set, and will
probably hang the system.
In order to cater for uses where the caller has an operation that
has to be done while the core is in SMT4, the core continues to be
kept in SMT4 after pnv_power9_force_smt4_catch() function returns,
until the pnv_power9_force_smt4_release() function is called.
It undoes the effect of step 1 above and allows the other threads
to go into a stop state.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds a CPU feature bit which is set for POWER9 "Nimbus" DD2.2
processors which will be used to enable the hypervisor to assist
hardware with the handling of checkpointed register values while the
CPU is in suspend state, in order to work around hardware bugs. The
hardware assistance for these workarounds introduced a new hardware
bug relating to the XER[SO] bit. We add a separate feature bit for
this bug in case future chips fix it while still requiring the
hypervisor assistance with suspend state.
When the dt_cpu_ftrs subsystem is in use, the software assistance can
be enabled using a "tm-suspend-hypervisor-assist" node in the device
tree, and a "tm-suspend-xer-so-bug" node enables the workarounds for
the XER[SO] bug. In the absence of such nodes, a quirk enables both
for POWER9 "Nimbus" DD2.2 processors.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This moves all the CPU feature bits that are only used on 32-bit
machines to the top 20 bits of the CPU feature word and arranges
for them to be defined only in 32-bit builds. The features that
are common to 32-bit and 64-bit machines are moved to bits 0-11
of the CPU feature word. This means that for 64-bit platforms,
bits 44-63 can now be used for new features that only exist on
64-bit machines. (These bit numbers are counting from the right,
i.e. the LSB is bit 0.)
Because CPU_FTR_L3_DISABLE_NAP moved from the low 16 bits to the high
16 bits, we have to adjust some assembly code. Also, CPU_FTR_EMB_HV
moved from the high 16 bits to the low 16 bits.
Note that CPU_FTR_REAL_LE only applies to 64-bit chips, because only
64-bit chips (POWER6, 7, 8, 9) have a true little-endian mode that is
a CPU execution mode as opposed to being a page attribute.
With this we now have 20 free CPU feature bits on 64-bit machines.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
All PowerPC CPUs other than the original PPC601 have a timebase
register rather than the "real-time clock" (RTC) register that the
PPC601 (and the original POWER and POWER2 CPUs) had. Currently
we have a CPU feature bit to indicate the presence of the timebase,
but it makes more sense to use a bit to indicate the unusual
situation rather than the common situation. This therefore defines
a CPU_FTR_USE_RTC bit in place of the CPU_FTR_USE_TB bit, and
arranges for it to be set on PPC601 systems.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On POWER9, under some circumstances, a broadcast TLB invalidation
might complete before all previous stores have drained, potentially
allowing stale stores from becoming visible after the invalidation.
This works around it by doubling up those TLB invalidations which was
verified by HW to be sufficient to close the risk window.
This will be documented in a yet-to-be-published errata.
Fixes: 1a472c9dba ("powerpc/mm/radix: Add tlbflush routines")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Enable the feature in the DT CPU features code for all Power9,
rename the feature to CPU_FTR_P9_TLBIE_BUG per benh.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
force_external_irq_replay() can be called in the do_IRQ path with
interrupts hard enabled and soft disabled if may_hard_irq_enable() set
MSR[EE]=1. It updates local_paca->irq_happened with a load, modify,
store sequence. If a maskable interrupt hits during this sequence, it
will go to the masked handler to be marked pending in irq_happened.
This update will be lost when the interrupt returns and the store
instruction executes. This can result in unpredictable latencies,
timeouts, lockups, etc.
Fix this by ensuring hard interrupts are disabled before modifying
irq_happened.
This could cause any maskable asynchronous interrupt to get lost, but
it was noticed on P9 SMP system doing RDMA NVMe target over 100GbE,
so very high external interrupt rate and high IPI rate. The hang was
bisected down to enabling doorbell interrupts for IPIs. These provided
an interrupt type that could run at high rates in the do_IRQ path,
stressing the race.
Fixes: 1d607bb3bd ("powerpc/irq: Add mechanism to force a replay of interrupts")
Cc: stable@vger.kernel.org # v4.8+
Reported-by: Carol L. Soto <clsoto@us.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
It's slightly less error prone to use sizeof(*foo) rather than
specifying the type.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
[mpe: Consolidate into one patch, rewrite change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The flush_dcache_phys_range() function is no longer used in the
kernel. The last usage was removed in c40785ad30 ("powerpc/dart: Use
a cachable DART").
This patch removes the function and declaration.
Signed-off-by: Matt Brown <matthew.brown.dev@gmail.com>
[mpe: Munge change log, include commit that removed last user]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
A protection flag may not be valid across entire address space and
hence arch_validate_prot() might need the address a protection bit is
being set on to ensure it is a valid protection flag. For example, sparc
processors support memory corruption detection (as part of ADI feature)
flag on memory addresses mapped on to physical RAM but not on PFN mapped
pages or addresses mapped on to devices. This patch adds address to the
parameters being passed to arch_validate_prot() so protection bits can
be validated in the relevant context.
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Khalid Aziz <khalid@gonehiking.org>
Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The RTC core is always calling rtc_valid_tm after the read_time callback.
It is not necessary to call it just before returning from the callback.
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When running virtualised the powerpc kernel is able to run the system
in "compat mode" - which means the kernel and hardware are pretending
to userspace that the CPU is an older version than it actually is.
AT_BASE_PLATFORM is an AUXV entry that we export to userspace for use
when we're running in that mode, which tells userspace the "platform"
string for the real CPU version, as opposed to the faked version.
Although we don't support compat mode when using DT CPU features, and
arguably don't need to set AT_BASE_PLATFORM, the existing cputable
based code always sets it even when we're running bare metal. That
means the lack of AT_BASE_PLATFORM is a user-visible artifact of the
fact that the kernel is using DT CPU features, which we don't want.
So set it in the DT CPU features code also.
This results in eg:
$ LD_SHOW_AUXV=1 /bin/true | grep "AT_.*PLATFORM"
AT_PLATFORM: power9
AT_BASE_PLATFORM:power9
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
early_init() and machine_init() have no prototype, add one in
asm-prototypes.h.
Fixes the following warnings (treated as error in W=1):
arch/powerpc/kernel/setup_32.c:68:30: error: no previous prototype for ‘early_init’
arch/powerpc/kernel/setup_32.c:99:21: error: no previous prototype for ‘machine_init’
Signed-off-by: Mathieu Malaterre <malat@debian.org>
[mpe: Move them to asm-prototypes.h, drop other functions]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
These functions can all be static, make it so.
Signed-off-by: Mathieu Malaterre <malat@debian.org>
[mpe: Combine a patch of Mathieu's with some other static conversions]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When neither CONFIG_ALTIVEC, nor CONFIG_VSX or CONFIG_PPC64 is
defined, the array feature_properties is defined as an empty array,
which in turn triggers the following warning (treated as error on
W=1):
arch/powerpc/kernel/prom.c: In function ‘check_cpu_feature_properties’:
arch/powerpc/kernel/prom.c:298:16: error: comparison of unsigned expression < 0 is always false
for (i = 0; i < ARRAY_SIZE(feature_properties); ++i, ++fp) {
^
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Two functions did not have a prototype defined in signal.h header. Fix
the following two warnings (treated as errors in W=1):
arch/powerpc/kernel/signal_32.c:1135:6: error: no previous prototype for ‘sys_rt_sigreturn’
arch/powerpc/kernel/signal_32.c:1422:6: error: no previous prototype for ‘sys_sigreturn’
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
__giveup_fpu() is never called outside process.c, so it can be static.
That also means we don't need an empty definition in switch_to.h
Signed-off-by: Mathieu Malaterre <malat@debian.org>
[mpe: Also drop the empty version, rewrite change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Since the value of `tmp` is never intended to be read, declare both `tmp`
variables as unused. Fix warning (treated as error in W=1):
arch/powerpc/kernel/signal_32.c: In function ‘sys_swapcontext’:
arch/powerpc/kernel/signal_32.c:1048:16: error: variable ‘tmp’ set but not used
arch/powerpc/kernel/signal_32.c: In function ‘sys_debug_setcontext’:
arch/powerpc/kernel/signal_32.c🔢16: error: variable ‘tmp’ set but not used
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With ibm,dynamic-memory-v2 and ibm,drc-info coming around the same
time, byte22 in vector5 of ibm architecture vector table got set twice
separately. The end result is that guest kernel isn't advertising
support for ibm,dynamic-memory-v2.
Fix this by removing the duplicate assignment of byte22.
Fixes: 02ef6dd810 ("powerpc: Enable support for ibm,drc-info devtree property")
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
While the implementation of the "slices" address space allows
a significant amount of high slices, it limits the number of
low slices to 16 due to the use of a single u64 low_slices_psize
element in struct mm_context_t
On the 8xx, the minimum slice size is the size of the area
covered by a single PMD entry, ie 4M in 4K pages mode and 64M in
16K pages mode. This means we could have at least 64 slices.
In order to override this limitation, this patch switches the
handling of low_slices_psize to char array as done already for
high_slices_psize.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On the 8xx, the page size is set in the PMD entry and applies to
all pages of the page table pointed by the said PMD entry.
When an app has some regular pages allocated (e.g. see below) and tries
to mmap() a huge page at a hint address covered by the same PMD entry,
the kernel accepts the hint allthough the 8xx cannot handle different
page sizes in the same PMD entry.
10000000-10001000 r-xp 00000000 00:0f 2597 /root/malloc
10010000-10011000 rwxp 00000000 00:0f 2597 /root/malloc
mmap(0x10080000, 524288, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS|0x40000, -1, 0) = 0x10080000
This results the app remaining forever in do_page_fault()/hugetlb_fault()
and when interrupting that app, we get the following warning:
[162980.035629] WARNING: CPU: 0 PID: 2777 at arch/powerpc/mm/hugetlbpage.c:354 hugetlb_free_pgd_range+0xc8/0x1e4
[162980.035699] CPU: 0 PID: 2777 Comm: malloc Tainted: G W 4.14.6 #85
[162980.035744] task: c67e2c00 task.stack: c668e000
[162980.035783] NIP: c000fe18 LR: c00e1eec CTR: c00f90c0
[162980.035830] REGS: c668fc20 TRAP: 0700 Tainted: G W (4.14.6)
[162980.035854] MSR: 00029032 <EE,ME,IR,DR,RI> CR: 24044224 XER: 20000000
[162980.036003]
[162980.036003] GPR00: c00e1eec c668fcd0 c67e2c00 00000010 c6869410 10080000 00000000 77fb4000
[162980.036003] GPR08: ffff0001 0683c001 00000000 ffffff80 44028228 10018a34 00004008 418004fc
[162980.036003] GPR16: c668e000 00040100 c668e000 c06c0000 c668fe78 c668e000 c6835ba0 c668fd48
[162980.036003] GPR24: 00000000 73ffffff 74000000 00000001 77fb4000 100fffff 10100000 10100000
[162980.036743] NIP [c000fe18] hugetlb_free_pgd_range+0xc8/0x1e4
[162980.036839] LR [c00e1eec] free_pgtables+0x12c/0x150
[162980.036861] Call Trace:
[162980.036939] [c668fcd0] [c00f0774] unlink_anon_vmas+0x1c4/0x214 (unreliable)
[162980.037040] [c668fd10] [c00e1eec] free_pgtables+0x12c/0x150
[162980.037118] [c668fd40] [c00eabac] exit_mmap+0xe8/0x1b4
[162980.037210] [c668fda0] [c0019710] mmput.part.9+0x20/0xd8
[162980.037301] [c668fdb0] [c001ecb0] do_exit+0x1f0/0x93c
[162980.037386] [c668fe00] [c001f478] do_group_exit+0x40/0xcc
[162980.037479] [c668fe10] [c002a76c] get_signal+0x47c/0x614
[162980.037570] [c668fe70] [c0007840] do_signal+0x54/0x244
[162980.037654] [c668ff30] [c0007ae8] do_notify_resume+0x34/0x88
[162980.037744] [c668ff40] [c000dae8] do_user_signal+0x74/0xc4
[162980.037781] Instruction dump:
[162980.037821] 7fdff378 81370000 54a3463a 80890020 7d24182e 7c841a14 712a0004 4082ff94
[162980.038014] 2f890000 419e0010 712a0ff0 408200e0 <0fe00000> 54a9000a 7f984840 419d0094
[162980.038216] ---[ end trace c0ceeca8e7a5800a ]---
[162980.038754] BUG: non-zero nr_ptes on freeing mm: 1
[162985.363322] BUG: non-zero nr_ptes on freeing mm: -1
In order to fix this, this patch uses the address space "slices"
implemented for BOOK3S/64 and enhanced to support PPC32 by the
preceding patch.
This patch modifies the context.id on the 8xx to be in the range
[1:16] instead of [0:15] in order to identify context.id == 0 as
not initialised contexts as done on BOOK3S
This patch activates CONFIG_PPC_MM_SLICES when CONFIG_HUGETLB_PAGE is
selected for the 8xx
Alltough we could in theory have as many slices as PMD entries, the
current slices implementation limits the number of low slices to 16.
This limitation is not preventing us to fix the initial issue allthough
it is suboptimal. It will be cured in a subsequent patch.
Fixes: 4b91428699 ("powerpc/8xx: Implement support of hugepages")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit f719582435 ("PCI: Add pci_mmap_resource_range() and use it for
ARM64") added this generic function with the intent of using it everywhere
and ultimately killing the old arch-specific implementations.
Remove the powerpc-specific pci_mmap_page_range() and use the generic
pci_mmap_resource_range() instead.
Powerpc can mmap I/O port space, so supply the powerpc-specific
pci_iobar_pfn() required to make that work.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas <helgaas@kernel.org>
This reverts commit 02ef6dd810.
The earlier patch tried to enable support for a new property
"ibm,drc-info" on powerpc systems.
Unfortunately, some errors in the associated patch set break things
in some of the DLPAR operations. In particular when attempting to
hot-add a new CPU or set of CPUs, the original patch failed to
properly calculate the available resources, and aborted the operation.
In addition, the original set missed several opportunities to compress
and reuse common code.
As the associated patch set was meant to provide an optimization of
storage and performance of a set of device-tree properties for future
systems with large amounts of resources, reverting just restores
the previous behavior for existing systems. It seems unnecessary
to enable this feature and introduce the consequent problems in the
field that it will cause at this time, so please revert it for now
until testing of the corrections are finished properly.
Fixes: 02ef6dd810 ("powerpc: Enable support for ibm,drc-info devtree property")
Signed-off-by: Michael W. Bringmann <mwb@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The notify_resume() callback in eeh_ops is NULL on powernv, leading to
crashes:
NIP (null)
LR eeh_report_resume+0x218/0x220
Call Trace:
eeh_report_resume+0x1f0/0x220 (unreliable)
eeh_pe_dev_traverse+0x98/0x170
eeh_handle_normal_event+0x3f4/0x650
eeh_handle_event+0x54/0x380
eeh_event_handler+0x14c/0x210
kthread+0x168/0x1b0
ret_from_kernel_thread+0x5c/0xb4
Fix it by adding a check before calling it.
Fixes: 856e1eb9bd ("PCI/AER: Add uevents in AER and EEH error/resume")
Signed-off-by: Juan J. Alvarez <jjalvare@linux.vnet.ibm.com>
Reviewed-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Tested-by: Carol L. Soto <clsoto@us.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Tested-by: Mauro S. M. Rodrigues <maurosr@linux.vnet.ibm.com>
Acked-by: Michael Neuling <mikey@neuling.org>
[mpe: Rewrite change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The TSCR can only be accessed in hypervisor mode.
Fixes: 88b5e12eeb11 ("powerpc: Expose TSCR via sysfs")
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
A larger batch of fixes than we'd like. Roughly 1/3 fixes for new code, 1/3
fixes for stable and 1/3 minor things.
There's four commits fixing bugs when using 16GB huge pages on hash, caused by
some of the preparatory changes for pkeys.
Two fixes for bugs in the enhanced IRQ soft masking for local_t, one of which
broke KVM in some circumstances.
Four fixes for Power9. The most bizarre being a bug where futexes stopped
working because a NULL pointer dereference didn't trap during early boot (it
aliased the kernel mapping). A fix for memory hotplug when using the Radix MMU,
and a fix for live migration of guests using the Radix MMU.
Two fixes for hotplug on pseries machines. One where we weren't correctly
updating NUMA info when CPUs are added and removed. And the other fixes
crashes/hangs seen when doing memory hot remove during boot, which is apparently
a thing people do.
Finally a handful of build fixes for obscure configs and other minor fixes.
Thanks to:
Alexey Kardashevskiy, Aneesh Kumar K.V, Balbir Singh, Colin Ian King, Daniel
Henrique Barboza, Florian Weimer, Guenter Roeck, Harish, Laurent Vivier,
Madhavan Srinivasan, Mauricio Faria de Oliveira, Nathan Fontenot, Nicholas
Piggin, Sam Bobroff.
-----BEGIN PGP SIGNATURE-----
iQIwBAABCAAaBQJahDTmExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYAd
chAAtVe8hmkEJefTbU63GBeqva0JHSiTu2DENZAlN/epWtbtyl05PLETMdTcwGCv
nK2zzR+xbSFN1DzZK8KQfDBW33McKZE+YkHwYOC8Kff/N0SKdHK4zvxYr7FTZGzG
9uSG5vrxVEsPLT/yANabl0d0vKWMsJ1jZquvJAU0eLNUbA/skGjEPADtXqYQUXiA
EnW4xeczsMLjuzTleoRqrBx74Gulovuq9LVAjfDvkydWlCU9MQkrodCgP0V2hQtw
RAJ/QLY+NS/vMCBnvVOGBaKzIqrfeQTHF3P0j4pyBeBq/2kNuidM5n25uoc31wUq
DE4Ebe2FJA6CHP5KEyf7dr9y7gsks/ak3/CKs+l6Yz3/0BqenEMhu6WKJ1tgf9cC
qAmi1dIjtpw6JZ6baCbkloUdAGNjKVfLWB9ld9VIfg0C+C3y4L7+TKJukxrCBGI6
hopfT/3p8xUdla3euiRXRLZzajyKDGrqk71hk5J/J0ChXfWB0B51X0F6NIfH41Mn
YsVUQ95p3zS79Pl942ijGScFX/bNVLfEEGzlI/nwU/wbTxF5g/XNXm5PjBsGSr/W
zFcCwCpFV2b/kypQoxQA5CbrKRCLOleDA/lLOxW/1NMYOQsNj05DM9wYAw5Bl+lX
AVj2c5jM9heNN4scxDiufRNfqZbyjZ4fFUpXLNqs7N5vcks=
=BmuL
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"A larger batch of fixes than we'd like. Roughly 1/3 fixes for new
code, 1/3 fixes for stable and 1/3 minor things.
There's four commits fixing bugs when using 16GB huge pages on hash,
caused by some of the preparatory changes for pkeys.
Two fixes for bugs in the enhanced IRQ soft masking for local_t, one
of which broke KVM in some circumstances.
Four fixes for Power9. The most bizarre being a bug where futexes
stopped working because a NULL pointer dereference didn't trap during
early boot (it aliased the kernel mapping). A fix for memory hotplug
when using the Radix MMU, and a fix for live migration of guests using
the Radix MMU.
Two fixes for hotplug on pseries machines. One where we weren't
correctly updating NUMA info when CPUs are added and removed. And the
other fixes crashes/hangs seen when doing memory hot remove during
boot, which is apparently a thing people do.
Finally a handful of build fixes for obscure configs and other minor
fixes.
Thanks to: Alexey Kardashevskiy, Aneesh Kumar K.V, Balbir Singh, Colin
Ian King, Daniel Henrique Barboza, Florian Weimer, Guenter Roeck,
Harish, Laurent Vivier, Madhavan Srinivasan, Mauricio Faria de
Oliveira, Nathan Fontenot, Nicholas Piggin, Sam Bobroff"
* tag 'powerpc-4.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
selftests/powerpc: Fix to use ucontext_t instead of struct ucontext
powerpc/kdump: Fix powernv build break when KEXEC_CORE=n
powerpc/pseries: Fix build break for SPLPAR=n and CPU hotplug
powerpc/mm/hash64: Zero PGD pages on allocation
powerpc/mm/hash64: Store the slot information at the right offset for hugetlb
powerpc/mm/hash64: Allocate larger PMD table if hugetlb config is enabled
powerpc/mm: Fix crashes with 16G huge pages
powerpc/mm: Flush radix process translations when setting MMU type
powerpc/vas: Don't set uses_vas for kernel windows
powerpc/pseries: Enable RAS hotplug events later
powerpc/mm/radix: Split linear mapping on hot-unplug
powerpc/64s/radix: Boot-time NULL pointer protection using a guard-PID
ocxl: fix signed comparison with less than zero
powerpc/64s: Fix may_hard_irq_enable() for PMI soft masking
powerpc/64s: Fix MASKABLE_RELON_EXCEPTION_HV_OOL macro
powerpc/numa: Invalidate numa_cpu_lookup_table on cpu remove
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:
for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
done
with de-mangling cleanups yet to come.
NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do. But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.
The next patch from Al will sort out the final differences, and we
should be all done.
Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ARM:
- Include icache invalidation optimizations, improving VM startup time
- Support for forwarded level-triggered interrupts, improving
performance for timers and passthrough platform devices
- A small fix for power-management notifiers, and some cosmetic changes
PPC:
- Add MMIO emulation for vector loads and stores
- Allow HPT guests to run on a radix host on POWER9 v2.2 CPUs without
requiring the complex thread synchronization of older CPU versions
- Improve the handling of escalation interrupts with the XIVE interrupt
controller
- Support decrement register migration
- Various cleanups and bugfixes.
s390:
- Cornelia Huck passed maintainership to Janosch Frank
- Exitless interrupts for emulated devices
- Cleanup of cpuflag handling
- kvm_stat counter improvements
- VSIE improvements
- mm cleanup
x86:
- Hypervisor part of SEV
- UMIP, RDPID, and MSR_SMI_COUNT emulation
- Paravirtualized TLB shootdown using the new KVM_VCPU_PREEMPTED bit
- Allow guests to see TOPOEXT, GFNI, VAES, VPCLMULQDQ, and more AVX512
features
- Show vcpu id in its anonymous inode name
- Many fixes and cleanups
- Per-VCPU MSR bitmaps (already merged through x86/pti branch)
- Stable KVM clock when nesting on Hyper-V (merged through x86/hyperv)
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJafvMtAAoJEED/6hsPKofo6YcH/Rzf2RmshrWaC3q82yfIV0Qz
Z8N8yJHSaSdc3Jo6cmiVj0zelwAxdQcyjwlT7vxt5SL2yML+/Q0st9Hc3EgGGXPm
Il99eJEl+2MYpZgYZqV8ff3mHS5s5Jms+7BITAeh6Rgt+DyNbykEAvzt+MCHK9cP
xtsIZQlvRF7HIrpOlaRzOPp3sK2/MDZJ1RBE7wYItK3CUAmsHim/LVYKzZkRTij3
/9b4LP1yMMbziG+Yxt1o682EwJB5YIat6fmDG9uFeEVI5rWWN7WFubqs8gCjYy/p
FX+BjpOdgTRnX+1m9GIj0Jlc/HKMXryDfSZS07Zy4FbGEwSiI5SfKECub4mDhuE=
=C/uD
-----END PGP SIGNATURE-----
Merge tag 'kvm-4.16-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Radim Krčmář:
"ARM:
- icache invalidation optimizations, improving VM startup time
- support for forwarded level-triggered interrupts, improving
performance for timers and passthrough platform devices
- a small fix for power-management notifiers, and some cosmetic
changes
PPC:
- add MMIO emulation for vector loads and stores
- allow HPT guests to run on a radix host on POWER9 v2.2 CPUs without
requiring the complex thread synchronization of older CPU versions
- improve the handling of escalation interrupts with the XIVE
interrupt controller
- support decrement register migration
- various cleanups and bugfixes.
s390:
- Cornelia Huck passed maintainership to Janosch Frank
- exitless interrupts for emulated devices
- cleanup of cpuflag handling
- kvm_stat counter improvements
- VSIE improvements
- mm cleanup
x86:
- hypervisor part of SEV
- UMIP, RDPID, and MSR_SMI_COUNT emulation
- paravirtualized TLB shootdown using the new KVM_VCPU_PREEMPTED bit
- allow guests to see TOPOEXT, GFNI, VAES, VPCLMULQDQ, and more
AVX512 features
- show vcpu id in its anonymous inode name
- many fixes and cleanups
- per-VCPU MSR bitmaps (already merged through x86/pti branch)
- stable KVM clock when nesting on Hyper-V (merged through
x86/hyperv)"
* tag 'kvm-4.16-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (197 commits)
KVM: PPC: Book3S: Add MMIO emulation for VMX instructions
KVM: PPC: Book3S HV: Branch inside feature section
KVM: PPC: Book3S HV: Make HPT resizing work on POWER9
KVM: PPC: Book3S HV: Fix handling of secondary HPTEG in HPT resizing code
KVM: PPC: Book3S PR: Fix broken select due to misspelling
KVM: x86: don't forget vcpu_put() in kvm_arch_vcpu_ioctl_set_sregs()
KVM: PPC: Book3S PR: Fix svcpu copying with preemption enabled
KVM: PPC: Book3S HV: Drop locks before reading guest memory
kvm: x86: remove efer_reload entry in kvm_vcpu_stat
KVM: x86: AMD Processor Topology Information
x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested
kvm: embed vcpu id to dentry of vcpu anon inode
kvm: Map PFN-type memory regions as writable (if possible)
x86/kvm: Make it compile on 32bit and with HYPYERVISOR_GUEST=n
KVM: arm/arm64: Fixup userspace irqchip static key optimization
KVM: arm/arm64: Fix userspace_irqchip_in_use counting
KVM: arm/arm64: Fix incorrect timer_is_pending logic
MAINTAINERS: update KVM/s390 maintainers
MAINTAINERS: add Halil as additional vfio-ccw maintainer
MAINTAINERS: add David as a reviewer for KVM/s390
...
59f47eff03 ("powerpc/pci: Use of_irq_parse_and_map_pci() helper")
replaced of_irq_parse_pci() + irq_create_of_mapping() with
of_irq_parse_and_map_pci(), but neglected to capture the virq
returned by irq_create_of_mapping(), so virq remained zero, which
caused INTx configuration to fail.
Save the virq value returned by of_irq_parse_and_map_pci() and correct
the virq declaration to match the of_irq_parse_and_map_pci() signature.
Fixes: 59f47eff03 "powerpc/pci: Use of_irq_parse_and_map_pci() helper"
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
The soft IRQ masking code has to hard-disable interrupts in cases
where the exception is not cleared by the masked handler. External
interrupts used this approach for soft masking. Now recently PMU
interrupts do the same thing.
The soft IRQ masking code additionally allowed for interrupt handlers
to hard-enable interrupts after soft-disabling them. The idea is to
allow PMU interrupts through to profile interrupt handlers.
So when interrupts are being replayed when there is a pending
interrupt that requires hard-disabling, there is a test to prevent
those handlers from hard-enabling them if there is a pending external
interrupt. may_hard_irq_enable() handles this.
After f442d00480 ("powerpc/64s: Add support to mask perf interrupts
and replay them"), may_hard_irq_enable() could prematurely enable
MSR[EE] when a PMU exception exists, which would result in the
interrupt firing again while masked, and MSR[EE] being disabled again.
I haven't seen that this could cause a serious problem, but it's
more consistent to handle these soft-masked interrupts in the same
way. So introduce a define for all types of interrupts that require
MSR[EE] masking in their soft-disable handlers, and use that in
may_hard_irq_enable().
Fixes: f442d00480 ("powerpc/64s: Add support to mask perf interrupts and replay them")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJad5lgAAoJEFmIoMA60/r8s2kQAI3PztawDpaCP9Z12pkbBHSt
Ho0xTyk9rCZi9kQJbNjc+a+QrlA3QmTHXIXerB3LSWoh7M+XhsECjem92eHpgLNS
JvYPhTfOrCr0vdiAmOz6hD0AqN/psrbfzgiJhSwomsGEFS77k7kERSJckRv81sxb
Aj5F/WjucAgLorwm4auveAJEQ7atE7/6pkXzoqYm4G6NLOb46jUcRGndrnvXZBlz
fws8fBM4BHyi7i25CYQl24tFq1CGax1rIPgLg+4KnH76bQk/N6Ju0sGVSzfh+hG8
SIerK9bJbzGRAuNKoxB3aO1dyzsK3x9WztE2mG98w5trOISPIR1FqnvC/225FWAU
d6eIXiC7wKnEx+DElNTzCjzfHc7SAJoupO32H7CoiTe5zPUlWlxJ1zLYkK1gt50q
m8PRBiYTglxyznzrO0drtcdjEzvbdZNRrsYnul4wi1vSHzjk6F6XLtzT10XWM1M1
1pXLB8384FTj0Hu4bq6Y3Aivkmz0Sf+eQM2NaOwe+Zj7/1VV0d3lvi4LUXkqzLCA
FoXPJSMxG2Qu+iflCeYRQBJjExaZH3eNLZ3dT6QpcJrjaFVedd9u5DeeFqNL27zV
bhr8TdqrR4p4rc8EBAGoCapw96IxLZROKB3gxbrZVOpfIZpzthwHbElHX6aqUgF4
w/EV1JWs36WXWaxFk8wd
=ttq9
-----END PGP SIGNATURE-----
Merge tag 'pci-v4.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
- skip AER driver error recovery callbacks for correctable errors
reported via ACPI APEI, as we already do for errors reported via the
native path (Tyler Baicar)
- fix DPC shared interrupt handling (Alex Williamson)
- print full DPC interrupt number (Keith Busch)
- enable DPC only if AER is available (Keith Busch)
- simplify DPC code (Bjorn Helgaas)
- calculate ASPM L1 substate parameter instead of hardcoding it (Bjorn
Helgaas)
- enable Latency Tolerance Reporting for ASPM L1 substates (Bjorn
Helgaas)
- move ASPM internal interfaces out of public header (Bjorn Helgaas)
- allow hot-removal of VGA devices (Mika Westerberg)
- speed up unplug and shutdown by assuming Thunderbolt controllers
don't support Command Completed events (Lukas Wunner)
- add AtomicOps support for GPU and Infiniband drivers (Felix Kuehling,
Jay Cornwall)
- expose "ari_enabled" in sysfs to help NIC naming (Stuart Hayes)
- clean up PCI DMA interface usage (Christoph Hellwig)
- remove PCI pool API (replaced with DMA pool) (Romain Perier)
- deprecate pci_get_bus_and_slot(), which assumed PCI domain 0 (Sinan
Kaya)
- move DT PCI code from drivers/of/ to drivers/pci/ (Rob Herring)
- add PCI-specific wrappers for dev_info(), etc (Frederick Lawler)
- remove warnings on sysfs mmap failure (Bjorn Helgaas)
- quiet ROM validation messages (Alex Deucher)
- remove redundant memory alloc failure messages (Markus Elfring)
- fill in types for compile-time VGA and other I/O port resources
(Bjorn Helgaas)
- make "pci=pcie_scan_all" work for Root Ports as well as Downstream
Ports to help AmigaOne X1000 (Bjorn Helgaas)
- add SPDX tags to all PCI files (Bjorn Helgaas)
- quirk Marvell 9128 DMA aliases (Alex Williamson)
- quirk broken INTx disable on Ceton InfiniTV4 (Bjorn Helgaas)
- fix CONFIG_PCI=n build by adding dummy pci_irqd_intx_xlate() (Niklas
Cassel)
- use DMA API to get MSI address for DesignWare IP (Niklas Cassel)
- fix endpoint-mode DMA mask configuration (Kishon Vijay Abraham I)
- fix ARTPEC-6 incorrect IS_ERR() usage (Wei Yongjun)
- add support for ARTPEC-7 SoC (Niklas Cassel)
- add endpoint-mode support for ARTPEC (Niklas Cassel)
- add Cadence PCIe host and endpoint controller driver (Cyrille
Pitchen)
- handle multiple INTx status bits being set in dra7xx (Vignesh R)
- translate dra7xx hwirq range to fix INTD handling (Vignesh R)
- remove deprecated Exynos PHY initialization code (Jaehoon Chung)
- fix MSI erratum workaround for HiSilicon Hip06/Hip07 (Dongdong Liu)
- fix NULL pointer dereference in iProc BCMA driver (Ray Jui)
- fix Keystone interrupt-controller-node lookup (Johan Hovold)
- constify qcom driver structures (Julia Lawall)
- rework Tegra config space mapping to increase space available for
endpoints (Vidya Sagar)
- simplify Tegra driver by using bus->sysdata (Manikanta Maddireddy)
- remove PCI_REASSIGN_ALL_BUS usage on Tegra (Manikanta Maddireddy)
- add support for Global Fabric Manager Server (GFMS) event to
Microsemi Switchtec switch driver (Logan Gunthorpe)
- add IDs for Switchtec PSX 24xG3 and PSX 48xG3 (Kelvin Cao)
* tag 'pci-v4.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (140 commits)
PCI: cadence: Add EndPoint Controller driver for Cadence PCIe controller
dt-bindings: PCI: cadence: Add DT bindings for Cadence PCIe endpoint controller
PCI: endpoint: Fix EPF device name to support multi-function devices
PCI: endpoint: Add the function number as argument to EPC ops
PCI: cadence: Add host driver for Cadence PCIe controller
dt-bindings: PCI: cadence: Add DT bindings for Cadence PCIe host controller
PCI: Add vendor ID for Cadence
PCI: Add generic function to probe PCI host controllers
PCI: generic: fix missing call of pci_free_resource_list()
PCI: OF: Add generic function to parse and allocate PCI resources
PCI: Regroup all PCI related entries into drivers/pci/Makefile
PCI/DPC: Reformat DPC register definitions
PCI/DPC: Add and use DPC Status register field definitions
PCI/DPC: Squash dpc_rp_pio_get_info() into dpc_process_rp_pio_error()
PCI/DPC: Remove unnecessary RP PIO register structs
PCI/DPC: Push dpc->rp_pio_status assignment into dpc_rp_pio_get_info()
PCI/DPC: Squash dpc_rp_pio_print_error() into dpc_rp_pio_get_info()
PCI/DPC: Make RP PIO log size check more generic
PCI/DPC: Rename local "status" to "dpc_status"
PCI/DPC: Squash dpc_rp_pio_print_tlp_header() into dpc_rp_pio_print_error()
...
Highlights:
- Enable support for memory protection keys aka "pkeys" on Power7/8/9 when
using the hash table MMU.
- Extend our interrupt soft masking to support masking PMU interrupts as well
as "normal" interrupts, and then use that to implement local_t for a ~4x
speedup vs the current atomics-based implementation.
- A new driver "ocxl" for "Open Coherent Accelerator Processor Interface
(OpenCAPI)" devices.
- Support for new device tree properties on PowerVM to describe hotpluggable
memory and devices.
- Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE to the 64-bit VDSO.
- Freescale updates from Scott:
"Contains fixes for CPM GPIO and an FSL PCI erratum workaround, plus a
minor cleanup patch."
As well as quite a lot of other changes all over the place, and small fixes and
cleanups as always.
Thanks to:
Alan Modra, Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple, Andreas
Schwab, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anshuman
Khandual, Anton Blanchard, Arnd Bergmann, Balbir Singh, Benjamin
Herrenschmidt, Bhaktipriya Shridhar, Bryant G. Ly, Cédric Le Goater,
Christophe Leroy, Christophe Lombard, Cyril Bur, David Gibson, Desnes A. Nunes
do Rosario, Dmitry Torokhov, Frederic Barrat, Geert Uytterhoeven, Guilherme G.
Piccoli, Gustavo A. R. Silva, Gustavo Romero, Ivan Mikhaylov, Joakim
Tjernlund, Joe Perches, Josh Poimboeuf, Juan J. Alvarez, Julia Cartwright,
Kamalesh Babulal, Madhavan Srinivasan, Mahesh Salgaonkar, Mathieu Malaterre,
Michael Bringmann, Michael Hanselmann, Michael Neuling, Nathan Fontenot,
Naveen N. Rao, Nicholas Piggin, Paul Mackerras, Philippe Bergheaud, Ram Pai,
Russell Currey, Santosh Sivaraj, Scott Wood, Seth Forshee, Simon Guo, Stewart
Smith, Sukadev Bhattiprolu, Thiago Jung Bauermann, Vaibhav Jain, Vasyl
Gomonovych.
-----BEGIN PGP SIGNATURE-----
iQIwBAABCAAaBQJadF6wExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYA2
nBAAnguCEyAIYpc+ffE3WU9xJEWxa6bKuVufHcUFVntGiGD+igmMS+SHp4ay3Aos
HcA4WFrpzNb2KZ++kmFWtAKWnMfCiW9xuYJNicjr7X5ZiVBEhLWN/mQCwBKs3p6L
5+HhvytcdkKVbEcyVjEGvRL40AyxXNOI02o6Co9X8vanHsmWB4q0eWe4PHstZqlg
6K6kazMp+NTvEFYwKNXDOvuHouKSL57l14SLROH7CpJkNTOQ9s+W59/LmnuCjRlu
o70b7iWOAEbF9tvMma1ksDZVNj7mSyaymLYCyOXu4CkuuleJacZYJ9oQGNddoIbC
wk7l93vPT/yze7DYg8x3uXpKcaDEvEepPuQ/ubz+UXFQWuJtl5ej6Cv+0eOmyZIs
+bjWhGHKdNttnsiPlTRCX/gWD13RE1dB6xXJlfOJ7Oz9OnXXK8ZKc1NTREbQXRWM
8tClAwf9upWpm86GHPVnyrgYbgZo5b1os4SoS8e3kESzakrQVQP7J376u2DtccRq
2AGqjJ+tl5tYPnhm8zG1cNrpqHHpgkNGqLS7DvWRg3EPmEKVQcltN1b/0aKaAjHA
aTRofjrVo+jJ4MX1uyEo59yNCEQPfjkmHRQdLwm+xjWTzEPfIMzpWyXm14tawDQf
OjcAe90W/qQ18brw4z+2BI14J76XziOSX/QcunOn1u/sqaM=
=3rYn
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Highlights:
- Enable support for memory protection keys aka "pkeys" on Power7/8/9
when using the hash table MMU.
- Extend our interrupt soft masking to support masking PMU interrupts
as well as "normal" interrupts, and then use that to implement
local_t for a ~4x speedup vs the current atomics-based
implementation.
- A new driver "ocxl" for "Open Coherent Accelerator Processor
Interface (OpenCAPI)" devices.
- Support for new device tree properties on PowerVM to describe
hotpluggable memory and devices.
- Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE to the 64-bit
VDSO.
- Freescale updates from Scott: fixes for CPM GPIO and an FSL PCI
erratum workaround, plus a minor cleanup patch.
As well as quite a lot of other changes all over the place, and small
fixes and cleanups as always.
Thanks to: Alan Modra, Alastair D'Silva, Alexey Kardashevskiy,
Alistair Popple, Andreas Schwab, Andrew Donnellan, Aneesh Kumar K.V,
Anju T Sudhakar, Anshuman Khandual, Anton Blanchard, Arnd Bergmann,
Balbir Singh, Benjamin Herrenschmidt, Bhaktipriya Shridhar, Bryant G.
Ly, Cédric Le Goater, Christophe Leroy, Christophe Lombard, Cyril Bur,
David Gibson, Desnes A. Nunes do Rosario, Dmitry Torokhov, Frederic
Barrat, Geert Uytterhoeven, Guilherme G. Piccoli, Gustavo A. R. Silva,
Gustavo Romero, Ivan Mikhaylov, Joakim Tjernlund, Joe Perches, Josh
Poimboeuf, Juan J. Alvarez, Julia Cartwright, Kamalesh Babulal,
Madhavan Srinivasan, Mahesh Salgaonkar, Mathieu Malaterre, Michael
Bringmann, Michael Hanselmann, Michael Neuling, Nathan Fontenot,
Naveen N. Rao, Nicholas Piggin, Paul Mackerras, Philippe Bergheaud,
Ram Pai, Russell Currey, Santosh Sivaraj, Scott Wood, Seth Forshee,
Simon Guo, Stewart Smith, Sukadev Bhattiprolu, Thiago Jung Bauermann,
Vaibhav Jain, Vasyl Gomonovych"
* tag 'powerpc-4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (199 commits)
powerpc/mm/radix: Fix build error when RADIX_MMU=n
macintosh/ams-input: Use true and false for boolean values
macintosh: change some data types from int to bool
powerpc/watchdog: Print the NIP in soft_nmi_interrupt()
powerpc/watchdog: regs can't be null in soft_nmi_interrupt()
powerpc/watchdog: Tweak watchdog printks
powerpc/cell: Remove axonram driver
rtc-opal: Fix handling of firmware error codes, prevent busy loops
powerpc/mpc52xx_gpt: make use of raw_spinlock variants
macintosh/adb: Properly mark continued kernel messages
powerpc/pseries: Fix cpu hotplug crash with memoryless nodes
powerpc/numa: Ensure nodes initialized for hotplug
powerpc/numa: Use ibm,max-associativity-domains to discover possible nodes
powerpc/kernel: Block interrupts when updating TIDR
powerpc/powernv/idoa: Remove unnecessary pcidev from pci_dn
powerpc/mm/nohash: do not flush the entire mm when range is a single page
powerpc/pseries: Add Initialization of VF Bars
powerpc/pseries/pci: Associate PEs to VFs in configure SR-IOV
powerpc/eeh: Add EEH notify resume sysfs
powerpc/eeh: Add EEH operations to notify resume
...
Pull printk updates from Petr Mladek:
- Add a console_msg_format command line option:
The value "default" keeps the old "[time stamp] text\n" format. The
value "syslog" allows to see the syslog-like "<log
level>[timestamp] text" format.
This feature was requested by people doing regression tests, for
example, 0day robot. They want to have both filtered and full logs
at hands.
- Reduce the risk of softlockup:
Pass the console owner in a busy loop.
This is a new approach to the old problem. It was first proposed by
Steven Rostedt on Kernel Summit 2017. It marks a context in which
the console_lock owner calls console drivers and could not sleep.
On the other side, printk() callers could detect this state and use
a busy wait instead of a simple console_trylock(). Finally, the
console_lock owner checks if there is a busy waiter at the end of
the special context and eventually passes the console_lock to the
waiter.
The hand-off works surprisingly well and helps in many situations.
Well, there is still a possibility of the softlockup, for example,
when the flood of messages stops and the last owner still has too
much to flush.
There is increasing number of people having problems with
printk-related softlockups. We might eventually need to get better
solution. Anyway, this looks like a good start and promising
direction.
- Do not allow to schedule in console_unlock() called from printk():
This reverts an older controversial commit. The reschedule helped
to avoid softlockups. But it also slowed down the console output.
This patch is obsoleted by the new console waiter logic described
above. In fact, the reschedule made the hand-off less effective.
- Deprecate "%pf" and "%pF" format specifier:
It was needed on ia64, ppc64 and parisc64 to dereference function
descriptors and show the real function address. It is done
transparently by "%ps" and "pS" format specifier now.
Sergey Senozhatsky found that all the function descriptors were in
a special elf section and could be easily detected.
- Remove printk_symbol() API:
It has been obsoleted by "%pS" format specifier, and this change
helped to remove few continuous lines and a less intuitive old API.
- Remove redundant memsets:
Sergey removed unnecessary memset when processing printk.devkmsg
command line option.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: (27 commits)
printk: drop redundant devkmsg_log_str memsets
printk: Never set console_may_schedule in console_trylock()
printk: Hide console waiter logic into helpers
printk: Add console owner and waiter logic to load balance console writes
kallsyms: remove print_symbol() function
checkpatch: add pF/pf deprecation warning
symbol lookup: introduce dereference_symbol_descriptor()
parisc64: Add .opd based function descriptor dereference
powerpc64: Add .opd based function descriptor dereference
ia64: Add .opd based function descriptor dereference
sections: split dereference_function_descriptor()
openrisc: Fix conflicting types for _exext and _stext
lib: do not use print_symbol()
irq debug: do not use print_symbol()
sysfs: do not use print_symbol()
drivers: do not use print_symbol()
x86: do not use print_symbol()
unicore32: do not use print_symbol()
sh: do not use print_symbol()
mn10300: do not use print_symbol()
...
- Allow HPT guests to run on a radix host on POWER9 v2.2 CPUs
without requiring the complex thread synchronization that earlier
CPU versions required.
- A series from Ben Herrenschmidt to improve the handling of
escalation interrupts with the XIVE interrupt controller.
- Provide for the decrementer register to be copied across on
migration.
- Various minor cleanups and bugfixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJaYXViAAoJEJ2a6ncsY3GfDhgIAIDVBZH/Ftq7eJiUSxDpqyCQ
DF/x7fNKzK/J33pu+3ntOI2gZsldExAy7vH2M27I4qLIkbI5y3vu4v8l3CDlS1LK
9dKi72zg7baozoVF5mGUNm0B1sSvZiIQlC/kaami2aPTF1GcrJ561GthzfZwxENX
TSLqOA4LkeUZh2tUsvbcUrPi6v+E4Em2lgacQcx2ioMblWz56sZu79VsUbSSw/a3
P8+pIv7EbHw+TrOZMehjCbZkOdBeZ3IRLJsdlIAfe7y4vWME/5b9uVnQS/+XQj/B
6f3rQrduGvF2P6GMjsm8gDkgE5oZ1zbKlgO4i5WApnu80MMLFlfEUN+GWuGJ95Q=
=OjGs
-----END PGP SIGNATURE-----
Merge tag 'kvm-ppc-next-4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
PPC KVM update for 4.16
- Allow HPT guests to run on a radix host on POWER9 v2.2 CPUs
without requiring the complex thread synchronization that earlier
CPU versions required.
- A series from Ben Herrenschmidt to improve the handling of
escalation interrupts with the XIVE interrupt controller.
- Provide for the decrementer register to be copied across on
migration.
- Various minor cleanups and bugfixes.
Pull livepatching updates from Jiri Kosina:
- handle 'infinitely'-long sleeping tasks, from Miroslav Benes
- remove 'immediate' feature, as it turns out it doesn't provide the
originally expected semantics, and brings more issues than value
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
livepatch: add locking to force and signal functions
livepatch: Remove immediate feature
livepatch: force transition to finish
livepatch: send a fake signal to all blocking tasks
This pull requests contains a consolidation of the generic no-IOMMU code,
a well as the glue code for swiotlb. All the code is based on the x86
implementation with hooks to allow all architectures that aren't cache
coherent to use it. The x86 conversion itself has been deferred because
the x86 maintainers were a little busy in the last months.
-----BEGIN PGP SIGNATURE-----
iQI/BAABCAApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAlpxcVoLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYN/Lw/+Je9teM4NPQ8lU/ncbJN/bUzCFGJ6dFt2eVX/6xs3
sfl8vBdeHt6CBM02rRNecEr31z3+orjQes5JnlEJFYeG3jumV0zCPw/zbxqjzbJ1
3n6cckLxbxzy8Ca1G/BVjHLAUX5eWp1ujn/Q4d03VKVQZhJvFYlqDbP3TrNVx7xn
k86u37p/o+ngjwX66UdZ3C4iIBF8zqy6n2kkpv4HUQtHHzPwEvliN39eNilovb56
iGOzjDX1UWHAu4xCTVnPHSG4fA4XU41NWzIN3DIVPE25lYSISSl9TFAdR8GeZA0G
0Yj6sW53pRSoUwco1ocoS44/FgrPOB5/vHIL06pABvicXBiomje1QylqcK7zAczk
esjkfPEZrmZuu99GtqFyDNKEvKKdy+aBGaTZ3y+NxsuBs+0xS2Owz1IE4Tk28xaw
xh7zn+CVdk2fJh6ZIdw5Eu9b9VN08UriqDmDzO/ylDlcNGcDi7wcxiSTEkHJ1ON/
g9nletV6f3egL0wljDcOnhCJCHTvmWEeq3z8lE55QzPzSH0hHpnGQ2WD0tKrroxz
kjOZp0TdXa4F5iysOHe2xl2sftOH0zIkBQJ+oBcK12mTaLu21+yeuCggQXJ/CBdk
1Ol7l9g9T0TDuZPfiTHt5+6jmECQs92LElWA8x7uF7Fpix3BpnafWaaSMSsosF3F
D1Y=
=Nrl9
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-4.16' of git://git.infradead.org/users/hch/dma-mapping
Pull dma mapping updates from Christoph Hellwig:
"Except for a runtime warning fix from Christian this is all about
consolidation of the generic no-IOMMU code, a well as the glue code
for swiotlb.
All the code is based on the x86 implementation with hooks to allow
all architectures that aren't cache coherent to use it.
The x86 conversion itself has been deferred because the x86
maintainers were a little busy in the last months"
* tag 'dma-mapping-4.16' of git://git.infradead.org/users/hch/dma-mapping: (57 commits)
MAINTAINERS: add the iommu list for swiotlb and xen-swiotlb
arm64: use swiotlb_alloc and swiotlb_free
arm64: replace ZONE_DMA with ZONE_DMA32
mips: use swiotlb_{alloc,free}
mips/netlogic: remove swiotlb support
tile: use generic swiotlb_ops
tile: replace ZONE_DMA with ZONE_DMA32
unicore32: use generic swiotlb_ops
ia64: remove an ifdef around the content of pci-dma.c
ia64: clean up swiotlb support
ia64: use generic swiotlb_ops
ia64: replace ZONE_DMA with ZONE_DMA32
swiotlb: remove various exports
swiotlb: refactor coherent buffer allocation
swiotlb: refactor coherent buffer freeing
swiotlb: wire up ->dma_supported in swiotlb_dma_ops
swiotlb: add common swiotlb_map_ops
swiotlb: rename swiotlb_free to swiotlb_exit
x86: rename swiotlb_dma_ops
powerpc: rename swiotlb_dma_ops
...
* pci/misc:
PCI: Add dummy pci_irqd_intx_xlate() for CONFIG_PCI=n build
PCI: Add wrappers for dev_printk()
PCI: Remove unnecessary messages for memory allocation failures
PCI: Add #defines for Completion Timeout Disable feature
hinic: Replace PCI pool old API
net: e100: Replace PCI pool old API
block: DAC960: Replace PCI pool old API
MAINTAINERS: Include more PCI files
PCI: Remove unneeded kallsyms include
powerpc/pci: Unroll two pass loop when scanning bridges
powerpc/pci: Use for_each_pci_bridge() helper
Pull poll annotations from Al Viro:
"This introduces a __bitwise type for POLL### bitmap, and propagates
the annotations through the tree. Most of that stuff is as simple as
'make ->poll() instances return __poll_t and do the same to local
variables used to hold the future return value'.
Some of the obvious brainos found in process are fixed (e.g. POLLIN
misspelled as POLL_IN). At that point the amount of sparse warnings is
low and most of them are for genuine bugs - e.g. ->poll() instance
deciding to return -EINVAL instead of a bitmap. I hadn't touched those
in this series - it's large enough as it is.
Another problem it has caught was eventpoll() ABI mess; select.c and
eventpoll.c assumed that corresponding POLL### and EPOLL### were
equal. That's true for some, but not all of them - EPOLL### are
arch-independent, but POLL### are not.
The last commit in this series separates userland POLL### values from
the (now arch-independent) kernel-side ones, converting between them
in the few places where they are copied to/from userland. AFAICS, this
is the least disruptive fix preserving poll(2) ABI and making epoll()
work on all architectures.
As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
it will trigger only on what would've triggered EPOLLWRBAND on other
architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
at all on sparc. With this patch they should work consistently on all
architectures"
* 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
make kernel-side POLL... arch-independent
eventpoll: no need to mask the result of epi_item_poll() again
eventpoll: constify struct epoll_event pointers
debugging printk in sg_poll() uses %x to print POLL... bitmap
annotate poll(2) guts
9p: untangle ->poll() mess
->si_band gets POLL... bitmap stored into a user-visible long field
ring_buffer_poll_wait() return value used as return value of ->poll()
the rest of drivers/*: annotate ->poll() instances
media: annotate ->poll() instances
fs: annotate ->poll() instances
ipc, kernel, mm: annotate ->poll() instances
net: annotate ->poll() instances
apparmor: annotate ->poll() instances
tomoyo: annotate ->poll() instances
sound: annotate ->poll() instances
acpi: annotate ->poll() instances
crypto: annotate ->poll() instances
block: annotate ->poll() instances
x86: annotate ->poll() instances
...
Pull siginfo cleanups from Eric Biederman:
"Long ago when 2.4 was just a testing release copy_siginfo_to_user was
made to copy individual fields to userspace, possibly for efficiency
and to ensure initialized values were not copied to userspace.
Unfortunately the design was complex, it's assumptions unstated, and
humans are fallible and so while it worked much of the time that
design failed to ensure unitialized memory is not copied to userspace.
This set of changes is part of a new design to clean up siginfo and
simplify things, and hopefully make the siginfo handling robust enough
that a simple inspection of the code can be made to ensure we don't
copy any unitializied fields to userspace.
The design is to unify struct siginfo and struct compat_siginfo into a
single definition that is shared between all architectures so that
anyone adding to the set of information shared with struct siginfo can
see the whole picture. Hopefully ensuring all future si_code
assignments are arch independent.
The design is to unify copy_siginfo_to_user32 and
copy_siginfo_from_user32 so that those function are complete and cope
with all of the different cases documented in signinfo_layout. I don't
think there was a single implementation of either of those functions
that was complete and correct before my changes unified them.
The design is to introduce a series of helpers including
force_siginfo_fault that take the values that are needed in struct
siginfo and build the siginfo structure for their callers. Ensuring
struct siginfo is built correctly.
The remaining work for 4.17 (unless someone thinks it is post -rc1
material) is to push usage of those helpers down into the
architectures so that architecture specific code will not need to deal
with the fiddly work of intializing struct siginfo, and then when
struct siginfo is guaranteed to be fully initialized change copy
siginfo_to_user into a simple wrapper around copy_to_user.
Further there is work in progress on the issues that have been
documented requires arch specific knowledge to sort out.
The changes below fix or at least document all of the issues that have
been found with siginfo generation. Then proceed to unify struct
siginfo the 32 bit helpers that copy siginfo to and from userspace,
and generally clean up anything that is not arch specific with regards
to siginfo generation.
It is a lot but with the unification you can of siginfo you can
already see the code reduction in the kernel"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (45 commits)
signal/memory-failure: Use force_sig_mceerr and send_sig_mceerr
mm/memory_failure: Remove unused trapno from memory_failure
signal/ptrace: Add force_sig_ptrace_errno_trap and use it where needed
signal/powerpc: Remove unnecessary signal_code parameter of do_send_trap
signal: Helpers for faults with specialized siginfo layouts
signal: Add send_sig_fault and force_sig_fault
signal: Replace memset(info,...) with clear_siginfo for clarity
signal: Don't use structure initializers for struct siginfo
signal/arm64: Better isolate the COMPAT_TASK portion of ptrace_hbptriggered
ptrace: Use copy_siginfo in setsiginfo and getsiginfo
signal: Unify and correct copy_siginfo_to_user32
signal: Remove the code to clear siginfo before calling copy_siginfo_from_user32
signal: Unify and correct copy_siginfo_from_user32
signal/blackfin: Remove pointless UID16_SIGINFO_COMPAT_NEEDED
signal/blackfin: Move the blackfin specific si_codes to asm-generic/siginfo.h
signal/tile: Move the tile specific si_codes to asm-generic/siginfo.h
signal/frv: Move the frv specific si_codes to asm-generic/siginfo.h
signal/ia64: Move the ia64 specific si_codes to asm-generic/siginfo.h
signal/powerpc: Remove redefinition of NSIGTRAP on powerpc
signal: Move addr_lsb into the _sigfault union for clarity
...
When a CPU detects its locked up via soft_nmi_interrupt() we have
pt_regs, so print the regs->nip, which points to where we took the
soft-NMI.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
soft_nmi_interrupt() is called directly from the asm exception
handling code, which passes regs as a pointer to the stack. So regs
can't be NULL, it may be full of junk, but that's a separate problem.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use pr_fmt() in the watchdog code, so we don't have to say "Watchdog"
so many times.
Rather than "CPU:%d" just spell it "CPU %d", "Hard" doesn't need a
capital in the middle of a sentence, and "LOCKUP other CPUS" should be
"LOCKUP on other CPUS".
Also make it clear when a CPU self detects a lockup by spelling it
out.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
clear_thread_tidr() is called in interrupt context as a part of delayed
put of the task structure (i.e as a part of timer interrupt). To prevent
a deadlock, block interrupts when holding vas_thread_id_lock to set/
clear TIDR for a task.
Fixes: ec233ede4c ("powerpc: Add support for setting SPRN_TIDR")
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When enabling SR-IOV in pseries platform, the VF bar properties for a
PF are reported on the device node in the device tree.
This patch adds the IOV Bar resources to Linux structures from the
device tree for later use when configuring SR-IOV by PF driver.
Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Signed-off-by: Juan J. Alvarez <jjalvare@linux.vnet.ibm.com>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Introduce a method for notify resume to be called from sysfs. In this
patch one can now call notify resume from sysfs when is supported by
platform.
Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Signed-off-by: Juan J. Alvarez <jjalvare@linux.vnet.ibm.com>
Acked-by: Russell Currey <ruscur@russell.cc>
[mpe: Add NULL check, add empty versions to avoid #ifdefs]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Devices can go offline when erors reported. This patch adds a change
to the kernel object and lets udev know of error. When device resumes,
a change is also set reporting device as online. Therefore, EEH and
AER events are better propagated to user space for PCI devices in all
arches.
Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Signed-off-by: Juan J. Alvarez <jjalvare@linux.vnet.ibm.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add EEH platform operations for pseries to update VF config space.
With this change after EEH, the VF will have updated config space for
pseries platform.
Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Signed-off-by: Juan J. Alvarez <jjalvare@linux.vnet.ibm.com>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The fallback RFI flush is used when firmware does not provide a way
to flush the cache. It's a "displacement flush" that evicts useful
data by displacing it with an uninteresting buffer.
The flush has to take care to work with implementation specific cache
replacment policies, so the recipe has been in flux. The initial
slow but conservative approach is to touch all lines of a congruence
class, with dependencies between each load. It has since been
determined that a linear pattern of loads without dependencies is
sufficient, and is significantly faster.
Measuring the speed of a null syscall with RFI fallback flush enabled
gives the relative improvement:
P8 - 1.83x
P9 - 1.75x
The flush also becomes simpler and more adaptable to different cache
geometries.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There are so many places that build struct siginfo by hand that at
least one of them is bound to get it wrong. A handful of cases in the
kernel arguably did just that when using the errno field of siginfo to
pass no errno values to userspace. The usage is limited to a single
si_code so at least does not mess up anything else.
Encapsulate this questionable pattern in a helper function so
that the userspace ABI is preserved.
Update all of the places that use this pattern to use the new helper
function.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Platforms with a panic handler that halts the system can have problems
getting kernel messages out, because the panic notifiers are called
before kernel/panic.c does its flushing of printk buffers an console
etc.
This was attempted to be solved with commit a3b2cb30f2 ("powerpc: Do
not call ppc_md.panic in fadump panic notifier"), but that wasn't the
right approach and caused other problems, and was reverted by commit
ab9dbf771f.
Instead, the powernv shutdown paths have already had a similar
problem, fixed by taking the message flushing sequence from
kernel/panic.c. That's a little bit ugly, but while we have the code
duplicated, it will work for this case as well. So have ppc panic
handlers do the same flushing before they terminate.
Without this patch, a qemu pseries_le_defconfig guest stops silently
when issued the nmi command when xmon is off and no crash dumpers
enabled. Afterwards, an oops is printed by each CPU as expected.
Fixes: ab9dbf771f ("Revert "powerpc: Do not call ppc_md.panic in fadump panic notifier"")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently it's possible that a thread on PPC64 LE has its endianness
flipped inadvertently to Big-Endian resulting in a crash once the process
is back from the signal handler.
If giveup_all() is called when regs->msr has the bits MSR.FP and MSR.VEC
disabled (and hence MSR.VSX disabled too) it returns without calling
check_if_tm_restore_required() which copies regs->msr to ckpt_regs->msr if
the process caught a signal whilst in transactional mode. Then once in
setup_tm_sigcontexts() MSR from ckpt_regs.msr is used, but since
check_if_tm_restore_required() was not called previuosly, gp_regs[PT_MSR]
gets a copy of invalid MSR bits as MSR in ckpt_regs was not updated from
regs->msr and so is zeroed. Later when leaving the signal handler once in
sys_rt_sigreturn() the TS bits of gp_regs[PT_MSR] are checked to determine
if restore_tm_sigcontexts() must be called to pull in the correct MSR state
into the user context. Because TS bits are zeroed
restore_tm_sigcontexts() is never called and MSR restored from the user
context on returning from the signal handler has the MSR.LE (the endianness
bit) forced to zero (Big-Endian). That leads, for instance, to 'nop' being
treated as an illegal instruction in the following sequence:
tbegin.
beq 1f
trap
tend.
1: nop
on PPC64 LE machines and the process dies just after returning from the
signal handler.
PPC64 BE is also affected but in a subtle way since forcing Big-Endian on
a BE machine does not change the endianness.
This commit fixes the issue described above by ensuring that once in
setup_tm_sigcontexts() the MSR used is from regs->msr instead of from
ckpt_regs->msr and by ensuring that we pull in only the MSR.FP, MSR.VEC,
and MSR.VSX bits from ckpt_regs->msr.
The fix was tested both on LE and BE machines and no regression regarding
the powerpc/tm selftests was observed.
Signed-off-by: Gustavo Romero <gromero@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The thread switch control register (TSCR) is a per core register
that configures how the CPU shares resources between SMT threads.
Exposing it via sysfs allows us to tune it at run time.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Symbolic macros are unintuitive and hard to read, whereas octal constants
are much easier to interpret. Replace macros for the basic permission
flags (user/group/other read/write/execute) with numeric constants
instead, across the whole powerpc tree.
Introducing a significant number of changes across the tree for no runtime
benefit isn't exactly desirable, but so long as these macros are still
used in the tree people will keep sending patches that add them. Not only
are they hard to parse at a glance, there are multiple ways of coming to
the same value (as you can see with 0444 and 0644 in this patch) which
hurts readability.
Signed-off-by: Russell Currey <ruscur@russell.cc>
Reviewed-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Merge our fixes branch from the 4.15 cycle.
Unusually the fixes branch saw some significant features merged,
notably the RFI flush patches, so we want the code in next to be
tested against that, to avoid any surprises when the two are merged.
There's also some other work on the panic handling that was reverted
in fixes and we now want to do properly in next, which would conflict.
And we also fix a few other minor merge conflicts.
Merge the topic branch we share with kvm-ppc, this brings in two xive
commits, one from Paul to rework HMI handling, and a minor cleanup to
drop an unused flag.
To: linuxppc-dev@lists.ozlabs.org
From: Michael Bringmann <mwb@linux.vnet.ibm.com>
Cc: Michael Bringmann <mwb@linux.vnet.ibm.com>
Cc: nfont@linux.vnet.ibm.com
Subject: [PATCH V6 4/4] powerpc: Enable support for ibm,drc-info devtree property
prom_init.c: Enable support for new DRC device tree property
"ibm,drc-info" in initial handshake between the Linux kernel and
the front end processor.
Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The overview comments in the powerpc watchdog are out of date after
several iterations and changes of the code. Bring them up to date.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The AMR/IAMR/UAMOR are part of the program context.
Allow it to be accessed via ptrace and through core files.
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The value of the pkey, whose protection got violated,
is made available in si_pkey field of the siginfo structure.
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Handle Data and Instruction exceptions caused by memory
protection-key.
The CPU will detect the key fault if the HPTE is already
programmed with the key.
However if the HPTE is not hashed, a key fault will not
be detected by the hardware. The software will detect
pkey violation in such a case.
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Store and restore the AMR, IAMR and UAMOR register state of the task
before scheduling out and after scheduling in, respectively.
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The POWER9 core supports a new feature: ASB_Notify which requires the
support of the Special Purpose Register: TIDR.
The ASB_Notify command, generated by the AFU, will attempt to
wake-up the host thread identified by the particular LPID:PID:TID.
This patch assign a unique TIDR (thread id) for the current thread which
will be used in the process element entry.
Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com>
Reviewed-by: Philippe Bergheaud <felix@linux.vnet.ibm.com>
Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Reviewed-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
New Kconfig is added "CONFIG_PPC_IRQ_SOFT_MASK_DEBUG" to add WARN_ON
to alert the invalid transitions. Also moved the code under the
CONFIG_TRACE_IRQFLAGS in arch_local_irq_restore() to new Kconfig.
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
[mpe: Fix name of CONFIG option in change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Two new bit mask field "IRQ_DISABLE_MASK_PMU" is introduced to support
the masking of PMI and "IRQ_DISABLE_MASK_ALL" to aid interrupt masking
checking.
Couple of new irq #defs "PACA_IRQ_PMI" and "SOFTEN_VALUE_0xf0*" added
to use in the exception code to check for PMI interrupts.
In the masked_interrupt handler, for PMIs we reset the MSR[EE] and
return. In the __check_irq_replay(), replay the PMI interrupt by
calling performance_monitor_common handler.
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
To support addition of "bitmask" to MASKABLE_* macros, factor out the
EXCPETION_PROLOG_1 macro.
Make it explicit the interrupt masking supported by a gievn interrupt
handler. Patch correspondingly extends the MASKABLE_* macros with an
addition's parameter. "bitmask" parameter is passed to SOFTEN_TEST
macro to decide on masking the interrupt.
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Rename the paca->soft_enabled to paca->irq_soft_mask as it is no
longer used as a flag for interrupt state, but a mask.
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
"paca->soft_enabled" is used as a flag to mask some of interrupts.
Currently supported flags values and their details:
soft_enabled MSR[EE]
0 0 Disabled (PMI and HMI not masked)
1 1 Enabled
"paca->soft_enabled" is initialized to 1 to make the interripts as
enabled. arch_local_irq_disable() will toggle the value when
interrupts needs to disbled. At this point, the interrupts are not
actually disabled, instead, interrupt vector has code to check for the
flag and mask it when it occurs. By "mask it", it update interrupt
paca->irq_happened and return. arch_local_irq_restore() is called to
re-enable interrupts, which checks and replays interrupts if any
occured.
Now, as mentioned, current logic doesnot mask "performance monitoring
interrupts" and PMIs are implemented as NMI. But this patchset depends
on local_irq_* for a successful local_* update. Meaning, mask all
possible interrupts during local_* update and replay them after the
update.
So the idea here is to reserve the "paca->soft_enabled" logic. New
values and details:
soft_enabled MSR[EE]
1 0 Disabled (PMI and HMI not masked)
0 1 Enabled
Reason for the this change is to create foundation for a third mask
value "0x2" for "soft_enabled" to add support to mask PMIs. When
->soft_enabled is set to a value "3", PMI interrupts are mask and when
set to a value of "1", PMI are not mask. With this patch also extends
soft_enabled as interrupt disable mask.
Current flags are renamed from IRQ_[EN?DIS}ABLED to
IRQS_ENABLED and IRQS_DISABLED.
Patch also fixes the ptrace call to force the user to see the softe
value to be alway 1. Reason being, even though userspace has no
business knowing about softe, it is part of pt_regs. Like-wise in
signal context.
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a new wrapper function, soft_enabled_return(), added to return
paca->soft_enabled value.
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Move set_soft_enabled() from powerpc/kernel/irq.c to asm/hw_irq.c, to
encourage updates to paca->soft_enabled done via these access
function. Add "memory" clobber to hint compiler since
paca->soft_enabled memory is the target here.
Renaming it as soft_enabled_set() will make namespaces works better as
prefix than a postfix when new soft_enabled manipulation functions are
introduced.
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Two #defines IRQS_ENABLED and IRQS_DISABLED are added to be used when
updating paca->soft_enabled. Replace the hardcoded values used when
updating paca->soft_enabled with IRQ_(EN|DIS)ABLED #define. No logic
change.
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We have always had softe in pt_regs, and accessible via PT_SOFTE, even
though it is not userspace state.
The value userspace sees should always be 1, because we should never
be in userspace with interrupts soft disabled.
In a subsequent patch we will be changing the semantics of the kernel
softe value, so hard wire the value to 1 to retain the existing
semantics. As far as we know nothing ever looks at it, but better safe
than sorry.
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
[mpe: Split out of larger patch, write change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This works on top of the single escalation support. When in single
escalation, with this change, we will keep the escalation interrupt
disabled unless the VCPU is in H_CEDE (idle). In any other case, we
know the VCPU will be rescheduled and thus there is no need to take
escalation interrupts in the host whenever a guest interrupt fires.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The prodded flag is only cleared at the beginning of H_CEDE,
so every time we have an escalation, we will cause the *next*
H_CEDE to return immediately.
Instead use a dedicated "irq_pending" flag to indicate that
a guest interrupt is pending for the VCPU. We don't reuse the
existing exception bitmap so as to avoid expensive atomic ops.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The powerpc NMI IPIs may not be recoverable if they are taken in
some sections of code, and also there have been and still are issues
with taking NMIs (in KVM guest code, in firmware, etc) which makes them
a bit dangerous to use.
Generic code like softlockup detector and rcu stall detectors really
hammer on trigger_*_backtrace, which has lead to further problems
because we've implemented it with the NMI.
So stop providing NMI backtraces for now. Importantly, the powerpc code
uses NMI IPIs in crash/debug, and the SMP hardlockup watchdog. So if the
softlockup and rcu hang detection traces are not being printed because
the CPU is stuck with interrupts off, then the hard lockup watchdog
should get it with the NMI IPI.
Fixes: 2104180a53 ("powerpc/64s: implement arch-specific hardlockup watchdog")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Book3S PACA memory allocation is restricted by the RMA limit and also
must not take SLB faults when accessed in virtual mode. Currently a
fixed 256MB limit is used for this, which is imprecise and sub-optimal.
Update the paca allocation limits to use use the ppc64_rma_size for RMA
limit, and share the safe_stack_limit() that is currently used for stack
allocations that must not take virtual mode faults.
The safe_stack_limit() name is changed to ppc64_bolted_size() to match
ppc64_rma_size and some comments are updated. We also need to use
early_mmu_has_feature() because we are now calling this function prior
to the jump label patching that enables mmu_has_feature().
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Change mmu_has_feature() to early_mmu_has_feature()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Hypervisor maintenance interrupts (HMIs) are generated by various
causes, signalled by bits in the hypervisor maintenance exception
register (HMER). In most cases calling OPAL to handle the interrupt
is the correct thing to do, but the "debug trigger" HMIs signalled by
PPC bit 17 (bit 46) of HMER are used to invoke software workarounds
for hardware bugs, and OPAL does not have any code to handle this
cause. The debug trigger HMI is used in POWER9 DD2.0 and DD2.1 chips
to work around a hardware bug in executing vector load instructions to
cache inhibited memory. In POWER9 DD2.2 chips, it is generated when
conditions are detected relating to threads being in TM (transactional
memory) suspended mode when the core SMT configuration needs to be
reconfigured.
The kernel currently has code to detect the vector CI load condition,
but only when the HMI occurs in the host, not when it occurs in a
guest. If a HMI occurs in the guest, it is always passed to OPAL, and
then we always re-sync the timebase, because the HMI cause might have
been a timebase error, for which OPAL would re-sync the timebase, thus
removing the timebase offset which KVM applied for the guest. Since
we don't know what OPAL did, we don't know whether to subtract the
timebase offset from the timebase, so instead we re-sync the timebase.
This adds code to determine explicitly what the cause of a debug
trigger HMI will be. This is based on a new device-tree property
under the CPU nodes called ibm,hmi-special-triggers, if it is
present, or otherwise based on the PVR (processor version register).
The handling of debug trigger HMIs is pulled out into a separate
function which can be called from the KVM guest exit code. If this
function handles and clears the HMI, and no other HMI causes remain,
then we skip calling OPAL and we proceed to subtract the guest
timebase offset from the timebase.
The overall handling for HMIs that occur in the host (i.e. not in a
KVM guest) is largely unchanged, except that we now don't set the flag
for the vector CI load workaround on DD2.2 processors.
This also removes a BUG_ON in the KVM code. BUG_ON is generally not
useful in KVM guest entry/exit code since it is difficult to handle
the resulting trap gracefully.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Instead of calling both of_irq_parse_pci() and irq_create_of_mapping(),
call of_irq_parse_and_map_pci(), which does the same thing. This will allow
making of_irq_parse_pci() a private, static function.
This changes the logic slightly in that the fallback path will also be
taken if irq_create_of_mapping() fails internally.
Signed-off-by: Rob Herring <robh@kernel.org>
[bhelgaas: fold in virq init from Stephen Rothwell <sfr@canb.auug.org.au>]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Commit 177ba7c647 ("powerpc/mm/radix: Limit paca allocation in radix")
limited the paca allocation address to 1G on pSeries because RTAS return
accesses the paca in 32-bit mode:
On return from RTAS we access the paca variables and we have 64 bit
disabled. This requires us to limit paca in 32 bit range.
Fix this by setting ppc64_rma_size to first_memblock_size/1G range.
Avoid this limit by switching to 64-bit mode before accessing any memory.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There are several cases outside the normal address space management
where a CPU's entire local TLB is to be flushed:
1. Booting the kernel, in case something has left stale entries in
the TLB (e.g., kexec).
2. Machine check, to clean corrupted TLB entries.
One other place where the TLB is flushed, is waking from deep idle
states. The flush is a side-effect of calling ->cpu_restore with the
intention of re-setting various SPRs. The flush itself is unnecessary
because in the first case, the TLB should not acquire new corrupted
TLB entries as part of sleep/wake (though they may be lost).
This type of TLB flush is coded inflexibly, several times for each CPU
type, and they have a number of problems with ISA v3.0B:
- The current radix mode of the MMU is not taken into account, it is
always done as a hash flushn For IS=2 (LPID-matching flush from host)
and IS=3 with HV=0 (guest kernel flush), tlbie(l) is undefined if
the R field does not match the current radix mode.
- ISA v3.0B hash must flush the partition and process table caches as
well.
- ISA v3.0B radix must flush partition and process scoped translations,
partition and process table caches, and also the page walk cache.
So consolidate the flushing code and implement it in C and inline asm
under the mm/ directory with the rest of the flush code. Add ISA v3.0B
cases for radix and hash, and use the radix flush in radix environment.
Provide a way for IS=2 (LPID flush) to specify the radix mode of the
partition. Have KVM pass in the radix mode of the guest.
Take out the flushes from early cputable/dt_cpu_ftrs detection hooks,
and move it later in the boot process after, the MMU registers are set
up and before relocation is first turned on.
The TLB flush is no longer called when restoring from deep idle states.
This was not be done as a separate step because booting secondaries
uses the same cpu_restore as idle restore, which needs the TLB flush.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The die() oops path contains a serializing lock to prevent oops
messages from being interleaved. In the case of a system reset
initiated oops (e.g., qemu nmi command), __die was being called
which lacks that synchronisation and oops reports could be
interleaved across CPUs.
A recent patch 4388c9b3a6 ("powerpc: Do not send system reset
request through the oops path") changed this to __die to avoid
the debugger() call, but there is no real harm to calling it twice
if the first time fell through. So go back to using die() here.
This was observed to fix the problem.
Fixes: 4388c9b3a6 ("powerpc: Do not send system reset request through the oops path")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Expose the state of the RFI flush (enabled/disabled) via debugfs, and
allow it to be enabled/disabled at runtime.
eg: $ cat /sys/kernel/debug/powerpc/rfi_flush
1
$ echo 0 > /sys/kernel/debug/powerpc/rfi_flush
$ cat /sys/kernel/debug/powerpc/rfi_flush
0
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
The recent commit 87590ce6e3 ("sysfs/cpu: Add vulnerability folder")
added a generic folder and set of files for reporting information on
CPU vulnerabilities. One of those was for meltdown:
/sys/devices/system/cpu/vulnerabilities/meltdown
This commit wires up that file for 64-bit Book3S powerpc.
For now we default to "Vulnerable" unless the RFI flush is enabled.
That may not actually be true on all hardware, further patches will
refine the reporting based on the CPU/platform etc. But for now we
default to being pessimists.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Trap numbers can have extra bits at the bottom that need to
be filtered out. There are a few cases where we don't do that.
It's possible that we got lucky but better safe than sorry.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The only difference between EXC_COMMON_HV and EXC_COMMON is that the
former adds "2" to the trap number which is supposed to represent the
fact that this is an "HV" interrupt which uses HSRR0/1.
However KVM is the only one who cares and it has its own separate macros.
In fact, we only have one user of EXC_COMMON_HV and it's for an
unknown interrupt case. All the other ones already using EXC_COMMON.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We used to not put the newline between the CPU part and the summary
part on UP kernels. This is a rather pointless ifdef so take it out.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When CONFIG_SWAP is set, the TLB miss handlers have to also take
into account _PAGE_ACCESSED flag. At the moment it is done by
anding _PAGE_ACCESSED into _PAGE_PRESENT using 3 instructions.
This patch uses APG for handling _PAGE_ACCESSED, allowing to
just copy _PAGE_ACCESSED bit into APG field, hence reducing the
action to a single instruction.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
As Linux kernel separates KERNEL and USER address spaces, there is
therefore no need to flag USER access at page level.
Today, the 8xx TLB handlers already handle user access in the L1 entry
through Access Protection Groups, it is then natural to move the user
access handling at PMD level once _PAGE_NA allows to handle PAGE_NONE
protection without _PAGE_USER
In the mean time, as we free up one bit in the PTE, we can use it to
include SPS (page size flag) in the PTE and avoid handling it at every
TLB miss hence removing special handling based on compiled page size.
For _PAGE_EXEC, we rework it to use PP PTE bits, avoiding the copy
of _PAGE_EXEC bit into the L1 entry. Unfortunatly we are not
able to put it at the correct location as it conflicts with
NA/RO/RW bits for data entries.
Upper bits of APG in L1 entry overlap with PMD base address. In
order to avoid having to filter that out, we set up all groups so that
upper bits can have any value.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
commit ac29c64089 ("powerpc/mm: Replace _PAGE_USER with
_PAGE_PRIVILEGED") introduced _PAGE_PRIVILEGED for BOOK3S/64
This patch generalises _PAGE_PRIVILEGED for all CPUs, allowing
to have either _PAGE_PRIVILEGED or _PAGE_USER or both.
PPC_8xx has a _PAGE_SHARED flag which is set for and only for
all non user pages. Lets rename it _PAGE_PRIVILEGED to remove
confusion as it has nothing to do with Linux shared pages.
On BookE, there's a _PAGE_BAP_SR which has to be set for kernel
pages: defining _PAGE_PRIVILEGED as _PAGE_BAP_SR will make
this generic
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
_PAGE_WRITETHRU is only used in:
* AMIGA_Z2RAM block driver which is never activated on powerPC
* Video/FB driver which is for PPC_PMAC
Therefore, no need to spend time in 8xx TLB miss handlers for
handling it.
And by removing it, we free up bit 20 which then avoids having
to clear it on each TLB miss.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In TLB miss handlers, updating the perf counter is only useful
when performing a perf analysis. As it has a noticeable overhead,
let's only do it when needed.
In order to do so, the exit of the miss handlers will be patched
when starting/stopping 'perf': the first register restore
instruction of each exit point will be replaced by a jump to
the counting code.
Once this is done, CONFIG_PPC_8xx_PERF_EVENT becomes useless as
this feature doesn't add any overhead.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
EXCEPTION_PROLOG_0 and EXCEPTION_EPILOG_0 were added some
time ago in order to regroup the two mtspr/mfspr to SCRATCH0 and
SCRATCH1 and the mfcr/mtcr in order to ease entry and exit of
function not using the full EXCEPTION_PROLOG.
Since then, the mfcr/mtcr has been taken out, hence just leaving
the two mtspr/mfspr in the macro.
In order to improve readability of the exception functions, we
remove those two macros and copy back the two mtspr/mfspr instead.
As r10 and r11 are used for SCRATCH0 and SCRATCH1, lets also use
r12 for SCRATCH2. It will also improve the readability/maintenance.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
CPU6 ERRATA affects only MPC860 revisions prior to C.0. Manufacturing
of those revisiosn was stopped in 1999-2000.
Therefore, it has been almost 20 years since this ERRATA has been
fixed in the silicon.
This patch removes the workaround for that ERRATA.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Certain HMI's such as malfunction error propagate through
all threads/core on the system. If a thread was offline
prior to us crashing the system and jumping to the kdump
kernel, bad things happen when it wakes up due to an HMI
in the kdump kernel.
There are several possible ways to solve this problem
1. Put the offline cores in a state such that they are
not woken up for machine check and HMI errors. This
does not work, since we might need to wake up offline
threads to handle TB errors
2. Ignore HMI errors, setup HMEER to mask HMI errors,
but this still leads the window open for any MCEs
and masking them for the duration of the dump might
be a concern
3. Wake up offline CPUs, as in send them to
crash_ipi_callback (not wake them up as in mark them
online as seen by the hotplug). kexec does a
wake_online_cpus() call, this patch does something
similar, but instead sends an IPI and forces them to
crash_ipi_callback()
This patch takes approach #3.
Care is taken to enable this only for powenv platforms
via crash_wake_offline (a global value set at setup
time). The crash code sends out IPI's to all CPU's
which then move to crash_ipi_callback and kexec_smp_wait().
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Our check was extra cautious, we've audited crash_send_ipi
and it sends an IPI only to online CPU's. Removal of this
check should have not functional impact on crash kdump.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Instead of manually coding the loop with of_find_node_by_type(), let's
switch to the standard macro for iterating over nodes with given type.
Also fixed a couple of refcount leaks in the aforementioned loops.
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add required bits to the architecture vector to enable support
of the ibm,dynamic-memory-v2 device tree property.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We currently have code to parse the dynamic reconfiguration LMB
information from the ibm,dynamic-meory device tree property in
multiple locations; numa.c, prom.c, and pseries/hotplug-memory.c.
In anticipation of adding support for a version 2 of the
ibm,dynamic-memory property this patch aims to separate the device
tree information from the device tree format.
Doing this requires a two step process to avoid a possibly very large
bootmem allocation early in boot. During initial boot, new routines
are provided to walk the device tree property and make a call-back
for each LMB.
The second step (introduced in later patches) will allocate an
array of LMB information that can be used directly without needing
to know the DT format.
This approach provides the benefit of consolidating the device tree
property parsing to a single location and (eventually) providing
a common data structure for retrieving LMB information.
This patch introduces a routine to walk the ibm,dynamic-memory
property in the flattened device tree and updates the prom.c code
to use this to initialize memory.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Among the existing architecture specific versions of
copy_siginfo_to_user32 there are several different implementation
problems. Some architectures fail to handle all of the cases in in
the siginfo union. Some architectures perform a blind copy of the
siginfo union when the si_code is negative. A blind copy suggests the
data is expected to be in 32bit siginfo format, which means that
receiving such a signal via signalfd won't work, or that the data is
in 64bit siginfo and the code is copying nonsense to userspace.
Create a single instance of copy_siginfo_to_user32 that all of the
architectures can share, and teach it to handle all of the cases in
the siginfo union correctly, with the assumption that siginfo is
stored internally to the kernel is 64bit siginfo format.
A special case is made for x86 x32 format. This is needed as presence
of both x32 and ia32 on x86_64 results in two different 32bit signal
formats. By allowing this small special case there winds up being
exactly one code base that needs to be maintained between all of the
architectures. Vastly increasing the testing base and the chances of
finding bugs.
As the x86 copy of copy_siginfo_to_user32 the call of the x86
signal_compat_build_tests were moved into sigaction_compat_abi, so
that they will keep running.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The function copy_siginfo_from_user32 is used for two things, in ptrace
since the dawn of siginfo for arbirarily modifying a signal that
user space sees, and in sigqueueinfo to send a signal with arbirary
siginfo data.
Create a single copy of copy_siginfo_from_user32 that all architectures
share, and teach it to handle all of the cases in the siginfo union.
In the generic version of copy_siginfo_from_user32 ensure that all
of the fields in siginfo are initialized so that the siginfo structure
can be safely copied to userspace if necessary.
When copying the embedded sigval union copy the si_int member. That
ensures the 32bit values passes through the kernel unchanged.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
We'll need that name for a generic implementation soon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Lift the code from x86 so that we behave consistently. In the future we
should probably warn if any of these is set.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Setting si_code to 0 results in a userspace seeing an si_code of 0.
This is the same si_code as SI_USER. Posix and common sense requires
that SI_USER not be a signal specific si_code. As such this use of 0
for the si_code is a pretty horribly broken ABI.
Further use of si_code == 0 guaranteed that copy_siginfo_to_user saw a
value of __SI_KILL and now sees a value of SIL_KILL with the result
that uid and pid fields are copied and which might copying the si_addr
field by accident but certainly not by design. Making this a very
flakey implementation.
Utilizing FPE_FIXME and TRAP_FIXME, siginfo_layout() will now return
SIL_FAULT and the appropriate fields will be reliably copied.
Possible ABI fixes includee:
- Send the signal without siginfo
- Don't generate a signal
- Possibly assign and use an appropriate si_code
- Don't handle cases which can't happen
Cc: Paul Mackerras <paulus@samba.org>
Cc: Kumar Gala <kumar.gala@freescale.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: linuxppc-dev@lists.ozlabs.org
Ref: 9bad068c24d7 ("[PATCH] ppc32: support for e500 and 85xx")
Ref: 0ed70f6105ef ("PPC32: Provide proper siginfo information on various exceptions.")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
pci_get_bus_and_slot() is restrictive such that it assumes domain=0 as
where a PCI device is present. This restricts the device drivers to be
reused for other domain numbers.
Getting ready to remove pci_get_bus_and_slot() function in favor of
pci_get_domain_bus_and_slot().
Use pci_get_domain_bus_and_slot() with a domain number of 0 as the code
is not ready to consume multiple domains and existing code used domain
number 0.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Bjorn Helgaas <helgaas@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
We want to use the dma_direct_ namespace for a generic implementation,
so rename powerpc to the second best choice: dma_nommu_.
Signed-off-by: Christoph Hellwig <hch@lst.de>
This causes warnings from cpufreq mutex code. This is also rather
unnecessary and ineffective. If we really want to prevent concurrent
unplug, we could take the unplug read lock but I don't see this being
critical.
Fixes: cd77b5ce20 ("powerpc/powernv/cpufreq: Fix the frequency read by /proc/cpuinfo")
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Because there may be some performance overhead of the RFI flush, add
kernel command line options to disable it.
We add a sensibly named 'no_rfi_flush' option, but we also hijack the
x86 option 'nopti'. The RFI flush is not the same as KPTI, but if we
see 'nopti' we can guess that the user is trying to avoid any overhead
of Meltdown mitigations, and it means we don't have to educate every
one about a different command line option.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On some CPUs we can prevent the Meltdown vulnerability by flushing the
L1-D cache on exit from kernel to user mode, and from hypervisor to
guest.
This is known to be the case on at least Power7, Power8 and Power9. At
this time we do not know the status of the vulnerability on other CPUs
such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale
CPUs. As more information comes to light we can enable this, or other
mechanisms on those CPUs.
The vulnerability occurs when the load of an architecturally
inaccessible memory region (eg. userspace load of kernel memory) is
speculatively executed to the point where its result can influence the
address of a subsequent speculatively executed load.
In order for that to happen, the first load must hit in the L1,
because before the load is sent to the L2 the permission check is
performed. Therefore if no kernel addresses hit in the L1 the
vulnerability can not occur. We can ensure that is the case by
flushing the L1 whenever we return to userspace. Similarly for
hypervisor vs guest.
In order to flush the L1-D cache on exit, we add a section of nops at
each (h)rfi location that returns to a lower privileged context, and
patch that with some sequence. Newer firmwares are able to advertise
to us that there is a special nop instruction that flushes the L1-D.
If we do not see that advertised, we fall back to doing a displacement
flush in software.
For guest kernels we support migration between some CPU versions, and
different CPUs may use different flush instructions. So that we are
prepared to migrate to a machine with a different flush instruction
activated, we may have to patch more than one flush instruction at
boot if the hypervisor tells us to.
In the end this patch is mostly the work of Nicholas Piggin and
Michael Ellerman. However a cast of thousands contributed to analysis
of the issue, earlier versions of the patch, back ports testing etc.
Many thanks to all of them.
Tested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>