As explained by commit daf00ae71d ("powerpc/traps: restore
recoverability of machine_check interrupts"), die() can't be called from
within nmi_enter to nicely kill a process context that was interrupted.
nmi_exit must be called first.
This adds a function die_mce which takes care of this for machine check
handlers.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-24-npiggin@gmail.com
Make mm fault handlers all just take the pt_regs * argument and load
DAR/DSISR from that. Make those that return a value return long.
This is done to make the function signatures match other handlers, which
will help with a future patch to add wrappers. Explicit arguments could
be added for performance but that would require more wrapper macro
variants.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-7-npiggin@gmail.com
This fix the bad fault reported by KUAP when io_wqe_worker access userspace.
Bug: Read fault blocked by KUAP!
WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229 __do_page_fault+0x6b4/0xcd0
NIP [c00000000009e7e4] __do_page_fault+0x6b4/0xcd0
LR [c00000000009e7e0] __do_page_fault+0x6b0/0xcd0
..........
Call Trace:
[c000000016367330] [c00000000009e7e0] __do_page_fault+0x6b0/0xcd0 (unreliable)
[c0000000163673e0] [c00000000009ee3c] do_page_fault+0x3c/0x120
[c000000016367430] [c00000000000c848] handle_page_fault+0x10/0x2c
--- interrupt: 300 at iov_iter_fault_in_readable+0x148/0x6f0
..........
NIP [c0000000008e8228] iov_iter_fault_in_readable+0x148/0x6f0
LR [c0000000008e834c] iov_iter_fault_in_readable+0x26c/0x6f0
interrupt: 300
[c0000000163677e0] [c0000000007154a0] iomap_write_actor+0xc0/0x280
[c000000016367880] [c00000000070fc94] iomap_apply+0x1c4/0x780
[c000000016367990] [c000000000710330] iomap_file_buffered_write+0xa0/0x120
[c0000000163679e0] [c00800000040791c] xfs_file_buffered_aio_write+0x314/0x5e0 [xfs]
[c000000016367a90] [c0000000006d74bc] io_write+0x10c/0x460
[c000000016367bb0] [c0000000006d80e4] io_issue_sqe+0x8d4/0x1200
[c000000016367c70] [c0000000006d8ad0] io_wq_submit_work+0xc0/0x250
[c000000016367cb0] [c0000000006e2578] io_worker_handle_work+0x498/0x800
[c000000016367d40] [c0000000006e2cdc] io_wqe_worker+0x3fc/0x4f0
[c000000016367da0] [c0000000001cb0a4] kthread+0x1c4/0x1d0
[c000000016367e10] [c00000000000dbf0] ret_from_kernel_thread+0x5c/0x6c
The kernel consider thread AMR value for kernel thread to be
AMR_KUAP_BLOCKED. Hence access to userspace is denied. This
of course not correct and we should allow userspace access after
kthread_use_mm(). To be precise, kthread_use_mm() should inherit the
AMR value of the operating address space. But, the AMR value is
thread-specific and we inherit the address space and not thread
access restrictions. Because of this ignore AMR value when accessing
userspace via kernel thread.
current_thread_amr/iamr() are updated, because we use them in the
below stack.
....
[ 530.710838] CPU: 13 PID: 5587 Comm: io_wqe_worker-0 Tainted: G D 5.11.0-rc6+ #3
....
NIP [c0000000000aa0c8] pkey_access_permitted+0x28/0x90
LR [c0000000004b9278] gup_pte_range+0x188/0x420
--- interrupt: 700
[c00000001c4ef3f0] [0000000000000000] 0x0 (unreliable)
[c00000001c4ef490] [c0000000004bd39c] gup_pgd_range+0x3ac/0xa20
[c00000001c4ef5a0] [c0000000004bdd44] internal_get_user_pages_fast+0x334/0x410
[c00000001c4ef620] [c000000000852028] iov_iter_get_pages+0xf8/0x5c0
[c00000001c4ef6a0] [c0000000007da44c] bio_iov_iter_get_pages+0xec/0x700
[c00000001c4ef770] [c0000000006a325c] iomap_dio_bio_actor+0x2ac/0x4f0
[c00000001c4ef810] [c00000000069cd94] iomap_apply+0x2b4/0x740
[c00000001c4ef920] [c0000000006a38b8] __iomap_dio_rw+0x238/0x5c0
[c00000001c4ef9d0] [c0000000006a3c60] iomap_dio_rw+0x20/0x80
[c00000001c4ef9f0] [c008000001927a30] xfs_file_dio_aio_write+0x1f8/0x650 [xfs]
[c00000001c4efa60] [c0080000019284dc] xfs_file_write_iter+0xc4/0x130 [xfs]
[c00000001c4efa90] [c000000000669984] io_write+0x104/0x4b0
[c00000001c4efbb0] [c00000000066cea4] io_issue_sqe+0x3d4/0xf50
[c00000001c4efc60] [c000000000670200] io_wq_submit_work+0xb0/0x2f0
[c00000001c4efcb0] [c000000000674268] io_worker_handle_work+0x248/0x4a0
[c00000001c4efd30] [c0000000006746e8] io_wqe_worker+0x228/0x2a0
[c00000001c4efda0] [c00000000019d994] kthread+0x1b4/0x1c0
Fixes: 48a8ab4eeb ("powerpc/book3s64/pkeys: Don't update SPRN_AMR when in kernel mode.")
Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210206025634.521979-1-aneesh.kumar@linux.ibm.com
On many powerpc platforms the discovery and initalisation of
pci_controllers (PHBs) happens inside of setup_arch(). This is very early
in boot (pre-initcalls) and means that we're initialising the PHB long
before many basic kernel services (slab allocator, debugfs, a real ioremap)
are available.
On PowerNV this causes an additional problem since we map the PHB registers
with ioremap(). As of commit d538aadc27 ("powerpc/ioremap: warn on early
use of ioremap()") a warning is printed because we're using the "incorrect"
API to setup and MMIO mapping in searly boot. The kernel does provide
early_ioremap(), but that is not intended to create long-lived MMIO
mappings and a seperate warning is printed by generic code if
early_ioremap() mappings are "leaked."
This is all fixable with dumb hacks like using early_ioremap() to setup
the initial mapping then replacing it with a real ioremap later on in
boot, but it does raise the question: Why the hell are we setting up the
PHB's this early in boot?
The old and wise claim it's due to "hysterical rasins." Aside from amused
grapes there doesn't appear to be any real reason to maintain the current
behaviour. Already most of the newer embedded platforms perform PHB
discovery in an arch_initcall and between the end of setup_arch() and the
start of initcalls none of the generic kernel code does anything PCI
related. On powerpc scanning PHBs occurs in a subsys_initcall so it should
be possible to move the PHB discovery to a core, postcore or arch initcall.
This patch adds the ppc_md.discover_phbs hook and a core_initcall stub that
calls it. The core_initcalls are the earliest to be called so this will
any possibly issues with dependency between initcalls. This isn't just an
academic issue either since on pseries and PowerNV EEH init occurs in an
arch_initcall and depends on the pci_controllers being available, similarly
the creation of pci_dns occurs at core_initcall_sync (i.e. between core and
postcore initcalls). These problems need to be addressed seperately.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
[mpe: Make discover_phbs() static]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201103043523.916109-1-oohall@gmail.com
./arch/powerpc/include/asm/paravirt.h:83:44: error: implicit declaration
of function 'smp_processor_id'; did you mean 'raw_smp_processor_id'?
smp_processor_id is defined in linux/smp.h but it is not included.
The build error happens only when the patch is applied to 5.3 kernel but
it only works by chance in mainline.
Fixes: ca3f969dcb ("powerpc/paravirt: Use is_kvm_guest() in vcpu_is_preempted()")
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210120132838.15589-1-msuchanek@suse.de
Access to per-cpu variables requires translation to be enabled on
pseries machine running in hash mmu mode, Since part of MCE handler
runs in realmode and part of MCE handling code is shared between ppc
architectures pseries and powernv, it becomes difficult to manage
these variables differently on different architectures, So have
these variables in paca instead of having them as per-cpu variables
to avoid complications.
Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210128104143.70668-2-ganeshgr@linux.ibm.com
soft_nmi_interrupt() usage requires PPC_WATCHDOG to be configured.
Check the CONFIG definition to declare the prototype.
It fixes this W=1 compile error :
../arch/powerpc/kernel/watchdog.c:250:6: error: no previous prototype for ‘soft_nmi_interrupt’ [-Werror=missing-prototypes]
250 | void soft_nmi_interrupt(struct pt_regs *regs)
| ^~~~~~~~~~~~~~~~~~
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210104143206.695198-18-clg@kaod.org
It fixes this W=1 compile error :
../arch/powerpc/mm/book3s64/slb.c:380:6: error: no previous prototype for ‘preload_new_slb_context’ [-Werror=missing-prototypes]
380 | void preload_new_slb_context(unsigned long start, unsigned long sp)
| ^~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210104143206.695198-15-clg@kaod.org
It fixes this W=1 compile error :
../arch/powerpc/mm/book3s64/hash_utils.c:1867:6: error: no previous prototype for ‘hpte_insert_repeating’ [-Werror=missing-prototypes]
1867 | long hpte_insert_repeating(unsigned long hash, unsigned long vpn,
| ^~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210104143206.695198-14-clg@kaod.org
It fixes this W=1 compile error :
../arch/powerpc/mm/book3s64/hash_utils.c:1515:5: error: no previous prototype for ‘__hash_page’ [-Werror=missing-prototypes]
1515 | int __hash_page(unsigned long trap, unsigned long ea, unsigned long dsisr,
| ^~~~~~~~~~~
../arch/powerpc/mm/book3s64/hash_utils.c:1850:6: error: no previous prototype for ‘low_hash_fault’ [-Werror=missing-prototypes]
1850 | void low_hash_fault(struct pt_regs *regs, unsigned long address, int rc)
| ^~~~~~~~~~~~~~
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210104143206.695198-13-clg@kaod.org
In commit 8150a153c0 ("powerpc/64s: Use early_mmu_has_feature() in
set_kuap()") we switched the KUAP code to use early_mmu_has_feature(),
to avoid a bug where we called set_kuap() before feature patching had
been done, leading to recursion and crashes.
That path, which called probe_kernel_read() from printk(), has since
been removed, see commit 2ac5a3bf70 ("vsprintf: Do not break early
boot with probing addresses").
Additionally probe_kernel_read() no longer invokes any KUAP routines,
since commit fe557319aa ("maccess: rename probe_kernel_{read,write}
to copy_{from,to}_kernel_nofault") and c331652534 ("powerpc: use
non-set_fs based maccess routines").
So it should now be safe to use mmu_has_feature() in the KUAP
routines, because we shouldn't invoke them prior to feature patching.
This is essentially a revert of commit 8150a153c0 ("powerpc/64s: Use
early_mmu_has_feature() in set_kuap()"), but we've since added a
second usage of early_mmu_has_feature() in get_kuap(), so we convert
that to use mmu_has_feature() as well.
Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Depends-on: c331652534 ("powerpc: use non-set_fs based maccess routines").
Link: https://lore.kernel.org/r/20201217005306.895685-1-mpe@ellerman.id.au
The "oprofile" user-space tools don't use the kernel OPROFILE support
any more, and haven't in a long time. User-space has been converted to
the perf interfaces.
This commits stops building oprofile for powerpc and removes any
reference to it from directories in arch/powerpc/ apart from
arch/powerpc/oprofile, which will be removed in the next commit (this is
broken into two commits as the size of the commit became very big, ~5k
lines).
Note that the member "oprofile_cpu_type" in "struct cpu_spec" isn't
removed as it was also used by other parts of the code.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Robert Richter <rric@kernel.org>
Acked-by: William Cohen <wcohen@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Merge misc fixes from Andrew Morton:
"18 patches.
Subsystems affected by this patch series: mm (pagealloc, memcg, kasan,
memory-failure, and highmem), ubsan, proc, and MAINTAINERS"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
MAINTAINERS: add a couple more files to the Clang/LLVM section
proc_sysctl: fix oops caused by incorrect command parameters
powerpc/mm/highmem: use __set_pte_at() for kmap_local()
mips/mm/highmem: use set_pte() for kmap_local()
mm/highmem: prepare for overriding set_pte_at()
sparc/mm/highmem: flush cache and TLB
mm: fix page reference leak in soft_offline_page()
ubsan: disable unsigned-overflow check for i386
kasan, mm: fix resetting page_alloc tags for HW_TAGS
kasan, mm: fix conflicts with init_on_alloc/free
kasan: fix HW_TAGS boot parameters
kasan: fix incorrect arguments passing in kasan_add_zero_shadow
kasan: fix unaligned address is unhandled in kasan_remove_zero_shadow
mm: fix numa stats for thp migration
mm: memcg: fix memcg file_dirty numa stat
mm: memcg/slab: optimize objcg stock draining
mm: fix initialization of struct page for holes in memory layout
x86/setup: don't remove E820_TYPE_RAM for pfn 0
The L1D flush fallback functions are not recoverable vs interrupts,
yet the scv entry flush runs with MSR[EE]=1. This can result in a
timer (soft-NMI) or MCE or SRESET interrupt hitting here and overwriting
the EXRFI save area, which ends up corrupting userspace registers for
scv return.
Fix this by disabling RI and EE for the scv entry fallback flush.
Fixes: f79643787e ("powerpc/64s: flush L1D on kernel entry")
Cc: stable@vger.kernel.org # 5.9+ which also have flush L1D patch backport
Reported-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210111062408.287092-1-npiggin@gmail.com
Skirmisher reported on IRC that the 32-bit LE VDSO was hanging. This
turned out to be due to a branch to self in eg. __kernel_gettimeofday.
Looking at the disassembly with objdump -dR shows why:
00000528 <__kernel_gettimeofday>:
528: f0 ff 21 94 stwu r1,-16(r1)
52c: a6 02 08 7c mflr r0
530: f0 ff 21 94 stwu r1,-16(r1)
534: 14 00 01 90 stw r0,20(r1)
538: 05 00 9f 42 bcl 20,4*cr7+so,53c <__kernel_gettimeofday+0x14>
53c: a6 02 a8 7c mflr r5
540: ff ff a5 3c addis r5,r5,-1
544: c4 fa a5 38 addi r5,r5,-1340
548: f0 00 a5 38 addi r5,r5,240
54c: 01 00 00 48 bl 54c <__kernel_gettimeofday+0x24>
54c: R_PPC_REL24 .__c_kernel_gettimeofday
Because we don't process relocations for the VDSO, this branch remains
a branch from 0x54c to 0x54c.
With the preceding patch to prohibit R_PPC_REL24 relocations, we
instead get a build failure:
0000054c R_PPC_REL24 .__c_kernel_gettimeofday
00000598 R_PPC_REL24 .__c_kernel_clock_gettime
000005e4 R_PPC_REL24 .__c_kernel_clock_gettime64
00000630 R_PPC_REL24 .__c_kernel_clock_getres
0000067c R_PPC_REL24 .__c_kernel_time
arch/powerpc/kernel/vdso32/vdso32.so.dbg: dynamic relocations are not supported
The root cause is that we're branching to `.__c_kernel_gettimeofday`.
But this is 32-bit LE code, which doesn't use function descriptors, so
there are no dot symbols.
The reason we're trying to branch to a dot symbol is because we're
using the DOTSYM macro, but the ifdefs we use to define the DOTSYM
macro do not currently work for 32-bit LE.
So like previous commits we need to differentiate if the current
compilation unit is 64-bit, rather than the kernel as a whole. ie.
switch from CONFIG_PPC64 to __powerpc64__.
With that fixed 32-bit LE code gets the empty version of DOTSYM, which
just resolves to the original symbol name, leading to a direct branch
and no relocations:
000003f8 <__kernel_gettimeofday>:
3f8: f0 ff 21 94 stwu r1,-16(r1)
3fc: a6 02 08 7c mflr r0
400: f0 ff 21 94 stwu r1,-16(r1)
404: 14 00 01 90 stw r0,20(r1)
408: 05 00 9f 42 bcl 20,4*cr7+so,40c <__kernel_gettimeofday+0x14>
40c: a6 02 a8 7c mflr r5
410: ff ff a5 3c addis r5,r5,-1
414: f4 fb a5 38 addi r5,r5,-1036
418: f0 00 a5 38 addi r5,r5,240
41c: 85 06 00 48 bl aa0 <__c_kernel_gettimeofday>
Fixes: ab037dd87a ("powerpc/vdso: Switch VDSO to generic C implementation.")
Reported-by: "Will Springer <skirmisher@protonmail.com>"
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201218111619.1206391-3-mpe@ellerman.id.au
Pull powerpc updates from Michael Ellerman:
- Switch to the generic C VDSO, as well as some cleanups of our VDSO
setup/handling code.
- Support for KUAP (Kernel User Access Prevention) on systems using the
hashed page table MMU, using memory protection keys.
- Better handling of PowerVM SMT8 systems where all threads of a core
do not share an L2, allowing the scheduler to make better scheduling
decisions.
- Further improvements to our machine check handling.
- Show registers when unwinding interrupt frames during stack traces.
- Improvements to our pseries (PowerVM) partition migration code.
- Several series from Christophe refactoring and cleaning up various
parts of the 32-bit code.
- Other smaller features, fixes & cleanups.
Thanks to: Alan Modra, Alexey Kardashevskiy, Andrew Donnellan, Aneesh
Kumar K.V, Ard Biesheuvel, Athira Rajeev, Balamuruhan S, Bill Wendling,
Cédric Le Goater, Christophe Leroy, Christophe Lombard, Colin Ian King,
Daniel Axtens, David Hildenbrand, Frederic Barrat, Ganesh Goudar,
Gautham R. Shenoy, Geert Uytterhoeven, Giuseppe Sacco, Greg Kurz,
Harish, Jan Kratochvil, Jordan Niethe, Kaixu Xia, Laurent Dufour,
Leonardo Bras, Madhavan Srinivasan, Mahesh Salgaonkar, Mathieu
Desnoyers, Nathan Lynch, Nicholas Piggin, Oleg Nesterov, Oliver
O'Halloran, Oscar Salvador, Po-Hsu Lin, Qian Cai, Qinglang Miao, Randy
Dunlap, Ravi Bangoria, Sachin Sant, Sandipan Das, Sebastian Andrzej
Siewior , Segher Boessenkool, Srikar Dronamraju, Tyrel Datwyler, Uwe
Kleine-König, Vincent Stehlé, Youling Tang, and Zhang Xiaoxu.
* tag 'powerpc-5.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (304 commits)
powerpc/32s: Fix cleanup_cpu_mmu_context() compile bug
powerpc: Add config fragment for disabling -Werror
powerpc/configs: Add ppc64le_allnoconfig target
powerpc/powernv: Rate limit opal-elog read failure message
powerpc/pseries/memhotplug: Quieten some DLPAR operations
powerpc/ps3: use dma_mapping_error()
powerpc: force inlining of csum_partial() to avoid multiple csum_partial() with GCC10
powerpc/perf: Fix Threshold Event Counter Multiplier width for P10
powerpc/mm: Fix hugetlb_free_pmd_range() and hugetlb_free_pud_range()
KVM: PPC: Book3S HV: Fix mask size for emulated msgsndp
KVM: PPC: fix comparison to bool warning
KVM: PPC: Book3S: Assign boolean values to a bool variable
powerpc: Inline setup_kup()
powerpc/64s: Mark the kuap/kuep functions non __init
KVM: PPC: Book3S HV: XIVE: Add a comment regarding VP numbering
powerpc/xive: Improve error reporting of OPAL calls
powerpc/xive: Simplify xive_do_source_eoi()
powerpc/xive: Remove P9 DD1 flag XIVE_IRQ_FLAG_EOI_FW
powerpc/xive: Remove P9 DD1 flag XIVE_IRQ_FLAG_MASK_FW
powerpc/xive: Remove P9 DD1 flag XIVE_IRQ_FLAG_SHIFT_BUG
...
Pull tracing updates from Steven Rostedt:
"The major update to this release is that there's a new arch config
option called CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS.
Currently, only x86_64 enables it. All the ftrace callbacks now take a
struct ftrace_regs instead of a struct pt_regs. If the architecture
has HAVE_DYNAMIC_FTRACE_WITH_ARGS enabled, then the ftrace_regs will
have enough information to read the arguments of the function being
traced, as well as access to the stack pointer.
This way, if a user (like live kernel patching) only cares about the
arguments, then it can avoid using the heavier weight "regs" callback,
that puts in enough information in the struct ftrace_regs to simulate
a breakpoint exception (needed for kprobes).
A new config option that audits the timestamps of the ftrace ring
buffer at most every event recorded.
Ftrace recursion protection has been cleaned up to move the protection
to the callback itself (this saves on an extra function call for those
callbacks).
Perf now handles its own RCU protection and does not depend on ftrace
to do it for it (saving on that extra function call).
New debug option to add "recursed_functions" file to tracefs that
lists all the places that triggered the recursion protection of the
function tracer. This will show where things need to be fixed as
recursion slows down the function tracer.
The eval enum mapping updates done at boot up are now offloaded to a
work queue, as it caused a noticeable pause on slow embedded boards.
Various clean ups and last minute fixes"
* tag 'trace-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits)
tracing: Offload eval map updates to a work queue
Revert: "ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS"
ring-buffer: Add rb_check_bpage in __rb_allocate_pages
ring-buffer: Fix two typos in comments
tracing: Drop unneeded assignment in ring_buffer_resize()
tracing: Disable ftrace selftests when any tracer is running
seq_buf: Avoid type mismatch for seq_buf_init
ring-buffer: Fix a typo in function description
ring-buffer: Remove obsolete rb_event_is_commit()
ring-buffer: Add test to validate the time stamp deltas
ftrace/documentation: Fix RST C code blocks
tracing: Clean up after filter logic rewriting
tracing: Remove the useless value assignment in test_create_synth_event()
livepatch: Use the default ftrace_ops instead of REGS when ARGS is available
ftrace/x86: Allow for arguments to be passed in to ftrace_regs by default
ftrace: Have the callbacks receive a struct ftrace_regs instead of pt_regs
MAINTAINERS: assign ./fs/tracefs to TRACING
tracing: Fix some typos in comments
ftrace: Remove unused varible 'ret'
ring-buffer: Add recording of ring buffer recursion into recursed_functions
...
Currently pmac32_defconfig with SMP=y doesn't build:
arch/powerpc/platforms/powermac/smp.c:
error: implicit declaration of function 'cleanup_cpu_mmu_context'
It would be nice for consistency if all platforms clear mm_cpumask and
flush TLBs on unplug, but the TLB invalidation bug described in commit
01b0f0eae0 ("powerpc/64s: Trim offlined CPUs from mm_cpumasks") only
applies to 64s and for now we only have the TLB flush code for that
platform.
So just add an empty version for 32-bit Book3S.
Fixes: 01b0f0eae0 ("powerpc/64s: Trim offlined CPUs from mm_cpumasks")
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Change log based on comments from Nick]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Pull TIF_NOTIFY_SIGNAL updates from Jens Axboe:
"This sits on top of of the core entry/exit and x86 entry branch from
the tip tree, which contains the generic and x86 parts of this work.
Here we convert the rest of the archs to support TIF_NOTIFY_SIGNAL.
With that done, we can get rid of JOBCTL_TASK_WORK from task_work and
signal.c, and also remove a deadlock work-around in io_uring around
knowing that signal based task_work waking is invoked with the sighand
wait queue head lock.
The motivation for this work is to decouple signal notify based
task_work, of which io_uring is a heavy user of, from sighand. The
sighand lock becomes a huge contention point, particularly for
threaded workloads where it's shared between threads. Even outside of
threaded applications it's slower than it needs to be.
Roman Gershman <romger@amazon.com> reported that his networked
workload dropped from 1.6M QPS at 80% CPU to 1.0M QPS at 100% CPU
after io_uring was changed to use TIF_NOTIFY_SIGNAL. The time was all
spent hammering on the sighand lock, showing 57% of the CPU time there
[1].
There are further cleanups possible on top of this. One example is
TIF_PATCH_PENDING, where a patch already exists to use
TIF_NOTIFY_SIGNAL instead. Hopefully this will also lead to more
consolidation, but the work stands on its own as well"
[1] https://github.com/axboe/liburing/issues/215
* tag 'tif-task_work.arch-2020-12-14' of git://git.kernel.dk/linux-block: (28 commits)
io_uring: remove 'twa_signal_ok' deadlock work-around
kernel: remove checking for TIF_NOTIFY_SIGNAL
signal: kill JOBCTL_TASK_WORK
io_uring: JOBCTL_TASK_WORK is no longer used by task_work
task_work: remove legacy TWA_SIGNAL path
sparc: add support for TIF_NOTIFY_SIGNAL
riscv: add support for TIF_NOTIFY_SIGNAL
nds32: add support for TIF_NOTIFY_SIGNAL
ia64: add support for TIF_NOTIFY_SIGNAL
h8300: add support for TIF_NOTIFY_SIGNAL
c6x: add support for TIF_NOTIFY_SIGNAL
alpha: add support for TIF_NOTIFY_SIGNAL
xtensa: add support for TIF_NOTIFY_SIGNAL
arm: add support for TIF_NOTIFY_SIGNAL
microblaze: add support for TIF_NOTIFY_SIGNAL
hexagon: add support for TIF_NOTIFY_SIGNAL
csky: add support for TIF_NOTIFY_SIGNAL
openrisc: add support for TIF_NOTIFY_SIGNAL
sh: add support for TIF_NOTIFY_SIGNAL
um: add support for TIF_NOTIFY_SIGNAL
...
Pull asm-generic mmu-context cleanup from Arnd Bergmann:
"This is a cleanup series from Nicholas Piggin, preparing for later
changes. The asm/mmu_context.h header are generalized and common code
moved to asm-gneneric/mmu_context.h.
This saves a bit of code and makes it easier to change in the future"
* tag 'asm-generic-mmu-context-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: (25 commits)
h8300: Fix generic mmu_context build
m68k: mmu_context: Fix Sun-3 build
xtensa: use asm-generic/mmu_context.h for no-op implementations
x86: use asm-generic/mmu_context.h for no-op implementations
um: use asm-generic/mmu_context.h for no-op implementations
sparc: use asm-generic/mmu_context.h for no-op implementations
sh: use asm-generic/mmu_context.h for no-op implementations
s390: use asm-generic/mmu_context.h for no-op implementations
riscv: use asm-generic/mmu_context.h for no-op implementations
powerpc: use asm-generic/mmu_context.h for no-op implementations
parisc: use asm-generic/mmu_context.h for no-op implementations
openrisc: use asm-generic/mmu_context.h for no-op implementations
nios2: use asm-generic/mmu_context.h for no-op implementations
nds32: use asm-generic/mmu_context.h for no-op implementations
mips: use asm-generic/mmu_context.h for no-op implementations
microblaze: use asm-generic/mmu_context.h for no-op implementations
m68k: use asm-generic/mmu_context.h for no-op implementations
ia64: use asm-generic/mmu_context.h for no-op implementations
hexagon: use asm-generic/mmu_context.h for no-op implementations
csky: use asm-generic/mmu_context.h for no-op implementations
...
Pull kmap updates from Thomas Gleixner:
"The new preemtible kmap_local() implementation:
- Consolidate all kmap_atomic() internals into a generic
implementation which builds the base for the kmap_local() API and
make the kmap_atomic() interface wrappers which handle the
disabling/enabling of preemption and pagefaults.
- Switch the storage from per-CPU to per task and provide scheduler
support for clearing mapping when scheduling out and restoring them
when scheduling back in.
- Merge the migrate_disable/enable() code, which is also part of the
scheduler pull request. This was required to make the kmap_local()
interface available which does not disable preemption when a
mapping is established. It has to disable migration instead to
guarantee that the virtual address of the mapped slot is the same
across preemption.
- Provide better debug facilities: guard pages and enforced
utilization of the mapping mechanics on 64bit systems when the
architecture allows it.
- Provide the new kmap_local() API which can now be used to cleanup
the kmap_atomic() usage sites all over the place. Most of the usage
sites do not require the implicit disabling of preemption and
pagefaults so the penalty on 64bit and 32bit non-highmem systems is
removed and quite some of the code can be simplified. A wholesale
conversion is not possible because some usage depends on the
implicit side effects and some need to be cleaned up because they
work around these side effects.
The migrate disable side effect is only effective on highmem
systems and when enforced debugging is enabled. On 64bit and 32bit
non-highmem systems the overhead is completely avoided"
* tag 'core-mm-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
ARM: highmem: Fix cache_is_vivt() reference
x86/crashdump/32: Simplify copy_oldmem_page()
io-mapping: Provide iomap_local variant
mm/highmem: Provide kmap_local*
sched: highmem: Store local kmaps in task struct
x86: Support kmap_local() forced debugging
mm/highmem: Provide CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP
mm/highmem: Provide and use CONFIG_DEBUG_KMAP_LOCAL
microblaze/mm/highmem: Add dropped #ifdef back
xtensa/mm/highmem: Make generic kmap_atomic() work correctly
mm/highmem: Take kmap_high_get() properly into account
highmem: High implementation details and document API
Documentation/io-mapping: Remove outdated blurb
io-mapping: Cleanup atomic iomap
mm/highmem: Remove the old kmap_atomic cruft
highmem: Get rid of kmap_types.h
xtensa/mm/highmem: Switch to generic kmap atomic
sparc/mm/highmem: Switch to generic kmap atomic
powerpc/mm/highmem: Switch to generic kmap atomic
nds32/mm/highmem: Switch to generic kmap atomic
...
setup_kup() is used by both 64-bit and 32-bit code. However on 64-bit
it must not be __init, because it's used for CPU hotplug, whereas on
32-bit it should be __init because it calls setup_kuap/kuep() which
are __init.
We worked around that problem in the past by marking it __ref, see
commit 67d53f30e2 ("powerpc/mm: fix section mismatch for
setup_kup()").
Marking it __ref basically just omits it from section mismatch
checking, which can lead to bugs, and in fact it did, see commit
44b4c4450f ("powerpc/64s: Mark the kuap/kuep functions non __init")
We can avoid all these problems by just making it static inline.
Because all it does is call other functions, making it inline actually
shrinks the 32-bit vmlinux by ~76 bytes.
Make it __always_inline as pointed out by Christophe.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201214123011.311024-1-mpe@ellerman.id.au
Pull perf updates from Thomas Gleixner:
"Core:
- Better handling of page table leaves on archictectures which have
architectures have non-pagetable aligned huge/large pages. For such
architectures a leaf can actually be part of a larger entry.
- Prevent a deadlock vs exec_update_mutex
Architectures:
- The related updates for page size calculation of leaf entries
- The usual churn to support new CPUs
- Small fixes and improvements all over the place"
* tag 'perf-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
perf/x86/intel: Add Tremont Topdown support
uprobes/x86: Fix fall-through warnings for Clang
perf/x86: Fix fall-through warnings for Clang
kprobes/x86: Fix fall-through warnings for Clang
perf/x86/intel/lbr: Fix the return type of get_lbr_cycles()
perf/x86/intel: Fix rtm_abort_event encoding on Ice Lake
x86/kprobes: Restore BTF if the single-stepping is cancelled
perf: Break deadlock involving exec_update_mutex
sparc64/mm: Implement pXX_leaf_size() support
powerpc/8xx: Implement pXX_leaf_size() support
arm64/mm: Implement pXX_leaf_size() support
perf/core: Fix arch_perf_get_page_size()
mm: Introduce pXX_leaf_size()
mm/gup: Provide gup_get_pte() more generic
perf/x86/intel: Add event constraint for CYCLE_ACTIVITY.STALLS_MEM_ANY
perf/x86/intel/uncore: Add Rocket Lake support
perf/x86/msr: Add Rocket Lake CPU support
perf/x86/cstate: Add Rocket Lake CPU support
perf/x86/intel: Add Rocket Lake CPU support
perf,mm: Handle non-page-table-aligned hugetlbfs
...
This is a simple cleanup to identify easily all flags of the XIVE
interrupt structure. The interrupts flagged with XIVE_IRQ_FLAG_NO_EOI
are the escalations used to wake up vCPUs in KVM. They are handled
very differently from the rest.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201210171450.1933725-3-clg@kaod.org
On POWER platforms where only some groups of threads within a core
share the L2-cache (indicated by the ibm,thread-groups device-tree
property), we currently print the incorrect shared_cpu_map/list for
L2-cache in the sysfs.
This patch reports the correct shared_cpu_map/list on such platforms.
Example:
On a platform with "ibm,thread-groups" set to
00000001 00000002 00000004 00000000
00000002 00000004 00000006 00000001
00000003 00000005 00000007 00000002
00000002 00000004 00000000 00000002
00000004 00000006 00000001 00000003
00000005 00000007
This indicates that threads {0,2,4,6} in the core share the L2-cache
and threads {1,3,5,7} in the core share the L2 cache.
However, without the patch, the shared_cpu_map/list for L2 for CPUs 0,
1 is reported in the sysfs as follows:
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-7
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:000000,000000ff
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:0-7
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:000000,000000ff
With the patch, the shared_cpu_map/list for L2 cache for CPUs 0, 1 is
correctly reported as follows:
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,2,4,6
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:000000,00000055
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:1,3,5,7
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:000000,000000aa
This patch also defines cpu_l2_cache_mask() for !CONFIG_SMP case.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1607596739-32439-6-git-send-email-ego@linux.vnet.ibm.com
On POWER systems, groups of threads within a core sharing the L2-cache
can be indicated by the "ibm,thread-groups" property array with the
identifier "2".
This patch adds support for detecting this, and when present, populate
the populating the cpu_l2_cache_mask of every CPU to the core-siblings
which share L2 with the CPU as specified in the by the
"ibm,thread-groups" property array.
On a platform with the following "ibm,thread-group" configuration
00000001 00000002 00000004 00000000
00000002 00000004 00000006 00000001
00000003 00000005 00000007 00000002
00000002 00000004 00000000 00000002
00000004 00000006 00000001 00000003
00000005 00000007
Without this patch, the sched-domain hierarchy for CPUs 0,1 would be
CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE
CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE
The CACHE domain at 0-7 is incorrect since the ibm,thread-groups
sub-array
[00000002 00000002 00000004
00000000 00000002 00000004 00000006
00000001 00000003 00000005 00000007]
indicates that L2 (Property "2") is shared only between the threads of a single
group. There are "2" groups of threads where each group contains "4"
threads each. The groups being {0,2,4,6} and {1,3,5,7}.
With this patch, the sched-domain hierarchy for CPUs 0,1 would be
CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-15,24-39,48-55 level=MC
domain-2: span=0-55 level=DIE
CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-15,24-39,48-55 level=MC
domain-2: span=0-55 level=DIE
The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1
resp.) gets degenerated into the SMT domain. Furthermore, the
last-level-cache domain gets correctly set to the SMT sched-domain.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1607596739-32439-5-git-send-email-ego@linux.vnet.ibm.com
Christophe Leroy wrote:
> I can help with powerpc 8xx. It is a 32 bits powerpc. The PGD has 1024
> entries, that means each entry maps 4M.
>
> Page sizes are 4k, 16k, 512k and 8M.
>
> For the 8M pages we use hugepd with a single entry. The two related PGD
> entries point to the same hugepd.
>
> For the other sizes, they are in standard page tables. 16k pages appear
> 4 times in the page table. 512k entries appear 128 times in the page
> table.
>
> When the PGD entry has _PMD_PAGE_8M bits, the PMD entry points to a
> hugepd with holds the single 8M entry.
>
> In the PTE, we have two bits: _PAGE_SPS and _PAGE_HUGE
>
> _PAGE_HUGE means it is a 512k page
> _PAGE_SPS means it is not a 4k page
>
> The kernel can by build either with 4k pages as standard page size, or
> 16k pages. It doesn't change the page table layout though.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201126121121.364451610@infradead.org