This is a simple cleanup to identify easily all flags of the XIVE
interrupt structure. The interrupts flagged with XIVE_IRQ_FLAG_NO_EOI
are the escalations used to wake up vCPUs in KVM. They are handled
very differently from the rest.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201210171450.1933725-3-clg@kaod.org
On POWER platforms where only some groups of threads within a core
share the L2-cache (indicated by the ibm,thread-groups device-tree
property), we currently print the incorrect shared_cpu_map/list for
L2-cache in the sysfs.
This patch reports the correct shared_cpu_map/list on such platforms.
Example:
On a platform with "ibm,thread-groups" set to
00000001 00000002 00000004 00000000
00000002 00000004 00000006 00000001
00000003 00000005 00000007 00000002
00000002 00000004 00000000 00000002
00000004 00000006 00000001 00000003
00000005 00000007
This indicates that threads {0,2,4,6} in the core share the L2-cache
and threads {1,3,5,7} in the core share the L2 cache.
However, without the patch, the shared_cpu_map/list for L2 for CPUs 0,
1 is reported in the sysfs as follows:
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-7
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:000000,000000ff
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:0-7
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:000000,000000ff
With the patch, the shared_cpu_map/list for L2 cache for CPUs 0, 1 is
correctly reported as follows:
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,2,4,6
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:000000,00000055
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:1,3,5,7
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:000000,000000aa
This patch also defines cpu_l2_cache_mask() for !CONFIG_SMP case.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1607596739-32439-6-git-send-email-ego@linux.vnet.ibm.com
On POWER systems, groups of threads within a core sharing the L2-cache
can be indicated by the "ibm,thread-groups" property array with the
identifier "2".
This patch adds support for detecting this, and when present, populate
the populating the cpu_l2_cache_mask of every CPU to the core-siblings
which share L2 with the CPU as specified in the by the
"ibm,thread-groups" property array.
On a platform with the following "ibm,thread-group" configuration
00000001 00000002 00000004 00000000
00000002 00000004 00000006 00000001
00000003 00000005 00000007 00000002
00000002 00000004 00000000 00000002
00000004 00000006 00000001 00000003
00000005 00000007
Without this patch, the sched-domain hierarchy for CPUs 0,1 would be
CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE
CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE
The CACHE domain at 0-7 is incorrect since the ibm,thread-groups
sub-array
[00000002 00000002 00000004
00000000 00000002 00000004 00000006
00000001 00000003 00000005 00000007]
indicates that L2 (Property "2") is shared only between the threads of a single
group. There are "2" groups of threads where each group contains "4"
threads each. The groups being {0,2,4,6} and {1,3,5,7}.
With this patch, the sched-domain hierarchy for CPUs 0,1 would be
CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-15,24-39,48-55 level=MC
domain-2: span=0-55 level=DIE
CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-15,24-39,48-55 level=MC
domain-2: span=0-55 level=DIE
The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1
resp.) gets degenerated into the SMT domain. Furthermore, the
last-level-cache domain gets correctly set to the SMT sched-domain.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1607596739-32439-5-git-send-email-ego@linux.vnet.ibm.com
Christophe Leroy wrote:
> I can help with powerpc 8xx. It is a 32 bits powerpc. The PGD has 1024
> entries, that means each entry maps 4M.
>
> Page sizes are 4k, 16k, 512k and 8M.
>
> For the 8M pages we use hugepd with a single entry. The two related PGD
> entries point to the same hugepd.
>
> For the other sizes, they are in standard page tables. 16k pages appear
> 4 times in the page table. 512k entries appear 128 times in the page
> table.
>
> When the PGD entry has _PMD_PAGE_8M bits, the PMD entry points to a
> hugepd with holds the single 8M entry.
>
> In the PTE, we have two bits: _PAGE_SPS and _PAGE_HUGE
>
> _PAGE_HUGE means it is a 512k page
> _PAGE_SPS means it is not a 4k page
>
> The kernel can by build either with 4k pages as standard page size, or
> 16k pages. It doesn't change the page table layout though.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201126121121.364451610@infradead.org
Exception fixup doesn't require the heady full regs saving,
do it from do_page_fault() directly.
For that, split bad_page_fault() in two parts.
As bad_page_fault() can also be called from other places than
handle_page_fault(), it will still perform exception fixup and
fallback on __bad_page_fault().
handle_page_fault() directly calls __bad_page_fault() as the
exception fixup will now be done by do_page_fault()
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/bd07d6fef9237614cd6d318d8f19faeeadaa816b.1607491748.git.christophe.leroy@csgroup.eu
book3s/32 has two main families:
- CPU with 603 cores that don't have HASH PTE table and
perform SW TLB loading.
- Other CPUs based on 604+ cores that have HASH PTE table.
This leads to some complex logic and additionnal code to
support both. This makes sense for distribution kernels
that aim at running on any CPU, but when you are fine
tuning a kernel for an embedded 603 based board you
don't need all the HASH logic.
Allow selection of support for each family, in order to opt
out unneeded parts of code. At least one must be selected.
Note that some of the CPU supporting HASH also support SW TLB
loading, however it is not supported by Linux kernel at the
time being, because they do not have alternate registers in
the TLB miss exception handlers.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/8dde0cdb629a71abc29b0d85a52a86e920376cb6.1603348103.git.christophe.leroy@csgroup.eu
sys_rtas() cannot call ibm,suspend-me directly in the same way it
handles other inputs. Instead it must dispatch the request to code
that can first perform the H_JOIN sequence before any call to
ibm,suspend-me can succeed. Over time kernel/rtas.c has accreted a fair
amount of platform-specific code to implement this.
Since a different, more robust implementation of the suspend sequence
is now in the pseries platform code, we want to dispatch the request
there.
Note that invoking ibm,suspend-me via the RTAS syscall is all but
deprecated; this change preserves ABI compatibility for old programs
while providing to them the benefit of the new partition suspend
implementation. This is a behavior change in that the kernel performs
the device tree update and firmware activation before returning, but
experimentation indicates this is tolerated fine by legacy user space.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201207215200.1785968-16-nathanl@linux.ibm.com
Provide a documented wrapper function for the ibm,activate-firmware
service, which must be called after a partition migration or
hibernation.
If the function is absent or the call fails, the OS will continue to
run normally with the current firmware, so there is no need to perform
any recovery. Just log it and continue.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201207215200.1785968-6-nathanl@linux.ibm.com
The pseries partition suspend sequence requires that all active CPUs
call H_JOIN, which suspends all but one of them with interrupts
disabled. The "chosen" CPU is then to call ibm,suspend-me to complete
the suspend. Upon returning from ibm,suspend-me, the chosen CPU is to
use H_PROD to wake the joined CPUs.
Using on_each_cpu() for this, as rtas_ibm_suspend_me() does to
implement partition migration, is susceptible to deadlock with other
users of on_each_cpu() and with users of stop_machine APIs. The
callback passed to on_each_cpu() is not allowed to synchronize with
other CPUs in the way it is used here.
Complicating the fix is the fact that rtas_ibm_suspend_me() also
occupies the function name that should be used to provide a more
conventional wrapper for ibm,suspend-me. Rename rtas_ibm_suspend_me()
to rtas_ibm_suspend_me_unsafe() to free up the name and indicate that
it should not gain users.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201207215200.1785968-4-nathanl@linux.ibm.com
This partially reverts commit eb232b1624 ("powerpc/book3s64/kuap: Improve
error reporting with KUAP") and update the fault handler to print
[ 55.022514] Kernel attempted to access user page (7e6725b70000) - exploit attempt? (uid: 0)
[ 55.022528] BUG: Unable to handle kernel data access on read at 0x7e6725b70000
[ 55.022533] Faulting instruction address: 0xc000000000e8b9bc
[ 55.022540] Oops: Kernel access of bad area, sig: 11 [#1]
....
when the kernel access userspace address without unlocking AMR.
bad_kuap_fault() is added as part of commit 5e5be3aed2 ("powerpc/mm: Detect
bad KUAP faults") to catch userspace access incorrectly blocked by AMR. Hence
retain the full stack dump there even with hash translation. Also, add a comment
explaining the difference between hash and radix.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201208031539.84878-1-aneesh.kumar@linux.ibm.com
The value in CIABR persists across kexec which can lead to unintended
results when the new kernel hits the old kernel's breakpoint. For
example:
0:mon> bi $loadavg_proc_show
0:mon> b
type address
1 inst c000000000519060 loadavg_proc_show+0x0/0x130
0:mon> x
$ kexec -l /mnt/vmlinux --initrd=/mnt/rootfs.cpio.gz --append='xmon=off'
$ kexec -e
$ cat /proc/loadavg
Trace/breakpoint trap
Make sure CIABR is cleared so this does not happen.
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201207010519.15597-1-jniethe5@gmail.com
Pull powerpc fixes from Michael Ellerman:
"Some more powerpc fixes for 5.10:
- Three commits fixing possible missed TLB invalidations for
multi-threaded processes when CPUs are hotplugged in and out.
- A fix for a host crash triggerable by host userspace (qemu) in KVM
on Power9.
- A fix for a host crash in machine check handling when running HPT
guests on a HPT host.
- One commit fixing potential missed TLB invalidations when using the
hash MMU on Power9 or later.
- A regression fix for machines with CPUs on node 0 but no memory.
Thanks to Aneesh Kumar K.V, Cédric Le Goater, Greg Kurz, Milan
Mohanty, Milton Miller, Nicholas Piggin, Paul Mackerras, and Srikar
Dronamraju"
* tag 'powerpc-5.10-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s/powernv: Fix memory corruption when saving SLB entries on MCE
KVM: PPC: Book3S HV: XIVE: Fix vCPU id sanity check
powerpc/numa: Fix a regression on memoryless node 0
powerpc/64s: Trim offlined CPUs from mm_cpumasks
kernel/cpu: add arch override for clear_tasks_mm_cpumask() mm handling
powerpc/64s/pseries: Fix hash tlbiel_all_isa300 for guest kernels
powerpc/64s: Fix hash ISA v3.0 TLBIEL instruction generation
The placeholder for instruction selection should use the second
argument's operand, which is %1, not %0. This could generate incorrect
assembly code if the memory addressing of operand %0 is a different
form from that of operand %1.
Also remove the %Un placeholder because having %Un placeholders
for two operands which are based on the same local var (ptep) doesn't
make much sense. By the way, it doesn't change the current behaviour
because "<>" constraint is missing for the associated "=m".
[chleroy: revised commit log iaw segher's comments and removed %U0]
Fixes: 9bf2b5cdc5 ("powerpc: Fixes for CONFIG_PTE_64BIT for SMP support")
Cc: <stable@vger.kernel.org> # v2.6.28+
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/96354bd77977a6a933fe9020da57629007fdb920.1603358942.git.christophe.leroy@csgroup.eu
Use SPRN_SPRG_SCRATCH2 as an alternative scratch register in
the early part of DSI prolog in order to avoid clobbering
SPRN_SPRG_SCRATCH0/1 used by other prologs.
The 603 doesn't like a jump from DataLoadTLBMiss to the 10 nops
that are now in the beginning of DSI exception as a result of
the feature section. To workaround this, add a jump as alternative.
It also avoids fetching 10 nops for nothing.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/f9f8df2a2be93568768ef1ac793639f7914cf103.1606285014.git.christophe.leroy@csgroup.eu